Momentum Accelerates Evolutionary Dynamics

We combine momentum from machine learning with evolutionary dynamics, where momentum can be viewed as a simple mechanism of intergenerational memory. Using information divergences as Lyapunov functions, we show that momentum accelerates the convergence of evolutionary dynamics including the replicator equation and Euclidean gradient descent on populations. When evolutionarily stable states are present, these methods prove convergence for small learning rates or small momentum, and yield an analytic determination of the relative decrease in time to converge that agrees well with computations. The main results apply even when the evolutionary dynamic is not a gradient flow. We also show that momentum can alter the convergence properties of these dynamics, for example by breaking the cycling associated to the rock-paper-scissors landscape, leading to either convergence to the ordinarily non-absorbing equilibrium, or divergence, depending on the value and mechanism of momentum.


Introduction
Gradient descent is commonly used in machine learning and in many scientific fields, including to model biological systems.Evolutionary algorithms are frequently mentioned as an alternative to gradient descent, particularly when the function to be minimized is not differentiable.With a long history in machine learning [1], evolutionary algorithms have found broad application, including in reinforcement learning [2] [3], neural architecture search [4], AutoML [5], and meta-learning [6], among other areas.Despite the perceived dichotomy between evolutionary algorithms and gradient descent, some evolutionary algorithms can be understood in terms of gradient descent.The replicator equation is a model of natural selection and can be recognized as gradient descent on a non-Euclidean geometry of the probability simplex where the potential function is the population mean fitness.Powerful methods exist to analyze the replicator equation and accordingly its long run behavior is relatively well-understood in many circumstances.We show that these methods inform the action of the ML concept of momentum, a method to carry forward prior values of the gradient into further iterations of the replicator dynamic or gradient descent.In particular, we give a simple way to understand how momentum accelerates gradient descent.
Momentum as used in ML may have plausible evolutionary interpretations.Mechanisms of memory are abundant in biological and cultural systems, capturing complex adaptive functions within the lifetimes of organisms, including epigenetics [7] and cultural transmission of information [8].However, the bias of these extra-genetic forms of memory may only last a few generations, as opposed to information incorporated more permanently in the genome, for example into a highly conserved gene, which may encode a more fundamental physical adaptation (e.g.heat-shock proteins [9]).Hence we might simplistically model a short-term memory mechanism as having an exponentiallydecaying impact on natural selection by carrying over some memory of the fitness landscape of earlier generations to future generations.
We show that the addition of a simple exponentially-decaying memory mechanism accelerates the convergence of trajectories of the replicator equation [10] and its Euclidean analog.This mechanism is called (Polyak) momentum in the machine learning literature [11] [12], where it is known to increase the rate of convergence of gradient descent quadratically, in terms of condition number [13].We will also consider Nesterov momentum [14] [15], which additionally has a look-ahead aspect.
After describing the replicator equation and important associated facts, we introduce momentum to the discrete replicator equation and give a Lyapunov function for the modified dynamic showing that the evolutionarily stable states, when they exist for a given landscape, are unchanged for small strength of momentum.Furthermore, we show analytically that the continuous replicator dynamic with momentum converges explicitly more quickly for typical values of momentum, slows for other regions, and reverses direction in some cases.Finally, we consider exceptional examples of nonzero learning rate and momentum that break typical dynamic behavior, such as the concentric cycles for the rock-paper-scissors landscapes.
Several authors have explored variations of the ideas presented here, including recent works exploring momentum and geometry [16], [17], [18], and earlier works regarding an aspect of memory to replicator dynamics [19], adding negative momentum to game dynamics [20], and other interactions between game theory and machine learning [21] [22].Our contributions are as follows: (1) introducing momentum to the replicator dynamic in a way compatible with recent work in machine learning, (2) demonstrating that momentum accelerates convergence for the replicator dynamic, and (3) Lyapunov stability theorems for evolutionary dynamics with momentum.
Putting this manuscript in a broader context, we encourage the reader and other researchers to continue to explore the interactions of evolutionary game theory, information theory, and machine learning.It appears these fields may still have much to offer each other.

Preliminaries
We briefly review the necessary background, recommending [13] for an overview of momentum and gradient descent and [23] for an overview of the replicator equation and the use of information theory to analyze it.

Gradient Descent
First we describe gradient descent in Eucliean space.Let x ∈ R n be a real-valued vector, U : R n → R be a potential function (or simply a function to be optimized), f = ∇U its gradient.Then discrete gradient descent takes the following form: where α is the learning rate, also commonly called the step size.In what follows it will be convenient to use the notation of time-scale dynamics [24].Let x i − x i α be the "time-scale" derivative, corresponding to either the ordinary derivative (in the limit that α → 0) or a finite difference (α > 0 and fixed) as needed.Gradient descent with learning rate α is simply x ∆ i,α = f i (x).Since we will not consider dynamics with actively changing α we simply write x ∆ i , though we will consider how a family of dynamics changes as α → 0, that is as the difference equations converge to a continuous differential equation.

Gradient Descent with Momentum
Momentum adds a memory of the prior gradients to future iterations.We proceed in accordance with the ML literature [13] 1 .Gradient descent with (Polyak) momentum [26] β is given by: where f is the gradient as before.When β = 0 the momentum-free gradient descent is recovered.

Replicator Dynamics and Gradient Descent
The replicator dynamic is an evolutionary dynamic describing the action of natural selection as well as the dynamics of iterated games [27].Its theoretical properties are extensively studied in Evolutionary Game Theory (EGT) and the equation has applications in biology, economics, and other fields.The importance of geometry in the study of the replicator equation and related dynamics, including that special cases of the replicator equation are a form of gradient descent, has been studied in EGT [28] and Information Geometry [29].
In EGT one typically restricts to discrete probability distributions that represent populations of evolving organisms or players of a strategic game, hence it is necessary to reformulate the state space of gradient descent as described above.Let ∆ n = {x ∈ R n | x i ≥ 0 and i x i = 1} be the (n − 1)-dimensional probability simplex and f : ∆ n → R a fitness function.The analog of gradient descent with respect to the Euclidean geometry on the simplex is a special case of the (orthogonal) projection dynamic, described below.Gradient descent with respect to the Fisher information metric (also known as the Shahshahani metric in EGT) is called the natural gradient in information geometry [30].In the case of a symmetric and linear fitness landscape, this gradient of the mean fitness with respect to the same geometry is a special case of the replicator equation.The more general form of the replicator equation is not always a gradient flow, nevertheless it has a strong convergence theorem that is closely related to this geometric structure.
For our purposes the discrete replicator equation for a fitness landscape f : ∆ n → R n takes the form: where x is a discrete population distribution over n types, described by a vector in the probability simplex ∆ n and f = x • f (x) is the mean fitness, which we will assume to be non-zero everywhere.Using a time-scale derivative with step-size α we can rewrite this equation as The continuous version of the replicator equation can be obtained by letting α → 0. The denominator f (x) on the right-hand side is often omitted as it can be eliminated with change in time scaling without altering the continuous trajectories.This gives the following standard form of the continuous dynamic: Subtracting off the mean fitness means that the rate of change of the i-th population type is proportional to its excess fitness, which is how much more or less its fitness f i is compared to the mean.Mathematically, subtracting the mean fitness keeps the derivative in the tangent space of the simplex.
Similarly, the analog of discrete Euclidean gradient descent on the simplex is known as the (orthogonal) projection dynamic, given by where now is the (unweighted) average fitness.The continuous form is given by It is a gradient flow whenever the fitness landscape is itself a Euclidean gradient, as is the case for the replicator equation.In particular, when the fitness landscape is linear, defined by a symmetric matrix f (x) = Ax, we can recover these dynamics as the appropriate gradients of the mean fitness f (x) = x • Ax.This models n-alleles of a gene locus (one of the early versions of the replicator equation), or the repeated play of games where A is the payoff matrix for the game (not necessarily symmetric).In our computational examples we will use matrices of the form When a = −b = 1 this matrix is known as a rock-paper-scissors game and the (continuous) replicator dynamic cycles about the interior point of the simplex.Otherwise the trajectories converge to the center of the simplex or diverge to the boundary depending on the relative values and signs of a and b.When a = b > 0 this matrix can be seen as a three dimensional version of the hawk-dove game.
In the case that the mean fitness f (x) is zero (e.g. for zero-sum games such as the rock-paperscissors game when a = −b), it is common to either remove the denominator of the dynamics or to apply the softmax function [31] to the fitness landscape.Either allows the discrete dynamics to be well-defined.We choose to drop the denominator, so in the computational examples below we typically have that for the replicator dynamic.

Lyapunov Functions and Evolutionarily Stable States
For dynamical systems the issue of convergence is critical.As analytically solving non-linear differential or difference equations explicitly is often extremely difficult, a common method to demonstrate stability of a dynamical system and convergence to a rest point is to find a Lyapunov function [32] [33], often an energy-like or entropy-like quantity that is positive definite and decreasing along trajectories of the dynamic toward an equilibrium point [24].The existence of such a function is often sufficient to demonstrate local or asymptotic stability of the dynamic, and bounds on convergence rate can often be determined.We now describe how to obtain a Lyapunov function for the replicator equation, though the story that follows generalizes to a much larger class of evolutionary dynamics [34].
The replicator equation is often studied in terms of evolutionarily stable states (ESS) [35], somewhat analogous to extrema of potential functions or stationary distributions.An ESS for a fitness landscape f is a state x such that x • f (x) > x • f (x) for all x in a neighborhood of x.It can also be defined in terms of robustness to invasion by mutant subpopulations, similar in concept to a Nash equilibrium, a mixture strategies such that no player has an incentive to unilaterally deviate.In this sense it is a stable population state for the fitness landscape.
When a fitness landscape has an evolutionarily stable state (ESS), it is well-known in EGT that the KL-divergence is a (local) Lyapunov function of the dynamic.It can then be seen that interior trajectories of the replicator dynamic converge to the ESS, and for the standard replicator equation there can only be one such ESS interior to the simplex.We restate this result below, which we will generalize with momentum, in the following theorem.An information-theoretic interpretation of Theorem 1 is that the population is learning information about the environment and encoding that information in the population structure (the distribution over different types).
Theorem 1.Let x be an ESS for a replicator dynamic.Then is a local Lyapunov function for the discrete and continuous replicator dynamic.
Theorem 1 is often stated in various alternative forms.The discrete time version with geometric considerations appears in [34] and is predated by a number of variations, going back at least to [36] and [37] in forms recognizable as information-theoretic (cross-entropy), and ultimately to [10].Similarly, the Euclidean distance D(x) = 1 2 ||x − x||2 is a Lyapunov function for the projection dynamic [38] [39], also realizable as a Bregman divergence [17].These functions can be derived directly from the underlying geometries, Fisher and Euclidean for the replicator and projection dynamics, respectively.Moreover, given an information divergence, an associated geometry and dynamic can be derived, and an analog of Theorem 1 holds [40].The proof of Theorem 1 will be a special case of the proof of Theorem 2. 2

Evolutionary Dynamics with Momentum
To introduce momentum to these dynamics we proceed in accordance with the ML literature [13].The discrete replicator equation with fitness landscape f and momentum β is given by: where for the replicator equation and we have suppressed the step size α in x ∆ i .Alternatively F could be a gradient ∇U . 3Similarly, we obtain the projection dynamic with momentum by instead substituting where the mean is again the unweighted average fitness.When β = 0 the usual momentum-free dynamics are obtained.
Another variation, known as Nesterov momentum, differs from Polyak momentum in that the function F is evaluated at a look-ahead step weighted by the momentum.For both flavors of momentum the dynamic starts at some initial population state x 0 and the initial value can be chosen to be the zero vector.
x ∆ i = z i

Lyapunov Stability and Momentum
Now we show that adding small amounts of momentum with a nonzero learning rate typically does not alter the evolutionarily stable states of these discrete dynamics.(We'll also see later that the ESS of the continuous dynamics are not affected for typical values of momentum.)We state Theorem 2 as a generalization of Theorem 1 for small values of momentum β.Theorem 2. For small positive β, or negative β, if x is an evolutionarily stable state for the landscape f , the KL divergence is a local Lyapunov function for the replicator dynamic with momentum and the Euclidean distance D(x) = 1 2 ||x − x|| 2 is a local Lyapunov function for the projection dynamic with momentum.If the fitness landscape is continuous, this also holds for Nesterov momentum.
The proof of the theorem is straightforward and given in the appendix.We note that it holds for any learning rate α, but the permissible values of β may vary with both α and the fitness landscape.Below, we develop a similar result for the continuous dynamic (the limit that α → 0) which works for any β = 1.In general there cannot be a variant of Theorem 2 for Polyak momentum, arbitrary learning rate α, and arbitrary momentum β: the hypothesis that at least one of α and β is small is necessary (see examples in Figures 1 and 2).

Effect of Momentum on Rate of Convergence
While it's good to know that the addition of some memory to the replicator equation does not alter the stable states, a more interesting effect is the acceleration of convergence.This is why momentum is of interest in machine learning.For evolutionary processes, this acceleration suggests one reason why epigenetic mechanisms may have evolved and persisted.

Time to Converge
We can again use Lyapunov methods to see that the rate of convergence increases with momentum 4 .Empirically we find that the convergence takes fewer steps by a factor of approximately (1 − β) of the momentum-free case, which we now demonstrate with an analytic argument.First we note that this factor makes sense intuitively given the iterative nature of momentum in Equation 9since Figure 2: For large values of momentum the dynamic may fail to converge as in the momentum free case if α is not sufficiently small.For all trajectories here α = 0.01, a = 2, and b = −1.Lowering α to 0.001 restores convergence of the red β = 0.9 curve.Now we hold momentum constant and allow the learning rate to converge to zero, yielding a continuous replicator dynamic with momentum associated to the discrete replicator dynamic, as follows.
From Equation 9, as the discrete dynamic converges, we set z i = z i to find that z i = Fi 1−β .Letting α → 0 we obtain, after substituting in z i to the second equation where a factor of 1/ f (x) has been removed for brevity, corresponding to a scaling of time 5 .Setting β = 0 recovers the standard definition of the replicator equation just as in the discrete case.The leading factor of 1/(1 − β) can similarly be eliminated by change of time scaling in the continuous case without altering the trajectories of the continuous dynamic, however we retain it to argue explicitly that the convergence rate increases as β increases within (0, 1), increasing relative to the base case β = 0 for β ∈ (0, 1).Similarly, the converge slows down for β ∈ (−∞, 0).Note that traditionally in EGT, scaling the continuous replicator equation this way would not be considered particularly interesting since the trajectories (and stable points) do not change, however the increased rate of convergence is of paramount importance in machine learning (and perhaps to actual evolving populations).
Let V β = D(x, x β ) be the KL-divergence with x an ESS and x β denoting that the trajectories evolve in time according to the replicator dynamic with momentum β as in Equation 10.An easy calculation shows that where we've used Equation 10 and the chain rule, i.e. we effectively scale the derivative of Lyapunov quantity by the leading factor.From this simple fact follows Theorem 3, which shows that the dynamic convergence and trajectory velocity is altered accordingly to 1/(1 − β).The theorem is summarized in Figure 5 in the supplement.Theorem 3. Let V β be defined as above.Then we have that: 1.For −∞ < β < 1, the ESS of the dynamic with (Polyak) momentum are the same as for the momentum free case; equivalently the KL-divergence is still a Lyapunov function for β < 1.
2. For 1 < β < ∞, the directionality of the trajectory is reversed (so any ESS for β < 1 is no longer an ESS) 3. The speed of the convergence is increasing on the intervals (−∞, 1), and decreasing on (1, ∞) (with direction reversed in the latter case) 4. In particular, the speed of convergence is faster than the momentum free dynamic for 0 < β < 1 and the ESS are unchanged.
For the continuous dynamic, in the case that the dynamic converges to an ESS, Equation 11 also shows that it takes ≈ (1−β) as much time for the dynamic to be within of the ESS when compared to the momentum-free dynamic, as measured by the KL-divergence, starting from the same initial point.Thus the trajectories converge more quickly as β ranges from 0 to 1, and the convergence slows for β < 0.
Returning to the discrete dynamic, for continuous landscapes and smaller α, we also roughly have that time-scale derivatives of the KL-divergence scale by 1/(1 − β), though we cannot as easily compare directly along trajectories and the associated trajectories will not trace out the same curves (as seen in the examples above), so is there is not a direct analog of Equation 11.Nevertheless we may reasonably predict that it takes approximately (1−β) as many steps as the momentum free case to be within of the ESS compared to the dynamic without momentum (β = 0), demonstrated in the computational examples below.This approximation improves as the learning rate α → 0. Computationally we also find that the dynamic with Nesterov momentum exhibits a similar behavior (Figure 3).While the argument of Theorem 3 does not directly apply to Nesterov momentum, for small β and continuous fitness landscape, a continuity argument suggests that the same approximation holds.
For completeness, we note that Theorem 3 also holds for the projection dynamic in an analogous manner, that is, to gradient descent on the Euclidean geometry, and should similarly apply to other Riemannian geometries as described in [34].

Momentum can break cycling into convergence or divergence
For the rock-paper-scissors landscape with a = −b = 0, the replicator equation is not a gradient.Since the mean fitness is zero (the game is zero-sum as the payoff matrix is skew-symmetric), the gradient flow is degenerate (the dynamic is motionless).However the replicator equation with this landscape is not degenerate and the phase portrait consists of concentric cycles of constant KL-divergence from the interior center of the simplex.The cycles are non-absorbing and the KLdivergence is an integral of motion.In the continuous case (α → 0), momentum alters the time to cycle around the central point, and possibly also reverses the directionality of the cycles, in accordance with the inequalities in Theorem 3.
In contrast, for non-zero learning rate α, we find computationally that the momentum can cause the trajectories to converge inward or diverge outward.For Polyak momentum, the memory of the prior iterations causes the divergence, preventing the dynamic from turning sufficiently.For Nesterov momentum, it is the look-ahead aspect of the momentum that induces the convergence by causing the dynamic to turn more quickly.

Discussion
We've shown that momentum can accelerate evolutionary dynamics in the probability simplex just as it does for gradient descent in the machine learning literature.Lyapunov methods, commonly used to analyze dynamical systems but not yet as commonly applied in machine learning, allow us to show analytically and explicitly that momentum decreases the time to converge for values of momentum typically used in ML, and otherwise cause divergence or slowdown of trajectories for momentum outside of the interval [0, 1).Crucially we have shown that learning rate and momentum interact so that preservation of the convergence properties of the dynamic are guaranteed only for small β or α despite the frequently realized speed up in convergence for larger values of momentum.
Interpreting the results, we've shown that the convergence of evolutionary dynamics can be accelerated by a mechanism of memory that can be viewed as a simple model of intergenerational information exchange such as epigentics.This may also apply to immunity or cultural exchanges of information and explain the origin and persistence of extra-genetic information exchange in lineages and populations.

Code
The code to generate the trajectories and plots in this manuscript is available as a Python library pyed at https://github.com/marcharper/pyed. Ternary plots were generated with the pythonternary library [41].

Proof of Equation 11
Taking the derivative of the KL-divergence, using the definition of the continuous replicator equation, gives the following: This quantity is less than zero if x is an evolutionarily stable state (which proves Theorem 1 in the continuous case).If we instead use the replicator equation with momentum (Equation 10), a factor of 1 1−β is present on the right hand side of the equation above, proving Equation 11.

Proof of Theorem 2
Proof.Since the KL-divergence is positive and zero only at x, essentially one just needs to show that the quantity D ∆ (x) = D(x||x ) − D(x||x) α is less than zero when x is an ESS (for fixed α) to establish it as a discrete Lyapunov function.
A straightforward algebraic calculation shows that this quantity is bounded by which is less than 0 for sufficiently small β and the inequality defining an ESS.

Figure 1 :
Figure 1: Examples of altered convergence time for Polyak (top 2) and Nesterov (bottom 2) momentum.In all cases we use a landscape with a = 2 and b = 1 and α = 1/200.As β increases, the dynamics typically converge faster, and the trajectories are not identical since α > 0. However, for Polyak momentum (top), as the value of β becomes closer to 1, the Lyapunov quantity eventually fails to be monotonic along the entirety of the trajectory (it is at best local).Contrast with the Nesterov momentum trajectories (bottom) for the same parameters, which in this case are all monotonically decreasing.

Figure 3 :
Figure 3: Left: Convergence speed up for Polyak momentum: Convergence time for small learning rates are well approximated by (1 − β) times the momentum free convergence time (β = 0) of iterations for small learning rates.Right: The dynamic with Nesterov momentum is also fairly well approximated by a constant factor times the momentum free convergence time, but is clearly not scaled by the same factor.The fitness landscape is defined by a = 1 = b.

Figure 4 :
Figure 4: For the rock-paper-scissors landscape (a = 1, b = −1), momentum β = 0.65, and learning rate α = 1/200, the replicator equation cycles indefinitely with constant KL-divergence based on the initial point.Adding momentum with a non-zero learning rate can cause the cycling to break into either convergence or divergence.In this case Nesterov momentum causes the dynamic to converge while Polyak momentum causes the dynamic to slowly diverge to the boundary.