BIRTH-DEATH DYNAMICS FOR SAMPLING: GLOBAL CONVERGENCE, APPROXIMATIONS AND THEIR ASYMPTOTICS

. Motivated by the challenge of sampling Gibbs measures with nonconvex potentials, we study a continuum birth-death dynamics. We improve results in previous works [51,57] and provide weaker hypotheses under which the probability density of the birth-death governed by Kullback-Leibler divergence or by χ 2 divergence converge exponentially fast to the Gibbs equilibrium measure, with a universal rate that is independent of the potential barrier. To build a practical numerical sampler based on the pure birth-death dynamics, we consider an interacting particle system, which is inspired by the gradient flow structure and the classical Fokker-Planck equation and relies on kernel-based approximations of the measure. Using the technique of Γ -convergence of gradient flows, we show that on the torus, smooth and bounded positive solutions of the kernelized dynamics converge on finite time intervals, to the pure birth-death dynamics as the kernel bandwidth shrinks to zero. Moreover we provide quantitative estimates on the bias of minimizers of the energy corresponding to the kernelized dynamics. Finally we prove the long-time asymptotic results on the convergence of the asymptotic states of the kernelized dynamics towards the Gibbs measure.


Introduction
Sampling from a given target probability distribution has diverse applications, including Bayesian statistics, machine learning, statistical physics, and many others.In practice, the measure π is often the Gibbs measure corresponding to potential V : R d → R: where Z is the typically unknown normalization constant.
Some of the most popular methods for sampling such distributions are based on Markov chain Monte Carlo (MCMC) approach.Much of the research works on MCMC have been devoted to designing Markov chains that are ergodic with respect to the target probability measure and enjoy fast mixing properties.Popular sampling methods include Langevin MCMC [35,67], Hamiltonian Monte Carlo [3,9,62,68], bouncy particle and zigzag samplers [8,10], and affine-invariant ensemble MCMC [34], and Stein variational gradient descent (SVGD) [53].When the potential function V is strongly convex, these sampling methods perform quite well; we refer to recent literature [23,28,56,74,77] and references therein for understanding their convergence and computational complexity.However, the efficiency of these sampling methods are hampered by the multi-modality of π (corresponding to a non-convex V ) as it takes exponentially long time for the sampler to hop from one mode to another.Many diffusion-based samplers suffer from such metastability issue, and numerous techniques have been proposed to alleviate this issue, including in particular parallel and simulated tempering [59,61,73] and adaptive biasing methods [24,36,43,47,75].
The sampling problem can be recast as an optimization problem on the space of probability measures [7,77].Indeed, inspired by the seminal work of Jordan, Kinderlehrer and Otto [38], the Fokker-Planck equation associated to the (overdamped) Langevin dynamics can be viewed as the Wasserstein gradient flow of the KL-divergence which suggests that the Langevin dynamics can be seen as the steepest descent flow of the KL-divergence, along which the initial distribution flows towards the target distribution.The gradient-flow perspective provides a way towards building new sampling dynamics by designing new objective functions or new metrics for the manifolds of probability measures.For example, the underdamped Langevin dynamics can be viewed as a Nesterov's accelerated gradient descent on the space of probability measures [58].
Langevin dynamics converge exponentially fast to the Gibbs measure under the assumption that the target measure satisfies the Logarithmic Sobolev inequality [6].Unfortunately, the convergence rate is limited by the optimal constant in the Log-Sobolev inequality, which can be very small when the target measure is multimodal.This is because it can take exponentially long time for the dynamics to overcome the energy barriers and hop between the multiple modes.

Birth-death dynamics, and their long-time convergence
The issues described above prompt the following question: Can one construct a gradient-flow dynamics for sampling that achieves a potential-independent convergence rate?
Recent work [57] by Lu, Nolen and one of the authors gives an affirmative answer to the question by proposing the following birth-death dynamics for sampling An equation of similar type on discrete space, known as Replicator equation, also appears in information geometry literature, see [5,Chapter 6] and references therein.Note that the dynamics (BD) is agnostic to the normalization constant.It has been shown that (BD) is a gradient flow of the KL-divergence with respect to the spherical Hellinger distance1 d SH defined by (10) (see [45,Section 3]).Furthermore the authors show an exponential rate of convergence for initial data such that ρ 0 is bounded below by a positive multiple of π and KL(ρ 0 |π) ≤ 1.More recent work [51] established exponential convergence of (BD) if ρ 0 π is bounded from above and below.In Theorem 2.4 we improve both results by showing that the solution ρ t contracts to the equilibrium with a uniform (potential-independent) rate from any ρ 0 that is only bounded from below, but not necessarily above with respect to π.The condition KL(ρ 0 |π) ≤ 1 is no longer required.This has an important practical consequence, namely it shows that it is sufficient to start the dynamics with an initial density that is more spread out than π, which is easy to guarantee for many target measures π.The removal of upper bound also allows us to choose a sufficient wide round Gaussian to satisfy the pointwise lower bound for a large class of π in R d .
In Section 3 we also investigate birth-death dynamics that arises as the spherical Hellinger gradient flow of χ 2 -divergence which is at the basis of the algorithm proposed in [50].In particular we prove that the χ 2 -divergence between the dynamics and the target measure converges to zero exponentially fast, with a rate independent of the potential, see Theorem 3.1.This complements the result of [50], which proved the convergence of the reverse χ 2 -divergence.

Approximations to (BD) that allow for discrete measures
Since the equation (BD) is not well-defined when ρ is a discrete measure, it is unclear whether one can build interacting particle sampling schemes directly based on it.A principled way to build particle approximations to (BD) and (BD2) is to define new dynamics which approximate (BD) and (BD2) and are well-defined for discrete measures.A further desirable property for these dynamics is to retain the (spherical) Hellinger gradient flow structure.
In doing so we are inspired by the work of Carrillo, Craig, and Patacchini [16], who considered the Wasserstein gradient flow of the regularized KL-divergence as an approximation of the Fokker-Planck equation.In particular they introduced the regularized energy The gradient flow of F ε with respect to Wasserstein metric, studied in [16], is where δFε δρ is the functional derivative of F ε defined by For discrete initial data, i.e. ρ 0 = m i δ x i , the solution remains a discrete measure for any positive time, where the evolution of particles is given by a system of ordinary differential equations.It heuristically provides a deterministic particle approximation of the Fokker-Planck equation, though rigorous convergence analysis remains open.
The gradient flow of F ε with respect to the spherical Hellinger distance is the equation Note that the right hand side is well-defined even if ρ (ε) is a discrete measure.This suggests the possibility of approximating (BD ε ) with interacting particles.Indeed, we will introduce and discuss in Section 4.5 a particle-based jump process whose mean-field limit is heuristically characterized by equation (BD ε ) and present some numerical experiments on its application for sampling in Section 5.
Bias of global minimizers of F ε .For sufficiently small ε, we expect F ε to be only a small perturbation of F, and hence the global minimizers of F ε should also be a perturbation of π.In Section 4.1 we prove that such bias is at most of order ε, improving on the qualitative Γ-convergence result in [16].More precisely, we show that for any minimizer π ε of F ε , the Wasserstein distance between π ε and π is no more than the order O(ε).The optimality of such upper bound is demonstrated by numerical experiments.
Convergence of the dynamics, on finite time intervals.In Section 4.3 we investigate the convergence of the solutions of (BD ε ) towards solutions of the pure birth-death dynamics (BD).More precisely we use the Γ-convergence of gradient flows to show that on arbitrary finite time intervals, smooth solutions of (BD ε ) with initial condition ρ 0 bounded below on a bounded domain converge to solutions of (BD) as ε → 0. For unbounded domains there are substantial difficulties to handle the decay of π at infinity, and proving Γ-convergence of gradient flows in such setting remains an open problem.
Convergence of the dynamics, asymptotic states.We note that since Γ-convergence of gradient flows is in general stated for finite time intervals, and thus does not directly imply that the asymptotic states of the dynamics (BD ε ) converge towards the asymptotic state of (BD), namely π.This is a general issue for the convergence of gradient flows and is of practical interest.Namely in many applications one uses approximate gradient flows with aim to approximate the limiting state of the original gradient flow.In Section 4.4 we investigate the relationship between Γ-convergence of gradient flows and the convergence of asymptotic states.Specifically in Proposition 4.19 we prove convergence of asymptotic states in the general setting of gradient flows in metric spaces.We apply the result in two settings, one is the setting of the gradient flows of this paper (Theorem 4.21) and the other (Theorem 4.20) is the convergence of two-layer neural networks studied in [37] , which we discuss at the end of this section.In particular, we prove in Theorem 4.21 that π must be the only possible limit of the asymptotic states ρ The proof is general and relies on the fact that π is the unique minimizer of the KL divergence.
Application to convergence of asymptotic states for 2-layer neural networks.In [37], the authors considered the problem of learning a strongly concave function f using bump-like neurons where the width of the kernels δ ≪ 1.As the number of neurons approach infinity, the process of stochastic gradient descent with noise τ converges to the Wasserstein gradient flow of the following entropy-regularized risk functional Here Ω is a smooth, convex and compact domain.More precisely, the gradient flow equation writes The authors proved that as δ → 0, with suitable initial and boundary conditions, the solution of (4) converges strongly in L 2 to the solution of the limiting gradient flow which is the Wasserstein gradient flow of the limiting functional Moreover, since ( 6) is displacement convex with respect to Wasserstein geodesics, (6) has a unique minimizer and (5) converges exponentially to that minimizer as t → ∞.The work [37], however, does not provide results regarding the long-time behavior of the regularized gradient flow (4), which is the numerically approximated dynamics.Our Proposition 4.19 provides the tools to prove convergence of limiting states, resulting in Theorem 4.20 below.For more results regarding the long-time convergence of such dynamics arising from the training of neural networks, we refer the readers to the recent works [19,21].
In Section 5 we provide two numerical experiments, one on a toy example of 2-dimensional Gaussian mixture, another on a real-world Bayesian classification problem, to demonstrate the effectiveness of the birth-death algorithm.In both examples we observe that birth-death sampler allows significantly faster mixing of particles compared to Langevin dynamics or SVGD.More specifically, in the multimodal Example 5.1, one can see in Figure 3 that, once a high-probability mode is discovered, birth-death sampler will facilitate movement of particles towards this newly discovered mode, which helps overcoming the issue of metastability.In the real-world Example 5.2 where the non-convexity is not strong and SVGD works well, birth-death sampler can reach the equilibrium in an extremely short time.

Contributions
We highlight the major contributions of the present paper as follows: • We prove that the pure birth-death dynamics (BD) converges globally to its unique equilibrium measure π with a uniform rate with respect to KL-divergence, improving the results in [51,57].See Theorem 2.4 for the precise statements.Using similar techniques, we also investigate the algorithm proposed in [50], and prove that their time-rescaled infinite-particle equation converges in χ 2 -divergence converges exponentially with a rate independent of the potential, see Theorem 3.1.
• We show that under suitable conditions, any global minimizer of the regularized energy (32) The precise statement can be found in Theorem 4.2.
• We show in Theorem 4.16 that smooth solutions of the kernelized dynamics (BD ε ) with densities bounded above and below on torus Γ-converges to the pure birth-death dynamics (BD) within any finite time-horizon in the limit of small kernel width.As a corollary, this justifies the convergence of the density ρ (ε) t of the kernelized dynamics with width ε to the target measure π as ε → 0 and t → ∞.
• Finally, we show in Theorem 4.21 that on torus, the long-time limit of (BD ε ) converges with respect to Hausdorff distance corresponding to the Wasserstein distance.

Related works
The pure birth-death dynamics (BD) is a gradient flow on relative entropy with respect to the the spherical Hellinger distance we define in (10).The mass-conservative metric (10) was introduced in [11,40,45] and used in the study equations modeling fluids and population dynamics.Later it was applied in the analysis of training process of neural networks [69,76].In the context of statistical sampling, the spherical Hellinger metric was first applied by [57] to accelerate Langevin dynamics for sampling, where a local exponential convergence (Theorem 2.3) was proved.The paper [30] uses the idea of birth-death dynamics to improve the training of normalizing flows that learn the target distribution.The recent paper [50] constructs an ensemble MCMC algorithm whose mean field evolution is given by the spherical Hellinger gradient flow of the χ 2 -divergence (see (BD2)).Convergence to the equilibrium was also established therein on a finite state space.The construction of birth-death dynamics (BD) is also related to the recent study on unbalanced optimal transport and associated gradient flows.In particular, the unbalanced transportation metric, called the Hellinger-Kantorovich metric, which interpolates between 2-Wasserstein and Hellinger metric, was defined and studied in [20,39,48] and allows for transport between measures with different masses.
Ensemble-based sampling methods have also been widely studied in recent years, which is another important motivation of our work.Ensemble-based sampling allows for global view of the particle configurations and enables for the particles to exchange information.One of the most successful sampling methods in this category is the affine invariant sampler introduced by Goodman and Weare [34]; see also [29].Ensemble-based sampling are also related to sequential Monte Carlo [25] and importance sampling [12,66].In a continuous-time point of view, ensemblebased samplers can be developed via interacting particle systems, examples of which include ensemble Kalman methods [32,65], consensus based sampling [13] (which also has ideas from optimization [14,64]), and ensemble Langevin dynamics [54].
We are particularly interested in sampling approaches that are defined as gradient flows of a functional measuring the difference from the target measure.There are a variety of functionals and metrics considered.Blob particle method [16] is the Wasserstein gradient flow of the regularization of KL divergence that allows for discrete measures where it becomes an interacting particle system.Such viewpoint has been applied to sampling purposes in [21] where the authors considered the Wasserstein gradient flow for regularized χ 2 energy.A different approach to create gradient-flow based interacting-particle systems for sampling is the SVGD introduced in [53].There the authors consider the gradient flow of the standard KL-divergence with respect to a metric (now known as the Stein geometry) which requires smoothness of the velocities.Thus the gradient-flow velocity makes sense even when considered for particle measures [27,52,55].The work [18] provides a new perspective on SVGD by viewing it as a kernelized gradient flow of the χ 2 -divergence.A further direction of research considers Wasserstein gradient flows of the distance to the target measure in a very weak metric that is well defined for particles; in particular Kernelized Stein Discrepancy [42].Recently the work [44] considered using Wasserstein gradient flow for variational inference, where they use Gaussian mixtures to approximate the target density with the evolution of mean and variance governed by gradient flows.

Pure birth-death dynamics governed by relative entropy
Let us first introduce the Benamou-Brenier formulation of the Hellinger distance (the distance plays an important role in information geometry, see for example [1,5]) on (not necessarily probability) measures where (ρ t , u t ) satisfies the equation If measures ρ 0 , ρ 1 ≪ λ for some probability measure dλ(x), then one can explicitly compute the minimal cost in (7) and obtain Moreover, this expression does not depend on the specific choice of λ.Indeed, substituting u t = − ∂t dρt/ dλ dρt/ dλ into (7), we have Equality is obtained when From the expression (8) we can derive immediately that for probability measures d 2 and hence d H is a cone geodesic distance on the space of positive measures satisfying [45, (2.1)].Thus from Theorem 2.2 and Corollary 2.3 of [45] (see also [5,Chapter 2]) it follows that the spherical Hellinger distance, which is obtained by restricting the configurations and paths to probability measures and considering the same path lengths, is given by This implies that d SH ≤ π and furthermore d SH (ρ 0 , ρ 1 ) = π if and only if ρ 0 and ρ 1 are orthogonal measures.From the definition (10) one can also observe immediately Furthermore (P(R d ), d SH ) is a geodesic metric space and [45, Theorem 2.7] identifies the geodesics, based on geodesics w.r.t d H in the cone of positive measures.
The distances d H , d SH metrize strong convergences of measures, which we present in the lemma below.
Lemma 2.1.Suppose {ρ n } ∞ n=1 and ρ are measures on R d and are all absolutely continuous with respect to some measure λ.Suppose also that ρ has finite total mass.Then As a consequence of (11), if we further assume ρ n , ρ are probability measures on R d , then Proof.Thanks to (11), it suffices prove (12), which is a direct consequence of the following inequalities (see [72, Theorem 2.1]): for any Before presenting the main results of this section, let us state the general assumptions of π that we assume throughout this work.
Assumption 1.The invariant measure π and initial condition ρ 0 are absolutely continuous with respect to the Lebesgue measure and have density functions π(x), ρ 0 (x).Let We require that ρ 0 > 0 in Ω and ρ 0 = 0 in Ω c .
The pure birth-death dynamics (BD) is the d SH -gradient flow of relative entropy KL(•|π).Under sufficient regularity hypotheses, for any energy functional G, the d SH -gradient flow of G has the form Here δG δρ is the first variation density of G at ρ [2,15].The following lemma shows the well-posedness of (BD) whenever ρ 0 > 0 on Ω.In addition, the proof reveals the deeper structure of (BD) which indicates exponential convergence to π.We prove the convergence rate in Theorem 2.4.We note that similar discussion is carried out in the proof of Theorem 2.1 in [51].
Lemma 2.2.Suppose ρ 0 , π satisfies Assumption 1, and KL(ρ 0 |π) < ∞, then there exists a unique solution of (BD , where the differentiability is with respect to L 1 norm and which dissipates the KL-divergence. solution of (BD).Then ρ = 0 a.e. in space and time outside Ω.In Ω we have for ρ-a.s.x, the function η t = log ρt π satisfies the equation Note that KL(ρ t |π) depends only on t and is bounded.Thus, using the theory of linear ODEs, should a solution of (15) exist, it has to be of the form Taking exponential, we have where Ψ t = e ψt .Since the solution of (BD) must be a probability density for all t, we have dπ, which is uniquely determined by ρ 0 and π.Also, Ψ t must be finite and positive since by Hölder's inequality, ρ e −t 0 π 1−e −t ∈ L 1 (Ω) and ´Ω ρ e −t 0 π 1−e −t ≤ 1.This means the equation (BD) has at most one solution.Finally we can verify the existence of a solution by substituting the expression ( 16) into (BD).Direct computation also verifies that KL divergence is noninceasing.
Before stating our result we recall the convergence result of [57].
We note that the above theorem requires the condition KL(ρ 0 |π) ≤ 1; some result would still hold for KL(ρ 0 |π) < 2 but not for larger bounds.The result in [51] removes the KL(ρ 0 |π) ≤ 1 condition, but they also requires a pointwise upper bound for ρ 0 π .We now state our main results that improve the conditions for convergence above by removing the requirement that KL(ρ 0 |π) ≤ 1 with the only assumption (18).Furthermore our second result establishes that KL(ρ t |π) contracts exponentially fast to 0 at all times t ≥ 0. We remark that asymptotically our rate becomes slower than the one in Theorem 2.3; once our bounds ensure that KL(ρ t |π) ≤ 1 one can apply the results of the above theorem.
Theorem 2.4.Under the assumptions of Lemma 2.2, and let ρ t satisfy the pure birth-death dynamics (BD) with initial condition ρ 0 ∈ L 1 (Ω) ∩ P(Ω).Then for any ρ 0 satisfying for some constant M , we have for all t > 0 as well as .
Proof.We first prove (19).Recall from the proof of Lemma 2.2 that for some Ψ t > 0.Moreover, notice that under the condition (18), which implies Ψ t ≤ e M e −t .Therefore where in the last inequality we used that if π and log ρ 0 π ≤ 0. We now prove (20).The proof strategy is a modification of [40, Lemmas 2.11 and 2.12] and [41,Theorem 4.1].We divide the proof into three steps.
Otherwise we must have ≥ 1 2 .Notice for probability densities we have We can estimate the l.h.s. by On the other hand, Hence, using that Thus, combining (25) and (26), in any case we obtain Finally, by further combining (22), we arrive at Step 3: Proof of exponential convergence.From the proof of Lemma 2.2 we have Ψ t ≥ 1.By taking infimum on both sides of ( 16), we obtain In other words, ρ t satisfies (18) with M e −t playing the role of M .Therefore, a combination of direct time differentiation and (24) yields The convergence result (20) therefore directly follows from a Gronwall inequality.
Remark 2.5.The condition ( 18) can be relaxed or modified if we add conditions on π.One such modification, inspired by [17,Proposition 3.23], is that, suppose for some p ∈ [1, ∞) we have M p (π) := ´Ω |x| p dπ(x) < ∞, and in Ω, we have then along (BD), we have convergence The proof follows closely along that of (19), with the difference being and we finish the proof after substituting this into (21).This new assumption (28) covers almost all reasonable scenarios with ρ 0 being Gaussian, as long as π has second moment.The upper bound in the assumption of [17] is unnecessary.As is suggested in [57], the optimal asymptotic convergence rate should be e −2t , which is proved in [26] under a different set of assumptions.
Remark 2.6.Combining Langevin dynamics with the birth-death dynamics would result in dynamics with convergence rate that is at least the maximum of the log-Sobolev constant of π and the rates obtained in Theorem 2.4.That is, suppose π satisfies a logarithmic Sobolev inequality with constant C LSI , then for the dynamics as long as ρ 0 satisfies (18), we have convergence .
Convergence rate of C LSI is guaranteed even without the condition (18).
At the end of this section, we would like to use two examples to illustrate that the pointwise lower bound condition (18) is not numerically restrictive.In particular, if V has at least quadratic growth at infinity, one can choose ρ 0 to be any sufficiently wide round Gaussian to satisfy (18).
, and therefore which means ρ 0 satisfies (18) 2ϵ , then Notice that which means ρ 0 satisfies (18) with M = d 2 log d 2ε + 4 ε , and therefore the burn-in time needed is 3 Pure birth-death dynamics governed by chi-squared divergence In this section we consider the spherical Hellinger gradient flow with χ 2 (ρ|π) := ´Ω ρ π − 1 2 dπ as the energy functional: This is the dynamics appeared in [50, (3.6)].There the authors first derive a related family of dynamics, [50, (3.1)], as the continuum limit of the ensemble Monte-Carlo sampling schemes they introduced.For the dynamics [50, (3.1)], with kernel Q = Id, they prove exponential decay of the "reverse" χ 2 -divergence.In the time scaling we consider, this can be stated as χ 2 (π|ρ Zt ) ≲ e −t , where Z t is the rescaling in time for which Thus the exponential rates of [50, Theorem 1] implies asymptotic exponential rates of order e −t for χ 2 (π|ρ t ).However, due to the non-symmetry of the χ 2 -divergence, the convergence result on χ 2 (π|ρ t ) from [50] does not directly imply a convergence result for χ 2 (ρ t |π), although the later is a more natural Lyapunov function for the gradient flow (BD2) since it is the underlying energy.In the next theorem, we show that χ 2 (ρ t |π) also contracts exponentially fast and provide a quantitative estimate for the convergence rate.The authors of [50] also note that the dynamics we consider, (BD2), is formally spherical Hellinger gradient flow.Theorem 3.1.Let ρ 0 , π satisfy Assumption 1, and let ρ t ∈ C 1 [0, ∞), L 1 (Ω) ∩ P(Ω) be the solution of (BD2) with initial condition ρ 0 .Then, for any initial probability density ρ 0 (x) such that for some M > 1, we have exponential convergence to equilibrium along the dynamics (BD2) Proof.The proof is similar to that of (20) in Theorem 2.4.The core step is the following functional inequality which holds for any This finishes the proof of (31).Now let us return to the dynamics (BD2).Taking time derivative, we have for e −M (t) = inf ρt π , By Gronwall inequality, this means the dynamics converge exponentially with instantaneous rate . Finally, notice that d dt Therefore, solving the ODE, one has e −M (t) ≥ e t e M + e t − 1 , which gives the convergence rate in (30).
Notice that one has lim t→∞ λ(t) = 2 9 , which we believe is suboptimal based on the results in Theorem 2.4 (i) as well as the observation made in [50].On the other hand, if M ≫ 1, then the instantaneous convergence rate is O(1) only when e M −t = O(1), which means t ≥ O(M ).Hence the waiting time of (BD2) is longer than that of (BD), which is O(log M ).

Kernelized dynamics and its particle approximations
In this section we investigate a particle-based approximation to the dynamics (BD).We first introduce a nonlocal approximation of (BD) that is based on regularizing the relative entropy: It is the spherical Hellinger gradient flow of the regularized entropy where C = log ´exp(−V (x))dx .We first study the energy F ε on the whole space.In Sections 4.2 and 4.3 we study the well posedness and the convergence as ε → 0 of the gradient flow on the torus.
We now state the conditions on the kernel K ε that we require for our results.
Assumption 2 also indicates that there exists a kernel ξ ≥ 0 such that K = ξ * ξ and ´Rd ξ = 1, namely ξ = K.One example that satisfies Assumption 2 is the Gaussian kernel We also use ξ ε to denote ε −d ξ( • ε ).

Quantitative distance between minimizers
The Γ-convergence of F ε , defined in (32), to relative entropy KL(ρ|π) and the convergence of minimizers are already proved in [16], which we restate in the following Theorem 4.1.
However, [16] did not prove quantitatively how close the minimizers π ε are to π.In the case V is sufficiently regular and has between quadratic and quartic growth, we can prove a more quantitative result.We would like to comment here that, while our assumption on V is slightly stronger than that of Theorem 4.1, we do not require K to be Gaussian in our following theorem.Theorem 4.2.Suppose K ε satisfies Assumption 2 with some ξ ∈ P 4 (R d ).Suppose π satisfies a Talagrand inequality [63] with some constant m > 0: for any probability measure ρ.Suppose also that for some L > 1.Then, for any 0 < ε < m 4LM 2 (ξ) , let π ε be a minimizer of F ε defined in (32), we have for some constant We remark that analogous result holds if we consider the energy on the torus T d .The integrals considered are on the torus, while for evaluating convolutions, all of the functions are extended periodically to R d .Proof.To get started, we have for arbitrary ρ, In particular, taking ρ = π ε where π ε is any minimizer of F ε , we have On the other hand, since K ε = ξ ε * ξ ε and x → log x is concave, we use Jensen inequality to obtain We then use the regularity of V and Talagrand inequality: For the first term, we can use triangle inequality For the second term, by Taylor expansion, we have3 for some ζ ∈ [x, y].Then, we appeal to the symmetry of the kernel ξ and derive, ˆρ(x) ˆξε Here in the last step, we use that for any γ(x, y) being a coupling between ρ and π, then the theorem is immediate.Otherwise, we combine ( 40), ( 41) and ( 42) for ρ = π ε to obtain that there exists a constant C = C(π, ξ) > 0 independent of ε such that which leads to (35) after completing the square, and (36) after optimizing the right hand side above with W 2 (π ε , π).
Remark 4.3.We are not able to prove the sharpness of the error estimate (35).However, we demonstrate via numerical experiments that the error bound O(ε) above seems to be optimal.Consider the target measure π(x) ∝ e −V (x) on the 1-dimensional torus T = R/2πZ with potential V (x) = sin x+2 sin 2x.We solve the equations (BD) and (BD ε ) using the finite difference method and use numerical integration to compute the evolution of KL(ρ (ε) t |π) for various values of ε.The left side of Figure 1 shows that for a fixed ε > 0, KL(ρ (ε) t |π) approaches to a positive constant as t → ∞.The right side of Figure 1 plots the function ε → KL(ρ ) as ε → 0. This is consistent with the scaling in (35) in view of the Talagrand inequality (33).

Well-posedness of gradient flows (BD ε )
In this section we develop the well-posedness of (BD ε ) considered on the torus T d and establish the elements of its gradient flow structure relevant for the convergence argument.Due to the smoothness of the terms of the equation it is easy to show well-posedness in the classical sense of ODE in Banach spaces.More precisely we first show local well-posedness in the spaces of measures on the torus which are bounded from above and below by a positive constant.We then show L ∞ bounds that allow us to extend the solution up to some large T ε ε→0 − −− → ∞.After establishing the existence and uniqueness of classical L 1 solutions, we show that both energies t |π) for various ε, which heuristically goes to some fixed number as t → ∞ for every fixed ε.Right: the relationship between ε and KL(ρ F and F ε are λ-geodesically convex with respect to d SH , in the restricted sense described in Definition 4.9.We then characterize the subdifferential of F ε with respect to the d SH geometry.These allow us to show that the classical solutions coincide with gradient flow solutions, as well as the curves of maximal slope.
For C > 1 let Lemma 4.4.Assume K ε satisfies Assumption 2, and π(x) ∝ e −V (x) where V is a C 2 function of T d .For any C > 1 and ρ 0 ∈ P C/2 there exists T > 0, independent of ρ 0 and a unique Proof.The existence and uniqueness follow from the classical existence of ODE in Banach spaces.
In particular we claim that te right-hand side of (BD ε ), Showing this is straightforward and we only sketch the argument.Observe that ρ → K ε * ρ is a 1-Lipschitz mapping on L 1 and that 1 C < K ε * ρ < C for all ρ ∈ P C .We also use that V is bounded and continuous on T d .Furthermore note that Combining these facts provides the desired constants L and M .Thus there exists a T > 0 and for all t ∈ [0, T ], where the derivative is taken in t and A(ρ t ).The proof that the solution is in C 1 (T d ) follows in the standard way.Namely we first observe that, once we know that ρ (ε) is a solution in P C then it satisfies ∂ t ρ (ε) = h(x, t)ρ (ε) where h is continuous and bounded.This implies that ρ(t) ∈ C(T d ) for all t ∈ [0, T ].To show the C 1 regularity one needs to show that u = ∂ x i ρ (ε) satisfies an integral equation.Note that taking a derivative of (BD ε ) gives that u satisfies a linear integral equation.Obtaining the existence of the solution and showing that it is the derivative of ρ (ε) is straightforward and we do not present the details to conserve space.
The next few lemmas aim to establish L ∞ upper and lower bounds for ρ Proof.
Lemma 4.6.Let ε > 0. Assume ρ ) is a solution of (BD ε ) with initial condition ρ 0 ∈ C 1 (T d ) ∩ P C (T d ) for some C > 1, and let w t .Then there exists constant C such that for any t ∈ [0, min{T, where T ε is defied in the proof and satisfies that T ε → ∞ as ε → 0.
Proof.We start with the observation that w (ε) t satisfies the equation Taking spatial derivative, we have Now fix an index i, and let t |, we have After taking supremum on the left side above, this turns into Notice that for every ε > 0, the solution z ε to the ODE problem blows up at the finite time However, as ε → 0, one has that z ε (0) → 0 and ε∥∇V ∥ L ∞ (T d ) → 0. Therefore, Notice that z ε (t) is an increasing function of both t and ε.Let Then Lε (t) ≤ C for all t ≤ min{T, T ε }.Taking account of above in the differential inequality (47), one obtains from Gronwall's inequality that for all t ∈ [0, min{T, T ε }], which is precisely (45).
Lemma 4.7.Under the same conditions as Lemma 4.6, suppose that for all t ∈ [0, T ], w and inf Here C π is the constant depending on π that appears in Theorem 4.2, such that To applying this lemma one can use the Lipschitz estimate of Lemma 4.6 and take L to be the right hand side of (45) for t = T ε .
Proof.Notice that Using Gronwall's inequality, we have The proof of lower bound is similar.Since Theorem 4.8.Let K ε satisfy Assumption 2 and is supported in Proof.Let ε > 0. By local well-posedess of Lemma 4.4 we know that a unique positive solution exists on some time interval [0, T 0 ).Consider the time T ε defined in the proof of Lemma 4.6.We claim that the solution of (BD ε ) exists until at least on [0, T ε ).Namely if the maximal time of existence, T is less than T ε , then by by Lemma 4.7 ρ (ε) is uniformly bounded from below and above by positive constants.Thus there exists C > 1 such that ρ (ε) ∈ P C for all t ∈ [0, T ).By applying the local existence 4.4 starting at time τ close enough to T we can extend the solution beyond T and obtain contradiction.
In order to study the convergence of (BD ε ) to (BD) as ε → 0, we will rely on their gradient flow structure, which we investigate next.Definition 4.9.Let G be a functional defined on P C .We say G(ρ) is λ-geodesically convex in P C with respect to d SH if, for any ρ 0 , ρ 1 ∈ P C , let (ρ t ) t∈[0,1] be the d SH -geodesics connecting ρ 0 to ρ 1 , then Note that the above definition does not require P C itself to be a geodesically convex set of d SH .As long as ρ is bounded above and below away from 0, both the relative entropy F and regularized entropy F ε are geodesically semiconvex with respect to d SH , which is why we introduced the restricted submanifold P C .For general results regarding displacement convexity with respect to Hellinger-Kantorovich distance, we refer the readers to [49].
Proof.Consider ρ 0 , ρ 1 ∈ P C and ρ 0 ̸ = ρ 1 .Let us recall the d SH -geodesics from ρ 0 to ρ 1 given in the expression in [45, Lemma 2.7]: and ρt is the d H -geodesics given in explicit form (9). Since d H (ρ 0 , ρ 1 ) ≤ 2 √ 2, we can obtain from (10) that d SH (ρ 0 , ρ 1 ) ≤ π, hence β t is well-defined and increases from 0 to 1.We may also obtain which indicates ρ t ∈ P 2C for all t ∈ [0, 1].Thus, one can explicitly calculate that for F(ρ) = KL(ρ|π), The contribution from the second term is already non-negative; therefore in view of the pointwise bounds of ρ t and π, we only need to prove Taking second derivative on (51), using chain rule and quotient rule, we obtain We complete the proof of ( 53) by combining the following facts: , We now turn our attention to the regularized energy F ε .Taking second derivative, following the same calculation as above, one has Here the second term on the right side of the second line can be bounded using Assumption 2 on K ε and Jensen's inequality: The third term on the right side of the second line can be treated in the identical way.In view of (53), it remains to show which is true since ρ 0 , ρ 1 ∈ P C and we may apply the bounds in (54) to We now introduce the elements needed to consider (BD ε ) as d SH gradient flows.The space of all probability densities on T d equipped with d SH is a complete Riemannian manifold with corners.The explicit formulas for geodesics (51) ensure that for all ρ 0 , ρ 1 ∈ P C and all t ∈ [0, 1] the geodesics ρ t ∈ P 2C .The tangent spaces are already identified in the earlier work [31]: For ρ ∈ P C , the tangent space at ρ with respect to d SH can be identified with We now compute the geometric logarithm (inverse of the exponential map) on the space of probability measures endowed with d SH geometry and a point ρ ∈ P C for some C.In other words we compute the tangent vector to the unit-time geodesic connecting two measures.Lemma 4.11.For ρ, µ ∈ P C , we have Proof.Let {ρ t } t∈[0,1] be the d SH -geodesics connecting from ρ to µ.We refer the readers to the expression in (51), then apply the chain rule and we obtain We note that due to the remark after (10), sin d SH (ρ, µ)/2 > 0.
Note that the tangent spaces are Hilbert spaces, which allows us to define the subdifferentials as the classical Fréchet subdifferentials: We note that ϕ belongs is the Fréchet differential if the inequality above is replaced by equality.
Lemma 4.13.Consider ρ ∈ P C for some C > 1. Suppose G : P → R is proper and lower semicontinuous.Assume that there exists ζ ∈ L ∞ (T d ) such that for all h ∈ L 2 (ρ) which is essentially bounded from below (i.e.h − ∈ L ∞ ), the first variation of G in direction h exists and is of the form δG δρ ρ [h] = ´ζhdx.Furthermore assume that G is d SH λ-geodesically convex in P C for some λ ∈ R. Then Proof.Let us first show that any element of ∂ SH G(ρ) must be equal to ζ − ´Td ζρ.Assume that ξ ∈ ∂ SH G(ρ), namely it satisfies (59).For ϕ ∈ L 2 0 (ρ) ∩ L ∞ (T d ) and t > 0 consider tϕ playing the role of ϕ in (59), dividing by t and taking the limit as t → 0 provides ˆζϕρ ≥ ˆξϕρ.
Combining this with the λ convexity of G implies that ζ − ´Td ζρ ∈ ∂ SH G(ρ).
The above lemma allows us to identify the subgradients of F and F ε .Since the subgradients are singletons we identify each set with its only element.That is, for ρ ∈ P C , The last step to rigorously identify (BD) and (BD ε ) as d SH -gradient flows is to show that they are curves of maximal slope.Let us recall that for a curve ρ t , the metric derivative is given by and the metric slope is defined as [2, (10.0.9)] We say that a path ρ ∈ AC([0, T ], (P C , d SH )) is a gradient flow solution of ( ρ) follows by considering the limit along µ t → ρ.To show the opposite inequality note that where With the above preparation, we are able to rigorously identify (BD ε ) as the d SH gradient flow of F ε .The existence results of Lemma 2.2 and Theorem 4.8 imply that during the interval of existence ). Furthermore the solutions are in AC([0, T ], (P C , d SH )), which can be verified by direct check based of solution formula of Lemma 2.2 for ρ t , and by ρ (ε) ∈ C 1 ([0, T ], L 2 (T d )) using Theorem 4.8.Thus ρ and ρ (ε) are gradient flows of the respective equations, and consequently curves of maximal slope.
Due to the semiconvexity of the functionals, the solutions also satisfy an evolution variational inequality (Chapter 11 of [2] for Wasserstein gradient flows, and [60] for general metric spaces).This implies that gradient flow solutions are unique.In particular while our notion of λ convexity is restricted, the proof of quantitative stability of Theorem 11.1.4 of [2] carries over.Thus gradient flow solutions coincide with the solutions of Lemma 2.2 and Theorem 4.8, respectively.

Γ-convergence of gradient flows
The next question is whether the dynamics (BD ε ) converges in any sense to the idealized dynamics (BD) as ε → 0. The natural notion to study is the Γ-convergence of gradient flows à la Sandier-Serfaty [70,71].We will show a proof for T d with a compactly supported kernel.The following theorem, essentially a rephrase of [71,Theorem 2] but adapted to our setting, summarizes the conditions we need to verify for the Γ-convergence of gradient flow.be solutions of (BD ε ) that are curves of maximal slopes.Suppose the initial conditions are well-prepared, in the sense that as ε → 0, Suppose also that as ε → 0, we have ρ (ε) t d SH −−→ ν t for almost every t, as well as the following conditions: Proof.The plan of our proof is to first use Arzela-Ascoli Theorem to identify the limiting sequence ν t , then verify the three conditions in Theorem 4.15.
Notice ∇ρ t , hence the combination of Lemmas 4.6 and 4.7, as well as π ∈ P C , yield that ∇ρ (ε) t is uniformly bounded for any t ∈ [0, T ], when ε is sufficiently small.Thus, since ρ (ε) t is also uniformly bounded (c.f.Lemma 4.7), we may invoke Arzela-Ascoli Theorem, so that there exists a subsequence, still denoted as ρ (ε) t , that converges uniformly to some function ν t as ε → 0.
We now verify the three conditions of Theorem 4.15.The proof of (ii) is straightforward, since by Γ-convergence of the energy functional [16, Theorem 4.1] (note that convergence in We now prove (i).The proof is standard and follows from the arguments in [22, Theorem 5.6].Assume without loss of generality that there exists 0 ≤ C < ∞ so that 0, T ) so, up to a further subsequence, it is weakly convergent to some v(t) ∈ L 2 (0, T ).Consequently, for any 0 ≤ s By taking limits in the definition of the metric derivative and using the lower semi-continuity of d SH with respect to weak-* convergence (see the proof in [39,Theorem 5] for d H , which also applies to d SH using conic structure (10)), By [2, Theorem 1.1.2],this implies that |∂ t ν t | ≤ v(t) for a.e.t ∈ (0, T ).Thus, by the lower semicontinuity of the L 2 (0, T ) norm with respect to weak convergence, Regarding (iii), we first claim here that ρ (ε) t converges uniformly to ν t implies that K ε * ρ (ε) t also converges uniformly to ν t as ε → 0. Indeed,

Convergence of asymptotic sets
While F ε defined in (32) may not have a unique minimizer, and the dynamics (BD ε ) may not converge to a unique probability distribution as t → ∞, the Γ-convergence of regularized gradient flows (BD ε ) to (BD) and the long-time convergence of the limiting gradient flow (BD) guarantees that the long-time limiting set of (BD ε ) is close to that of (BD), which is π.The goal of this section is to discuss some sufficient conditions where convergence of asymptotic sets of gradient flows hold.
We first present below Proposition 4.19, which we state for general gradient flows in metric spaces.In particular the lemma can be applied to gradient flows in Wasserstein metric and has interesting consequences for 2-layer neural network training which we will discuss in Theorem 4.20.
Proposition 4. 19.Let E ε and E be energy functionals, and let (ρ ε and (ρ t ) 0≤t<∞ be continuous curves in a metric space, with metric d and maximal existence time * ε and ∞ respectively, such that E ε is nonincreasing along ρ (ε) t and E is nonincreasing along ρ t .Assume the following conditions hold: (i) E has a unique minimizer π, and ρ t converges to π as t → ∞ in the sense that E(ρ t ) → E(π).
(iii) As ε → 0, we have T * ε → ∞.Moreover, for every t ≥ 0, we have that ρ (iv) The sub-level sets of E ε are uniformly precompact in the following sense: There exists ε 0 > 0 such that for any M ∈ (0, ∞) the set Then we have the following: (a) For any ε > 0 and any time sequence (T ε ) ε such that T ε < T * ε and lim ε→0 T ε = ∞, we have ρ ) be the ω-limit set of ρ Proof.We only prove (a) since the proof for (b) is identical.Fix δ > 0, then there exists a T > 0 such that E(ρ T ) ≤ E(π) + δ.
To prove the claim we argue by contradiction.Assume that lim sup ε→0 d(ρ Tε , π) ≥ λ > 0, then along a subsequence (not relabeled) ε → 0, there exists ρ Before we discuss some properties and modifications to this jump dynamics, let us remark that there is an alternative way to create a jump process whose mean field limit, as ε → 0 is expected to approach the pure birth-death process, (BD), namely simply replacing ρ by ρ * K ε in the rates for birth and death in (BD).This is the approach considered in [57].The jump rates for such process are The expected mean field limit for fixed ε > 0 would be A downside of the dynamics ( 73) is that it is unclear if it possesses a gradient flow structure or a Lyapunov functional that approximates KL divergence, which is why we are modifying the jump rates to be ( 71).
An alternative ensemble based sampling, where the birth-death process was achieved via jumps, has recently been introduced and studied in [50], where the jump rate was The limit of the mean field dynamics as ε → 0 is the spherical Hellinger gradient flow of the χ 2 divergence, (BD2).In particular the birth-death part of their dynamics relates to (BD2) in the same way as the dynamics of ( 73) relates to Hellinger gradient flow of KL divergence, (BD).Unlike our choice of ( 71), the rate in (74) does depend on the normalization constant of π, which [50] can avoid by a rescaling of time.
We note that a serious issue with jump processes discussed above is that the support of the measure is not expanding.The jumps only lead to new particles at the occupied locations.In [50] this issue is dealt with as follows: each particle created at some x is moved to proposal obtained by sampling K ε ( • − x).This proposal is accepted according to the standard Metropolis procedure, thus ensuring that one is sampling the probability measure π.In our numerical experiments in Section 5 we combine the jump process of Λ ε with unadjusted Langevin algorithm (ULA).The latter sampler is responsible for exploring new territory, especially high density regions in the state space.In effect we move each particles by a gradient descent plus sampling the Gaussian centered at the particle.We did not add a Metropolis step in our experiments.trick, while for birth-death, since the bandwidth is proportional to the bias, we choose the bandwidth to be 0.1, 0.5 or 1, whichever provides the best performance.The results are shown in Table 1, showing that when the dimension is not too large and running time is relatively long, both birth-death sampler (with Langevin steps) and SVGD perform similarly well.

Numerical Examples
We also compare the behavior of both criteria, accuracy and log-likelihood, as time evolves between t ∈ [0, 5].The results are shown in Figure 4.One can observe that birth-death Langevin sampler reaches the desired accuracy at an extremely short time t ≈ 0.3, which SVGD cannot achieve before t = 5.This indicates that one can run BDLS for a much shorter time to reach equilibrium, significantly alleviating the computational cost issue of running a system of many particles.This is the spirit of our Theorem 2.4.
We would like to comment that for the dataset "splice" where d = 61, the performance of birth-death sampler is significantly worse, even with N = 2000, which is not entirely surprising due to the limitations of kernel density estimation.It is an interesting open question to design a sampling algorithm which can inherit the fast convergence properties of spherical Hellinger gradient flows and be robust in high dimension settings.

Lemma 4 . 5 .
allows us to extend the local existence theory to long time intervals.Suppose that K ∈ C ∞ c (B 1 ) satisfies Assumption 2, and that w = log ρ is L-Lipschitz continuous, then for any fixed x, we have

Lemma 4 . 14 .
Consider ρ ∈ P C for some C > 1 and suppose that the assumptions of Lemma 4.13 are satisfied.Assume furthermore that ∂ SH G(µ) is continuous in L ∞ (T d ) in an L ∞ neighborhood of ρ.Then the metric slope satisfies

Example 5 . 1 ( 3 . 10 ULAFigure 3 :
Figure 3: Gaussian mixture example.Top: position of particles at T = 3; bottom: position of particles at T = 10.Algorithms based on birth-death are better at attracting particles into under-explored regions.

Figure 4 :
Figure 4: Bayesian classification problem with dataset "Image".The birth-death Langevin algorithm reaches the desired accuracy and log-likelihood much faster than SVGD.