Position-dependent diffusion coefficients and free energies from Bayesian analysis of equilibrium and replica molecular dynamics simulations

Bayesian inference is used to obtain self-consistent estimates of free energies and position-dependent diffusion coefficients along complex reaction coordinates from molecular dynamics simulation trajectories. Effectively, exact solutions for the dynamics of a diffusive model are matched globally to the observed molecular dynamics data. The approach is first tested for a simple one-dimensional diffusion model, and then applied to the dihedral-angle dynamics of a peptide fragment dissolved in water. Both long equilibrium molecular dynamics simulations and short, appropriately initialized, replica simulations are used to sample the short-time dynamics of the peptide–water system. In both cases, accurate estimates of free energies and diffusion coefficients are obtained.


Introduction
Over its 65-year history, the Kramers [1] model of diffusive barrier crossing has become one of the most powerful and widely used approaches to describe transitions in molecular systems [2]. In Kramers theory, the underlying many-body dynamics of a molecular system is condensed into a one-dimensional (1D) diffusive motion along a chosen reaction coordinate Q. The problem of obtaining rate coefficients for molecular transitions then reduces to (i) finding the free-energy surface F(Q) along the reaction coordinate, in particular the height and shape of the barrier, and (ii) to estimating an effective, position-dependent diffusion coefficient D(Q) that describes the local dynamics on the free-energy surface. Obtaining the free-energy surface, or potential of mean force, amounts to counting population densities p(Q) ∝ e −βF(Q) as a function of Q under equilibrium conditions, which can be performed rigorously and with a variety of straightforward techniques of classical molecular simulations [3,4] and of non-equilibrium pulling experiments [5]. In contrast, estimating a local diffusion coefficient poses a more serious challenge because (i) diffusive dynamics is only an approximate assumption such that no rigorous expressions for D(Q), unlike F(Q), is expected; and (ii) the observed dynamics is determined by a combination of free energies and diffusion coefficients, requiring a 'deconvolution' step to remove contributions from a non-uniform free-energy surface F(Q) when going from observed trajectories to D(Q). In view of these difficulties in obtaining D(Q), it is reassuring that the rates, while exponentially sensitive to the barrier height, are only linearly proportional to the diffusion coefficient in the Kramers theory.
How can one estimate accurate position-dependent diffusion coefficients? Following the earlier works of Berne et al [6], Woolf and Roux [7] proposed an elegant approach in which free energies and diffusion coefficients are calculated effectively from the same set of simulations. In their formalism, umbrella sampling [8] with a harmonic bias on Q is used to sample p(Q) locally. From the Laplace transformation of the autocorrelation function of the velocityQ along the reaction coordinate in the harmonically restrained simulations, one can then estimate D(Q). As shown in appendix A, the expression of Woolf and Roux can be simplified considerably, and reduced exactly to where Q is the average of the reaction coordinate Q in the biased run, var(Q) = Q 2 − Q 2 is its variance and τ Q the characteristic time of its autocorrelation function, τ Q = ∞ 0 δQ(t)δQ(0) dt/var(Q) with δQ(t) = Q(t) − Q . Equation (1) is a relation between the diffusion coefficient and correlation time of a harmonic oscillator with overdamped Langevin dynamics [9] that has been used, for instance, to estimate the diffusion coefficients along protein [10] and peptide folding reaction coordinates [11].
Recently, Liu et al [12] proposed a different approach to calculate diffusion coefficients in anisotropic and inhomogeneous systems. For anisotropic systems, such as a liquid-vapour interface, the elements of a diagonal diffusion tensor are related to the conditional mean square displacement of only those particles that remain within a given position interval. If the system is inhomogeneous with a known free-energy surface F(Q), the friction coefficient in independently run Langevin dynamics simulations on F(Q) is varied to match the probability of remaining within a chosen interval, as obtained from the full molecular dynamics (MD) simulation.
Here, I follow a different approach and estimate diffusion coefficients and free energies self-consistently. The central idea is that long equilibrium simulation runs or, equivalently, an ensemble of independent and appropriately initialized short simulations can be used to probe the local 'propagators' along the coordinate Q. After coarse-graining in time, one can compare the observed motions along Q with those expected from diffusive dynamics. Such a global comparison will provide self-consistent information about the free-energy surface, F(Q), and the diffusion coefficient, D(Q). To perform this comparison, I use a global Bayesian analysis of the simulation data. Under the assumption of diffusive dynamics (after some short initial time accounting for fast molecular processes), and for a given free-energy surface and positiondependent diffusion coefficient, a likelihood function can be constructed that gives the probability of observing exactly the motions along Q seen in the simulation runs. Using Bayes' formula, the likelihood of observations given the parametrized diffusive model is turned into a posterior density of the unknown 'parameters'F(Q) and D(Q) of the diffusive model, given the simulation observations. Entering additionally into Bayes' formula is the 'prior', i.e., a distribution of the parameters that reflects what is known about them before the observations are made [13]. Implicit in the formalism is the assumption that Q is a 'good' reaction coordinate. If that is not the case, the dynamics along Q is not Markovian and the estimated diffusion coefficients will depend on the timescale at which the observations were made. Such time dependences can be used as a test of the underlying assumption of diffusive dynamics.
Within this Bayesian approach, one can infer the slow dynamics in the projected coordinate Q and construct a coarse master equation represented by a rate matrix [14]. At sufficiently fine discretizations along Q, this rate matrix contains the necessary information about free energies F(Q) and diffusion coefficients D(Q) in an inhomogeneous system. The general procedure will be illustrated for a simple 1D test system, and applied to the analysis of dihedral-angle transitions in a hydrated peptide fragment [15].

Master equation
As in [14], the starting point is Zwanzig's generalized master equation for Newtonian dynamics in configuration space [16]. If the complete configuration space is divided into N non-overlapping cells, the probability p i (t) of being in cell i at time t satisfies a generalized master equation, whereṗ i (t) ≡ dp i (t)/dt and K ij (t) is the transition memory kernel given formally in terms of projections [16]. Here, the cells i correspond to intervals along the coordinate Q. Note in particular the absence of the 'random force' (i.e., an inhomogeneous term) in equation (2), which requires that initially (at time t = 0) all phase-space variables are at equilibrium within a given cell i, and non-equilibrium in the initial conditions is limited to the partitioning between different cells i.
Here, one is interested in diffusion, and I assume that after some time, the dynamics becomes Markovian. With this coarse graining in time, one can approximate equation (2) by a Markovian rate equation,ṗ where the R ij are the constant elements of a rate matrix R, with R ij 0 for i = j, R ii 0, and i R ij ≡ 0. One can solve equation (3) in terms of a matrix exponential, The 'propagators' of the Markovian model are accordingly given by where p(i, t|j, 0) is the conditional probability that a trajectory starting from cell i (with equilibrium initial conditions in phase space within i) is in cell j at a later time t.

Diffusive dynamics
In the following, I will connect this general rate formalism to diffusive dynamics by spatially discretizing the Smoluchowski diffusion equation and turning it into a system of rate equations. The resulting rate matrix R describes the dynamics between neighbouring intervals along Q, and contains information about free energies F(Q) and diffusion coefficients D(Q). The Smoluchowski diffusion equation describes the time evolution of the probability density p(Q, t) along the coordinate Q, with β −1 = k B T , where k B is Boltzmann's constant and T the absolute temperature. Equation (6) can be discretized in space following Bicout and Szabo [9] to give a system of rate equations in the form of equation (3). In one dimension, the resulting rate equations take the following simple forṁ where p i (t) is the probability of being in interval i around Q i at time t and R N,N+1 ≡ R N1 and R N+1,N ≡ R 1N for periodic boundary conditions, and R N,N+1 = R N+1,N = 0 for reflecting boundary conditions, respectively. The free-energy surface can be estimated from the equilibrium probabilities P i of being in where Q = |Q i+1 − Q i | is the bin width (assumed to be a constant for simplicity). The vector P = (P 1 , . . . , P N ) T of equilibrium probabilities is an eigenvector of R with eigenvalue 0, , are related to the rate matrix and its equilibrium distribution through [9] where i and i + 1 are two neighbouring cells at centre-to-centre distance |Q i+1 − Q i |. Note that equation (9) is symmetric with respect to exchanging i and i + 1 because the rate matrix satisfies detailed balance, With equations (8) and (9), the positional coarse graining of the diffusion equation, equation (6), has turned the problem of finding D(Q) and F(Q) into that of estimating rate coefficients for transitions between adjacent intervals along Q. For systems that do not satisfy detailed balance (e.g., driven systems), but can be described by the kinetic scheme equation (3), the full rate matrix R ij could be estimated.

Likelihood function
Following [14], Bayesian inference will be used to estimate the rate matrix R from either long equilibrium simulations or short replica simulations. Consider that one has made the following observations in either the short replica runs or the equilibrium trajectory: at a time t α after the system was in state j α , the system is found in state i α . Under the assumption of the kinetic model equation (3), the likelihood of such an observation is p(i α , t α |j α , 0) = (e t α R ) i α j α . The likelihood L of a series of such observations is then simply the product of these probabilities, assuming that the observations are independent and that the dynamics is given by the Markovian rate matrix R. For non-Markovian dynamics, the observations are the actual paths, with a generalized path action functional determining their likelihood [14].

Bayesian analysis of simulation data
To estimate the rate coefficients R ij in equation (7), I use the Bayesian-inference method of [14].
Alternatively, one could also use a maximum-likelihood approach. In the Bayesian formalism, a posterior distribution of the model parameters R ij is constructed from the simulation data through where p(data|parameters) ≡ L is given by the likelihood function L of equation (10). In the following, I will assume a uniform prior distribution of model parameters, p(parameters) = constant, such that Implicit in the 'uniform' prior is the assumption of a particular integration measure because p(parameters|data) is a probability density. The condition of detailed balance reduces the number of free parameters in R, which can be written as For N states, there are N − 1 free equilibrium probabilities P i (with 1 P i > 0 and the Nth P i given by normalization, i P i = 1) and N(N − 1)/2 free rate coefficients in general. For 1D diffusion, the number of free off-diagonal rate coefficients is reduced to N and N − 1 with periodic and reflecting boundary conditions, respectively. To obtain the propagators p(i, t|j, 0) for a given R, I use an eigenvalue technique for the symmetric matrixR ij = P i −1/2 R ij P j 1/2 . Its matrix U of eigenvectors satisfiesRU = UΛ, where = diag(λ 1 , . . . , λ N ) is the diagonal matrix of eigenvalues. The matrix exponential determining the propagators then becomes e tR = diag(P 1 1/2 , . . . , P N 1/2 )e tR diag(P 1 −1/2 , . . . , P N −1/2 ), where e tR = Ue t U T with e t = diag(e tλ 1 , . . . , e tλ N ). In some cases, it may also be possible to avoid the spatial discretization of the diffusion equation and instead use exact analytic expressions for the continuous propagators p(Q, t|Q , 0), or at least accurate short-time expansions for locally smooth free-energy surfaces F(Q) and diffusion coefficients D(Q) (such as the shorttime propagators used to generate diffusive trajectories [17]). Such expressions could be used to obtain the likelihood function without numerical matrix computations.
To construct posterior distributions of the parameters according to equation (11), I will sample parameters using Metropolis Monte Carlo simulations [14,18] in which the negative log-likelihood, − ln L, serves as an effective energy function in parameter space. In the Monte Carlo simulations, the equilibrium free energies, g i = − ln P i , and the rate coefficients R ij (i > j) are randomly varied, the latter being restricted to the positive axis. For both R ij and g i , a uniform prior is assumed, unless specified otherwise. According to the Metropolis criterion [18], Monte Carlo moves are always accepted if the log-likelihood increases; if the log-likelihood decreases by , the move is accepted with probability exp(− ). From these Monte Carlo simulations in parameter space, one also obtains directly the posterior distributions of 'observables' derived from R, in particular, F(Q i ) and D(Q i ), according to equations (8) and (9). I will report the mean of these distributions, and uncertainty intervals about the mean given by the points where the cumulative distributions reach 0.1587 and 1 − 0.1587, corresponding to ±SD for a Gaussian distribution.

Prior for smooth free energies and diffusion coefficients
In many practical situations, one may expect that the free-energy profile F(Q) and the positiondependent diffusion coefficient D(Q) are smooth, such that their values in neighbouring intervals should be similar. Such 'a priori' expectations can be imposed by choosing an appropriate prior 'p(parameters)' in equation (11). Specifically for D(Q), one can multiply the likelihood L by a weighting function that puts a harmonic penalty on deviations |D(Q i ) − D(Q i+1 )| of diffusion coefficients at adjacent grid points, where small values of the parameter γ impose smooth D(Q). Alternatively, one could also impose smoothness by using low-order series expansions of the position-dependent F(Q) and D(Q), and then obtain Bayesian estimates of the series expansion coefficients.

Test for 1D diffusion
To test the algorithm for reconstructing F(Q) and D(Q) from simulation data, I first analyse a simple model system. The purpose of this comparison is to explore whether the Bayesian procedure can accurately recover both the underlying free-energy surface and the positiondependent diffusion coefficients from noisy simulation data created for a known diffusive model.
This likelihood function is used in the Bayesian estimate together with a uniform prior distribution for the equilibrium free energies, −k B T ln P i / ψ, and rate coefficients, R ij (i > j). By sampling the resulting posterior distribution of rate matrices, one obtains distributions of free energies, −k B T ln P i / ψ, and local diffusion coefficients D(ψ i ) according to equation (9). Figure 1 shows results for free energies and diffusion coefficients for n = 24 cells. One finds that the Bayesian estimate accurately recovers the underlying free-energy surface. Small deviations between the estimated and exact F(ψ) in figure 1 are caused by the insufficient sampling of the 100 ns trajectory, with the minimum at ψ = ±π slightly more populated during the simulation run. The Bayesian estimate of the position-dependent diffusion coefficient also agrees well with the D(ψ) used in the simulation. Note that the t = 0.5 ps observation time is comparable to the characteristic relaxation time in one of the free-energy wells, [βF (0)D(0)] −1 = 1.25 ps. At such a relatively long timescale, estimates of F(ψ) and D(ψ) from the local average drift and spread of trajectories would require substantial curvature corrections. Estimating the local diffusion coefficient as D(ψ 0 ) ≈ var(ψ; ψ 0 , t)/2t from the variance of trajectories starting from a fixed initial point ψ 0 and evolving for time t = 0.5 ps results in overand under-estimates of D by about 30% at the barrier tops and free energy minima, respectively.
As shown in figure 2, the estimated diffusion coefficients approach their reference value D(ψ) with an error that depends quadratically on the bin width. The discretization of the Smoluchowski diffusion equation, equation (7), uses centred finite differences [9] with errors of the order of ψ 2 . For n = 24 grid points, the estimated diffusion coefficients deviate by up to about 10% from the corresponding reference values. Overall, the results for the 1D test system show that the Bayesian analysis can produce accurate free energies and diffusion coefficients.

Alanine dipeptide in water
In the previous example, the underlying dynamics was diffusive by construction. In the following, I will use MD simulations of a small peptide fragment in water, and approximate its dihedralangle dynamics by Smoluchowski diffusion. To estimate the free-energy surface and diffusion coefficient of the hydrated alanine dipeptide along the Q = ψ dihedral angle, I again use the Bayesian analysis with a uniform prior. Figure 3 compares the results for F(ψ) from the Bayesian analysis of the replica runs of [15] to a long equilibrium MD simulation in explicit solvent. In a previous analysis [15], a related Chapman-Kolmogorov approach had been used in which local short-time propagators were calculated, but without imposing detailed balance on the coarse master equation. Without detailed balance, one obtains a steady state rather than an equilibrium solution. I find here that the Bayesian re-analysis, with detailed balance imposed, leads to improved estimates of the free-energy profile.
The Bayesian approach also produces a self-consistent estimate of the position-dependent diffusion coefficient D(ψ), as shown in the centre panel of figure 3. For reference, a local diffusion coefficient calculated for harmonically biased equilibrium MD is included.As described in appendix A, the formalism of [7] was used to estimate D(ψ) from the biased run. With the same data, I also used the analytical limit, equation (1) (see appendix A). As shown in the centre panel of figure 3, the results from the Bayesian analysis are in excellent agreement with the two estimates from the biased MD run. However, as discussed in appendix A, the two estimates from biased MD are subject to possible substantial systematic errors due to the particular choice of extrapolation schemes and integration cutoffs. The value of D(ψ) near the global free energy minimum at ψ ≈ −0.3 rad is also in good agreement with a previous estimate of 0.15 rad 2 ps −1 based on the equilibrium relaxation time [15].
In the bottom panel of figure 3, I show results for the diffusion coefficient estimated with a prior imposing 'smoothness'. In equation (14), γ was set to 0.05 rad 2 ps −1 for n = 24 cells. The results for the free-energy profile are essentially unchanged (not shown). The local diffusion coefficients also show no substantial changes, except for the removal of 'outliers' of D(ψ) in the relatively poorly sampled barrier regions.

Conclusions
The objective of this paper was to extract diffusion coefficients and free energies along complex reaction coordinates self-consistently from many-particle molecular dynamics simulations.   (14), for γ = 0.05 rad 2 ps −1 . Symbols as in panel (b). In all panels, error bars indicate a credibility range of ∼68% (corresponding to ±1 SD for a Gaussian).
The problem was approached by coarse-graining in space and time. Conformation space is divided into cells and MD trajectories are monitored at finite time intervals to identify transitions between cells. A model of diffusive dynamics is similarly coarse-grained in space, leading to coupled rate equations [9], and 'matched' to the observed dynamics with a Bayesian approach. The Bayesian formalism produces estimates of the underlying position-dependent free energies and diffusion coefficients, as well as their uncertainties. The approach is presented for diffusion along a 1D coordinate, but generalizations to higher dimensions are straightforward, following the spatial discretization procedure of Bicout and Szabo [9]. Estimates of F(Q) and D(Q) for a multidimensional coordinate Q will require trajectory data for a large number of initial conditions. However, the number of parameters can be effectively reduced if the free-energy surface or the diffusion coefficients are smooth functions of position.
To connect the diffusive model to the observed simulation data, a likelihood function is constructed. The likelihood L is defined as the probability of observing a set of transitions between conformation-space cells found in the MD simulations under the assumption of a diffusive model. Turning the likelihood around according to the Bayes theorem, one obtains a distribution of the model parameters given the trajectory data. A 'prior' entering into the distribution of parameters is assumed to be uniform, but priors imposing smoothness on the estimated free energies and diffusion coefficients are also considered.
Approaches related to the one presented here have been used to estimate molecular rate processes from single-molecule photon-count statistics [19,20]. Likelihood functions have also been developed for the analysis of dynamical systems [21], and have been used for Bayesian inference [22,23]. However, the applicability of Bayesian approaches in the context of dynamical systems is still a matter of debate [24]. Clearly, with poorly chosen priors, or inappropriate models, a Bayesian approach is unlikely to lead to reliable results. Alternative approaches applied in the analysis of dynamical systems, for instance, directly relate the local average drift and the spread of trajectories at short times to the gradient of the free-energy surface and diffusion coefficient [25]- [27]. In the context of coarse molecular dynamics simulations, such expressions based on the short-time expansion of propagators have been used to estimate free energies and diffusion coefficients [15,28].
Here, I obtained accurate estimates of the underlying free-energy surface and of local diffusion coefficients both for a simple 1D model, and for the dihedral-angle dynamics of a peptide fragment in explicit water. The diffusion coefficients agree well with those obtained by analysing a harmonically biased run using the method of Woolf and Roux [7] and the analytic limit of their formula (see appendix A). Compared to the previous estimate of the diffusion coefficient from the relaxation dynamics in a free-energy minimum [15], I found a slight speedup of the diffusion (∼0.2 versus 0.15 rad 2 ps −1 ). A relevant question is whether diffusive dynamics is useful to describe complex motions in a molecular system, such as peptide conformational changes. For alanine dipeptide in water, Bolhuis et al [29] used commitment probabilities to show that the ψ dihedral angle should be a relatively poor reaction coordinate for the α-to-extended transition. Nevertheless, with the diffusion coefficients estimated here, I expect a mean life time of ∼450 ps in the α-helical minimum. That value is bracketed by the two estimates from explicit MD simulations (∼400 ps from a 7-ns run with 607 water molecules, and ∼800 ps from a 24-ns run with 265 water molecules). Even though this result suggests that diffusive dynamics, as estimated after coarse-graining in time, is useful here, it is essential to examine the system for significant deviations of the observed dynamics from the underlying model.
where the harmonic bias restrains Q near Q i . In terms of a Laplace transform,Ĉ v (s; Q i ) = ∞ 0 e −st C v (t; Q i ) dt, Woolf and Roux arrive at the following expression [7]: where δQ = Q − Q i . This expression was obtained by integrating the memory function to estimate a friction coefficient in the presence of a harmonic potential. The local diffusion coefficient is then obtained by extrapolating D(s) to s → 0: One cannot immediately take this limit because, ideally, C v (s; Q i ) = 0 for the harmonically restrained system and perfect sampling. Instead, a numerical extrapolation procedure is used in which D(s; Q i ) is plotted as a function of s and extrapolated from some chosen range in s to s = 0. However, it turns out that one can simplify the expressions given above, and bring them in a more familiar form, by using the autocorrelation function of the position instead of the velocity.
and thus is the autocorrelation function of the reaction coordinate (with the average removed) in biasing window i. Laplace transformation of equation (A.5) produceŝ where I used that C Q (0; Equation (A.9) is exact for an overdamped harmonic oscillator, and has been used, for instance, to estimate diffusion coefficients in protein [10] and peptide folding [11]. Figure A.1 illustrates difficulties faced when estimating local diffusion coefficients from the autocorrelation functions calculated in harmonically biased simulations. A 240 ps MD simulation of alanine dipeptide in water was run as in [15], but with a harmonic restraint potential k(ψ − ψ 0 ) 2 /2 added to keep ψ near ψ 0 ≈ −0.3 rad. With a spring constant of k = 100 kcal mol −1 rad −2 , S.D. ∼ 0.074 rad for ψ.As shown in figureA.1, the resulting autocorrelation functions of position (ψ) and velocity (ψ) are highly oscillatory, making estimation of the correlation time in equations (1) and (A.9), and of D(s) in equation (A.8) difficult. The resulting estimates of D from var(ψ)/τ thus depend on the time range used to estimate τ. Equivalently, the estimated limit of D(s) for s → 0 depends on the range of s used in the extrapolation (with large s values giving high weight to the short-time correlations). Moreover, a singularity, caused by the numerical instability of equation (A.2), makes the result dependent on the chosen extrapolation method. Here, I used a quadratic fit for 3 < s < 10 ps −1 . But despite these difficulties, both methods produced D(ψ) values near those estimated with the Bayesian approach.