Differentiable Cosmological Simulation with the Adjoint Method

Rapid advances in deep learning have brought not only a myriad of powerful neural networks, but also breakthroughs that benefit established scientific research. In particular, automatic differentiation (AD) tools and computational accelerators like GPUs have facilitated forward modeling of the Universe with differentiable simulations. Based on analytic or automatic backpropagation, current differentiable cosmological simulations are limited by memory, and thus are subject to a trade-off between time and space/mass resolution, usually sacrificing both. We present a new approach free of such constraints, using the adjoint method and reverse time integration. It enables larger and more accurate forward modeling at the field level, and will improve gradient-based optimization and inference. We implement it in an open-source particle-mesh (PM) N-body library pmwd (PM with derivatives). Based on the powerful AD system JAX, pmwd is fully differentiable, and is highly performant on GPUs.


INTRODUCTION
Current established workflows of statistical inference from cosmological datasets involve reducing cleaned data to summary statistics like the power spectrum, and predicting these statistics using perturbation theories, semi-analytic models, or simulation-calibrated emulators.These can be suboptimal due to the limited model fidelity and the risk of information loss in data compression.Cosmological simulations (Hockney & Eastwood 1988;Angulo & Hahn 2022) can accurately predict structure formation even in the nonlinear regime at the level of the fields.Using simulations as forward models also naturally accounts for the cross-correlation of different observables, and can easily incorporate systematic errors.This approach has been intractable due to the large computational costs on conventional CPU clusters, but rapid advances in accelerator technology like GPUs open the possibility of simulation-based modeling and inference (Cranmer et al. 2020).Furthermore, model differentiability enabled by AD libraries can accelerate parameter constraint with gradient-based optimization and inference.A differentiable field-level forward model combining these two features is able to constrain physical parameters together with the initial conditions of the Universe.
The first differentiable cosmological simulations, such as BORG, ELUCID, and BORG-PM (Jasche & Wandelt 2013;Wang et al. 2014;Jasche & Lavaux 2019), were developed before the advent of modern AD systems, and were based on the analytic derivatives, which involve first convoluted derivation by hand using the chain rule (see e.g., Seljak et al. 2017, App.D) before implementing them in code.Later codes including FastPM and FlowPM (Feng et al. 2016;Feng 2018;Seljak et al. 2017;Modi et al. 2021) compute gradients using the AD engines, namely vmad (written by the same authors) and TensorFlow, respectively.The AD frameworks automatically apply the chain rule to the primitive operations that comprise the whole simulation, relieving the burden of derivation and implementation of the derivatives.Both analytic differentiation and AD backpropagate the gradients through the whole history, which requires saving the states at all time steps in memory.Therefore, they are subject to a trade-off between time and space/mass resolution, usually sacrificing both.As a result, they lose accuracy on small scales and in dense regions where the time resolution is important, e.g., in weak lensing (Böhm et al. 2021).
Alternatively, the adjoint method provides systematic ways of deriving the gradients of an objective function  under constraints (Pontryagin 1962), such as those imposed by the N-body equations of motion in a simulated Universe.It identifies a set of adjoint variables λ, dual to the state variables z of the model, and carrying the gradient information of the objective function with respect to the model state ∂/∂z.For time-dependent problems, the adjoint variables evolve backward in time by a set of equations dual to that of the forward evolution, known as the adjoint equations.For continuous time, the adjoint equations are a set of differential equations, while in the discrete case they become difference equations which are practically a systematic way to structure the chain rule or backpropagation.Their initial conditions are set by the explicit dependence of the objective on the simulation state, e.g., λ n = ∂/∂z n if  is a function of the final state z n .Solving the adjoint equations can help us to propagate the gradient information via the adjoint variables to the input parameters θ, to compute the objective gradients d/dθ.And we will see later that the propagated and accumulated gradients on parameters come naturally from multiple origins, each reflecting a θ-dependence at one stage of the modeling in Fig. 1a.
The backward adjoint evolutions depend on the states in the forward run, which we can re-simulate with reverse time integration if the dynamics is reversible, thereby dramatically reducing the memory cost (Chen et al. 2018).Furthermore, we derive the discrete adjoint equations dual to the discrete forward time integration, known as the discretize-then-optimize approach (e.g., Gholaminejad et al. 2019), to ensure gradients propagate backward along the same trajectory as taken by the forward time integration.This is in contrast with the optimize-then-discretize approach, that numerically integrates the continuous adjoint equations, and is prone to larger deviations between the forward and the backward trajectories due to different discretizations (Lanzieri et al. 2022).In brief, to compute the gradient we only need to evolve a simulation forward, and then backward jointly with its dual adjoint equations.We introduce the adjoint method first for generic time-dependent problems in both continuous and discrete cases in Sec. 3, and then present its application on cosmological simulation in Sec.3.6.
We implement the adjoint method with reverse time integration in a new differentiable PM library pmwd using JAX (Li et al. 2022).pmwd is memory efficient at gradient computation with a space complexity independent of the number of time steps, and is computation efficient when running on GPUs.

FORWARD SIMULATION
We first review and formulate all components of the N-body simulation-based forward model of the cosmological structure formation.

Initial Conditions & Perturbation Theories
N-body particles discretize the uniform distribution of matter at the beginning of cosmic history (scale factor a(t) → 0) at their Lagrangian positions q, typically on a Cartesian grid, from which they then evolve by displacements s to their later positions, i.e., x = q + s(q). (1) To account for the cosmic background that expands globally, while x is the comoving position relative to this background, the physical position grows with the expansion We describe different operators in Sec.2: Boltzmann solver ("Boltz") and initial condition generator by the Lagrangian perturbation theory ("LPT") in Sec.2.1; force solver (F) in Sec.2.2; time integration ("Integ"), kick (K), and drift (D) in Sec.2.3; observation ("Obs" and O) and objective ("Obj") in Sec.2.4.Gradients flow backward with all arrows reversed (Sec.3.6).
scale factor, i.e., ax.The expansion rate is described by the Hubble parameter The initial conditions of particles can be set perturbatively when |∇ • s|, the linear approximation of the density fluctuation, is much less than 1.We compute the initial displacements and momenta using the second order Lagrangian perturbation theory (2LPT, Bouchet et al. 1995): where p is the canonical momentum1 for canonical coordinate x.The temporal and spatial dependences separate at each order: the i-th order growth factor D i is only a function of scale factor a (or time t), and the i-th order displacement field s (i)   depends only on q.We have also used two types of time derivatives, □ ≜ d□/dt and □ ′ ≜ d□/d ln a, related by □ = H□ ′ .
Both the first and second order displacements are potential flows, with the scalar potentials sourced by where ϕ s,ij ≜ ∂2 ϕ s /∂q i ∂q j , and δ (1) is the linear order of the overdensity field, δ, related to the density field ρ and mean matter density ρ by δ ≜ ρ/ ρ − 1.
The linear overdensity δ (1) , which sources the 2LPT particle initial conditions, is a homogeneous and isotropic Gaussian random field in the consensus cosmology, with its Fourier transform δ (1) (k) = ∫ dq δ (1) (q)e −ik•q characterized by the linear matter power spectrum P lin , (5) The angle bracket takes the ensemble average of all possible realizations.Homogeneity demands that different wavevectors are uncorrelated, thus the Dirac delta δ d in the first equality.And with isotropy, P lin does not depend on the direction of the wavevector k, but only on its magnitude, the wavenumber k ≜ |k|.In a periodic box of volume V, where k is discrete, δ d is replaced by the Kronecker delta δ k in the second equality.Numerically, we can easily generate a δ (1) realization by sampling each Fourier mode independently, with ω(k) being any Hermitian white noise, i.e., Fourier transform of a real white noise field ω(q).Cosmological perturbation theory gives the linear power spectrum as2 where the shape of P lin is determined by the transfer function T , solution to the linearized Einstein-Boltzmann equations (Lewis & Challinor 2011;Blas et al. 2011).
T depends on the cosmological parameters, some of which already appear in (7): A s is the amplitude of the primordial power spectrum defined at some fixed scale k pivot ; n s describes the shape of the primordial power spectrum; Ω m is the total matter density parameter; Ω b is the baryonic matter density parameter; and H 0 is the Hubble constant, often parameterized by the dimensionless h as H 0 = 100h km/s/Mpc.Other parameters may enter in extensions of the standard Λ cold dark matter (ΛCDM) cosmology.In summary, other than the discretized white noise modes ω, to generate initial conditions we need the growth functions D and the transfer function T , both of which depend on the cosmological parameters θ.We compute D by solving the ordinary differential equations (ODEs) given in App.A, and employ the fitting formula for T from Eisenstein & Hu (1998).We illustrate these dependencies in the upper left triangle of Fig. 1a.
At early times and/or lower space/mass resolution, LPT can be accurate enough to directly compare to the observational data.However, more expensive integration of the N-body dynamics is necessary in the nonlinear regime.During LPT and the time integration we can "observe" the simulation predictions by interpolating on the past light cone of a chosen observer.These form the upper right square of Fig. 1a.

Force Evaluation
The core of gravitational N-body simulation is the gravity solver.The gravitational potential sourced by matter density fluctuation satisfies the Poisson equation where ∇ 2 is the Laplacian with respect to x.We separate the time dependence by defining φ ≜ aϕ, so φ satisfies that only depends on the matter overdensity δ.
While our adjoint method is general, we employ the PM solver in pmwd for efficiency, and leave the implementation of short-range forces to future development.With the PM method, we evaluate δ(x) on an auxiliary mesh by scattering particle masses to the nearest grid points.We use the usual cloud-in-cell (CIC), or trilinear, interpolation (Hockney & Eastwood 1988), to compute the fractions of a particle at x ′ going to a grid point at x, where l is the mesh cell size.
The gravitational field, −∇φ, can then be readily computed on the mesh with the fast Fourier transform (FFT), as the above partial differential equation becomes an algebraic one in Fourier space: In Fourier space, −∇φ(x) is just −ikφ(k), each component of which can be transformed back to obtain the force field.With 4 (1 forward and 3 inverse) FFTs, we can obtain −∇φ(x) from δ(x), with both on the mesh, efficiently.Finally, we interpolate the particle accelerations by gathering −∇φ from the same grid points with the same weights, as given in (10).

Time Integration
N-body particles move by the following equations of motion We use the FastPM time stepping (Feng et al. 2016), designed to reproduce in the linear regime the linear Lagrangian perturbation theory, i.e., the 1LPT as the first order in (2), also known as the the Zel'dovich approximation (hereafter ZA).We present a simplified derivation below.N-body simulations integrate (12) in discrete steps (Fig. 1b), typically with a symplectic integrator that updates x and p alternately.From t a to t b , which are named the drift and kick operators, respectively.In the second equalities of each equation we have introduced two time-dependent functions G D and G K .As approximations, they have been taken out of the integrals together with p and ∇φ at some intermediate representative time t c .We can make the approximation more accurate by choosing G D to have a time dependence closer to that of p. Likewise for ∇φ and G K .However, in most codes G D and G K are simply set to 1 (Quinn et al. 1997), lowering the accuracy when the number of time steps are limited.
FastPM chooses G D and G K according to the ZA growth history, thereby improving the accuracy on large scales and at early times.In ZA, the displacements are proportional to the linear growth factor, s ∝ D 1 , which determines the time Li et al. dependences of the momenta and the accelerations by ( 12).Therefore, we can set G D and G K in (13) to They are functions of D 1 and its derivatives, as given by (A7).With these choices, the drift and kick factors, defined in ( 13), then become While these operators are generally applicable in any symplectic integrator, we use them in the second order kick-drift-kick leapfrog, or velocity Verlet, integration scheme (Quinn et al. 1997).From t i−1 to t i , the particles' state (x, p) is updated in the following order: F i : The left column names the operators as shown in Fig. 1c.The force operator F on the third line computes the accelerations as described in Sec.2.2.It caches the results in a, so that they can be used again by the first K in the next step.Note that we need to initialize a 0 with F 0 before the first time step.

Observation & Objective
Because all observables live on our past light cone, we model observations on the fly by interpolating the j-th particle's state ẑj = (x j , pj ) when they cross the light cone at tj .Given z = (x, p) at t i−1 and t i , we can parametrize intermediate particle trajectories with cubic Hermite splines.Combined with the θ-dependent propagation of the light front, we can solve for the intersections, at which we can record the observed ẑ.The solution can even be analytic if the light propagation is also approximated cubically.Note that we only "observe" the dark matter phase space here, and leave more realistic observables to future works, including the forward modeling of real observational effects.Fig. 1c illustrates the observation operator O and its dependencies on the previous and the current time step.We can compare the simulated observables to the observational data, either at level of the fields or the summary statistics, by some objective function in case of optimization, or by a posterior probability for Bayesian inference.We refer to both cases by objective and denote it by  throughout.Note that in general it can also depend on θ and ω, in the form of regularization or prior probability.
Formally, we can combine the observation and objective operators as as illustrated in the lower right triangle in Fig. 1a.Note this form also captures the conventional simulation snapshots at the end of a time step or those interpolated between two consecutive steps, so we model all these cases as observations in pmwd.

BACKWARD DIFFERENTIATION -THE ADJOINT METHOD
We first introduce the adjoint method for generic time-dependent ODEs, and derive the adjoint equations following a pedagogical tutorial by Bradley (2019).We then adopt the discretize-then-optimize approach, and derive the discrete adjoint equations, that is more suitable for the N-body symplectic time integration.Finally we apply them to derive the adjoint equations and the gradients for cosmological simulations described in Sec. 2, and couple them with the reverse time integration to reduce the space complexity.

Variational (Tangent) and Adjoint Equations
Consider a vector state z(t) subject to the following ODEs and initial conditions for t ∈ [t 0 , t 1 ].Here the initial conditions can depend on the parameters θ.
A perturbation in the initial conditions propagates forward in time.For z(t, t 0 , z 0 ), the Jacobian of state variables describing this, evolves from identity ∆ 0 = I by following from ( 18). ( 20) is known as the variational or tangent equation.
The backward version of ∆, evolves backward in time from identity Λ 1 = I by (22) is called the adjoint equation, for the right-hand side is (∂f/∂z) ⊺ Λ.It can be derived from the time-invariance of Λ • ∆: Alternatively, the adjoint equation can be derived from the variational equation, using the facts that dM −1 /dt = −M −1 M M −1 for any invertible matrix M, and (∂z/∂z) −1 = ∂z/∂z.Like ( 20) and ( 22) As we can see next, the adjoint equation takes a similar form when one optimizes an objective function of the state.

Objective on the Final State
In the simplest case, the objective function depends only on the final state, e.g., the last snapshot of a simulation, and possibly the parameters too in the form of regularization or prior information, i.e., (z 1 , θ).To optimize the objective under the constraint given by the ODEs, we can introduce a time-dependent function λ(t) as the Lagrange multiplier: Note the minus sign we have introduced in front of λ for later convenience.Its total derivative with respect to θ is Integrating the first term of the integrand by parts: and plugging it back: Now we are free to choose that allow us to avoid all ∂z/∂θ terms in the final objective gradient: in which the first two terms come from the regularization and initial conditions, respectively.( 29) is the adjoint equation for the objective (z 1 , θ).With the initial conditions set at the final time, we can integrate it backward in time to obtain λ(t), which enters the above equation and yields d/dθ.Note that (29) has the same form as ( 22), and their solutions are related by And like Λ • ∆, λ • ∆ is time-invariant: Computing λ 1 • ∆ 1 directly is expensive, but solving (29) backward for λ 0 is cheap.This is related to the fact that the reverse-mode AD or backpropagation is cheaper than the forward mode for optimization.

Objective on the State History
The adjoint method applies to more complex cases too.Let's consider an objective as a functional of the evolution history with some regularization ℛ on θ The derivation is similar: has gradient The adjoint equation becomes with the objective gradient The time-invariant is now so 3.4.Objective on the Observables Now let's consider an objective that depends on different state components at different times, e.g., in the Universe where further objects intersected our past light cone earlier.It falls between the previous two scenarios, and we can derive its adjoint equation similarly.
We denote the observables by ẑ, with different components ẑj affecting the objective  (ẑ, θ) at different times tj , i.e., ẑj ≜ z j ( tj ).The Lagrangian becomes constraining only the parts of trajectories inside the light cone.Its gradient is where we have defined λ similarly with components λj ≜ λ j ( tj ).In the first equality, we have also dropped a vanishing term j λj z j − f j ( tj ) ∂ tj /∂θ, i.e., ∂ tj /∂θ does not directly enter the gradient.Now we find the adjoint equation that has the same form as (29), with a slightly different initial condition given at the respective observation time of each component.The objective gradient is also similar to the previous cases: Note that even though λ j (t) for t > tj does not affect the final gradient, they can enter the right-hand side of the adjoint equation, and affect those λ k with t < tk , i.e., inside the light cone.Physically however, ∂f j /∂z k should vanish for spacelike separated pairs of z j and z k , even though the Newtonian approximation we adopt introduces some small deviation.Therefore, we can set λ j (t) to 0 for t > tj , and bump it to ∂/∂ẑ j at tj .

Discretize Then Optimize
In practice, the time integration of ( 18) is discrete.Consider the explicit methods, which include the leapfrog integrator commonly used for Hamiltonian dynamics.We want to propagate the gradients backward along the same discrete trajectory as taken by the forward integration.Therefore, instead of the continuous adjoint equations derived above, we need the adjoint method for the discrete integrator.Without loss of generality, we derive the adjoint equation for an objective depending on the state at all time steps, which can be easily specialized to the 3 cases discussed above with only slight modifications.The discretized Lagrangian is now whose gradient is So the discrete adjoint equation is We can iterate it backward in time to compute the final objective gradient: These equations are readily adaptable to simulated observables.For snapshots at t n or interpolated between t n−1 and t n , all ∂/∂z i vanish except for the last one or two, respectively.For light cones, as discussed in Sec.3.4, each component of ẑ is interpolated at different times, thus all ∂/∂z i vanish except for those times relevant for its interpolation, and the corresponding λ i can be set to zero for i greater than the intersection time.
At the i-th iteration the adjoint variable requires the vector-Jacobian product (VJP) λ i • ∂F i−1 /∂z i−1 and the partial objective derivative ∂/∂z i−1 at the next time step, which can be easily computed by AD if the whole forward history of (44) has been saved.However, this can be extremely costly in memory, which can be alleviated by checkpointing algorithms such as Revolve and its successors (Griewank & Walther 2000).Alternatively, if a solution to ( 18) is unique, we can integrate it backward and recover the history, which is easy for reversible Hamiltonian dynamics and with reversible integrators such as leapfrog.When the N-body dynamics become too chaotic, one can use more precise floating-point numbers and/or save multiple checkpoints3 during the forward evolution, from which the backward evolution can be resumed piecewise.

Application to Simulation
The adjoint method provides systematic ways of deriving the objective gradient under constraints (Pontryagin 1962), here imposed by the N-body equations of motion.We have introduced above the adjoint method for generic time-dependent problems in both continuous and discrete cases.The continuous case is easier to understand and has pedagogical values, while the discrete case is the useful one in our application, for we want to propagate numerically the gradients backward along the same path as that of the forward time integration.
For the N-body particles, the state variable4 is z = (x, p) ⊺ .Their adjoint variables help to accumulate the objective gradient while evolving backward in time by the adjoint equation.Let's denote them by λ = (ξ, π).We can compare each step of ( 16) and ( 17) to ( 44), and write down its adjoint equation following (47).Taking D i i−1 for example, we can write it explicitly as in the form of (44).By (47), its adjoint equation is where we have used the fact that D(t c , t a , t b ) = −D(t c , t b , t a ), and left the ∂/∂z i term, the explicit dependence of the objective on the intermediate states (from, e.g., observables on the light cone), to the observation operator O below.This also naturally determines the subscripts of ξ and π.
Repeating the derivation for K and O, and flipping the arrow of time, we present the adjoint equation time stepping for ( 16) from t i to t i−1 : Like a, we have introduced α and ζ to cache the vector-Jacobian products on their right-hand sides, for the next time step in the kick operator and the objective gradient (see below), respectively.Note that in the reverse order, the F operator is at t i−1 instead of t i as in ( 16), and we need to initialize a n , α n , and ζ n with F n before stepping from t n to t n−1 .Likewise, the gradient of O n−1 n at t n is absent in (49) but enters via the initial conditions following (47).Explicitly, the initial conditions of (49) are The VJPs in F and the ∂/∂z's in O can be computed by AD if the whole forward integration and observation history of ( 16) and ( 17) has been saved.However, this Li et al. can be too costly spatially for GPUs, whose memories are much smaller than those of CPUs.Alternatively, we take advantage of the reversibility of the N-body dynamics and the leapfrog integrator, and recover the history by reverse time integration, which we have already included on the first lines of the K and D operators in (49).We can integrate the leapfrog and the adjoint equations jointly backward in time, and still benefit from the convenience of AD in computing VJPs and ∂/∂z's.In practice, the numerical reversibility suffers from the finite precision and the chaotic N-body dynamics, which we find is generally not a concern for our applications in the result section.
Finally, during the reverse time integration, we can accumulate the objective gradient following (48): where the latter backpropagates from fewer sources than the former as shown in Fig. 1a.To implement ( 49)-( 51) in pmwd with JAX, we only need to write custom VJP rules for the high-level N-body integration-observation loop, while the derivatives and VJPs of the remaining parts including regularization/prior, the observation, the initial conditions, the kick and drift factors, the growth and transfer functions, etc., can all be conveniently computed by AD.
Other than that, we also implement custom VJPs for the scatter and gather operations in Sec.2.2 following Feng (2017), and these further save memory in gradient computations of those nonlinear functions.
As in FastPM, the choice of time steps is flexible, and in fact with pmwd it can even be optimized in non-parametric ways to improve simulation accuracy at fixed computation cost.Here we use time steps linearly spaced in scale factor a, and leave such optimization in a follow-up work.

Simulation
We first test the forward simulations of pmwd.Fig. 2 shows the cosmic web in the final snapshot of a fairly large simulation for the size of GPU memories.
Because GPUs are inherently parallel devices, they can output different results for identical inputs.To test the reproducibility, in Table 1 we compare the root-meansquare deviations (RMSDs) of particle displacements and velocities between two runs, relative to their respective standard deviations, with different floating-point precisions, mesh sizes, particle masses, and numbers of time steps.Other than the precision, mesh size is the most important factor, because a finer mesh can better resolve the most nonlinear and dense structures, which can affect reproducibility as the order of many operations can change easily.The particle mass plays a similar role and less massive particles generally take part in nonlinear motions at earlier times.The number of time steps has very small impact except in the most nonlinear cases.And interestingly, more time steps improves the reproducibility most of the time.
Table 1.pmwd reproducibility on GPU.GPUs can output different results for identical inputs.We simulate 384 3 particles from a = 1/64 to a = 1, with two floating-point precisions, two mesh sizes, two particle masses (with box sizes of 192 Mpc/h and 384 Mpc/h), and two time step sizes.We take the root-mean-square deviations (RMSDs) of particle displacements and velocities between two runs at a = 1, and quote their ratios to the respective standard deviations, about 6 Mpc/h and 3 × 10 2 km/s.In general, the factors on the left of the left four columns affect the reproducibility more than the those on the right, and lower rows are more reproducible than the upper ones.

Differentiation
Model differentiation evolves the adjoint equations backward in time.To save memory, the trajectory of the model state in the forward run is not saved, but re-simulated together with the adjoint equations.Even though in principle the N-body systems are reversible, in practice the reconstructed trajectory can differ from the forward one due to the finite numerical precision and exacerbated by the chaotic dynamics.Better reversibility means the gradients propagate backward along a trajectory closer to the forward path, and thus would be more accurate.To test this, in Table 2 we compare the RMSDs of the forward-then-reverse particle displacements and velocities from the LPT initial conditions, relative to their respective standard deviations, that are very small at the initial time.As before, we vary the floating-point precision, the mesh size, the particle mass, and the number of time steps.The order of factor importance and their effects are the same as in the reproducibility test.This is because more nonlinear structures are more difficult to reverse.One way to improve reversibility is use higher order LPT to initialize the N-body simulations at later times (Michaux et al. 2021), when the displacements and velocities are not as small.We leave this for future development.
Table 2. pmwd reversibility on GPU.Our adjoint method reduces memory cost by reconstructing the forward evolution with reverse time integration.We test the numerical reversibility by comparing the displacements and velocities of particles that have evolved to a = 1 and then reversed to a = 1/64, to those of the LPT initial conditions at a = 1/64, in RMSDs.We take their ratios to the respective standard deviations, about 0.1 Mpc/h and 0.7 km/s.Their smallness is the main reason that the quoted relative differences here are orders of magnitude greater than those in Table 1 (see Fig. 3 where the reversibility is few times more important than the reproducibility in their impacts on the gradients).With the same setup as that in Table 1, we see the same general trend-left factors are more important than the right ones, and lower rows are more reversible than the upper ones.Next, we want to verify that our adjoint method yields the same gradients as those computed by AD.As explained in Sec.3.6, pmwd already utilizes AD on most of the differentiation tasks.To get the AD gradients we disable our custom VJP implementations on the N-body time integration, and the scatter and gather operations.In Fig. 3, we compare the adjoint and AD gradients on a smaller problem, because AD already runs out of memory if we double the number of time steps or increase the space/mass resolution by 2 3 × from the listed specifications in the caption.For better statistics, we repeat both adjoint and AD runs for 64 times, with the same cosmology and white noise modes, and compare their results by an asymmetric difference of X i − Y j , where 1 ⩽ j < i ⩽ 64.First we set X and Y to the adjoint and AD gradients respectively, and find they agree very well on the real white noise (so do their gradients on cosmological parameters not shown here; see ).In addition, we can set both X and Y to either adjoint or AD to check their respective reproducibility.We find both gradients are consistent among different runs of themselves, with AD being a lot more reproducible without uncertainty from the reverse time integration but only that from the GPU reproducibility.This , in a pmwd simulation of 128 3 particles in a (128 Mpc/h) 3 box, with a 256 3 mesh, 15 time steps, and single precision.We choose a mean squared error (MSE) objective between two realizations with the same cosmology but different initial modes, on their density fields on the 256 3 mesh at a = 1, and then compute the gradients with respect to one realization while holding the other fixed.We compare the adjoint gradients to those by AD, for which we have disabled the custom gradient implementation on the scatter, gather, and N-body time stepping operators.The adjoint and AD gradients agree as expected, with a RMSD of ≈ 4 × 10 −5 , 3 orders of magnitude smaller than the standard deviation of the gradients itself, ≈ 0.015.It is also comparable to the difference between two different adjoint gradients, with a RMSD ≈ 5 × 10 −5 .Different AD gradients are more consistent, with a tighter RMSD of ≈ 1 × 10 −5 due to the absence of uncertainty from reverse time integration.
implies that we can ignore the reproducibility errors (Table 1) when the reversibility ones dominate (Table 2).Though this statement should be verified again in the future when we reduce the reversibility errors using, e.g., 3LPT.
Our last test on the adjoint gradients uses them in a toy optimization problem in Fig. 4. We use the Adam optimizer (Kingma & Ba 2015) with a learning rate of 0.1 and the default values for other hyperparameters.Holding the cosmological A toy problem where we optimize the initial conditions by gradient descent to make some interesting pattern after projection.The particles originally fill a 16 × 27 × 16 grid, and then evolve from a = 1/64 to a = 1 for 63 time steps with single precision and a 32 × 54 × 32 mesh in a 160 × 270 × 160 Mpc 3 /h 3 box.We compute their projected density in 64 × 108 pixels and compare that to the target image at the same resolution with an MSE objective.We use the adjoint method and reverse time integration, assuming the latter can reconstruct the forward evolution history accurately.We validate this by demonstrating that the particles evolve backward to align on the initial grid.The optimized initial conditions successfully evolve into the target pattern, which improves with more iterations.Also see the animated reverse time evolution and initial condition optimization on YouTube.
parameters fixed, we can optimize the real white noise modes to source initial particles to evolve into our target pattern, in hundreds to thousands of iterations.Interestingly, we find that the variance of the modes becomes bigger than 1 (that of the standard normal white noise) after the optimization, and the optimized modes show some level of spatial correlation not present in the white noise fields, suggesting that the optimized initial conditions are probably no longer Gaussian.N-body per step, 2 3 x mesh N-body per step, 1x mesh 2LPT growth Figure 5. Performance of pmwd.Both the 2LPT and N-body components scale well from 512 3 particles to 64 3 particles, below which they become overhead dominated.Solving the growth ODEs takes a constant time, and can dominate the cost for small numbers of particles and few time steps, but generally does not affect problems with more than 128 3 particles.

Performance
pmwd benefits from GPU accelerations and efficient CUDA implementations of scatter and gather operations.In Fig. 5, we present performance test of pmwd, and find both the LPT and the N-body parts scale well except for very small problems.The growth function solution has a constant cost, and generally does not affect problem of moderate sizes.However, for a small number of particles and few time steps, it can dominate the computation, in which case one can accelerate the growth computation with an emulator (Kwan et al. 2022).

CONCLUSIONS
In this work, we develop the adjoint method for memory efficient differentiable cosmological simulations, exploiting the reversible nature of N-body Hamiltonian dynamics, and implement it with JAX in a new PM library, pmwd.We have validated the numerical reversibility and the accuracy of the adjoint gradients.pmwd is both computation and memory efficient, enabling larger and more accurate cosmological dark matter simulations.The next step involves modeling cosmological observables such as galaxies.One can achieve this with analytic, semi-analytic, and deep learning components running based on or in parallel with pmwd.In the future, it can also facilitate the simultaneous modeling of multiple observables and the understanding of the astrophysics at play.pmwd will benefit all the forward modeling approaches in cosmology, and will improve gradient based optimization and field-level inference to simultaneously constrain the cosmological parameters and the initial conditions of the Universe.The efficiency of pmwd also makes it a promising route to generate the large amount of training data needed by the likelihood-free inference frameworks (Cranmer et al. 2020;Alsing et al. 2019).
Currently, these applications require more development, including distributed parallel capability on multiple GPUs (Modi et al. 2021), more accurate time integration beyond FastPM (List & Hahn 2023), optimization of spatiotemporal resolution of the PM solvers (Dai & Seljak 2021;Lanzieri et al. 2022;Zhang et al. in prep), short-range supplement by direct summation of the particle-particle (PP) forces on GPUs (Habib et al. 2016;Potter et al. 2017;Garrison et al. 2021), differentiable models for observables, etc.We plan to pursue these in the future.

Li et al. APPENDIX
A. GROWTH EQUATIONS The 2LPT growth functions follow the following ODEs: where If the universe has been sufficiently matter dominated at a i , with Ω m (a i ) ≃ 1 and H ′ /H ≃ −3/2, the initial conditions of the ODEs can be set as the growing mode in this limiting case D 1 ≃ a in the matter dominated era, before being suppressed by dark energy.The above growth equations can be written in such suppression factors, for m ∈ {1, 2}, as with initial conditions We solve the growth equations in G m instead of D m , with the JAX adaptive ODE integrator implementing the adjoint method in the optimize-then-discretize approach.This is because the former can be integrated backward in time more accurately for early times, which can improve the adjoint gradients.We can then evaluate the FastPM time integration factors in ( 14) by and )

Figure 1 .
Figure 1.Simulation-based forward model of the Universe.(a) shows the overall model structure.Single arrows from the cosmological parameters θ and white noise modes ω indicate dependence on θ only, while double arrows imply dependence on both.The time integration loop in (b) expands the solid box in (a), and a single time step in (c) further expands the dashed box in (b).We describe different operators in Sec.2: Boltzmann solver ("Boltz") and initial condition generator by the Lagrangian perturbation theory ("LPT") in Sec.2.1; force solver (F) in Sec.2.2; time integration ("Integ"), kick (K), and drift (D) in Sec.2.3; observation ("Obs" and O) and objective ("Obj") in Sec.2.4.Gradients flow backward with all arrows reversed (Sec.3.6).

Figure 2 .
Figure 2. Relative matter density field, 1 + δ, at a = 1, projected from an 8 Mpc/h thick slab in a pmwd simulation, that has evolved 512 3 particles with single precision and a 1024 3 mesh in a (512 Mpc/h) 3 box for 63 time steps.The simulation takes only 13 seconds to finish on an NVIDIA H100 PCIe GPU.

Figure 3 .
Figure3.gradients of a 128 × 128 slice of the real white noise field d /dω (top panel), in a pmwd simulation of 128 3 particles in a (128 Mpc/h) 3 box, with a 256 3 mesh, 15 time steps, and single precision.We choose a mean squared error (MSE) objective between two realizations with the same cosmology but different initial modes, on their density fields on the 256 3 mesh at a = 1, and then compute the gradients with respect to one realization while holding the other fixed.We compare the adjoint gradients to those by AD, for which we have disabled the custom gradient implementation on the scatter, gather, and N-body time stepping operators.The adjoint and AD gradients agree as expected, with a RMSD of ≈ 4 × 10 −5 , 3 orders of magnitude smaller than the standard deviation of the gradients itself, ≈ 0.015.It is also comparable to the difference between two different adjoint gradients, with a RMSD ≈ 5 × 10 −5 .Different AD gradients are more consistent, with a tighter RMSD of ≈ 1 × 10 −5 due to the absence of uncertainty from reverse time integration.

Figure 4 .
Figure 4.A toy problem where we optimize the initial conditions by gradient descent to make some interesting pattern after projection.The particles originally fill a 16 × 27 × 16 grid, and then evolve from a = 1/64 to a = 1 for 63 time steps with single precision and a 32 × 54 × 32 mesh in a 160 × 270 × 160 Mpc 3 /h 3 box.We compute their projected density in 64 × 108 pixels and compare that to the target image at the same resolution with an MSE objective.We use the adjoint method and reverse time integration, assuming the latter can reconstruct the forward evolution history accurately.We validate this by demonstrating that the particles evolve backward to align on the initial grid.The optimized initial conditions successfully evolve into the target pattern, which improves with more iterations.Also see the animated reverse time evolution and initial condition optimization on YouTube.