Impacts of the physical data model on the forward inference of initial conditions from biased tracers

We investigate the impact of each ingredient in the employed physical data model on the (Bayesian) forward inference of initial conditions from biased tracers. Specifically, we use dark matter halos in a given cosmological simulation volume as tracers of the underlying matter density field. We study the effect of tracer density, grid resolution, gravity model, bias model and likelihood on the inferred initial conditions. We find that the cross-correlation coefficient between true and inferred phases reacts weakly to all ingredients above, and is well predicted by the theoretical expectation derived from a Gaussian model on a broad range of scales. The bias in the amplitude of the inferred initial conditions, on the other hand, depends strongly on the bias model and the likelihood. We conclude that the bias model and likelihood hold the key to an unbiased cosmological inference. Together they must keep the systematics---which arise from the sub-grid physics that are marginalized over---under control in order to obtain an unbiased inference.


Introduction
The distribution of large-scale structures (LSS) in our Universe, specifically galaxies -biased tracers of the underlying matter density field -is very far from the random distribution of initial quantum fluctuations that seed their formation. As complex as the process of galaxy formation is, on large and quasi-linear scales, the total matter distribution follows that of dark matter (DM) which only feels the effect of gravity. Going forward in time, the gravitational evolution of matter distribution is fully deterministic. It follows that, by forward-modeling this process, together with the relation between matter and galaxy distributions, one should -1 -

JCAP03(2021)058
be able to retrieve a vast amount of information directly from the three-dimensional distributions of galaxies as probed by galaxy redshift surveys. This includes, but is not limited to the initial conditions of our observed patch of the Universe and parameters in our cosmological models.
There are four principal ingredients of the physical data model in such a field-level, forward inference approach [9,18]: 1. Prior for initial conditions: this element encodes our prior knowledge of the statistical distribution of matter density fluctuations at the very early stage of the Universe that comes from the theory of initial fluctuations [32,33] and observation of the Cosmic Microwave Background (CMB) [34,35].
2. Gravity model for matter: this ingredient evolves matter distributions forward in time, from a specific set of initial conditions, under the effect of gravity.
3. Bias model for deterministic matter-galaxy bias: this constituent relates the forward-evolved matter field and the deterministic galaxy field by means of a deterministic relation (see [36] for a review). 4. Likelihood for stochastic galaxy bias: this final piece attempts to model the scatter in the deterministic matter-galaxy bias relation, i.e. the difference between the deterministic and observed galaxy fields, in the form of a conditional probability [18][19][20][21].
We summarize the connection between these ingredients of the physical data model in figure 1, where (and hereafter) we have generalized "galaxy" to "biased tracer", partially due to the fact that our results are obtained using DM halos in N-body simulations, and also because most of our results and arguments apply to all generic biased tracers. We will discuss possible exceptions at the end of section 4.2.3. How strongly does the quality of a forward inference depend on these ingredients? In particular, how sensitive are the inferred initial conditions to different choices for each ingredient? So far, this is an open question in the context of forward modeling (see [37] for a case study in the "backward" modeling approach). It is the aim of this paper to address that question. Specifically, inspired by previous findings in [22][23][24], our main goal is to identify the key ingredients for a robust, unbiased cosmological forward inference from LSS, and what should be the focus of future studies to realize this goal.
In this paper, we specifically measure: the cross-correlation coefficient between true and inferred phases (phase-correlation hereafter) r(k) and the amplitude-bias in the corresponding power spectra (which is often referred to by the literature of reconstruction, and hence hereafter as "transfer function") T infer (k). Clearly, the former is important for studies making use of the inference for cross-correlations with other probes. The latter is important as it indicates systematic biases in cosmological parameter that are to be expected. Specifically, a transfer function that is biased in a scale-independent way implies a bias in the inferred primordial normalization of the power spectrum A s , whereas a scale-dependently biased transfer function entails biases in not only A s but other cosmological parameters, e.g. Ω m , n s , as well.
We use DM main halos identified in N-body simulations, for which the true initial conditions are known, as physical biased tracers. We further restrict our analysis to the halo rest frame, i.e. the case without redshift space distortion (RSD). We consider multiple inference setups corresponding to different choices for the ingredients two to four in the above list (bottom red panels in figure 1), as well as different options for halo number density and grid resolution, i.e. smoothing scale.
All forward inferences in this paper follow exactly the flowchart in figure 1, and are carried out using the borg algorithm that was first introduced in [9]. As our parameter space actually consists of the whole three-dimensional field of initial conditions, augmented by the bias parameters, efficiently Monte-Carlo Markov-chain (MCMC) sampling this very high-dimensional space of the posterior is a truly daunting task. The task is even more

JCAP03(2021)058
critical in the particular problem of forward inference, for each sample requires essentially a full simulation from the early-time Gaussian initial conditions to the late-time non-Gaussian observed data, as diagramed in figure 1, hence some non-negligible numerical expense. The borg algorithm addresses this issue by utilizing the Hamiltonian Monte Carlo (HMC) sampling method [38][39][40] which combines geometrical information of the posterior, encoded in its gradient, and Hamiltonian dynamics. The latter ensures that the Metropolis-Hastings proposal always has a high acceptance rate -providing that errors in the numerical integration of the Hamiltonian equation of motion are kept under control [39,41]. We encourage readers to see [9] for further conceptual and technical details specifically to borg. We note that, in an application on real data from a galaxy redshift survey, borg is also capable of routinely dealing with multiple sources of systematics, including survey geometries, selection functions, foreground contaminations and RSD effects [See, for example 17]. These steps are not included in figure 1 as they are not relevant for our analysis using rest-frame halos in N-body simulations.
Our results in section 4 are based on our analyses of the outputs of multiple borg runs: each is an ensemble of MCMC samples that resembles the corresponding posterior density distribution. In our particular case, after marginalizing over bias parameters, the posterior parameter space is the three-dimensional field of initial conditions (middle blue diamond panel in figure 1), physically constrained by the three-dimensional input field of halos (top red square panel in figure 1) and our choices of models for each ingredient (bottom red rectangle panels).
The rest of this paper is organized as follows. In section 2, we describe in more details the three ingredients under investigation and other settings that together characterize the configuration of each inference. We further detail the N-body simulations and the halos that are used as the input data in section 3. In section 4, we first define the estimators used to assess the quality of the inference and later present our results. We then highlight the ingredients that demand better modeling in section 5, and conclude in section 6. The appendices provide additional information about the derivation of the Gaussian expectation in section 4, numerical implementation details and bias priors relevant for section 4.2, as well as MCMC chains analyzed throughout this paper.

Convention and notation
Our Fourier notation follows that of [18,36]: (1.1) For ensemble averages, we distinguish between O k , which denotes an average of O over Fourier wavenumber k, and O s , which denotes an average of O over MCMC samples. A spatial-average of O, on the other hand, is denoted byŌ.
The matter and halo density contrast fields (on a fixed time slice) are defined as Throughout the paper, we assume adiabatic, growing-mode initial conditions such that the initial fluctuations in the matter field can be described by only a single field δ m,ini .

JCAP03(2021)058
Further, we adopt a flat ΛCDM cosmology with a primordial matter power spectrum of the form [42]: such that the linear matter power spectrum is given by [42]: where θ represents the set of cosmological parameters (Ω m , A s , n s , H 0 ), namely the total matter density parameter, the primordial spectrum normalization, the scalar spectral index and the Hubble constant, respectively; while T (k) denotes the matter transfer function. We also adopt a pivot scale of k pivot = 0.05 h −1 Mpc following the Planck team's convention [43].
We further assume the prior on the distribution of δ m,ini ≡ (δ m,ini (x 1 ), . . . , δ m,ini (x n )), i.e. ingredient one in the above list, to follow a zero-mean, multivariate Gaussian In a Fourier-space representation of δ m,ini , the covariance matrix C is diagonal, and the diagonal is set by the linear matter power spectrum in eq. (1.4), at a = a ini . For all inferences presented in this work, we proceed by fixing all cosmological parameters to the values used in the simulations, with one exception. Instead of specifying A s , we specify the normalization of the present linear matter power spectrum σ 8 , which is defined as where W th (R th , k) is the Fourier transform of the top-hat filter, and fix it to the value in the corresponding simulation. This is due to the convention in the current numerical implementation of borg. In practice, this subtlety might lead to a percent-level bias in the measured transfer function T infer (k), especially if the forward inference and the input simulation employ different matter transfer functions T (k). In our case, borg adopts the Eisenstein-Hu fitting formula [44,45] while our N-body simulations take the matter transfer functions computed with CAMB or CLASS as inputs (cf. section 3). The implication of this difference is examined in more details in appendix B. In fact, the systematic shift this effect might induce on T infer (k) is negligible compared to the bias we observed throughout section 4.2.
The numerical implementation of gravity and bias models in the borg framework discretizes the matter and halo density fields using a cloud-in-cell (CIC) projection onto a grid with cell size l grid such that where the superscript i denotes the cell index and W cic (x, x i ) is the real-space, normalized CIC filter:

JCAP03(2021)058
whose Fourier-space representation is given by W cic (l grid , k) = sinc 2 k l grid 2 . (1.8b) Let us briefly discuss implications of the shape of the smoothing filter here. A sharp-k filter, for example, corresponds to a particular scale cut-off Λ. This means only modes below the cut-off scale Λ enter the analysis, and the higher this cut-off is, the more non-linear modes are included. The CIC filter adopted here, on the other hand, is not localized in Fourier space and thus does not completely remove modes with k > 1/l grid . This might allow non-linear modes to slip into the inference, as further discussed in section 4.2.2 and section 4.2.3.

Inference setup
Below, for each inspected ingredient in the inference, we introduce and discuss each model or option considered within this study.

Gravity model
We choose the first-and second-order Lagrangian perturbation theories (LPT and 2LPT) [e.g. [46][47][48] as our gravity models for this comparison. We refer readers to [9] for specific details about the numerical implementation of these models in the borg framework employed here.
Our two choices are the most common and relevant for forward inference of current galaxy survey data [see, for example 17]. A forward inference powered by any of the two approximations naturally includes optimal Baryon Acoustic Oscillation (BAO) reconstruction (to the degree that the forward model is accurate). Especially, with 2LPT, the final joint posterior should also include sub-leading effects of large-scale perturbation modes For a more detailed discussion on the relation with BAO reconstruction, see section 6 of [18].

Bias model
In this paper, we consider one linear and two non-linear, non-perturbative bias models. All are Eulerian, local-in-matter-density-field (LIMD) [36] bias relations, i.e. the biased tracer density at one location only depends on the matter density at that same location in the Eulerian frame. These models are the ones currently implemented in borg and recently employed for the analyses in [8,16,17,49,50].
Our actual observable is the tracer count at a given cell, n i h . The deterministic or meanfield prediction for this quantity, n i h,det , is related to the quantity δ i h,det returned by a bias model for the same cell through a nuisance parameter n 0 such that Note that, in practice, the inferred value of n 0 can be different from the true mean number of halos in a grid cell n i h ≡ N h /(N 3 grid ).

Linear bias
The linear bias model links the forward-evolved matter density δ i m,fwd to the deterministic halo density δ i h,det , via a single bias coefficient b 1 :

JCAP03(2021)058
It is worth noting that the linear bias coefficient b 1 in eq. (2.2) is different from the more familiar large-scale linear bias coefficient measured from n-point functions [51,52]. In fact, all bias coefficients appearing in the field-level approach are moment biases (cf. section 4.2 of [36]), which explicitly depend on both the specific shape of the smoothing filter and the smoothing scale. In particular, on linear and quasi-linear scales where the perturbation theory (PT) for matter and biased tracers is guaranteed to converge to the correct result (when carried out to sufficiently high order), the moment bias b m 1 is related to the large-scale b l 1 as follows [36] where R * and σ 2 1 ≡ k k 2 P L (k)W 2 cic (k) are the physically relevant scale for the tracer formation process and the first generalized spectral moment, respectively. Given the grid resolutions considered in this study (cf. section 2.4), they yield a leading order correction term of O R 2 * σ 2 1 (l grid ) 0.005 − 0.053 and 0.050 − 1.327 for the two N-body simulation setups (cf. section 3).

Power-law bias
In the power-law bias model, the deterministic halo density in a given cell is related to the matter density at that cell also by only one free parameter, the power-law index β [49] Thus, for cells where the evolved matter density fluctuations are still small, the role of the power-law index in eq. (2.4) reduces to that of the linear (moment) bias coefficient in eq. (2.2), i.e. β b 1 for δ i m,fwd 1. As shown in figure 2, however, there is a significant fraction of cells where fluctuations, evolved by either N-body or 2LPT, have become highly non-linear for the typical grid resolution we study in this paper. Thus we expect no convergence of eq. (2.5), but instead significant deviation of β from b 1 (cf. right panel of figure 11). We will return to this point again in the discussion near the end of section 4.2.3, where we will see that the same issue poses a problem for the validity of the Poisson likelihood (for halos).

Broken power-law bias
The broken power-law bias model, as its name indicates, introduces an additional power-law cut-off into the power-law bias relation in eq. (2.4), such that This model essentially behaves like the power-law bias model but accounts for the exponential suppression of halo clustering inside voids [53]. It is characterized by three bias parameters: the power-law index β and two hyperparameters (ρ, ) appearing in the exponential cut-off. We emphasize again that both the power-law and broken power-law bias are LIMD bias models as the right-hand sides (r.h.s) of eq. (2.4) and eq. (2.6) both include only one single operator constructed out of the local matter density field, that is δ i m,fwd .

Likelihood
For this case study, we compare Poisson and Gaussian distributions for the likelihood which captures stochasticity in the actual biased tracer field. These two distributions are the most commonly studied and considered in the literature of halo stochasticity [e.g. 54,55]. Together, they form the basis for all active likelihoods in borg [16,17,49,50].
The joint likelihood is a product of single-cell likelihoods evaluated at each individual cell, such that where P (1) is defined as in eq. (2.8) (for the Poisson likelihood) or eq. (2.9) (for the Gaussian likelihood). The assumption in eq. (2.7) is that the stochasticity is uncorrelated between grid cells.

Poisson likelihood
The single-cell Poisson likelihood is given by where n i h,det is defined as in eq. (2.1). The Poisson likelihood requires that n i h,det ≥ 0.

Gaussian likelihood
The single-cell Gaussian likelihood can be written as in which the Gaussian standard deviation σ is an additional nuisance parameter and assumed to be universal for all cells. We will return to this assumption in section 4.2.3.

N-body simulation setup
We use one of the GADGET-2 [56] N-body simulations first presented in [57] and later used in [18,22,23] as our main input data. We label this fiducial box as Box- Whenever a higher grid resolution inference is required to verify a trend observed in lower resolution inferences, we employ a smaller GADGET-2 simulation box, first presented in [52], as our input data. We label this as Box-2. Box-2 has a box length L box = 500 h −1 Mpc and N part = 512 3 particles, yielding a mass resolution of M part = 7.0 × 10 10 h −1 M . Box-2 also assumed a flat ΛCDM cosmology, albeit with a slightly different cosmology: Ω m = 0.27, n s = 0.950, h = 0.7, σ 8 = 0.831. The input matter transfer function was computed with CAMB 3 [66]. Halos were identified in the same fashion as done in Box-1. We consider only a single mass bin of log 10 M h = [12.55, 13.55) h −1 M in our analysis of Box-2.
We summarize the halo samples in table 1. It is worth emphasizing that all inferences take halos in their own rest frame as input, i.e. neglecting the RSD effect.

Results
The phase-correlation coefficient between two given fields δ 1 , δ 2 is given by: where k denotes the wavenumber average. Hereafter, we define the correlation coefficient eq. (4.1) between the inferred initial modes in the MCMC sample s and the true ones as where s denotes the borg ensemble average, i.e.
where N samples = 1401 is the number of MCMC samples considered for each inference in our analysis. Additionally, we have shortened the notation by writing δ infer m,ini ≡ δ infer and δ input m,ini ≡ δ input . Our estimator of the correlation coefficient for each MCMC chain is then the borg ensemble average of eq. (4.2): whose variance is estimated using the sample variance over borg samples, The inferred phases of the initial conditions are unbiased if r ri (k) is consistent with 1 within the uncertainty. Note that we use the MCMC ensemble mean δ infer s k instead of the single-sample δ s infer k in the denominator of the estimator eq. (4.2). This ensures that, if the inferred initial conditions are drawn from a posterior centered on the true initial conditions, r ri (k) indeed asymptotes to 1. This would not be the case if one used δ s infer k in the denominator, since the inference noise would then always lead to an r ri (k) below 1.
We use the Gaussian expectation of r ri as a reference. This large-scale (low-k) limit corresponds to the case wherein δ m , δ h , and ε h (halo stochasticity) fields are all Gaussian fields, where the tracer and matter fields are linearly related. As derived in appendix A, the Gaussian expectation is given by: -10 -

JCAP03(2021)058
where P L (k) and P ε (k) denote the linear matter power spectrum and the halo stochasticity power spectrum, respectively. Eq. (4.6) clearly shows that there is an expected limit for the inference, set by halo stochasticity and bias, if the inference only has access to information of the linear evolution of LSS. Additionally, eq. (4.6) also suggests that one expects r ri,lin to asymptote to unity on large scales (low-k values) and to decrease at progressively smaller scales (higher-k values). This has nothing to do with non-linear evolution, but just the fact that P ε (k) asymptotes to a constant on large scales while P L (k) is a decreasing function of k. Note that at very low k (below the range shown here), r ri,lin actually shrinks again due to the turnover in P L (k). For comparison purpose, since the halo stochasticity power spectrum cannot be precisely determined, we approximate eq. (4.6) by the correlation coefficient between the actual halo and the input initial Fourier modes, r hi , which is given by (cf. eq. (4.1)): Below, we will show eq. (4.7) as dotted lines in all of our figures of correlation coefficients to serve as a reference. Note that eq. (4.7) can deviate from eq. (4.6) at high k, where the fields are significantly non-Gaussian. However, at low k, for whatever weighting applied (on the halos), eq. (4.6) always reduces to eq. (4.7). The transfer function can be defined for each MCMC sample s as where, as usual, we define the matter power spectrum as The borg ensemble mean and variance are then given by: (4.10) The mean inferred amplitudes of the initial conditions at different wavenumbers are unbiased compared to the truth if T infer (k) is consistent with 1 within the uncertainty. Below we vary each element in section 2 that could potentially affect the quality of the inference, while keeping the rest of the inference setup fixed. We classify our results into two categories: those from inferences using mock data and those from inferences using N-body DM halo catalogs. For additional clarity, we explicitly specify the details of the setup at the beginning of each section.

Results with mock data
In addition to the N-body halo datasets described in section 3, we generate mock "halo" fields consistently following the same procedure of the forward inference: 1. Draw a Gaussian realization of δ m,ini from eq. (1.5).
2. Forward evolve δ m,ini → δ m,fwd , assuming a certain gravity model (LPT or 2LPT). We use these mock datasets in section 4.1.1 and section 4.1.2 below, each as input for an inference employing exactly the same choices of gravity model, bias model and likelihood. The goal is to isolate and study the effect of tracer density and grid resolution, respectively, on the inference.

Tracer number density
All results in this section are obtained with the following setup: • 2LPT forward model.
We consider three mock "halo" catalogs, as described at the beginning of this section. We emphasize again that, as the biased field is generated from the exact models used in the inference, any difference between the qualities of different inferences should really arise from the tracer number density alone (the same applies for the grid resolution examined in section 4.1.2 below). We consider three cases with tracer (comoving) number densities of n h = 3.34 × 10 −4 , 1.10 × 10 −4 , 2.91 × 10 −5 ( h Mpc −1 ) 3 . In figure 3, we show results for these inferences.
As shown in the left panel, upper plot of figure 3, the phase-correlation coefficient is scale-dependent. At moderately high k, the inference from mock data performs slightly better than the corresponding Gaussian expectation, while it perfectly agrees with the latter at low k ≤ 0.04 h Mpc −1 . This trend indicates two things: • The forward inference does have access to information encoded in the non-linear evolution of LSS and can recover the small-scale phases that have been processed by gravity in the mildly non-linear regime. The inference quality relative to the Gaussian expectation depends very weakly on the tracer density, at least on the scales considered here.
• The Gaussian expectation correctly describes the phase inference on large scales, even in the absence of modeling error.
Indeed, the bottom plot in the left panel of figure 3 , which shows the ratios between the phase-correlation coefficients and their corresponding Gaussian expectations, confirms that the Gaussian expectation is an excellent indicator of the cross-correlation between true and inferred initial conditions at low k for a wide range of tracer densities.
The right panel of figure 3 shows the amplitude-bias, or transfer function, in the corresponding inferences. The transfer functions are consistent with 1, which demonstrates the  .7)) are plotted in dotted lines. Right: the amplitude-bias between inferred and true initial conditions, measured in terms of the transfer function T infer , for the same cases shown on the left. The solid lines again represent the ensemble means (cf. eq. (4.9)) while the corresponding shaded bands designate the 1-σ uncertainty (cf. eq. (4.10)).    consistency of the inference pipeline. The right panel of figure 3 shows a direct comparison between transfer functions in three inferences, which interestingly appear to be different only at 2% level at the lowest k values. Another important observation is that the uncertainties, introduced by marginalizing over the Fourier phases, is of order of 2% for both the phase-correlation and amplitude-bias. This shows that there is still a lot of information and constraining power even when the phases are sampled, as done in the field-level, forward-modeling approach.
We emphasize again, however, that there is no modeling error since the mock datasets employed in these inferences are sampled from the model assumed in the inference. Additionally, the three mock datasets were generated with the same phases, which explains the correlated fluctuations in the three different mock tracer cases at high k in the right panel of figure 3.

Grid resolution
The general setup in this section is as follows: • 2LPT forward model. However, as discussed in section 1, one must be careful not to include information from highly non-linear scales where our model is no longer valid. Fortunately, with mock data, we can just focus on the effect of the former, i.e. being able to access more small-scale information, on the inference. The top panels of figure 4 compare the phase-correlation coefficients from six inferences with five different grid resolutions. Higher resolutions are observed to considerably improve r(k) at high-k values. Interestingly, the improvement when going from l grid = 7.8 h −1 Mpc to l grid = 5.2 h −1 Mpc is marginal. This is presumably because we are reaching the shot noise dominated regime, where most cells are empty for this tracer density. We show, in the bottom panels of figure 4, a comparison between transfer functions from the corresponding inferences. Similarly to the case of varying tracer density, the transfer function appears to be rather insensitive to grid resolution. Thus, for such cases where the physical data model perfectly matches the underlying mechanism generating the data, the inference is robustly unbiased.

Results with N-body data
Results shown in figure 3 and figure 4 are obtained with mock data, for which there is no mismatch between the mechanisms generating the data and the models employed in the inference. When N-body halos from Box-1 are instead taken as input data in each of the setup shown above, i.e. all inference settings being the same, we obtain the results shown -14 -JCAP03(2021)058 in figure 5. These results essentially confirm the trends observed with mock data with one interesting exception, to which we will turn shortly.
First, for the phase-correlation, we observe the same trends, namely the scale-dependence behavior and the sharp decrease after some characteristic wavenumber. Despite possible differences between the physical data model and the actual data, the inferences continue to outperform the Gaussian expectations. Second, for the transfer function, we find no strong dependence on either tracer density or grid resolution. Interestingly, we observeconsistently across all inferences -a clear scale-dependent bias which cannot be explained by the scale-dependent, percent-level bias introduced by different normalization conventions between CLASS (CAMB) and borg (see appendix B for more details).
The origin of this bias in the measured T infer (k) is perhaps easier to understand in the linear regime, where (4.12) with D + (a) being the linear growth factor. One immediately sees that the linear bias b 1 and the amplitude of initial fluctuations are perfectly degenerate at linear level. A wrong inferred b 1 (or β in the context of the power-law bias model) thus entails a bias in the amplitude of the inferred modes. To correctly recover both bias parameter(s) and the amplitude, the physical data model must be flexible enough while still keeping sub-grid systematics under rigorous control [24]. In particular, the fact that the bias in β only shows up with N-body input further implies that this bias is of physical origin, and thus can only be addressed by improving either the gravity model, the bias model, the likelihood, or any combination of these three ingredients of the physical data model. Below, we repeatedly return to this point in section 4.2.1-section 4.2.3. We further show that it is highly likely a combination of the bias model and likelihood (and to some extent, the gravity model) that is in charge and therefore should be improved.

Gravity model
To obtain results in this section, we use the following setup: • Power-law bias model.
We run two inferences using LPT and 2LPT as the forward model. We note that LPT speeds up the HMC sampler by approximately a factor of 3, on average. The warm-up phases, during which the chains approach the stationary target distribution (see appendix D), are very similar in both cases in terms of number of MCMC samples required. As shown in figure 6, the performances of inferences using LPT mirror those of inferences using 2LPT extremely well at l grid = 15.6 h −1 Mpc, in terms of both metrics. The ensemble means of the inferred initial density fields using LPT and 2LPT are 99% spatially-correlated. We further verify that this trend is universal among inferences using different bias models and likelihoods. To test the limit of LPT, we increase the resolution to l grid = 5.2 h −1 Mpc using Box-2. In this case we observe a 5-10% difference in performance, favoring 2LPT. Taking into consideration also the computational demand, these results jointly suggest that, for large-volume surveys where the grid resolution will be anyway limited (due to numerical restrictions on number of grid cells), LPT could be a sufficient choice. In fact, the validity of LPT as the forward model for an inference has already been demonstrated in the BOSS-SDSS3 inference in [17] (see section 4.4, especially figure 6). The limit l grid = 5.2 h −1 Mpc -where LPT starts to fall behind 2LPT -however, should be noted. Going from LPT to 2LPT shifts the transfer functions towards unity, which is the right direction. The shifts are however not nearly enough to yield transfer functions consistent with 1. We therefore argue that the gravity model is not the key to this issue.

Bias model
In this section, we adopt the following general setup: • 2LPT forward model.
• Box-1 (fiducial sample) and Box-2. We consider three deterministic bias relations: the linear, the power-law, and the broken power-law models (cf. section 2.2). For clarity, when combining linear bias model (cf. eq. (2.2)) and Poisson likelihood (cf. eq. (2.8)), we need to implement a soft-thresholder which truncates δ h,det to ensure that n h,det ≥ 0. Details about the thresholding procedure and priors on our bias parameters can be found in appendix C. Figure 7 shows results of inferences using the three bias models described above, for Box-1 at l grid = 15.6 h −1 Mpc (top panels) and Box-2 at l grid = 5.2 h −1 Mpc (bottom panels). For the former, all bias models considered perform comparably in both metrics: the differences in r(k) are all within the 1-σ uncertainties; the differences in transfer functions favor the linear bias which exhibits a bias of only roughly 6-10% less than the others. More interestingly, results for the higher-resolution case clearly support this trend: both metrics favor the linear bias model. The differences in the transfer function now increase significantly. Still, none of the cases show a transfer function consistent with unity.
The fact that the differences between bias models are amplified with a higher grid resolution strongly indicates that it is the cells whose fluctuations have become non-linear that dominate the r.h.s. of eq. (2.5), and the expansion itself fails to converge at these resolutions. Note that the linear bias model, as implemented with the Poisson likelihood here, is not truly linear. That is, the soft-thresholder, described in appendix C, which ensures that n i h,det is always positive (as required for a Poisson variable), naturally deals with voids and extremely low density regions by setting n i h,det = 0 for those cells. A fairer comparison between the linear and power-law bias models requires, for example, a Gaussian likelihood. We return to this subtlety in figure 9 where it will become clear that the thresholding procedure applied in this section is highly unlikely to be the main reason for the superior performance of the linear bias model.

Likelihood
In this section, we adopt the following general setup: • 2LPT forward model.

JCAP03(2021)058
We consider two forms of likelihoods: Poisson and Gaussian distributions (cf. section 2.3). In the case of the Gaussian likelihood, we find that letting the Gaussian variance σ 2 (cf. eq. (2.9)) free is generally problematic for cases where the average Poisson variance of tracer counts-in-cell is small, that is Typically for such cases, we find that the variance parameter σ 2 collapses to a very small value, i.e. σ 2 N h /N 3 grid , and the chain only explores a narrow region around this value in posterior space. In order avoid this issue, we fix σ 2 = N h /N 3 grid in all inferences employing a Gaussian likelihood. It would be worth trying to alleviate this issue in the future by, for example, relaxing the universal variance assumption in eq. (2.9) and allowing for the variance to be density-dependent, as discussed in [24]. 4 This way, the likelihood could essentially downweight regions with low signal-to-noise while upweighting those with higher signal-tonoise. 5 As shown in the top panels of figure 8, the phase-correlation metric marginally favors the Poisson likelihood. Interestingly, however, the transfer function strongly prefers the Gaussian case. Perhaps it is worth keeping in mind that the amplitude-bias is robustly observed across different halo mass ranges, grid resolutions, forward models and bias models. We therefore conclude the Poisson likelihood is incompatible with halo stochasticity. This is perhaps not entirely surprising as previous literature has already pointed out that halo stochasticity can be either sub-Poisson, due to halo exclusion effect [55,69,70], or super-Poisson in highdensity regions [54]. Intuitively, we only expect halo stochasticity to asymptote the Poisson limit at very high wavenumber k, however, precisely in this regime, our bias expansions no longer converge due to non-linearities in the matter density field (cf. figure 2).
To summarize, our results demonstrate that using a Poisson likelihood to model halo stochasticity typically leads to a biased transfer function, i.e. a bias in the amplitudes of the inferred modes. This is an important implication for future cosmological inference using the field-level, forward-modeling approach.
Note, however, that (main) halo and galaxy stochasticities might significantly differ: a likelihood, or a bias model for that matter, optimized on main halos might not be compatible with galaxy stochasticity or bias. Rather than a recommendation to always use the Gaussian rather than the Poisson likelihood, or the linear bias than a more sophisticated one, for example the broken power-law bias model, our results firmly suggest that the likelihood and the bias model should be rigorously derived from, for example, a perturbative approach, such that sub-grid systematics can be kept under control [e.g. [18][19][20][22][23][24].
We now revisit the case of the linear bias model. Figure 9 shows a direct comparison between the Poisson and the Gaussian likelihoods, when combined with the linear bias model, at the highest resolution we have probed. The Gaussian likelihood does not involve any thresholding procedure. Interestingly, each metric now favors a different setup: the Poisson case seems to outperform the Gaussian slightly above k 0.  shown in figure 7. We additionally note that the case shown in figure 9 exhibits the most significant difference in the phase correlations across all of our comparisons, going up to 5% at k = 0.2 h Mpc −1 .

Discussion
In this paper, we have examined the forward modeling approach of biased-tracer clustering in terms of inferring the initial conditions in the observed cosmological volume. Specifically, employing the borg framework and N-body simulations, we inspect how different ingredients in the physical data model affect the performance of the inference of the initial conditions. These include: tracer density, grid resolution, gravity model, bias model and likelihood. We quantify the performance of each inference by two metrics: • The cross correlation between inferred and input initial modes, quantified by the phasecorrelation coefficient; • The amplitude bias between the inferred and input amplitudes of these initial modes, quantified by the transfer function.

Tracer density and grid resolution
First, in section 4.1.1, using mock tracers, we observe that -in the absence of modeling error -enhancing tracer density generally improves the inference, even at scales where the clustering signal dominates over shot noise. Similarly, increasing grid resolution increases the small-scale information the inference can access and consequently boosts the quality of the inference, as found in section 4.1.2. Both effects described above can be perfectly captured by the linear Gaussian expectation we derived in eq. (4.6). At low k, the Gaussian model is an unbiased estimator of the cross correlation between the inferred phases and the underlying truth. At high k, this relationship is biased but depends on neither tracer density nor grid resolution.

Gravity model
Second, in section 4.2, using DM halos from N-body simulations as physical tracers, we have investigated the robustness of the inference when modeling error is present. In particular, the choice of gravity model considered in section 4.2.1, be it LPT and 2LPT, does not significantly affect the performance of the inference up to a moderate grid resolution of l grid = 15.6 h −1 Mpc. This has a strong implication for cosmological inference using forward modeling approach with data from future large-volume surveys. We note however that 2LPT outperforms LPT at higher grid resolutions.

Bias model and likelihood
Last, but perhaps most interesting, we observe, for all inferences with N-body data, that the Poisson likelihood results in a significant bias on the amplitude of the inferred modes. Further investigation is needed to determine exactly the origin of this effect, though as discussed in section 4.2.2 and section 4.2.3, our results suggest that this is an indication of un-controlled systematics from highly non-linear fluctuations (cf. figure 2) and their couplings (see also results in [24], who follow an approach based on effective field theory (EFT) which adopts a sharp-k filter in place of the CIC filter employed here). Let us summarize our findings below:

JCAP03(2021)058
• The phase-correlation is sensitive to a change in any ingredient. Fortunately, the shifts are generally 5% in any direction and can be well predicted by the Gaussian expectation in all cases. This implies that analyses that cross-correlate the inference with other datasets [e.g. 71], will be robust to details of the physical data model.
• The transfer function is found to be only susceptible to the choices of bias model and likelihood. However, the differences between different likelihoods and bias models are significant. At this point, unfortunately, it is unclear how to predict these from first principles.

Conclusions
To our knowledge, our results represent the most stringent test for Bayesian forward modeling and inference of biased tracer clustering, within the borg framework, so far. Our input data are main halos identified in N-body simulations, unlike the mock tracers generated from the same physical data model employed by previous analyses [see, e.g. 9, 50]. More realistic and rigorous tests are still awaiting ahead in order to achieve the goal of unbiased inference of initial conditions and cosmological parameters using this approach. Specifically, extending the input biased tracers to include galaxies -either populated inside DM halos using some halo occupation distribution model, or directly identified in hydrodynamic simulationswould be such a test. An even more realistic case must also consider RSD, survey geometries and selection effects applied to the galaxy field. We defer such studies to future work as well.
Clearly, a transfer function T infer (k) with a scale-dependent bias up to ∼ 30% has a significant implication for cosmological inference, especially when future analyses are aiming to reach percent-level constraints on cosmological parameters. In particular, one would expect biases in not only A s but also Ω m , n s and probably the distance scale as well. We thus view this issue as a call for attention to studies and developments of accurate bias models and likelihoods. As already pointed out above, borg and the work in this paper present a physically-motivated and technically-ready statistical framework for such studies.
Specifically, an EFT-based approach such as the one presented in [18,22,24] is certainly preferable. In fact, [24] showed that for more conservative scale cut-offs, one can recover an unbiased primordial power spectrum normalization within percent-level for the same fiducial halo sample studied in this work. The analysis in [24] however fixed the phases. It would be interesting to investigate how the EFT approach performs, specifically in terms of the level of systematic bias, when the phases are inferred from the data and marginalized over, as done in this paper. We leave this investigation for an upcoming work. CAMB and those obtained from the Eisentein-Hu fitting formula, together will result in a bias between the inferred and input initial conditions, and hence a deviation of T rec (k) from 1. In this appendix, we examine this effect by measuring the ratio between these transfer functions. Specifically, figure 10 shows that one can expect a scale-dependent bias in T rec up to 1.7% on the scales of interest in this paper. This clearly cannot explain the bias observed in section 4.2.

C Priors and thresholding
For all inferences presented in this paper, we impose uniform priors and positivity constraints on all bias parameters, including the nuisance parameter n 0 : being the softplus function [72]. We choose c = 5 for our implementation. We emphasize again that this thresholding procedure is only needed for, and thus applied to, inferences using both linear bias model and Poisson likelihood.

D MCMC burn-in phase and chain thinning
This appendix describes in detail how we verify that our MCMC chains have indeed approached the target typical sets and how we select MCMC samples in our analysis.
All the Markov chains considered in this paper were initialized in an over-dispersed state, presumably remotely far from the stationary distribution region in the posterior space. We refer to this initial phase as the warm-up phase. For each chain, we ensure that samples included in our analysis do not come from this phase by carefully tracing the posterior power spectrum and bias parameters. In figure 11, we show two such examples. Both trace plots confirm that the chain has reached a stationary distribution after around 2000 samples.
To achieve a reasonable effective sample size while avoiding wasting disk storage, we only store and analyze 1-in-every-20 MCMC sample.