Forward Modeling of Galaxy Populations for Cosmological Redshift Distribution Inference

We present a forward-modeling framework for estimating galaxy redshift distributions from photometric surveys. Our forward model is composed of: a detailed population model describing the intrinsic distribution of the physical characteristics of galaxies, encoding galaxy evolution physics; a stellar population synthesis model connecting the physical properties of galaxies to their photometry; a data model characterizing the observation and calibration processes for a given survey; and explicit treatment of selection cuts, both into the main analysis sample and for the subsequent sorting into tomographic redshift bins. This approach has the appeal that it does not rely on spectroscopic calibration data, provides explicit control over modeling assumptions and builds a direct bridge between photo-z inference and galaxy evolution physics. In addition to redshift distributions, forward modeling provides a framework for drawing robust inferences about the statistical properties of the galaxy population more generally. We demonstrate the utility of forward modeling by estimating the redshift distributions for the Galaxy And Mass Assembly (GAMA) survey and the Vimos VLT Deep Survey (VVDS), validating against their spectroscopic redshifts. Our baseline model is able to predict tomographic redshift distributions for GAMA and VVDS with respective biases of Δz ≲ 0.003 and Δz ≃ 0.01 on the mean redshift—comfortably accurate enough for Stage III cosmological surveys—without any hyperparameter tuning (i.e., prior to doing any fitting to those data). We anticipate that with additional hyperparameter fitting and modeling improvements, forward modeling will provide a path to accurate redshift distribution inference for Stage IV surveys.


INTRODUCTION
Accurate inference of the redshift distributions of ensembles of galaxies from their photometry is of central importance for deriving cosmological constraints from weak lensing surveys.Ongoing and upcoming surveys such as the Dark Energy Survey (Dark Energy Survey Collaboration: Fermilab & Flaugher 2005), the Kilo-Degree Survey (De Jong et al. 2015), Hyper Suprime-Cam (Aihara et al. 2018), the Vera C. Rubin Observatory's Legacy Survey of Space and Time (Abell et al. 2009) and Euclid (Laureijs et al. 2011) will map large volumes of the Universe, measuring the angular positions, images, and photometry for billions of galaxies.The unprecedented statistical power of these surveys makes them increasingly sensitive to systematic biases, with systematics on the redshift distributions in particular expected to be the single largest contributor to the total systematic error budget (see e.g., Hildebrandt et al. 2017).In order to reach the full potential of upcoming Stage IV surveys, the characterization of the redshift distributions (e.g., their measured means and variances) will need to improve by roughly an order of magnitude compared to the current state-of-the-art (Newman & Gruen 2022).
Cross-correlation and direct calibration both rely on spectroscopic redshift samples to compare against the photometric data in order to calibrate their redshift distributions.These approaches have the appeal that they leverage reliably measured redshifts to calibrate the photo-z distributions, and are relatively insensitive to modeling assumptions about the photometric data.On the other hand, they are limited by the lack of available spectroscopic redshifts at the depths probed by ongoing and upcoming surveys, and are vulnerable to biases due to spectroscopic selection effects that are not well-represented by re-weighting in the (broad-band) colors (Buchs et al. 2019;Hartley et al. 2020), and uncertainties in galaxy-bias modeling in the case of crosscorrelation methods (Gatti et al. 2018).
Template-based methods instead rely on building a statistical model for the photometric observations in order to constrain the redshifts of individual galaxies, and the redshift distributions of ensembles, from their photometry alone.Template approaches assert that galaxies belong to one of a finite set of "types", where each type has an associated (rest-frame) spectral template that defines its colors as a function of redshift.These templates can then be compared to the observed colors to give redshift (and type) likelihoods for each galaxy, which can then be combined in a hierarchical model to infer the redshift distributions of ensembles of galaxies (Leistedt et al. 2016(Leistedt et al. , 2019)).However, template methods are limited by the incapability of template-sets (and priors over galaxy types) to characterize the galaxy population at the necessary level of realism.Finite template sets tend to be overly restrictive, while methods that allow for linear combinations of templates result in large swathes of prior volume populated by unphysical galaxy spectra.Further, by relying on a model for the photometric data, template-based models need to treat selection effects explicitly in order to draw robust inferences at the population level (e.g., redshift distributions): this has so far not been done.
Physically, the redshift distributions of selected samples of galaxies arise from a sequence of three main processes: The statistical properties of the galaxy population (i.e., the intrinsic distribution of physical characteristics and redshifts) defines an intrinsic distribution for galaxy colors, from which galaxies in the Universe are sampled.The colors (photometry) of those galaxies in some patch of the sky then get observed by a survey, resulting in a catalog of noisy (measured) photometry.Selection cuts are then applied to those measurements to ensure a clean and high-quality galaxy sample, and to sort the galaxies into tomographic redshift bins.The redshift distributions of interest, then, are those of the galaxies that make it past the selection cuts and into a given tomographic bin.Therefore, if one is able to accurately characterize the galaxy population, observational processes, and selection effects, then one can predict the redshift distributions of interest.
In this paper we develop a forward modeling framework for estimating redshift distributions by explicitly modeling the processes that give rise to them.We construct a population model describing the joint distribution of physical characteristics (e.g., stellar, dust and gas content) of galaxies, encoding galaxy evolution physics in the relationships between galaxies physical properties.We use a stellar population synthesis (SPS) model to connect those physical parameters to the rest-frame spectra and hence photometry for each galaxy.The observation process is then characterized by a data-model that captures measurement noise, heterogeneous observing conditions and strategy, and photometric calibration.Finally, we have a selection model specifying the selection cuts.This parameterized forward model then forms the basis for Bayesian inference of cosmological redshift distributions, either by hierarchical inference, or simulation-based inference (SBI).
This forward modeling approach can be thought of as resolving the current limitations of template-based methods by replacing template-sets with a continuous SPS model, explicitly treating selection effects, and inferring population-and data-model parameters in a selfconsistent fashion.In particular, in replacing finite template-sets by a continuous model we are able to better capture the diversity of real galaxy spectra whilst having full control over the prior describing the statistical properties of the galaxy population.The use of SPS models for analyzing large samples of galaxies has only recently become feasible, thanks to fast neural emulators (e.g., speculator, Alsing et al. 2020).The use of physically motivated priors (e.g., Tanaka 2015 andRa-machandra et al. 2021) and continuous physical models for galaxy spectra (Ramachandra et al. 2021) has already led to promising improvements in photometric redshift inference for individual galaxies.
The structure of this paper is as follows.In §2 we describe the framework for forward modeling photometric surveys, and estimating redshift distributions in a forward modeling context.In §3 we describe our SPS model, galaxy population model, and data model assumptions.In §4 and 5 we show the ability of our baseline forward model to recover the tomographic redshift distributions for GAMA and VVDS respectively.We outline a roadmap for future forward modeling efforts and conclude in §7.
In a companion paper (Leistedt et al. 2022) we validate the forward modeling framework for inferring individual galaxy redshifts, including hierarchical calibration of hyper-parameters.

FORWARD MODELING OF GALAXY SURVEYS FOR REDSHIFT INFERENCE
In this section we describe our generative modeling framework for photometric surveys.We frame the forward model as a pipeline for simulating mock catalogs, which can then be compared to the observed catalog in a simulation-based inference setting, used to estimate the implied tomographic redshift distributions for a given set of modeling and hyper-parameter choices, or can be used as a basis for hierarchical inference via MCMC sampling.
Notation is summarized in §2.1 and Table 1.We describe the generative model in §2.2, and discuss how to perform inference of the model parameters in §2.4.

Notation
Each galaxy is described by a set of SPS parameters ϕ, which describe the stellar, gas and dust contents of the galaxy.The rest-frame spectrum l(λ) ≡ l(λ; ϕ) is connected to the parameters ϕ via a stellar population synthesis model, which computes the composite spectrum from the stars in the population given their initial mass function, ages and metallicities from the star formation and metallicity histories, plus modifications due to dust, and nebular emission (see Conroy 2013 for a review).
Combined with redshifts z, the SPS parameters predict the model photometry for each galaxy, i.e., the fluxes {f b } in band-passes {W b (λ)}, defined by: where d L (z) the luminosity distance for redshift z and τ (z, λ) is the optical depth of the inter-galactic medium.
We denote the vector of measured fluxes for each galaxy by d, where the photometric data-model specifies the sampling distribution P (d|ϕ, z, σ, η) of measured fluxes given the model fluxes and measurement uncertainties σ.The data model is parameterized by nuisance parameters η, which characterize the properties of the noise distribution, calibration (e.g., zero-point) parameters, and modeling error terms.
The intrinsic distribution of galaxy characteristics (SPS parameters and redshift) is described by a population model P (ϕ, z|ψ), with hyper-parameters ψ.This describes the joint distribution of galaxy characteristics for the background galaxy population, in the absence of any selection effects.
We assume that selection cuts for the main galaxy sample are made on observed photometry, with selection probability P (S|d, σ) equal to one or zero for selection or rejection., i.e., selection is deterministic given the photometric data vector.
Subsequent sorting of galaxies into tomographic bins introduces an additional selection S (k) (for the k-th bin) based on the measured galaxy colors (fluxes).Typically, tomographic binning will be based on some estimator for the redshift, which will be a deterministic function of the measured fluxes and their uncertainties: where ẑ(d, σ) is an estimator for the redshift, and z and the lower and upper limits for the k-th tomographic bin.
The measurement uncertainties on the photometry will be depend on the observing conditions (e.g., seeing), and strategy (e.g., exposure time), which will typically vary to some extent across the survey.The uncertainties will also scale with flux, owing to the Poisson photon count contribution to the errors, and have additional intrinsic scatter due to the varying difficulty in extracting fluxes from galaxy images with different morphologies.We denote the distribution of photometric uncertainties across the survey (as a function of flux) by the uncertainty model P (σ|f (ϕ, z), O), where O denotes the model assumptions (and parameters) characterizing that distribution.

Generative model
The forward model proceeds as follows: 1. Draw SPS parameters ϕ and redshifts z from the population model P (ϕ, z|ψ); The result is a catalog of noisy photometry for selected galaxies, with associated tomographic bin labels.
Repeating the above process until one obtains the same number N of selected objects as in the observed catalog provides a draw from the assumed generative model for the data conditioned on N selected objects1 .This can hence be used as a generative model for inferring the hyper-parameters via simulation-based inference, or as the basis for Bayesian hierarchical inference, as described below in §2.4.
Repeating this process in the limit of N → ∞ selected samples, and examining the redshift distributions of the galaxies that make it into each bin, provides the tomographic redshift distributions implied for a given set of modeling and hyper-parameter assumptions.Note that the target redshift distributions are given by a (typically intractable) integral over the population, data, and selection models.The tomographic redshift distribution n k (z) for galaxies passing selection both into the analysis sample, and subsequently into the k-th tomographic bin, is given by: n k (z) ≡ P (z|S (k) , ψ, η, O) = P (ϕ, z|ψ, S (k) , σ, η) × P (σ|S (k) , ψ, η, O)dϕ dσ = P (ϕ, z|ψ)P (S (k) |ϕ, z, σ, η) P (S (k) |ψ, σ, η) where S (k) denotes selection into both the analysis sample and k-th tomographic bin.Forward simulating from the generative model described above provides a way of estimating the target redshift distributions for a given set of hyper-parameters, without the need for direct integration.

Emulation of SPS models
Any applications of this forward model -either simulation of large mocks, or MCMC sampling the associated posterior -will require a vast number of SPS model calls.This is only made tractable by neural emulation of SPS models (Alsing et al. 2020), which speeds-up SPS computations by a factor of 10 4 compared to FSPS (Conroy & Gunn 2010).

Inference
Inference of redshift distributions under the generative model described in §2.2 requires inferring the population-and data-model parameters ψ and η under the forward model assumptions, which in turn provide marginal posteriors for the tomographic redshift distributions via Equation (3)2 .
In this paper we are focused on validating a baseline forward model for predicting tomographic redshift distributions, without performing inference (or optimization) of the forward model parameters.Nonetheless, it is useful to consider how inference works within our forward modeling framework, and highlight the advantages of performing redshift distribution inference in this fashion.
The joint posterior for the generative model described in §2.2 is given by (see appendix A for a derivation): where the selection term in the denominator is given by: P (S|ψ, σ, η) = P (S|d, σ)P (d|ϕ, z, σ, η) × P (ϕ, z|ψ) dd dϕ dz. (5) This joint posterior can then be sampled (using MCMC) to jointly infer the population-and data-model parameters, and SPS parameters and redshifts for each galaxy.
The marginal posterior over the hyper-and data-model parameters then provides a posterior over the target tomographic redshift distributions, via Equation (3).
Phrasing the redshift distribution inference task as a hierarchical model in this way has a number of advantages.Firstly, note that so far this model contains only photometry: the method does not explicitly require spectroscopic calibration data.Where spec-zs (or other spectroscopically-derived constraints on SPS parameters) are available for some sub-sets of the galaxies, they can be straightforwardly included by simply appending additional (sharply-peaked) likelihoods for those galaxies.Importantly, the fact that any external spec-z calibration data are not representative of the main sample is unimportant in this approach, provided selection cuts are only performed with respect to the main survey data.
Similarly, inclusion of additional data (e.g., additional bands) from external surveys for subsets of galaxies is also straightforwardly achieved by simply appending additional likelihood-terms for those galaxies.Again, the fact that auxiliary data is only available for biased (unrepresentative) sub-sets of the galaxies is not important, provided selection is not performed with respect to those auxiliary data.
Note that the population model P (ϕ, z|ψ) that appears in Equation ( 4) is a "global" quantity: it describes the statistical properties of the background galaxy population (without selection effects), and is therefore the same for all tomographic bins and across all surveys, regardless of their differing selection functions.This opens up strong synergies with the galaxy evolution community who are concerned with constraining (various aspects of) P (ϕ, z|ψ): as our understanding of the statistical properties of the galaxy population improves, this can be fed directly into improved photometric redshift inferences via improved priors on the hyper-parameters ψ.
Finally, since population-and data-model parameters are inferred from the photometric data in a selfconsistent fashion, uncertainties in the population and data model parameters will be fully propagated through to the final n(z) inferences.
However, MCMC sampling the joint posterior in Equation ( 4) requires computing the selection integral in Equation ( 5) for every galaxy in the sample, in every likelihood evaluation.This presents a severe computational bottleneck for sampling-based methods.In practise, for sampling to be computationally tractable the selection integral will require replacement with a fast emulator (e.g., Talbot & Thrane 2022).
Alternatively, simulation-based inference (SBI, aka likelihood-free inference) provides a framework for performing Bayesian inference under complex forward models using only simulations, bypassing the need to compute the likelihood (and selection integral) entirely (e.g., Alsing et al. 2018Alsing et al. , 2019;;Jeffrey & Wandelt 2020).For a recent application of SBI to a population model with selection effects, see Gerardi et al. (2021).

BASELINE FORWARD MODEL
In this paper we are focused on demonstrating the ability of a baseline forward model to recover redshift distributions, without performing additional inference or optimization of population-level parameters.We lay out the baseline forward model assumptions for the SPS model in §3.1, the galaxy population model in §3.2, and the data model in §3.3.

Stellar population synthesis (SPS) model
We assume an SPS model with nine free parameters, summarized in Table 2 and described below.
Star formation histories (star formation rates as a function of time) are parameterized by a double power law: where the transition time is defined as t * ≡ τ t univ (z) for lookback-time t univ (z), and the SFH is normalized such that it integrates to give the total stellar mass tuniv(z) 0 Ṁ dt = M .We define star formation rate (SFR) as the average of Ṁ over the past 100Myr.
The gas-phase metallicity log 10 Z gas is a free parameter, and the gas ionization parameter u is set so that it tracks the star-formation rate, assuming the Kaasinen et al. (2018) relation between gas-ionization and SFR.
The metallicity history of the stellar population assumed to build up with stellar mass production, such that present-day stellar and gas-phase metallicities are identical: where Z min is the minimum metallicity covered by the stellar templates.
Dust attenuation is modelled with two components describing birth cloud (stars younger than ten million years) and diffuse dust screens respectively, following Charlot & Fall (2000) (see Leja et al. 2017 for details).The birth cloud (τ 1 ) and diffuse (τ 2 ) attenuation, and the power law index n of the Calzetti et al. (2000) attenuation curve for the diffuse component, are all free model parameters.
We emulate the photometry (apparent magnitudes) in all relevant bands using speculator (Alsing et al. 2020).The apparent magnitude in each band as a function of the SPS parameters and redshift is parameterized by a dense neural network with four hidden layers of 128 units each, and activation functions described in (Alsing et al. 2020).Each emulator is trained on 6.4 × 10 6 training samples, with SPS parameters and redshifts drawn from the population model (see §3.2 below) and model photometry computed using FSPS (Conroy & Gunn 2010).Training is performed following the prescription in (Alsing et al. 2020), and we ensure that the 99.9% intervals for the emulator error distributions are better than 2% in all bands.

Galaxy population model
Specifying a population model for the SPS parameters and redshift amounts to specifying the joint (prior) distribution that characterizes the statistical properties of the galaxy population.Galaxy formation and evolution physics results in complex relationships between the stellar population parameters.In an effort to capture as much of this phenomenology as possible, we factorize the population prior into the following (generic) structure: Factorized this way, the population model is decomposed into a number of well-studied relations between galaxy characteristics: P (mass, redshift) is given by the redshift-evolving mass-function, P (metallicity|SFR, mass) characterizes the fundamental metallicity relation, P (SFR|mass, redshift) characterizes the star-forming sequence, P (dust|SFR, mass, metallicity) specifies the relationship between dust and the star-formation and chemical enrichment histories, and P (age|SFR, mass) the empirical relationship between ages and starformation histories.
For the SPS model set-up chosen for this study (summarized in Table 2), the specific population model assumptions and default parameters are taken as follows.

Mass function
The joint distribution of mass and redshift is defined by: where Φ(M, z) is the un-normalized mass function, and dV (z) the differential comoving volume element.For the mass-function, we assume a mixture of two Schechter functions, with default parameter values and redshift evolution taken from Leja et al. (2020).We assume a Planck 2015 (Ade et al. 2016) cosmology for the comoving volume element.The assumed mass-function and redshift prior is show in Figure 1.

Fundamental metallicity relation (FMR)
Galaxies undergo continuous chemical evolution, as heavier elements are produced in stars and expelled into the interstellar medium, and gas flows regulate the metal content by either dilution or expulsion of enriched gas out of the galactic potential.On global scales, this results in a tight relationship between the gas-phase metallicity and star formation history of a galaxy (e.g., mass and SFR) -the so-called fundamental metallicity relation (Yates et al. 2012;Andrews & Martini 2013;Nakajima & Ouchi 2014;Yabe et al. 2015;Salim et al. 2014Salim et al. , 2015;;Kashino et al. 2016;Cresci et al. 2019;Curti et al. 2020).Qualitatively, galaxies on the FMR tend towards lower metallicities for higher star formation rates, and higher metallicities for higher masses, but the overall shape is a non-linear function of both mass and SFR.The FMR is typically considered to be independent of redshift, with galaxies moving along the relation as they evolve (and preferentially occupying different regions of the relation at different redshifts), but the relation itself being constant over cosmic history3 .
We take the FMR parameterization and default parameter values from Curti et al. (2020), where the FMR was measured over the broad stellar-mass and SFR ranges covered by SDSS.The median gas-phase metallicity as a function of mass and SFR is parameterized as: where with default parameters ∆Z 0 = 0.09, γ = 0.3, m 0 = 10.1, m 1 = 0.56, and β = 2.1.We assume a Student's-t distribution4 (with two degrees-of-freedom) for P (log 10 Z gas |M, SFR), where the mean is given by Equation (10), and the FWHM is equal to 0.05.The FMR is shown in Figure 2. Measurement of the FMR is sensitive to the way in which SFRs and metallicities are estimated (see e.g., Figure 2. Fundamental metallicity relation prior on gasphase metallicity conditioned on mass and specific star formation rate (in units of Gyr −1 ).Solid lines represent the mean, and the bands show the FWHM of the assumed student-t distribution.Telford et al. 2016 andCresci et al. 2019), with the SFRs and metallicities used to calibrate the FMR from spectra and those arising in a given SPS model being subtly different proxies for those quantities.Therefore, when fitting the FMR to photometric data as part of the population model, it would be prudent to set reasonably broad priors on the FMR parameters in order to capture any biases relative to measurements based on spectra.
While the majority of galaxies are expected to live on the FMR (since the same physics drives chemical enrichment for most galaxies), those processes will be disrupted in merger events.Merged galaxies are therefore not expected to follow the same FMR relation (Bustamante et al. 2020).This can be compensated for by adding heavy tails to the FMR (as we do here with the student-t distribution), or deriving a separate model for the metallicities of merged galaxies.

Star forming sequence
The star-forming sequence (SFS) characterizes the (redshift-evolving) relationship between star formation rate and mass, with the vast majority of galaxies forming most of their mass either on or passing through the star-forming sequence (Leitner 2012;Abramson et al. 2015).Qualitatively, the SFS is characterized by starforming and quiescent galaxies with ongoing and negligible star formation rates respectively.The clustering of galaxies into those two populations with different characteristic SFRs results in a highly non-Gaussian -sometimes bimodal -distribution of SFRs (conditioned on mass and redshift) in the galaxy population as a whole (Daddi et al. 2007;Noeske et al. 2007;Karim et al. 2011;Rodighiero et al. 2011;Whitaker et al. 2012Whitaker et al. , 2014;;Speagle et al. 2014;Renzini & Peng 2015;Schreiber et al. 2015;Tomczak et al. 2016;Leslie et al. 2020;Leja et al. 2021).
In this work, we base our SFS model on the measured relation from Leja et al. (2021), who fit a normalizing flow to learn P (SFR|M, z) from 3D HST data.The galaxy sample used in Leja et al. (2021) is mass-complete down to around 10 9 M over the redshift range relevant for this study (z ≤ 1.5).We note that P (SFR|M, z) is a very smooth function of mass and redshift, and to ensure sensible extrapolation below the mass limit we fit a surrogate model (with good extrapolation properties) to the normalizing flow of Leja et al. (2021) (see Figure 4, and Appendix B for technical details).
Specifying a population model requires that we put a prior on the SFH parameters (in this case, the three parameters of the double-power SFH), but the star-forming sequence provides only a prior constraint on a derived quantity: the star formation rate, SFR(α, β, τ, z).Hence, we need to define our prior over the SFH parameters in such a way that the target prior over the SFR is satisfied.One of the problems with parametric SFH models such as the double power-law is that simple priors on the SFH parameters lead to strong (and undesirable) implied priors on derived quantities such as SFR (e.g., Carnall et al. 2019).For example, taking a baseline uniform prior in (log 10 α, log 10 β, τ ) over reasonable ranges (c.f.Table 2) leads to the undesirable implicit SFR prior P 0 (SFR|z) shown in Figure 3.
To ensure that our SFH priors only encode the SFS, without a spurious additional contribution from any baseline prior assumptions, we define the SFH priors as: where π 0 (α, β, τ ) is the baseline (uniform) prior on the SFH parameters, P (SFR|M, z) is the target SFS prior, and P 0 (SFR|z) is the implicit prior on SFR implied by π 0 .The implicit SFR prior is defined by the surface integral where dS is the surface element in the SFH parameter space (α, β, τ ).Dividing out the implicit SFR prior in this way insures that the overall prior on SFR is specified by the SFS only.To avoid having to compute the surface integrals in Equation ( 13) directly, we train a normalizing flow to learn the conditional density P 0 (SFR|z) so that it can be conveniently divided out (technical details are given in Appendix C).

Dust
The amount of dust attenuation, and the shape of the effective attenuation law, is governed by the total amount of dust, the grain composition, the dust-stargas geometry in the galaxy, and its inclination relative to the observer.This can be encoded as a relationship between dust attenuation parameters and star formation histories (SFR and mass), as well as potentially metallicity and redshift (see Salim & Narayanan 2020 for a review).
We take a relatively simple dust prior model, where the dust attenuation in the diffuse component is assumed to scale with SFR according to (following Tanaka 2015): where Θ is the Heaviside step function.We assume a Gaussian prior on the diffuse dust attenuation with the mean given above, and standard deviation of 0.2.The dust attenuation prior is shown in Figure 5.The index of the dust attenuation law (for the diffuse component) is assumed to vary as a function of the total dust attenuation, with mean given by: where δ is the (negative) offset from the index of the Calzetti attenuation curve (Calzetti et al. 2000).We take a Gaussian prior on δ, with mean given above and standard deviation 0.4.For the birth cloud component, we take a Gaussian prior on the ratio τ 1 /τ 2 of the birth cloud and diffuse dust components with mean equal to 1 and standard deviation 0.3.This is consistent with previous findings that the dust optical depth in nebular emission lines is roughly twice that of the stellar component (Calzetti et al. 1994;Price et al. 2014;Reddy et al. 2015).
We leave more sophisticated dust prior modeling, e.g., where dust attenuation properties are conditioned on SFR, mass, metallicity and redshift (e.g., Nagaraj et al. 2022), to future work.

Age
The double power-law SFH parameterization and prior implicitly links age and SFR, with older (younger) galaxies having lower (higher) star-formation rates, qualitatively in line with expectations.However, the assumed priors on the double-power-law SFH parameters allow for a tail down to very low ages; we therefore impose a lower cut of one Gyr on the galaxy ages5 to   eliminate spuriously young galaxies.We do not impose any additional priors on age.

Data model
The data-model characterizes the sampling distribution of the measured photometry, given the true (model) fluxes, encoding both calibration (i.e., zero points), modeling errors, and measurement noise.We treat the sampling distribution of observed fluxes as a student-t distribution with two degrees-of-freedom, where the total uncertainty Σ is given by: The data-model parameters η = (α zp , β) characterize the zero points α zp , and error floors β (encoding modeling errors, emulator errors, and an effective noise floor on the measurement uncertainties).
In the application to GAMA and VVDS data in §4-5, we fix the zero points to the values published by the respective survey collaborations, and assume a default value of 0.03 for the (fractional flux) error floors in all bands.
We note that this data model is readily extendable to include additional error terms for emission line modeling errors, SED modeling-error as a function of rest-frame wavelength, and parameters describing the shape (e.g., skewness or tailweight) of the data sampling distribution.A more sophisticated error-model (including hierarchical calibration of hyper-parameters) is investigated in our companion paper Leistedt et al. (2022).

Uncertainty model
The uncertainty model describes the distribution of photometric measurement uncertainties over the survey, which will vary from galaxy to galaxy due to heterogeneous observing conditions and strategy, varying difficulty in extracting photometry from galaxies images with different morphologies and geometries, and will also scale with the (true) flux owing to the Poisson photon count contribution to the overall measurement error.
We take a data-driven approach to uncertainty modeling, where we learn the distribution P (σ|f , O) directly from the data.This process proceeds in two steps.First, we fit each of the galaxies under the SPS model, population prior and data model assumptions described above, by MCMC sampling their individual posteriors (c.f.Equation ( 4) with fixed hyper-parameters).For each galaxy, this provides a maximum a posteriori (MAP) estimate for their true fluxes f .We then take the catalog of MAP estimated fluxes and associated uncertainties {f , σ} 1:N , and train a Mixture Density Network (MDN) to learn P (σ|f , O).The MDN parameterizes the uncertainty distribution as a Gaussian mixture: where the component weights, means and variances are functions of flux, parameterized by a dense neural network (whose weights and biases are denoted by w).The MDN is trained by minimizing the total (negative) loglikelihood of the data under the model with respect to the network weights6 : Throughout this paper we take a default MDN with 12 components, and a single hidden layer with 256 units and leaky-ReLU activation functions.Examples of trained MDNs for the uncertainty distributions for GAMA and VVDS are shown in Figures 6 and 9.

CASE STUDY I: GAMA
The Galaxy And Mass Assembly (GAMA) survey covers 250 square degrees, and has obtained ∼ 230, 000 spectroscopic redshifts over the past decade (Driver et al. 2011).The survey was designed to have simple target selection based on photometry alone (discussed below), making it an ideal dataset for validating our forward model.We take the GAMA data release 4 (Driver et al. 2022), with photometry in KiDS ugri and VIKING ZY HJKs bands.
The main selection is performed on the KiDS r-band, with spectroscopic redshifts measured for all galaxies with r < 19.65.An additional color cut of (J − Ks) > 0.025 is made for star-galaxy separation.With these cuts, we are left with a sample of 206, 454 galaxies with 9-band photometry (ugriZY HJKs), and measured spectroscopic redshifts (for validation).

Data model
We assume student-t uncertainties on the fluxes as described in §3.3, and take the extinction and zero point corrections provided with GAMA DR4 (Driver et al. 2022).
We train an MDN to model the distribution of measurement uncertainties as a function of flux, trained on the GAMA data, as described in §3.4.The uncertainty distributions and corresponding trained models are shown side-by-side in Figure 6.

Tomographic binning
For the purpose of tomographic binning, we train a simple (dense) neural network estimator for the redshift given the measured KiDS and VIKING photometry.We take a dense network with four hidden layers with 64 units each and leaky-ReLU activations, passing the measured magnitudes and magnitude errors in the 9-bands as inputs and the estimated redshift as output.The network is trained to minimize the mean square error on the redshift, trained on the GAMA photometry and redshifts.Training is performed with Adam using a batch size of 1024, a training:validation split of 90:10, and triggering early-stopping when the validation loss has ceased to improve after 30 epochs.The resulting redshift estimator has an overall accuracy of around σ z 0.06.
The GAMA galaxies are binned into two tomographic bins based on their estimated redshift, 0 < ẑ < 0.2 and ẑ > 0.2 for the two bins respectively.

Results
Forward model predictions are obtained by generating a large mock catalog following the prescription in §2.2.We MCMC sample the population model (imposing a prior limit of r < 20.65), draw uncertainties and add noise according to the uncertainty and data models described above, and apply selection cuts r < 19.65 and (J − Ks) > 0.025 to the simulated noisy photometry.Tomographic bin labels are assigned based on the redshift estimator described above.We continue sampling until 5 • 10 5 selected samples are obtained.
The tomographic redshift distributions predicted by the forward model are shown alongside the spec-z histograms in Figure 7, and the corresponding bias on the mean of the redshift distribution in shown in Figure 8.The forward model is able to predict the redshift distributions with a bias of around 0.003 and 0.001 on the mean, for the two respective tomographic bins.This is comfortably accurate enough for ongoing Stage III surveys (e.g., Asgari et al. 2021), where the statistical error on the mean redshift per tomographic bin is O(0.01), and cosmological parameter constraints should be insensitive to biases of 0.04 (Hildebrandt et al. 2016).The model predictions are very close to the accuracy requirements for Stage IV surveys, where the bias on the mean should not exceed7 ∆z < 0.002(1 + z) (Mandelbaum et al. 2018).We re-iterate that these model predictions are obtained by assuming (fixed) default parameters for the population model described in §3.2, fitting the uncertainty distribution to the GAMA data to characterize the distribution of photometric errors (as a function of flux), and assuming the zero-point calibration provided by the GAMA collaboration (Driver et al. 2022).We focus on VVDS-Wide, which covers 8.7 square degrees and obtained spectra for 25 805 galaxies down to I < 22.5.
We take photometry in the BV RI bands for VVDS-Wide, which were obtained with the CFH12K camera at CFHT.The main selection is performed in the I-band, with spectra obtained for objects with 17.5 < I < 22.5.The imaging survey is sufficiently deep (limiting magnitude of I = 24.8) to ensure 100% completeness down to I = 22.5 for the spectroscopic sample.Star-galaxy separation was performed on the spectra, so no other photometric cuts were performed.We selected only galaxies with redshift quality flags of 3 or 4 (> 95% probability to obtain a correct redshift according to Le Fèvre et al. 2013).The relevant information for modeling any correlations between the assessed reliability and redshift in detail is not publicly available for this catalog, and hence we make no attempt to model this implicit selection effect.However, Fig. 13 in Le Fèvre et al. ( 2013) for the VVDS-Deep sample in the same magnitude range suggests that the spectroscopic success rate for this subsample is expected to be roughly uniform up to z 1 and drop thereafter.This only impacts ∼ 1 − 2% of the sample, in a regime where photometric selection is expected to strongly dominate in any case.Hence we do not expect a significant impact on our results from the redshift dependence of the spectroscopic success rate.

Data model
We again assume student-t uncertainties on the fluxes as described in §3.3, and take the extinction and zero point corrections provided by the VVDS team (Le Fèvre et al. 2013).
We train a MDN to model the distribution of measurement uncertainties as a function of flux, trained on the VVDS data, as described in §3.4.Unlike GAMA, the VVDS BV RI photometry is patchy with roughly a quarter of objects missing measurements in at least one band.We set the fractional uncertainties for missing values to 1000, and constrain one component in the mixture model to be a delta function at 1000, where the relative weight of that component in the MDN then encodes the relative probability of having a missing value, as a function of flux.The uncertainty distributions (for non-missing values) and corresponding trained models are shown side-by-side in Figure 9.Note the multimodal structure in the uncertainty distributions, owing 0.005 0.000 0.005 bias on mean, z probability, P( z) bin 1: 0 < z < 0.2 0.005 0.000 0.005 bias on mean, z bin 2: 0.2 < z < 0.8 to varying depth photometry in all but the I-band.This makes for a more challenging test case for the forward modeling framework.

Tomographic binning
For the purpose of tomographic binning, we train a simple (dense) neural network estimator for the redshift given the measured BV RI photometry.We again take a dense network with four hidden layers with 64 units each and leaky-ReLU activations, passing the measured magnitudes and magnitude errors in the 4-bands as inputs and the estimated redshift as output.The network is trained on the VVDS photometry and spec-zs as described in §4.2.The resulting redshift estimator has an overall accuracy of around σ z 0.2.Note that the redshift estimator is considerably less accurate in this case comapred to GAMA, owing to the poorer constraining power of the BV RI bands, the prevalence of missing values in the VVDS photometry, and the smaller training set.

Results
As in §4.3, model predictions are obtained by MCMC sampling the population model (imposing a prior limit of I < 23.5), drawing uncertainties and adding noise according to the uncertainty and data models described above, and applying selection cuts 17.5 < I < 22.5 to the simulated noisy I-band magnitudes.Tomographic bin labels are assigned based on the redshift estimator described above.Sampling is continued until 5 • 10 5 selected samples are obtained.
The tomographic redshift distributions predicted by the forward model are shown alongside the spec-z histograms for VVDS in Figure 10, and the corresponding bias on the mean of the redshift distribution in shown in Figure 11.We note that our data model for VVDS has residual uncertainties that may be responsible for some of the redshift bias.We have assumed standard Johnson BV RI filters, which are approximately correct but there are some differences in detail (Le Fèvre et al. 2013).Zero-point calibration for VVDS was also reported to be challenging, with large (and sometimes differential) zero points required in some bands to achieve consistency between photometry and spectra (Vincent LeBrun, pri-vate communication).We have not included calibration uncertainties in our results.

DISCUSSION
While our baseline model is able to accurately recover the tomographic redshift distributions for GAMA and VVDS, a number of improvements are possible.
The baseline SPS model assumes a simple double power-law star-formation history parameterization.Such a simple SFH parameterization is not expected to capture the full diversity of real star formation histories, and can lead to overly restrictive correlations between important derived quantities (such as SFR and age), which might not be representative of real galaxies.These limitations can be alleviated by non-parametric (binned) star formation history models (e.g., Leja et al. 2019), or more physical SFH parameterizations (e.g., Alarcon et al. 2022).
Regarding the population model, the largest modeling uncertainties are expected to come from the dust attenuation prior, where our baseline model assumed that dust attenuation scales with SFR only.In reality, dust characteristics are expected to be related to the detailed star formation and metallicity enrichment histories, motivating a more sophisticated dust prior model (e.g., Nagaraj et al. 2022).
The data-model also has a number of simplifying assumptions.SPS modeling errors are expected to vary as a function of rest-frame wavelength, with emission- lines in particular being subject to potentially significant modeling biases (Leistedt et al. 2022).In the context of inferring SPS parameters for individual galaxies, photometric error floors have often been used to capture both modeling and calibration uncertainties, in order to increase the uncertainties on inferred SPS model parameters and reduce biases.While this strategy is fine for individual galaxies, increasing variances in order to cover potential biases in this way can lead to over-dispersion in population-level parameters.Therefore, data-modeling efforts should instead focus on parameterizing and modeling biases directly, rather than treating them as extra variance terms.The shape of the noise distribution also merits careful investigation, with (for example) the skewness and tail-weights of photometric measurement errors likely varying between bands, and as a function of flux and background noise levels.
Regarding the selection modeling, so far we have considered the scenario where selection is performed with respect to the measured photometry alone.However, for weak lensing surveys some additional selection cuts will typically be made on the images, such as image quality cuts to ensure reliable shear measurements, image-based star-galaxy separation, deblending, surface-brightness cuts, etc.Because galaxy image characteristics correlate with SPS parameters and redshift, image-based cuts will induce additional selection effects that could modify the resulting redshift distributions.
In Appendix A we show that the effect of image-based selection cuts can be addressed by replacing the population model with an effective population prior describing the statistical properties of the galaxy population that passes the image cuts (conditioned on the characteristics of the survey, etc.).Hence, image-based selection can be incorporated into the forward modeling framework by parameterizing the effect of those image cuts on the population prior over SPS parameters and redshift, and inferring those additional hyper-parameters alongside the other population-and data-model parameters.Alternatively, if it can be demonstrated (or orchestrated) that the photometric cuts are sufficiently stronger than any image based cuts, such that the image cuts have a negligible impact on the analysis sample, then those unmodelled selection effects can be safely ignored.
In addition to modeling improvements, inference (or optimization) of population-and data-model parameters from the photometric data should lead to additional improvements in accuracy.For robust inferences, data-model parameters should be self-consistently calibrated using the photometric data themselves, with photometric redshifts expected to be particularly sensitive to zero-point calibrations (but with all data-model parameters playing a role).Regarding the population model, we note that different aspects of the model are better constrained than others by external data.The dust prior and fundamental metallicity relations in particular are expected to be the least well understood and constrained, meriting broader priors on their parameters.The star-forming sequence is somewhat better constrained (e.g., Leja et al. 2021), while the mass function is relatively tightly constrained (e.g., Leja et al. 2020).

CONCLUSIONS
We have presented a forward modeling framework for photometric surveys, which is capable of accurately predicting the tomographic redshift distributions required for cosmological analyses.Scaling this forward modeling approach to large surveys is made possible by neural emulation of SPS models (Alsing et al. 2020).
Forward modeling has a number of advantages over existing methods for estimating cosmological redshift distributions.In contrast to direct-calibration methods, forward modeling does not require external spectroscopic data: it is therefore not hampered by the (lack of) availability of spectroscopic redshifts at the depth required for photometric surveys, and is not vulnerable to biases arising from spectroscopic selection effects that cannot be well-described by re-weighting in broad-band colour-space.In contrast to cross-correlation based estimators, it is not sensitive to galaxy-bias modeling assumptions.Our forward modeling framework also resolves a number of limitations of existing template-based methods, by replacing template-sets with a continuous physical model for galaxy spectra (with associated physical priors), carefully treating selection effects, and enabling self-consistent inference of model parameters describing the galaxy population and data-model.
By explicitly modeling the processes that give rise to the target redshift distributions, forward modeling allows for fine-control over the relevant modeling assumptions.In particular, it creates synergies between galaxy evolution physics and photometric redshift inference: as our constraints on the statistical properties of the galaxy population improve, those lead directly to improved priors on the population model parameters, and hence improved photometric redshift inferences.
We have demonstrated the utility of our forward modeling framework by accurately recovering the redshift distributions for the GAMA and VVDS surveys, validating against their spectroscopic redshifts.The model is able to predict the tomographic redshifts for those two surveys, with biases of ∆z 0.003 for GAMA and ∆z 0.01 for VVDS respectively, without performing inference or optimization of the model parameters describing the galaxy population and photometric calibration.This accuracy is sufficient for ongoing Stage III surveys, and approaching the accuracy requirements of Stage IV surveys.We anticipate that with additional modeling improvements, and optimization of model hyper-parameters, forward modeling can provide a path to accurate cosmological redshift distribution inference for Stage IV surveys.
In a companion paper (Leistedt et al. 2022) As described in 3.2, we want to put a prior on the SFH parameters such that the resulting prior on the SFR is given by our desired assumptions about the star-forming sequence (SFS).To re-cap from the main text, we construct a prior over SFH parameters as: P (α, β, τ |M, z) = π 0 (α, β, τ ) P (SFR|M, z) P 0 (SFR|z) (C23) where π 0 (α, β, τ ) is the baseline (uniform) prior on the SFH parameters, P (SFR|M, z) is the target SFS prior, and P 0 (SFR|z) is the implicit prior on SFR implied by π 0 .The implicit SFR prior is defined by the surface integral P 0 (SFR|z) = SFR=const.
π 0 (α, β, τ )dS, (C24) where dS is the surface element in the SFH parameter space (α, β, τ ).In order to circumvent having to compute those surface integrals directly every time we need to evaluate the prior density, we train a normalizing flow to learn P 0 (SFR|z) as follows.We construct a training set by drawing SFH parameters from the baseline prior π 0 (α, β, τ ), which we take as log-uniform in α and β, and uniform in τ , over the prior ranges given in Table 2, and redshifts drawn uniformly between 0 and 2. For each baseline prior (and redshift) sample, we compute the specific SFR for those SFH and redshift parameters, defined as the fractional mass formed in the last 100Myr.This provides a training set of {sSFR, α, β, τ, z}, on which we can train a conditional density estimator to learn P 0 (sSFR|z).We train a neural spline flow (Durkan et al. 2019) to learn P 0 (log 10 sSFR|z), with 16 spline knots spaced between −15 and −6 (in log 10 sSFR), a single hidden layer with 16 units and leaky-ReLU activation functions, and a base Gaussian density with location −11 and scale 1. Training is performed using Adam with a learning rate of 1e − 2 for 1000 epochs, with no batching on a training set of 10 6 samples (generated as described above).The implicit prior P 0 (log 10 sSFR|z) and the trained normalizing flow is shown in Figure 12, and use of the trained normalizing flow to divide out the implicit SFR prior is shown in Figure 3.

Figure 3 .
Figure3.The prior on the star formation rate implied by taking uniform priors over the double power-law SFH parameters (log 10 α, log 10 β, τ ) over the ranges specified in Table2is shown in blue, while the target prior (specified by the SFS measurement ofLeja et al. (2021)) is shown in orange.The SFR prior obtained by re-calibrating the baseline prior with a normalizing flow (as described in 3.2.3) is shown by the black dotted line, giving excellent agreement with the target SFS prior.

Figure 4 .
Figure 4. Star forming sequence prior on star formation rate (in units of M yr −1 ) conditioned on mass and redshift, based on the measurement from Leja et al. (2021) (see Appendix B for details).The vertical lines represent the mass-complete limit for the 3D HST data on which the Leja et al. (2021) fit was performed; we extrapolate the SFS below the mass limit as shown in the Figure.

Figure 5 .
Figure 5. Prior on the diffuse dust attenuation conditioned on SFR.Solid line shows the mean, and the bands show the 1-and 2-σ contours.

Figure 6 .
Figure 6.Magnitude uncertainties versus magnitudes for the GAMA data (left panels) versus the trained MDN model for the error distribution conditioned on flux (right panels).Note that the magnitudes in both the left and right panels are maximum a posteriori (MAP) magnitudes, from an initial fit of the SPS model to the GAMA galaxies as described in §3.4.

5.
CASE STUDY II: VVDS Similar to GAMA, the VIMOS VLT Deep Survey (VVDS; Le Fèvre et al. 2013) is a spectroscopic survey designed to have simple photometric target selection.

Figure 7 .
Figure 7. Tomographic redshift distributions obtained by the forward model (blue) compared to the histogram of the GAMA spectroscopic redshifts (orange).The model predictions are in excellent agreement with the distributions of the spectroscopic redshifts.

Figure 8 .
Figure 8. Bias on the mean redshift of the model redshift distributions versus the data for GAMA, ∆z = z model − z data .The distributions are obtained by bootstrapping samples from the model n(z), taking a kernel density estimate (KDE) of the bootstrapped sample means, and centering the KDE on the difference between the sample mean of the spec-zs and the mean of the model n(z).

Figure 9 .
Figure 9. Magnitude errors versus magnitudes for the VVDS data (left panels) versus the trained MDN model for the error distribution conditioned on flux (right panels).Note that the magnitudes in both the left and right panels are maximum a posteriori (MAP) magnitudes, from an initial fit of the SPS model to the VVDS galaxies as described in §3.4..

Figure 10 .Figure 11 .
Figure 10.Tomographic redshift distributions obtained by the forward model (blue) compared to the histogram of the VVDS spectroscopic redshifts (orange).The width of the histogram bars indicates ± √ N Poisson noise.The model predictions are in excellent agreement with the distributions of the spectroscopic redshifts.

Figure 12 .
Figure12.Samples from the prior on the star formation rate implied by taking uniform priors over the double power-law SFH parameters (log 10 α, log 10 β, τ ) over the ranges specified in Table2is shown as blue histograms, while the learned normalizing flow model for the implied distribution (as a function of redshift) is shown in orange.

Table 1 .
Notation for all model parameters included in the forward model.

Table 2 .
SPS model parameters and their prior ranges.
C. SETTING PRIORS ON DERIVED QUANTITIES USING NORMALIZING FLOWS