PopSED: Population-Level Inference for Galaxy Properties from Broadband Photometry with Neural Density Estimation

We present PopSED, a framework for the population-level inference of galaxy properties from photometric data. Unlike the traditional approach of first analyzing individual galaxies and then combining the results to determine the physical properties of the entire galaxy population, we directly make the population distribution the inference objective. We train normalizing flows to approximate the population distribution by minimizing the Wasserstein distance between the synthetic photometry of the galaxy population and the observed data. We validate our method using mock observations and apply it to galaxies from the GAMA survey. PopSED reliably recovers the redshift and stellar mass distribution of $10^{5}$ galaxies using broadband photometry within $<1$ GPU hr, being $10^{5-6}$ times faster than the traditional spectral energy distribution modeling method. From the population posterior, we also recover the star-forming main sequence for GAMA galaxies at $z<0.1$. With the unprecedented number of galaxies in upcoming surveys, our method offers an efficient tool for studying galaxy evolution and deriving redshift distributions for cosmological analyses.


INTRODUCTION
Galaxies are the building blocks of the Universe.The history of their formation and evolution are encoded in their spectral energy distributions (SEDs).Therefore, one of the most important tasks in extragalactic astronomy is to decode the physical properties of galaxies, including the redshift, stellar mass, star formation history (SFH), and chemical enrichment history, from the observed SEDs.The state-of-the-art SED modeling methods (e.g., Noll et al. 2009;Carnall et al. 2018;Johnson et al. 2021) that utilize stellar population synthesis (SPS) models (Conroy 2013) have been an indispensable component in many studies, ranging from individual high-redshift galaxies (Labbé et al. 2023) to large galaxy surveys (e.g., the Sloan Digital Sky Sur-Corresponding author: Jiaxuan Li jiaxuanl@princeton.eduvey, or SDSS, Gunn et al. 2006).Hundreds of millions of galaxies will be characterized with the upcoming observations from the Rubin Observatory Legacy Survey of Space and Time (LSST; Ivezić et al. 2019), Euclid (Racca et al. 2016), and the Roman telescope (Spergel et al. 2015), enabling a huge discovery space for the origin and evolution of galaxies as a population.
SED modeling is typically applied to individual galaxies.The point estimates or the posterior distributions of each galaxy need to be combined to study the galaxy population, such as measuring the stellar mass function (e.g., Wright et al. 2017;Hahn et al. 2023a), the starforming main sequence (SFMS; e.g., Speagle et al. 2014;Thorne et al. 2021), and the stellar mass-metallicity relation (e.g., Tremonti et al. 2004;Curti et al. 2020).However, modeling the SEDs of individual galaxies is very expensive.SED fitting is a high-dimensional problem (typically > 10 dimensions) that requires evaluating the SPS model and sampling the posterior distribution several million times for a single galaxy.A Bayesian SED fitting with traditional SPS models (e.g., FSPS; Conroy et al. 2009) takes ∼ 20 CPU hr per galaxy (Leja et al. 2019b), making it computationally formidable to analyze millions of galaxies, not to mention analyze the entire galaxy population.
This problem has been partially mitigated by recent developments in accelerating SED modeling by emulating the SPS models with neural networks (Alsing et al. 2020), building differentiable SPS models with a highperformance library (Hearin et al. 2023), and speeding up the sampling by using amortized simulation-based inference (SBI; Hahn & Melchior 2022;Khullar et al. 2022;Wang et al. 2023).However, some of these methods need sophisticated training and are costly to retrain when adapting to different SPS models or noise properties.Furthermore, to constrain the galaxy population distribution, one cannot simply combine point estimates, e.g. the Maximum A Posteriori (MAP) estimate, or directly average the individual posteriors.One needs thousands of samples from the posterior of each galaxy to construct the population posterior by running Markov Chain Monte Carlo (MCMC) for a hierarchical model (e.g., Malz & Hogg 2020;Alsing et al. 2022;Hahn et al. 2023a), which poses additional modeling and computational challenges.
In this paper, we introduce PopSED, an efficient and robust method to infer the properties of the galaxy population from broadband photometric data.We directly constrain the population-level distribution of galaxy properties without fitting for individual galaxies and circumvent the need for combining individual posteriors.We use normalizing flows to flexibly model the population distribution, and we train the flows by minimizing the Wasserstein distance between the observed photometric data and the synthetic data generated by the flow.PopSED could reliably recover the populationlevel properties for 10 5 galaxies with ∼ 1 GPU hr.The code is available on GitHub1 The paper is organized as follows.We describe our method in §2 and Figure 1, introduce the data we used in §3, validate our method using mock observations and real data in §4, and discuss the strengths and limitations of this work in §5.We adopt a Chabrier (2003) initial mass function (IMF) and a flat ΛCDM cosmology from Planck Collaboration et al. (2016), with Ω m = 0.307 and H 0 = 67.7 km s −1 Mpc −1 .The photometry used in this work is in the AB system (Oke & Gunn 1983).

METHOD
The goal of this paper is to infer the distribution of the physical properties θ of a large number of galaxies from photometric data {X i }.We approximate the underlying galaxy population distribution p(θ|{X i }) with a flexible neural density estimator, a normalizing flow q ϕ (θ) with parameters ϕ.In order to train the normalizing flow to approximate the population distribution, we compare the synthetic photometric data generated by the flow with the observed data.Specifically, after sampling from the flow θ ϕ j ∼ q ϕ (θ), we predict the corresponding broadband photometry Xϕ j = F (θ ϕ j ) using the SPS model.We then compare the observed photometry {X i } and the synthetic photometry { Xϕ j } by calculating the Wasserstein distance W 2 ({ Xϕ j }, {X i }) between the two distributions.The normalizing flow is trained to minimize W 2 until the synthetic photometry from the normalizing flow agrees with the observed photometry, at which point q ϕ (θ) serves as a MAP estimate to the galaxy population distribution p(θ|{X i }).We train an ensemble of flows {q ϕ (θ)} to further approximate the posterior of the population distribution.
Figure 1 provides a high-level overview of our method.We describe the forward model F used to model the photometry in §2.1 and §2.2, introduce the normalizing flow in §2.3 and Wasserstein distance in §2.4, and present the training strategy in §2.5.

SPS Modeling
An SPS model is needed to translate the physical parameters of a galaxy θ i to its SED X i (see Conroy 2013 for a review).Different SPS models have different choices for modeling the SFH, chemical enrichment history, and dust attenuation.In the following, we discuss each of these aspects in the SPS model we use.
In this work, the SFHs of galaxies are described using the PROVABGS model (Alsing et al. 2020;Hahn et al. 2023b), which is trained on the galaxies in the Illustris hydrodynamical simulation (Genel et al. 2014;Vogelsberger et al. 2014;Nelson et al. 2015).The SFH is modeled by a linear combination of four SFH bases {s SFH i } and one burst component δ(t − t burst ): (1) where β i is the coefficient of each SFH base, t burst is the lookback time when the starburst happens, and f burst is the fraction of the total stellar mass that is formed during the burst.The SFH bases {s SFH < l a t e x i t s h a 1 _ b a s e 6 4 = " e 0 C t 6 s 5 6 t d + v D + p y 0 L l j Z z B 7 6 U 9 b 3 L 3 7 S p Z Y = < / l a t e x i t > ✓ j ⇠ q (✓)

Normalizing flow
Parametrized by neural net ϕ Backprop Train 0 1 2 3 < l a t e x i t s h a 1 _ b a s e 6 4 = " q k P o 9 S j r y z < l a t e x i t s h a 1 _ b a s e 6 4 = " i + 2 c 3 f 5 o J < l a t e x i t s h a 1 _ b a s e 6 4 = " e Q Q 2 g W u E T R x f / G u 3 5 y C q 1 q v w z k 0 A L V E Z 3 q I K q i K I X 9 I b e 0 Y f 1 a n 1 a X 9 Z w 2 p q x Z j P 7 6 A + s 0 T d o p 6 c Q < / l a t e x i t > X j = F (✓ j ) Figure 1.A schematic diagram of PopSED (details in §2).The galaxy population distribution p(θ|{Xi}) is approximated by a normalizing flow q ϕ (θ).We sample from the normalizing flow and forward model the synthetic photometry { Xϕ j } using the galaxy SED emulator F (θ ϕ j ).Then we compare the distributions of the observed photometry and the synthetic photometry by calculating the Wasserstein distance W2({ Xϕ j }, {Xi}), which is used as a loss to train the normalizing flow until the synthetic photometry from the normalizing flow agrees with the observed photometry.
chocki & Phan 2009).As shown in Hahn et al. (2023b), the four bases are sufficient to capture the SFH of Illustris galaxies.Among the SFH bases, s SFH 1 corresponds to the most recent star formation, whereas s SFH 4 corresponds to the oldest star formation (see Fig. 5 and Appendix A in Hahn et al. 2023b).Such SFH bases are non-negative by construction and are physically intuitive to interpret.We require i β i = 1, therefore only three of the {β i } are independent2 .We define the total formed stellar mass to be the integration of SFH: M ⋆,formed = tage t=0 SFH(t, t age ).Unless otherwise noted, in this work we refer to the stellar mass M ⋆ as the total formed stellar mass M ⋆,formed3 .In total, the SFH of a galaxy is described by 7 parameters, namely β 1 , β 2 , β 3 , β 4 , f burst , t burst , and M ⋆ .Another important piece in the SPS model is the chemical enrichment history, which is usually simplified to be the metallicity history (ZH) by assuming a fixed element abundance ratio for all metals.In the original PROVABGS, the ZH is described as a linear combination of two ZH bases extracted from Illustris.Because photometric data alone are quite uninformative in inferring the ZH (e.g., Hahn et al. 2023b), we simplify the ZH to be a constant stellar metallicity log(Z ⋆ /Z ⊙ ) over time, as many other works assume (e.g., Carnall et al. 2019;Leja et al. 2019a).
In order to synthesize the stellar populations, we discretize the lookback time t into time bins.The first time bin corresponds to log (t/yr) < 6.05, and each time bin has a width of 0.1 dex until the lookback time reaches the age of the galaxy t age , which is determined by its redshift z.We treat the stellar population in each time bin as a simple stellar population (SSP) and we evaluate its SFR according to the SFH.In the end, we add the Li et al.
spectra of the SSPs in each time bin together weighted by the stellar mass formed in each bin.We assume a Chabrier (2003) IMF and use the MIST isochrones (Choi et al. 2016;Dotter 2016) and the empirical spectra library MILES (Sánchez-Blázquez et al. 2006) for 3800-7100 Å and the BaSeL library (Lejeune et al. 1997(Lejeune et al. , 1998;;Westera et al. 2002) for wavelengths outside of the range of MILES.We use FSPS (Conroy et al. 2009;Conroy & Gunn 2010;Johnson et al. 2021) to generate SSP spectra.We do not add nebular emissions to the spectra because broadband photometry reflects more about the continuum of the galaxy spectrum.
We add dust attenuation to the galaxy spectra following the recipe in Charlot & Fall (2000), which includes the birth-cloud attenuation and the diffuse dust screening.The birth-cloud attenuation only acts for stars younger than 10 7 yr (Conroy et al. 2009), whereas the diffuse dust component affects all stars.We refer interested readers to Hahn et al. (2023b) for details.There are three parameters in our dust model: τ 1 is the birthcloud optical depth at 5500 Å, τ 2 is the diffuse dust optical depth at 5500 Å, and n dust is the slope of the Calzetti et al. (2000) attenuation curve.
To summarize, our SPS model contains 12 parameters, as listed in Table 1.We refer the interested readers to Hahn et al. (2023b) for a more complete description of PROVABGS.We emphasize that our method for population-level inference is not limited to specific SPS models.One can use a different SPS model to perform the population-level inference on galaxy populations.If the SPS model is relatively slow to evaluate, one needs to train a neural emulator to accelerate the evaluation and bring differentiability to the inference, as we introduce below.

Galaxy Spectrum Emulator
Although we do not use MCMC to construct the population distribution in this work, we also need a fast and differentiable SPS model, such that the information contained in the photometric data (i.e., the Wasserstein distance; see §2.4) could efficiently flow back to the parameter space.Therefore, we train an emulator for the SPS model described in §2.1 following the approach in Alsing et al. (2020) and Hahn et al. (2023b).After training, the emulator takes the physical parameters θ i of a galaxy (listed in Table 1) and predicts its synthetic photometry Xi = F (θ i ) in SDSS ugriz bands with realistic noise added.We choose SDSS bands just to better match with the data used in §3.One can certainly in-clude more filters if needed.The details of training the spectrum emulator are presented in Appendix A.
Noise must be added to the noiseless SEDs in order to meaningfully forward model the observed data.For simplicity, we assume the noise in each filter is independent and construct the noise model in each band k.To be specific, we add Gaussian noise to the noiseless fluxes The noise level σ k should depend on the flux f k because, given a survey depth, fainter sources have a lower signalto-noise ratio (SNR k = f k /σ k ).We assume that the SNR for a given flux f k follows a Gaussian distribution: . For a given data set, we empirically estimate the median SNR (µ SNR ) and the standard deviation (σ SNR ) as a function of f k in each band k by evaluating them in magnitude bins and interpolating over the bins.During the forward modeling, we sample the distribution If the resulting flux fk < 0, we repeat the above procedure until fk ⩾ 0. Unlike in Hahn & Melchior (2022), who model p(σ k |f k ), we find that modeling p(SNR k |f k ) is more robust in the low-SNR regime.In the end, we convert noisy fluxes fi to magnitudes Xi .Our emulator is ∼ 10 3−4 times faster than the direct SPS computation.

Neural Density Estimator
We try to approximate the population distribution p(θ|{X i }) using density estimators.Although traditional density estimation techniques, such as Gaussian mixture models (e.g., Bovy et al. 2011), are easy to optimize and interpret, they are less flexible and scalable to larger data sets in higher dimensions than the neural density estimation (NDE) techniques.In this work, we use "normalizing flows" (e.g., Tabak & Vanden-Eijnden 2010;Tabak & Turner 2013;Kobyzev et al. 2019) to approximate the population distribution of galaxy properties.Being a neural density estimator, normalizing flows have been successfully applied to many fields in astronomy (e.g., Alsing et al. 2019;Zhang et al. 2021;Ciuca & Ting 2022;Dai & Seljak 2022;Hahn & Melchior 2022;Green et al. 2023) to approximate density distributions and generate new data.
A normalizing flow maps a complex distribution q ϕ (θ) to a simple base distribution π(z) using an invertible bijective transformation g : z → θ, which is described by a neural network with parameters ϕ.The base distribution is often chosen to be easy to evaluate and sample, such that we can evaluate the target distribution follow- Flat Dirichlet prior with 0 The lookback time when the star formation burst happens (Eq. 1) Uniform (10 −2 , 13.27) The fraction of total stellar mass formed in the star formation burst (Eq. 1) Uniform (0, 1) log(Z⋆/Z⊙) Stellar metallicity (Z⊙ = 0.019) Uniform (−2.6, 0.3) n dust The power-law index of the Calzetti et al. (2000) attenuation curve Uniform (−3.0, 1.0) τ1 Birth-cloud dust optical depth Uniform (0, 3.0) τ2 Diffuse dust optical depth Uniform (0, 3.0) From many different flow models (e.g., Papamakarios et al. 2017), we use the Neural Spline Flow (NSF; Durkan et al. 2019), where the base function is a multivariate Gaussian distribution and the transformations are described by monotonic rational quadratic splines.
The flexibility of the NSF model makes it well suited to modeling the population distribution.We use the NSF implementation in the sbi4 package (Greenberg et al. 2019;Tejero-Cantero et al. 2020).Our NSF has 20 blocks, 60 bins for the splines, and 100 latent features in each coupling layer.The original NSF model is initialized such that q ϕ (θ) is a standard multivariate Gaussian distribution.However, such an initialization poses significant challenges for training the flow.First, nonphysical parameters (e.g., negative redshift and stellar mass) or parameters that lie outside the prior of the emulator (Table 1) will be drawn from this initial distribution.These nonphysical values are either meaningless or incompatible with our forward model.Second, the standard Gaussian initialization imposes a quite strong prior, which could bias the population distribution.
To mitigate these issues, we opt for uniform distributions over the standard Gaussian to initialize the normalizing flow.We append an additional layer to the NSF flow that performs cumulative distribution function (CDF) transformation.The CDF transformation converts a standard Gaussian distribution into a uniform distribution within a user-specified range.We set the ranges of the uniform distributions following the priors listed in Table 1.In particular, for the results shown in §4, we narrow the redshift range to 0 < z < 0.8 and the stellar mass range to 7.5 < log(M ⋆ /M ⊙ ) < 13.0 to remove irrelevant parameter space and make the inference more efficient.After applying this CDF transformation, the initial normalizing flow describes a uniform distribution in all dimensions and ensures that all parameters drawn from the initial normalizing flow are physical.The uniform distribution serves as the prior of our inference for the population distribution.

Wasserstein Distance
The job now is to train the normalizing flow q ϕ (θ) to best approximate the population distribution.We use optimization to find a flow that is most probable to produce the photometric data that is consistent with the observations.This is equivalent to finding the MAP estimate to the posterior of the population distribution.In order to do so, we sample from the normalizing flow θ ϕ j ∼ q ϕ (θ) and generate corresponding synthetic photometry using the forward model Xϕ j = F (θ j ).The distance (or divergence) between the distribution of the observed data {X i } and the distribution of the synthetic data { Xϕ j } can be a proxy for the dissimilarity between the normalizing flow and the underlying population distribution.By minimizing this distance metric, we train the normalizing flow q ϕ (θ) so that it accurately approximates the underlying population distribution.
Choosing an appropriate distance metric for probability distributions is critical.However, traditional distance metrics are challenging to apply to highdimensional and discrete data sets.Because we try to compare two discrete samples with different sizes ({ Xϕ j } and {X i }), the commonly used Kullback-Leibler (KL) divergence cannot be employed without modeling the two distributions separately.
On the other hand, the divergences based on the optimal transport theory, such as the Wasserstein distance (also known as the earth mover's distance), are shown to be well behaved when the KL divergence is not applicable.The Wasserstein distance quantifies the minimal "total cost" required to move a distribution (which can be viewed as a volume of soil) to the target distribution (the other volume of soil).This distance metric allows one to compare discrete distributions whose supports do not overlap and to quantify the spatial shift between them.Importantly, the Wasserstein distance is differentiable by construction, symmetric under exchange of the two distributions, and satisfies the triangle inequality, making it well suited for comparing two discrete photometric distributions.Wasserstein distance has been successfully applied to various topics in statistics, including variational inference (Ambrogioni et al. 2018), approximate Bayesian computation (Bernton et al. 2019), and, most famously, Generative Adversarial Networks (Arjovsky et al. 2017).It is also used in astronomy to characterize galaxy images (Holzschuh et al. 2022) and the topology of large-scale structures (Tsizh et al. 2023).We refer interested readers to Peyré & Cuturi (2018) and Feydy et al. (2019) for more details on the Wasserstein distance and its application.
Practically, the optimal transport solution is approximated using the Sinkhorn algorithm (Sinkhorn 1964;Cuturi 2013), which provides an efficient and scalable approximation to the Wasserstein distance.The Sinkhorn algorithm tries to minimize the earth moving cost together with an entropic regularization term with a coefficient ε (also denoted as "temperature").Such a regularization makes the optimal transport problem solvable using iterations that only involve linear algebra and can run in parallel on GPUs with differentiablility.Physically speaking, the particles in the base distribution are mapped to a fuzzy collection of the target particles whose diameters are proportional to a blurring scale σ = ε 1/p (Feydy et al. 2019, where p is the index of the L p norm used to calculate the cost).Thus, the Sinkhorn distance converges to the Wasserstein distance as σ → 0. A larger σ will produce a fuzzier match between the two distributions but a faster convergence.
In this work, we use the Wasserstein-2 distance (W 2 , where the L 2 norm is used in the cost function) to characterize the distance between the observed photometry {X i } and the synthetic photometry { Xϕ j }.W 2 is calculated using the implementation of the Sinkhorn iteration in the Python package GeomLoss5 (Ramdas et al. 2017;Feydy et al. 2019;Charlier et al. 2021).In our case, the two photometric data sets are often not of the same size.GeomLoss could nicely handle this unbalanced optimal transport problem.The Wasserstein distance W 2 ({ Xϕ j }, {X i }) is then used as a loss to train the normalizing flow q ϕ (θ).

Training
During each training iteration, synthetic photometry is generated by sampling from the normalizing flow, and the Wasserstein distance W 2 ({ Xϕ j }, {X i }) is computed.The gradient of this distance metric is then backpropagated to train the normalizing flow.Because we initialize the normalizing flow to be a uniform distribution, the distribution of the synthetic photometry is quite far from the observed one in the first few steps.Therefore, we calculate the Wasserstein distance using a relatively large blurring scale σ at the beginning of the training to capture the global structure of the population distribution.Then we take an annealing strategy to gradually reduce σ such that the focus of the normalizing flow shifts from matching the mean values to matching increasingly subtle details as the training proceeds (Chui & Rangarajan 2000).Compared with using a small σ for all iterations, training using annealing σ is faster to converge.In practice, we initialize σ to be 0.3 (σ corresponds to the blur parameter in GeomLoss) and decrease σ by 0.05 every 60 steps until reaching σ = 0.05, after which we fix σ = 0.002 for the remaining iterations.We find that the specific annealing schedule does not significantly affect the outcomes.
The realistic noise added to the synthetic photometry Xϕ j can make the training much harder at the beginning, when the normalizing flow has not yet learned the global landscape of the population distribution.Therefore, we take an "anti-annealing" strategy, where the noise (described by an effective SNR) is added gradually as the training goes on.In this way, the normalizing flow will be guided by the bulk of the data in the beginning, without paying much attention to the noise.The flow is then exposed to more realistic noises later on and adjusts its shape to match the details in the observed data.To do so, we reduce the noise level σ k in the forward model (see §2.2) by a factor of R, which follows an exponential decay with time: R(t) = 1+R 0 •exp(−τ •t/T ), where τ is the decay rate and T is the total epochs of training.For the case studies in §4, we choose to use R 0 = 30, τ = 12, and T = 800.We determine these values by trial and error based on the mock test (see §4.1).
A trained normalizing flow is one MAP solution to the population distribution.It is also critical to understand the posterior of the population distribution, i.e., the variation of the MAP solutions.The deep ensemble method is a popular approach to characterizing the Bayesian posterior by training the same deep-learning model multiple times with different initializations and averaging the resulting models.The deep ensemble method is proved to better approximate the Bayesian posterior than methods such as variation inference (e.g., Wilson & Izmailov 2020).We take this ensemble learning approach and train a number of normalizing flows with different random seeds, then combine these models by drawing the same number of samples from them and aggregating the samples.The ensemble of flows would roughly approximate the posterior of the population distribution.
PopSED is implemented in PyTorch (Paszke et al. 2019) and trained with the stochastic gradient descent optimizer Adam (Kingma & Ba 2014).We use a one-cycle learning-rate policy (Smith & Topin 2017) with an initial learning rate of 3 × 10 −5 , a maximum learning rate of 3 × 10 −4 , and a minimum learning rate of 3 × 10 −6 .The learning rate increases from the initial learning rate to the maximum learning rate, and then slowly drops to a minimum learning rate.We train each normalizing flow for 800 epochs.The training loss and the validation loss are comparable, indicating that the flows do not overfit.

DATA
To test the performance of PopSED on inferring the galaxy properties and redshifts, we use the photometric data used in the Galaxy And Mass Assembly (GAMA) survey (Driver et al. 2011).The GAMA survey is a spectroscopic survey targeting galaxies selected from photometric surveys down to r < 19.8 mag.It is therefore an ideal data set to test how well we can recover the redshift distribution by comparing our result with the spectroscopic redshifts.We use the aperture-matched photometry (Driver et al. 2016) from GAMA Data Release 3 (DR3; 6 Baldry et al. 2018) in this work.The photometry in the ugriz bands comes from SDSS DR7 (Abazajian et al. 2009).We remove objects lacking the AUTO photometry (i.e., Kron photometry) and objects 6 http://www.gama-survey.org/dr3/with no flux uncertainty.We also apply a color cut (J − K s ) > 0.025 using photometry from the VIKING survey (Edge et al. 2013) for star-galaxy separation and an additional signal-to-noise ratio cut SNR > 1 for all SDSS ugriz bands to remove marginally detected sources with poor photometric quality.There are 83,692 objects in our GAMA sample.Their spectroscopic redshift distribution peaks around z = 0.15 and extends from z = 0 to z ∼ 0.60.In this work, we only use the AUTO photometry in the SDSS ugriz bands to infer the population distribution.We also build the noise models for each SDSS band as described in §2.2.

RESULTS
With the PopSED framework presented above, we first test how well it works by applying it to mock observations ( §4.1), where the ground truth is known.Then we apply our method to the GAMA data and compare the inferred properties with spectroscopic results and literature in §4.2.

Mock Observations
We construct a mock observation to test our method and understand its strengths and limitations.We first design a parameter distribution p(θ) mock that roughly imitates a real galaxy population in the Universe, then sample from it and generate realistic synthetic photometry as the mock data set.Acknowledging the fact that the redshift and stellar mass should be correlated due to the depth limit of the survey, we use the GAMA DR3 spectroscopic results as a basis and resample the joint distribution of stellar masses and spectroscopic redshifts from GAMA DR3.The distributions of other parameters are listed in Table 2.For simplicity, parameters other than log(M ⋆ /M ⊙ ) and z are assumed to be independent.Subsequently, we generate mock observations X i = F (θ i ) in the SDSS ugriz bands for θ i ∼ p(θ) mock .Realistic noise is added according to the noise model of GAMA DR3 (see §3).In total, our mock galaxy sample comprises 100,863 galaxies with SNR > 1 across all SDSS bands.
To better visualize the mock galaxy population, we calculate several characteristic quantities of galaxies, including the average SFR in the past 0.1 Gyr (log SFR 0.1Gyr ) and the mass-weighted age (t age,MW ), from the parameter distribution p(θ) mock .The gray contours in Figure 2 represent the mock galaxy population by showing the joint distributions of stellar mass, redshift, SFR, metallicity, and age.The corresponding distributions of photometry data are also displayed as gray contours in the right panel of Figure 2. We plot the color-magnitude and color-color distributions for better visibility, although we emphasize that PopSED is   agrees with the ground truth (gray contours) remarkably well in all dimensions.Moreover, the ensemble of flows encompasses the true distribution, suggesting that PopSED is able to approximate the posterior of the population distribution using the ensemble method.However, the stellar metallicity and mass-weighted age have relatively large scatters among different flows, but we understand this as the photometric data not being informative enough to constrain the distributions of these two parameters.

GAMA Sample
Motivated by the success of the mock test in §4.1, we apply PopSED to 83,692 galaxies in the GAMA sample (see §3).We run 30 independent normalizing flows and combine their samples together.The left panel in Figure 3 shows the inferred population distribution among the stellar mass, redshift, average SFR within the past 0.1 Gyr, stellar metallicity, and the mass-weighted age.Similar to Figure 2, the light red histograms correspond to individual flows.For completeness, we also show the full population distribution in Figure 7 in Appendix B. Thanks to GAMA spectroscopy, we are able to compare our inferred redshift distribution with the spectroscopic redshifts.The gray contours in the left panel of Figure 3 show distributions of the spectroscopic redshifts and the stellar masses from the GAMA DR3 catalog.Our redshift estimates align well with the spectroscopic results, despite the photometric data being intrinsically less informative of redshift.
In terms of stellar mass, we compare our stellar mass distribution to the distribution of the total formed stellar mass (logmintsfh) in the GAMA catalog.We simply combine the individual stellar masses in the GAMA catalog without considering the reported uncertainties.We find that our stellar mass distribution is on average 0.35 dex higher than that of GAMA.We note that the stellar mass in GAMA is derived for individual galaxies following the SPS model in Taylor et al. (2011), where a τ -model for SFH and the stellar evolution models in Bruzual & Charlot (2003) are used.Given these differences in SPS models between GAMA and our method, it is not surprising to see an offset between GAMA's stellar mass and ours.To account for this systematic effect, we add a constant 0.35 dex offset to all GAMA stellar masses and then compare with our results in Figures 3  and 7. We find that the GAMA stellar mass distribution is consistent with ours quite well, but is slightly skewed toward the lower-mass end.As with the mock test, constraints on metallicity and age are relatively less stringent.Additionally, the right panel in Figure 3 shows the distributions of the GAMA data and the synthetic photometric data from our inferred galaxy population.While a minor difference is observed in the u − g color, the two distributions show excellent agreement in other bands.
Leveraging the inferred population distribution, we can take slices and marginalize over irrelevant parameters to study correlations among key physical parameters.To showcase the capabilities of PopSED on such population-level analysis, we focus on constraining the SFMS for galaxies at z < 0.1.With the samples drawn from the inferred population distribution in hand, we select samples with z < 0.1 and calculate the average SFR within the past 0.1 Gyr and the surviving stellar mass using the fitting function in Hearin et al. (2023) 7 .Here we choose the surviving stellar mass rather than the total formed mass to better compare with literature results.The distribution of the inferred galaxy population on the M ⋆ -SFR plane is shown as the gray hexagons in Figure 4, colored on a logarithmic scale.Figure 4 clearly reveals the main sequence of star-forming galaxies as well as a continuous transition from star-forming to quiescent at 10 < log M ⋆ /M ⊙ < 11.5.
To contextualize our findings, we further compare the derived galaxy distribution on the M ⋆ -SFR plane with literature results.Renzini & Peng (2015, orange line in Figure 4) derived a linear SFMS using the SDSS galaxies at 0.02 < z < 0.085, whose SFRs are estimated using Hα following Brinchmann et al. (2004).Sánchez et al. (2019, blue line) studied the SFMS for galaxies in the MaNGA survey (Bundy et al. 2015) using the average SFR within the past 0.1 Gyr from SED fitting.Our results are qualitatively consistent with these results at 9.5 < log M ⋆ /M ⊙ < 11.5.At the lower-mass end, Mc-Gaugh et al. (2017, purple line) explored the SFMS of low-surface-brightness galaxies at D < 100 Mpc using the Hα-based SFRs.Our results agree well with Mc-Gaugh et al. (2017) at log M ⋆ /M ⊙ < 9.5 and even capture the trend of the SFMS slope flattening with increasing stellar mass.
We note that the positive correlation between M ⋆ and SFR is also seen in the mock test (Figure 2), where the SFH is designed to be independent of M ⋆ .However, the quiescent galaxy population and the shallower slope at low mass are nontrivial to capture and showcase the potential of PopSED for probing the dependence of SFH on M ⋆ .Nonetheless, it should be noted that proper weighting is needed to account for observational completeness when deriving the SFMS from the population distribution.We defer such analysis to future studies.

Advantage of Population-level Inference using
PopSED Population-level analyses of galaxies provide important insights into key questions of astrophysics.However, the traditional ways of SED modeling on an individual galaxy basis are very expensive.Using traditional SED fitting methods, analyzing 10 5 galaxies will take up to 2 × 10 6 CPU hr.Even with the development of accelerated SED fitting (e.g., Alsing et al. 2020;Hearin et al. 2023;Hahn & Melchior 2022;Khullar et al. 2022;Wang et al. 2023), an analysis of 10 5 galaxies will still take up to ∼ 10 3 GPU hr.PopSED is able to recover the posterior of the population distribution for ∼ 10 5 galaxies within ∼ 10 GPU hr, 100 times faster than the SBI-based methods.In contrast to PopSED, it is challenging to change the SPS and noise models in SBI-based methods because of the cost of generating training data and retraining the SBI model.
Moreover, it is nontrivial to combine the posteriors of individual galaxies from SED fitting to construct a population-level distribution in a statistically rigorous way.Individual posteriors should not be multiplied directly because the prior distribution will be multiplied many times and dominate the resulting posterior.Hierarchical Bayesian models are often used to tackle this problem, and such population-level inference has been successfully applied to the studies of galaxy redshifts (Leistedt et al. 2016;Malz & Hogg 2020;Alsing et al. 2022) and gravitational-wave sources (e.g., Wong et al. 2020).Taking the formulation in Malz & Hogg (2020), the population posterior is often described by simple statistical models (e.g., Gaussian mixture models) parameterized by hyperparameters φ.The posterior of the population distribution can be written as Evaluating this posterior requires the calculation of N integrals, where the integrals are evaluated using Monte Carlo samples from individual posteriors p(θ i |X i ).For each galaxy, one needs to save thousands of samples from its posterior to compute the integral.It is very expensive to store all the individual posteriors and computationally heavy to run MCMC chains to construct the population distribution in this fashion.It is also difficult to ensure that individual posteriors are accurate enough such that the combined population posterior is not biased when combining a large number of them.
PopSED bypasses the problem of deriving and combining individual posteriors, because we directly find the optimal solution for the population distribution by minimizing the distance between the observed data and the synthetic data.PopSED is much more efficient and robust at inferring the underlying population distribution.The ensemble of flows also provides an effective estimate of the posterior of the population distribution, shown as the light histograms in Figure 2 and 3.

Potential applications of PopSED
Many science cases would benefit from having the population distribution in hand.As we demonstrate in §4.2, the population distribution can be marginalized to learn about the key relationships, such as the SFMS and stellar mass function.When applied to different galaxy samples that are selected differently (e.g., by colors), PopSED will be able to tell the underlying physical difference between the two galaxy populations.The PopSED framework can also be enhanced by integrating known spectroscopic redshifts, allowing it to concentrate primarily on inferring the physical properties rather than the redshift.
Accurately determining the redshift distribution of galaxies is crucial for extracting cosmological parameters from weak-lensing surveys (e.g., Mandelbaum 2018;Newman & Gruen 2022;Dalal et al. 2023).In this work, we demonstrate that PopSED is capable of recovering the redshift distribution of GAMA galaxies very well.Such population-level analyses of galaxies (e.g., Alsing et al. 2022) could greatly help the upcoming weak-lensing surveys including LSST (Ivezić et al. 2019), Euclid (Racca et al. 2016), andRoman (Spergel et al. 2015) The PopSED framework can also be employed to generate synthetic observations in specified filters with different noise levels.These simulated observations can serve as a good reference for designing new surveys and assessing survey completeness (e.g., Luo et al. 2023).The target selection for dedicated studies (such as spectroscopic surveys) can now be done by selecting regions in the population distribution, and then translating them into the photometry (color) space.The framework also enables the exploration of outliers by sampling from low-probability regions in the population distribution (e.g., Liang et al. 2023) and subsequently identifying objects in the photometric space that resemble these outliers.
Furthermore, the idea of population-level inference presented in this paper could be generalized to understand the populations behind galaxy spectra (e.g., DESI: Hahn et al. 2023c;PFS: Greene et al. 2022), quasar spectra (e.g., Sun et al. 2023), and stellar spectra (e.g., Gaia XP spectra: Zhang et al. 2023).

Limitations and Future Work
Runtime.As the number of galaxies and bandpasses in photometric surveys increases, the bottleneck of running PopSED will be the cost of computing the Wasserstein distance.In this work, we use the Sinkhorn iteration to approximate the Wasserstein distance.However, the time complexity of the Sinkhorn algorithm is O(M N ), where M and N are the number of samples in the two discrete distributions (Altschuler et al. 2017).Thus, the cost of PopSED will grow faster with the number of galaxies than traditional methods.One possible solution is to use the "sliced Wasserstein distance" (e.g., Bonneel et al. 2015;Kolouri et al. 2018).Instead of solving optimal transport in high dimensions, one can slice the high-dimensional probability distributions into a number of one-dimensional distributions and calculate their Wasserstein distances.Therefore, the sliced Wasserstein distance is much faster to compute.We can also accelerate PopSED by calculating the Wasserstein distance using small batches, which also increases the stochasticity when training the normalizing flows.
Accuracy.The SPS model in this work is emulated using a neural network, which is trained using synthetic spectra by sampling the parameter space.We notice that generating training samples and training such an emulator can be costly.The emulator might also perform poorly near the boundary of the prior used to train it.Furthermore, our SPS model does not include emis-sion lines, making it inapplicable to narrowband photometric data.Hearin et al. (2023) presented DSPS, a fast and differentiable SPS model implemented in jax with no emulation.DSPS does not need training and can be directly used in PopSED once it is more developed and tested, as an alternative to the SPS emulator.
Robustness.Because of the low constraining power of broadband photometry on the physical parameters, we take the annealing strategies ( §2.5) to make sure the normalizing flow smoothly converges from the uniform prior to the solution.A more informative prior distribution would likely make the training more stable.Such a prior might be learned from running PopSED on a small subset of the data.We also notice that the average SNR in the u band is typically the lowest, thus it is specifically challenging for PopSED to extract information from the u-band data.A better treatment for the noise in the low SNR regime (e.g., Lupton et al. 1999) in the forward model might help mitigate the inference bias in the u − g color seen in Figure 3.
Selection effects.The survey completeness and selection effects are not encoded in PopSED.One solution is to include a differentiable selection function in the forward model (e.g., Alsing et al. 2022).Nevertheless, if one has a well-defined survey completeness as a function of photometry, one can sample from the population distribution, calculate the corresponding completeness for each sample, and then reweight the samples to construct a new population distribution.We defer such studies to future works.

SUMMARY
In this work, we propose PopSED, a novel framework to efficiently infer the distribution of physical parameters for an entire galaxy population using broadband photometric data.We focus on the populationlevel analysis of galaxies, rather than inferring physical properties for individual objects.We successfully applied PopSED to GAMA DR3 data and obtained a posterior of the population distribution that is consistent with spectroscopic results.Our main findings and prospects are as follows: • The overall structure of PopSED is highlighted in Figure 1.We approximate the population-level distribution of physical parameters using a normalizing flow ( §2.3), a highly flexible neural density estimator.We then forward model the synthetic photometry of galaxies using the samples generated from the normalizing flow and the emulator for the SPS model ( §2.1 and §2.2).The normalizing flow is trained to minimize the Wasserstein distance ( §2.4) between this synthetic photo-metric data and the observations.After training, the flow is able to approximate the population posterior and imitate the observed photometric data distribution.
• PopSED is able to analyze 10 5 galaxies within 1 GPU hr, being ∼ 10 5 times faster than traditional MCMC methods and ∼ 100 times faster than SBIbased methods.It also circumvents the problem of combining individual posteriors to derive the population distribution.
• We validate PopSED through a carefully designed mock observation ( §4.1).The results prove that PopSED can accurately recover the underlying population posterior, especially for key parameters such as stellar mass, redshift, and SFR (Figure 2).
• We then apply our method to 83,692 galaxies from the GAMA survey ( §4.2).Our inferred population distribution shows a remarkable agreement with the distribution of spectroscopic redshift and stellar mass, despite photometric data being less informative for constraining redshift (Figure 3).We further demonstrate the versatility of PopSED by studying the SFMS for galaxies at z < 0.1.Our results are in good agreement with the literature results (Figure 4).
• PopSED holds promise for application to future photometric surveys to derive the redshift distribution for weak-lensing studies and understand galaxy formation and evolution.The idea of population-level inference can also be applied to analyze stellar spectra and galaxy spectra.
GAMA is a joint European-Australasian project based around a spectroscopic campaign using the Anglo-Australian Telescope.The GAMA input catalog is based on data taken from the Sloan Digital Sky Survey and the UKIRT Infrared Deep Sky Survey.Complementary imaging of the GAMA regions is being obtained by a number of independent survey programs, including GALEX MIS, VST KiDS, VISTA VIKING, WISE, Herschel-ATLAS, GMRT, and ASKAP, providing UV to radio coverage.GAMA is funded by the STFC (UK), the ARC (Australia), the AAO, and the participating institutions.The GAMA website is http://www.gamasurvey.org/.To train the emulator, we generate rest-frame spectra L λ (θ) from 1000 Å to 60,000 Å for galaxies with fixed stellar masses M ⋆ = 1 M ⊙ .We sample the parameter space according to the prior distributions listed in Table 1 to make the training data more representative.Uninformative priors are employed to avoid introducing any bias when training the emulator.The SFH coefficients β i are sampled from a flat Dirichlet prior, which is equivalent to a uniform distribution over the open standard three-dimensional simplex.Following Betancourt (2012), we first sample κ j ∼ Uniform (0, 1), j = 1, 2, 3, then transform κ j to β i .Uniform priors are used for other SPS parameters.When generating the training spectra, the redshift z is only used to calculate the galaxy age t age which marks the endpoint of the SFH.In the end, we generate 3 × 10 6 rest-frame spectra to train the emulator and 10 4 spectra for validation.
We split the full wavelength range into five bins: 1000-2000 Å, 2000-3600 Å, 3600-5500 Å, 5500-7410 Å, 7410-60000 Å, and we train one emulator for each wavelength bin separately.In order to reduce the dimensionality, the rest-frame spectra are first compressed using the principal component analysis (PCA) technique (Alsing et al. 2020).We use N bases = 80, 50, 50, 50, 50 PCA bases for each wavelength bin, and we find these PCA bases are sufficient to recover the training spectra to a high accuracy (1σ error < 0.2%).Then we train a five-layer neural network to predict the PCA coefficients given the SPS parameters θ.We adopt the customized activation function that is used in Alsing et al. (2020): a(x) = γx + (1 − γ) x • sigmoid(βx), where β and γ are trainable parameters.This activation function combines the Swish function (also known as SiLU; Elfwing et al. 2017;Ramachandran et al. 2017) with a linear component and works better than the commonly used ones, such as Sigmoid or ReLU, on emulating galaxy spectra.The neural network is implemented in PyTorch (Paszke et al. 2019) and trained using the Adam optimizer (Kingma & Ba 2014), similar to Alsing et al. (2020).
Finally, we calculate the observed spectrum of a galaxy with a stellar mass M ⋆ at redshift z by redshifting and scaling the predicted rest-frame spectrum L λ (θ), following where d L (z) is the luminosity distance.We further convolve l λ (θ) with the transmission curves R k (λ) of the SDSS filters in Doi et al. (2010) to obtain the noiseless flux f k in the SDSS bands8 : f k (θ) = l λ (θ)R k (λ)dλ, k = {u, g, r, i, z}.(A2) We note that these operations are not embedded in the neural networks, but are rather post-processing.We show the validation accuracy of the spectrum emulator l λ (θ) at z = 0 in Figure 5, where we limit our wavelength range to 1000-12000 Å for brevity.The emulator achieves < 1% accuracy at 3000 Å < λ < 12000 Å and ∼ 2% accuracy at 1000 Å < λ < 3000 Å.The errors  (2010), assuming an airmass of 1.3.The emulator achieves a high accuracy of < 1% for λ > 4000 Å, but suffers at the very blue end.However, the emulated photometry is accurate to < 0.02 mag, which is negligible for analyzing nearby galaxies in large photometric surveys.
on the predicted broadband SEDs for galaxies at z = 0 are shown in the right panel of Figure 5.The emulator is able to predict the broadband photometry to an accuracy of ∼ 0.01 mag.This accuracy is more than enough for the population-level analysis of galaxy SEDs in photometric surveys.Since we are mostly interested in lowto intermediate-redshift galaxies (z < 1) in this work, the emulation error at the very blue end does not affect the inference, since the observation noise almost always dominates the photometry error.However, if one wants to study higher-redshift galaxies, a better emulator at the blue end is required.

B. FULL POPULATION POSTERIORS
Here we present the full population posteriors of galaxies from the mock galaxy population (Figure 6) and the GAMA sample (Figure 7).
s h a 1 _ b a s e 6 4 = " T Z 8 n P U C N s s d I I i I G n r P j g v E S 3 w c = " > A A A C B H i c b V B N S 8 N A E J 3 4 W e t X 1 G M v w S J 4 K o m I e i y I 4 L G C / Y C m h s 1 2 0 6 z d b M L u R i g h B y / + F S 8 e F P H q j / D m v 3 H T 5 q C t D w Y e 7 8 0 w M 8 r / g V e / e x K s 3 r / 4 S N 2 0 O 2 v p g 4 P H e D D P z / J g z B b b 9 Z Z S W l l d W 1 8 r r l Y 3 N r e 0 d c 3 e v r a J E E t o i E Y 9 k 1 8 e K c h b S F j D g t B t L i o X P a c c f X + V + 5 5 5 K x a L w T D I 5 o J H r I q c B C s E S t G p C d Y 3 e t d j / 3 6 M 1 O a R + E D 9 G P W k q Q b c p 9 T A k Z q 5 0 / c g E D q S g K B 5 6 e N w a D 9 9 O j G A c d X + L b o e j J 1 I W B A f u T j d r 5 g l + w J 8 C J x Z q S A Z q i 0 8 y O 3 E 9 F E s h C o I F o 3 H T u G V k o U c C r Y I O c m m s W E 9 k i X N Q 0 N i W S 6 l U 4 e G + A j o 3 S w H y l T I e C J + n s i J V L r v v R M 5 / g D P e + N x X 8 9 T 8 5 t B v + y l f I w T o C F d L r Y T w S G C I 9 T w x 2 u G A X R N 4 R Q x c 3 t m A Z E E Q o m 2 5 w J x Z m P Y J H U T k v O e c m 5 P y u U b 2 b x Z N E B O k R F 5 K

Figure 2 .−Figure 3 .
Figure 2. Left: the mock galaxy population (gray contours) and the inferred galaxy population (blue contours) using our method.We calculate the average SFR within the past 0.1 Gyr (log SFR0.1Gyr) and the mass-weighted age (tage,MW) using the inferred SPS parameters.The lighter blue histograms show the individual normalizing flows, and the dark blue histogram is the result after averaging 10 flows.The inferred galaxy population agrees with the truth very accurately.Right: the mock photometric data in the SDSS ugriz bands (gray contours) are practically indistinguishable from the photometry of the inferred galaxy population using PopSED (blue contours).trainedon magnitudes instead of colors.The distribution of the mock galaxy population in the full parameter space is shown in Figure6in Appendix B.We run PopSED on the mock photometric data of 100,863 galaxies.Each normalizing flow takes ∼ 40 minutes for training using an NVIDIA A100 GPU.We train 10 independent normalizing flows with different random seeds and aggregate the samples drawn from

Figure 4 .
Figure 4.The distribution of the inferred galaxy population at z < 0.1 on the M⋆-SFR plane.The SFMS and the quiescent population are clearly shown.The inferred galaxy population agrees well with the SFMSs from Renzini & Peng (2015), McGaugh et al. (2017), and Sánchez et al. (2019).

Figure 5 .
Figure5.Performance of the galaxy spectrum emulator.The left panel shows the fractional error of the emulated rest-frame spectrum L λ at 1000-12000 Å.The shaded regions correspond to the 68th (1σ) and 95th (2σ) percentiles of the fractional error distribution.The right panel shows the emulation error of the broadband photometry X k in the SDSS ugriz bands for galaxies at z = 0.Here we use the SDSS ugriz filters as a demonstration and take the transmission curves fromDoi et al. (2010), assuming an airmass of 1.3.The emulator achieves a high accuracy of < 1% for λ > 4000 Å, but suffers at the very blue end.However, the emulated photometry is accurate to < 0.02 mag, which is negligible for analyzing nearby galaxies in large photometric surveys.

Figure 6 .
Figure 6.Similar to the left panel in Figure 2, but showing the distributions of the mock galaxy population (gray) and the inferred galaxy population (blue) in the full parameter space.

Figure 7 .
Figure 7. Similar to the left panel in Figure 3, but showing the distributions of the inferred galaxy population (red) in the full parameter space.For comparison, the gray contours show the joint distribution of the spectroscopic redshift and the stellar mass from the GAMA catalog.

Table 1 .
Parameters in the SPS model and their priors for training the galaxy spectrum emulator

Table 2 .
The distribution of SPS parameters for the mock galaxy population .κj is used to generate βi of the SFH; see §2.1 and Appendix A.