As Simple as Possible but No Simpler: Optimizing the Performance of Neural Net Emulators for Galaxy SED Fitting

Elijah P. Mathews; Joel Leja; Joshua S. Speagle; Benjamin D. Johnson; Justus Gibson; Erica J. Nelson; Katherine A. Suess; Sandro Tacchella; Katherine E. Whitaker; Bingjie Wang

doi:10.3847/1538-4357/ace720

1. Introduction

The study of galaxy formation aims to answer questions such as how galaxies evolve and are structured, and how and when they form their stars. To answer these questions, we collect measurements of the light emitted by them in the form of photometry and/or spectra, which collectively is a galaxy's spectral energy distribution (SED). However, due to the complex physics that govern a galaxy's emission of light, it can often be very difficult to properly interpret these observations (Walcher et al. 2011; Conroy 2013).

In order to interpret observed galaxy SEDs, we first require stellar population synthesis (SPS) methods that can model and predict a galaxy's SED for a given set of physical parameters. In the simplest cases, these can be a dedicated stellar population spectral library (e.g., Bruzual & Charlot 2003; Maraston 2005), or alternatively a joint combination of a stellar spectral library (e.g., MILES; Sánchez-Blázquez et al. 2006) and a stellar isochrone library (e.g., MIST; Choi et al. 2016; Dotter 2016) from which an SED for a simple or composite stellar population (SSP or CSP, respectively) can be predicted assuming a given initial mass function (IMF; e.g., Salpeter 1955; Kroupa 2001; Chabrier 2003), dust attenuation law (e.g., Calzetti et al. 2000), and star formation history (SFH).¹⁶ From these, physical parameters can be inferred using techniques such as χ²-minimization or maximum likelihood estimation. Recent codes, such as FSPS (Conroy et al. 2009; Conroy & Gunn 2010), also allow modifications of stellar physics (e.g., changes to horizontal branch morphology) or the addition of other expected contributions to galaxy SEDs such as active galactic nuclei (AGNs), such as in the case of MAGPHYS (da Cunha et al. 2008), BayeSED (Han & Han 2012, 2014, 2019), and X-CIGALE (Yang et al. 2020).

SPS codes like FSPS typically require about ∼10–100 ms to compute a prediction for a galaxy SED depending on the amount of features used in the prediction.¹⁷ Furthermore, new fitting codes have been created that combine SPS routines with more advanced fitting techniques, often combining SPS with sampling routines to perform Bayesian inference on galaxy SEDs, leading to an ensemble of possible solutions (i.e., posteriors) for a galaxy's physical properties as opposed to only providing a single best-fit model. Some Bayesian SED-fitting codes employing Markov Chain Monte Carlo (MCMC) sampling include CIGALE and its X-ray module X-CIGALE (Burgarella et al. 2005; Noll et al. 2009; Boquien et al. 2019; Yang et al. 2020), BayeSED (Han & Han 2012, 2014, 2019), Prospector (Leja et al. 2017; Johnson et al. 2021), Bagpipes (Carnall et al. 2018, 2019), and MCSED (Bowman et al. 2020a, 2020b). Generating an estimate of the posterior typically requires a very large number of executions of the SPS code, of order 10⁵–10⁶ for photometric SEDs (Leja et al. 2019).¹⁸

The ∼10–100 ms execution time for SPS actually presents a significant problem with the enormous amount of data we are soon to be faced with: it is too slow. In fact, going forward, these existing SED-fitting procedures are orders of magnitude too slow to feasibly fit large catalogs of SEDs with Bayesian SED-fitting codes. This problem is most obvious when fitting photometric surveys, which has been shown to consume about ∼10–100 CPU hr per galaxy fit for a 14-parameter physical model (e.g., Leja et al. 2019). Within the next few years, the Vera C. Rubin Observatory (VRO) will begin science operations and initiate its Legacy Survey of Space and Time (Ivezić et al. 2019), which is expected to deliver photometry for ∼10¹⁰ galaxies. This alone would overwhelm current computational resources, as fitting the full sample would require on the order of ∼10¹¹ CPU hours, equivalent to a ∼USD 10 billion¹⁹ investment, an amount that is comparable to the cost of JWST (Witze 2018). The environmental impact would also be significant, producing an estimated ∼10⁸ kg CO₂ with current electrical energy sources, comparable to the emissions of a modern widebody aircraft operating continuously for ∼200 days.²⁰ Furthermore, spectroscopic surveys will be even more computationally expensive to fit than photometric surveys of similar size due to their increased spectral resolution over multiband photometry and their need for significantly more likelihood function calls when fitting (e.g., ∼100 hr per fit, Tacchella et al. 2022), with several large spectroscopic surveys being either planned or underway. For example, the zCOSMOS survey (Lilly et al. 2007, 2009; Knobel et al. 2012) provides spectra for ∼10⁴ galaxies at z < 1. A sample of this size would be difficult enough to fit on its own, but ever larger spectroscopic surveys are on the way, including the Prime Focus Spectroscopic Galaxy Evolution Survey (Greene et al. 2022), Hobby–Eberly Telescope Dark Energy Experiment (Hill et al. 2004), and the Dark Energy Spectroscopic Instrument (Aghamousa et al. 2016), which will deliver spectra for approximately 10⁵, 10⁶, and 10⁷ galaxies, respectively.

Thus, faster SED-fitting methods are clearly needed in order to fit large samples of galaxy SEDs. One such potential solution is to replace existing SPS routines altogether with an artificial neural network (ANN) that is trained to mimic an SPS code's behavior. ANNs are highly flexible mathematical structures that can mimic the behavior of arbitrarily complex functions (i.e., behaving as a highly nonlinear regression), both in their arbitrary width (Cybenko 1989; Hornik 1991) and arbitrary depth (Hanin & Sellke 2017; Lu et al. 2017; Kidger & Lyons 2019) cases. Crucially, ANNs are typically very fast to execute, ∼10³–10⁴ times faster than SPS routines, with execution times for typical ANN architectures of 1–100 μs. ANNs have been used to model highly nonlinear relationships in astrophysics for several decades (e.g., to predict a galaxy's morphological type based on observations, Storrie-Lombardi et al. 1992). They were first used to emulate SPS by Han & Han (2012) in the BayeSED SED-fitting code, which used an ANN with a single 20-node hidden layer to emulate a 5–6 physical parameter model. Alsing et al. (2020) further developed an ANN SPS emulation in the Speculator code, which uses wider and deeper ANNs to produce photometric and spectroscopic FSPS emulators for the 14-parameter Prospector-α physical model as used by Leja et al. (2019).

However, it should be noted that while SPS routines are typically at worst interpolations of their underlying physical models (and are thus guaranteed to provide sensible output based on their underlying physics), SPS emulators are at best approximations of these SPS routines, providing no guarantee to the user of supplying sufficiently accurate or precise output for a given set of inputs. In this sense, they produce flux uncertainties that are of order <0.01–0.1 mag depending on their ANN architecture, training set, and optimization hyperparameters. These inaccuracies in flux predictions, even if they are small and similar to assumed systematic uncertainties in the SPS methods (e.g., ∼5%), could then result in highly incorrect posterior distributions when used to infer galaxy properties due to the highly complicated and degenerate nature of likelihood space, along with incorrect inferred colors if they vary as a function of wavelength. Furthermore, due to the black-box nature of ANNs, it may be extremely difficult to know when an emulator's rare but catastrophic errors lead to these incorrect solutions without comparing to a fit using the original SPS code it replaces (which voids the purpose of using the emulator). This effect has not yet been thoroughly investigated in the literature; Kwon et al. (2023) found that spectroscopic emulators have the ability to recover accurate posteriors, but fit only a small sample of ∼100 mock galaxy SEDs and are not sensitive to rare, catastrophic (e.g., ≥3σ) outliers in posterior estimation.

In this work, we use a technique similar to that of Alsing et al. (2020) to develop a set of photometric emulators that replace an SPS code for an 18-parameter physical model, which we then use to fit a large sample of ∼10⁴ photometric galaxy SEDs from the 3D-HST photometric catalog (Brammer et al. 2012; Skelton et al. 2014). In doing so, we investigate the performance of each emulator, including the speed at which an emulator can produce a predicted SED and the accuracy and precision in which it is able to do so. We then examine their ability to recover accurate posterior distributions when combined with an MCMC-based SED-fitting code, and compare the rate of extreme posterior discrepancies. Using this information, we develop a method for selecting a suitable architecture for a given SED-fitting task that minimizes emulation execution time while not sacrificing posterior accuracy. Section 2 describes in detail the codes which our emulators are trained to mimic, the data we train them on, the ANN architectures used, and the methods we use to fit SEDs with them, while Section 3 displays the performance of our emulators. Section 4 includes the discussion, before stating our conclusions in Section 5. Throughout the work, we assume a flat Lambda cold dark matter cosmology using the Hinshaw et al. (2013) parameter values.²¹ The emulators trained in this work along with the code used to fit SEDs with them is available on GitHub.²²

2. Methods and Data

In this section, we detail the methods we used in this work to develop SPS emulators, fit galaxy SEDs using them, and quantify their performance. In Section 2.1, we describe the generation of our model spectra (from which photometry that we will train on can be computed), along with the chosen physical model. Then, in Section 2.2, we describe the methods used for our ANN emulation, including information about the networks' architectures and training procedures. Finally, in Section 2.3, we describe the photometric data that we will be fitting as a test for the emulators, in addition to the specifics of the fitting routine used.

2.1. Stellar Population Synthesis

To generate galaxy SEDs, we use FSPS ²³ (Conroy et al. 2009; Conroy & Gunn 2010), accessed using the python-fsps ²⁴ Python package (Johnson et al. 2021). Within FSPS, we select the MIST isochrones (Choi et al. 2016; Dotter 2016) and the MILES spectral library (Sánchez-Blázquez et al. 2006).

We begin our physical model with the Prospector-α model assumed by Leja et al. (2019), consisting of 14 parameters quantifying a galaxy's mass, stellar, and gas-phase metallicity, SFH, dust properties, and AGN contribution. We then make a handful of modifications to the fiducial Prospector-α model. First, we extend the model's lower limit on the cumulative stellar mass formed to 10⁶ M_⊙ to allow for lower mass solutions when fitting objects at very low redshift. Furthermore, as we will be fitting photometry that has coverage in the rest-frame mid-IR, we allow three parameters governing the SED from thermal dust emission (i.e., Draine & Li 2007) to vary (whereas Prospector-α holds these three parameters to fixed values), which includes the polycyclic aromatic hydrocarbon (PAH) mass fraction Q_PAH, minimum radiation field strength for dust emission ${U}_{\min }$ , and the fraction γ of starlight exposed to radiation field strengths ${U}_{\min }\lt U\leqslant {U}_{\max }$ (where ${U}_{\max }$ is fixed). Finally, while redshift will remain a fixed parameter for the vast majority of the fits we will perform, each galaxy requires a different fixed redshift. Thus, redshift must also be included in the emulator, although it is a particularly difficult parameter to emulate in photometric emulators and may warrant special treatment in the future (see Section 4.3 for further discussion). These collectively assemble an 18-parameter physical model, where the 18 parameters will be used as inputs to our ANN emulators. The 18 parameters, in addition to the distributions used for generating training sets (and test sets) and the distributions used as priors when fitting to photometry, are listed in Table 1.

Table 1. Physical Parameters

Parameter	Symbol	Train/Test Distribution	Prior Distribution
Cumulative stellar mass formed	${\mathrm{log}}_{10}{M}_{* ,\mathrm{formed}}$	Uniform (6, 12.5)	Uniform (6, 12.5)
Stellar metallicity	${\mathrm{log}}_{10}({Z}_{* }/{Z}_{\odot })$	Uniform (−1.98, 0.19)	Normal (μ, σ) with μ and σ from Gallazzi et al. (2005) mass–metallicity relation, truncated to $\left[-1.98,0.19\right]$
Adjacent SFR (six parameters)	${\mathrm{log}}_{10}\left(\tfrac{{\mathrm{SFR}}_{i}}{{\mathrm{SFR}}_{i+1}}\right)$	t(2), scaled to μ = 0 and σ = 3, truncated to $\left[-5,5\right]$	t(2), scaled to μ = 0 and σ = 0.3, truncated to $\left[-5,5\right]$
Diffuse dust optical depth	τ₂	Normal (0.3, 3), truncated to $\left[0,4\right]$	Normal (0.3, 1), truncated to $\left[0,4\right]$
Power-law modifier for Calzetti et al. (2000) dust attenuation law	n	Uniform (−1.2, 0.4)	Uniform (−1.2, 0.4)
Ratio of birth cloud dust to diffuse dust	τ₁/τ₂	Normal (1, 1), truncated to $\left[0,2\right]$	Normal (1, 0.3), truncated to $\left[0,2\right]$
Ratio of AGN luminosity to bolometric luminosity	${\mathrm{log}}_{10}{f}_{\mathrm{AGN}}$	$\mathrm{Uniform}\,(-5,{\mathrm{log}}_{10}3)$	$\mathrm{Uniform}\,(-5,{\mathrm{log}}_{10}3)$
AGN torus dust optical depth	${\mathrm{log}}_{10}{\tau }_{\mathrm{AGN}}$	$\mathrm{Uniform}\,({\mathrm{log}}_{10}5,{\mathrm{log}}_{10}150)$	$\mathrm{Uniform}\,({\mathrm{log}}_{10}5,{\mathrm{log}}_{10}150)$
Gas-phase metallicity	${\mathrm{log}}_{10}({Z}_{\mathrm{gas}}/{Z}_{\odot })$	Uniform (−2.0, 0.5)	Uniform (−2.0, 0.5)
Redshift	z	Uniform (0.5, 3)	Fixed to each object's z_best from Momcheva et al. (2016)
PAH mass fraction	Q_PAH	Normal (2, 4), truncated to $\left[0,7\right]$	Normal (2, 2), truncated to $\left[0,7\right]$
Minimum radiation field for dust emission	${U}_{\min }$	Normal (1, 20), truncated to $\left[0.1,25\right]$	Normal (1, 10), truncated to $\left[0.1,25\right]$
Fraction of dust mass exposed to ${U}_{\min }$	${\mathrm{log}}_{10}{\gamma }_{{\rm{e}}}$	Normal (−2, 2), truncated to $\left[-4,0\right]$	Normal (−2, 1), truncated to $\left[-4,0\right]$

Download table as: ASCII Typeset image

We then select a set of 137 photometric filters, whose spectral flux densities will collectively be the outputs of the emulator. We base our initial filter set on the filters necessary for fitting SEDs from the 3D-Hubble Space Telescope (HST) photometric catalog (Brammer et al. 2012; Skelton et al. 2014), which we then extend by adding additional filters, including filters from the Two Micron All Sky Survey, Galaxy Evolution Explorer, Herschel, JWST, the Sloan Digital Sky Survey, VISTA, and the Wide-field Infrared Survey Explorer. Since the filters used in the 3D-HST catalog are a subset of our full filter set, only a subset of our emulator's outputs will actually be used when fitting SEDs in this work. All of these aforementioned filters are included in sedpy ²⁵ (Johnson 2021), a Python package for dealing with spectral fluxes and photometric filters.

Natively, sedpy outputs the spectral flux density observed through each filter in units of maggies; however, due to the fact that the fluxes covered by this physical model span ∼40 orders of magnitude, we elect to put these on a logarithmic scale to provide the emulator with a simpler output, a method that has also been employed in previous works. The natural choice for doing so would be to use the AB magnitude system as originally defined by Oke & Gunn (1983); however, for this particular problem, we encounter an issue with the outputs of FSPS; namely, FSPS tends to bottom out in the UV at very small values of spectral flux density due to strong IGM absorption, with m_AB > +100 in some cases. With few exceptions, spectral flux densities that are fainter than about m_AB ≈ +35 exceed the capabilities of most of the detectors in use today.

The simplest way to solve this would be to raise all extremely faint spectral flux densities in our training set to a minimum spectral flux density of m_AB = +35, such that

$\begin{eqnarray}{m}_{\mathrm{AB},\mathrm{mod}}({m}_{\mathrm{AB}})=\left\{\begin{array}{ll}{m}_{\mathrm{AB}} & {m}_{\mathrm{AB}}\leqslant +35\\ +35 & {m}_{\mathrm{AB}}\gt +35.\end{array}\right.\end{eqnarray} \tag{ 1 }$

However, with this method, a discontinuity is introduced into the derivative of the emulator's output. Preserving an emulator's ability to be continuously differentiable is desirable to leave open the possibility of combining an emulator with a gradient-based sampler in the future (see Section 4.4 for further discussion). Thus, we choose to avoid this differentiability issue by instead using an inverse hyperbolic sine (i.e., $\mathrm{arsinh}$ ) magnitude system that is nearly identical to the one proposed by Lupton et al. (1999), where for an observed spectral flux density E_e,ν, one can define an "AB" $\mathrm{arsinh}$ magnitude μ_AB as

$\begin{eqnarray}&&{\mu }_{\mathrm{AB}}({E}_{{\rm{e}},\nu };a,{\mu }_{0})=-a\,\mathrm{arsinh}\left[\displaystyle \frac{{E}_{{\rm{e}},\nu }}{2}\exp \left(\displaystyle \frac{{\mu }_{0}}{a}\right)\right]+{\mu }_{0},\end{eqnarray} \tag{ 2 }$

where a is set to $2.5{\mathrm{log}}_{10}e$ to preserve the same scaling at high values of spectral flux density as the ordinary AB magnitude system and μ₀ is a parameter that sets the location of the low-end cutoff in this magnitude system, which we set to μ₀ = +35 in this work. The behavior of this magnitude system is shown in Figure 1. In the bright limit, this magnitude system asymptotically approximates the ordinary m_AB magnitude system as one would expect, while in the faint regime, the system approaches a constant value of μ_AB = +35 regardless of the actual input spectral flux density. However, importantly, it does so smoothly and with a continuous derivative, thus preserving continuous differentiability.

**Figure 1.** The difference in behavior between ordinary logarithmic magnitudes (e.g., AB magnitudes) and the $\mathrm{arsinh}$ magnitude system used in the emulators in this work. For logarithmic magnitudes significantly brighter than μ₀, the $\mathrm{arsinh}$ system behaves identically to the logarithmic system. However, for logarithmic magnitudes ≥μ₀, the $\mathrm{arsinh}$ system asymptotically approaches a constant value μ₀ (equal to 35 in this work).
Download figure:
Standard image High-resolution image

**Figure 1.** The difference in behavior between ordinary logarithmic magnitudes (e.g., AB magnitudes) and the $\mathrm{arsinh}$ magnitude system used in the emulators in this work. For logarithmic magnitudes significantly brighter than μ₀, the $\mathrm{arsinh}$ system behaves identically to the logarithmic system. However, for logarithmic magnitudes ≥μ₀, the $\mathrm{arsinh}$ system asymptotically approaches a constant value μ₀ (equal to 35 in this work).
Download figure:
Standard image High-resolution image

2.2. ANN Emulator

Choosing the proper emulator architecture for a particular SED-fitting problem is one of the central questions this work is seeking to address. Put simply, our goal is to provide an SPS alternative that produces precise photometry estimates while requiring as little execution time as possible. Emulator architectures that are more complex, which provide for the ability to fit more complex input-output relationships more precisely, necessarily require longer execution times in comparison to simpler architectures. Thus, there is a key balance to be had—ideally, we would like to choose the simplest (i.e., fastest) possible emulator architecture that provides sufficiently accurate results (i.e., posteriors) when used for fitting SEDs. Choosing an overly complex emulator architecture may be counterproductive if that additional complexity is not also adding a notable improvement in the recovery of posteriors, which is an extra layer of nonlinear transformations beyond simply reproducing the spectral flux density accurately.

Following previous work (i.e., Han & Han 2012, 2014; Alsing et al. 2020), the emulation method selected for this work is a multilayer perceptron (MLP) ANN. In the context of emulation, MLP ANNs have a very desirable runtime advantage over other machine-learning techniques, such as the k-nearest neighbors algorithm and Gaussian processes, in that they are incredibly fast to call and have a small memory requirement (whereas other methods have memory footprints and execution times that can scale with the size of the training set). An MLP is constructed of L trainable layers, where layer i contains N_i nodes, which each use an activation function ϕⁱ( x ) (which generally operates element-wise). The MLP requires a set of weight matrices W ⁱ and bias vectors b ⁱ for each layer, such that evaluating

$\begin{eqnarray}&&{{\boldsymbol{x}}}^{i}={\phi }^{i}\left({{\boldsymbol{W}}}^{i}{{\boldsymbol{x}}}^{i-1}+{{\boldsymbol{b}}}^{i}\right)\end{eqnarray} \tag{ 3 }$

for each layer produces a useful output from the last layer, given an input vector x ⁰ (which in this case is our 18 parameters). For the purposes of this work, we use a total of L = 6 layers for our emulator, where the first five layers have N nodes per layer and use a nonlinear activation function. The final layer consists of 137 nodes to match the 137 bands of photometry to be predicted (see Section 2.1) and uses linear activation (i.e., f( x ) = x ). The full architecture described in this section is illustrated in Figure 2.

**Figure 2.** Illustration of the neural network architecture used in this work. At the far left, a vector θ of physical parameters is taken as input, from which a precomputed mean is subtracted and standard deviation divided to normalize the inputs. Then, a set of L − 1 hidden layers (where L = 6 for the entirety of this work), each of which has N fully connected nodes using the GELU activation function as described in Section 2.2. The final fully connected layer, layer L, has linear activation and a number of nodes equal to the number of predicted photometric filters (137 in this work). Finally, the output has a precomputed mean readded (the precomputed standard deviation is fixed to 1), which represents the predicted photometry.
Download figure:
Standard image High-resolution image

To find where the balance between ANN architecture and emulator precision occurs, we train a total of six ANN emulators with varying numbers of nodes per layer, N:

$\begin{eqnarray}&&N=32,64,128,256,512,1024.\end{eqnarray} \tag{ 4 }$

As is demonstrated in Section 3.1, this selection of N provides a wide range in emulator accuracies (i.e., an emulator's typical bias with respect to the true FSPS predictions) and precisions (i.e., the typical scale of the differences between an emulator's predictions and the true FSPS predictions) to gauge the dependence of posterior quality on emulator precision. All other hyperparameters relating to ANN architecture and training procedure (e.g., the depth of the network L, training set size, activation function, loss function) are fixed.

We use the Gaussian error linear unit (GELU; Hendrycks & Gimpel 2016) activation function for the nonlinear layers in our emulator, which we found to produce the best balance between execution time and emulation precision. It is formally defined as

$\begin{eqnarray}&&\mathrm{GELU}(x)\equiv \displaystyle \frac{x}{2}\left[1+\mathrm{erf}\,\left(\displaystyle \frac{x}{\sqrt{2}}\right)\right].\end{eqnarray} \tag{ 5 }$

However, as there is no closed-form expression for $\mathrm{erf}(x)$ , one must approximate it. We elect to use the "tanh" approximation,

$\begin{eqnarray}&&\mathrm{GELU}(x)\approx \displaystyle \frac{x}{2}\left(1+\tanh \left[\sqrt{\displaystyle \frac{2}{\pi }}\left(x+0.044715{x}^{3}\right)\right]\right),\end{eqnarray} \tag{ 6 }$

which can be accessed using the jax.nn.gelu(x, approximate=True) function within the JAX (Bradbury et al. 2018) numerical library in Python or the Flux.gelu(x) function in the Flux numerical library in Julia.

In addition to this ordinary ANN architecture, special layers are added to the input and output layers of the emulator to normalize and denormalize the inputs and outputs using precomputed means and standard deviations of the training set parameters and photometry. Specifically, a normalization layer is added to the input layer, which subtracts the precomputed mean to center the data around zero (where floating-point precision is highest) and divides by the precomputed standard deviation to force each parameter to have approximately the same scale. Finally, a denormalization layer is added to the output that performs the opposite operations on the output photometry with the exception being that the precomputed standard deviations are all set to 1 to ensure each filter is weighted equally in the loss function.

In order to determine the optimal set of weight matrices W ⁱ and bias vectors b ⁱ for a given emulator, one must provide the emulator with a large set of input/output pairs so it can learn how the pairs are related. In this work, we generate a set of 10⁷ parameter-SED pairs for training. For comparison, Alsing et al. (2020) chose to use 10⁶ training samples; we chose to increase the number of training samples due to the additional parameters that exist in our model. While not done in this work, a useful avenue of inquiry would be to determine the ideal training set size for these types of emulators—larger training set sizes grant an emulator more examples to hone their accuracy and precision on, but at some point face diminishing returns. Parameter vectors are sampled from the distributions listed in Table 1. Similar to Alsing et al. (2020), we base these distributions on the distributions used as priors in our SED fits. However, all nonuniform distributions are altered to add more probability density in their tails to increase the number of samples the emulator is trained on in regions of parameter space that are rarely occupied (see Section 4.2 for further discussion on the choice of training set distributions). These parameter vectors are then fed into the SPS portion of our Prospector workflow to generate FSPS spectra, which are then fed to sedpy to convert the spectra into photometry (expressed in $\mathrm{arsinh}$ magnitudes as per Section 2.1). When training an emulator, we use the root mean square error (RMSE, i.e., the square root of MSE) loss function, which is the same as used by Alsing et al. (2020).²⁶ We optimize each emulator's weights and biases in each step using the NADAM optimizer (Dozat 2016), a variation of the ADAM optimizer (Kingma & Ba 2014) that incorporates Nesterov momentum (Nesterov 1983).²⁷ The decay of momentum parameters are set to β₁ = 0.9 and β₂ = 0.999.

The network is trained in three steps following a learning rate η and batch size N_batch schedule:

$\begin{eqnarray}&&\eta ={10}^{-3},{10}^{-4},{10}^{-5},\end{eqnarray} \tag{ 7 }$

$\begin{eqnarray}&&{N}_{\mathrm{batch}}={10}^{3},{10}^{3},{10}^{3}.\end{eqnarray} \tag{ 8 }$

In each step of the schedule, the full training set is randomly split into a training set and a validation set, using a 1:19 validation-training split (i.e., 5% of available training data is devoted to validation). Training continues until either the loss over the validation set has not improved over the past 20 epochs or a total of 1000 epochs have been reached in that step. At the end of a given step, the weights and biases that resulted in the lowest validation loss are saved and training continues until the final step has been completed. These emulators typically take of order ∼10 hr to train on a Nvidia Tesla K80 GPU.

2.3. Fitting Photometry SEDs

The most stringent test of an SPS emulator is to produce accurate posterior distributions when conditioned on galaxy SEDs. In order to test each emulator's ability to meet this goal, we use each emulator to fit a selection of photometric galaxy SEDs to test the emulator across a broad range of galaxy SED types. Specifically, we fit a selection of galaxies in the 3D-HST photometric catalog (Brammer et al. 2012; Skelton et al. 2014), fitting all 12,319 galaxies fit by Leja et al. (2019) in the COSMOS field. This sample size was selected to provide a robust measurement of the rate of 3σ posterior discrepancies, which theoretically occur at a rate of about 1 in 370. In addition to the observed photometric spectral flux densities and their measurement uncertainties, the z_best redshift from Momcheva et al. (2016) is extracted from the catalog, and the redshift is then fixed to that value in our fits to avoid significant posterior discrepancies solely arising from multimodal photometric redshift solutions. All of the 3D-HST data used in this work are available on MAST: 10.17909/T9JW9Z.

We fit these SEDs using the Prospector ²⁸ (Leja et al. 2017; Johnson et al. 2021) SED-fitting code, with our priors set to the distributions listed in the rightmost column of Table 1. These priors are largely the same as the priors used by Leja et al. (2019), with modifications including the aforementioned lower limit for total stellar mass formed in addition to distributions for the added dust emission parameters that approximate their stacked posteriors from the Leja et al. (2017) work.²⁹ For sampling, we use the dynesty ³⁰ (Skilling 2004, 2006; Speagle 2020; Koposov et al. 2022) nested sampling routine. Fitting a single SED with Prospector with the settings used in this work typically takes approximately ∼1.8 × 10⁶ likelihood function calls, which each emulator requires ∼30 CPU minutes to execute and FSPS requires ∼20 CPU hr to do the same.³¹ Following the procedure of Leja et al. (2019), we assume an error floor in the observed spectral flux densities of 5% in each band to account for systematic uncertainties in FSPS predictions; however, to further account for the known emulation error, we convolve the observational errors with emulator precision in that band (i.e., the standard deviation of the test set residuals, see Section 3.1). That is, for a given observational uncertainty σ_obs, estimated emulation precision σ_emul, and observed flux E_ν, we assume an overall uncertainty σ_assume of

$\begin{eqnarray}&&{\sigma }_{\mathrm{assume}}=\sqrt{{\sigma }_{\mathrm{obs}}^{2}+{\sigma }_{\mathrm{emul}}^{2}+{(0.05{E}_{\nu })}^{2}}.\end{eqnarray} \tag{ 9 }$

Another aspect of the fitting routine we are particularly concerned with is the settings used for the dynesty sampler, especially as they pertain to sampling variance. As the purpose of this work is to compare different emulation architectures against each other, we want to ensure that the differences between the emulators are the primary cause for the differences between fits. The effect of emulation errors could be hidden if the sampling variance is too large, such that an emulator's fit could be found to be discrepant with respect to a reference fit even if the emulator is far more precise than the observations. Indeed, we found that independent realizations of the same fit would occasionally result in highly discrepant posteriors when using default dynesty settings. However, the combination of nlive_init=2000 (default 100) and target_n_effective=200000 (default 10000) sufficiently decreases the sampling noise to make the emulation errors more evident. Increasing the value of nlive_init, the number of live points used by dynesty, allows dynesty to explore more areas of parameter space, reducing the risk of missing solutions. Meanwhile, increasing the value of target_n_effective, the target effective sample size (Kish 1965) of posterior samples defined as ${n}_{\mathrm{eff}}={({\sum }_{i}{w}_{i})}^{2}/{\sum }_{i}{w}_{i}^{2}$ where w_i are the sample weights, allows for more robust measurements of the extreme percentiles of the posterior (e.g., percentiles corresponding to ±3σ). Both of these qualities are relevant, as we will ultimately be wanting to compare how often a solution proposed by one emulator is statistically excluded by a different emulator's fit, and in doing so we want to reduce the number of these exclusions that occur solely due to the sampler.³²

Each of the 12,319 galaxies is fit once with each emulator, except for the 1024 node-per-layer emulator, which each galaxy is fit with twice to provide an estimate on the discrepancies between posteriors solely due to sampling uncertainties. Furthermore, in addition to fitting the galaxy SEDs with the emulators, we also fit a random subsample of 10³ galaxies with FSPS to isolate any errors that may be caused by emulation alone. Finally, we generate a mock catalog of 10⁴ galaxy SEDs and fit these with the emulator as well to verify that the posteriors are representative of galaxies' true properties. These are generated in the same way as the training and test sets, with the exception that the parameters are sampled from the prior distributions rather than the training/test set distributions (although redshift is still sampled from a uniform (0.5, 3.0) distribution), a random error is added to each filter flux, with the error being randomly sampled from a normal distribution with μ = 0 mag and σ = 0.1 mag, and the filter set is restricted to only those filters provided by the 3D-HST COSMOS catalog.

3. Results

Our results are organized into two separate sets of tests: Section 3.1 evaluates the performance of the emulators in flux space, while Section 3.2 evaluates the performance of the emulators in derived parameter space, which is a more expensive computational test since it requires an estimate of the truth.

3.1. A Priori Emulation Performance

Here we demonstrate how the execution time of an emulator scales with the architecture. The dashed gray line shown in Figure 3 shows the measured ∼40 ms execution time for FSPS (with the proper settings for use with our physical model), which is the benchmark that all emulators must exceed in order to be useful. Meanwhile, the solid black curve denotes the typical execution times for ANN emulators of various different widths on a CPU.³³ As expected, emulators that have smaller architectures require less execution time. For example, an emulator with 1 node per layer (which would only feasibly be able to emulate the normalization of an SED) is approximately 10⁵ times faster than FSPS, while an emulator with 4096 nodes per layer requires roughly as much execution time as FSPS (and is thus the absolute upper limit in ANN width for an SPS emulator on a CPU). For emulators with more than ∼256 nodes per layer, we find that the execution time of the emulator tends to scale quadratically with respect to the number of nodes per layer (i.e., doubling the width of the emulator quadruples the execution time), with the execution time scaling as ∝N^2.01±0.01 via a least squares fit, directly attributable to the matrix multiplication involving N × N matrices in the hidden layers. For emulators with less than ∼256 nodes per layer, the curve flattens due to the linear output layer (which has a fixed width of 137 nodes due to the 137 emulated filter magnitudes) becoming the most computationally expensive step in the execution, asymptotically approaching an execution time of ∼800 ns for extremely small widths. Thus, for large architectures, adding additional filters to the emulator will not significantly change the overall execution time of the emulator, but these additional filters may change the emulator's precision after training due to the additional information the emulator needs to learn.

Next we assess how the precision of an emulator in flux space scales with network width. For each filter i, we estimate emulator errors from the test set (a data set of 10⁶ parameter-SED pairs generated using the same distributions as the training set) by computing the mean μ_i (a measure of the emulator's accuracy) and standard deviation σ_i (a measure of the emulator's precision) of the emulation errors in that filter, which are the differences between the flux predicted by the emulator and the flux predicted by FSPS. Overall, the accuracy of all emulators trained in this work is very good, with the maximum μ_i of any filter/emulator combination of about 4 × 10⁻³ mag.³⁴ The full distributions for each emulator for a selection of four filters can be seen in Figure 4.

**Figure 4.** The distribution of absolute test set residuals (i.e., ∣emulated − actual∣) for each emulator, evaluated for the Mayall Mosaic U, Keck LRIS R_s, JWST NIRCam F277W, and Spitzer MIPS 24 μm filters. The thin vertical dotted lines denote the residual corresponding to the distribution's ∼68th percentile.
Download figure:
Standard image High-resolution image

However, the precision is a more relevant factor in parameter inference. In Figure 5, we show the emulation precision σ_i for each filter in each emulator. We first notice that the precision of a given emulator scales with the number of nodes per layer—specifically, we generally find that doubling the width of an emulator doubles the precision (i.e., σ_i decreases by a factor of 2). Given a fiducial emulator architecture and its precision after being trained, this provides a useful method of estimating the architecture that will be required in order to meet a given precision target. However, it appears that this trend breaks down for extremely complex architectures, with the 1024 node-per-layer emulator displaying diminishing returns in this regard. Additionally, we find that the 64, 128, 256, 512, and 1024 node-per-layer emulators have precisions that exceed the precision of the observations we fit in this work (which have an error floor of 5%) for at least a subsample of the emulated bands, with the 128 node-per-layer network being the simplest emulator to exceed this precision in all 137 of its filters. All emulators exhibit very little bias, with the 32 node-per-layer emulator showing biases no greater than 5 × 10⁻³ mag and the 1024 node-per-layer emulator showing biases no greater than 5 × 10⁻⁴ mag.

Additionally, we note that there is a spectral trend in the precision of a given emulator—with this filter set, we find that the emulators tend to be the most precise in the optical and near-IR (NIR), and have diminished precision near the extreme ends of the modeled SEDs. This is likely partly due to the fact these locations are where the mass-to-light ratio (M/L) is most variable (Conroy 2013), in addition to the fact that the selected filters are most densely concentrated in the optical and NIR and all filters are weighted equally in the loss function. Finally, we can also see that there are instances where the precision of a given emulator can change rapidly from one filter to the next, particularly in the optical and NIR. The cause for this is likely twofold, both of which are sensitive to redshift. First, as a rest-frame SED is redshifted, sharp spectral features present in the SED (e.g., Lyman and Balmer breaks, emission lines) will rapidly change filter magnitudes as these features pass through each filter's transmission curve, which means that filters whose transmission curves contain these spectral features (when subject to the redshifts present in our test set) will be affected more than filters that do not. Second, while our filter selection primarily contains wideband filters, some of the filters have narrower transmission curves, including 18 Subaru Suprime-Cam intermediate bands between 400 and 900 nm and 12 JWST NIRCam medium band filters between 1 and 5 μm, which are affected even more strongly as spectral features pass through their transmission curves.

Now we assess to what extent the emulator errors are correlated. This is important: If an emulator typically provides an incorrect prediction of the observed spectral flux density for a single filter while simultaneously being correct in the remaining filters, then the errors in the emulator's predictions can be regarded as independent with regard to filter and the results of the filter-by-filter precision analysis from Figure 5 can be treated as independent noise (and thus, the sampler should be able to properly account for an emulator's errors). However, if emulator residuals are typically correlated for a group of filters (i.e., if emulation errors are correlated between some number of filters), it may lead to systematic color offsets that yield incorrect results. For example, if an otherwise precise emulator typically predicts a Balmer break that is too small (i.e., predicting too much flux for filters on the blueward side of the jump while simultaneously predicting too little flux for filters on the redward side), it would likely lead to a stellar age posterior that is artificially biased toward younger ages, especially since these correlations would not be accounted for if the emulator's errors are treated as independent with regard to filter.

A simple way to investigate this effect is by computing the covariance matrix of the emulator's test set residuals, where σ_ij is the covariance between errors in filter i and filter j (and with this notation ${\sigma }_{{ii}}={\sigma }_{i}^{2}$ ). We display these covariance matrices for the 32 and 1024 node-per-layer emulators in Figure 6, where we normalize each of the covariances by the corresponding filter precisions (i.e., σ_ij/(σ_i σ_j)) to compare how strong their covariances are in comparison to the individual precisions. We find that in the 32 node-per-layer emulator, the covariances are both large in comparison to the filter precision (i.e., maximal ∣σ_ij/(σ_i σ_j)∣ of order 10⁻¹) and spectrally widespread, indicating that this emulator is very likely to be significantly wrong (and wrong in the same direction) in many unrelated filters when it is wrong in a given filter (i.e., its errors are highly correlated). On the contrary, we find that emulators with more nodes per layer have increasingly less correlation in their flux errors—on the extreme end, the 1024 node-per-layer emulator contains much smaller covariances (i.e., maximal ∣σ_ij/(σ_i σ_j)∣ of order 10⁻²), and these covariances are mostly concentrated to filters that are spectrally nearby (and in some cases are nearly identical filters, e.g., the V band on different telescopes/instruments). Thus, the standard assumption of uncorrelated Gaussian errors is a much safer assumption for a 1024 node-per-layer emulator than it is for the 32 node-per-layer emulator.

**Figure 6.** Covariance matrices for the 32 node-per-layer emulator (left) and 1024 node-per-layer emulator (right), normalized by the emulation precision in the relevant filters (i.e., plotted is σ_ij/(σ_i σ_j)). Note the differing color bar scales between the two plots. Compared to the 32 node-per-layer emulator, the 1024 node-per-layer emulator provides covariances that are both smaller in scale and less spectrally widespread (i.e., they are limited to filters with similar effective wavelengths).
Download figure:
Standard image High-resolution image

3.2. A Posteriori Emulation Performance

As the analysis that follows in this section is predicated on the assumption that our most accurate/precise emulator (the 1024 node-per-layer emulator) is a suitable replacement for FSPS, we first need to establish that this is actually the case. In Figure 7, we compare the posterior medians recovered by the emulator-based fits (using the 1024 node-per-layer emulator, the most accurate and precise emulator) to the posterior medians recovered for the random sample of 10³ 3D-HST galaxies using traditional FSPS-based fits for a selection of parameters. In this test, there is strong agreement in posterior medians between the two SPS methods. By measuring the 68th percentile of the absolute differences between the posterior medians, we find typical differences of order 0.10 dex for the total stellar mass formed, 0.14 dex for the stellar metallicity, 0.12 dex for the star formation rate (SFR) over the most recent 30 Myr, and 0.11 dex for the mass-weighted age. There is no apparent bias between the two SPS methods and the typical scale of the differences between the recovered posterior medians (shown by the gold-shaded region on the plots) is of a similar scale to the typical widths of the posteriors in these parameters (shown by the pink crosshairs). This demonstrates that the 1024 node-per-layer emulator produces fits that have posteriors equivalent to the posteriors produced by FSPS-based fits.

**Figure 7.** A comparison between posteriors recovered from 1000 `Prospector` fits to 3D-HST photometry, run with a 1024 node-per-layer emulator and with `FSPS`, to verify the ability of emulators to recover posteriors consistent with `FSPS`. Plots are shown for the stellar mass formed M_*,formed, diffuse dust optical depth τ₂, stellar metallicity Z_*, and dust power-law index n (top four panels), in addition to the SFR for the most recent 30 Myr and the mass-weighted age 〈t〉_m, which are derived parameters (bottom two panels). The posterior median from the `FSPS`-based fit is shown on the x-axis and the posterior median from the emulator-based fit is shown on the y-axis. The gold and blue shaded regions enclose the middle 68th and 95th percentiles of the differences in posterior medians, while the black line denotes the median difference. Shown in pink is a typical posterior width for the plotted parameter.
Download figure:
Standard image High-resolution image

Additionally, while we have demonstrated that the emulator can recover posteriors that are consistent with the code it emulates (FSPS), we must also verify that these posteriors are actually representative of a galaxy's true physical properties, which can be evaluated using the fits to the sample of mock galaxies. The results of these fits are shown in Figures 8 and 9, the former for a comparison between 1024 node-per-layer emulator fits and the mock truth and the latter for a comparison between FSPS fits and the mock truth. Here we see that the posteriors recovered by both the emulator and FSPS are typically consistent with the true parameter values.

**Figure 8.** A test of `Prospector` `'s` ability to recover accurate estimates of various parameters from mock observations when paired with the 1024 node-per-layer emulator. Results are displayed similarly to the `FSPS` comparisons shown in Figure 7, with spreads somewhat wider than those found in that test.
Download figure:
Standard image High-resolution image

**Figure 9.** A test of `Prospector` `'s` ability to recover accurate estimates of various parameters from mock observations when paired with `FSPS`, similar to Figure 8.
Download figure:
Standard image High-resolution image

Using the same metric of measuring the 68th percentile between the posterior median and the mock truth, we find that mock recovery performance is nearly identical between the two SPS methods, with typical differences of 0.15 dex for the stellar mass formed, 0.32 dex for the stellar metallicity, 0.50 dex for the SFR over the most recent 30 Myr, and 0.23 dex for the mass-weighted age. While the differences found here are larger than the scatter found between emulator and FSPS posterior medians in the previous 3D-HST test, and thus could potentially be explained by the constant 0.1 mag random error built into the mock photometry, we exclude this possibility given that the differences between the emulator and FSPS posteriors medians on these mock observation fits are nearly identical to the 3D-HST results. Thus, we conclude that the 1024 node-per-layer emulator posterior medians are of the same high quality as the FSPS posterior medians, and the emulator does not limit Prospector 's ability to recover the true parameter values for a given SED.

Next we investigate catastrophic outliers, which represent a serious mode of failure in this test. In this context, a catastrophic outlier is obtaining a posterior distribution from an emulator-based fit that would exclude (with high probability) a solution obtained from a traditional FSPS fit. One potential source of these errors is emulation errors not accounted for in the estimation of emulation precision. In Figure 10, we observe that emulators' errors tend to have tails far longer than that of a Gaussian distribution, making extreme errors more common than one would expect by assuming a normal distribution whose σ is set to the estimated precision of an emulator.

**Figure 10.** A demonstration of the non-Gaussianity of the emulation errors. Plotted in gray is a histogram of the test set emulation errors for the 1024 node-per-layer emulator's JWST NIRCam F115W filter, while plotted in gold, red, blue, and green are the probability density functions for normal, Student's t(ν = 4), Laplace, and logistic distributions for visual comparison, whose location and scales are set to the mean and standard deviation of emulation errors for this filter.
Download figure:
Standard image High-resolution image

To assess the frequency at which these catastrophic emulation errors yield inaccurate posteriors, we define a discrepancy statistic:

$\begin{eqnarray}&&{Q}_{i;j,k}=\displaystyle \frac{{m}_{i,j}-{m}_{i,k}}{\sqrt{{\sigma }_{i,j}^{2}+{\sigma }_{i,k}^{2}}},\end{eqnarray} \tag{ 10 }$

where m_i,j is the posterior median value for parameter i from a fit using SPS method j and σ_i,j is the posterior width (defined as half the width between the marginalized posterior's 15.87% and 84.13% percentiles) for parameter i using SPS method j. This is largely analogous to a z-score. Given that sampling is an intrinsically random process, it would not necessarily be unexpected for one to obtain different results with independent realizations of the same fit, even if the fluxes were reproduced. Accordingly, we run fits using the same 1024 node-per-layer emulator twice in order to assess the standard rate of catastrophic outliers caused by sampling uncertainty rather than emulator inaccuracy.

In Figure 11, we view the empirical cumulative distribution functions (ECDFs) of this discrepancy statistic for the cumulative stellar mass formed M_*,formed and diffuse dust attenuation optical depth τ₂, both of which are parameters that should be well constrained by panchromatic SED fitting (e.g., Burgarella et al. 2005), in addition to AGN dust optical depth τ_AGN and gas-phase metallicity Z_gas, which should be less well constrained. We note that the distributions asymptotically approach the distribution reached by the 1024 node-per-layer emulator, to the extent that the 256 and 512 node-per-layer emulators produce nearly identical distributions of discrepancies to repeat runs of 1024 node-per-layer fits. This indicates that emulators with at least 256 nodes per layer are sufficiently accurate for fitting these 3D-HST SEDs.

We also investigate the rate at which catastrophic outliers occur, which we define as ∣Q_i;j,k∣ ≥ 3 (analogous to a 3σ outlier). Similarly, we see in Figure 12 that these tend to asymptotically approach the 1024 node-per-layer rates as ANN width increases. But, it should be noted that these are not strictly decreasing with increasing ANN width for all parameters as one may expect—in fact, for some parameters (e.g., Z_gas), the rate actually increases with increasing ANN width, while others exhibit noisy behavior (e.g., M_*,formed). In the case of Z_gas, this is likely due to the fact that the parameter is rarely well constrained—it plays a minor role in a galaxy's rest-frame optical spectrum and is difficult to detect in photometry. Since larger observational errors are assumed with smaller ANN emulators, it is likely that the sampler is correspondingly returning the prior as a result of the larger uncertainties. These wide posteriors create scenarios where it is exceedingly difficult to produce a catastrophic error in some parameters, resulting in the noted performance. All told, the asymptotic behavior is the primary result—increasing ANN width provides increasingly similar catastrophic outlier rates, with all emulators with at least 256 nodes per layer providing rates within ∼30% of the target rate as measured by the repeated 1024 node-per-layer fits.

**Figure 12.** Rate of 3σ outliers (i.e., rate of ∣Q_i;j,k∣ ≥ 3 discrepancies) for four of the fitted parameters. We see the 3σ outlier rates generally approach the rate found in the 1024 node-per-layer emulator, which is caused only by sampling variance, as network width increases.
Download figure:
Standard image High-resolution image

4. Discussion

In this section, we discuss the implications of our results and possible avenues for future work on efficient SPS computation.

4.1. Optimal Architecture

The key question this work aims to answer is how to determine what ANN architecture one should use when fitting a given set of galaxy SEDs. We aim to find the fastest (i.e., simplest) emulator that achieves posterior accuracy goals; however, the only metrics we have available for evaluating an emulator's ability to recover accurate posteriors (without running the entire analysis performed in this work), such as emulation precision, are all in flux space. Due to the complex and degenerate nature of the parameter-SED relationship, it is difficult to correlate a given parameter accuracy target with a flux precision target.

A natural target would be to compare the emulation precision to the precision of the observations being fit. If provided with only the results shown in Figure 5, one would conclude that all emulators except for the 32 node-per-layer emulator would potentially be sufficiently accurate and precise since the precision in some or all of their filters exceeds the precision of the observations (i.e., typical emulation errors of ∼5% or better). If we make a further constraint that all emulated filters must provide precision better than the observational uncertainty floor of 5%, then the 64 node-per-layer emulator would also be excluded. In addition, one would note from Figure 6 that correlated emulation errors are smaller in magnitude and predominantly concentrated in filters with similar effective wavelengths for wider architectures. Then, since we aim to choose the fastest architecture that meets our accuracy and precision targets (i.e., Figure 3), we would conclude from these a priori results that the 128 node-per-layer emulator would be the correct choice for this physical model and set of observations. However, if one is not as concerned about the runtime of the emulator, then increasing the width is desirable in order to minimize the emulator's contribution to the overall error budget.

We can then verify whether or not this decision to use the 128 node-per-layer emulator for SED fitting would actually lead to accurate posterior distributions. We find in Figure 11 that the three emulators consisting of at least 256 nodes per layer all converge to common distributions of Q_i;j,k, with catastrophic outlier rates approaching the ideal value found by running duplicate 1024 node-per-layer fits as demonstrated in Figure 12. At this level of flux precision, the discrepancies of the posteriors can be solely attributed to sampling variance. In other words, the improved flux precision that comes from using emulators with more than 256 nodes per layer is not being met with a significant improvement in the recovery of accurate posterior distributions. Therefore, this a posteriori analysis would conclude that the 256 node-per-layer emulator is the optimal choice for this task.

Thus, there is a small discrepancy between the conclusions found by these two avenues of inquiry: the a priori results would indicate that the 128 node-per-layer emulator should be chosen due to it being the simplest emulator whose precision exceeds the observational precision (see Figure 5), while the a posteriori results would indicate that the 256 node-per-layer emulator should be chosen due to it being the simplest emulator that achieves similar distributions of discrepancies (see Figure 11) and similar rates of catastrophic outliers (see Figure 12) up to the limit imposed by sampling uncertainty. This demonstrates that the emulation precision acts as a good rule of thumb for choosing an appropriate architecture but fails to tell the full story. This is likely partly due to the long tails in emulators' errors (i.e., Figure 10), which result in the emulator producing significant errors (e.g., larger than 5% in flux) more often than one would predict when assuming Gaussian errors. Furthermore, not all parameters are affected equally by this, and thus the choice of parameters being fit must also enter into consideration when selecting an appropriate architecture. Figure 11 shows that M_*,formed and τ_AGN are largely unaffected by the errors introduced by the 128 node-per-layer emulator, while τ₂ and especially Z_gas are more affected. For this reason, we deem the three emulators with at least 256 nodes per layer as being sufficient for use in fitting galaxy SEDs, the 128 node-per-layer emulator as being marginally sufficient (e.g., useful for cases where τ₂ and Z_gas are not being fit), and the remaining two emulators (32 and 64 nodes per layer) as being insufficient.

Based on these results, we propose a method for choosing a minimal emulator architecture for a given physical model and set of observations. First and foremost, more complex physical models require more complex emulator architectures for a given precision target due to the more complicated input/output relationship that these models pose. Thus, to prevent the emulator from adding significant noise to constraints on galaxy parameters, an architecture must be chosen that provides sufficient accuracy and precision in order for the emulator to exceed the observational uncertainties in all of the filters in question. Then, the covariance matrix must also be examined to ensure that an assumption of uncorrelated errors is valid so that one can convolve an emulator's σ_i precision with the observational uncertainty for that filter. This work has shown that these steps, when combined, provide emulators capable of producing posteriors that are as accurate as sampling variance will permit for most parameters. However, a slight improvement in the emulator's precision (e.g., further doubling the network width, which produces emulation precision that is approximately ∼5 times better than the observational uncertainties) can deliver a notable improvement in parameter recovery for some parameters, and may provide more confidence in posterior accuracy in situations where absolute accuracy is desired and the fastest possible speed is not strictly necessary.

4.2. Distributions for Training, Testing, and Priors

In this work, we deviate from previous works (e.g., Alsing et al. 2020; Kwon et al. 2023) in not using our prior distributions as the probability distributions used for generating our training and test sets. This is a subtle change, but an important one—while informative priors downweight the probability calculation in portions of parameter space that are either known or highly suspected to be rarely occupied by real galaxies, using these same nonuniform distributions for generating training sets will have the unintended consequence of also causing the emulator to maximize its errors for rare objects.

For example, the prior distribution used for the adjacent SFR ratios (which are SFH parameters controlling the ratio of SFR in adjacent time bins) is an extremely narrow distribution, providing almost no prior density near the upper and lower limits of the prior (which are values corresponding to objects which have rapidly increasing or decreasing SFRs in some portion of their SFHs). Specifically, only ∼0.2% of all samples in the training set have $4\leqslant | {\mathrm{log}}_{10}\left({\mathrm{SFR}}_{i}/{\mathrm{SFR}}_{i+1}\right)| \leqslant 5$ .

To demonstrate the effect this subtlety can have, we generate two additional mock galaxy observations (denoted Galaxy A and Galaxy B) using the same specifications as the mock catalog created in Section 2.1 (including random 0.1 mag observational errors) but with manually specified SPS parameters. These two galaxies only differ by their SFH: Galaxy A represents a galaxy whose SFH parameters are located at the center of the training set and prior distributions (i.e., ${\mathrm{log}}_{10}({\mathrm{SFR}}_{i}/{\mathrm{SFR}}_{i+1})=0$ ), thus producing a flat SFH with mass-weighted age 〈t〉_m ≈ 1.89 Gyr, while Galaxy B represents a galaxy whose SFH parameters are located far from the center of the distributions (i.e., ${\mathrm{log}}_{10}({\mathrm{SFR}}_{i}/{\mathrm{SFR}}_{i+1})=-2.5$ ), in this case producing a galaxy with a rapidly declining SFH with 〈t〉_m ≈ 3.57 Gyr. Both galaxies are generated at z = 1.75, where t_univ ≈ 3.77 Gyr.

We then fit these observations using the same procedure as before, but in this case, we also fit for redshift using a prior distribution uniform (0.5, 3.0) to increase the sensitivity to small emulation errors due to the additional degeneracies redshift introduces. Furthermore, in addition to fitting the mock observations with the 1024 node-per-layer emulator described in this work (denoted as "Thick Tail" in Figures 13 and 14), we also fit the observations with a similar 512 node-per-layer emulator³⁵ whose training set used the narrower prior distributions listed in Table 1 rather than the wider training set distributions (denoted as "Thin Tail"). Figure 13 shows the true emulation errors for each of these emulators when evaluated at the true mock parameters. For both galaxies, we see that the thin tail emulator is less precise than the thick tail emulator; however, this effect is far more pronounced for Galaxy B, which lies in a region of parameter space where the thin tail emulator's training set is poorly sampled, with emulation errors in excess of 0.1 mag.

**Figure 13.** The emulation precision of the two emulators used to fit the two mock galaxies described in Section 4.2. Plotted in red is the emulation error for the emulator whose training set used the thinner prior distributions, and plotted in blue is the emulation error for the 1024 node-per-layer emulator described in this work. In both cases, the emulation error is evaluated using the true parameter values for the mock observations. We see that the precision of the thin tail emulator is dramatically worse for Galaxy B, whose SFH parameters are extreme, while the thick tail emulator does well for both galaxies.
Download figure:
Standard image High-resolution image

We then see in Figure 14 the effect this degradation in emulation precision can have on an SED fit. For Galaxy A, both emulators managed to accurately recover the galaxy's true stellar mass formed and mass-weighted age. However, the thin tail emulator did not fare as well with Galaxy B as the thick tail emulator did, with posteriors that exclude the mock truth with high probability. As such, we conclude that it is important to strongly consider the distributions used when generating an emulator's training set, and to be aware of the unintended consequences that using physical prior distributions can have on SED-fitting results.

**Figure 14.** The posteriors for the fits to Galaxies A and B as described in Section 4.2. Left/right panels show the posteriors for the fits to Galaxy A/B for the total stellar mass formed ${\mathrm{log}}_{10}({M}_{* ,\mathrm{formed}}/{M}_{\odot })$ and the mass-weighted age 〈t〉_m. The thick dashed black line denotes the true parameter value, while the thin colored dashed lines show the median of the posterior along with the ±1σ quantiles.
Download figure:
Standard image High-resolution image

4.3. Emulating Difficult Parameters

In our tests, one problem we encountered was the difficulty in expanding the redshift training/test set distributions. This does not necessarily come as a surprise—an emulator trained on a training set where redshift is held to a fixed value only needs to learn how the rest-frame spectral flux density in each selected filter (with filter transmission curves blueshifted to the rest frame) changes as a function of each of the input parameters, and regions of the rest-frame SED not well covered by the selected filters' transmission curves can even be ignored entirely.³⁶ However, an emulator trained on a training set where redshift is allowed to vary must instead learn how the rest-frame spectrum (either in full or in large portions) changes as a function of parameters, along with the characteristics of each filter's transmission curve. This task necessarily becomes more difficult as the redshift domain an emulator is trained on is widened. Furthermore, this is a problem that is specific to photometric emulators, as most of the effects of redshift can be applied independently in the case of spectroscopic emulators (i.e., the emulator would predict the rest-frame spectrum, although the SFH definition used in our physical model is dependent on the age of the universe at a given redshift, and thus redshift would still need to be included as an input). Therefore, expanding the redshift domain will typically result in a loss in emulation precision, or alternatively, a longer execution time if a wider emulator architecture is selected to compensate for the loss in precision.

To combat these problems, we propose a method of stitching separate emulators trained on a variety of redshift domains into a single composite emulator. Suppose we have trained K emulators across various overlapping redshift domains, such that emulator i is trained on the redshift domain z_L,i ≤ z ≤ z_H,i, the emulators collectively cover the redshift domain z_L ≤ z ≤ z_H, and a maximum of two emulators overlap at any given redshift. Then, we can define a transition angle θ_i(z),

$\begin{eqnarray}&&{\theta }_{i}(z)=\displaystyle \frac{\pi }{2}\displaystyle \frac{z-{z}_{{\rm{L}},i+1}}{{z}_{{\rm{H}},i}-{z}_{{\rm{L}},i+1}},\end{eqnarray} \tag{ 11 }$

where we also define z_H,0 = −∞ and z_L,K = +∞ in order to handle the cases at the edges of the redshift domain. We then wish to generate a coefficient a_i(z) representing the contribution emulator i should give to the total spectral flux density estimate:

$\begin{eqnarray}{a}_{i}(z)=\left\{\begin{array}{ll}0 & (z\gt {z}_{{\rm{H}},i})\parallel (z\lt {z}_{{\rm{L}},i}),\\ 1 & {z}_{{\rm{H}},i-1}\leqslant z\leqslant {z}_{{\rm{L}},i+1},\\ {\sin }^{2}{\theta }_{i-1}(z) & {z}_{{\rm{L}},i}\leqslant z\lt {z}_{{\rm{H}},i-1},\\ {\cos }^{2}{\theta }_{i}(z) & {z}_{{\rm{L}},i+1}\lt z\geqslant {z}_{{\rm{H}},i}.\end{array}\right.\end{eqnarray} \tag{ 12 }$

This coefficient scheme has a few desirable characteristics for use with emulators. First, it provides the ability to gracefully handle overlapping emulation domains, where the square sine and cosine functions are used to let two emulators vote against one another such that ${\sum }_{i=1}^{K}{a}_{i}(z)=1$ . Furthermore, it provides a continuous derivative as a function of redshift, which can be useful with gradient-based samplers (see Section 4.4). However, it must be stressed that the overlaps must be selected such that no more than two coefficients are nonzero at any given redshift (i.e., only two emulators may overlap at any given redshift). Furthermore, due to the requirement to run two emulators for redshifts within overlaps, this method produces a slight increase in execution times, which increases as the overlap widths increase and the number of overlaps increases.

The total spectral flux density E_ν(z, ...) can then be computed as a linear combination of the individual emulator outputs E_ν,i(z, ...),

$\begin{eqnarray}&&{E}_{\nu }(z,\ldots )=\displaystyle \sum _{i=1}^{K}{a}_{i}{E}_{\nu ,i}(z,\ldots ),\end{eqnarray} \tag{ 13 }$

and in Figure 15, we demonstrate one potentially useful application of this technique. Following the launch of JWST, there has been a deluge of candidate high-redshift, high-mass galaxies discovered (e.g., Naidu et al. 2022; Atek et al. 2023; Labbé et al. 2023). This has led to a demand to rapidly parse incoming photometric observations for new candidates, and to distinguish high-redshift, high-mass candidates from low-redshift, low-mass interlopers (e.g., by incorrectly interpreting the Balmer break as the Lyman break). This task would require an emulator that covers a far wider redshift range than the one used in this work (i.e., Δz = 3.0 − 0.5 = 2.5), and would thus be a good candidate for constructing a stitched emulator. Shown in Figure 15 is an example of a composite emulator which collectively covers a redshift domain of 0 ≤ z ≤ 12, which would allow it to fit a large number of the candidate high-redshift, high-mass candidates found to date. Six individual emulators are used in this example (with limits and widths shown in Table 2), each of which is trained on a Δz ≈ 2.5 redshift domain, with overlap widths of Δz = 0.5. Assuming a sampler uniformly explores all redshifts within the composite domain, this choice of overlaps would result in a typical increase in execution time of ∼21% compared to running any of the individual emulators. Additionally, since the width of each emulator's redshift domain is similar to the emulator used throughout this work, it is likely each emulator's emulation precision and execution time would be similar to those found in Section 3.1. For demonstration purposes, the maximal redshift covered by this scheme is z = 12, but this could be extended to any higher redshift as desired by appending additional emulators covering higher redshift domains.

Table 2. Example 0 ≤ z ≤ 12 Composite Emulator

Emulator (i)	Lower Limit (z_L,i)	Upper Limit (z_H,i)	Width (Δz_i)
1	0.00	2.25	2.25
2	1.75	4.25	2.50
3	3.75	6.25	2.50
4	5.75	8.25	2.50
5	7.75	10.25	2.50
6	9.75	12.00	2.25

Download table as: ASCII Typeset image

4.4. Gradient-based Sampling

As described in Section 2.3, the sampler used in this work to perform the SED fits with Prospector was dynesty, a nested sampler that uses MCMC sampling. While it has been demonstrated that this is a suitable sampling method for fitting galaxy SEDs (e.g., Leja et al. 2019), it is expected that the computational burden of SED fitting when using an emulator may be reduced by using a gradient-based sampler instead (Neal 2011; Betancourt 2017), such as Hamiltonian Monte Carlo (Duane et al. 1987) or the No-U-Turn Sampler (Hoffman & Gelman 2011). ANNs are typically implemented in libraries compatible with automatic differentiation, which can provide exact evaluations of an emulator's derivative with a computational cost comparable to a hard-coded derivative, making them a prime candidate for being paired with a gradient-based sampler. However, one should also approach this method with caution—gradients are not accounted for in the loss function used to train the emulators in this work.

Alternatively, when using a gradient-based sampler, it may not be necessary to resort to an ANN emulator to handle SPS. Instead, there has been recent work to create a differentiable SPS code, one which produces identical predictions to a code like FSPS (unlike an emulator, which is approximate) but does so in a way that can exploit the AD capabilities of packages like JAX (Bradbury et al. 2018) in Python or ForwardDiff.jl (Revels et al. 2016) in Julia. One such implementation is DSPS (Hearin et al. 2023), which can generate predicted SEDs with roughly the same execution time as ANN emulators. However, DSPS is currently only capable of generating SEDs using a precomputed SSP spectral grid, and bringing it to feature-parity with FSPS will likely result in either a slower execution time (in the case of using stellar spectral libraries and isochrones as the base) or an extremely high memory requirement (in the case of using high-fidelity precomputed SSP grids). Regardless, the ability to explore a posterior distribution more efficiently by pairing a differentiable SPS code with a gradient-based sampler would likely prove to provide a net improvement in SED-fitting execution times.

5. Conclusions

In this work, we have evaluated the performance of SPS emulators with six different ANN architectures, with evaluations in both flux space and parameter space, to assess the relationship between emulator architecture, precision, and execution time, including testing each emulator's ability to recover physical parameters from observed galaxy SEDs. From our analyses, we have drawn a number of conclusions:

1.
ANN emulators dramatically reduce the computational burden of SPS, with an execution time improvement upper limit of ∼10⁵ compared to their native SPS counterparts on a CPU. Actual improvement will be less than this, dependent on the ANN architecture required to reproduce fluxes to within the target precision (with more complex architectures requiring longer execution times), but most architectures that provide reasonably accurate and precise emulators are at least ∼10³ times faster. In addition, we find that emulation execution time scales with respect to network width as ∝N². Simpler architectures are preferred in this respect.
2.
ANN emulators can accurately, precisely, and quickly match the predictions given by existing SPS codes to well within the typical observational uncertainties encountered in real-world data, with emulation precision improving as the number of nodes per layer in the ANN emulator increases. Emulators in this work have precisions ranging from ∼0.2 mag at worst to ∼0.005 mag at best. Furthermore, we find that emulation precision tends to scale with respect to network width as ∝N⁻¹. More complex architectures are preferred in this respect.
3.
ANN emulators typically have less correlated emulation errors when using more complex architectures, allowing users to safely convolve observational uncertainties with estimated emulation precision without requiring a full covariance matrix.
4.
Emulators are capable of recovering posterior medians that are consistent with fits run with traditional SPS codes, with no apparent bias and typical differences only of order 25%–40% for stellar mass, stellar metallicity, SFR, and stellar age.
5.
Based on a set of ∼10⁴ SED fits to observed galaxies in the deep extragalactic COSMOS field, we find that posteriors tend to coalesce to a common result (within sampling variance) as the emulators' precision increases. Convergence begins to occur roughly at the same architecture where we see the scale of observational errors exceeds the scale of typical emulation errors and where we see emulation errors' correlations between photometric filters become insignificant, setting a good benchmark for a minimal architecture for future studies that require the use of emulators.
6.
By fitting a sample of 10³ mock observations using both an emulator and FSPS, we show that both emulators and FSPS are able to recover the true physical parameters for a galaxy to within the same level of precision.
7.
The distributions chosen for generating emulator training sets can have large effects on an emulator's ability to fit rare objects. Choosing a nonuniform distribution leads to inaccurate flux predictions in regions of parameter space that are poorly sampled in the training set, which then leads to inaccurate posteriors when fitting SEDs. Using the same distributions for the training set as are used as a prior leads to this exact scenario, and is especially problematic since the most interesting objects tend to reside in rare regions of parameter space.

In sum, these conclusions pave the way for ANN-based emulators to be used in real-world SED-fitting applications. We demonstrate that not only can photometric emulators dramatically speed up the rate at which we can perform SED fitting, but we can also trust the results that these emulator-based fits provide. Useful avenues of further inquiry include optimizing the architectures used for SPS emulation (e.g., convolutional layers) in addition to extending this analysis to high-resolution spectroscopic emulators. These results open the door to fitting large photometric surveys with SPS emulators to reduce computational costs, all the while being able to trust the results we receive from these emulators.

Acknowledgments

Based on observations with the NASA/ESA Hubble Space Telescope obtained from the Space Telescope Science Institute, which is operated by the Association of Universities for Research in Astronomy, Incorporated, under NASA contract NAS5-26555. Support for Program number AR-16146.008-A was provided through a grant from the STScI under NASA contract NAS5-26555. Computations for this research were performed on the Pennsylvania State University's Institute for Computational and Data Sciences' Roar supercomputer. K.W. acknowledges support for this work from the Astrophysics Data Analysis Program under NASA grant No. 80NSSC20K0416.

Software: Astropy (Robitaille et al. 2013), dynesty (Skilling 2004, 2006; Speagle 2020; Koposov et al. 2022), Flux.jl (Innes 2018; Innes et al. 2018), FSPS (Conroy et al. 2009; Conroy & Gunn 2010), JAX (Bradbury et al. 2018), NumPy (Harris et al. 2020), Plots.jl (Breloff 2022), Prospector (Leja et al. 2017; Johnson et al. 2021), python-fsps (Johnson et al. 2021), sedpy (Johnson 2021).

As Simple as Possible but No Simpler: Optimizing the Performance of Neural Net Emulators for Galaxy SED Fitting

Article metrics

Author e-mails

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction