Propagating Uncertainties in the SALT3 Model Training Process to Cosmological Constraints

Type Ia supernovae (SNe Ia) are standardizable candles that must be modeled empirically to yield cosmological constraints. To understand the robustness of this modeling to variations in the model training procedure, we build an end-to-end pipeline to test the recently developed SALT3 model. We explore the consequences of removing pre-2000s low-$z$ or poorly calibrated $U$-band data, adjusting the amount and fidelity of SN Ia spectra, and using a model-independent framework to simulate the training data. We find the SALT3 model surfaces are improved by having additional spectra and $U$-band data, and can be shifted by $\sim 5\%$ if host galaxy contamination is not sufficiently removed from SN spectra. We find that resulting measurements of $w$ are consistent to within $2.5\%$ for all training variants explored in this work, with the largest shifts coming from variants that add color-dependent calibration offsets or host galaxy contamination to the training spectra, and those that remove pre-2000s low-$z$ data. These results demonstrate that the SALT3 model training procedure is largely robust to reasonable variations in the training data, but that additional attention must be paid to the treatment of spectroscopic data in the training process. We also find that the training procedure is sensitive to the color distributions of the input data; the resulting $w$ measurement can be biased by $\sim2\%$ if the color distribution is not sufficiently wide. Future low-$z$ data, particularly $u$-band observations and high signal-to-noise ratio SN Ia spectra, will help to significantly improve SN Ia modeling in the coming years.


INTRODUCTION
Since the discovery of cosmic acceleration (Riess et al. 1998;Perlmutter et al. 1999), Type Ia supernovae (SNe Ia) have played an important role in constraining the dark energy equation of state, w.Calibrated SN Ia distances at low redshift are also used to derive the most precise local measurements of the Hubble constant (H 0 ; Riess et al. 2022), currently in tension with inferred H 0 Corresponding author: M. Dai mi.dai@jhu.eduvalues from the cosmic microwave background (Planck Collaboration et al. 2018; see reviews from Verde et al. 2019;Di Valentino et al. 2021).
The most recent cosmological constraints from SNe Ia use up to ∼1500 SNe Ia (Scolnic et al. 2018;Jones et al. 2019;Brout et al. 2022) and in the near future, surveys such as Vera Rubin Observatory's Legacy Survey of Space and Time (The LSST Dark Energy Science Collaboration et al. 2018) and the SN survey from the Nancy Grace Roman Telescope (Hounsell et al. 2018;Rose et al. 2021) will increase the number of wellmeasured SNe Ia by orders of magnitude.Although the statistical uncertainties of the cosmological measurements will be greatly reduced with these future data sets, understanding and reducing the systematic uncertainties will be crucial.Most previous studies have found that the dominant systematic uncertainties in SN Ia cosmology analyses are caused by photometric calibration of the data in the cosmology sample and the sample used to train the SN for standardization (Betoule et al. 2014;Scolnic et al. 2018;Brout et al. 2019;Jones et al. 2019;Brout et al. 2022).However, explorations of systematic uncertainties in the SN standardization model are typically limited to photometric calibration offsets; potential systematic errors in the training process or definition of the model have been explored (e.g., Mosher et al. 2014) but are rarely propagated to cosmological parameter uncertainty budgets.
Substantial recent effort has been put into developing new SN Ia models (Saunders et al. 2018;Léget et al. 2020;Boone et al. 2021;Pierel et al. 2021;Kenworthy et al. 2021;Mandel et al. 2022).All SN Ia models for cosmology to date are empirical models and therefore require a training sample of well-measured SNe (with or without spectral data); therefore, it is important to understand how robust a SN model is to the choice and quality of the training sample, and the effect of that training sample on the resulting cosmological parameter measurements.In particular, given the large impact of photometric calibration uncertainties on the SN model training, other systematic uncertainties related to the model training must also be better understood.Finally, it is important to understand the ways in which new data could enhance the model, such as data from ATLAS (Tonry et al. 2018), the Zwicky Transient Facility (Bellm et al. 2019), the Carnegie Supernova Project (CSP, Krisciunas et al. 2017), and the Young Supernova Experiment (Jones et al. 2021).
The SALT2 model (Guy et al. 2007(Guy et al. , 2010;;Betoule et al. 2014;Taylor et al. 2021) is the baseline SN standardization model used for nearly all measurements of w in recent years.Advantages of the SALT2 model include its ability to use high redshift SNe for training, built-in k-corrections due to its continuous SED model across both phase and wavelength, and the fact that the SN amplitude is a free parameter in the training procedure, making the training independent of the cosmological model.Unfortunately, the original training code and data are not fully publicly available, and are difficult to modify and improve for systematic studies, with Mosher et al. (2014) being the most recent study to explore alterations to the SALT2 training procedure.However, a new "SALT3" model has recently been developed (Kenworthy et al. 2021, hereafter K21), which shares much of the functional form as SALT2 but includes improvements in the training procedure and includes new data that extends the model wavelength range from ∼ 0.9 to 1.1 µm.Though we use the K21 model in this work, we note that Pierel et al. (2022) recently extended the SALT3 model surfaces to 2 µm, albeit using a data set with somewhat larger calibration uncertainties, and Jones et al. (2022) built a SALT3 model that includes a host-galaxy dependent model surface.
Here, we use the open-source SALT3 model training code, SALTshaker1 , to quantify how training data variations or unknown physics in the data can affect the measurement of cosmological parameters.In particular, we investigate the exclusion of low-z training data without precisely measured filter throughputs, the exclusion of poorly calibrated ground-based near-UV bands, variations in the number of SN Ia spectra, and the impact of wavelength-dependent calibration errors and host-galaxy contamination in those spectra using simulations generated from a previously trained SALT2 model (Pierel et al. 2018).Furthermore, we use simulations generated from BYOSED (Pierel et al. 2021), a model framework that is independent of SALT, to determine the sensitivity of cosmological parameters to SED features that are not modeled by the SALT framework.The BYOSED framework generates SN Ia SEDs using a base model with added "perturbers" derived from composite spectra of different SN Ia sub-populations; this produces variations in the SN spectra that are not modeled with existing frameworks, such as velocity and hostgalaxy mass variations.
For each of these training scenarios, we use a simulation-based approach to quantify how variations in the available training data or unknown physics affects the recovery of the trained model, the correlations between light-curve parameters and luminosity, and the measured value of w.For this purpose, we have built an end-to-end simulation and analysis pipeline, which will enable future analyses to propagate modeling systematics into the systematic uncertainty budget of cosmological constraints.
In Section 2 we describe our analysis and pipeline, and in Section 3 we discuss our simulation approaches and training sample variations.In Section 4 we discuss the recovered models, nuisance parameters, and cosmological parameters, and in Section 5 we discuss our results and conclude.

DESCRIPTION OF THE ANALYSIS AND THE PIPELINE
Modern SN Ia cosmology analyses include a complex set of stages to go from input photometric SN Ia light curve data to cosmological parameter measurements.Those stages include fitting for light-curve parameters in order to standardize the SN brightness, correcting for observational biases using simulations, estimating nuisance parameters that relate stretch and color to brightness, computing a covariance matrix to account for systematic uncertainties, and finally, fitting for cosmological parameters.The community has developed tools to perform these individual tasks (e.g., Kessler et al. 2009a;Guy et al. 2010;Rubin et al. 2015;Kessler & Scolnic 2017) and a pipeline to control the workflow (Hinton & Brout 2020).
Typically missing from the evaluation of systematic uncertainties, however, are estimations of systematics related to the training procedure.Here, we build an end-to-end cosmological analysis pipeline that includes a SN model training stage.While previous studies such as the Pantheon+ analysis (Scolnic et al. 2021) have performed model re-training to incorporate calibration offsets in the SALT model, here we build a more flexible framework for SALT training options and systematic uncertainty evaluation.This open-source pipeline is built in Python and ties together pre-existing code to perform many of the steps necessary to go from input data or simulations to cosmological parameter estimation.In particular, many stages are built around methods implemented within the SNANA software (Kessler et al. 2010(Kessler et al. , 2019a)); SNANA is a collection of SN analysis methods that perform simulations, light curve parameter estimation, distance measurement, and cosmological parameter fitting.The individual components of the pipeline are described in more detail below.
The input model in this analysis pipeline is flexible, but the model training and analysis stages currently assume a SALT3 model defined as where M 0 and M 1 are model components that describe an average spectral surface and its first-order variation, and CL is the color law, which is a polynomial between 2800 and 8000 Å and is extrapolated beyond those wavelength ranges.The parameters x 0 , x 1 and c are light curve parameters that describe the overall amplitude, shape, and color of the light curve.
Using those light-curve parameters, distances are measured using the Tripp estimator (Tripp 1998): where x 1 and c are stretch and color parameters from a SALT3 light-curve fit, and m B is the log of the lightcurve amplitude x 0 .Nuisance parameters α and β are estimated to determine the correlation of x 1 and c with luminosity.M is a combination of the SN absolute magnitude and H 0 , which are degenerate in this work.∆µ bias is the bias correction term that is determined from simulations and used to correct for selection biases.Most versions of the Tripp estimator used in the last decade also include the variable γ, the correlation between host-galaxy mass and distance measurement (Kelly et al. 2010;Lampeitl et al. 2010;Sullivan et al. 2010), but we do not include it in this work since it is a second-order effect and requires proper simulation of the relations between the host-galaxy mass, the SN light-curve parameters, and the SN brightness (Smith et al. 2020;Popovic et al. 2021).
The stages of the pipeline are described below and illustrated in Figure 1: time at peak and the light curve parameters as input to the training process.
• Generating "BiasCor" Simulations.To correct the measured light curve parameters for sample selection biases, the pipeline again generates a large sample of SNe Ia from the newly trained SALT3 model; this sample is typically a factor of 10-100 times larger than the training or test samples to prevent noise in the bias corrections from significantly contributing to distance uncertainties.The extent to which SALT3 does not accurately model the true SN Ia features will propagate to errors in the bias correction.
• Light-Curve Fitting.The simulated photometry of the test sample and the BiasCor sample are fit with the newly trained SALT3 model.
• Bias Correction and Distance Measurement.After light-curve fitting, the BiasCor simulations are used to determine a mapping between simulated versus measured light-curve parameters as a function of redshift using the "BEAMS with Bias Corrections (BBC)" method (Kessler & Scolnic 2017).The BBC method determines the nuisance parameters α and β and applies Equation 2 to measure distances.For computational efficiency, we use a 1D bias correction method within the BBC framework, which corrects those distances as a function of redshift for selection effects and returns maximum-likelihood distances binned by redshift.
• Cosmological Parameter Estimation.Finally, the pipeline uses these distances to fit a wCDM model in a maximum likelihood fit to measure w with a cosmic microwave background (CMB) prior using the R(z * ) shift parameter (e.g.Eq. 69 in Komatsu et al. 2009) with σ R = 0.007.This is a computationally efficient way of producing CMB constraints with a constraining power that is similar to Planck Collaboration et al. (2018).Finally, the pipeline evaluates biases in the measured cosmology with respect to the input cosmology.
Our pipeline is publicly available from the SALTShaker package2 .

EVALUATING SYSTEMATIC UNCERTAINTIES FROM THE SALT MODEL
To quantify biases on measurements of cosmological parameters due to the SALT3 training procedure, we create realistic simulations and use them as input "data" to the SALTShaker training code.In different simulations, we vary the training set or the underlying properties of the SED model used to generate the SN Ia sample.We describe each of these simulations in detail below and we analyze the effect of each simulation on the trained SALT3 model and on the recovered cosmological parameters after re-training the SALT3 model for each in Section 4.
In K21, SNANA simulations were generated that reproduced the measured light-curve parameters of the real data in order to create a simple approximation of the distribution of data for a simulation-based training.Here, we instead adopt an approach in which light-curve parameters are drawn from Monte Carlo-sampled distributions that will on average yield the same x 1 and c distributions as the real data.In the baseline simulations and the variants in Section 3.2 below, we use SALT2-extended as our input model to allow our simulated model to be independent of the SALTShaker training process.
For the samples discussed above, SNANA simulations have already been developed as part of previous cosmological analyses.The low-z, SDSS, Pan-STARRS and SNLS simulations used here are from Scolnic et al. (2018) with x 1 and c populations from (Scolnic & Kessler 2016), the DES simulations are from Kessler et al. (2019b), and the Foundation simulations are from Jones et al. (2019).We keep the same x 1 /c distributions that are developed by the works above.The parameters of the x 1 /c distributions for each survey are summarized in Appendix A. We note that the same distributions are used for simulating both the training sample (TrainSim) and the test sample (TestSim) in the pipeline.Each pipeline run generates one TrainSim for one TestSim.A comparison between the light-curve parameters measured from simulations versus real training data is shown in Figure 2. Spectra are also generated as part of these simulations as discussed in K21 such that the total number of spectra, the distribution of their phases, and their noise properties as a function of wavelength for each individual survey are approximately equal to the real K21 training sample.

Varying the Input Training Data
We modify the input data used for SALT3 training in several ways, listed below, to explore the potential effect of removing less reliable data from the model training and the impact of the number and quality of the spectroscopic data on the resulting model surfaces.• Removing low-z SNe without measured filter throughputs.The low-z sample used in K21 is a compilation of data from various surveys dating back to the 1990s.Prior to the CfA3 sample (Hicken et al. 2009), the filter transmission functions were estimates and color transformations were used so that synthetic colors matched observations of Landolt standard stars.The lack of precise filter transmissions can introduce unknown systematic uncertainties in the calibration of the SN photometry.In particular, Scolnic et al. (2015); Brout et al. (2021) found that the photometry of these surveys could be systematically off by up to 3%, but they lacked the statistics to correct for these offsets.
Thanks to the recent wealth of low-z data from CSP, Foundation, and the later CfA surveys, it is possible to train the SALT3 model after removing data without measured filter transmissions: the Calan/Tololo and CfA1-2 low-z samples.Rather than testing the systematic effect of calibration offsets,we test the statistical impact of excluding these low-z samples from the training.All other simulated data for this variant remain the same as in our baseline simulations, although the total amount of photometric and spectroscopic data is significantly smaller due to removing these samples.On average, each random sample of this variant consists of 1000 SNe with 609 spectra.The Dai et al.
loss of SNe is about 4%, while the loss of spectra is nearly 50%.
• Removing U -band data.The observer-frame ultraviolet data (U /u band, central wavelength <4000 Å) from low-z surveys and SDSS are particularly difficult to calibrate, and in past analyses U -band model trainings have been found to be unreliable (Kessler et al. 2009b).Recent efforts to improve the calibration of legacy SN Ia data (Scolnic et al. 2015;Currie et al. 2020;Brout et al. 2021) have not attempted to recalibrate the U/u bands.Because the SALT procedure can use well-calibrated high-z data in the optical bands (with central wavelength > 4000 Å) to train the UV model, it might be preferable to omit these poorly calibrated data entirely.We therefore test the effect of removing the U /u band data from our simulated training set.
• Reducing the number of spectra.Including both photometric and spectroscopic data in the SALT3 training allows the model to more reliably account for variation in spectral features, improving the fidelity of K-corrections.However, spectral data are expensive to obtain relative to photometric observations and must be iteratively recalibrated to match the best-fit model during the training process.For this reason, we test the effects of randomly removing half of the spectra from the simulated training data; this allows us to explore the degree to which spectroscopic data can be removed from the training process while still yielding reliable SN Ia distances.
The relative, wavelength-dependent flux calibration of SN spectroscopy can be highly uncertain.For this reason, SALTShaker re-calibrates the spectra to match the best-fit SALT model during each iteration of the training process.We test the robustness of this recalibration procedure by including simulations of mis-calibrated spectra in the training data.For each simulated spectra, a multiplicative calibration warp in the form of dF/dλ → dF/dλ × (1 + sλ) is applied, with s being randomly selected between −10 −5 Å−1 and +10 −5 Å−1 .This degree of re-calibration changes the spectral flux by up to ±6% over the range from 2800 to 8700 Å, which are the minimum and maximum effective wavelengths of the photometric filters used in the training.The SALTShaker spectral recalibration procedure uses a polynomial to estimate the spectral warping; however, unlike the warping used in the simulations, the spectra are recalibrated using the exponential of a polynomial (a different functional form from the simulations), with coefficients estimated during the training process (K21, their Equation 6).
• Including host-galaxy contamination in the spectra.In many galaxies, particularly at higher redshifts, the brightness of a SN Ia is often comparable to the surface brightness of its host galaxy.
There can be large uncertainties in subtracting host-galaxy light from the SN spectrum, which could add unphysical spectral features to the SN model or even shift the model's color if the spectral re-calibration procedure is insufficient to remove higher-order calibration offsets.In the real K21 input data, host-galaxy emission lines were removed in the low-z data, and contamination was removed in the high-z data by fitting for a simultaneous SN and host spectrum, and subtracting the host component (Guy et al. 2007).However, this procedure could leave some fraction of residual host contamination.
We simulate the host-galaxy contamination effect by adding a scaled host-galaxy spectrum to the SN spectra, with the host-galaxy spectrum randomly selected from a small host-galaxy library.This host-galaxy library is compiled from a random subset of high-S/N spectra of the host galaxies of SNe discovered during the Lick Observatory Supernova Search (Filippenko et al. 2001).The host-galaxy spectrum is scaled such that where dF host /dλ is the host spectra, dF peak /dλ is the SN spectrum at peak brightness, and S host is the normalization factor for the host spectrum for a given host-galaxy contamination fraction (HOSTSNFRAC).We test HOSTSNFRAC of 100%, 50% and 10%, with respect to the SN brightness at peak.Example simulated SN spectra with different fractions of host contamination and their simulated host-galaxy spectra are shown in Figure 3.

Varying the Input SN SED with the BYOSED Model
The SALT formalism models the SN Ia SED with only one shape and one color parameter, but additional variability in SN Ia spectra and photometry has been seen (e.g., Fakhouri et al. 2015;Saunders et al. 2018;Léget et al. 2020;Boone et al. 2021).Observed correlations between Hubble residuals and host-galaxy mass (Kelly et al. 2010;Lampeitl et al. 2010;Sullivan et al. 2010), other host properties (Rigault et al. 2013;Jones et al. 2018;Roman et al. 2018;Rigault et al. 2020;Kelsey et al. 2021), host-galaxy reddening (Brout & Scolnic 2021;Rose et al. 2022;Meldorf et al. 2022), and the potential correlation between Hubble residuals and SN ejecta velocity (Siebert et al. 2019;Dettman et al. 2021), give additional evidence for variability beyond the standard SALT parameters.
BYOSED enables us to test the reliability of the SALT formalism for measuring accurate distances from a flexible suite of SN Ia models.These models are not necessarily intended to be true representations of real SN Ia data, but instead are a priori plausible models for the ways in which SN Ia spectra could depend on the SN shape, color, and additional standardization parameters.We use the following BYOSED model variants to test the SALTshaker training framework: • Shape and Color.We follow Pierel et al. (2021) in simulating a baseline Hsiao et al. (2007) model with shape variations modeled as a simple wavelength-independent time dilation and color variations modeled using the SALT2 color law.
The Hsiao et al. (2007) model presents an average spectrum at every phase.While this model uses the same color law as the SALT2 model, it does not have the SALT spline-basis interpolation features imprinted on the model surface.
• Spectral variation as a function of hostgalaxy mass.Following Pierel et al. (2021), we add simulated host galaxy-based variation to the shape+color simulations above using the same host-galaxy mass perturbers created by Pierel et al. (2021).Pierel et al. (2021) generates the perturbers by making composite spectra from SNe with host-galaxy masses log(M * /M ⊙ ) > 10 and log(M * /M ⊙ ) < 10 from the Kaepora database (Siebert et al. 2019).Though some of the observed spectral variation is likely due to differences in the distribution of x 1 and c between highand low-mass host galaxies, these composites are an approximation for testing how effectively a SN that varies as a function of its host-galaxy properties can be modeled by a procedure that does not include host-galaxy dependence in its model assumptions.Fifty percent of our simulated SNe use the low-mass composite versus the high-mass composite.We simulated two different host-galaxy mass perturbers as in Pierel et al. (2021), one with a static host-galaxy component with no redshift evolution, the other with a host-galaxy component that decreases in amplitude as a function of redshift.
• Spectral variation as a function of SN velocity.Similar to the host-galaxy mass procedure above, we added velocity perturbers constructed by Pierel et al. (2021).These velocity perturbers are generated by constructing spectral composites for SNe with low and high Si II velocities (the divide between low and high velocity is located at the sample median of 11,000 km/s).We draw from the observed distribution of Si II velocities following Pierel et al. (2021) to randomly assign velocities to SNe.This variant is meant to mimic the spectral variations that caused a 2.7-σ shift in Hubble residual as a function of Si II velocity found by Siebert et al. (2019).

RESULTS
In this section, we compare results from each of the SALT3 training variants presented in the previous section.We first run our pipeline (Section 2) 40 times for the baseline scenario described in Section 3.1; the baseline simulations were designed to reproduce the SALT3.K21 training sample.Then for each training variant described in Section 3.2 and Section 3.3, we run our pipeline 20 times.Each pipeline run starts with a different random seed.We compute the average of the trained model surfaces, inferred distances, nuisance parameters, and measurements of w for the 40 baseline runs and the 20 runs of each variant.We compare the averages of each 20 variant results to the average of the 40 baseline results.
First, however, we examined w results from the baseline simulations and found a surprising w bias of −0.024 ± 0.006 relative to ΛCDM.We traced this bias to low-z simulations that adopted the Scolnic & Kessler (2016) x 1 and c distributions for the legacy low-z data, which is the sample containing most of our spectra and some of the best-sampled photometry.These distributions do not fully match those of the K21 training sample and in particular the color distribution of the K21 data is wider; when simulating the low-z training sample with x 1 and c distributions that more closely match those of K21, we find a statistically insignificant w bias of w = −1.005± 0.009.We discuss the potential reasons for the w bias in the baseline version of the simulations in Appendix B and conclude that training sets consisting of SNe with a wider distribution of colors, as in K21, will be more robust to cosmological biases.
For simulations that use the BYOSED models as the underlying models, we only compare the distances, and the measurements in w, since it is not meaningful to compare the model surfaces when the underlying models used for simulation are different.We compare the BYOSED variants with the BYOSED stretch and color model described in 3.3, which serves as the baseline for the BYOSED models.The simulations that use the baseline BYOSED model yield w = 0.971 ± 0.016, which is consistent with an expected offset found in Pierel et al. (2021).
For convenience, we define short names for each training variant in Table 1.Below, we discuss variations in the trained model surfaces and the color law (Section 4.1), changes in the resulting distance moduli (Section 4.2) and changes in the nuisance parameters and cosmological parameters (Section 4.3).

Variations in the trained model
In this section, we examine the differences in the trained SALT3 model surfaces given the training variants described in Section 3, including the principal components M 0 and M 1 and the color law CL.

Spectral components
We first examine the changes in model components with respect to the baseline M 0 component: We note that because M 1 can be zero or negative, we define it fractionally with respect to M 0 .Therefore, ∆M 1 /M 0,baseline is the fractional difference in the predicted light curve or spectrum of a SN having an x 1 parameter that is one standard deviation away from the mean.
Figure 4 shows the relative model changes in wavelength space for the SALT3 model components at SN peak brightness.For the variants shown in Figure 4 (left), the M 0 components are consistent within 2% with the baseline model between 3000 Å and 7500 Å, but deviate more in the bluer and redder regions beyond that wavelength range, by up to ∼ 10%.The M 1 components are mostly consistent despite a larger variation in the bluer region below 3000 Å.For variants with different fractions of host-galaxy contaminations shown in Figure 4 (right), the models are mostly consistent within 5% with respect to the baseline when there is 10% hostgalaxy light, but unsurprisingly, large (>5%) spikes are seen in the M 0 components and some regions of the M 1 components when there is greater host-galaxy contamination.

Integrated fluxes
In Figure 5, we show the relative model flux variations in the U BV RI bands.We integrate the model components M 0 and M 1 over the U BV RI passbands for each variant, and show the relative changes with respect to the baseline M 0 model flux (also integrated over Including host-galaxy contamination with a fraction as bright as 100% of the SN peak brightness HOST-50 Including host-galaxy contamination with a fraction as bright as 50% of the SN peak brightness HOST-10 Including host-galaxy contamination with a fraction as bright as 10% of the SN peak brightness BYO-STRETCH-COLOR Baseline BYOSED model with stretch and color effects BYO-VEL BYOSED model with a velocity effect added to the baseline BYO-HOST BYOSED model with a static host-galaxy mass effect added to the baseline BYO-HOST-Z-DEP BYOSED model with a redshift dependent host-galaxy mass effect added to the baseline  the U BV RI passbands): We find that the M 0 component for the BV RI light curves is consistent to within 2% at phases greater than −15 days for NO-LOWZ, NO-U, MIS-CAL-SPEC and HALF-SPEC.The M 1 component shows larger variations, up to ∼3% and biases of up to 4% relative to M 0 .
Dai et al.The U -band model fluxes show significantly greater variation compared to the other bands, however; the training variants NO-LOWZ, NO-U, MIS-CAL-SPEC and HALF-SPEC all have substantially higher variation compared to the other bands in the UV model fluxes.This is unsurprising, as each of these variants removes a significant fraction of the available U -band spectra or photometry.This test therefore indicates that for the U -band SALT model to be robust, additional observations are needed to cover the rest-frame U band, likely with corresponding high-S/N spectra, perhaps from future facilities that will have well-calibrated u-band data such as the Rubin Observatory.Furthermore, the increased variation in ∆M 1 indicates that there is value in obtaining a larger SALT3 training set in order to better constrain the first principal component of the model.Finally, when the mis-calibration effect is simulated in the SN spectra, we see ∼2-5% offsets in U and I, which are near the blue and red ends of the model's spectral range, respectively, and which may be a result of limitations in the SALT3 re-calibration procedure.
In Figure 5 (right) we examine the effect of including host-galaxy contamination in the SALT training spectra in more detail by showing model flux variations from the 100%, 50%, and 10% contamination training variants.While HOST-10 has variation at the level of ∼3% across most of the phase range, higher contamination yields further degradation in the training when using SN spectra with strong host-galaxy contamination.The HOST-50 and HOST-100 variants show color-dependent offsets of ≳5% in the U , R, and I band model fluxes; we note that because the model is normalized to maximum light in the rest-frame B-band, this band shows smaller variation at the few-percent level.High-z SNe in particular tend to be fainter relative to the local surface brightness of their host galaxies, which can result in substantial host contamination; this level of bias in the model surfaces shows that high-z SN spectra must have their host-galaxy contributions carefully removed to avoid model biases.

Color law
The change in color laws of each data variant are shown in Figure 6.We describe the color law change with respect to the baseline with the following equation: Here, ∆m(c, λ) is the change in magnitude as a function of wavelength for a SN with color c relative to the baseline model.We choose a nominal c = 0.1 for the comparisons in Figure 6, equal to a difference in reddening between the B and V bands, compared to an average SN Ia, of 0.1 mag.Typical SN Ia cosmology cuts limit c < 0.3, so we note that the reddest -and the bluest -SNe Ia in a typical data set would be biased by three times the values shown in this figure.
Similarly to the spectral components, the color laws are consistent to 2% or better for c = 0.1 between 3000 Å and 7500 Å, but have larger deviations in the bluer and redder regions.The SALT3 color law is extrapolated beyond 8000 Å, and these extrapolated regions at redder wavelengths in particular can deviate by up to ∼10% when including host contamination effects, and therefore should not be considered robust to reasonable variations in the training data.

Differences in distance moduli
We show the changes in distance moduli for each data variant with respect to the baseline in Figure 7, and for BYOSED simulations in Figure 8.For z < 0.6, the distance modulus changes are below 0.025 for each of the input data variants.Larger changes of ∼0.5 are seen for z > 0.6, with the largest changes coming from MIS-CAL-SPEC and HOST-50.
The RMS of the Hubble residuals is consistent to within ∼2% across all variants that are simulated with the extended SALT2 model, with the MIS-CAL-SPEC model having the largest scatter, 0.149 mag, but negli-gibly higher than the baseline value of 0.147 mag.For the BYOSED variants, the RMS of the Hubble residuals are higher, ∼0.245.The reduced χ 2 of the SALT3 light curve fitting is also consistent across all the variants that are simulated with the extended SALT2 model.By excluding the error term due to the in-sample variancethe degree to which the SALT3 formalism does not fully encapsulate the intrinsic variability of the underlying SN data -we compare reduced χ 2 across training options.The baseline model has a median reduced χ 2 = 0.95, which could indicate slight under-regularization in the training procedure or very modest overestimation of the uncertainties.
Most training options yield a comparable reduced χ 2 to the baseline case, but the variants NO-U, HOST-10, HOST-50 and HOST-100 yield a slightly higher reduced χ 2 , ranging from 0.98 to 1.01.MIS-CAL-SPEC has a smaller reduced χ 2 = 0.92.The BYOSED models have slightly larger median reduced χ 2 ∼ 1.08, perhaps a consequence of these models being semi-independent of the SALT model framework.
In spite of the increased χ 2 from some variants, it is encouraging that the distances change very little on average.Biases in distances relative to the baseline model, averaged in redshift bins of 0 < z < 0.2 versus 0.4 < z < 0.6, are shown in Table 2.All biases, except HOST-100 or MIS-CAL-SPEC, are consistent with zero with an uncertainty of less than 0.01 mag, demonstrating that the training procedure remains effective at standardizing its training sample, even if the model surfaces themselves are shifting on the few-percent level.HOST-100 and MIS-CAL-SPEC have a bias ∼ 0.02 and −0.01, respectively.

Biases on cosmological and nuisance parameters
Despite the apparent shifts in M 0 and M 1 described in Section 4.1.2,we find that inferred distances are consistent.This may be due to the compensating shifts in α and β in the distance estimation stage.The differences in these nuisance parameters are shown in Table 2.We observe differences in α of ≲2σ for the SALT2-extended model simulations except for one high-significance shift of 0.014 for the HOST-10 variant.Biases in the β parameter are within 2-sigma significance for all variants simulated with SALT2-extended, with a magnitude of up to 0.092.
Finally, Table 2 shows the differences in measurements of w from these different training variants.All are consistent with the baseline model at the ∼2-σ level (∆w ≲ 0.025), but with the largest potential deviations coming from variants that adversely affect the reliability of the training spectra (including the NO-LOWZ variant, which removes many spectra).
We compare the BYOSED models with BYOSED-STRETCH-COLOR, and show the differences in w in Table 3.All values of w are consistent to within 0.025, showing that the SALT3 training procedure can largely account for the effect of possible correlations between the host galaxy and the SN Ia SED, or additional SN properties such as SN velocity.

DISCUSSION
In this paper, we test the robustness of the newly developed SALTShaker code and the SALT3 model, and quantify the systematic uncertainties of the training procedure on cosmological parameter measurements.We explore the effect of removing legacy low-z data without measured filter throughputs, observer-frame u/U data (which is poorly calibrated in the real training data), and a large fraction of the spectroscopic training data.We also test the consequences of mis-calibrated SN spectra and the effects of host-galaxy contamination in the spectra.Finally, we use a simulated model independent of SALT (BYOSED; Pierel et al. 2021) to test whether a SALT3 model trained on these data can produce consistent distances from 0 < z < 1.
We generally find better than 2% consistency of the model across these different training variants.The most significant changes in the model are seen in the U band, in pre-maximum light phases, and in the redder and bluer edges of the model surfaces.Host-galaxy contamination at 10% of the SN maximum brightness does not produce large changes in the trained model; however, host contamination at the level of > 50% of the SN a Relative to the baseline fitting results, the difference between average Hubble residual at 0.01 < z < 0.2 and the average Hubble residual at 0.4 < z < 0.6.We also see evidence that the 1990s-era legacy data and the observer frame U band data are playing important roles in constraining the model surfaces and the color law.It appears that individual SNe with photometry spanning the full available model wavelength range (U/u to I/z) may also improve the fidelity of the model, and the existing model training sample may need to be expanded to fully replace these data.We suggest compiling additional training samples, particularly at low redshift, from CSP and Rubin Observatory photometry that include uBV gri and ugriz (and perhaps y) data, respectively, for the same SNe; this will help to fully constrain the model surfaces and color law simultaneously and allow us to remove the less-reliable low-redshift training samples.SN spectra also appear to be important for constraining the model surfaces in the SALT3 training process, particularly at the bluest and reddest ends of the wavelength range; when half of our spectral training sample are removed, we find model variations on the order of up to ∼5-10%.
Although these different training options demonstrate the phases and wavelength ranges where the SALT3 training is less robust, we see consistent distance measurements across nearly the full redshift range, as well as consistent cosmological parameter measurements to within 2% in most cases; the MIS-CAL-SPEC variant has the largest deviation of 0.025 at 1.9-σ significance.Provided that the cosmological parameter estimation is run on data drawn from the same intrinsic distribution as the training data -i.e., it is important that the SALT3 model is re-trained on the data used in a given cosmology analysis -we find that cosmological parameter measurements from the SALT3 model are robust at the 2% level to most realistic variations in the SALT3 training process.However, we suggest that additional attention must be paid to the way in which SALT3 recalibrates spectra during the training process, and to the contamination of the high-redshift training spectra by host-galaxy light.Additionally, we find that the SALT3 training process is sensitive to the color distribution of the input training data, and the resulting w measurement can be biased by ∼ 2% if the color distribution is not sufficiently wide.
Finally, we suggest using the pipeline developed in this work, or similar approaches, to propagate variants like these into the systematic uncertainty budgets of future cosmological analyses.We can only be confident in our constraints on dark energy if we fully understand the assumptions that are propagated into SN standardization modeling.Fortunately, our analysis indicates that these types of systematic uncertainties in the training process bias w at a level below the precision of current analyses.However, as hundreds of thousands of additional SNe Ia are discovered in the coming years, w measurements become more precise, and these new data simultaneously improve our models for estimating SN distances and extend the viable wavelength range at which measuring SN Ia distances is possible, these types of end-to-end systematic uncertainty tests will become increasingly important.M.D. is supported by the Horizon Fellowship at the Johns Hopkins University.Support for D.O.J. was provided by NASA through the NASA Hubble Fellowship grant HF2-51462.001awarded by the Space Telescope Science Institute, which is operated by the Association of Universities for Research in Astronomy, Inc., for NASA, under contract NAS5-26555.Research on systematic uncertainties in supernova cosmology at Rutgers University is funded by DOE awards DE-SC0011636 and DE-SC0010008.The UC Santa Cruz team is supported in part by NASA grants 14-WPS14-0048, NNG16PJ34C, and NNG17PX03C, NSF grants AST-1518052 and AST-1815935, the Gordon and Betty Moore Foundation, the Heising-Simons Foundation, and by a fellowship from the David and Lucile Packard Foundation to R.J.F.This work was completed in part with resources provided by the University of Chicago's Research Computing Center.2013,2018,2022), Matplotlib (Hunter 2007), NumPy (Harris et al. 2020)

B. BIASES IN W FROM THE BASELINE SIMULATIONS
Our baseline simulations described in Section 3.1 (which use x 1 /c populations derived by Scolnic & Kessler 2016) yield a w bias of −0.024 ± 0.006.In order to investigate the origin of this bias, we create another set of baseline training set simulations following K21, which uses the x 1 and c parameter values from the actual K21 data sample.For this sample, we find a statistically insignificant w bias of w = −1.005± 0.009 (averaged on 20 random samples).
The above change was applied only to the legacy lowz simulations because we found that simulations based on the Scolnic & Kessler (2016) x 1 /c populations did not fully match the K21 data (Figure B.1).We note that this is not a deficiency of the Scolnic & Kessler (2016) results; rather, the K21 training data includes additional SNe that were not considered in Scolnic & Kessler (2016).These legacy low-z data, while a limited subset of the training, contain a majority of the spectra used in the training and also constitute some of the highest-S/N and best-sampled photometry.
Figure B.1 shows the color distributions of the low-z training data from our original baseline samples, the regenerated K21-like baseline samples, and the K21 data.We find that the K21-like simulations (and the K21 data) have significantly more SNe with blue colors compared to the original simulations, likely due to the addition of new data from the CfA4 survey, the CSP survey, and z < 0.01 SNe that are too nearby to be suitable for dark energy measurements (but can be used for lightcurve training).
An alternative source of bias could be differences between x 1 and c populations in the bias correction simulations versus the "test" simulations.Because the SALT training process enforces definitions of mean x 1 , c = 0, 0, subtle shifts in the simulated shape/color parameters of the simulated sample can cause mismatches between the training sample and our adopted x 1 /c distributions for the bias correction simulations.However, slight changes in the mean of the c distribution (∼0.01) to make up for these effects had no significant effect on the w bias.We therefore infer that the difference in w stems from differences in the two trained SALT3 models themselves, and that the trained models are sensitive to the color distributions of the input training data, though in future work we will re-compute x 1 and c populations as part of our pipeline for each individual data set following the method of Scolnic & Kessler (2016).Fig B .2 shows the differences in the averaged trained model components trained with these two samples.We find a > 5% difference in the rest-frame U band, and a slight difference in the color law.The color laws trained from these two samples are similar, with a slight difference for λ > 7000 Å.We note that although on average the color laws are not significantly different between these two sets of simulations, we find that the SALT3 color law is particularly sensitive to the c range of the training data for each individual random sample.We advise that future model trainings use a training sample with a wide distribution of colors (similar to K21) in order to avoid subtle cosmological biases such as the one found here.

Figure 2 .
Figure 2. Comparison of light curve parameter distributions between the average of 40 baseline simulations (Section 3.1) and real K21 data.Blue histograms are the simulated baseline data, with the shaded area being the standard deviation of the number in each bin from the simulations, and red histograms are the K21 data.

Figure 3 .
Figure3.Examples of simulated SN and host-galaxy spectra.Blue: simulated final SN spectra with host-galaxy contaminations added; from left to right the host-galaxy fractions are 100%, 50% and 10% relative to the SN peak brightness.Orange: the simulated host-galaxy spectra for each SN, scaled relative to the SN peak brightness.The uncontaminated SN spectra are simply a subtraction between the final SN spectra and the host-galaxy spectra.

Figure 4 .
Figure 4. Relative changes of model components M0 and M1 at peak for different variants of the training data with respect to the baseline M0 component, with the gray area showing the error of the mean of the baseline M0 component.Left: model changes for the following training data variations: removing the legacy low-z data, removing the u/U bands, adding calibration offsets to the training spectra, and removing 50% of the spectral data.Right: model changes after including 100%, 50% and 10% host-galaxy contamination in the spectral data, with contamination scaled relative to the SN brightness at maximum light.

Figure 5 .
Figure 5. Same as Figure 4 but showing relative changes of integrated model fluxes F0 and F1 with respect to the baseline model flux F0.Individual colors show the model components when integrated over the U BV RI passbands.

Figure 6 .
Figure 6.Difference in color law for each training data variation with respect to the baseline.The coefficient 0.1 is chosen to show the change in predicted magnitude as a function of wavelength for a SN with c = 0.1.

Figure 7 .Figure 8 .
Figure7.Changes in the distance moduli as a function of redshift for each training variant, with respect to the baseline.Left: distance modulus changes for the following training data variations (from top to bottom): removing the legacy low-z data, removing the u/U bands, including mis-calibrated spectra, and removing 50% of the spectral data.Right: distance modulus changes when including 100%, 50% and 10% host-galaxy contamination, with respect to SN brightness at maximum light, in the input spectral data.
Astropy (Astropy Collaboration et al. Figure B.1.Color distributions of the low-z training data from the original baseline simulations (with low-z x1/c parameters drawn randomly from previously generated populations, blue), the regenerated K21-like simulations (with K21 low-z x1/c parameters, green), and the K21 data (red).The shaded areas correspond to the standard deviation of the number in each bin for the multiple simulations.

Figure B. 2 .
Figure B.2. Difference in the averaged SALT3 model flux (integrated over UBVRI bands) between the two samples described in Appendix B -the original baseline samples and the regenerated K21-like samples -relative to the M0 flux of the original baseline samples.
Schematic overview of our analysis pipeline, from simulation of the underlying SED model to cosmological parameter constraints.The arrows show the flow of the pipeline and how each stage is connected.

Table 1 .
Short names for each training variant

Table 2 .
Biases on Nuisance Parameters and w for SALT-based Variants

Table A1 .
Asymmetric Gaussian parameters for x1 and c populationsNote: Following Scolnic & Kessler 2016, x1 /c is the value with the maximum probability for the asymmetric gaussian x 1 /c distribution, σ + and σ − are the corresponding gaussian widths on the low and high sides.