Improving Photometric Redshift Estimates with Training Sample Augmentation

Large imaging surveys will rely on photometric redshifts (photo-z's), which are typically estimated through machine-learning methods. Currently planned spectroscopic surveys will not be deep enough to produce a representative training sample for Legacy Survey of Space and Time (LSST), so we seek methods to improve the photo-z estimates that arise from nonrepresentative training samples. Spectroscopic training samples for photo-z's are biased toward redder, brighter galaxies, which also tend to be at lower redshift than the typical galaxy observed by LSST, leading to poor photo-z estimates with outlier fractions nearly 4 times larger than for a representative training sample. In this Letter, we apply the concept of training sample augmentation, where we augment simulated nonrepresentative training samples with simulated galaxies possessing otherwise unrepresented features. When we select simulated galaxies with (g-z) color, i-band magnitude, and redshift outside the range of the original training sample, we are able to reduce the outlier fraction of the photo-z estimates for simulated LSST data by nearly 50% and the normalized median absolute deviation (NMAD) by 56%. When compared to a fully representative training sample, augmentation can recover nearly 70% of the degradation in the outlier fraction and 80% of the degradation in NMAD. Training sample augmentation is a simple and effective way to improve training samples for photo-z's without requiring additional spectroscopic samples.


INTRODUCTION
Understanding the nature of dark energy is a major open question in cosmology.Stage-IV dark energy experiments, such as the Vera C. Rubin Observatory's Legacy Survey of Space and Time (LSST, Ivezić et al. 2019), Euclid (Euclid Collaboration et al. 2022) and Roman (Akeson et al. 2019), are scheduled to come online in the coming years.
Imaging surveys will need to obtain redshifts to galaxies, but there will be too many for spectroscopic redshifts to be feasible.LSST alone is expected to observe billions of galaxies and will therefore rely on photometric redshifts (photo-z's).Photo-z's can be estimated through machine learning algorithms, which learn to associate photometric quantities, such as colors and magnitudes, with a redshift estimate.
Previous Stage-III dark energy surveys have also used machine learning for estimating photometric redshifts.The Hyper Suprime-Cam Subaru Strategic Program (HSC-SSP, Aihara et al. 2017) used DNNz and DEMPz (Hsieh & Yee 2014) for the Year 3 cosmology results (Rau et al. 2023;Sugiyama et al. 2023;Miyatake et al. 2023).Both DNNz and DEMPz are conditional density estimators.The Dark Energy Survey (DES, Abbott et al. 2018) has used a self-organizing map (SOMPZ, Myles et al. 2021) for estimating photo-z's.The Kilo-Degree Survey (KiDS, Heymans et al. 2021) has also used self-organizing maps for photo-z estimation (Hildebrandt et al. 2021).
Machine learning methods require a training sample of galaxies with both photometry and spectroscopic red-shifts, and it is well known that machine learning methods trained on non-representative training data perform worse than when trained on representative training sets (see e.g., Beck et al. 2017 for a general evaluation of photo-z quality when non-representative training samples are used).Stylianou et al. (2022) also demonstrates the effect of some simplistic forms of training sample incompleteness on specific machine learning methods.However, existing spectroscopic samples are biased towards brighter, redder galaxies than LSST will observe in general, and these also tend to be at lower redshift than the typical LSST galaxy.This means that training samples for photo-z estimation will not be representative of LSST data, leading to poor photo-z estimation for galaxies with photometry not represented in the training sample.The Dark Energy Spectroscopic Instrument (DESI, Flaugher & Bebek 2014), along with spectroscopic redshifts from Euclid, Roman and 4MOST (de Jong et al. 2019), will alleviate this issue to an extent, but the DESI survey will not be as deep as LSST; additional spectroscopic redshifts from DESI can not solve the problem alone.We will need methods to improve the redshift estimation that do not involve obtaining more spectroscopic redshifts.
One method for improving training samples without obtaining more spectroscopic redshifts is through data augmentation, which is the process of modifying a training sample in some way to increase the generality of a machine learning model (Shorten & Khoshgoftaar 2019).Data augmentation can be done by transforming existing training sample data in some way, such as through rotations or deformations in the case of image recognition (Bloice et al. 2017), or by generating synthetic data for the training sample (Bird et al. 2021).Broussard & Gawiser (2021) used this synthetic data generation method for augmentation to estimate photo-z's.
In this paper, we investigate a slightly different method of augmenting the training sample by adding galaxies from simulated catalogs to our training sample.By selecting simulated galaxies with photometry and/or redshifts not otherwise represented in the training sample, this training sample augmentation can expand the range of feature space capable of producing good photo-z estimates, provided the simulated catalog used for augmentation has reasonable colors.If the simulated catalog is too unrealistic, this will only create confusion in our model.
Section 2 describes our simulated data, including our stand in for real LSST data and the simulated catalog used for augmenting the training sample.Section 3 describes our methodology, including how a realistically non-representative training sample is created, how we estimate photo-z's, and the process for augmenting the training sample.Section 4 discusses our results, and section 5 concludes.

DC2
The LSST Dark Energy Science Collaboration (DESC) Data Challenge 2 catalog (DC2; Abolfathi et al. 2021) is a 300 deg 2 area of simulated LSST observations.The base input for DC2 is the CosmoDC2 galaxy catalog (Korytov et al. 2019), which is derived from the Outer Rim N-body simulation (Heitmann et al. 2019).Galaxies were assigned to halos using UniverseMachine (Behroozi et al. 2019) and GalSampler (Hearin et al. 2020).Complete galaxy properties are generated with Galacticus (Benson 2012).Galaxy spectra are constructed from stellar population spectra computed with fsps (Conroy et al. 2009).
To generate the DC2 catalog, stars, supernovae, strong lenses and AGN are added to the CosmoDC2 catalog.The object catalogs are passed as inputs to the image simulation software imSim1 to generate LSST-like images, which are then processed by the LSST Science Pipelines.
From the DC2 catalog, we select objects with magnitude i > 17.We require S/N > 6 in the i -band, as well as S/N > 3 in at least one other band.To minimize contamination from stars, we select only extended objects.This extendedness cut does not entirely eliminate stars, but the stellar contamination is low, and the training sample selection process (described in section 3.1) ends up placing all stars in the application sample.
The base, unaugmented training sample and the application sample of galaxies, for which we estimate redshifts, are formed from our selected DC2 objects.In this work, DC2 is a stand-in for real data.We use the term 'application sample' to refer to the DC2 stand-in for what would be unlabeled LSST data.While we do have true redshifts for this sample, and it functions as a testing set in this analysis, we keep the terminology to identify our application sample with eventual LSST data.

Buzzard
The Buzzard simulation (DeRose et al. 2019) was built from the L-GADGET2 dark matter simulations (Springel 2005).Galaxies are assigned to halos using AddGals (Wechsler et al. 2022) using the abundance matching technique.Galaxy SEDs were assigned to match the measured SED-luminosity-density relationship in SDSS data.
We use the Buzzard catalog selected for the DESC Tomographic Challenge (Zuntz et al. 2021).Details on selection cuts, post-processing and uncertainty generation can be found in that paper.
The Buzzard method of assigning spectra is completely independent from fsps, so Buzzard SEDs should be sufficiently different from DC2 SEDs to simulate adding simulated galaxies to training samples of real galaxies.In this work, we use the Buzzard catalog as a simulated catalog with which to augment the DC2 training sample.

Non-representative Training Sample
Existing spectroscopic galaxy samples are brighter and redder than expected LSST observations, and also tend to be at lower redshifts.To partition our DC2 catalog into a realistically non-representative training sample and application sample, we use the GridSelection degrader in the DESC RAIL2 software (LSST-DESC RAIL developer team et al. 2023).We briefly summarize the GridSelection degrader below.A more detailed discussion can be found in Moskowitz et al. (2023).
The GridSelection degrader is based on the second data release of the Hyper Suprime Cam Subaru Strategic Program (Aihara et al. 2019).Galaxies with similar photometry to early LSST observations are selected from the HSC Wide catalog; some of these galaxies have photometry only, and some have matched spectroscopic redshifts.The range in i -band magnitude and (g-z ) color are divided into 100x100 pixels.Within each pixel, a ratio of the number of galaxies with spectroscopic redshifts to the total number of galaxies is computed, along with the 99th percentile in spectroscopic redshift, denoted z max .The GridSelection degrader divides our DC2 galaxies into the same set of pixels in i vs (g-z ) and automatically assigns DC2 objects with z true > z max to the application sample.From the remaining DC2 objects, the GridSelection degrader randomly selects objects for the training sample such that the ratio of DC2 training objects to total objects in a pixel matches the ratio from HSC.
After partitioning the full DC2 sample into training and application samples, the training sample contains 186,837 galaxies, while the application sample contains 5,520,458 objects.The left and center panels of Figure 1 show the resulting DC2 training and application sam-ples, where it is clear that the training sample is redder and brighter than the majority of the application sample.Figure 2 shows the (normalized) redshift distributions of both samples, as well as the distributions of the Buzzard sample and the best performing augmentation choice.The training sample is biased towards lower redshifts than the application sample as a whole.The right panel of Figure 1 shows the Buzzard sample that will be used for augmentation.See section 3.3 for more details.

Photo-z estimation
Schmidt et al. ( 2020) tested twelve photo-z estimation codes, albeit using representative training data, and recommended FlexZBoost (Izbicki & Lee 2017;Dalmasso et al. 2020) as an appropriate estimator.Therefore, to estimate photo-z's, we use FlexZBoost as implemented in RAIL.FlexZBoost is a non-parametric conditional density estimator for redshifts.It takes as inputs the magnitudes and errors in each of the bands ugrizy and outputs a photo-z pdf for each object in the application sample.
To evaluate the quality of a set of photo-z estimates, we use the outlier fraction, catastrophic outlier fraction, normalized median absolute deviation (NMAD) and bias.We define an outlier as |z true − z phot |/(1 + z true ) > 0.15, while a catastrophic outlier is defined as |z true − z phot | > 1.0.The NMAD is given by: where the bias is given by Median(∆z/(1 + z true )).Although the pdf contains a wealth of useful information that can be used to quantify photo-z quality, such as the 3σ outlier fraction (see e.g.Jones et al. 2024), cosmological analyses typically involve assigning galaxies to tomographic redshift bins.Since galaxies can only be assigned to one redshift bin, little information is lost by compressing the pdf into a single photo-z point estimate used to assign the galaxy to a bin.We use the mean of the pdf as a point estimate for each photo-z.
The photo-z's estimated from the base, nonrepresentative DC2 training sample are shown in the top panel of Figure 3.The outlier fraction is quite high at nearly 50%.In particular, the majority of galaxies with z true ≳ 1.0 have outlier z phot estimates.This is due to the fact that our DC2 training sample has very few objects with z > 1.0.For comparison, the bottom panel of Figure 3 shows the results for a fully representative DC2 training sample, which obtains an outlier fraction of 0.14.This represents the best we can expect to do using FlexZBoost.As shown in Figure 3, the unaugmented, non-representative training sample

Augmentation
We augment the training sample by adding 10,000 Buzzard galaxies according to a set of criteria, taking care to only use knowledge about the application sample that would be available for real data.The simplest  criterion for augmentation is to select Buzzard galaxies with higher redshifts than those present in the DC2 training sample.We refer to this as redshift augmenta-tion, and make the selection z buzzard > 1.0.This region is indicated by the vertical dashed line and arrow in Figure 2.
Since DC2 application galaxies are also dimmer and bluer than the training sample, we also choose magnitude and color selection criteria, which we call magnitude augmentation and color augmentation, respectively.For magnitude augmentation we make the selection i buzzard > 23, and for color augmentation we choose (g − z) buzzard < 1.75.These boundaries were chosen to match where the magnitude and (g-z ) color distributions in the training sample start to decline.They are indicated by the dotted lines and open arrows in the right panel of Figure 1.We test augmentation with each of the features individually, as well as in combination with each other.

Photometry Shifts
Since DC2 and Buzzard rely on different methods for determining galaxy SEDs, the colors as a function of redshift are different between the two simulations.If this augmentation method was used for real data, it would be advantageous to shift the simulated photometry to look like the real photometry in the application sample.Therefore, we also attempt to match the Buzzard photometry to the DC2 application sample.This modifies the color-redshift relationship in Buzzard to potentially more closely resemble the color-redshift relationship of DC2.Since we do not use the true redshifts of the application sample, this represents something we could do with real data.

Magnitude Shifts
The simplest way to transform Buzzard colors is to apply a single shift to the Buzzard magnitudes to make their median match the median of the DC2 application sample magnitudes in each band.We will refer to this sample as the 'magnitude shifted Buzzard' sample.
In addition to the medians, we can also rescale the NMADs to match in each band.This is a proxy for matching the first and second moments of the photometry distributions.To shift the NMADS, we apply the following transformation: where N M AD j refers to the NMAD in band j, med j is the median magnitude in band j and all quantities are for Buzzard unless indicated by the DC2 subscript.We will refer to this sample as the 'NMAD shifted Buzzard' sample.

Normalizing Flows
The simple shift method is able to match the medians and NMADs of the Buzzard and DC2 color distributions, but not the shapes of the distributions.To attempt to more fully match the color distributions, we use normalizing flows to produce a catalog of DC2-like photometry with Buzzard-like redshifts.
We use the PZFlow3 package (Crenshaw et al. 2023) as implemented in RAIL for training the normalizing flows.We train two flows: one on DC2 photometry, and one on Buzzard photometry.The DC2 flow learns the probability distribution function of the DC2 photometry, p(photometry), while the Buzzard flow is a conditional flow that learns the probability density function of the redshift given the photometry, p(z|photometry).The features used for training are i -band magnitudes and (u-g), (g-r ), (r-i ), (i-z ) and (z-y) colors.We train 100 epochs for the DC2 flow, and 150 epochs for the Buzzard flow.
Once the flows are trained, we sample from the DC2 flow to make a new catalog of galaxies with DC2-like photometry.We then sample from the Buzzard flow, using the new DC2-like photometry as conditions, to generate Buzzard-like redshifts for our DC2-like photometry.Finally, we use the RAIL LSSTErrorModel degrader to generate LSST-like errors on the magnitudes.This set of DC2-like photometry and Buzzard-like redshifts constitutes our flowed catalog from which we draw galaxies for augmentation.We will refer to this sample as the "flowed Buzzard" sample.

RESULTS
Table 1 summarizes the outlier fractions and NMADS achieved for each combination of augmentation features and each Buzzard sample (unshifted, magnitude shifted and flowed), as well as those achieved for the unaugmented training sample and a fully representative training sample.Every kind of augmentation we tested improved the outlier fraction and NMAD of the resulting photo-z's.Augmentations involving redshift selections performed better than those without.The magnitude shifted Buzzard sample generally produced better results than the unshifted or flowed Buzzard samples; however, in the case of selecting galaxies for augmentation using only a single features, the unshifted Buzzard catalog produced the best results.
Adding the NMAD shift to the magnitude shift did not hurt, but showed no improvement over the simple magnitude shift, so we only show results for the magni-  The training sample was augmented with the unshifted Buzzard catalog, which in this case produced better photo-z statistics than the magnitude shifted or flowed Buzzard catalog.Center: When a double feature combination is used for selecting galaxies for augmentation, the combination of color+redshift produces the best results.The best results for this case came from using the magnitude shifted Buzzard catalog.Right: The best photo-z estimates were produced when all three features are combined for selected galaxies for augmentation.The best results for this case came from using the magnitude shifted Buzzard catalog.
tude shifted sample.We also tried combining the magnitude shifted Buzzard sample with the normalizing flow method, but the standard color flow worked better.
Since FlexZBoost also takes in photometric errors, we tested the effect of changing the errors.We tested multiplying the errors by factors of 0.1, 2.0 and 2.0×(1+z).Variations in the outlier fractions and NMADs were smaller than the variations between augmentation cases.
Each method of augmentation produces an outlier fraction and NMAD for the resulting photo-z's.In addition to the raw metrics, we also report the ratio of the augmented results to the unaugmented results.Since the best case scenario is the fully representative training case, not an outlier fraction/NMAD of 0, we also report a percent recovery towards the outlier fraction/NMAD achieved in the representative case.The percent recovery is calculated as (X unrep − X aug )/(X unrep − X rep ), where X is either the outlier fraction or NMAD, the subscript 'aug' refers to the statistic for the augmen-tated training sample, the subscript 'unaug' refers to the unaugmented training sample, and the subscript 'rep' refers to the representative training sample.
The following subsections discuss the results for single feature, double feature and triple feature augmentations.Table 1 summarizes the results.

Augmentation with Individual Features
When augmenting with a single feature, we choose Buzzard galaxies with either z true > 1.0, i-mag > 23 or (g − z) < 1.75.When using a single feature for augmentation, the base, unshifted Buzzard catalog produced the lowest outlier fractions and NMADs.
Of the three features, redshift augmentation produces the best results when a single feature is used for augmentation, while color performs the worst.The redshift augmented training sample, using the unshifted Buzzard catalog, and the resulting photo-z estimates for the redshift augmentation are shown in the left panel of Figure 4. Results for color and magnitude augmentation are, as well as for the shifted and flowed Buzzard catalogs are listed in Table 1.

Augmentation with Double Feature Combinations
There are three double feature combinations possible: magnitude+redshift, color+redshift and color+magnitude.Since the training sample creates a compact shape in color-magnitude space, we fit two lines to mimic the shape of the top of the color-magnitude distribution of the training sample.We then choose objects above and to the left of this region, see the dashed line and solid arrows in the right panel of Figure 1.
The training sample does not have a simple shape in either color-redshift or magnitide-redshift space.For these feature combinations, we use the intersection of the selection requirements for each feature rather than the union; for example, the magnitude+redshift augmentation selects Buzzard galaxies with both i-mag > 23 and z true > 1.0.The intersection worked mildly better than the union.
When using multiple features to select Buzzard galaxies for augmentation, the magnitude shifted Buzzard sample produced the best results.Results for the unshifted and flowed Buzzard catalogs are also listed in Table 1.
Similar to the single feature augmentation, the double feature combinations that include redshift as a selection criterion perform better than combinations without redshift.Magnitude+Redshift and Color+Redshift combinations perform virtually identically for the magnitude shifted Buzzard catalog, with outlier fractions of 0.259 and 0.258, respectivly.This corresponds to a ratio to the unaugmented results of 0.54, and a 65% recovery of the fully representative case.Both produce an NMAD of 0.086, which corresponds to a ratio to the unaugmented results of 0.452 and a 78% recovery towards the fully representative case.The color+redshift augmented training sample, using the magnitude shifted Buzzard catalog, and the resulting photo-z estimates are shown in the middle panel of Figure 4.
When compared to the outlier fraction for the redshift augmentation using the magnitude shifted catalog (see Table 1), adding color or magnitude provides a small improvement.The color+magnitude combination also provides a small improvement over either color or magnitude alone in most cases.

Color+Magnitude+Redshift Augmentation
Finally, we use all three features to select Buzzard galaxies for augmentation.We use the intersection of the Color+Magnitude and redshift selection criteria as this provides better results than the union.This is likely because we always augment the training sample with 10,000 galaxies; in this case, the intersection most efficiently probes the feature space not covered by the DC2 training sample.The color+magnitude+redshift augmented training sample, using the shifted magnitude Buzzard catalog, is shown in the top right of Figure 4, and the post augmentation redshift distribution is shown as the black dot-dashed line in Figure 2.
As with the double feature augmentation cases, the simple magnitude shifted Buzzard catalog produced the best results, and we show that case in the right panel of Figure 4.
This augmentation case produces the lowest outlier fraction and NMAD out of all combinations of color shifting and augmentation features.At an outlier fraction of 0.245, which is a ratio of 0.51 to the unaugmented case, and recovers 69% of the degradation resulting from the non-representative training sample.We achieve an NMAD of 0.084, a ratio of 0.44 to the unaugmented case, and an 80% recovery of the degradation in NMAD compared to the fully representative training sample.We also achieve 8 times less bias with this augmentation than in the unaugmented case (see Table 1).
It can be seen that the augmented training samples using the shifted magnitudes Buzzard catalog have iband magnitudes that extend fainter than the application sample.We tried imposing an i < 26 cut on the magnitude shifted Buzzard catalog before selecting galaxies for augmenting, but this produces slightly worse photo-z estimates.We suspect this is because FlexZ-Boost is a conditional density estimator, and a hard cut-off in the magnitudes makes it difficult for FlexZ-Boost to estimate the density at the cut-off magnitude.Allowing the magnitudes to drop off naturally makes it easier for FlexZBoost to estimate the density at i = 26, resulting in better photo-z estimates even though the magnitude range extends farther than required to match the application sample.

Comparison to TPZ
To test whether the results of augmentation depend strongly on the machine learning method used, we also estimated photo-z's using the RAIL implementation of the code Trees for Photo-Z (TPZ, Carrasco Kind & Brunner 2013), which uses a random forest method.TPZ produces similar results when no training sample augmentation is performed; the photo-z statistics of (outlier fraction, catastrophic outlier fraction, NMAD and bias) for the unaugmented TPZ case is (0.51, 0.20, 0.22, -0.13) compared to (0.48, 0.21, 0.19, -0.12) for the unaugmented FlexZBoost case.The best performing augmentation case, using the magnitude shifted Buzzard sample and augmenting with color+magnitude+redshift, produced photo-z statistics of (0.32, 0.04, 0.11, -0.014) for TPZ, a smaller but still highly significant improvement over the unaugmented case when compared to the FlexZBoost results (see Table 1).

CONCLUSIONS
Large imaging surveys, such as LSST, will not have access to representative training samples for estimating photometric redshifts.When estimating photo-z's using a realistically non-representative training sample, the outlier fraction reaches nearly 50%, almost a factor of 3.5 worse than when using a representative training sample, and a similar increase in scatter as probed by the NMAD.Obtaining new spectroscopic samples of dim galaxies cannot solve this problem alone, as it is not feasible to obtain a large enough sample by the time LSST is expected to see first light.Training sample augmentation is an easy way to improve photo-z estimates without requiring additional spectroscopic samples of dim galaxies.
We used the DESC DC2 simulation as a stand-in for eventual LSST data, and investigated how augmenting a realistically non-representative training sample with simulated galaxies from Buzzard can improve the photoz estimates.Even a relatively simple augmentation pro-cess of selecting simulated galaxies with redshifts higher than those present in the training sample can recover 65% of the degradation in the outlier fraction when compared to a fully representative training sample.
Shifting the photometry of the galaxy catalog used for augmenting the training sample can improve the results further.We shifted all Buzzard magnitudes so the median magnitude in each band matches the median magnitudes of the DC2 application sample.We then select galaxies for augmentation in regions of color-magnituderedshift space not covered by the training sample.The resulting photo-z estimates, shown in the right panel of Figure 4 and Table 1, have an outlier fraction below 25%, NMAD of 0.084, and bias of -0.014, representing a nearly 50% reduction over the unaugmented photo-z outlier fraction, 56% reduction in NMAD, and factor of 8 reduction in bias.Augmentation has recovered 69% of the degradation in outlier fraction compared to the fully representative case, 80% of the degradation in NMAD, and 88% of the degradation in bias.With these results in mind, it is clear that training sample augmentation should be considered in the photo-z pipeline for large galaxy surveys, including LSST.
Since redshift seems to be the most important feature for augmentation, the Buzzard catalog is not the bestcase scenario for augmentation given how few galaxies there are at z > 1.0.There are almost no Buzzard objects at z > 1.5, while there are still many application sample objects in that redshift range.When using augmentation for real data, choosing a simulation with sufficient redshift coverage, such as the DC2 simulation should produce better results than were achieved here.
While we leave extensions of this method to real data to future work, we discuss below a few possible avenues to explore training sample augmentation for real data.One could, for example, use the deep photo-z catalogs from COSMOS2020 (Weaver et al. 2022) as a truth catalog, where the COSMOS2020 photo-z catalog would take the place of the DC2 catalog used in this work.It could also be worthwile to explore using a deep photo-z catalog such as COSMO2020 as the augmentation catalog, in place of the Buzzard catalog.
Self-organizing maps (SOMs) have also been used to correct for spectroscopic incompleteness of training samples in DES (see, for example, Myles et al. 2021;Buchs et al. 2019) and KiDS (van den Busch et al. 2022).Assigning training and application samples to SOM cells can be useful for identifying under-represented regions of photometry in the training sample that can be targeted for spectroscopic followup (Masters et al. 2015).Future efforts could use the SOM methodology identify photometry regions to target with augmented training samples.
We have assumed in this work that the outlier fraction and NMAD are good indicators of photo-z quality.While this is a reasonable assumption, a full quantification of the improvements provided by augmentation would come from using the photo-z estimates to do a full cosmological parameter estimation.For augmentation to truly be useful, it should result in better cosmological parameter estimates than the unaugmented photo-z's.This analysis will be presented in a forthcoming paper.

Figure 1 .
Figure 1.The results of partitioning our DC2 catalog into training (left) and application (center) samples.The training sample is redder and brighter than the bulk of the application sample.The right panel shows the Buzzard sample used for augmenting the training sample.The horizontal dotted line shows the i-band selection criterion, while the vertical dotted line shows the (g-z ) color criterion.The dashed line indicates the section criterion for color+magnitude augmentation, which generally matches the shape of the DC2 training sample in the left panel.Open arrows indicate which regions of color-magnitude space are used for single feature augmentation, while solid arrows indicate regions used for color+magnitude augmentation See section 3.3 for more details on augmentation criteria.

Figure 2 .
Figure 2. Normalized redshift distributions of the DC2 training sample (blue solid line), DC2 application sample (orange dashed line), and Buzzard sample (green dotted line).The DC2 training sample is biased to lower redshifts than the application sample.The vertical dashed line indicates the selection criterion for redshift augmentation, with the arrow indicating the region of redshift space used for augmentation.The black dot-dashed line shows the redshift distribution of the best performing, post augmentation training sample shown in the top right panel of Figure 4.

Figure 3 .
Figure 3. Top: Photo-z's estimated from the realistic, non-representative DC2 training sample shown in Figure 1.Solid black lines indicate the boundary for outliers.Bottom: Photo-z's estimated from a fully representative training sample drawn randomly from the DC2 application sample in Figure 1.

Figure 4 .
Figure 4.The best case photo-z estimation for single, double and triple feature augmentation (bottom row), with the corresponding augmented training samples (top row).Left: When only a single feature is used for augmentation, redshift augmentation produces the best photo-z estimates.The training sample was augmented with the unshifted Buzzard catalog, which in this case produced better photo-z statistics than the magnitude shifted or flowed Buzzard catalog.Center: When a double feature combination is used for selecting galaxies for augmentation, the combination of color+redshift produces the best results.The best results for this case came from using the magnitude shifted Buzzard catalog.Right: The best photo-z estimates were produced when all three features are combined for selected galaxies for augmentation.The best results for this case came from using the magnitude shifted Buzzard catalog.
undergone internal review by the LSST Dark Energy Science Collaboration.The internal reviewers were Huan Lin and Eve Kovacs.The authors thank the internal reviewers for their valuable comments.IM and EG acknowledge support for this research from the LSST Corporation via grant #2021-42.IM and EG also acknowledge support from the U.S. Department of Energy, Office of Science, Office of High Energy Physics Cosmic Frontier Research program under Award Number DE-SC0010008.JFC acknowledges support from the U.S. Department of Energy, Office of Science, Office of High Energy Physics Cosmic Frontier Research program under Award Number DE-SC0011665.BHA acknowledges support by the National Science Foundation under Award Number AST-2009251.AIM acknowledges the support of Schmidt Sciences.Author contributions are as follows.IM performed the analysis and wrote the majority of the paper.EG advised IM, suggested approaches, and provided feedback on the text.JFC suggested and advised on methodology for the normalizing flow, and provided feedback on manuscript.BHA provided guidance on bias discussions and feedback on the manuscript.AIM designed and led the development of the RAIL software.SS advised on the use of FlexZBoost and feedback on the manuscript.The DESC acknowledges ongoing support from the Institut National de Physique Nucléaire et de Physique des Particules in France; the Science & Technology Facilities Council in the United Kingdom; and the Department of Energy, the National Science Foundation, and the LSST Corporation in the United States.DESC uses resources of the IN2P3 Computing Center (CC-IN2P3-Lyon/Villeurbanne -France) funded by the Centre National de la Recherche Scientifique; the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231; STFC DiRAC HPC Facilities, funded by UK BEIS National E-infrastructure capital grants; and the UK particle physics grid, supported by the GridPP Collaboration.This work was performed in part under DOE Contract DE-AC02-76SF00515.

Table 1 .
Summary of outlier fractions, NMADs and bias achieved for all color-shifting and augmentation cases.We have abbreviated redshift augmentation as 'z', magnitude augmentation as 'Mag' and color augmentation as 'col'.Values in the parentheses in the outlier fraction columns are the catastrophic outliers.The bolded values correspond to the best performing augmentation case.