Improving Photometric Redshift Estimation for Cosmology with LSST Using Bayesian Neural Networks

We present results exploring the role that probabilistic deep learning models can play in cosmology from large-scale astronomical surveys through photometric redshift (photo-z) estimation. Photo-z uncertainty estimates are critical for the science goals of upcoming large-scale surveys such as the Legacy Survey of Space and Time (LSST); however, common machine learning methods typically provide only point estimates and lack uncertainties on predictions. We turn to Bayesian neural networks (BNNs) as a promising way to provide accurate predictions of redshift values with uncertainty estimates. We have compiled a galaxy data set from the Hyper Suprime-Cam Survey with grizy photometry, which is designed to be a smaller-scale version of large surveys like LSST. We use this data set to investigate the performance of a neural network and a probabilistic BNN for photo-z estimation and evaluate their performance with respect to LSST photo-z science requirements. We also examine the utility of photo-z uncertainties as a means to reduce catastrophic outlier estimates. The BNN outputs the estimate in the form of a Gaussian probability distribution. We use the mean and standard deviation as the redshift estimate and uncertainty. We find that the BNN can produce accurate uncertainties. Using a coverage test, we find excellent agreement with expectation—67.2% of galaxies between 0 < 2.5 have 1σ uncertainties that cover the spectroscopic value. We also include a comparison to alternative machine learning models using the same data. We find the BNN meets two out of three of the LSST photo-z science requirements in the range 0 < z < 2.5.


INTRODUCTION
Dark matter and dark energy comprise ∼ 95% of the energy density of the universe, but their natures are largely unknown.To investigate these entities, largescale extragalactic surveys such as the Legacy Survey of Space and Time (LSST -e.g.Ivezić et al. 2008) and Euclid (e.g.Collaboration et al. 2022) will soon provide observations of billions of galaxies.Cosmological probes of dark matter and dark energy aim to measure the structure and evolution of the universe, and thus rely in part on accurately and precisely measuring galaxy redshifts of hundreds of millions of galaxies with well-constrained uncertainties.Therefore the task of obtaining sufficiently accurate photometric redshift estimates and understanding the error properties of these estimates is a major challenge.
Corresponding author: Evan Jones, Tuan Do evan.jones@astro.ucla.edu,tdo@astro.ucla.eduSpectroscopic redshift measurements are the most reliable method of obtaining redshift, but are too time consuming and therefore not a suitable solution for obtaining the number of redshifts required for cosmological measurements.Photometric redshift estimation (photoz) can provide redshifts for billions of galaxies, however photo-z estimates are subject to significant systematic errors because the spectral information of a galaxy is sampled with only a limited number of imaging bands.These systematic errors can manifest as outlier predictions that are far from their true redshift, biases in the distribution of redshift predictions, and large scatter in redshift predictions (e.g.Newman & Gruen 2022).These systematics strongly affect science goals such as weak lensing inferences of cosmological parameters since photo-z uncertainties will be propagated into models constraining cosmological quantities.Any photo-z model developed for the potential application to these science missions must produce uncertainties on photo-z predictions.
According to the LSST Science Requirements Document (SRD) 1 , sufficiently accurate photo-z estimates for ∼ four billion galaxies are required to meet the LSST science goals for their main cosmological sample.Specifically, for the i < 25 flux-limited galaxy sample measured by LSST, one must achieve • number of galaxies ≈ 10 7 • rms error < 0.2 (Equation 3 in Table 2) • bias < 0.003 (Equation 4 in Table 2) • 3σ catastrophic outliers < 10% total sample (Equation 2 in Table 2) Currently, no published model satisfies the LSST photo-z science requirements up to z = 3 (Tanaka et al. 2018a;Schuldt et al. 2020a;Schmidt et al. 2020a).Additionally, methods for rejecting the majority of outliers and characterizing their effects on the predictions must be developed (Ivezić et al. 2018).Beyond the LSST metrics stated in the SRD, we consider additional probabilistic metrics for quantifying the quality of uncertainty estimates (see Table 2 - Malz & Hogg 2020;Schmidt et al. 2020a;Jones et al. 2022).The requirement thresholds for the probabilistic metrics are not as well quantified at this time as those for point metrics, but they allow us to compare the performance between different probabilistic models evaluated on the same data.Techniques for identifying photo-z outlier predictions in machine learning models have been investigated in Jones & Singal (2020); Wyatt & Singal (2020), and Singal et al. (2022).
Photo-z estimation techniques have traditionally been divided into two main approaches.Template fitting methods, such as Lephare (Ilbert et al. 2006;Arnouts et al. 1999), Mizuki (Nishizawa et al. 2020) (DEmP -Tanaka et al. 2018b), and others develop a mapping from input parameters to redshift with a training set of data in which the actual spectroscopic redshifts are known, then apply the mappings to data for which 1 https://docushare.lsstcorp.org/docushare/dsweb/Get/LPM-17the redshifts are to be estimated.Both have their drawbacks -template fitting methods require assumptions about intrinsic galaxy spectra or their redshift evolution, and empirical methods require the training set and evaluation set to significantly overlap in parameter space.As machine learning approaches for photo-z estimation have increased in capability and larger data sets have been observed over the past decade, galaxy images can be effectively used as inputs to utilize morphological information for photo-z estimation, unlike template-fitting approaches.
There have been a number of works studying the application of neural networks for photo-z estimation (Firth et al. 2003;Collister & Lahav 2004;Singal et al. 2011), while probabilistic NN techniques have had limited investigations until recently (Sadeh et al. 2016;Pasquet et al. 2019;Schuldt et al. 2020a;Zhou et al. 2022).Bayesian neural networks (BNN), a type of probabilistic NN (Jospin et al. 2020) are a promising approach that has not been well explored.Probabilistic neural networks, conceptualized in the 1990s (Specht 1990), have previously been limited in their ability to process the size of data required for performing photo-z estimation for large-scale surveys, because of the complexity of their computation.However, recent breakthroughs in conceptual understanding and computational capabilities (e.g.Filos et al. 2019;Dusenberry et al. 2020) now make probabilistic deep learning possible for cosmology.Probabilistic deep learning with a BNN has many advantages compared to traditional neural networks, including better uncertainty representations, better point predictions, and offers better interpretability of neural networks because they can be viewed through the lens of probability theory.In this way we can draw upon decades of development in Bayesian inference analyses.
We have three goals in this work: (1) develop a probabilistic ML model that can produce robust uncertainties for photometric redshifts, (2) assess the model with respect to LSST requirements and alternative photo-z estimation methods, and (3) investigate the use of photoz uncertainties to identify likely outliers in photometric redshift predictions.For the analysis in this work, we have created the largest publicly available machinelearning-ready galaxy image data of ∼ 300k galaxies from the Hyper Suprime-Cam survey containing fiveband photometric images and known spectroscopic redshifts from 0 < z < 4.This data will be released in Do et al. 2024 (in prep).In §2 we discuss the data and network architecture.In §3 we state the results.In §4 and §5 we provide a discussion and conclusion.

Data: Galaxy observations
For the analysis in this work we compile a data set intended to approximate the data produced by future large-scale deep surveys for photo-z estimation (Collaboration et al. 2021).We use the Hyper-Suprime Cam (HSC) Public Data Release 2 (Aihara et al. 2019), which is designed to reach similar depths as LSST but over a smaller portion of the sky.We choose the HSC survey because it mimics LSST in photometry and depth.Including photometry in infrared bands would improve photo-z estimates, but since LSST will provide observations in only optical bands (Ivezić et al. 2008), we will restrict our analysis to optical bands only.HSC is a wide-field optical camera with a FOV of 1.8 deg2 on the Subaru Telescope.HSC PDR2 surveys more than 300 deg 2 in five optical filters (grizy).The median seeing in the i-band is 0.6".This data set is presented in more detail in Do et al. 2024, in prep.The final data set used in the analyses of this paper consists of ∼ 300k galaxies with 5-band grizy photometry and spectroscopic redshifts.(2013), Coil et al. (2011), Cool et al. (2013).We used data quality cuts similar to Nishizawa et al. (2020) and Schuldt et al. (2021) (see Table 1 and Do et al. 2024 (in prep) for a full list), which are intended to remove outlier photometric measurements and poorly measured spectroscopic redshifts.We also required detections in each band.The spectroscopic redshift values are treated as the ground truth for training and evaluation.In total, the data consists of 286,401 galaxies with broadband grizy photometry and known spectroscopic redshifts.Our galaxy sample extends from 0.01 < z < 4, however the majority of the sample lies between redshift of 0.01 and 2.5 with peaks at z 0.3 and z 0.6 (see N(z) in Fig. 1).We use 80% of the galaxies for training, 10% for validation, and 10% for testing.The data used for training is available 2 from (Jones et al. 2021a).This dataset includes the photometry and spectroscopic redshifts.A future release will also include images (Do et al. in prep.).

Network architectures
We built two neural networks for this work -one is a fully connected neural network that produces singlevalued redshift predictions and one is a Bayesian neural network that outputs Gaussian probability distributions.The NN and BNN models are visualized in Fig. 3.Both the NN and the BNN are implemented in Ten-sorFlow (Abadi et al. 2016) and have five input nodes for the five-band grizy photometry.We performed a parameter grid search to optimize for free parameters, such as the number of epochs, number of layers, number of nodes per layer, learning rate, loss function, activation function, and optimizer.Both the NN and BNN used for the final analysis in this work contain four hidden layers with 200 nodes per layer and utilize a rectified linear activation function.The networks also have a skip connection between the input nodes and the final layer.The NN has an output node to produce a single point estimate photo-z prediction while the BNN has a final output node that produces a mean and standard deviation assuming a Gaussian distribution for each photo-z prediction.For the BNN we use a negative log likelihood loss function with RMS error as the metric.We choose the negative log-likelihood loss function for the BNN because it has been shown to be more effective than MAE for probabilistic NNs (Lakshminarayanan et al. 2016).The NN uses a mean absolute error loss function, and we also consider a custom loss function (Nishizawa et al. 2020) defined in equation 6 of Table 2.The NN and BNN use the Adam optimizer and have learning rates of 0.0005 and 0.001, respectively.We train using an AMD Ryzen Threadripper PRO 3955WX with 16-Cores and NVIDIA RTX A6000 GPU.Training and evaluation runtimes are typically under 30 minutes.

Other ML models
We use three other common ML models in order to compare to the neural network performance: (1) a support SVM classification model, (2) a random forest regression (RF) model, and (3) a gradient boosted tree regression model.For RF models we utilize the Scikit-Learn implementation (Pedregosa et al. 2011) 3 .For the SVM we use SPIDERz (Jones & Singal 2017, 2020), which implements support vector classification on classes of redshift bins of width z = 0.1 spanning 0 < z < 4. The RF model uses the RandomForestRegressor package to produce photo-z estimates.We also use the default hyperparameters with the RF, with the exception of using 200 trees in the forest.We use the  XGboost software package for the gradient boosted tree model (the XGBRegressor library) with default hyperparameters.We perform a broad hyperparameter grid search for each model, however the performance boost over default parameters is not significant.

PHOTO-Z METRICS
Photo-z uncertainties are propagated to measurement uncertainties on dark matter and dark energy.Therefore, our choice of metrics to evaluate the photo-z determinations in this work include chiefly the photo-z metrics used in the LSST science requirements document (RMS error (Eq.3), Bias (Eq.4), and 3σ Outliers (Eq.7)) which are calculated to provide the necessary precision to constrain important cosmological quantities.Specifically, for the purpose of constraining dark matter and dark energy we require photo-z RMS error (< 0.2), Bias (< 0.003), and 3σ Outliers (< 10%).In addition, we also include in our analysis a number of point metrics that are commonly used in the photo-z literature (Outlier (Eq.1), Catastrophic Outlier (Eq.2), Scatter (Eq.5), and Loss (Eq.6)) for the purpose of comparison to other models, as well as additional probabilistic metrics to evaluate the photo-z uncertainties produced by the BNN.To measure model performance we evaluate predictions using the metrics in Table 2, which are separated into non-probabilistic and probabilistic categories.These metrics describe different ways to characterize the photometric redshift performance averaged over all predictions.Ideally, photo-z measurements should be accurate out to the redshift limit of LSST observations ( z = 3.4 is where galaxies begin dropping out of the g band), however the main redshift range of focus is 0.3 < z < 3.0.In this redshift range, LSST aims to measure the comoving distance as a function of redshift to an accuracy of 1-2%.In order to achieve this goal, LSST must obtain (1) a sufficiently large sample of galaxies (∼ four billion) and ( 2) sufficiently accurate photo-z measurements for these galaxies as defined by the aforementioned requirements.In addition to meeting photo-z science requirements, the LSST team also requires 'methods for rejecting the majority of those outliers, and for characterizing their effects on the sample'.
We note that science missions of LSST and Euclid for which photo-zs are necessary divide the redshift ranges of interest into several discrete tomographic redshift bins (0 < z < 1.5 divided into four bins of z = 0.3 in weak lensing analyses).The photo-z science requirements must be achieved on average throughout each tomographic redshift bin, rather than on average throughout the entire sample.This means that a full evaluation of a particular photo-z method must include an evaluation of important metrics as a function of redshift, rather than averaging across the entire photo-z sample.This distinction is particularly important for evaluating model performance of high redshift regions (z > 1.0), which contain significantly fewer galaxies than low redshift regions (see Fig. 1), and are thus more challenging for any photo-z method to accurately produce photo-zs.

Point Metrics
We use the conventional definition for photometric redshift outliers and catastrophic outliers in Eqs. 1 and 2, where z phot and z spec are the estimated photo-z and actual (spectroscopically determined) redshift of the galaxy.The RMS photo-z error is given by a standard definition in Eq. 3, where n gals is the number of galaxies in the evaluation testing set and Σ gals represents a sum over those galaxies.Bias and scatter are defined in Eqs. 4 and 5.We follow Tanaka et al. (2018b) and define a loss function in Eq. 6 to characterize the point estimate photo-z accuracy with a single number, where we use γ = 0.15.

Probability metrics
We propose coverage as a key metric for assessing the performance of the BNN (see Eq. 8).Coverage is typically used to assess whether confidence intervals are accurate.In this case, we define coverage as the fraction of galaxies that have a spectro-z within their 68% confidence interval.Ideally, 68% of evaluated galaxies should have true spectro-zs within their 68% confidence interval.If the coverage is over 68%, then the estimated The inputs for both networks are five-band photometry in the g,r,i,z,y filters.The output for the NN is a single point photo-z estimate while the output for the BNN is a photo-z PDF, which we sample to obtain a photo-z estimate.We assume Gaussianity in the creation of the photo-z PDF, so a photo-z uncertainty is produced by the standard deviation of the PDF.
uncertainties are on average too large.Similarly, if the cover is below 68%, the estimated uncertainties are on average too small.
Error in the bulk photo-z distribution width for the evaluation set can be difficult to distinguish between uncertainties associated with galaxy bias or uncertainties in the mean redshift of photo-z tomographic bins.The Probability Integral Transform (PIT) is a photo-z metric that can detect systematic error in the photo-z distribution width for galaxy samples with known spectroscopic redshifts (Malz & Hogg 2020;Malz 2021).The PIT value for a single galaxy is defined in Eq. 9 in Table 2, where p(z) is the predicted photo-z PDF.

Leveraging BNN for Outlier Identification
We propose a method for utilizing the photo-z uncertainties z σ produced by the BNN to preemptively flag photo-z predictions with high uncertainties as potential poor predictions.The method is simple: all galaxies with a photo-z uncertainty greater than the specified z σ cutoff value are flagged as potential outlier or catastrophic outlier candidates and removed from the evaluation sample.Figs. 4, 5, and 6 depict performance improvements with example σ z removal values for a variety of performance metrics including the LSST photo-z requirements.An acceptable balance needs to be achieved between the number of galaxies correctly flagged as poor (3) Coverage predictions versus the number of non-outlier galaxies removed for a given z σ cutoff value.Other outlier removal strategies have previously been explored in Jones & Singal (2020), Wyatt & Singal (2020), and Singal et al. (2022).
With the data used in this work we find a significant reduction in the number of catastrophic outliers and outliers by sacrificing a minimal number of non-outlier predictions; for example, we find that by removing all galaxies in the evaluation sample with a photo-z uncertainty σ z > 0.3, the RMS error was reduced by 57.6%, outliers were reduced by 70.1%, and catastrophic outliers were reduced by 80.43% -at the cost of removing only 11% of the evaluation set.See Fig. 9 for the N(z) distribution of removed galaxies for example cases of σ z > 0.3 and σ z > 0.5.

RESULTS
The BNN generally satisfies LSST photo-z science requirements in the range of 0.3 < z < 1.5 (redshift range for weak lensing analyses -see Fig. 8) and performs as well or better than the 6 common alternative methods investigated in this study (see Table 3 and Figs.  7 and 8).We compare the BNN and NN to a support vector machine (Cortes & Vapnik 1995), a random forest (Breiman 2001), and a gradient boosting model, XGBoost (Chen & Guestrin 2016), using the same data discussed in §2.1.We also form a comparison to photometric redshift predictions measurements from the HSC team (Nishizawa et al. 2020), which used the template-fitting model Mizuki and empirical method DEmP (Hsieh & Yee 2014).To form a comparison, we relied on the photometric redshifts produced by the HSC team.Mizuki and DEmP were trained and evaluated on a slightly larger data set of 300k galaxies by (Nishizawa et al. 2020), but the majority of galaxies overlap with the data set introduced in this work.We crossmatched the HSC data set with the object IDs of our data to obtain a pre-evaluated sample of ∼60 thousand galaxies.Another photo-z investigation performed by Schuldt et al. (2021) utilized HSC imaging data and obtained a preci-sion of ∆ z = |z phot − z spec | = 0.12 with a convolutional neural network averaged over all galaxies in the redshift range 0 < z < 4. We obtain ∆ z = 0.0031 for the NN and ∆ z = 0.0032 for the BNN averaged over all galaxies in our data set in this range.We note that a perfect comparison between photo-z models requires identical training, validation, and evaluation data sets.While the photo-z models from Schuldt et al. (2020b) and the HSC team compared in this work utilized largely the same data that was used in this work, there are some differences between their data and the data used in this investigation, which introduces additional uncertainty in the comparison made between results.
We note that training a Bayesian neural network does not deterministically produce weights on the same data.The weights in the variational layers are sampled from a Gaussian distribution.The results presented here are representative of a typical training run with the BNN model presented in this work.However, there can be variations of several percents in outlier rates and other metrics depending on the training run.There can also be variations in the final loss achieved at the end of training.We find that the accuracy of a particular training run is correlated to the final loss value.

Using BNN Uncertainties to Identify Outliers
The BNN with the outlier removal method discussed in §3.1 stands out as the overall best performing model for the majority of photo-z performance metrics considered in this work, achieving the lowest percentage of outliers, catastrophic outliers, and RMS error.The outlier removal method described in §3.1 is visualized in Figs. 4,5,6,8,and 9. Notably,Fig. 5 shows the performance of the NN and BNN with respect to LSST photo-z science requirements.The utilization of the photo-z uncertainties produced by the BNN to remove poor predictions significantly reduces RMS error.The BNN satisfies the LSST photo-z science requirements with respect to RMS error and 3σ outlier fraction across 0 < z < 2.5, however the bias requirement is only partially met in the range 0.3 < z < 1.2.We note that the BNN bias deviation from the acceptable range is confluent with the drop-off of the galaxy population in the N(z) distribution in Fig. 1.

Bayesian Neural Network Photo-z Uncertainty Estimates
We find that the BNN produces accurate uncertainties as defined by the probabilistic metrics.The quality of the uncertainties produced by the BNN are visualized in Figs 5, 6, and 10.The BNN 3σ outlier fraction is shown in Fig. 5, which indicates that uncertainties are generally well-estimated on average across the redshift range 0 < z < 2.5.It is notable that the BNN performs best with respect to the σ outlier fraction when no galaxies with large uncertainties are removed.The PIT histogram produced for a sample determination with the BNN is shown in Fig. 10.The PIT histogram is generally flat, as is desired, however the slight bump in the middle indicates that the photo-z PDFs tend to be overly broad.For a comparison of PITs produced by other probabilistic photo-z methods (performed on different data) see Schmidt et al. (2020b).The BNN uncertainty coverage of the sample is provided in Fig. 6, showing acceptable agreement with the target 68% confidence interval up to the target redshift interval for weak lensing applications 0.3 < z < 1.5, indicating the uncertainties of photo-z estimates for this galaxy population are accurately defined.
The results from evaluating the NN and BNN models on the evaluation set are available at https://zenodo.org/doi/10.5281/zenodo.10145347.

Investigating the effect of non-representative training data
The distribution of brightness of the training sample is peaked at brighter magnitudes compared to the full HSC photometric sample because of the need for spectroscopy.We investigated how this bias might affect our results by re-sampling the testing dataset to have a magnitude distribution closer to original HSC dataset.We find that the performance on this re-sample is similar or slightly worst by about 1 to 2% depending on the metric.See Appendix A for more details.

DISCUSSION
Future large-scale astronomical surveys will provide high quality observations of billions of celestial objects that will be used to investigate the mysterious and unknown nature of dark matter and dark energy.LSST will play a crucial part in this investigation; we model our analysis here with respect to the photo-z science requirements provided by the LSST team.Bayesian Neural Networks have been used in the past for photo-z estimation (e.g.Zhou et al. 2022;Schuldt et al. 2020b;Jones et al. 2021b), however this is the first BNN (following our prototype in Jones et al. ( 2022)) applied to photometry observations of a representative dataset similar to what we will obtain from future large scale surveys like LSST.This work is among the first that evaluates the photo-z model performance with respect to the LSST science requirements as a function of redshift.
The BNN largely satisfies LSST science requirements in the redshift range of interest for LSST weak lensing surveys (0.3 < z < 1.5), and outperforms alternative models on the same data, however the BNN does not fully satisfy the bias requirement.We believe the BNN model can be further optimized for these requirements.Compared to the NN model, the BNN has the advantage of producing uncertainties for each prediction, which are both required for precision cosmology and can be used to eliminate galaxies with large uncertainties from the data sample.We note that both the NN and BNN models generally perform worse at higher redshifts, which is due in large part to the reduced signal to noise for distant dim sources and also the disproportionate number of high redshift sources (z > 2.5) compared to low redshift sources (see Fig. 1 and also the discussion in Wyatt & Singal (2020)).
The BNN model introduced here is an improved version of the model we introduced in a previous work (Jones et al. 2022).The uncertainty estimates produced in the BNN model discussed in this work are significantly improved from the previous model -due in large part to optimizing the learning rate during training and modifying the network architecture.The previous model network contained four variational layers, which we ad-justed to contain three dense layers and one variational layer.We find that the coverage for the architecture with all variational layers produces coverage that is generally 10% larger than expectation (uncertainties too large).Using a single variational now produces more accurate coverage.
The BNN uncertainty coverage of the sample provided in Fig. 10 shows excellent agreement with the target 68% confidence interval up to z = 1.5, indicating the uncertainties of photo-z estimates for this galaxy population are accurately defined.Beyond z = 1.5, the coverage oscillates around the target %68 level.A likely explanation for the reduced quality of galaxy uncertainties beyond z = 1.5 is the lack of data samples at this redshift range (see Fig. 1) compared to lower redshifts.Another possible factor affecting photo-z uncertainties may result from a disparity between the complexity present in the band magnitudes compared to the BNN model; we use five photometric band fluxes paired with a single spectroscopic redshift per galaxy for training.In a future work we will apply galaxy photometric images to a Bayesian convolutional neural network, which is likely to contribute more useful information than the five photometric measurements per galaxy.
Another benefit of using a BNN for photo-z estimation is the use of the photo-z uncertainty z σ to preemptively flag photo-z predictions with high uncertainties as potential poor predictions.An acceptable balance needs to be achieved between the number of galaxies correctly flagged as poor predictions versus the number of non-outlier galaxies removed for a given z σ cutoff value.With the data used in this work we find a significant reduction in the number of catastrophic outliers and outliers can be achieved by sacrificing a relatively small number of non-outlier predictions; for example, we find that by removing all galaxies in the evaluation sample with a photo-z uncertainty σ z > 0.3, the RMS error was reduced by 57.6%, outliers were reduced by 70.1%, and catastrophic outliers were reduced by 80.43% -at the cost of removing only 11% of the evaluation set.

CONCLUSION
In preparation of the coming influx of data from large scale surveys like LSST, it is important to prepare photo-z estimation models in advance.Such models must provide both accurate photo-z predictions and reliable photo-z uncertainties, which are required for using photo-z predictions in subsequent cosmological analyses.The quality of photo-z models should be assessed using data that is representative of data from future large scale surveys, and principally evaluated using the scientific requirements provided by those surveys.
Table 3.Comparison of the performance results with each model discussed in §2.We use the data discussed in §2.1 to train and evaluate a NN, BNN, a SVM SPIDERz (Jones & Singal 2017), a random forest (Breiman 2001), and a gradient boosting model XGBoost (Chen & Guestrin 2016).We also include a comparison to the template-fitting model, Mizuki, and empirical method, DEmP (Hsieh & Yee 2014), that were evaluated on a larger, overlapping data set in (Nishizawa et al. 2020).To form a comparison to Mizuki and DEmP in this work, we crossmatched the larger data set with the object IDs of our data discussed in §2.1 to obtain a pre-evaluated sample of 60 thousand galaxies.---< 0.2 < 0.003 < --Figure 5. BNN and NN performance with respect to LSST photo-z requirements.We note that the 3σ outlier fraction can only be calculated with the BNN because the metric requires photo-z uncertainties so we additionally include the standard outlier fraction for the NN and BNN for comparison.The plots reflect results with 80% of galaxies for training, 10% for validation, and 10% for evaluation.We include only those results in the redshift range 0 < z < 2.5 because the N (z) distribution of the data set degrades significantly at higher redshifts (see Fig. 1) and would likely significantly improve given sufficient training data.

Network
This work introduces a BNN model for photometric redshift estimation.We apply the BNN to data from the Hyper Suprime Cam survey, which is designed to Figure 6.The fraction of galaxies that have a spectro-z within their 68% confidence interval.Ideally, 68% of evaluated galaxies should have true spectro-zs within their 68% confidence interval.If more than 68% of evaluated galaxies have spectro-zs within their 68% confidence interval, the galaxies are considered 'over-covered' because their photoz uncertainties are too large.The same logic applies for 'under-covered' galaxies.
reflect the data we will soon receive from large scale surveys such as LSST.We evaluate the BNN with respect to LSST science requirements and compare the results to alternative photo-z estimation tools including a fully connected neural network, random forest, support vec-tor machine (Jones & Singal 2017), XGBoost, Mizuki, and DEmp (Aihara et al. 2018(Aihara et al. , 2019)).We find that the BNN meets two of the three LSST photo-z requirements in the redshift range considered for weak lensing cosmological probes (0.3 < z < 1.5) and provides superior photo-z estimations to the other models.
A key attribute of the BNN model is the production of photo-z uncertainties, which are needed for using photoz results in cosmological analyses.We find that the BNN produces accurate uncertainties.Using a coverage test, we find excellent agreement with expectation -68.5% of galaxies between 0 < 2.5 have 1-σ uncertainties that cover the spectroscopic value.In addition, the BNN photo-z uncertainties can be used to flag likely outlier or catastrophic outlier estimates with high success.
This analysis is subject to the potential sources of bias that affect most photometric redshift estimation studies.For example, spectroscopic redshift observations are biased toward high luminosity galaxies, particularly at higher redshift ranges (z > 1.5), which may not be fully representative of galaxy populations at a specific redshift range.Another source of bias in this analysis is the underrepresented galaxy population in the N(z) distribution beyond z = 1.5.Both of these potential sources of bias can be alleviated with improved spectroscopic samples in future galaxy surveys.
We will continue this analysis by applying the BNN method to galaxy images via a Bayesian convolutional neural network in a forthcoming paper.
We are grateful for the financial support for this work from the Sloan Foundation.refers to the default BNN results with no galaxies removed based on a zσ criteria.BNN 2 and BNN 3 refer to results obtained after removing all galaxies from the evaluation set containing photo-z uncertainties greater than 0.5 and 0.3, respectively.We use the data discussed in §2.1 to train and evaluate a NN, BNN, a SVM SPIDERz (Jones & Singal 2017), a random forest (Breiman 2001), and a gradient boosting model XGBoost (Chen & Guestrin 2016).We also include a comparison to the template-fitting model, Mizuki, and empirical method, DEmP (Hsieh & Yee 2014), that were evaluated on a larger, overlapping data set in (Nishizawa et al. 2020).To form a comparison to Mizuki and DEmP in this work, we crossmatched the larger data set with the object IDs of our data discussed in §2.1 to obtain a pre-evaluated sample of 60 thousand galaxies.Coverage is defined as the fraction of galaxies that have a spectro-z within their 68% confidence interval.Ideally, 68% of evaluated galaxies should have true spectro-zs within their 68% confidence interval.If more than 68% of evaluated galaxies have spectro-zs within their 68% confidence interval, the galaxies are considered 'over-covered' because their photo-z uncertainties are too large.The same logic applies for 'under-covered' galaxies.
Figure 15.BNN and NN performance with respect to LSST photo-z requirements using an evaluation set with photometry that is re-sampled to approximate the bulk HSC photometry.We note that the 3σ outlier fraction can only be calculated with the BNN because the metric requires photo-z uncertainties so we additionally include the standard outlier fraction for the NN and BNN for comparison.The plots reflect results with 80% of galaxies for training, 10% for validation, and 10% for evaluation.We include only those results in the redshift range 0 < z < 2.5 because the N (z) distribution of the data set degrades significantly at higher redshifts (see Fig. 1) and would likely significantly improve given sufficient training data.
Figure 16.Histogram of photo-z uncertainties produced by the BNN that exceed 0.3 and 0.5 using the re-sampled dataset.By removing all galaxies in the evaluation sample with a photo-z uncertainty σz < 0.3, outliers were reduced by 70.1%, and catastrophic outliers were reduced by 80.43% -at the cost of removing 11% of the evaluation set.Using a photo-z uncertainty cutoff of 0.5 reduces the number of outliers by 70.1% and catastrophic outliers by 67.8% at the cost of removing 7.67% of the evaluation set.

Figure 1 .
Figure 1.Left: Example of a galaxy (z = 0.48) image in the i-band.Middle: five-band photometry for the same galaxy.Right: N (z) distribution for the data set discussed in §2.1 For the photo-z determinations in this work we use training, validation, and testing sets consisting of 229,120, 28,640, and 28,640 galaxies respectively.

Figure 2 .
Figure 2. Example HSC galaxy images for the data set used in this work with grizy photometry for a low redshift galaxy at z = 0.05 (TOP), and a high redshift galaxy at z = 3.92 (MIDDLE), and another low redshift galaxy at z = 0.14 (BOTTOM).The similarity between the high redshift galaxy and the bottom low redshift galaxy highlights the difficulty of photo-z estimation.

Figure 3 .
Figure3.Left: NN architecture.Right: BNN architecture.The inputs for both networks are five-band photometry in the g,r,i,z,y filters.The output for the NN is a single point photo-z estimate while the output for the BNN is a photo-z PDF, which we sample to obtain a photo-z estimate.We assume Gaussianity in the creation of the photo-z PDF, so a photo-z uncertainty is produced by the standard deviation of the PDF.
A histogram of PIT values for a galaxy sample should be uniform for an accurate collection of p(z) samples.Ideally, the PIT histogram is flat across all redshift bins.If the PIT histogram peaks at the center, the p(z) collection is too broad.If the PIT histogram peaks at high and low PIT values, the p(z) samples are too narrow.For a comparison of several probabilistic photo-z methods, see Schmidt et al. (2020b).

Figure 4 .
Figure 4. Visualization of NN (top left) and BNN (top right) performance compared to the BNN with outlier removal criteria examples σz = 0.5 (bottom left) and σz = 0.3 (bottom right).

Figure 7 .
Figure 7. Visualization of predicted photo-zs versus measured spectroscopic redshifts by the models discussed in §2.The results of these determinations are quantified in Table3.The colorbars indicate the density of evaluation data points as computed with a Gaussian kernel-density estimation.

Figure 8 .
Figure8.Comparison of the percentage of outliers (Eqn 1) and catastrophic outliers (Eqn 2) achieved with each model.BNN 1 refers to the default BNN results with no galaxies removed based on a zσ criteria.BNN 2 and BNN 3 refer to results obtained after removing all galaxies from the evaluation set containing photo-z uncertainties greater than 0.5 and 0.3, respectively.We use the data discussed in §2.1 to train and evaluate a NN, BNN, a SVM SPIDERz(Jones & Singal 2017), a random forest(Breiman 2001), and a gradient boosting model XGBoost(Chen & Guestrin 2016).We also include a comparison to the template-fitting model, Mizuki, and empirical method, DEmP(Hsieh & Yee 2014), that were evaluated on a larger, overlapping data set in(Nishizawa et al. 2020).To form a comparison to Mizuki and DEmP in this work, we crossmatched the larger data set with the object IDs of our data discussed in §2.1 to obtain a pre-evaluated sample of 60 thousand galaxies.

Figure 9 .
Figure9.Histogram of photo-z uncertainties produced by the BNN that exceed 0.3 and 0.5.By removing all galaxies in the evaluation sample with a photo-z uncertainty σz < 0.3, outliers were reduced by 70.1%, and catastrophic outliers were reduced by 80.43% -at the cost of removing 11% of the evaluation set.Using a photo-z uncertainty cutoff of 0.5 reduces the number of outliers by 70.1% and catastrophic outliers by 67.8% at the cost of removing 7.67% of the evaluation set.

Figure 11 .
Figure 11.Visualisation of the grizy bands before and after the data is re-sampled to approximate the bulk HSC photometry sample.

Figure 12 .
Figure 12.N(z) distributions of the original evaluation set discussed in §2.1 and the re-sampled evaluation set.

Figure 13 .
Figure13.Visualization of the NN and BNN results using an evaluation set that is re-sampled to more closely approximate the bulk HSC photometry.The models are trained on the original data discussed in §2.1.

Figure 14 .
Figure14.Comparison of the photo-z uncertainty coverage present in the original evaluation set compared to the re-sampled evaluation sample.Coverage is defined as the fraction of galaxies that have a spectro-z within their 68% confidence interval.Ideally, 68% of evaluated galaxies should have true spectro-zs within their 68% confidence interval.If more than 68% of evaluated galaxies have spectro-zs within their 68% confidence interval, the galaxies are considered 'over-covered' because their photo-z uncertainties are too large.The same logic applies for 'under-covered' galaxies.

Figure 17 .
Figure 17.PIT histogram of the photo-z PDF produced by the Bayesian Neural Network using an evaluation set with photometry that is re-sampled to approximate the HSC bulk photometry.The red horizontal line indicates the ideal PIT histogram distribution: if the PIT histogram peaks at the center, the photo-z PDFs are generally too broad, and if the PIT histogram peaks at high and low PIT values, the PDF samples are too narrow or contain a large amount of catastrophic outliers.

Table 1 .
Quality cuts used to construct the data set.

Table 2 .
Metrics used to assess model performance.