Estimating Galaxy Parameters with Self-organizing Maps and the Effect of Missing Data

The current and upcoming large data volume galaxy surveys require the use of machine-learning techniques to maximize their scientific return. This study explores the use of Self-Organizing Maps (SOMs) to estimate galaxy parameters with a focus on handling cases of missing data and providing realistic probability distribution functions for the parameters. We train an SOM with a simulated mass-limited lightcone assuming a ugrizY JHK s +IRAC data set, mimicking the Hyper Suprime-Cam Deep joint data set. For parameter estimation, we derive SOM likelihood surfaces considering photometric errors to derive total (statistical and systematic) uncertainties. We explore the effects of missing data, including which bands are particularly critical to the accuracy of the derived parameters. We demonstrate that the parameter recovery is significantly better when the missing bands are “filled in” rather than if they are completely omitted. We propose a practical method for such recovery of missing data.


INTRODUCTION
The study of the physical mechanisms that drive galaxy formation and evolution (see Somerville & Davé 2015, for a review) starts with our ability to derive galaxy properties such as their distance, stellar mass, and star formation activity.We also need to do so for large enough galaxy populations that span all potential evolutionary stages, environments and cosmic epochs.In the past few decades, increasingly large multi-band imaging and spectroscopic surveys have been constructed to address this need (e.g.SDSS, York et al. 2000;COSMOS, Scoville et al. 2007; CANDELS, Grogin et al. 2011).Set to start in 2025, the Rubin Observatory Legacy Survey of Space and Time (LSST) will cover 18,000 deg 2 in six filters (ugrizy) and reach a 5 σ depth of 27.5 in r (LSST Science Collaboration et al. 2009).In many respects, the three-tiered Subaru Hyper Suprime-Cam (HSC) survey (Aihara et al. 2017) is a precursor of the Rubin LSST survey using nearly all the same filters and reaching comparable depth in its middle (HSC-Deep) tier which covers 27 deg 2 .Large photometric surveys are complemented by spectroscopic surveys (e.g.VVDS; Le Fèvre et al. 2013) although spectroscopic samples have been observationally expensive and therefore limited in the past.
Next generation spectrographs will allow for thousands of spectra to be taken simultaneously within wide fields of view allowing for dramatic increases in spectroscopic data samples (Wang et al. 2022;Greene et al. 2022;Jin et al. 2023).Survey planning with these upcoming spectrographs, as well as fully contextualizing the results thereof, critically depends on our understanding of the selection functions of the parent photometric samples as well as of the spectroscopic datasets themselves.
Techniques employed for determining galaxy parameters from such multi-band data can be classified as physically motivated or data-driven methods (Salvato et al. 2019).The first techniques, already widely used for decades, rely on physical models that have been improved by integrating all the physical knowledge acquired over the years.The estimation is carried out through template Spectral Energy Distribution (SED) fitting to the observed photometry.This method helps us derive parameters such z phot , stellar mass, stellar population age, metallicity, star formation history, etc. (Gallazzi et al. 2005).This approach works because each physical parameters leaves specific marks on the SED, determining its shape and amplitude.Template libraries, spanning a range of physical parameters, are constructed ahead of time or on the go in the process of exploring the parameter space.Then, stellar population parameters are assigned to the galaxy based on the properties corresponding to the SED model that best fits the observed photometry (e.g., Sawicki & Yee 1998;Walcher et al. 2010;Conroy 2013;Leja et al. 2017).The more sophisticated, and hence realistic, the modeling (e.g.treating the Star Formation History SFH as nonparametric in Leja et al. 2017), the more computationally expensive the SED fitting becomes.Multiplying this by millions and soon billions of galaxies makes this traditional approach largely unfeasible.The necessity of speeding up the process of galaxy parameter estimation from observed photometry or spectroscopy is one of the reasons for the widespread applications of machine learning ML in astronomy (Ball & Brunner 2010;Acquaviva 2015;Baron 2019).Machine learning methods are data-driven since they learn directly from the data used to train the algorithm.Techniques include data compression, dimensionality reduction, visualization techniques, etc. which all help to manage and process large amounts of information.Machine learning methods can both offer significant speed-up over traditional SED fitting, but also as they learn complex connections amongst multi-dimensional data can help us make additional inferences from them.
As stated above, ML is already widely used in astrophysics and cosmology (Longo et al. 2019;Baron 2019).Salvato et al. (2019) provides a comprehensive review of different ML methods that can be applied to estimate z phot .Star Formation Rate SFR and stellar mass M * are predicted by training deep neural networks for the Galaxy And Mass Assembly (GAMA) survey (Surana et al. 2020).Lovell et al. (2019) derive SFHs by training convolutional neural networks (CNN) with simulated galaxy spectra from Illustris and EAGLE.Acquaviva (2015) applies a range of supervised ML methods (regularized ridge regression, RF, extremely randomized trees ERT, boosted decision trees AdaBoost and support vector machines) to predict galaxy metallicty from five-band SDSS photometry.The Cosmology and Astrophysics with MachinE Learning Simulations (CAMELS) project (Villaescusa-Navarro et al. 2021) was created to offer a large, state-of-the-art hydrodynamic simulation that can serve as a training dataset for a variety of ML methods.One such application involved using neural networks to estimate cosmological and astrophysical parameters from 17 simulated galaxy properties, revealing significant correlations between Ω m and galaxy properties (Villaescusa-Navarro et al. 2022).
The Self-Organizing Map (SOM, Kohonen 1982) is an unsupervised ML algorithm that projects high dimensional data onto a 2D map that nevertheless preserves the topology.In the case of galaxy surveys, inputting for example a catalog of 10+ observed colors results in galaxies with similar colors being grouped closely on the 2D map.This makes SOMs a powerful tool for visualizing large astronomical surveys (Geach 2012;Longo et al. 2019).Because observed colors depends on the redshift and stellar population parameters of galaxies, this effective grouping by multi-dimensional colors can also be used for accurate and computationally very efficient derivation of said parameters.SOMs have been used in the literature to classify stellar spectra (Mahdi 2011), classify galaxy morphology (Galvin et al. 2019), estimate photometric redshifts and other galaxy parameters (Masters et al. 2015;Hemmati et al. 2019;Davidzon et al. 2019Davidzon et al. , 2022)), or for target selection (to have spectroscopic samples that are more representative of the photometric ones) (Hemmati et al. 2019;Masters et al. 2019).
The goal of this paper is to investigate the use of SOM for parameter estimation including the effects of photometric errors and missing data, a common scenario in real astronomical datasets that is not adequately handled within SOM applications.To achieve this, we employ a fiducial K s -magnitude limited sample to mimic the joint HSC-Deep photometric dataset.This analysis lays the groundwork for future work involving training a SOM with real data from the HSC-Deep+ survey.
This paper is organized as follows.The employed simulated mass-limited and magnitude-limited datasets are described in detail in Section 2. In Section 3 we describe the set-up, training, labeling and visualization of our SOM.In Section 4.1 we present the visualization of the observed-like sample and the lessons learned therefrom.In Sections 4.2-4.7 we present parameter estimation results with further investigation on how derivations are influenced by realistic cases like noise and gaps in the photometric coverage.In Section 5 we discuss the implications of our findings as well as compare them with the literature.In Section 6 we present our summary and conclusions.In Appendix A, we compare a SOM trained on a simulated vs. an observed dataset.Throughout this paper, all the magnitudes are expressed in the AB system (Oke & Gunn 1983), and we adopt ΛCDM cosmology with H 0 = 70 km s −1 Mpc −1 , Ω M = 0.3, and Ω Λ = 0.7.

DATA
We start with a mass-limited simulated dataset which we color-calibrate with real galaxy data to provide realistic color-parameter correlations.We then generate a fiducial magnitude-limited simulated dataset which is modelled after the HSC-Deep survey with ancillary data.Below we present the details of our masslimited initial dataset, its color-calibrations and finally the observations-like simulated dataset.

Simulated lightcone
Our analysis is based on two publicly-available mock catalogs (described in Laigle et al. 2019) with observedframe photometry extracted from the Horizon-AGN light-cone 1 (Dubois et al. 2014).Horizon-AGN is a cosmological hydrodynamic simulation that contains 1024 3 dark matter particles in a box of L box = 100 h −1 Mpc, corresponding to a dark matter mass resolution of 8 × 10 7 M ⊙ .The simulation is run with the adaptive mesh refinement code ramses (Teyssier 2002), following the evolution of the gas including the effects of gravity, hydrodynamics, gas cooling and heating processes, star formation and stellar feedback, and feedback from black holes (see Kaviraj et al. 2017, for details).Gas is heated from an uniform UV background after the reionization epoch, according to Haardt & Madau (1996), and it cools down to 10 4 K via H, He and metals following Sutherland & Dopita (1993).Star formation is activated only in regions with gas number density n > 0.1 H cm −3 and according to the Schmidt law (Schmidt 1959): ρSF R = ϵρ gas /τ f f , where ρSF R is the SFR mass density, ϵ is the star formation efficiency, ρ gas is the gas mass density, and τ f f is the free fall time of the collapsing gas cloud.Stellar feedback is provided by stellar winds and type Ia and II supernovae, and AGN feedback is a combination of the radio or quasar mode based on the black hole accretion rate.
In Laigle et al. (2019), the 1 deg2 light-cone is build by running the AdaptaHOP halo finder (Aubert et al. 2004) on the Horizon-AGN light-cone over 0 < z < 4 and selecting structures with a density threshold 178 times the average matter density at that redshift, and then considering the total stellar mass within each halo.The extracted light-cone contains 789,354 galaxies with stellar masses >10 9 M ⊙ in a redshift range z = 0 − 4. For each galaxy, Laigle et al. (2019) compute its SED by adopting single stellar population models from Bruzual & Charlot (2003) and a Chabrier IMF (Chabrier 2003).They adopt an empirical conversion relation from the gas phase metals to dust and estimate the spatial distribution of that dust from the gas density in each cell.Integrated along the line of sight, this provides the column density of dust.Dust attenuation is then applied to the spectra by assuming the dust follows the R V = 3.1 Milky Way dust model by Weingartner & Draine (2001).They do not take into account scattering in or out of the line of sight, but only dust absorption.This results for example in higher (by ≈0.8dex) estimated magnitudes in the UV (Laigle et al. 2019).They also account for ab-1 https://www.horizon-simulation.org sorption by the IGM, but not for foreground extinction by the Milky Way, which is expected to be corrected for in any observed survey that would be compared with this simulated dataset.
Apparent magnitudes are obtained by convolving the thus computed galaxy spectra with the desired filter profiles.Of the two mock catalogs 2 we use, one is the COSMOS-like mock where the photometry matches the Laigle et al. (2016) COSMOS2015 catalog.Specifically, this catalog includes the u * -band from MegaCam on the Canada-Hawaii-France Telescope CHFT (Boulade et al. 2003), 6 broad optical bands (B, V, r, i + , z ++ ), 12 median and 2 narrow bands from the Suprime-Cam on the Subaru telescope (Taniguchi et al. 2007(Taniguchi et al. , 2015)), and the Y -band from the Hyper Suprime-Cam/Subaru (Miyazaki et al. 2012), which is slightly bluer than the Y filter from VIRCAM.The COSMOS2015 also includes near-IR YJHK s -band from VIRCAM on the VISTA telescope (Beard et al. 2004;McCracken et al. 2012), and the H and K s bands from WIRCam/CFHT (Puget et al. 2004;McCracken et al. 2010).Finally, the catalog includes the mid-IR Spitzer IRAC channels centred at 3.6 and 4.5µm (ch1 and ch2 respectively) (Fazio et al. 2004).For our analysis, we need a simulated mock of the HSC-Deep survey (Aihara et al. 2017) and its ancillary surveys as described in more detail in Section 2.2.To provide a similar photometric coverage, we use the u, r, i, z, Y , J, H, K s , ch1 and ch2 photometric datapoints from the Horizon-AGN COSMOS-like mock.However, the COSMOS-like mock is missing the g-band.So we also use the public Euclid +LSST-like mock which includes the LSST g-filter which is very close to the HSC g-filter.From these simulated photometric catalogs, we use only the photometry without any photometric errors applied since we add photometric errors ourselves for the specific surveys we are considering (see Section 4.3).In cross-matching the two mocks ≈ 100 objects are lost leaving us with 789,292 galaxies.In the simulation, the u-band exhibits a tail that extends to highly unrealistic values (corresponding to this band dropping shortward of the Lyman break).To mitigate this issue, we consider only galaxies with m u < 28, leaving us with 691,486 objects.This cut limits the redshifts to z ∼ 3.4, as shown in Figure 1 (top,) but has no other effects on the dataset.

Color-calibration
We expect the magnitude and color distributions of this light-cone to not be quite right, because of the simplicity of the dust attenuation treatment, as well as the fact that this idealized photometry does not take into account all the systematic effects that one would have from extracting photometry from an image (Laigle et al. 2019).We correct for this systematic offsets in the photometry following the method of Pearl et al. (2022).This method adjusts simulated magnitudes to match the correlation between M/L ν,obs ratios and sSFR U V,obs , ensuring consistency between simulations and observations.We performed the color-calibration using the UltraV-ISTA catalog (UVISTA; Muzzin et al. 2013) which has similar photometric coverage as well as all necessary derived quantities including redshift.To ensure maximum reliability for our color-calibration, we selected galaxies from the UltraVISTA catalog based on the following criteria: 0.2 < z peak < 4.0, 5 < logM * <12 [M ⊙ ], star=0, contamination=0, nan contam < 3, use=1, K s < 23.5 AB.This results in a K s -limited catalog with only ∼100 objects at z > 3.
We derived the SFR U V for both the simulated dataset and UVISTA following the Kennicutt & Evans (2012) conversion relation from L N U V , which was available for both datasets.For the sake of the calibration, we excluded simulated galaxies with log(sSFR U V )<-13 and Figure 2. M/Lν,u ratio for the observed u-band vs. the sSFRUV in the redshift bin 0.8 < z < 0.9, before and after color-calibration in the upper and lower panel, respectively.The black data points and their relative best-fit (black solid line) represent the simulations.The red dashed line represents the best fit to the observed UVISTA catalog (Muzzin et al. 2013).These best-fits are used to adjust the simulated photometry to match the observed relation.We perform such color-calibration for all bands in all redshift bins.log(sSFR)<-13, as such values are not typically observed.
We perform the color calibration in 70 redshift bins.The first 69 bins cover the range 0 < z ≤ 2.5, each with ≈8K simulated galaxies, making k-correction negligible within these narrow redshift bins.The remaining 132,281 simulated galaxies at z > 2.5 are all grouped in the final bin.The choice of this binning is motivated by the use of the K s -selected UVISTA catalog for the calibration, implicating that there are few observed objects at z > 2.5.Consequently, dividing this redshift range into smaller bins would yield too few objects for a reliable fit in the log(M/L ν )-log(sSFR U V ) plane. Figure 2 shows as an example the M/L ν,u ratio for the u-band vs. the sSFR U V for the redshift bin 0.8 < z < 0.9.We chose as an example the u-band as the needed calibration is by far the strongest in this band.As shown in Figure 2, prior to color-calibration, the slopes of the fitted lines in the log(M * /L u )-log(sSFR U V ) plane differ between the Horizon-AGN mock and UVISTA data.To align the simulated fitted line with the observations, we perform linear fits to both distributions and calculate ∆ m (slope difference) and ∆ b (y-intercept difference).The new luminosity in each band and redshift bin is derived as: where L ν,old is the old luminosity in a specific band ν.We then derive new apparent magnitudes from: In this way we obtain the color-calibrated version of the mock dataset.For the rest of this paper we refer to this mass-limited, color-calibrated mock as the mass-limited simulated dataset, and it is what we use to train the SOM (see Section 3) as well as to generate the observations-like sample as described in the next subsection.

The K s -limited dataset
From the mass-limited simulated dataset described above, we construct a simple K s -limited sample that is meant to mimic the Hyper Supreme-Cam (HSC) Deep survey and ancillary data.
The HSC Deep survey is one of the three layers of the HSC survey (Aihara et al. 2017).The HSC is a 1.8 deg 2 field of view imaging camera (Miyazaki et al. 2017) mounted on the 8.2-m Subaru telescope on the summit of Mauna Kea in Hawaii, operated by the National Astronomical Observatory of Japan (NAOJ).The product of its wide field of view and the large collecting area of the telescope makes HSC second only to the anticipated LSST Camera (Aihara et al. 2017).The Deep layer, which is our focus, covers 27 deg 2 spread across four well-studied extragalactic fields (XMM-LSS, COS-MOS, ELAIS N1, and DEEP2-F3).The HSC photometry includes five broad bands (g, r, i, z, Y ) and three narrow-band filters (NB387, NB816, NB921).
The HSC Deep joint dataset augments the above through the addition of several complementary surveys.The u-band coverage is provided by the CLAUDS survey (Sawicki et al. 2019), which uses the Megaprime camera on the CFHT telescope.The near-IR J, H, and K s bands are supplied from a variety of surveys including UKIDSS/DXS & UDS (Lawrence et al. 2007), VIDEO (Jarvis et al. 2012), UltraVISTA (Muzzin et al. 2013), and DUNES3 (in prep).Spitzer IRAC imaging no significant difference in our findings on parameter estimation accuracy or effects of missing data between the two HSC-like mocks.Our adopted HSC-like K s -limited dataset contains 232,118 galaxies, and its stellar mass vs. redshift distribution is shown in the bottom panel of Figure 1.The bottom panel of Figure 1 shows that the K smagnitude cut introduces an incompleteness in stellar mass as a function of redshift, with the stellar mass limit increasing with increasing redshift.As a result, lowmass galaxies are excluded from the sample at z ≳ 1, in particular those with log(M * /M ⊙ )< 10 are entirely absent by z ∼ 3. We caution that the simulated lightcone we started with, represents 1 deg 2 whereas the HSC-Deep survey covers 27 deg 2 .Rarer populations such as log(M * /M ⊙ )> 11 galaxies at z > 3 can be present in the real data even though largely missing here.

Self-organizing maps
A Self-Organizing Map (SOM; Kohonen 1982) is an unsupervised artificial neural network that performs a dimensional reduction of a multi-dimensional parameter space to a lower-dimensional space, while preserving the topology of the data.It thus consists of a lowerdimensional grid where each pixel is characterized by a weight vector that represents the mapping from the higher-dimensional to the lower-dimensional space.The training involves determining these "weights" so that objects that are similar in the high-dimensional parameter space are grouped together, maintaining the intrinsic structure of input data.For this reason, SOM is a powerful visualization tool especially for a final 2D grid, as considered in this work.We use the Python library SomPY4 (Moosavi et al. 2014) for constructing and training our SOM.Below we describe in more detail how the algorithm works.
The first step is selecting a representative training data, which is then normalized by SomPY to the unit variance with mean of zero.The weight vector of each SOM pixel, or neural network, in SomPY is initialized using principal component analysis (PCA; Chatfield & Collins 1980) to bring the initial weights closer to the input data (SomPY also allows for random initialization, but it extends the training duration).At this point, the Euclidean distance between each neurons and input data is calculated.The neuron with the smallest distance to an input data becomes the Best Match Unit (BMU).The latter is thus dragged closer to the data point and neurons around the BMU, namely within a 'neighborhood radius', are also dragged in the same direction.This process is repeated in an iterative way until all neurons are as close as possible to the input data.This process is unsupervised as it does not require any a priori labelling of input data, unlike for example traditional neural networks where the weights are optimized to match some particular outputs vector.After this optimization, similar objects in the high-dimensional parameter space will be grouped together in the final lower-dimensional (usually 2D) space, maintaining the intrinsic topology of input data.

The SOM trained with galaxy colors
In this work, the input parameter space has ten dimensions representing observed-frame colors for galaxies.These colors, based on the expected photometry available for the HSC-Deep joint dataset (see Section 2.2) are: We select these colors for their sensitivity to galaxy parameters such as redshift, normalized M * and SFR, and sSFR.To do so, we computed the distance correlation for all possible two-color combinations (55 in total) between the 11 available bands.This analysis identified the 10 colors with the strongest non-linear relationships with our target parameters.While we initially explored using adjacent colors for SOM training, our final selection minimized parameter spread within SOM pixels (Sec.3.4), leading to more accurate parameter estimations, particularly for M * and SFR.
For our training set, we selected a random subset from the mass-limited dataset.The training set consists of 228,524 galaxies or about 1/3 of the total.The performance of SOM depends on the user set map's size and geometry which need to be verified by the user depending on their specific case (see Davidzon et al. 2019).The choice of size and geometry of the map, should represent a compromise between good sampling of the data and a high resolution (less quantization error).We achieve this, following the Davidzon et al. (2019) procedure, by training different SOMs starting with a square 20x20 map and gradually increasing the size by adding 10 cells in both dimensions.We kept track of sampling vs. error finding that an 80×80 SOM represents a good compromise, similar with the Davidzon et al. ( 2019) findings.We also tested the effect of different aspect ratios, but found that a square map gave the best performance.Finally, in our final 80×80 SOM, we verified that the 10 color distributions in the input dataset and across SOM pixels are consistent.This ensures that thus trained SOM is representative of the input data5 .

Visualization
Figure 3 shows the distribution of each of the 10 normalized colors across the map.The distribution is typically smooth except for a sharp discontinuity in the middle of the SOM, more evident for some of the colors such as u − g or ch1 − ch2.This is because, in this case, the algorithm organized the input data in a ringlike structure, which causes the colors to change in a counter-clockwise direction.The point where the loop closes marks the discontinuity.However, this discontinuity is not random, but rather it has a physical meaning, denoting the transition between galaxies with low and high redshift and other stellar population properties.For instance, for ch1 -ch2 we expect redder colors at z > 1.7 when both IRAC channels sample blueward of the 1.6 µm bump.This means that lower values of ch1 − ch2 on the map would correspond to z < 1.7 and vice versa.The u − g colors also switch from low to high across the same discontinuity, likely due to the effects of redshift and SFR.Lower redshifts correspond to lowermass, passive galaxies, while higher redshifts correspond to redder colors, indicating more massive, star-forming galaxies.
At z > 1.7 the u − g colors redden further as a result of the Lyman break, which enters the u filter at z > 2.2.In general, both redshift and galaxy stellar population properties affect these distributions.Figure 3 clearly demonstrates that galaxies with similar 10-color sets are grouped together in the trained SOM.
Considering our ultimate goal is to replicate the methodology presented in this work with observed data, we conduct an additional sanity check shown in Appendix A. This check involves verifying the similarity of the component maps of two different SOMs, one trained with an observed dataset (UVISTA) and the other with simulated data where we match as closely as possible the selection function of the observed dataset.We find encouragingly consistent component maps.

SOM pixel labeling
The color distributions across the SOM (Figure 3) already allowed us to identify trends with redshift and stellar population parameters.To quantify these trends and use SOM for parameter estimation, we need to label each SOM pixel with the galaxy properties based on its corresponding 10-dimensional color combination.There are multiple approaches.Masters et al. (2015) use the sub-set of their galaxies with spectroscopic redshifts to label each pixel by the expected redshift.Hemmati et al. (2019) use simple empirical template libraries covering a range of parameters such as age of the stellar population and dustiness.In our case, following the approach of Davidzon et al. (2019), we label SOM pixels with the median of each parameter.The latter are the Horizon-AGN parameters from the training set galaxies falling within a given pixel.This works for redshifts and sSFRs which only depend on the observed-frame colors.Stellar mass M * and SFR, however, require information about the overall amplitude of a galaxy's spectrum, which is not modeled in our colors-trained SOM.To account for this amplitude-dependence, we normalise M * and SFR values from our simulated dataset to a reference magnitude J = 23 mag by doing: where X is the galaxy parameter that needs to be normalized and J is the J-magnitude for each particular galaxy.
The left panels of Figure 4 show how redshift (z), normalised stellar mass (nM * ), normalised star formation rate (nSFR), and sSFR vary across the SOM.Consistent with our initial visual inspection, the considered parameters increase in a counter clockwise trend.Specifically, more distant galaxies at z > 2.5 are also more massive and active.Note that the SOM's morphology is not universal; it varies depending on the specific colors and dataset used for training.However, its ability to reveal trends remains consistent.For instance, comparing these plots with Figure 3, it is evident that redder galaxies are the farther, more massive ones.
The right panels of Figure 4 show the distribution width δ for each parameter.The width is quantified by the difference between the 84th and 16th percentiles among galaxies within each pixel.The δ/2 represents the systematic uncertainties σ sys of the derived redshifts and stellar population pararameters using SOMs.Large parameter spread within a pixel can be due to degeneracies, although the use of 10 different colors tends to mitigate this to a large degree.We also notice that large spreads in our SOM tend to concentrate the edges.This  could be due to: 1) boundary effects, as Davidzon et al. (2019) explains, risen from galaxies with extreme colors put on the borders of the map by the algorithm; 2) poorly defined areas, where there are not enough objects to train properly a cell.

RESULTS
Now that we have a trained and calibrated SOM, we use it to both explore the selection functions for our observations-like simulated dataset (Sec.4.1), and the derivation of redshift and stellar population parameters (Sec.4.2).Specifically, for the latter we account for photometric errors (Sec. 4.3 and 4.4) and cases of missing data (Sec.4.5).

Visualization of Sample
One of the main strengths of SOMs is its ability to provide interpretable visualization of high-dimensional data.In particular, once a SOM is trained using the mass-limited dataset, it is possible to project on the map different samples derived by applying different cuts.Figure 5 shows the projection of the K s -limited sample, or HSC-Deep joint-like sample, reaffirming the conclusion of Figure 1.Indeed, a lack of objects at z ≳ 2.5 with log(M * /M ⊙ )< 10 is present on the SOM.This is due to the stellar mass incompleteness introduced by the K s magnitude cut.Additionally, we notice that the "missing" galaxies in the HSC-Deep joint-like sample are pre-dominantly higher sSFR galaxies with log(sSFR)≳-9.2yr −1 .In observed terms, these are galaxies with blue optical (e.g.g − Y and r − i), but red near/mid-IR (e.g.J − H and ch1 − ch2) colors.This projection of the K s -limited sample highlights the effectiveness of SOMs in visualizing the impact of both magnitude and color cuts.

Deriving redshift and stellar population parameters from SOM
We test the use of SOMs in deriving redshift and stellar population parameters by projecting the K s -limited dataset onto the SOM.For each galaxy, we assign median of z, log(nM * ), log(nSFR), and log(sSFR) of the SOM pixel in which it resides.To de-normalize M * and SFR, we use equation 3 and the observed J-band magnitude for each galaxy.Note that while we employ the K s -limited sample for predictions, the SOM was trained and its pixels labeled with a different dataset, specifically the training datset, which was randomly drawn from the mass-limited dataset.Although the distributions of these two datasets in terms of z, M * and SFR may differ, we expect the mapping color space-physical parameter space to be consistent.
Figure 6 shows the comparison between the true (in values from the HorizonAGN simulation) and the SOMderived values for z, M * , SFR and sSFR.For each parameter we also show four statistics including normalized median absolute deviation (σ N M AD ), outlier fraction η, root mean squared error (RMSE) and bias.The outlier fraction η is the percentage of objects that satisfy |z SOM − z sim |/(1 + z sim ) > 0.15 in the case of z phot , and |logX SOM − logX sim | > 2σ in the case of the other properties X.Here σ is the standard deviation of the difference log(predicted) -log(true).
We find that redshift outperforms other statistics, with σ N M AD and RMSE of 0.01 and 0.03, respectively.Stellar mass estimation is also accurate, with σ N M AD and RMSE of 0.06 and 0.07, respectively.As typical of all parameter estimation methods, the estimation of SFR and sSFR has higher uncertainties.The spread between the true and SOM-derived parameters in Figure 6 is due to the intrinsic spread δ in these parameters within a SOM pixel as shown in Figure 4.This spread does not include the effect of photometric errors.Therefore, the performance statistics presented here represents the ideal scenario.

Accounting for photometric uncertainties
To consider photometric uncertainties, we adopt relative errors per magnitude derived from the HSC DR2  (2023).We computed the relative photometric errors σ ν in each band by fitting a line between log(σ ν /f ν ) and magnitude in that band.This approach accounts for larger relative errors for the more abundant faint objects and vice versa.
6 https://hscdata.mtk.nao.ac.jp/schema browser3/#dr2.s18a dud u2k.forced For each galaxy in our simulated HSC-like dataset, we draw a new random flux value from a gaussian distribution centered at the true flux and a width given by the photometric error.We project these new perturbed data onto the SOM trained with noiseless colors.Afterwards, we repeat the procedure to estimate parameters as described in Section 4.2.Note that to de-normalised M * and SFR, we use the J-band magnitudes with added noise.
Figure 7 shows the comparison of true and SOMderived parameters, accounting for photometric uncertainties.As expected, parameter uncertainties increase with photometric errors, but the estimations remain reasonably accurate.For instance, the new redshift σ N M AD is 0.026 with only a 1.2% outlier fraction.The true vs. SOM-derived z phot distribution in Figure 7 however shows horizontal stripes where the z phot derivation is less reliable, with increased number of outliers, at specific redshift ranges.The most stripe is at z SOM ∼ 1 − 2. We investigated the primary cause of these effects by examining the position on SOM of these outliers.The latter are all located near the area of the abrupt change in K − ch2 and ch1 − ch2 (Figure 3), as a result of the move across the 1.6µm bump.
The larger photometric errors in the IRAC channels can cause galaxies to fall into neighboring SOM pixels associated with different redshift values.We validated this hypothesis by repeating the same analysis of this section but removing the IRAC photometric errors.As a result, the horizontal stripes visible in the upper-right panel in Figure 7 disappeared, confirming that the colors associated with the IRAC-channels were indeed the primary contributors.This illustrates how photometric noise in specific bands impacts derived parameter uncertainties.
Due to the degeneracies between photometric redshift and shape of the SED, the outliers of the horizontal stripes are responsible also for the outliers in mass and SFR derivation visible in Figure 7.For example, if objects at somewhat higher-z are incorrectly placed at lower-z (right-side of the stripe), their redder colors are associated to more quiescent galaxies, leading to an underestimation of SFR, and vice-versa.Similarly, if an object is placed at a lower redshift than its true value, the observed J-band suggests a lower stellar mass than in reality.This results in stripes below the 1:1 relation in M * (see Figure 7).Since sSFR is the ratio of SFR and M * , these effects are canceled out.This explains why our estimates for sSFR changed the least between the no-error and error case (Figure 6 vs. Figure 7).We tested this interpretation by running the analysis without any IRAC channels errors, and as a result, the horizontal stripes in z phot , in M * (logM * ,SOM /M ⊙ ∼ 9.5 − 10), and in SFR (log(SFR) SOM ∼ 0) disappeared.

Uncertainty estimation for individual galaxies
Figure 7 gives an overall sense of parameter uncertainties when photometric errors are considered.These parameter uncertainties can be statistical, resulting from photometric errors, and systematic, resulting from parameter value distribution within a single SOM pixel.Below, we explain how we compute both statistical and systematic parameter uncertainties for individual galaxies.
For any galaxy, the likelihood L of it occupying a SOM pixel with coordinates [x,y] is computed as: with: where, in the pixel [x,y], C i are the 10 colors of the test galaxy, C i are the 10 median colors in that pixel, and σ Ci represent the uncertainties of C i .The likelihood surface is normalized by dividing L [x,y] by its sum. Figure 8 (left panels) shows the likelihood surfaces for two test galaxies.These test galaxies are of similar redshift (z ∼ 1.2) and were chosen to represent a high-SNR (signal to noise ratio) galaxy (pixel = [61, 21], SNR=82.3 in g band, m g =23.4) and a low SNR galaxy (pixel = [71, 29], SNR=4.0 in g band, m g =27.7).The likelihood surface of the high-SNR test galaxy is very concentrated around its true position on the SOM, while for the low-SNR test galaxy it is more spread out.From these likelihood surfaces, combined with the pixel labels for a given parameter, we can derive probability distribution functions for any of our parameters.For example, the right-hand panels of Figure 8 show the redshift probability distribution functions P (z) derived from these likelihood surfaces.The P (z) of the low-SNR test galaxy exhibits a larger spread compared to the high-SNR test galaxy.This confirms the idea that the accuracy of parameter estimation with SOMs depends on a galaxy's SNRs, as indeed is the case for all methods used in deriving galaxy parameters.
We define the statistical uncertainty for the given parameter, σ stat,param , for any individual galaxy as the standard deviation of the probability distribution function, such as the P (z) functions discussed above.To estimate the parameter systematic uncertainty σ sys,param , we take half of the ⟨δ⟩ = 84% − 16%, the dispersion of galaxy parameter within the best-fit pixel for that galaxy (see right panels of Figure 4).The total uncertainty is: For example, to derive the uncertainties in the redshift estimation for the high-SNR test galaxy we look at its pixel [61,21] which has a redshift spread ⟨δ⟩ = 0.057, hence σ sys = 0.029, leading to a total redshift uncertainty of σ tot,z = √ 0.033 2 + 0.029 2 = 0.044.To compare this with the nominal σ N M AD in Figure 7, we need σ tot,z /(1 + 1.2) = 0.02, which is consistent with the σ N M AD of 0.026.

Handling missing data
The previous section's analysis assumes flux measurements are available across all bands.In reality, the surveys making up the HSC-Deep joint dataset do not perfectly overlap.Several square degrees of the HSC-Deep joint survey in fact will have no data outside the core bands of grizY .Additionally, even in overlapping areas, there can be instances of missing data in individual bands, due to issues such as bad data arising from image artifacts.The simplest approach would be to train a SOM excluding any missing bands.We tested this scenario for multiple individual missing bands or sets of missing bands.Figure 9 shows the z phot σ N M AD in case of missing u-band, J-band, H-band, K s -band, JHK s , K s +IRAC, and uJHK+IRAC.The σ N M AD gets worse compared to the baseline of complete data (dashed line) with σ N M AD =0.026.Note that this analysis takes into account photometric uncertainties in the same manner as done in Section 4.3, where each flux is perturbed by Gaussian noise based on the given flux uncertainty.
The degradation in parameter estimation seen in Figure 9 is due to the loss of information resulting from the missing bands.To address this, we assume that galaxies with missing data fall within the same color distribution as the training sample.This is justifiable when the missing data is simply due to the inhomogeneous coverage of the various multiband surveys.To recover the colors involved with missing bands, we perform 500 random draws from the color distribution of these same colors, as defined in the training set.Our color distribution again include photometric uncertainties as discussed above.The random draws involve either 1D, 2D or 3D color distributions depending on whether the missing band is on the edge (e.g.u-band), in the middle (e.g.H-band), or a set of adjacent bands is missing (e.g.JHK).With our 500 random draws for the "missing band" colors combined with the existing colors for each galaxy, we are able to essentially build a SOM likelihood surface for that galaxy in a manner similar to that discussed in Section 4.4.The adopted parameter value is then simply based on the maximum-likelihood SOM pixel.The resulting redshift σ N M AD values for each of our missing data scenarios are shown in Figure 9.We Squares represent the case when the SOM is trained on the complete data and datasets with missing data are projected onto it using the missing color recovery method described in Section 4.5.The horizontal dashed line indicates the σNMAD = 0.026 achieved using all bands (see Figure 7).
note that in most cases, this band recovery procedure allows us to achieve significantly better σ N M AD values than by simply training the SOM without the missing data.We emphasize that this improvement is based on the assumption that the color distributions determined from the data with complete coverage are representative of the data with incomplete coverage as well.
The only cases where our recovered data perform worse than the SOM trained by omitting the missing data are those involving IRAC channels.We tested that this worsening does not depend on the adopted photometric error.This effect is likely due to a more uncertain extrapolation when more than one color needs to be recovered right on the edge of the wavelength coverage as is the case of missing the IRAC channels.

Parameter recovery vs. wavelength coverage
The missing data analysis also provides us with information on the relative importance of having particular bands present for the recovery of our parameters.Table 1 goes beyond z phot , shown in Figure 9, to explore how the σ N M AD for z phot , M * and SF R all vary depending on which bands are missing and filled in with our recovery method (Section 4.5).We reach reasonably good accuracy for when only a single band is missing.As more bands are missing and filled-in, the values of σ N M AD increase.For example, the worst parameter estimation occurs when all ancillary bands outside the HSC-Deep core bands (grizY ), namely uJHK+IRAC, are missing and filled-in with our method.If at least uband, among the ancillary bands, is not missing (i.e. in the case of recovered missing JHK+IRAC) the parameter σ N M AD improves especially for z phot estimation, with a decreasing in σ N M AD of ∼ 32%.These findings show the critical importance of the ancillary surveys providing the u-band, the near-IR, and the Spitzer IRAC ch1 and ch2 coverage.
The variation of parameter σ N M AD with specific bands depends on their different sensitivity to galaxy parameters at different ranges.Figure 10 illustrates this for the cases when u-band (top panel), H-band (middle panel), and all ancillary to HSC-Deep data are missing (bottom panel).We chose these specific cases as the uband is crucial for accurate z phot estimation at z ≳ 2.5 (as this filter enters the Lyman break) and traces the rest-frame ultraviolet hence is relevant for the SFR estimation.The top panel in Figure 10 shows that indeed when the u-band is missing our redshift recovery gets worse beyond z ∼ 2.5, the SFR recovery is poor.As seen in Table 1, the SFR recovery here is the worst compared to any other single (or couple of) missing band cases.
The H-band, in our redshift range, traces the restframe optical to near-IR, therefore it is sensitive to the  stellar mass.It also affects the z phot determination since that is sensitive to the 1.6 µm bump at lower z, and to the 4000 Å-break at z > 3. The middle panel in Figure 10 shows that indeed, when the H-band is missing our redshift recovery is much worse at z > 2. The underestimated redshifts in this regime also lead to a pronounced bulge of underestimated stellar masses, and corresponding underestimated SFRs.
The last case we explored in more detail was the worse case scenario (see Table 1) which was the missing uJHK + IRAC scenario, or when we have no data outside the HSC-Deep core bands (grizY ).In this case, we use the scenario where the missing data are simply omitted rather than recovered, since the later shows worse parameter recovery (see Figure 9).We see in the bottom panel of Figure 10, that the redshift estimation is still reasonable at z < 1−1.5, where the optical bands trace the 4000 Å break, but there is essentially no handle on the redshift beyond that (up to the z ∼ 3 regime explored here).The longer wavelength (J+) bands are critical for the z ∼ 1.5 − 3 estimation due to their sampling both of the Balmer break and the 1.6 µm stellar bump.

Handling upper limits
Besides cases of missing data, we can also have upper limits if the source is too faint relative to the noise in a given image.The naive first approach to handle such upper limits is to draw random fluxes from uniform distributions representing the upper limit.However, this method can lead to unrealistic colors that place the galaxy outside the SOM surface, which only represents the 10D color space present in the training set.Dealing with upper limits is in general a significant challenge for machine learning methods, as it requires extrapolating outside the training set.
While a more comprehensive treatment of upper limits in SOMs is beyond the scope of the present work, we can adapt our missing data approach for one particular scenario involving upper limits.This is the case where upper limits are a result of some part of the survey having significantly shallower coverage in some band relative to another deeper portion of the survey.In this case, we can train the model using the full color set from the deep portion of the survey.The assumption is that the deeper-data training set contains realistic colors for sources with upper limits in the shallower portion.In this case, we can implement a modification of the missing data approach described in the previous section.Instead of drawing from the full color distribution, we draw from a truncated color distribution that accounts for the detection limit in the shallower data.Figure 11 illustrates this approach with an example of a test galaxy having only an upper limit in H band.The test galaxy is drawn from a fiducial dataset with H−band coverage two orders of magnitude shallower coverage than in the training set.In this case, the recovered colors are realistic and fall within the panchromatic space used to train our SOM, as showed by the grey region in Figure 11.Note that the tightness of the SOM likelihood survey of this upper limit galaxy is due to the fact that each SOM pixel involves 10 colors which helps compensate to a large degree for any missing/upper limit data.ing the effects of missing data and how to handle them.Fundamentally, we achieve a photo-z σ N M AD of 0.026 and an outlier fraction of 1.2%, while Davidzon et al. (2019) reported a σ N M AD of 0.044 and an outlier fraction of 6.1%.While the approach is very similar, this is a significant improvement in performance.Our tests suggests that the key aspect behind this improvement is the specific choice of colors.We use more widely separated colors, such as r − z, which is sensitive to the redshift, and we also add the g-band (which is omitted in Davidzon et al. (2019)) which also improves performance, especially in redshift regimes where it helps pinpoint the Balmer break.In addition, our improved performance is aided by our uncertainties in the u-band are ≈ 0.2mag lower, they are comparable in the visible bands, and ≈ 0.1mag lower in the NIR, although our IRAC channels errors are worse by ≈ 0.2mag.Davidzon et al. (2019) do not quote detailed statistics for SFR, but they quote σ N M AD < 0.2 which is consistent with our results.
In a more recent paper, Davidzon et al. (2022) used their SOM method to derive M * from the observed COS-MOS data and compare their SOM-derived values with traditional SED-fitting.Their results indicates that this method works equally well as traditional SED fitting but it is computationally much faster, making it suitable for large datasets.Their distribution of SOM-derived vs. SED fitting-derived stellar masses appears visually wider than our SOM-derived vs. true stellar mass.However, this is not a fair comparison since there are uncertainties associated with the SED-fitting derived values as well.

The handling of missing data
In this paper, we explore for the first time in the astronomical context the handling of missing data in SOMs.This issue has been explored in SED template fitting codes, e.g. by modifying the χ 2 method, usually at the expense of increased computational cost (Sawicki 2012;Boquien et al. 2019).A recent paper, Rejeb et al. (2023) does discuss how to train SOMs with incomplete data; however, this paper did not appear until the final stages of our current work therefore we have not had a chance to implement it.In future works, we plan to explore the possible use of their algorithm through their publicly available code7 .One needs to be cautious, however, about including incomplete data in the training, especially in cases of large variance, as discussed in Cottrell & Letrémy (2007).They argue that when a sufficient amount of data is available to build a robust model, it is preferable to train on the complete data and only afterwards consider the effects of missing data.Indeed, this is the approach we adopted in our study.In our paper, we outline a method to train the SOM on the complete dataset and "recover" the missing data where the data are incomplete by assuming the SOM training set is otherwise representative of the galaxies with missing bands.In line with our finding that it is better to fill-in the missing data rather than simply omit it, Chartab et al. (2023) achieved better results in SED fitting after first using a random forest model to fill-in any missing data.

SUMMARY & CONCLUSIONS
SOMs are an efficient means of grouping galaxies based on their colors, providing a powerful means for visualizing multi-dimensional galaxy datasets as well as estimate any SED-based parameters.In this study, we train a SOM on a mass-limited light-cone from the Horizon-AGN simulation that has been color-calibrated on real data.Our training color set is based on a fiducial ugrizY JHK s +IRAC photometric coverage characteristic of the HSC-Deep survey plus ancillary data (the HSC-Deep joint survey).We also examine an observations-like magnitude-limited sample with K s < 23.5.We assess the quality of the SOM-derived photometric redshifts, stellar masses, SFRs and sSFRs while considering realistic photometric errors for the HSC-Deep joint survey.Finally, we investigate the effects on the derived parameters in cases of missing data.Our key findings are: • The set of colors used for SOM training should be those most sensitive to the parameters we intend to estimate.For example u−g, J −H and ch1−ch2 are all colors that are critical to accurate redshift estimation, each particularly sensitive to different redshift ranges.
• SOMs can be used to interpret visualization of high-dimensional data.For example, the projection of the K s -limited sample onto the SOM (Figure 5) shows that this selection is similar to the mass-limited sample up to z ∼ 1, but loses galaxies with higher sSFR (log(sSFR)>-9) and lower stellar mass (log(M * /M ⊙ )<10), beyond that.
• With the HSC-Deep joint survey photometric uncertainties, the derived z phot 's have σ N M AD = 0.026.For M * , SFR, and sSFR we have σ N M AD < 0.2.The IRAC channels uncertainties are particularly critical in the z ∼ 1 − 2 range where the K − ch2 and ch1 − ch2 colors trace the 1.6µm bump.
• We consider both the systematic and statistical uncertainties in the derived parameters for individual galaxies.In particular, the probability distribution function is derived from the likelihood sur-face of a galaxy within the SOM, which is strongly dependent on the signal-to-noise ratio of the particular galaxy.
• We discuss how to recover and project cases of missing photometric bands onto a SOM trained with the full dataset.This leads to improved parameter estimations compared to simply omitting missing data from the training.In limited cases, this approach can be applied to upper limits as well.
• The quality of the derived parameters is strongly dependent on the overall wavelength coverage included in training the SOM.We provide a table for comparison of achieved σ N M AD for estimated z phot , M * , and SFR when different (sets of) missing bands are recovered.
We stress out that the findings presented in this study are based on simulations, despite our efforts to make the simulations more realistic through the color-calibration process.Our next step will be to use the methods outlined here to use SOMs for parameter estimation for the upcoming HSC-Deep joint catalog.Finally, in a companion paper (La Torre et al. 2024b in prep.), we conduct a performance comparison of SOM with three supervised ML methods: extreme gradient boosting, fully connected neural networks, and random forest.For comprehensive information on the training process and our findings, we refer the reader to that paper.

A. COMPARISON OF SOMS TRAINED WITH SIMULATED AND OBSERVED DATASETS
We test if two SOMs trained with a simulated or an observed dataset yield similar component maps.This is important as our final goal is to employ SOM to estimate galaxy parameters for the large HSC Deep joint survey.For this test we compare two SOMs, one trained with the mass-limited simulation and the other trained with observed galaxies from UltraVISTA catalog (Muzzin et al. 2013).The training colors of these two datasets should be similar if we want to check that simulated and observed galaxies with same colors are placed in same regions of the map after the training.To this effect, for this test we select only simulated and observed galaxies brighter than the peak of the observed magnitude's distribution in each band.Thus, the resulting simulated and observed color distributions are similar, as show in the top panel of Figure 12, and so they can be used to train two SOMs.Note that these components maps differs from Fig. 3, due to variations in the magnitude distribution utilized for color computation.In the analysis of this work, we exclusively apply K s -magnitude cuts, whereas in this test, the cuts are made at the peak of the observed magnitude distributions.The middle and bottom panels in Figure 12 illustrate the two SOMs color-coded by two of the training colors, showing that they yield similar component maps (more details in the figure caption).This means that if the two training datasets have same color distributions, than even the SOMs will be similar.Consequently, with the aim of using this analysis with similar observed color distributions, we can trust, within some limits, the analysis we are performing in this work with simulations.(bottom).A comparison between the middle and the bottom panels shows that the two datasets yield the same component maps, meaning that if the input objects have the same color distribution, as in the top panel, than even the SOMs will be similar.For this test we trained the two SOMs with the same color set considered in this paper.

Figure 1 .
Figure 1.Top: The density of galaxies in the stellar mass versus redshift plane from the mass-limited 1 deg 2 simulated Horizon-AGN light-cone, after applying a u-magnitude cut of mu < 28 (Sec.2.1).Bottom: The same simulated dataset after color-calibration and applying a Ks-magnitude cut of Ks < 23.5 (Sec.2.2).

Figure 3 .
Figure 3. Component maps showing each of the 10 input normalized colors across our trained SOM.The color bar indicates the normalized color values.

Figure 4 .
Figure 4. Left panels: SOM with cells color-coded according to the median of galaxy redshift, normalised stellar mass, normalised SFR, and specific SFR respectively from the top to the bottom panel.Right panels: SOM with cells color-coded according to the distribution width ⟨δ⟩ = ⟨84% − 16%⟩ of the corresponding galaxy parameters on the left side.

Figure 5 .
Figure5.The dark region represents the log of the number of galaxies projected on the SOM when Ks < 23.5 is applied to the dataset.The Ks-limited sample contains 232,118 objects.The rainbow contours represent levels of values of redshift (Figure4), with redder colors representing higher redshift.

Figure 6 .
Figure 6.The true simulated vs. SOM-derived parameter values in the ideal case with no photometric errors.These are based on the Ks-limited test dataset.The solid red line represents the bisection line; the red dashed lines indicate the boundaries of the outlier.The number of objects is 232,111.preliminaryu2K catalog 6 for u through K s bands (Aihara et al. 2019).We focused on the XMM-LSS field, since its IRAC ch1 and ch2 fluxes and uncertainties were available fromKrefting et al. (2020);Nyland et al. (2023).We computed the relative photometric errors σ ν in each band by fitting a line between log(σ ν /f ν ) and magnitude in that band.This approach accounts for larger relative errors for the more abundant faint objects and vice versa.

Figure 7 .
Figure 7.The true simulated vs. SOM-derived parameter values in the more realistic case including photometric errors.These are based on the Ks-selected test dataset.The solid red line indicates the bisection line; the red dashed lines show the boundaries of the outlier.The number of objects is 227,365.Circa 2% of objects is lost because of invalid values from the log.

Figure 8 .
Figure8.The panels on the left show the normalized likelihood surfaces for two test galaxies: a high-SNR (= 82.3 in g band, mg=23.4)example on top, and a low-SNR (= 4.0 in g band, mg=27.7)example on the bottom both with true redshifts around z ∼ 1.1 − 1.2.The rainbow colors correspond to the redshift levels as in Figure.5.The panels on the right show the redshift probability distribution based on these likelihood surfaces.The blue solid line represents the most likely redshift value, z peak , from the 2D surface.The red solid line represents the true redshift, zsim, while the red dotted line corresponds to the median of redshift, z med , used to label the pixel in which the test galaxy is located.Note that the standard deviation σ std of the distribution is smaller for an higher-SNR test galaxy.

Figure 9 .
Figure9.The σNMAD for estimated z phot as a function of missing bands.Stars represent the case when the missing band(s) are omitted from the SOM training.Squares represent the case when the SOM is trained on the complete data and datasets with missing data are projected onto it using the missing color recovery method described in Section 4.5.The horizontal dashed line indicates the σNMAD = 0.026 achieved using all bands (see Figure7).

Figure 10 .
Figure 10.From left to right, z phot , M * , and SFR predicted vs. true in case of recovered missing data for u-band (top) and H-band (middle).The procedure for handling missing data is described in Section 4.5.Bottom: predicted vs. true in case only HSC-Deep grizY bands are available and the missing bands are not recovered.Red lines as in Figures 6,7.

Figure 11 .
Figure 11.The log of the normalized likelihood for a test galaxy (hot diamond) with an upper limit for H-band, in the scenario where the SOM is trained with complete deeper data.The rainbow contour represent levels of redshift values as in Figure 5.

Figure 12 .
Figure 12.The top panel shows the comparison of distribution of simulated and UVISTA colors, obtained from magnitudes brighter than the peak of the observed magnitude's distribution in each band.The colors g-r on the left and Y-J on the right are shown as an example of two similar color distributions.The middle and bottom panels show the comparison of component maps obtained from training SOM with simulated galaxy colors (middle) and UVISTA(bottom).A comparison between the middle and the bottom panels shows that the two datasets yield the same component maps, meaning that if the input objects have the same color distribution, as in the top panel, than even the SOMs will be similar.For this test we trained the two SOMs with the same color set considered in this paper.

Table 1 .
Effect of parameter recovery for different missing bands "filled-in" as described in Section 4.5.