Morphological Parameters and Associated Uncertainties for 8 Million Galaxies in the Hyper Suprime-Cam Wide Survey

We use the Galaxy Morphology Posterior Estimation Network (GaMPEN) to estimate morphological parameters and associated uncertainties for $\sim 8$ million galaxies in the Hyper Suprime-Cam (HSC) Wide survey with $z \leq 0.75$ and $m \leq 23$. GaMPEN is a machine learning framework that estimates Bayesian posteriors for a galaxy's bulge-to-total light ratio ($L_B/L_T$), effective radius ($R_e$), and flux ($F$). By first training on simulations of galaxies and then applying transfer learning using real data, we trained GaMPEN with $<1\%$ of our dataset. This two-step process will be critical for applying machine learning algorithms to future large imaging surveys, such as the Rubin-Legacy Survey of Space and Time (LSST), the Nancy Grace Roman Space Telescope (NGRST), and Euclid. By comparing our results to those obtained using light-profile fitting, we demonstrate that GaMPEN's predicted posterior distributions are well-calibrated ($\lesssim 5\%$ deviation) and accurate. This represents a significant improvement over light profile fitting algorithms which underestimate uncertainties by as much as $\sim60\%$. For an overlapping sub-sample, we also compare the derived morphological parameters with values in two external catalogs and find that the results agree within the limits of uncertainties predicted by GaMPEN. This step also permits us to define an empirical relationship between the S\'ersic index and $L_B/L_T$ that can be used to convert between these two parameters. The catalog presented here represents a significant improvement in size ($\sim10 \times $), depth ($\sim4$ magnitudes), and uncertainty quantification over previous state-of-the-art bulge+disk decomposition catalogs. With this work, we also release GaMPEN's source code and trained models, which can be adapted to other datasets.


INTRODUCTION
The morphology of galaxies has been shown to be related to various other fundamental properties of galaxies and their environment, including galaxy mass, star formation rate, stellar kinematics, merger history, cosmic environment, and the influence of supermassive black holes (e.g., Bender et al. 1992;Tremaine et al. 2002;Pozzetti et al. 2010;Wuyts et al. 2011;Huertas-Company et al. 2016;Powell et al. 2017;Shimakawa et al. 2021;Dimauro et al. 2022).Therefore, quantitative measures of the morphological parameters for large samples of galaxies at different redshifts are of fundamental importance in understanding the physics of galaxy formation and evolution.
Distributions of morphological quantities alone can place powerful constraints on possible galaxy formation scenarios.And when combined with other physical quantities, they can provide key insights into evolutionary processes at play or even reveal the role of new physical mechanisms that impact evolution (e.g., Kauffmann et al. 2004;Weinmann et al. 2006;Schawinski et al. 2007;van der Wel 2008;Schawinski et al. 2014).However, such studies often involve subtle correlations or hidden variables within strong correlations that demand greater statistics and measurement precision than what has been available in the preceding decades.An often-overlooked factor in such studies has been the computation of robust uncertainties.The computation of full Bayesian posteriors for different morphological parameters is crucial for drawing scientific inferences that account for uncertainty and, thus are indispensable in the derivation of robust scaling relations (e.g., Bernardi et al. 2013;van der Wel et al. 2014) or tests of theoretical models using morphology (e.g., Schawinski et al. 2014).
A quantitative description of galaxy morphology is typically expressed in terms of its structural parameters -brightness, shape, and size -all of which can be determined by fitting a single two dimensional analytic light profile to the galaxy image (e.g., Van Der Wel et al. 2012;Tarsitano et al. 2018).However, moving beyond single-component determinations by using separate components to analyze galaxy sub-structure (e.g., disk, bulge, bar, etc.) can provide us additional insights into the formation mechanisms of these components: bulges, disks, and bars may be formed as a result of secular evolution (e.g., Kormendy 1979;Kormendy & Kennicutt 2004;Genzel et al. 2008;Sellwood 2014) or due to the interaction of disk instabilities with smooth and clumpy cold streams (e.g., Dekel et al. 2009a,b).In this sense, contrary to what is often expected, bulges can also be formed without major galaxy mergers.
Over the last decade, machine learning (ML) has been increasingly used for a wide variety of tasks-from identifying exoplanets to studying black holes (e.g., Hoyle 2016; Kim & Brunner 2017;Shallue & Vanderburg 2018;Sharma et al. 2020;Natarajan et al. 2021).Unsurprisingly, these algorithms have become increasingly popular for determining galaxy morphology as well.(e.g., Dieleman et al. 2015;Huertas-Company et al. 2015;Tuccillo et al. 2018;Ghosh et al. 2020;Hausen & Robertson 2020;Walmsley et al. 2020;Cheng et al. 2021;Vega-Ferrero et al. 2021;Tarsitano et al. 2022).The use of these techniques has been driven by the fact that traditional methods of analyzing morphologies-visual classification and template fitting-are not scalable to the data volume expected from future surveys such as the Vera Rubin Observatory Legacy Survey of Space and Time (LSST; Ivezić et al. 2019), the Nancy Grace Roman Space Telescope (NGRST; Spergel et al. 2013), and Euclid (Racca et al. 2016).
Most previous applications of ML to galaxy morphology produced broad, qualitative classifications rather than numerical estimates of morphological parameters.Tuccillo et al. (2018) did estimate parameters of singlecomponent Sérsic fits.However, they did not analyze galaxy sub-structures or provide uncertainties.In order to address these challenges, in Ghosh et al. (2022), we introduced The Galaxy Morphology Posterior Estimation Network (GaMPEN).GaMPEN is a machine learning framework that estimates full Bayesian posteriors for a galaxy's bulge-to-total light ratio (L B /L T ), effective radius (R e ), and flux (F ).GaMPEN takes into account covariances between the different parameters in its predictions and has been shown to produce calibrated, accurate posteriors.
GaMPEN can also automatically crop input galaxy cutouts to an optimal size before determining their morphology.This feature is critical given that morphologydetermination ML frameworks typically require input cutouts of a fixed size, and thus, cutouts of a "typical" size often contain secondary objects in the frame.By cropping out most secondary objects in the frame, GaMPEN can make more accurate predictions over wide ranges of redshift and magnitude.In this paper, we use GaMPEN to estimate Bayesian posteriors for L B /L T , R e , and F for ∼ 8 million galaxies with z ≤ 0.75 from the Hyper Suprime-Cam (HSC) Wide survey (Aihara et al. 2018).A few recent works have studied the morphology of smaller subsets of HSC galaxies: Shimakawa et al. (2021) classified ∼ 2 × 10 5 massive HSC galaxies into spiral and non-spiral galaxies, while Kawinwanichakij et al. (2021) analyzed ∼ 1.5×10 6 HSC galaxies using single-component Sérsic fits.To date, the state-of-the-art morphological catalog that provided bulge+disk decomposition parameters at low redshift has been that of Simard et al. (2011), which used Sloan Digital Sky Survey (SDSS; York et al. 2000) imaging to estimate morphological parameters of ∼ 1 million m < 18 galaxies, most with z < 0.2.As Figure 1 shows, HSC imaging allows us to probe much fainter magnitudes than SDSS with significantly better seeing (0. ′′ 85 for HSC-W g compared to 1. ′′ 4 for SDSS g).The catalog presented in this paper builds on Simard et al. (2011) by providing an order of magnitude increase in sample size and probing four magnitudes deeper, to a higher redshift threshold.Along with estimates of parameters, this catalog also estimates robust uncertainties of the predicated parameters, which have typically been absent from previous large morphological catalogs.This catalog represents a significant step forward in our capability to quantify the shapes and sizes of galaxies in our universe.
This paper also demonstrates that ML techniques can be used to study morphology in new surveys, which do not have already-classified large training sets available.Most previous works involving ML to study galaxy morphology have depended on the availability of an extensive training set of real galaxies with known properties from the same survey or a similar one.However, if Convolutional Neural Networks (CNNs) are to replace tra-ditional methods for morphological analysis, we must be able to use them on new surveys that do not have a morphological catalog readily available to be used as a training set.In this paper, we demonstrate that by first training on simulations of galaxies and then using real data for transfer learning (fine-tuning the simulationtrained network), we can fully train GaMPEN, while having to label < 1% of our dataset.This work outlines an easy pathway to apply morphological ML techniques to upcoming large imaging surveys.
In §2, we describe the HSC data used in this study, along with the simulated two-dimensional light profiles.§3 and §4 provide a brief introduction to GaMPEN and how we train it.In §5, we outline how we determined the morphological parameters of ∼ 60, 000 HSC-Wide galaxies for transfer learning and validating the results of our ML framework.§6 provides detailed information on the accuracy of GaMPEN's predictions.§7 compares our predictions to that of other catalogs.We end with a summary and discussion of our results in §8.
The full Bayesian posteriors for all ∼ 8 million galaxies are being released with the publication of this work.We are also releasing the source code of GaMPEN, along with the trained models and extensive documentation and tutorials on how to use GaMPEN.The public data release is described in Appendix A.

Hyper Suprime-Cam Data
We apply GaMPEN to g,r,i -band data from the Hyper Suprime-Cam (HSC) Subaru Strategic Program Public Data Release 2 (PDR2; Aihara et al. 2019).The Subaru Strategic Program, ongoing since 2014, uses the HSC prime-focus camera, which provides extremely high sensitivity and resolving power due to the large 8.2 meter mirror of the Subaru Telescope.
In order to have a large but uniform sample of HSC PDR2 galaxies, we focus on the largest volume HSC survey, namely, the Wide layer, which covers 1400 deg 2 to the nominal survey depth in all filters and contains over 450 million primary objects.To consistently perform morphology determination in the rest-frame gband across our entire sample, we use different filters for galaxies at different redshifts, as shown in Figure 2. We use g-band for z ≤ 0.25, r -band for 0.25 < z ≤ 0.50, and i -band for 0.50 < z ≤ 0.75.Given HSC's typical seeing, all objects with sizes ≲ 5 kpc cannot be resolved beyond z = 0.75.Therefore, given that the HSC-Wide light profiles of the large majority of galaxies beyond this redshift will be dominated by the PSF, we restrict this work till z = 0.75 and will explore the application of GaMPEN to higher redshift deeper HSC data in future work.
In order to select galaxies, we used the PDR2 galaxy catalog produced using forced photometry on coadded images.The HSC data has been reduced using a pipeline built on the prototype pipeline developed by the Rubin Observatory's Large Synoptic Survey Telescope's Data Management system, and we refer the interested reader to Bosch et al. (2018) for more details.We use the extendedness value flag only to select extended sources.The extendedness value flag relies on the difference between the Composite Model (CModel) and PSF magnitudes to separate galaxies from stars, and contamination (from stars) increases sharply for m > 23 for median HSC seeing in the Wide layer as outlined here1 .Thus, for each redshift bin, we select galaxies with magnitude m < 23 (in the appropriate band).The full query to download the data in each redshift bin is available in Appendix B. We use spectroscopic redshifts when available (∼ 2.5%) and high-quality photometric redshifts otherwise.The spectroscopic redshifts were collated by the HSC Subaru Strategic Program (SSP) team from a wide collection of spectroscopic surveys -zCOSMOS DR3 (Lilly et al. 2009), UDSz (Bradshaw et al. 2013), 3D-HST (Momcheva et al. 2016) (Pentericci et al. 2018).The photometric redshifts in HSC PDR2 were calibrated based on the above spectroscopic sample and were calculated using the Direct Empirical Photometric code (DEmP; Hsieh & Yee 2014), an empirical quadratic polynomial photometric redshift fitting code, and Mizuki (Tanaka 2015), a Bayesian template fitting code.For an extended description, we refer the interested reader to Nishizawa et al. (2020).To remove galaxies with unsecure photo-metric redshifts, we set the photoz risk best parameter < 0.1, which controls the risk of the photometric redshift being outside the range z true ± 0.15(1 + z true ), with 0 being extremely safe to 1 being extremely risky.Using the cleanflags any flag, we further excluded objects flagged to have any significant imaging issues (in the relevant band) by the HSC pipeline.This flag can be triggered by a wide range of issues; however, for ∼ 80% of cases, the flag was triggered by cosmic ray hits, as shown in Appendix B. We checked and confirmed that none of the above cuts significantly modified the redshift or magnitude distribution of our galaxy sample.
The above process resulted in the selection of ∼ 8 million galaxies, with ∼ 1 million, ∼ 3 million, ∼ 4 million galaxies in the low-z, mid-z, and high-z bins, as shown in Table 1.The magnitude and redshift distribution of the data is shown in Figure 3. Using the HSC Image Cutout Service 2 , we downloaded cutouts for each galaxy with sizes of 30 ′′ , 24 ′′ , 16 ′′ for the low-, mid-, and highz bins, respectively.These sizes are large enough that they should capture all objects in the relevant bin.Using the results of our light-profile fitting analysis on a sub-sample, as outlined in §5, we expect ∼ 99.5% of our sample to have a downloaded cutout size ≥ 10×(size of the major axis of the galaxy).Figure 4 shows randomly chosen examples of galaxy cutouts for each redshift bin.

Simulated Galaxies for Initial Training
As outlined later in §4.2, we perform GaMPEN's initial round of training on mock galaxy image cutouts, simulated to match HSC observations in the appropriate band.To generate mock images, we used GalSim (Rowe et al. 2015), the modular galaxy image simulation toolkit.GalSim has been extensively tested and shown to yield very accurate rendered images of galaxies.We simulated 150,000 galaxies in each redshift bin, with a mixture of both single and double-component galaxies, in order to have a diverse training sample.To be exact, 75% of the simulated galaxies consisted of both bulge and disk components, while the remaining 25% had either a single disk or a bulge.
For both the bulge and disk components, we used the Sérsic profile, and the parameters required to generate the Sérsic profiles were drawn from uniform distributions over ranges given in Table 2.For the disk and bulge components, we allow the Sérsic index to vary between 0.8 − 1.2 and 3.5 − 5.0, respectively.We chose to have 2 https://hsc-release.mtk.nao.ac.jp/das cutout/pdr2/ varying Sérsic indices as opposed to fixed values for each component in order to have a training set with diverse light profiles.Note that the single-component galaxies were included in the simulations to have some examples of galaxies that are purely disk-dominated (i.e., no bulge component) and some that are purely bulge-dominated (i.e., no disk component).Thus, the Sérsic indices chosen for the single-component galaxies mirror the values chosen for the disk and bulge components in the doublecomponent galaxies.
The parameter ranges for fluxes, and half-light radii are quite expansive and are representative of most galaxies at the appropriate redshift range.To obtain these parameter ranges, we first start with a set of parameters that represent most local galaxies (Binney & Merrifield 1998).Thereafter, we redshift these parameters for each galaxy using Planck18 cosmology (H 0 = 67.7 km/s/Mpc, Planck Collaboration et al. 2018) and the appropriate pixel scale.
To make the two-dimensional light profiles generated by GalSim realistic, we convolved these with representative point-spread functions (PSFs) downloaded from the HSC survey.We also added realistic noise using onethousand 2 ′′ × 2 ′′ "sky objects" from the HSC PDR2 Wide field.Sky objects are empty regions identified by the HSC pipeline that are outside object footprints and are recommended for being used in blank-sky measurements.For PSF convolution and noise addition, we follow the procedure detailed in Ghosh et al. (2022) and refer the interested reader to that work for more details.Figure 5 shows a randomly chosen simulated light profile and the corresponding image cutout generated after PSF convolution and noise addition.All the simulated galaxy images in each redshift bin were chosen to have cutout sizes equal to their real data counterparts, as outlined in §2.1.
We would like to note that even after PSF convolution and noise addition, our simulated galaxies are only semirealistic and do not account for many specific features seen in real data (e.g., spiral arms, knots, non-classical bulges, etc.).The primary goal of the simulation dataset is to provide a large corpus of images on which the initial training can be done.The second step of fine-tuning GaMPEN on real data (described in §4.3) ensures that the framework also learns about the existence of features in the real data that are missed by the simulations.

BRIEF INTRODUCTION TO GaMPEN
The Galaxy Morphology Posterior Estimation Network (GaMPEN;Ghosh et al. 2022) is a novel machine learning framework that can predict posterior distributions for a galaxy's bulge-to-total light ratio (L B /L T ),  effective radius (R e ), and flux (F ).In this section, we provide a brief introduction to GaMPEN; however, for a complete understanding of GaMPEN's architecture and how it predicts posteriors, we refer the reader to Appendix C and Ghosh et al. (2022).
The architecture of GaMPEN consists of an upstream Spatial Transformer Network (STN) module followed by a downstream Convolutional Neural Network (CNN) module.The upstream STN is used to automatically crop each input image frame to an optimal size before morphology determination.Based on the input image, the STN predicts the parameters of an affine transformation which is then applied to the input image.The transformed image is then passed onto the downstream CNN, which estimates the joint posterior distributions of the morphological parameters.Because the transformation we use is differentiable, the STN can be trained using standard backpropagation along with the downstream CNN without any additional supervision.
The inclusion of the STN in the framework greatly reduces the amount of time spent on data pre-processing as we do not have to worry about making cutouts of the "proper size"-an inherently difficult task since most of the galaxies in our sample have not been morphologically analyzed before.More importantly, the STN greatly reduces the chance of spurious results by cropping most secondary objects out of frame (see §6.2).
Two primary sources of error contribute to the uncertainties in the parameters predicted by GaMPEN.The first arises from errors inherent to the input imaging data (e.g., noise and PSF blurring), and this is commonly referred to as aleatoric uncertainty.The second source of error comes from the limitations of the  a The single-component galaxies are equally divided between galaxies with a Sérsic index between 0.8 -1.2 and galaxies with a Sérsic index between 3.5 -5.0.b Fractional fluxes are noted here.The bulge flux fraction is chosen such that for each simulated galaxy it is added with the disk flux fraction to give 1.0.The total flux of the galaxies is varied between the values given in the top-row for each sample c The bulge position angle differs from the disk position angle by a randomly chosen value between −15 and +15 degrees.
Note-The above table shows the ranges of the various Sérsic profile parameters used to simulate mock HSC cutouts.75% of the simulated galaxies have both disk and bulge components, and the remainder has either a disk or a bulge component.All the simulation parameters are drawn from uniform distributions.
model being used for prediction (e.g., the number of free parameters in GaMPEN, the amount of training data, etc.); this is referred to as epistemic uncertainty.
For every given input image, GaMPEN predicts a multivariate Gaussian distribution N (µ, Σ), with mean µ and covariance matrix Σ.Although we would like to use GaMPEN to predict aleatoric uncertainties, the covariance matrix, Σ, is not known a priori.Instead, we train GaMPEN to learn these values by minimizing the negative log-likelihood of the output parameters for the training set.The covariance matrix here represents the aleatoric uncertainties in GaMPEN's predictions.
In order to obtain epistemic uncertainties, we use the Monte-Carlo Dropout technique (Srivastava et al. 2014), wherein during inference, each image is passed through the trained networks multiple times.During each for-ward pass, random neurons from the network are removed according to a Bernoulli distribution, i.e., individual nodes are set to zero with a probability, p, known as the dropout rate.
The entire procedure used to estimate posteriors is summarized in Figure 6.Once GaMPEN has been trained, we feed each input image, Xn , 500 times into the trained model with dropout enabled.During each iteration, we collect the predicted set of μn,t , Σn,t for the t th forward pass.Then, for each forward pass, we draw a sample Ŷn,t from the multivariate normal distribution N μn,t , Σn,t .The distribution generated by the collection of all 500 forward passes, Ŷn , represents the predicted posterior distribution for the test image Xn .The different forward passes capture the epistemic During this process, we re-scale the variables described in the text, and return them to the original variable space during inference.After the STN+CNN networks are trained, the posterior inference step consists of 500 forward passes with dropout enabled for each galaxy image.We draw a sample from the predicted multivariate Gaussian distribution during each forward pass, and the collection of these samples gives us the predicted posterior distribution.
uncertainties, and each prediction in this sample also has its associated aleatoric uncertainty represented by Σn,t .Thus the above procedure allows us to incorporate both aleatoric and epistemic uncertainties in GaMPEN's predictions.

TRAINING GaMPEN
Since most of the galaxies described in §2.1 have not been morphologically analyzed before, we devise a method to train GaMPEN with only 0.5% of our sample of real galaxies.In order to achieve this, we first train GaMPEN using the simulated galaxies described in §2.2.Thereafter, we apply "transfer learning", wherein we fine-tune the already trained network using a small sample of real galaxies, analyzed using GALFIT (Peng et al. 2002).
The process of training GaMPEN consists of the following steps, summarized in Figure 7.
• Simulating galaxies corresponding to the target data set (described previously in §2.2).
• Initial training of GaMPEN on the above simulated images (described further in §4.2).
• Fine-tuning GaMPEN using a small fraction of the real dataset.For this, we used ∼ 0.5% of the HSC data described in §2.1.This process is known as transfer learning and is described further in §4.3.
• Testing the performance of the fine-tuned network on a test set of real galaxies.For this, we used ∼ 0.25% of the HSC data described in §2.1.
• Processing the remainder of the real data ( ∼ 99% of the HSC data described in §2.1) through the trained framework.
We describe each of the above steps in detail below.

Data Transformations
To make the training process more robust against numerical instabilities, we transform the input images and target variables following the steps outlined in Ghosh et al. (2022).
Since reducing the dynamic range of pixel values has been found to be helpful in neural network convergence (e.g., Zanisi et al. 2021;Walmsley et al. 2021;Tanaka et al. 2022), we pass all images in our dataset through the arsinh function.For all the target variables, we first apply the logit transformation to L B /L T and log transformations to R e and F : where is the target set of variables before transformation and f ′′ is how we will refer to the transformation in Equation 1. Next, we apply the standard scalar transformation to each parameter (cal- We first train GaMPEN using simulated light profiles described in §2.2.Thereafter, we fine-tune the simulation-trained framework using 0.5% of our real data sample, for which we pre-determined the morphological parameters using light-profile fitting, as described in §5.Finally, we process all the ∼ 8 million galaxies in our dataset through the trained GaMPEN framework to obtain estimates of their morphological parameters and associated uncertainties. ibrated on the training data), which amounts to subtracting the mean value of each parameter and scaling its variance to unity.These two transformations ensure that all three variables have similar numerical ranges and prevent variables with larger numerical values from making a disproportionate contribution to the loss function.
Post training, during inference, we apply the inverse of the standard scalar function (with no re-tuning of the mean or variance), followed by the inverse of the logit and log transformations, f ′′−1 , as indicated in Figure 6.Besides transforming the variables back to their original space, these final transformations also ensure that the predicted values always conform to physically meaningful ranges (0 ≤ L B /L T ≤ 1; R e > 0; F > 0).

Initial Training of GaMPEN on simulated galaxies
The purpose of training GaMPEN initially on simulations is two-fold.Firstly, it greatly reduces the amount of real data needed for training the framework.Secondly, since simulated galaxies are the only situation where we have access to the "ground-truth" morphological parameters, they provide the perfect dataset to assess GaMPEN's typical accuracy for the different output parameters.
In Ghosh et al. (2022), we extensively tested and reported on GaMPEN's performance on simulated HSC z ≤ 0.25 g-band galaxies.Here, we extend that to include simulated r -band 0.25 < z ≤ 0.5 and i -band 0.50 < z ≤ 0.75 galaxies.Out of the 150,000 galaxies simulated in each z-bin, we use 70% to train the framework and 15% as a validation set to determine the optimum value for various hyper-parameters (such as learning rate, batch size, etc.).Thereafter, we use the remaining 15%, which the framework has never encountered before, to evaluate the performance of the trained framework.
We train GaMPEN by minimizing its loss function using Stochastic Gradient Descent.The different hyperparameters that need to be tuned are: the learning rate (the step size during gradient descent), momentum (acceleration factor used for faster convergence), the strength of L2 regularization (the degree to which larger weight values are penalized), and batch size (the number of images processed before weights and biases is updated).To choose these hyper-parameters, we trained GaMPEN with different sets of hyper-parameters and chose the ones that resulted in the lowest value for the loss function on the validation set.The final hyperparameters for the trained models are given in Appendix D.
In order to test the robustness of our simulationtrained framework, we compare the predictions made by GaMPEN on the test set to the true values determined from the simulation parameters.The results are similar across all redshift bins and closely follow what we determined previously in Ghosh et al. (2022).The histograms of residuals for GaMPEN's output parameters are shown in Figure 8 across all three redshift bins.Note that to make all three parameters dimensionless, we report the (Residual R e )/R e , instead of simply Residual R e .For each histogram, we also show the mean (µ), median (μ), and standard deviation (σ).The mean and median help demonstrate that the distributions are all centered around zero, and the standard deviation indicates the value of the "typical error" made by the framework (i.e., 68.27% of the time, GaMPEN's prediction errors are less than this value).
Note that for the simulated data, the typical error made by GaMPEN increases with redshift.This is expected given that in our simulations, galaxies in the higher redshift bins are preferentially smaller, fainter, and have lower signal-to-noise ratios than their lower redshift counterparts-thus, these galaxies are harder to analyze morphologically for any image processing algorithm.However, for all the parameters across all redshift bins, the GaMPEN error is typically always ≲ 15% for the simulated sample.
We would like to note that GaMPEN does not explicitly predict the number of components a galaxy has, but we performed an analysis of its relative performance on single-and double-component galaxies in Ghosh et al. (2022) (see Figure 15 therein).We found that accurately determining L B /L T is more challenging for double-component galaxies (compared to singlecomponent galaxies); and becomes even more difficult when one of the components strongly dominates the other.

Transfer Learning using Real Data
Transfer Learning as a data-science concept has been around since the late 1990s (e.g., Blum & Mitchell 1998) and involves taking a network trained on a particular dataset and optimized for a specific task, and retuning/fine-tuning the weights and biases for a slightly different task or dataset.In Ghosh et al. (2020), we introduced the concept of training on simulated lightprofiles and then transfer learning using real data for galaxy morphology analysis.Here, we employ the same idea, taking the networks trained on the simulated galaxies and then employing transfer learning us-ing ∼ 15, 000 real HSC-Wide galaxies in each redshift bin.
In order to select a sample of galaxies to use for transfer learning, we start with the galaxies summarized in Table 1.Note that what matters in training a CNN is not matching the observed distributions of the simulation parameters; rather, it is spanning the full range of those parameters with high fidelity.Having too many of a particular type-even if that is the reality in real data-can result in lower accuracy for minority populations (e.g., Ghosh et al. 2020).As seen in Figure 3, the sample for all three redshift bins is heavily biased towards fainter galaxies.To ensure that GaMPEN gets to train on enough bright galaxies, we split the midand high-z samples into two sub-samples: m ≤ 21 and m > 21.For the low-z sample, since it has a more substantial tail towards the lower m values (compared to the mid-and high-z samples), we use three sub-samplesm ≤ 18, 18 < m ≤ 20, m > 20.We select 20,000 galaxies from each redshift bin, making sure to sample equally across the magnitude bins mentioned above.Thereafter, we determine their morphological parameters using light-profile fitting, as described later in §5.The magnitude and redshift distributions of the selected galaxies are shown in Figure 9.As seen from the figure, the transfer learning sample has sufficiently large numbers of examples from all parts of the parameter space.This empowers us to optimize GaMPEN for the full range of galaxy morphologies.
Out of the 20,000 galaxies selected in each redshift bin, we use 75% for fine-tuning the simulation trained GaM-PEN models and another 5% for selecting the various hyper-parameters to be used during the transfer learning process.The remaining 20% is used to evaluate the performance of the fully trained GaMPEN frameworks in §6.
In order to artificially augment the number of galaxies being used for transfer learning, we apply random rotations and horizontal/vertical flips on the galaxies earmarked for training.This takes the effective number of samples used for fine-tuning in each bin from ∼ 15, 000 to ∼ 90, 000.Using the values determined by light-profile fitting as the labels, we fine-tune the three GaMPEN models trained on simulations.We choose the values of the different hyper-parameters based on the loss computed on the validation set, and the final chosen hyper-parameters for transfer learning are reported in Appendix D.

Fine-Tuning the Dropout Rate
Aside from the hyper-parameters mentioned in §4.2, there is one more adjustable parameter in GaMPEN-the dropout rate, which directly affects the calculation of the epistemic uncertainties outlined in §3.On average, higher dropout rates lead networks to estimate higher epistemic uncertainties.To determine the optimal value for the dropout rate, during transfer learning, we trained variants of GaMPEN with dropout rates from 10 −3 to 10 −5 , all with the same optimized values of momentum, learning rate, and batch size mentioned in Appendix D.
To compare these models, we calculate the percentile coverage probabilities associated with each model, defined as the percentage of the total test examples where the parameter value determined using light profile fitting lies within a particular confidence interval of the predicted distribution.We calculate the coverage probabilities associated with the 68.27%, 95.45%, and 99.73% central percentile confidence levels, corresponding to the 1σ, 2σ, and 3σ confidence levels for a normal distribution.For each distribution predicted by GaMPEN, we define the 68.27% confidence interval as the region on the x-axis of the distribution that contains 68.27% of the most probable values of the integrated probability distribution.To estimate the probability distribution function from the GaMPEN predictions (which are discrete), we use kernel density estimation, which is a nonparametric technique to estimate the probability density function of a random variable.
We calculate the 95.45% and 99.73% confidence intervals of the predicted distributions in the same fashion.Finally, we calculate the percentage of examples for which the GALFIT-ed parameter values lie within each of these confidence intervals.An accurate and unbiased estimator should produce coverage probabilities equal to the confidence interval for which it was calculated (e.g., the coverage probability corresponding to the 68.27% confidence interval should be 68.27%).For every redshift bin, we choose the dropout rate for which the calculated coverage probabilities are the closest to their corresponding confidence levels.This leads to a dropout rate of 4 × 10 −4 for the low-z bin, and 2 × 10 −4 for the mid-and high-z bins.
As an example, we show in Figure 10 the coverage probabilities (averaged across the three output variables) for different dropout rates for the low-z sample.As can be seen, higher values of the dropout rate lead to GaMPEN over-predicting the epistemic uncertainties, resulting in too high coverage probabilities.In contrast, extremely low values lead to GaMPEN underpredicting the epistemic uncertainties.For a dropout rate of 4 × 10 −4 , the calculated coverage probabilities are very close to their corresponding confidence levels, resulting in accurately calibrated posteriors. .Magnitude and redshift distributions for all galaxies in each redshift bin, plotted along with the galaxies selected for transfer learning (before and after applying quality cuts, as described in §5).Note that we plot density on the y-axis, not the number of samples.The total number of galaxies used for transfer learning is ∼ 0.5% of all the galaxies in our dataset.The relative density of some magnitude bins is higher than others (e.g., 18 < m ≤ 20 for low-z) because they span a smaller range while having roughly the same number of galaxies as the other bins.
Figure 10.The calculated percentile coverage probabilities for different dropout rates for the low-z bin.Note that the coverage probabilities have been averaged over the three output variables.The coverage probabilities are defined as the percentage of the total test examples where the value determined using light profile fitting lies within a particular confidence interval of the predicted distribution.A dropout rate of 4 × 10 −4 leads to coverage probabilities very close to their corresponding confidence levels.A similar process of tuning in the mid-z and high-z bins leads to an optimal dropout rate of 2 × 10 −4 in both of them.

GALFITTING GALAXIES FOR TRANSFER LEARNING & VALIDATION
In this section, we describe a semi-automated pipeline that we developed and used to determine the morphological parameters of ∼ 60, 000 galaxies (∼ 20, 000 in each z-bin), which are used for transfer learning and to test the efficacy of the trained GaMPEN frameworks.
In order to estimate the parameters, we use GALFIT, which is a two-dimensional fitting algorithm designed to extract structural components from galaxy images.However, before running GALFIT, we run Source Extractor (Bertinl 1996) on all the cutout frames in order to obtain segmentation maps for each input image.We use these segmentation maps to mask all secondary objects present in the cutout frame during light-profile Steps used in our light-profile fitting pipeline to determine morphological parameters, for two representative galaxies in each redshift bin.From left to right we show the input image, the mask generated by Source Extractor, the model generated by GALFIT, and the residuals.Note that since we do not explicitly model any Fourier bending modes or coordinate rotations, we expect features like spiral arms to show up in the residuals, as depicted in the second row.fitting.We also use the Source Extractor estimates of various morphological parameters as the initial starting guesses during light profile fitting.Lastly, we use the Source Extractor estimates to pick a cutout size for each galaxy, which we set to be ten times the effective radius estimated by Source Extractor.
Using GALFIT, we fit each galaxy with two Sérsic components:-a disk-component with Sérsic index, n = 1, and a bulge component with 3.5 < n < 5.0.For each galaxy, we perform two consecutive rounds of fitting:first with some constraints placed on the values of the different parameters and second round with almost no constraints.This is to help the parameters converge quickly during the initial round while still allowing the full exploration of the parameter-space in the subsequent round.In the initial round, we constrain the radius of each galaxy to be between 0.5 -90 pixels (0.85 -15.12 arcsec) and the magnitude to be between ±7.5 of the initial value guessed by Source Extractor.We also constrain the difference between the magnitudes of the two components to be between −7.5 and +7.5 (0.001 < L B /L T < 0.999), and the relative separation between the centers of the two light profiles to be not more than 30 pixels (5.04 arcsec).These constraints are much more expansive than what we expect galaxy parameters to be in this z-range and are similar to what has been used for many previous studies (e.g., Van Der Wel et al. 2012;Tuccillo et al. 2018).During the second round of fitting, we only use the constraints on the Sérsic indices and the relative separation between the two components.After fitting the two components, we calculate the R e of each galaxy as the radius that contains 50% of the total light in the combined disk + bulge model.Figure 11 shows the image cutouts, masks, fitted models, and residuals for two typical galaxies chosen from each redshift bin.
After the two rounds of fitting, we excluded galaxies for which GALFIT failed to converge (∼ 3.2%, ∼ 3.8%, and ∼ 2.9% for the low-, mid-, and high-z bins, respec-tively).Thereafter, we visually inspected the fits of a randomly selected sub-sample of ∼ 300 galaxies from each redshift bin.The process of visually inspecting ∼ 900 galaxies helped us to identify three failure modes of our fitting pipeline: i) for some galaxies, GALFIT assigned an extremely small axis ratio to one of the components leading to an unphysical lopsided bulge/disk; ii) for some galaxies, the centroids of the two fitted components were too far away from each other; iii) some galaxies had residuals that were too large.Examples of each of these failure modes are shown in Appendix E. In order to get rid of these problematic fits from our GALFIT-ed sample, we use a mixture of GALFIT flags and calculated parameters.Specifically, as summarized in Table 3 and detailed below: a) problematic value flags3 flags galaxies that have an extremely small axis-ratio (< 0.1) or radius (< 0.5 pixels), or any other parameters that caused issues with numerical convergence.
b) max iters flag flags galaxies for which GALFIT quit after reaching the maximum number of iterations (100).
c) The reduced χ 2 of the fit is poor.
d) The distance between the centers of the two fitted components is too large.
Table 3 shows the criteria used to exclude fits in each redshift bin.The variation in the thresholds with redshift is to account for the fact that galaxies at higher redshift are preferentially smaller, fainter, and have lower signal-to-noise ratios.Figure 9 shows that the exclusion criteria do not selectively exclude more galaxies from certain regions of the magnitude/redshift parameter space compared to others.In order to determine appropriate thresholds for the various flags above, we balanced excluding too many galaxies against ensuring only good fits are included in the final transfer learning dataset.The choice of thresholds is arbitrary to a certain extent.In order to empower users to retrain GaMPEN using different criteria for their own scientific analysis, we are making public the entire catalog of GALFIT-ed values as outlined in Appendix A.
To the best of our knowledge, there is no published large catalog of bulge+disk decomposition of HSC galaxies against which we can compare the results of our light-profile fitting pipeline.However, Simard et al. (2011) performed bulge+disk decomposition using g and r -band imaging from the Sloan Digital Sky Survey (SDSS).We should note that HSC-Wide differs from SDSS in a multitude of ways, with the most significant differences being in median seeing [HSC-Wide: 0. ′′ 79 compared to SDSS: 1. ′′ 4 in the g-band] and pixel scale [HSC: 0.168 arcsecs/pixel compared to SDSS: 0.396 arcsecs/pixel].Thus, we do not expect our analysis to yield the exact same results as that of Simard et al. (2011).However, given that a significant portion of our low-z sample overlaps with that of Simard et al. (2011), it is still useful to compare our results to that of Simard et al. (2011), as the overall trends should agree.
Using an angular cross-match diameter of 0.15 arcsec, we cross-matched our low-z GALFIT sample with that of Simard et al. (2011) to obtain a sample of ∼ 6500 galaxies.Note that although our HSC sample extends to g < 23, the cross-matched galaxies are mostly g < 20 due to the shallower depth of the SDSS data.In Figure 12, we compare the results of our light-profile fitting results with that of Simard et al. (2011).The figure shows galaxies in hexagonal bins of roughly equal size, with the number of galaxies in each bin represented according to the colorbar on the right.Note that we are using a logarithmic colorbar to explore the full distribution of galaxies, down to 1 galaxy/bin.Although there are outliers present for all three parameters, and the scatter of the relationship depends on the parameter, our GALFIT-derived parameters strongly correlate with that of Simard et al. (2011).According to the Spearman's rank correlation test (see Dodge 2008, for more details), there is a positive correlation for all three variables, and the null hypothesis of noncorrelation can be rejected at extremely high significance (p < 10 −200 ).The correlation coefficients obtained for L B /L T , R e , and F are 0.85, 0.95, and 0.98, respectively.For both our GALFIT predictions and that of Simard et al. (2011), we define R e for double-component fits as the radius that encompasses 50% of the light from both components combined.
Note that the higher scatter in the L B /L T relation is expected, given that bulge+disk decomposition involves constraining two different imaging components simultaneously and is thus more sensitive to algorithmic differences and the differences in imaging quality between SDSS and HSC mentioned above.These differences in imaging quality, data-reduction pipeline, and filters (Kawanomoto et al. 2018) might also explain the slight offsets in the R e and flux measurements.It is also important to note that most of the differences in R e measurements are at effective radii values smaller than the median SDSS g-band seeing, as seen in the middle panel of Figure 12.

EVALUATING GaMPEN'S PERFORMANCE
After both rounds of training are complete, we apply the final trained GaMPEN models to all the ∼ 8 million galaxies outlined in §2.1.In this section, we evaluate GaMPEN's performance to assess the reliability of its predictions.

Inspecting the Predicted Posteriors
Using the procedure outlined in §3 and Figure 6, we use the trained GaMPEN frameworks to obtain joint probability distributions of all the output parameters for each of the ∼ 8 million galaxies.Figure 13 shows the marginalized posterior distributions for six randomly selected galaxies; along with the input cutouts fed to GaMPEN.All the predicted distributions are unimodal, smooth, and resemble Gaussian/skewed-Gaussian distributions.For each predicted distribution, the figure also shows the parameter space regions that contain 68.27%, 95.45%, and 99.73% of the most probable values of the integrated probability distribution.We use kernel density estimation to estimate the probability distribution function (PDF; shown by a blue line in the figure) from the predicted values.The mode of this PDF is what we refer to as the "predicted value" henceforth.The figure also demonstrates how GaMPEN predicts distributions of different widths based on the input frame (e.g., the L B /L T distribution in the second row is wider compared to the first row), and we will explore this in more detail in §6.4.
By design, GaMPEN predicts only physically possible values.This is especially apparent in the L B /L T column of rows 1 and 4 of Figure 13.Note that to achieve this, we do not artificially truncate these distributions.Instead, we use data transformations, as outlined in §4.1.This ensures that the predicted L B /L T values are always between 0 and 1.Similarly, we also ensure that the R e and F values predicted by GaMPEN are positive through appropriate transformations.
While performing quality checks on the predicted posteriors of all the ∼ 8 million galaxies, we noticed that sometimes GaMPEN predicts R e and F values outside the parameter range on which it was trained (e.g., m > 23).It has been shown that while machine learning frameworks are excellent at interpolation, one should be extremely cautious while trying to extrapolate too much beyond the training set (e.g., Quionero-Candela et al. 2009;Recht et al. 2019;Taori et al. 2020).Thus, for each redshift bin, we exclude galaxies with predicted R e and F values which are outside the upper and lower bounds of the training set by more than 0.5 arcsecs or 0.5 mags, respectively.This led to ∼ 6%, ∼ 2.5%, and ∼ 1.2% of the data being excluded in the low-, mid-, and high-z bins.

Evaluating the STN performance
As can be seen from Rows 3, 4, and 5 of Figure 13, GaMPEN can accurately predict morphological parameters even when the primary galaxy of interest occupies a small portion of the input cutout, and secondary objects are present in the input frame.This is primarily enabled by the upstream STN in GaMPEN which, during training, learns to apply an optimal amount of cropping to each input image.
Figure 14 shows examples of the transformations applied by the STN to randomly selected galaxies in all three redshift bins.As can be seen, the STN crops out most secondary galaxies present in the cutouts and helps the downstream CNN to focus on the galaxy of interest at the center.
To further validate the performance of the STN, we measured the amount of cropping applied by the STN for all galaxies in the low-z bin.We chose the lowest redshift bin for this test as it has the largest range of galaxy radii among the different redshift bins.After that, we sorted all the processed images based on the amount of cropping applied to each input image.In Figure 15, we show example images from our dataset with extremely high and extremely low values of applied crops.The s parameter shown in the lower left of each panel denotes what fraction of the input image was retained in the STN output-higher values of s denote that a more significant fraction of the input image was retained in the output image produced by the STN (i.e., minimal cropping).Figure 15 demonstrates that (without us having to engineer this specifically), the STN correctly learns to apply the most aggressive crops to the smallest galaxies in our dataset, and the least aggressive crops to the largest galaxies Thus, GaMPEN's STN learns to systematically crop out secondary galaxies in the cutouts and focus on the galaxy of interest at the center of the cutout.At the same time, the STN also correctly applies minimal cropping to the largest galaxies, making sure the entirety of these galaxies remains in the frame.

Comparing GaMPEN predictions to GALFIT predictions
Out of the 60, 000 galaxies analyzed using GALFIT in §5, we use 80% as the training and validation sets.We use the remaining 20%, which the trained GaMPEN frameworks have never seen, to evaluate the accuracy of the predicted parameters.We refer to this as the "test set" henceforth.
In Figure 16, we show the coverage probabilities achieved by GaMPEN on the test set.Note that in § 4.3.1,we tuned the dropout rate using the validation set, whereas the values in Figure 16 are calculated on the test set.In the ideal situation, they would perfectly mirror the confidence levels; (e.g., 68.27% of the time, the true value would lie within 68.27% of the most probable volume of the predicted distribution).Clearly, the coverage probabilities achieved by GaMPEN are consistently close to the claimed confidence levels, both when averaged over the three output parameters, as well as for each parameter individually.The mean coverage probability never deviates by more than 4.5% from the claimed confidence interval, and when considered for each parameter individually, the coverage probabil-  (Left): Galaxies in the low-z bin with the lowest values of s (i.e., the most aggressive crops) (Right): Galaxies in the low-z bin with the highest values of s (i.e., the least aggressive crops).The s parameter denotes the fraction of the input image that was retained in the STN output.As can be seen, the STN correctly learns to apply the most aggressive crops to small galaxies; and the least aggressive crops to large galaxies.ity never deviates by more than 8.7% from the corresponding confidence interval.Additionally, we note that even for the case for which the coverage probabilities are most discrepant (68% flux confidence interval for the mid-z model), the uncertainties predicted by GaM-PEN are in any case overestimates (i.e., conservative).If GaMPEN were used in a scenario that requires perfect alignment of coverage probabilities, users could employ techniques such as importance sampling (Kloek & van Dijk 1978) on the distributions predicted by GaMPEN.We note here that incorporating the covariances between the predicted parameters into our loss function was key to achieving simultaneous calibration of all three output variables.
Having shown above that the posteriors predicted by GaMPEN are well calibrated, we now investigate how close the modes (most probable values) of the predicted distributions are to the parameter values determined us-ing light-profile fitting.Figure 17 shows the most probable values predicted by GaMPEN for the test set plotted against the values determined using GALFIT in hexagonal bins of roughly equal size.The number of galaxies is represented according to the colorbar on the right.Note that we use a logarithmic colorbar to visualize even small clusters of galaxies in this plane, down to 1 galaxy/bin.Across all three redshift bins, a large majority of all the galaxies are clustered around the line of equality, showing that the most probable values of the distributions predicted by GaMPEN closely track the values obtained using light-profile fitting.The scatter obtained for L B /L T is larger compared to the other two output parameters, and we explore that in more detail later in this section.
In Figure 18, we show the residual distribution for GaMPEN's output parameters in all three redshift bins.We define the residual for each parameter as the dif- ference between the most probable value predicted by GaMPEN and the value determined using light profile fitting.The box in the upper left corner gives the mean (µ), median (μ), and standard deviation (σ) of each residual distribution.All nine distributions are normally distributed (verified using the Shapiro Wilk test), and have µ ∼ μ ∼ 0. The σ of each distribution also identifies the typical disagreement for each parameter (e.g., for the low-z bin, in 68.27% cases, the predicted L B /L T value is within ±0.17 of the value determined by light profile fitting).The residual R e , when converted to physical units, correspond to typical disagreements of 0.32 arcsecs, 0.14 arcsecs, and 0.14 arcsecs for the low-, mid-, and high-z bins, respectively.The L B /L T residuals are mostly constant across the three redshift bins, while the disagreement between GaMPEN and GAL-FIT predictions for R e and F decrease slightly as we go from the low to the higher redshift bins.This could be driven by the fact that the HSC median seeing be-comes better as we move from the g-band to the i-band (g-band: 0. ′′ 79; r -band: 0. ′′ 75; i -band: 0. ′′ 61).Additionally, lower redshift galaxies have preferentially more resolved structural features (e.g., spiral arms), which are not accounted for in our disk + bulge GALFIT decomposition pipeline.This could lead to a higher disagreement between the GALFIT and GaMPEN predictions.
Although Figures 17 and 18 indicate the overall agreement between GaMPEN and GALFIT, they do not reveal the dependence of this agreement on location in the parameter space.This is critical information as this identifies regions of parameter space where GaMPEN agrees especially well or badly with light-profile fitting results, so that future users can flag the reliability of predictions in these regions or use results only from certain regions of the parameter space for specific scientific analyses.Figure 19 shows the residuals for the three output parameters for the low-z bin plotted against the values predicted by GaMPEN.Note that in order to Figure 17.The most probable parameter values predicted by GaMPEN for all galaxies in the test set plotted against the values determined using GALFIT.Galaxies are plotted in hexagonal bins of roughly equal size, and the number of galaxies in each bin is represented according to the logarithmic colorbar to the right of each panel.The top, middle, and bottom rows show the results for the low-, mid-, and high-z bins, respectively.The dashed black y = x line represents the line of equality.Across all three redshift bins, values predicted by GaMPEN closely mirror the values obtained using light-profile fitting.make the y-axis dimensionless for all three parameters, we plot the fractional R e residuals instead of absolute values.As in Figure 17, we have split the parameter space into hexagonal bins and used a logarithmic color scale to denote the number of galaxies in each bin.The trends between the residuals and different parameters are very similar across all three redshift bins.Thus, to keep the main text concise, we have shown the plot for the low-z bin here and shown the same plot for the midand high-z bins in Appendix F.
For most of the panels, the large majority of galaxies are clustered uniformly around the black dashed line, y = 0, which denotes the ideal case of perfectly recovered parameters (assuming the GALFIT parameters are correct).There are a few notable features on the top row, which depicts the L B /L T residuals.In the top left panel, the L B /L T residuals are highest near the limits of L B /L T .We noticed the same effect when testing GaMPEN with simulated galaxies in Ghosh et al. (2022), and refer to this as the "edge effect" For L B /L T values near the edges (i.e., when the disk/bulge component completely dominates over the other component), precisely determining L B /L T is challenging for GaM-PEN -in fact, this is difficult for any image analysis algorithm).In some of these cases, GaMPEN assigns almost the entirety of the light to the dominant component, resulting in the streaks seen at the edges of the figure.Poor structural parameter determination for galaxies with L B /L T < 0.2 and L B /L T > 0.8 have also been independently observed in other studies using different algorithms (e.g., Euclid Collaboration et al. 2022;Häußler et al. 2022) Figure 18.Distributions of residuals for all galaxies in the test set; specifically, the differences between the values predicted by GaMPEN and those obtained via light-profile fitting.The top, middle, and bottom rows show the results for the low-, mid-, and high-z bins, respectively.The boxes in the top-left corner of each panel show the mean (µ), median (μ), and standard deviation (σ) of each residual distribution.The σ of each distribution identifies the typical disagreement for each parameter (e.g., for the low-z bin, in 68.27% cases, the predicted magnitude is within ±0.41 of the value determined by light profile fitting).The dashed red vertical line marks x = 0.
In order to mitigate this, GaMPEN users can choose to transform GaMPEN's quantitative predictions in the region 0.2 ≥ L B /L T ≥ 0.8 (demarcated by shaded lines in Figure 19) to qualitative values such as "highly bulgedominated" (L B /L T ≥ 0.8) or "highly disk-dominated" (L B /L T ≤ 0.2).We followed a similar procedure in Ghosh et al. (2022) and found the net accuracy of these labels to be ⪆ 95%.
The top-middle panel of Figure 19 also shows that L B /L T residuals are higher for galaxies with smaller R e .In other words, GaMPEN and GALFIT systematically disagree more for galaxies with smaller sizes-and this effect becomes more pronounced as the sizes become comparable to the seeing of the HSC-Wide Survey (g-band: 0.79 arcsec).
To comparatively evaluate the accuracy of GaMPEN and GALFIT specifically for smaller galaxies, we ran our GALFIT pipeline on a subset of simulated galaxies with R e ≤ 2 arcsec.Thereafter, we compared these results to the predictions made by GaMPEN for the same galaxies.As shown in Appendix G, GaMPEN outperforms GAL-FIT for these smaller simulated galaxies.This provides preliminary evidence that GaMPEN's predictions on the smaller galaxies referred to in the previous paragraph are more accurate than those obtained using GALFIT.However, we would like to note that our simulated galaxies are semi-realistic and do not represent the full range of complexities present in real data.In a future publication, we will compare GaMPEN and GALFIT's performance on more realistic simulated galaxies generated using radiative transfer from hydrodynamical simulations.

Inspecting the Predicted Uncertainties
The primary advantage of a Bayesian ML framework like GaMPEN is its ability to predict the full posterior distributions of the output parameters instead of just point estimates.Thus, we would expect such a network to inherently produce wider distributions (i.e., larger uncertainties) in regions of the parameter space where residuals are higher.
Figure 20 shows the uncertainties for the three predicted parameters plotted against the predicted values for the low-z test set.We define the uncertainty predicted for each parameter as the width of the 68.27% confidence interval (i.e., the parameter interval that contains 68.27% of the most probable values of the predicted distribution; see Fig. 13).The y-axis of the middle row has been normalized so that all three panels show dimensionless fractional uncertainties.The distributions of uncertainties look very similar across all redshifts; thus, we have shown the uncertainty distributions for the mid-and high-z bins in Appendix F.
In Figure 19, we saw that GaMPEN's L B /L T residuals are higher for lower values of R e .Here, we see that GaMPEN accurately predicts higher L B /L T uncertainties for lower values of R e .This compensatory effect is what allows GaMPEN to achieve the calibrated coverage probabilities shown in Figure 16.
In the right column, we see that for all three parameters, the uncertainty in GaMPEN's predictions increase for fainter galaxies.This is in line with what we expect and had seen in Ghosh et al. ( 2022)-morphological parameters for fainter galaxies are more difficult to constrain compared to brighter galaxies and thus should have higher uncertainties.
The top left panel of Figure 20 shows that GaMPEN is reasonably certain of its predicted bulge-to-total ratio across the full range of values but appears slightly more certain when L B /L T ≤ 0.2 or L B /L T ≥ 0.8.We had also seen the same effect in Ghosh et al. (2022) with simulated galaxies.We found that the smaller uncertainties at the limits corresponded to the single-component galaxies, while for the double-component galaxies, the edge effect is less pronounced (see Figure 15 of Ghosh The σ for each parameter is defined as the width of the 68.27% confidence interval.Note that we plot fractional uncertainties for the radius in order to make the y-axis dimensionless for all three rows.The line-shaded region in the top-left panel shows the region where we recommend transforming quantitative LB/LT predictions to qualitative labels (see §6.3 for details).
et al. ( 2022)).Here, we are seeing the same effect-GaMPEN's uncertainties at the edges are systematically lower for galaxies that can be described completely by only a disk or bulge component.These lower uncertainties also contribute to the higher residuals near the edges of L B /L T (as wider distributions would reduce the number of galaxies with high residuals at the edges).However, as can be seen from the top left panel of Figure 20, most of the galaxies with very low uncertainties lie in the region 0.2 ≥ L B /L T ≥ 0.8, where we recommend transforming the quantitative L B /L T predictions to qualitative labels, as outlined in §6.3.
The results shown in this section outline the primary advantage of using a Bayesian framework like GaMPEN-even in situations where the network is not perfectly accurate, it can predict the right level of precision, allowing its predictions to be reliable and wellcalibrated.

Comparing GaMPEN's Uncertainty Estimates to
Other Algorithms As §6.3 and §6.4 demonstrate, GaMPEN predicts wellcalibrated uncertainties on our HSC data.Previous studies (e.g., Haussler et al. 2007) have shown that analytical estimates of errors from traditional morphological analysis tools like GALFIT or GIM2D (Simard et al. 2002) are smaller than the true uncertainties by ≥ 70% for most galaxies.
Recently, Euclid Collaboration et al. ( 2022) reported coverage probabilities obtained by four different lightprofile fitting tools-Galapagos-2 (Häußler et al. 2022), Morfometryka (Ferrari et al. 2015), ProFit (Robotham et al. 2017), and SourceXtractor++ (Bertin et al. 2020)-on simulated Euclid data.The Euclid sample consisted of ∼ 1.5 million galaxies ranging from I E ∼ 15 to I E ∼ 30, simulated at 0. ′′ 1 /pixel.The simulations included analytic Sérsic profiles with one and two components, as well as more realistic galaxies generated with neural networks.As we did not have access to the simulated Euclid dataset, we could not test GaMPEN's performance on the same data.Instead, we compared GaMPEN's coverage probabilities for the HSC data set to those reported for the Euclid simulations.Although the latter is significantly different from our HSC sample, coverage probabilities reflect the ability of the predicted uncertainty to capture the true uncertainty and are not necessarily correlated with accuracy (which often varies across different data sets).Moreover, GaMPEN's predicted uncertainties can be tuned for specific data sets, as shown in Figure 10, which should only improve the GaMPEN outcome.Therefore, the results presented by Euclid Collaboration et al. ( 2022) allow us to perform a preliminary comparison of GaMPEN's uncertainty prediction to that of other algorithms.
Figure 21 shows the 68.27% coverage probabilities achieved by GaMPEN on the HSC data compared to values for the four light profile fitting codes on the simulated Euclid dataset (averaging over the different structural parameters).When considering all galaxies, GaM-PEN's uncertainties are at least ∼ 15 − 25% better calibrated than the other algorithms.The differences are much larger for brighter galaxies, suggesting that the uncertainties predicted by these algorithms depend primarily on the flux of the object.In contrast, GaMPEN's uncertainty predictions remain well-calibrated throughout and are better by as much as ∼ 60% for the brightest galaxies.The severe under-prediction of uncertainties seems to hold true even for Bayesian codes like ProFit.It is likely that GaMPEN's robust implementation of aleatoric and epistemic uncertainties, along with a carefully selected transfer learning set spanning the entire magnitude range (see Fig. 9), allows it to predict wellcalibrated uncertainties across a wide range of magnitudes.

COMPARING OUR PREDICTIONS TO OTHER CATALOGS
In §6.3, we compared GaMPEN's predictions to the values determined using our light-profile fitting pipeline.In order to further assess the reliability of GaMPEN's predictions, we now compare our predictions to two other morphological catalogs.
In Figure 22, we compare GaMPEN's predictions to the fits of Simard et al. (2011).Using an angular crossmatch diameter of 0. ′′ 15, we cross-matched our entire sample to that of Simard et al. (2011) to obtain an overlapping sample of ∼ 20, 000 galaxies.A large majority of these galaxies have m < 19.5 and z < 0.2.We have included error bars in this figure to show the typical width of predicted distributions for different regions of the parameter space.We binned the x-axis of each parameter into bins of equal width and plotted the median y-value in each bin as a point, with the error bars showing the average 68.27% and 95.45% confidence intervals for all galaxies in that bin.
Despite the differences in SDSS and HSC imaging quality (with regards to pixel-scale and seeing), our results largely agree with those of Simard et al. (2011) within the ranges of predicted uncertainties.Using  Spearman's rank correlation test, we obtain correlation coefficients of 0.87, 0.96, and 0.98 for L B /L T , R e , and F , respectively.It is also interesting to note that the widths of the error bars account for almost the entire scatter in the distribution of points, showing the robustness of GaMPEN's uncertainty estimates.
In the L B /L T panel, there is a cluster of points near L B /L T = 0.8, and this is due to the edge-effect described in §6.3. GaMPEN and Simard et al. (2011)'s R e predictions become discrepant (even when accounting for uncertainty) for R e ∼ 1 ′′ .This is not unexpected given that these R e values are almost 0. ′′ 4 smaller than the median SDSS seeing, depicted by the dashdotted vertical line in the middle panel of Figure 22.The slight offset seen in the magnitude estimates is due to the differences in SDSS and HSC imaging quality, data-reduction pipeline, and filters (Kawanomoto et al. 2018), given that this discrepancy disappears when we compare our results to another catalog that uses HSC imaging (see below).Note that some of the scatter in Figure 22 can also be attributed to the fact that Simard et al. (2011) uses a morphology-determination pipeline that is significantly different from GaMPEN.
In Figure 23, we compare our results to that of Kawinwanichakij et al. (2021) 4 , wherein the authors fitted single Sérsic light-profiles to 1.5×10 6 HSC i -band galaxies using LENSTRONOMY (Birrer & Amara 2018), a multipurpose open-source gravitational lens modeling Python package.Following Figure 22, we also show mean error bars in different bins of the fitted parameter in this figure.To compare results in the same band, we cross-matched our high-z sample with that of Kawinwanichakij et al. (2021) to obtain a sample of ∼ 200, 000 galaxies with 0.50 < z ≤ 0.75 and i ≤ 23.Note that the magnitude discrepancy present in Figure 22 now disappears.GaMPEN and LENSTRONOMY radius measurements are also in agreement within the limit of uncertainties, with the mean trend deviating slightly from the y = x diagonal at R e values lower than the HSC median seeing, depicted by the dash-dotted vertical line in Figure 23.It is also important to note that while we define R e as the radius that contains 50% of the total light of the combined bulge + disk profile, Kawinwanichakij et al. (2021) defines it as the semi-major axis of the ellipse that contains half of the total flux of the bestfitting Sérsic model.The correlation coefficients for R e and magnitude are 0.87 and 0.96, respectively.
Since Kawinwanichakij et al. (2021) used singlecomponent fits, we cannot compare our L B /L T predictions with their catalog.However, the left panel of Figure 23 shows the correlation between fitted Sérsic index (n) and measured bulge-to-total light ratio and can be used empirically to convert one measure into another.We find that, in line with expectations, higher Sérsic indices generally correspond to higher values of L B /L T .A large majority of galaxies with n ≤ 1.5 have most of their light in the disk component.Galaxies with n ≥ 3 have a large fraction of their light in the bulge component, although this fraction may vary from values as low as 40% to 95%.Note that these trends largely agree with what was reported by Simmons & Urry (2008).

CONCLUSIONS & DISCUSSION
In this paper, we used GaMPEN, a Bayesian machine learning framework, to estimate morphological parameters (L B /L T , R e , F ) and associated uncertainties for ∼ 8 million galaxies in the HSC Wide survey with z ≤ 0.75 and m ≤ 23.Our catalog is one of the largest morphological catalogs and is the first publicly available structural parameter catalog for HSC galaxies.It provides an order of magnitude more galaxies compared to the current state-of-the-art disk+bulge decomposition catalog of Simard et al. (2011) while probing four magnitudes deeper and having a higher redshift threshold.This represents an important step forward in our capability to quantify the shapes and sizes of galaxies and uncertainties therein.
We also demonstrated that by first training on simulations of galaxies and then utilizing transfer learning using real data, we are able to train GaMPEN using < 1% of our total dataset for training.This is an important demonstration that ML frameworks can be used to measure galaxy properties in new surveys, which do not have already-classified large training sets readily available.Our implemented two-step process provides a new framework that can be easily used for upcoming large imaging surveys like the Vera Rubin Observatory Legacy Survey of Space and Time, Euclid, and the Nancy Grace Roman Space Telescope.
We showed that GaMPEN's STN is adept at automatically cropping input frames and successfully removes secondary objects present in the frame for most input cutouts.Note that the trained STN framework can be detached from the rest of GaMPEN, and can be used as a pre-processing step in any image analysis pipeline.
By comparing GaMPEN's predictions to values obtained using light-profile fitting, we demonstrated that the posteriors predicted by GaMPEN are well-calibrated and GaMPEN can accurately recover morphological parameters.We note that the full computation of Bayesian posteriors in GaMPEN represents a significant improvement over estimates of errors from traditional morphological analysis tools like GALFIT, Galapagos, or ProFit.We demonstrated that while the uncertainties predicted by GaMPEN are well calibrated with ≲ 5% deviation across a wide range of magnitudes, traditional light-profile fitting algorithms underestimate uncertainties by ∼ 15 − 60% depending on the flux of the galaxy being analyzed.These well-calibrated uncertainties will allow us to use GaMPEN for the derivation of robust scaling relations (e.g., Bernardi et al. 2013;van der Wel et al. 2014) as well as for tests of theoretical models using morphology (e.g., Schawinski et al. 2014).
GaMPEN's residuals increase for smaller galaxies, but GaMPEN correctly accounts for that by predicting correspondingly higher uncertainties for these galaxies.GaMPEN's L B /L T residuals are also high when the bulge or disk component completely dominates over the other component.GaMPEN's quantitative L B /L T predictions for these galaxies (0.2 ≥ L B /L T ≥ 0.8) can be transformed into highly accurate qualitative labels with ≳ 95% accuracy.We did not detect a decline in GaM-PEN's performance for fainter galaxies, and GaMPEN's residual patterns were also fairly consistent across all three redshift bins used in this study.
In order to assess the reliability of our catalog using a completely independent analysis, we compared GaM-PEN's predictions to those of Simard et al. (2011) and Kawinwanichakij et al. (2021).Within the limit of uncertainties predicted by GaMPEN, our results agree well with these two catalogs.We noticed a slight discrepancy in the flux values determined using SDSS and HSC imaging, which can be attributed to the differences in imaging quality, data-reduction pipeline, and filters between the two surveys.Comparing our L B /L T predictions to Kawinwanichakij et al. (2021)'s measured Sérsic indices also allowed us to study the correlation between these two parameters, and this can also be used empirically to switch between these two parameters.We found that although galaxies with n ≥ 3 have a large fraction of their light in the bulge component, this fraction can be anywhere between 40% to 95%.
Similar to other previous structural parameter catalogs (e.g., Simard et al. 2011;Tarsitano et al. 2018), we did not explicitly exclude merging galaxies from this catalog.We are currently working to incorporate a prediction flag within GaMPEN that will flag merging/irregular galaxies and highly blended sources so that they can be analyzed separately -we will demonstrate this in future publications.However, we would like to note that for the redshift ranges and magnitudes considered in this study, blending and mergers only constitute a limited fraction of the total sample.Using the m (g|r|i) blendedness flag and m (g|r|i) blendedness abs flux parameters available in HSC PDR2, we estimate that ∼ 10%, ∼ 4%, and ∼ 3% of galaxies in the low-z, mid-z, and highz redshift bins, respectively, have nearby sources within the parent object footprint that can affect structural parameter measurements.For an extended description of these flags, we refer the interested reader to Bosch et al. (2018) and note that users can choose to ignore/exclude these galaxies from their analysis by setting different values/thresholds for these two parameters.
With this work, we are publicly releasing: a) GaM-PEN's source code and trained models, along with documentation and tutorials; b) A catalog of morphological parameters for the ∼ 8 million galaxies in our sample along with robust estimates of uncertainties; c) The full posterior distributions for all ∼ 8 million galaxies.All elements of the public data release are summarized in Appendix A.
Although we used GaMPEN here to predict L B /L T , R e , and F , it can be used to predict any morphological parameter when trained appropriately.Additionally, although GaMPEN was used here on single-band images, we have tested that both the STN and CNN modules in GaMPEN can handle an arbitrary number of channels, with each channel being a different band.We defer a detailed evaluation of GaMPEN's performance on multiband images for future work.Additionally, GaMPEN can also be used to morphologically analyze galaxies from other ground and space-based observatories.However, in order to apply GaMPEN on these data sets, one would need to perform appropriate transfer learning using data from the target dataset.Finally, to give readers an estimate of GaMPEN's runtime, we note that once trained, it takes GaMPEN ∼ 1 millisecond on a GPU and ∼ 150 milliseconds on a CPU to perform a single forward pass on an input galaxy.These numbers, of course, change slightly based on the specifics of the hardware being used.However, as these numbers show, even with access to ∼ 1000 CPUs or ∼ 10 GPUs, GaMPEN can estimate full Bayesian posteriors for millions of galaxies in just a few days.Therefore, GaMPEN is fully ready for the large samples expected soon from Rubin-LSST and Euclid.Determinations of structural parameters along with robust uncertainties for these samples will allow us to characterize both morphologies as well as other relevant properties traced by morphology (e.g., merger history) as a function of cosmic time, mass, and the environment with unmatched statistical significance.
Culture, Sports, Science and Technology (MEXT), the Japan Society for the Promotion of Science (JSPS), Japan Science and Technology Agency (JST), the Toray Science Foundation, NAOJ, Kavli IPMU, KEK, ASIAA, and Princeton University.
This paper makes use of software developed for the Large Synoptic Survey Telescope.We thank the LSST Project for making their code available as free software at http://dm.lsst.org.
The Pan-STARRS1 Surveys (PS1) have been made possible through contributions of the Institute for Astronomy, the University of Hawaii, the Pan-STARRS Project Office, the Max-Planck Society and its participating institutes, the Max Planck Institute for Astronomy, Heidelberg and the Max Planck Institute for Extraterrestrial Physics, Garching, The Johns Hopkins University, Durham University, the University of Edinburgh, Queen's University Belfast, the Harvard-Smithsonian Center for Astrophysics, the Las Cumbres Observatory Global Telescope Network Incorporated, the National Central University of Taiwan, the Space Telescope Science Institute, the National Aeronautics and Space Administration under Grant No. NNX08AR22G issued through the Planetary Science Division of the NASA Science Mission Directorate, the National Science Foundation under Grant No. AST-1238877, the University of Maryland, and Eotvos Lorand University (ELTE) and the Los Alamos National Laboratory.
Based, in part, on data collected at the Subaru Telescope and retrieved from the HSC data archive system, which is operated by Subaru Telescope and Astronomy Data Center at National Astronomical Observatory of Japan.
As outlined in §2, we used the cleanflags any parameter available as part of HSC PDR2 to exclude objects flagged to have any significant imaging issues by the HSC pipeline.The various triggers which contribute to the above flag, as well as their prevalence among the galaxies which we excluded from the analysis, are shown in Figure 24.As can be seen, ∼ 80% of the triggers are caused by cosmic ray hits/interpolated pixels.
In addition, the full SQL queries used to download the low-, mid-, and high-z data are shown in Listings 1, 2, and 3. Note that after downloading the data using these queries, we further excluded data based on the flags referred to in the previous paragraph, as well as the quality of photometric redshift estimates.For an extended description, please refer to §2.As noted in §8, users may additionally choose to use the various blendedness flags available in HSC PDR2 to further exclude merging/blended galaxies.
Figure 24.We exclude ∼ 20% of our downloaded galaxies due to different flags being triggered.The distribution of various flags that contribute to galaxies being excluded from our sample is shown in this figure.As can be seen, the large majority of exclusions are due to cosmic-ray hits (and hence, interpolated pixels).To provide readers a visual understanding of how GaMPEN's different architectural components are organized, Figure 25 shows a schematic diagram outlining the structure of both the STN and CNN in GaMPEN.For complete details on individual layers in GaMPEN, please refer to Ghosh et al. (2022).

D. ADDITIONAL DETAILS ABOUT TRAINED GaMPEN MODELS
As noted in §4, the training procedure of GaMPEN involves the tuning of various hyper-parameters (e.g., learning rate, batch size, etc.).These hyper-parameters are chosen based on the combination of the values that result in the best performance on the validation data set.The final chosen hyper-parameters for both the simulation trained GaMPEN models, as well as those fine-tuned on real data, are shown in Table 4. Please refer to §4 for more details on how we train these models.

E. IDENTIFYING ISSUES WITH OUR LIGHT PROFILE FITTING PIPELINE
We described in §5 a semi-automated pipeline that we used to determine the structural parameters for ∼ 60, 000 galaxies using light-profile fitting.After performing this analysis, we visually inspected the fits of a randomly selected sub-sample of ∼ 300 galaxies from each redshift bin, to assess the quality of the fit.The process of visually inspecting ∼ 900 galaxies helped us to identify three failure modes of our fitting pipeline:-1.for some galaxies, GALFIT assigned an extremely small axis ratio to one of the components leading to an unphysical lopsided bulge/disk 2. for some galaxies, the centroids of the two fitted components were too far away from each other 3. some galaxies had residuals that were too large Note that for some galaxies, multiple failure modes were applicable.Examples of each of these failure modes are shown in Figure 26.We use a combination of different cuts on the fitted dataset to get rid of these failure modes -see §5 for an extended discussion.
Figures 27 and 28 show the distribution of residuals (difference between GaMPEN and GALFIT predictions) for the mid-and high-z bins.

G. COMPARING GaMPEN AND GALFIT'S PERFORMANCE ON SMALLER SIMULATED GALAXIES
As shown in §6.3, GaMPEN and GALFIT systematically disagree more for galaxies with smaller sizes.To ascertain their relative performance, specifically for smaller galaxies, we ran our GALFIT pipeline (described in §5) on ∼ 5000 simulated galaxies from each redshift bin with R e ≤ 2 ′′ .These galaxies were chosen randomly from the testing set of GaMPEN -thus, none of them were used to train GaMPEN.Thereafter, we compared the results of this fitting procedure to the predictions made by GaMPEN on the same galaxies.The residuals obtained for both GaMPEN and GALFIT are shown in Figure 31.
The typical error for each of the parameters is given by μ ± σ, where μ and σ are the median and standard deviation of the residual distribution respectively.As shown in Figure 31, GaMPEN outperforms GALFIT for all three parameters across all redshift bins.This provides preliminary evidence that GaMPEN's predictions on the smaller galaxies referred to in §6.3 are more accurate than those obtained using GALFIT.

Figure 1 .
Figure 1.The limiting absolute magnitudes probed by the Hyper Suprime-Cam (HSC) Wide Survey and Sloan Digital Sky Survey (SDSS) at different redshifts.

Figure 2 .
Figure 2. The filter used for each redshift bin is shown along with the wavelength range sampled by each filter.The blue line shows where rest-frame 450 nm emission falls for redshifts labeled on the x-axis.As this figure shows, the chosen filters allow us to consistently perform morphology determination in the rest-frame g-band.

Figure 4 .
Figure 4. Four randomly chosen galaxy cutouts are shown here for each redshift bin, with the object of interest at the center of each cutout.Note that most of these cutouts have secondary objects in the frame, which can often cause ML algorithms to produce spurious classifications.GaMPEN uses a Spatial Transformer Network to crop most secondary objects out of the frame (see §3).

Figure 5 .
Figure 5. Two stages of simulating an HSC galaxy.(Left): A randomly chosen two-dimensional light profile generated by GalSim.(Right): The same image after PSF convolution and noise addition.The white pixels represent (small) negative values that arise from the process of noise addition.

Figure 6 .
Figure 6.Diagram outlining the training (left) and posterior inference (right) phases of the GaMPEN workflow.Training consists of feeding galaxies (with pre-determined parameter values) through the STN and CNN modules, minimizing the loss function using Stochastic Gradient Descent.During this process, we re-scale the variables described in the text, and return them to the original variable space during inference.After the STN+CNN networks are trained, the posterior inference step consists of 500 forward passes with dropout enabled for each galaxy image.We draw a sample from the predicted multivariate Gaussian distribution during each forward pass, and the collection of these samples gives us the predicted posterior distribution.

Figure 7 .
Figure 7. Diagram outlining the different stages of training GaMPEN.We first train GaMPEN using simulated light profiles described in §2.2.Thereafter, we fine-tune the simulation-trained framework using 0.5% of our real data sample, for which we pre-determined the morphological parameters using light-profile fitting, as described in §5.Finally, we process all the ∼ 8 million galaxies in our dataset through the trained GaMPEN framework to obtain estimates of their morphological parameters and associated uncertainties.

Figure 8 .
Figure8.Histograms of residuals for simulated galaxies (in the test set) across all three redshift bins.We define the residuals as the difference between the true value and the most probable value predicted by GaMPEN.The dashed vertical line represents x = 0, denoting cases with perfectly recovered parameter values.The mean (µ), median (μ), and standard deviation (σ) of each residual distribution are listed in each panel.

Figure 9
Figure9.Magnitude and redshift distributions for all galaxies in each redshift bin, plotted along with the galaxies selected for transfer learning (before and after applying quality cuts, as described in §5).Note that we plot density on the y-axis, not the number of samples.The total number of galaxies used for transfer learning is ∼ 0.5% of all the galaxies in our dataset.The relative density of some magnitude bins is higher than others (e.g., 18 < m ≤ 20 for low-z) because they span a smaller range while having roughly the same number of galaxies as the other bins.
Figure 11.Steps used in our light-profile fitting pipeline to determine morphological parameters, for two representative galaxies in each redshift bin.From left to right we show the input image, the mask generated by Source Extractor, the model generated by GALFIT, and the residuals.Note that since we do not explicitly model any Fourier bending modes or coordinate rotations, we expect features like spiral arms to show up in the residuals, as depicted in the second row.

Figure 12 .
Figure12.Morphological parameters determined for HSC-imaged galaxies using our light-profile fitting pipeline, versus morphological parameters for the same galaxies determined bySimard et al. (2011) based on SDSS imaging.The black dashed diagonal corresponds to perfect agreement.The vertical line in the middle panel shows the median SDSS g-band seeing.

Figure 13 .
Figure13.Examples of predicted posterior distributions for two randomly chosen galaxies from each redshift bin.The blue shaded histograms show the predictions from GaMPEN, and the solid blue lines show the associated probability distribution functions estimated by kernel density estimation.These are used to calculate the confidence intervals shown in the figure with pink, yellow, and green shading.The mode (solid red line) shows the most probable value of each morphological parameter.As expected, in most cases, the GALFIT-ed value (dashed black line) lies within the 68.27% confidence interval.

Figure 14 .
Figure 14.Examples of the transformation applied by the STN to two randomly selected galaxies from each redshift bin.The top row shows the input galaxy images, and the bottom row shows the corresponding output from the STN.The numbers in the top-left yellow boxes help correspond the output images to the input images.As can be seen, the STN learns to crop most secondary objects present in the input frame.

Figure
Figure15.(Left): Galaxies in the low-z bin with the lowest values of s (i.e., the most aggressive crops) (Right): Galaxies in the low-z bin with the highest values of s (i.e., the least aggressive crops).The s parameter denotes the fraction of the input image that was retained in the STN output.As can be seen, the STN correctly learns to apply the most aggressive crops to small galaxies; and the least aggressive crops to large galaxies.

Figure 16 .
Figure 16.Percentile coverage probabilities achieved on the test set shown separately for each redshift bin.The leftmost set of bars in each panel shows the coverage probabilities when averaged over the three output parameters, and the right three sets of bars show the coverage probabilities for each parameter individually.The mean coverage probability never deviates by more than 4.5% from the claimed confidence interval, and when considered for each parameter separately, the coverage probability never deviates by more than 8.7%.This demonstrates that GaMPEN produces well-calibrated accurate uncertainties.

Figure 19 .
Figure19.Residuals of the output parameters (difference between GaMPEN and GALFIT predictions) plotted against the values predicted by GaMPEN for all galaxies in the low-z test set.To make the y-axis dimensionless for all three parameters, we plot the fractional Re residuals instead of absolute values.This figure allows us to assign quality labels to GaMPEN's predictions based on the output values (e.g., flagging regions of the parameter space with high levels of disagreement, as shown by the line-shaded region in the top-left panel).See § 6.3 for details.The equivalent figures for the mid-and high-z bins are shown in Appendix F.

Figure 20 .
Figure 20.Uncertainties predicted by GaMPEN for each parameter plotted against the predicted values for the low-z test set.The σ for each parameter is defined as the width of the 68.27% confidence interval.Note that we plot fractional uncertainties for the radius in order to make the y-axis dimensionless for all three rows.The line-shaded region in the top-left panel shows the region where we recommend transforming quantitative LB/LT predictions to qualitative labels (see §6.3 for details).

Figure 21 .
Figure 21.Percentile coverage probabilities for the 68.27% confidence interval obtained by GaMPEN on our HSC sample compared to coverage probabilities obtained by various light-profile fitting algorithms on simulated Euclid data (from Euclid Collaboration et al. 2022).The rightmost set of bars shows the values calculated on the entire dataset, while the other sets display values calculated on sub-samples of galaxies with specific magnitude ranges (AB mag, shown on the x-axis).Compared to light-profile fitting tools, the uncertainties predicted by GaMPEN are better calibrated by ∼ 15 − 25% overall and by as much as ∼ 60% for the brightest galaxies.

Figure 22 .
Figure 22.GaMPEN predictions plotted against values estimated by Simard et al. (2011) for a cross-matched sample of ∼ 20, 000 galaxies with z < 0.2 and m < 19.The density of points in each histogram is represented according to the logarithmic colorbar on the right.The black dots show the median y values in bins of equal width along the x-axis, with the error bars depicting the average 68.27% and 95.45% confidence intervals predicted by GaMPEN in that bin.The dash-dotted vertical line in the middle panel shows the median SDSS g-band seeing.

Figure 23 .
Figure 23.GaMPEN predictions plotted against values estimated by Kawinwanichakij et al. (2021) for a cross-matched sample of ∼ 200, 000 galaxies with 0.5 < z ≤ 0.75 and m ≤ 23.Similar to Figure 22, we show the median y-values and associated error bars in bins of equal width along the x-axis.The dash-dotted vertical line in the middle panel shows the median HSC i-band seeing.
High-z Sample SQL query C. ADDITIONAL DETAILS ABOUT GaMPEN

Figure 25 .
Figure 25.A schematic diagram of the Galaxy Morphology Posterior Estimation Network.GaMPEN's architecture consists of a downstream CNN module preceded by an upstream STN module.The CNN module empowers GaMPEN to estimate posterior distributions of galaxy morphology parameters.The upstream STN module trains without any extra supervision and learns to apply appropriate cropping transformations to the input image before passing it on to the CNN (for more details about these modules, see §3).The numbers below each layer refer to the number of filters/neurons in each layer.The yellow boxes inside the convolutional layers show the kernel and the number beside it refers to the corresponding kernel size.Only one kernel is shown per set of convolutional layers; all other layers in the set have kernels of the same size.Conv2D and ReLU refer to Convolutional Layers and Rectified Linear Units, respectively.

Figure 26 .
Figure 26.The different failure modes of the semi-automated light profile fitting code described in §5.From left to right, we show the input image, the mask generated by Source Extractor, the model generated by GALFIT, and the residuals.

Figures
Figures 29 and 30 show the uncertainties predicted by GaMPEN for each parameter plotted against the predicted values for the mid-and high-z bins.

Figure 27 .
Figure 27.Residuals of the output parameters (difference between GaMPEN and GALFIT predictions) plotted against the values predicted by GaMPEN for all galaxies in the mid-z test set.See § 6.3 for details.

Figure 29 .
Figure 29.Uncertainties predicted by GaMPEN for each parameter plotted against the predicted values for the mid-z test set.The σ for each parameter is defined as the width of the 68.27% confidence interval.The line-shaded region in the top-left panel shows the region where we recommend transforming quantitative LB/LT predictions to qualitative labels.See §6.4 for details.

Table 1 .
Data Characteristics Figure3.Redshift (top) and magnitude (bottom) distributions for the ∼ 8 million galaxies used in this study.We used spectroscopic redshifts when available and high-quality photometric redshifts otherwise.The spectroscopic completeness of each sub-sample is shown in Table1.

Table 2 .
Parameter Ranges of Simulated Galaxies

Table 4 .
Tuned Values of Various Hyper-parameters