Revealing the Galaxy-Halo Connection Through Machine Learning

Understanding the connections between galaxy stellar mass, star formation rate, and dark matter halo mass represents a key goal of the theory of galaxy formation. Cosmological simulations that include hydrodynamics, physical treatments of star formation, feedback from supernovae, and the radiative transfer of ionizing photons can capture the processes relevant for establishing these connections. The complexity of these physics can prove difficult to disentangle and obfuscate how mass-dependent trends in the galaxy population originate. Here, we train a machine learning method called Explainable Boosting Machines (EBMs) to infer how the stellar mass and star formation rate of nearly 6 million galaxies simulated by the Cosmic Reionization on Computers (CROC) project depend on the physical properties of halo mass, the peak circular velocity of the galaxy during its formation history $v_\mathrm{peak}$, cosmic environment, and redshift. The resulting EBM models reveal the relative importance of these properties in setting galaxy stellar mass and star formation rate, with $v_\mathrm{peak}$ providing the most dominant contribution. Environmental properties provide substantial improvements for modeling the stellar mass and star formation rate in only $\lesssim10\%$ of the simulated galaxies. We also provide alternative formulations of EBM models that enable low-resolution simulations, which cannot track the interior structure of dark matter halos, to predict the stellar mass and star formation rate of galaxies computed by high-resolution simulations with detailed baryonic physics.


INTRODUCTION
Numerical simulation enables theoretical models of galaxy formation to include detailed physical models for baryonic processes. Simulations can capture the physics of cooling, supernova feedback, radiative feedback and ionization, and the role of dynamics simultaneously while tracking the growth of cosmological structure formation (e.g., Schaye et al. 2015;Pillepich et al. 2018;Davé et al. 2019). The simulated galaxy populations that result from these models reproduce observed stellar mass sequences such as the main sequence of star-forming galaxies (Brinchmann et al. 2004;Noeske et al. 2007) or the red sequence of quiescent galaxies ). The quest for realism in modeling these observed trends has also added substantial complexity, such that understanding which physical properties of a galaxy most influence its stellar mass and star formation rate Corresponding author: rhausen@ucsc.edu, brant@ucsc.edu can prove challenging. Many theoretical frameworks to describe these relations have been developed (e.g., Wechsler & Tinker 2018), including halo occupation distribution models (e.g., Jing et al. 1998), subhalo abundance matching (Vale & Ostriker 2004;Conroy et al. 2006), and semi-analytic models (for a review, see Somerville & Davé 2015). The complex physics encoded by these models and simulations can be difficult to interpret, and the relative contribution of baryonic feedback, dark matter halo formation, and environment in setting galaxy properties remains challenging to disentangle.
This complexity extends to cosmological models of galaxy formation in the reionization epoch. To capture the distribution of sizes of ionized regions with converged simulations (Iliev et al. 2014) and the largest observed features, such as dark gaps (Zhu et al. 2021), the volume of reionization simulations should extend to a least several hundred megaparsecs. Modeling such large volumes in a single simulation while maintaining the spatial resolution needed to include the complex physics of the current state-of-the-art projects, such as Cosmic Reionization on Computers (CROC Gnedin 2014), THESAN (Kannan et al. 2021), or Cosmic Dawn (CoDa, Ocvirk et al. 2016, remains computationally infeasible. Instead, we desire an intermediate approach where large volumes are simulated and the physics of galaxy formation are implemented with a approximate model that recovers the mean trends for galaxy baryonic properties predicted by more detailed calculations. With this goal in mind, a model for reionization sources that encapsulates the results of projects like CROC in a simple module is the first necessary step for deploying lower resolution simulations with much larger (L ∼ 500cMpc) simulation volumes. If the stellar mass and star formation rates of ionizing sources can be predicted from their dark matter halo properties and environment, then we can account for the ionizing photons produced by these sources in large-box simulations of the reionization process without resolving the baryonic physics in detail.
This work employs a machine learning method called Explainable Boosting Machines (EBMs Lou et al. 2013) to infer how stellar mass M and star formation rate SFR depend on the physical parameters θ of a host galaxy. In this work, we use the galaxy populations from the CROC simulations to provide our training and test data that populate samples in the multidimensional parameter space of M -SFR-θ. For the additional parameters θ, we use a wide range of physical characteristics measured for galaxies in CROC including the virial mass M vir , redshift z, environmental properties averaged on a length scale R, and the maximum peak circular velocity v peak . We can then use this approximate machine learning-based EBM model for galaxy formation as a basis for future development to incorporate the CROC galaxy population as sources in lower resolution, large-volume reionization simulations.
EBMs represent a form of Generalized Additive Models (Hastie & Tibshirani 1986, GAMs) where the dependencies of a target quantity, such as M or SFR, on each physical parameter θ i are encapsulated by feature functions of one parameter (e.g., f i (θ i )) or interaction functions of two parameters (e.g., f ij (θ i , θ j )). An EBM model is trained to fit these functions from a provided multidimensional dataset. The predicted value of the target quantity given the parameters (e.g., γ(M |θ)) is then a sum of the functions f i and f ij . EBM models are often described as interpretable because the magnitudes of the functions f i and f ij directly indicate the relative importance of θ in determining the target quantity. If a given parameter θ i is unimportant for determining the target quantity, the EBM will find f i → 0. A formal defintion of the EBM is provided in Section 2.1.
Previous works have applied machine learning models to infer connections between simulated galaxy properties. Lovell et al. (2021) use a tree-based learning method called Extremely Randomized Trees to map baryon information to dark matter halos in the EAGLE simulations. Xu et al. (2021) train a Random Forest to predict the number of central and satellite galaxies in dark matter halos in the Millennium simulation. Machado Poletti Valle et al. (2021) used an XG-Boost model to predict gas shapes in dark matter halos in the IllustrisTNG simulations. Bluck et al. (2022) used Random Forest classifiers to study quenching mechanisms in observations, semianalytical models, and cosmological simulations. Piotrowska et al. (2022) also used Random Forest classifiers to examine how supermassive black hole feedback quenches central galaxies in the EAGLE, Illustris, and IllustrisTNG simulations. Our approach complements these prior works by studying the detailed connection between halo and environmental properties, star formation rate, and stellar mass in a model that can be directly implemented in future largevolume cosmological simulations with limited spatial resolution.
The paper is organized as follows. In §2 we review the EBM methodology, define our training dataset and procedure, and introduce the evaluation metrics used to assess the performance of the model. In §3 we present the average contribution of each parameter to the target quantities, the best-fit feature and interaction functions, and the performance of the model in determining the distributions of stellar mass and star formation rate as a function of halo virial mass. We then explore in §4 methods for constructing composite EBM models to recover the stellar mass and star formation rate of simulated galaxies that only use instantaneous halo virial properties and environmental measures (i.e., excluding v peak ). We discuss our results in §5, and summarize them and conclude in §6. The Appendicies of the paper provide detailed model results for the EBM for M ( §A), the mathematical formalism of the composite EBM model ( §B), and detailed composite EBM model results for star formation rate ( §C) and stellar mass ( §D).

METHODS
To infer the connection between M , SFR, and other physical properties of simulated galaxies, we apply EBM models to the CROC simulated galaxy catalogs. In §2.1, we define the EBM model. We select our model parameters and describe the simulated galaxy catalog used to train the model in §2.2. The training procedure is outlined in §2.3.

Explainable Boosting Machines
Explainable Boosting Machine (Lou et al. 2013, EBM) models provide a fitted representation of the relationship between the target quantities y and the parameters θ. EBMs are an extension of Generalized Additive Models (Hastie & Tibshirani 1986, GAMs), which represent target quantities y as the sum of learned univariate functions f i (θ i ) that depend on only one parameter θ i . EBMs extend GAMs by including both univariate functions f i (θ i ) and bivariate functions f ij (θ i , θ j ) that represent dependencies on pairs of features (θ i , θ j ) beyond the dependence of the target quantity on either feature independently. Both EBMs and GAMs are forms of regression where the feature functions f i and f ij can be quite general.
The EBM aims to encode the average dependence of a target quantity y on the parameters θ. Mathematically, an EBM can therefore be represented as where γ(y|θ) is the predicted value of the target quantity y given n p parameters θ ∈ R n from the dataset. We will refer to learned parameter β y as the baseline value of the target quanity y. Though f i y and f ij y can be any interpretable function (e.g., linear regression, splines, etc.), Lou et al. (2012) found that gradient boosted trees (Friedman 2001) work best in practice. Using gradient boosted trees, the functions f i y and f ij y will be piece-wise one-and two-dimensional functions, respectively. By expressing the dependence of y on θ directly through the functions f i y and f ij y , EBMs are interpretable and decomposeable. Further, after training is complete the learned tree-based functions f i y and f ij y can be formulated as look-up tables for performant inference.

Simulated Galaxy Catalog Training Set
To engineer an EBM that describes the connection between simulated galaxy properties, their host dark matter halos, and features of the extrinsic environment, we turn to established observations and theoretical modeling to inform our choices for constructing a training dataset.
At fixed redshift and halo mass, average galaxy masses of central galaxies differ from satellite galaxies. Halos grow through hierarchical merging, in which small halos merge to form larger halos. As subhalos merge into larger halos, tidal heating and stripping reduce the mass of the more extended dark matter halo, while the satellite galaxy mass remains largely unaffected. For this reason, galaxy mass often correlates better with halo properties at the time of accretion than the current halo mass (e.g. Conroy et al. 2006;Vale & Ostriker 2006;Moster et al. 2010;Reddick et al. 2013). In particular, SHAM models find that using the halo peak circular velocity, v peak , to assign galaxy masss and/or luminosity best reproduces observed galaxy clustering (e.g. Reddick et al. 2013;Hearin et al. 2013;Lehmann et al. 2017).
Star formation rates correlate tightly with galaxy masses, and increase with redshift at fixed stellar mass (e.g. Noeske et al. 2007;Stark et al. 2009;Bouwens et al. 2012). While these trends hold on average, there is a distinct bimodal distribution in the star formation rates of galaxies, corresponding to star-forming and quiescent populations (e.g. Balogh et al. 2004). The observed fraction of quiescent galaxies increases as the Universe evolves (e.g. Tomczak et al. 2014), with the interpretation that some mechanism turns off star formation in galaxies. Many quenching mechanisms have been proposed, including secular/mass quenching (e.g. Kauffmann et al. 2004;Contini et al. 2020) and environmental quenching (e.g. Davies et al. 2016;Trussler et al. 2020). Which of these processes dominate may vary with redshift (Kalita et al. 2021).
Overdense environments may cause environmental quenching, by providing close pairs that can suppress gas accretion ("strangulation"), removing gas through rampressure stripping, or disrupting by interactions with other galaxies ("harassment"). Environment thereby influences star formation rates, and low-mass satellite galaxies are typically the most prone to environmental quenching (e.g. Davies et al. 2019).
Given these established trends, galaxy mass and star formation rate may depend on redshift, halo mass, peak circular velocity, and environmental properties. We will therefore select corresponding parameters from the CROC simulated galaxy catalogs to provide our dataset for training the EBM models. Details of the CROC simulations can be found in Gnedin (2014). At a range of redshifts z during the simulation, the computational grid and particle properties are written to disk. These simulation snapshots are post-processed to identify virialized galaxies, as described in Zhu et al. (2020), and the properties of the simulated galaxies are recorded in catalogs. Merger trees are used to identify the properties of simulated galaxies across redshift.
For our target quantities y, in this work we will model stellar mass M [h −1 M ] and star formation rate SFR [M yr −1 ]. The parameters θ selected from the simulated catalog include both intrinsic properties of galaxies and extrinsic properties set by the large scale environment. For intrinsic properties we include the galaxy virial mass M vir [h −1 M ], the redshift z at which the simulated galaxy properties were measured, and the maximum peak circular velocity v peak [km s −1 ] measured over the formation history of each galaxy. The extrinsic properties used are defined by a length scale R measured relative to each simulated galaxy. We follow convention and substitute R with a numerical value that indicates a number of comoving Mpc (e.g., σ 8 is the rms density fluctuations measured in spheres of radius of R = 8Mpc). We compute an environmental density ρ 1 ≡ 1 + ∆ 1 , where ∆ 1 is the dimensionless matter overdensity measured within 1 Mpc. We include an environmental gas temperature T 1 [K] averaged on 1 Mpc scales. From each simulated galaxy we also find the virial mass M max,0.1 of the most massive neighboring halo within 100 kpc. We then define the mass ratio Υ 0.1 ≡ 1 + M max,0.1 /M vir The simulated galaxy catalogs include roughly 8,426,327 objects covering a wide range of halo masses, stellar masses, star formation rates, redshifts, and other extrinsic properties. From the catalog of simulated galaxies, objects with a SFR < 0.001 M yr −1 were excluded owing to resolution effects artificially limiting their star formation rates. Af-ter this culling, the catalog contained 5,950,357 objects that formed our dataset. At this stage, we constructed the training and test datasets from our catalog using the parameter vector θ = [M vir , z, v peak , ρ 1 , T 1 , Υ 0.1 ] to model the target quantities y = [M , SFR]. We use k-fold cross-validation (Hastie et al. 2001) with k = 5, such that the test/training split is 20%/80% for each k-folding.

Training Procedure
The calculations presented in this paper leverage the Inter-pretML (Nori et al. 2019) implementation of EBMs, using the hyperparameters in Table 1. These InterpretML hyperparameters control the number of bins in the piece-wise f i y and f ij y functions (Q max , Q max,2D ), the distribution of bins across the fitted domain (B), and the learning rate of the optimization scheme (R l ). The Nori et al. (2019) implementation trains an EBM in two phases. First, the univariate functions are optimized using a gradient boosting approach applied round-robin on each parameter, as detailed in Lou et al. (2012). After the univariate functions have converged, the interaction terms are computed and the bivariate functions are optimized according the GA2M/FAST algorithms detailed in Lou et al. (2013). During training we use k-fold cross-validation, and merge the training and test datasets for the final performance evaluation of the model.
We evaluate the EBM performance using the mean absolute error (MAE), a variance metric r 2 , and the total outlier fraction ζ k . These statistics provide measures of how well the EBM reproduces the mean trends in the target quantities y as a function of the features θ, the width of the distribution about the mean trends in the training data, and the tails of that distribution.
We calculate the MAE of the model applied to the simulated galaxy sample as where N is the number of objects, y i is the true value of the target quantity for object i, andŷ i is the predicted value from the model for object i. We compute the r 2 ∈ [0, 1] variance metric as which provides a measure of how well the model captures the variance in the data relative to the mean y, with r 2 = 1 reflecting a perfect reproduction of the distribution of y in the training dataset. Note that the feature and interaction functions f i y and f ij y have a finite range, and thus not all values y i can be represented by Equation 1 even when the input parameters θ vary about the mean trends with halo mass or environment. Hence, even for high quality EBM models r 2 < 1 and we expect outliers. The ζ k metric represents the fraction of the total dataset that lies outside the range of predicted values, {ŷ}, as a function of one of the features θ k . We define Reported are values for the variance metric r 2 , the outlier fraction ζ, and the mean absolute error (MAE). Uncertainties are computed from the variation among the k-fold trials.
where the index i runs over the total number of samples N and g k,i (y i , θ k,i ) is a function that returns 1 if the true target quantity for object i lies outside the predicted range, i.e., y i ∈ {ŷ}. In practice, we compute the outlier fraction for feature k = log 10 M vir , and use 2D histograms of (y i , θ k,i ) and (ŷ i , θ k,i ) to calculate g k,i . In Table 2 we present the evaluation metrics for our EBM model fully trained on the simulated galaxy catalog. For the EBM model for star formation rate (y = SFR), we find a MAE ∼ 0.14 log 10 M yr −1 , a variance metric r 2 ∼ 0.9, and an outlier fraction of < 3%. For the EBM model for stellar mass (y = M ), we report a MAE ∼ 0.19 log 10 M yr −1 , a variance metric r 2 ∼ 0.88, and an outlier fraction of < 1%. The good performance of the EBM models in these metrics reflects the abililty of the EBMs to capture both the mean trends and full distributions of the target quantities y = [M , SFR] in the training set given the parameters θ = [M vir , z, v peak , ρ 1 , T 1 , Υ 0.1 ]. We describe the detailed model results in §3. eters θ, the relationships between the target quantities and the parameters can be analyzed. Below, we provide several analyses that quantify how the target quantities relate to the parameters and illustrate the performance of the EBM for our astrophysical applications.

Average Contribution
A key advantage of using EBM models over "black box" models (e.g., neural networks) is their clear interpretability (see §2.1). The contribution of each parameter θ i to the model of the target quantity y is provided by the output functions f i y and f ij y . Since these functions are vectors or two-dimensional matrices with a number of elements equal to the number of bins n b in the piece-wise function (see Table 1), a summary scalar quantity for each feature function is helpful for comparing their relative importance. We can define the average contributionf i y that provides the average absolute value of f i y or f ij y , with the average computed over the number of bins n b and weighted by the number of samples in each bin. Mathematically, we can writē where f is the feature function being averaged (f i y or f ij y from Equation 1), θ i,j is value of the parameter θ i in the jth bin, and N j is the number of samples in bin j. Intuitively, the average contributionf i y summarizes the importance of each parameter θ i for determining the target quantities when averaged over the samples in the final, merged dataset.
The average contributions of each feature (f i y ) or combination of features (f ij y ) are computed from the EBM. In each case, we rank order the features by decreasing average contribution and focus on the seven features or feature combinations with the largest average contribution. In each case the most important feature has an average contribution more than an order of magnitude larger than the seventh-ranked feature.
3.1.1. EBM Model Targeting Star Formation Rate SFR Figure 1 shows the average contribution of the top seven features for the EBM model targeting star formation rate log 10 SFR. In decreasing order, the seven most important features in determining SFR are maximum peak circular velocity v peak , virial mass M vir , environmental density ρ 1 , redshift z, environmental temperature T 1 , mass ratio of nearby halos Υ 0.1 , and the interaction between M vir and Υ 0.1 . The numerical values for the average contributions are provided in Table 3. The baseline value of SFR is β log 10 SFR = −2.1151 [log 10 M yr −1 ], typical of halos with log 10 M vir ∼ 9. The average contribution of v peak and M vir are quite similar, providing ∆ log 10 SFR > 0.2 on average, but their interaction term is small with f (log 10 v peak , log 10 M vir ) 0.01. Therefore peak circular velocity and virial mass provide important contributions to determining the star formation rate, and the univariate dependence of the SFR on these properties accounts for most of their contribution. At the few-percent level, environmental density, redshift, environmental gas temperature, and the presence of nearby massive halos also contribute. In order of decreasing importance, these features include peak circular velocity v peak , virial mass Mvir, environmental density ρ1, redshift z, environmental temperature T1, the mass ratio of nearby halos Υ0.1, and the interaction between virial mass Mvir and Υ0.1. Average contribution is calculated using the average of the absolute value of the feature functions weighted by the number of samples in each bin (see Equation 5).
The feature functions f i y for each feature are plotted in Figure 2. The functions indicate that there are positive correlations between the star formation rate log 10 SFR and either the peak circular velocity v peak , virial mass M vir , or environmental density ρ 1 . The star formation rate increases with increasing environmental temperature log 10 T 1 , but near T 1 ≈ 10 4 K the univariate function shows an enhancement just as hydrogen becomes mostly neutral and a deficit near the temperature at which it becomes ionized. Star formation rate increases with decreasing redshift over the range z ∼ 5 − 15, becoming more efficient after reionization.
The interaction functions f ij y learned by the EBM γ(SFR|θ) targeting the star formation rate SFR are plotted as "heat maps" in Figure 3. Most interaction functions do not contribute significantly to the star formation rate, and change the star formation rate by ∆ log 10 SFR 0.05. However, halos with low neighboring halo mass ratios Υ 0.1 and large peak circular velocity v peak have their star formation f (log 10 Mvir, log 10 Υ0.1) 0.0052 Table 3. Summary of the EBM model trained to predict SFR. The first entry, β log 10 SFR , is the baseline value learned model (see Section 2.1). The next seven entries are the average contributions of the most important feature functions listed in descending order (see Equation 5).
rate enhanced by ∆ log 10 SFR ≈ 0.15. Rephrased, locally dominant halos with large peak circular velocity show en-hanced star formation. Such enhancements likely owe to recent merger activity. While Equation 1 represents a complex, multidimensional manifold that provides the SFR as a function of the parameters θ, the distributions of simulated and predicted SFR as a function of a single parameter provide a graphical summary of the EBM model performance. Figure 4 shows the simulated and predicted SFR as a function of virial mass log 10 M vir , and we will refer to this figure as the model summary. Shown in this model summary are the distributions of SFR in the CROC simulated galaxy catalogs with virial mass and the SFR predicted by the EBM model γ(SFR|θ) using the parameters θ measured for each simulated galaxy. The EBM model captures roughly 97% of the simulated distribution of SFR with virial mass. The EBM model is highly predictive of the simulated connection between SFR and the intrinsic and extrinsic properties θ.
Given the combined complexity of the average contribution measures, univariate feature functions, and bivariate interaction functions, in what follows we will show the summary figure for other EBM models in the main text. For completeness, the average contribution, feature function, and Redshift z, log 10 v peak 0.02 Figure 3. Most important learned interaction functions f ij y for the EBM model γ(SFR|θ) targeting the star formation rate SFR, as a function of their parameter pairs. Each panel shows the contribution of the bivariate interaction terms, normalized such that the color map ranges between plus or minus the maximum of the norm of each function ||f ||max. Light blue areas indicate regions of joint parameter space where the feature interactions contribute positively to the star formation rate, while dark blue areas indicate regions with negative contributions. The table lists ||f ||max for the interaction functions, each with units log 10 M yr −1 . In absolute terms, the largest interaction occurs for halos with large peak circular velocity v peak and no large neighboring halos (Υ0.1 ≈ 0). The other interaction functions are relatively weak, and contribute changes to log SFR 0.05. interaction function figures for each model will be presented in the Appendices.

EBM Model Targeting Stellar Mass M
An EBM model γ(M |θ) targeting stellar mass M using the properties θ can be constructed through simple retraining. Using the simulated galaxy catalogs from CROC, we retrain the EBM to model M against θ. We report the average contribution, univariate feature functions, and bivariate interaction functions for γ(M |θ) in Appendix A. For reference, the baseline value of M is β log 10 M = 5.9629 [log 10 M yr −1 ] (see Table 6 in the Appendix), typical of halos with log 10 M vir /M ∼ 9. Figure 5 shows the model summary for the EBM model γ(M |θ). The EBM model provides an excellent representation of the distribution of stellar masses for the CROC simulated galaxy catalog. As the lower right panel of Figure 5 indicates, the γ(M |θ) model results in few out-liers for the CROC simulated galaxies and has an outlier fraction of 1%. Given the galaxy properties θ = [M vir , z, v peak , ρ 1 , T 1 , Υ 0.1 ], the distribution of stellar masses for CROC simulated galaxies can be recovered to 99% accuracy.

COMPOSITE EBMS FOR RESTRICTED PARAMETER SETS
The EBM models γ(SFR|θ) and γ(M |θ) presented in §3.1.1 and §3.1.2 are constructed using the parameter set θ = [M vir , z, v peak , ρ 1 , T 1 , Υ 0.1 ]. Our results show that the full distribution of SFR and stellar mass in the simulated CROC galaxy catalogs can be recovered accurately with only ≈ 1 − 3% outliers. These EBM models can therefore be applied to cosmological simulations using the parameters θ measured from simulated galaxy catalogs to recover the distribution of SFR and stellar mass computed by CROC.
The parameters θ include the peak circular velocity v peak , which requires both time-dependent tracking of formation histories for individual galaxies and high spatial resolution to capture the peak of the rotation curve for each object. As a result, as expressed above the models γ(SFR|θ) and γ(M |θ) cannot be applied directly to cosmological simulations with low spatial resolution or without merger trees to capture formation histories.
Instead of fitting EBM models using the full parameter set θ, consider the construction of an EBM model using the restricted parameter set θ = [M vir , z, ρ 1 , T 1 , Υ 0.1 ] that does not include v peak . The parameters θ can all be measured directly in cosmological simulations with sufficient resolution to capture individual galaxy-mass halos without the need to track merger trees. The EBM models γ(SFR|θ ) and γ(M |θ ) using the restricted parameter set θ perform substantially less well than the models γ(SFR|θ) and γ(M |θ) trained on the full parameter set θ that includes v peak . With the restricted parameter set θ , the EBM model shows 7.6% outliers when targeting SFR and 2.8% when targeting M . Comparing with the outlier fractions reported in Table 2 for the full parameter set including v peak , the EBM model trained on the restricted dataset has degraded its performance by a factor of ∼ 2 − 3.
To ameliorate the poorer performance of the EBM models trained on restricted parameter sets, we use a Composite EBM (CEBM) model. Given a target quantity y and a parameter set θ , we fit a base EBM γ(y|θ ) in the same manner as fitting the EBMs γ(SFR|θ) or γ(M |θ). We construct a dataset from the galaxies whose y values lie outside the predictions from γ(y|θ ), and then fit an outlier EBM δ(y|θ ) to these discrepant samples. We then weight the base and outlier EBMs to construct the CEBM model Γ(M |θ ) using a classifier EBM φ y (θ ). Instead of fitting the change in star formation rate or stellar mass at a given sample in θ , the classifier EBM fits the log odds that a given sample in θ is an outlier. We then define φ y (θ ) to be the sigmoid of these log odds, such that φ y (θ ) ∈ [0, 1]. The CEBM can then be written as We describe the CEBM approach in more detail in Appendix B, and provide information on the CEBMs Γ(SFR|θ ) and Γ(M |θ ) in Appendices C and D. Table 4 lists the evaluation metrics for the training of CEBM models targeting SFR and stellar mass without using v peak . The outlier fraction has improved to ≈ 5% for CEBM model Γ(SFR|θ ) and to 2% for Γ(M |θ ). The average parameter contributions and baseline value β log 10 SFR from Γ(SFR|θ ) are provided in Table 5 and for the CEBM targeting stellar mass in Table 7. The univariate feature functions and bivariate interaction functions for the CEBM models Γ(SFR|θ ) and Γ(M |θ ) are provided in Appendices C and D. Figure 6 shows the model summary for the CEBM targeting SFR, and Figure 7 shows the model summary for the CEBM targeting stellar mass. As both models demon-  Table 4. Training results for CEBM models for SFR and M using k-fold cross validation. See Section 2.3 for more information on the training process. Reported are values for the variance metric r 2 , the outlier fraction ζ, and the mean absolute error (MAE). Uncertainties are computed from the variation among the k-fold trials.
f (log 10 Mvir, log 10 Υ0.1) 0.0056 Table 5. Average contribution to the CEBM model Γ(SFR|θ ) trained to predict SFR from the parameter set θ . The first entry, β log 10 SFR , is the learned baseline of the model. The next seven entries are the feature functions with the highest average contribution listed in descending order. The average contribution is calculated using the average of the absolute value of the base EBM function values weighted by the number of samples in each bin and the output of the classification EBM for each sample (see Appendix B.2 for more details).
strate, the CEBM model accurately recovers the distribution of star formation rate and stellar mass in the CROC simulated galaxy sample. Between the models, the outlier fraction is only ≈ 2 − 5% despite using the restricted set of parameters θ that does not include v peak or any time-dependent tracking of individual systems.
5. DISCUSSION Explainable Boosting Machine (EBM) models provide a method to statistically infer relationships present in highdimensional data. Given their statistical nature, EBM models remain ignorant of the physics that generate the connection between star formation rate, stellar mass, and the properties of dark matter halos that host galaxies. Nonetheless, given the results of detailed physical modeling in the form of simulated galaxy catalogs from cosmological simulations, the EBM correctly identifies halo mass and maximum peak circular velocity as the most important halo properties for determining SFR and M (e.g., Figure 1). The EBM correctly infers that SFR and M increase with increasing halo mass or v peak , and the EBM univariate feature functions correctly identify the gas temperature at which star formation efficiency changes. To the extent that the physical connection between galaxy and halo properties are recorded in statistical relationships, the EBM models effectively recover some fraction of those relations.
EBM models also provide a means to implement a "subgrid" prescription for galaxy formation based on the properties of halos and their environments. The EBM models γ(SFR|θ) and γ(M |θ) capture better than 97% of the SFR and M distributions measured for simulated galaxies in the CROC simulations. The stellar masses and star formation rates of galaxies in CROC could be accurately recov- Using the CEBM model trained on the restricted parameter set θ = [M vir , z, ρ 1 , T 1 , Υ 0.1 ], ≈ 95 − 98% of the distribution of SFR and M of the CROC galaxies is recovered. One advantage of this parameter set is that the spatial resolution in the simulations required to compute them is less demanding than for v peak . A simulation with coarser resolution than CROC, such that the details of the star formation and feedback processes cannot be resolved, may still leverage the CEBM models Γ(SFR|θ ) and Γ(M |θ ) to model the star formation rate and stellar masses in dark matter halos. Further, the quantities θ used to train the CEBM models are measured at distinct redshifts such that no merger trees are required to recover accurately the CROC SFR and M distributions from halo and environmental properties. We note that for both the EBM and CEBM models the outlier fractions not well captured by the model are roughtly percent-level or less in the SFR or M distributions, and we expect that corre-sponding inaccuracies induced in, e.g., the ionizing photon budget or topology of reionization will be minimal.
By editing the dataset and retraining, the impact of environment on the performance of the EBM models can be estimated. Relative to γ(SFR|θ) and γ(M |θ) that use the full dataset θ including all environmental parameters, EBM models trained only on maximum peak circular velocity v peak , halo virial mass M vir , and redshift z have an outlier fraction increased by only ∼ 1% when modeling M and ∼ 10% when modeling SFR. Further, removing v peak and training only on [M vir , z] substantially degrades the model performance, and the outlier fractions increase to ∼ 20% when modeling M and ∼ 40% when modeling SFR. The importance of including v peak in the training dataset is much larger than the importance of accounting for the environmental measures selected in this analysis.
The EBM models enable an approximate translation of the galaxy formation model from one simulation to another. Provided the parameter sets θ or θ can be measured in both simulations, an EBM can recover the connection between SFR, stellar masses, halo properties, and environment from the training simulation and then be used to instill those relations in a different simulation. Since the θ parameter set does not require very high spatial resolution to capture, the net results for SFR and stellar mass from a high resolution simulation accurately tracking detailed baryonic physics can be translated into a simulation with resolution insufficient to capture those physics directly. In future work, we plan to transfer the CROC baryonic galaxy formation model into Cholla cosmological simulations (e.g., Villasenor et al. 2021a,b) via the EBM models presented here. Such a transferred model could be used to build models of feedback from galaxy formation on resolved scales that incorporate the regulatory effects of feedback on small-scale star formation.
Lastly, the ability of the EBM models to recover the SFR and M distributions using only halo and environmental properties allows for the rapid replacement of galaxy formation models based on EBMs. Models can be trained on the simulated galaxy catalogs from a variety of expensive, high-resolution training simulations including a wide range of physics. These EBM models can then be used interchangeably as effective galaxy formation models in the target simulations, and can also be modified posteriori to allow a broad parameter search or correct the inaccuracies of the training simulation. Such an approach could reduce the sensitivity of conclusions about, e.g., the reionization process on the detailed SFR and M distributions as multiple EBM models for these properties could be trained and implemented in the target simulations.
6. SUMMARY A complex interplay of physical processes gives rise to the distribution of star formation rates (SFRs) and stellar masses M of galaxies over cosmic time. Cosmological simulations provide powerful methods for modeling these physical processes, but the connection between SFR, M , and other galaxy properties can be obfuscated by complex-ity. Leveraging machine learning techniques, we use a variation of the Generalized Additive Model (Hastie & Tibshirani 1986) called Explainable Boosting Machines (EBM Nori et al. 2019) to infer the dependence of SFR and M in the Cosmic Reionization on Computers (CROC) simulations (Gnedin 2014) on dark matter halo properties including virial mass M vir , peak maximum circular velocity v peak , redshift, environmental density, environmental gas temperature, and the mass of neighboring halos. Our findings include: • SFR and M primarily depend on M vir and v peak , followed by redshift, environmental density, and environmental gas temperature.
• When including M vir and v peak in the parameter set used to train the EBM, the model recovers better than 97% of the distribution of M or SFR with virial mass M vir in the CROC simulations.
• If the model fit excludes v peak , the fraction of outliers in the CROC data relative to the predicted model distribution increases to 7.6% for SFR and 2.8% for M .
• To ameliorate the degradation of the model performance when excluding v peak , we define a composite EBM model comprised of a weighted sum of the base EBM model fit to main trend of SFR and M with the halo properties and a second EBM model to fit the outliers not represented in the base EBM. The weighting coefficients are themselves determined by an EBM model fit.
• The composite EBM model improves the performance to ≈ 95 − 98% accuracy in the distribution of SFR or M with virial mass, even when excluding v peak measurements from the training dataset.
The EBM models quantify the relative importance of halo properties like virial mass and maximum peak circular velocity for determining the stellar mass and star formation rate of the galaxy it hosts. Through these models, the physics of baryonic galaxy formation can be connected to the properties of dark matter halos and enable galaxy formation to be implemented as a "sub-grid" prescription in dark matteronly simulations or hydrodynamical simulations that do not resolve the small scale details of star formation and feedback.  In order of decreasing importance, these features include peak circular velocity, virial mass, redshift, environmental density, environmental temperature, the mass ratio of nearby halos, and the interaction between redshift and peak circular velocity. Peak circular velocity is about 50% more important than virial mass, which in turn is roughly a factor of two more important than redshift. The other features and interactions contribute to stellar mass at the 0.1 dex level. For reference the numerical values for the average contributions are provided in Table 6. ). The features with the largest contributon are v peak and Mvir, followed by redshift z, environmental density ρ1, environmental temperature T1, and mass ratio of nearby halos Υ0.1. The interaction with the largest average contribution involves [z,v peak ].

A.2. Feature Functions
The univariate functions determined by the EBM targeting stellar mass M are shown in Figure 9. Stellar mass increases with increasing peak circular velocity, virial mass, environmental density, and neighboring halo mass ratio. Stellar mass increases with decreasing redshift. As with star formation rate, the stellar mass increases with increasing environmental temperature T 1 , with a sharp enhancement near the temperature where hydrogen becomes neutral and a sharp deficit near where hydrogen ionizes. , for the EBM γ(M |θ) trained to predict M . Areas highlighted in orange indicate portions of the function that contribute positively to the predicted M and areas in red contribute negatively. Stellar mass increases with peak circular velocity and virial mass, increases with decreasing redshift, and increases with environmental density. Temperature correlates positively with stellar mass, with a strong feature near T1 ≈ 10 4 K where hydrogen ionizes. Stellar mass also increases with the mass ratio of neighboring halos.

A.3. Interaction Functions
The bivariate interaction functions f ij y (see Equation 1) learned by the EBM when targeting stellar mass M are plotted as heat maps in Figure 10. On average most interaction functions do not contribute significantly to galaxy stellar mass, but there are regions of parameter space where the interaction functions are important. For instance, halos with low environmental temperatures and high environmental densities have suppressed stellar mass. Large virial mass halos with small neighboring halo mass ratios log 10 Υ 0.1 , indicating halos that dominate their local environment, have stellar mass enhanced by ≈ 0.3 dex. This effect exceeds the maximum univariate contribution of log 10 Υ 0.1 alone. The deficit of stellar mass at environmental temperatures where hydrogen is becoming ionized is increased at high redshifts. B. COMPOSITE EBM The composite EBM (CEBM) models we present consist of a base EBM model trained to recover the main trend γ(y|θ ) of the targeted property y with the input parameters θ , an outlier EBM model that captures the outlying values of y not recovered by γ(y|θ ), and a classification EBM model φ y (θ ) that interpolates between them (see §4). Given a CEBM, we wish to construct analogs of the average contribution, feature functions, and interaction functions determined for a single EBM. We define these quantities for the CEBM function in §B.2 and B.3 below.

B.1. CEBM Feature and Interaction Functions
The feature functions of a single EBM are univariate and indicate directly how the expectation value of the targeted quantity depends on each parameter θ i ∈ θ. With a CEBM comprised of a weighted sum of two base EBMs, we define the analog of the feature function to be the weighted sum of the base EBM feature functions. We can write that where is the Hadamard or element-wise product operation and the sum is over the number of samples N . The quantity f i y is the vector of the individual EBM feature functions f i y . While the base EBM feature functions are individually univariate, by weighting the sum of these feature functions with the classifier EBM the resulting feature function analog in Equation B1 is not univariate.
The interaction functionsf ij y are defined as in Equation B1 but with the vector of the individual EBM interaction functions f ij y subsituted for f i y . While the interaction functions for a single EBM are bivariate, the CEBM interaction functions are not bivariate.

B.2. CEBM Average Contribution
The average contribution of each feature in a CEBM can be defined in a manner analogous to the average contribution computed for a single EBM (Equation 5). The CEBM average contribution can be written as wheref is either the CEBM feature functionf i y or the CEBM interaction functionf ij y . Equation B2 characterizes how important the parameter θ i is for modeling the target quantity y.

B.3. Visualizing CEBM Feature and Interaction Functions
The feature and interaction functionsf i y andf ij y are not univariate or bivariate by design, which allows them to model the outlier distribution about the base EBM model γ(y|θ ). To visualize the feature and interaction functions for CEBM models in a manner similar to the univariate feature and bivariate interaction functions for single EBM, we can average the values of f i y and f ij y . For the feature function averaged over N samples, consider n b bins along the θ i direction, with central values θ i,b and bin widths ∆θ i,b . The bin-averaged CEBM feature and interaction functions are then where θ j,i is the ith parameter of the jth sample θ j and the function α( C. COMPOSITE EBM MODEL FOR STAR FORMATION RATE The CEBM model Γ(SFR|θ ) for the star formation rate consists of a base EBM γ(SFR|θ ), a residual EBM δ(SFR|θ ) that attempts to capture the outlying values of SFR not recovered by γ(SFR|θ ), and the classifier EBM φ SFR (θ ). For each of these individual EBMs that form the CEBM model, we plot the average contribution, feature functions, and interaction functions. Figure 11 shows the average contribution, feature functions, and interaction functions for the EBM model γ(SFR|θ ) that forms the base of the CEBM model. The differences between γ(SFR|θ) and γ(SFR|θ ) reflect the additional information provided by the maximum peak circular velocity v peak . Without access to v peak , the base EBM γ(SFR|θ ) upweightsf (M vir ) such that its importance roughly equals the combined importance of M vir and v peak in determining γ(SFR|θ). The average contribution of ρ 1 , T 1 , z, Υ 0.1 , and (M vir , Υ 0.1 ) are similar between the models. The additional interaction term in the top seven average contributions is (z, ρ 1 ), with a percent-level contribution to SFR relative to M vir . The feature functions for γ(SFR|θ ) have shapes similar to the feature functions for γ(SFR|θ), but their minimum and maximum contributions to SFR are adjusted to account for the missing v peak contribution. The feature functionf (z) is noisier overall. For the interaction functions, the largest contributors now involve M vir rather than the missing parameter v peak , and the set of available functions is substantially different than with γ(SFR|θ). Figure 12 shows the average contribution, feature functions, and interaction functions for the outlier EBM δ(SFR|θ ) fit to the deviant samples not captured by the base EBM γ(SFR|θ ). The outlier EBM receives the highest contribution from virial mass, with an average contribution more than an order of magnitude larger than the next most important feature ρ 1 . The redshift z and environmental temperature T 1 have comparable importance to ρ 1 . The remaining features provide only percent-level contributions relative to M vir . Figure 13 shows the average contribution, feature functions, and interaction functions for the classifier EBM δ(SFR|θ ) that interpolates between the base and outlier EBMs when calculating the CEBM model. For the classifier EBM, the most important features are ρ 1 , M vir , and Υ 0.1 . Redshift z has middling importance, following by T 1 , [ρ 1 , Υ 0.1 ], and [M vir , ρ 1 ]. The feature functions show strong dependencies on ρ 1 , M vir , Υ 0.1 , z, and T 1 . The largest interaction functions involve the environmental temperature T 1 , redshift z, and virial mass M vir .
By weighting the base and outlier EBM models with the classifier EBM, we construct the CEBM for star formation rate as Figures 14 show the average contribution, feature functions, and interaction functions for the SFR CEBM. The most important feature is M vir , which dominates by a factor of ∼ 4 − 10 over environmental density ρ 1 , environmental temperature T 1 , Υ 0.1 , and redshift z. The interaction terms are roughly percent-level effects relative to M vir . The feature functions show a strongly increasing SFR with M vir , and enhanced SFR with environmental density ρ 1 . The temperature dependence shows the feature at log 10 T 1 ≈ 4 seen with the EBM model γ(SFR|θ). The interaction functions provide only important contributions over very limited areas of parameter space, with the most important adjustments occuring at low redshift and large virial mass, or for large temperatures and virial masses. For reference, the model summary Figure 6 illustrates the overall performance of the model.  Figure 16 shows the average contribution, feature functions, and interaction functions for the outlier EBM model δ(M |θ ). The average contribution is dominated by M vir , with the contributions from all other single parameters lower by a factor of ≈ 10 with the order of importance maintained relative to γ(M |θ ). For the feature functions, the redshift dependence changes dramatically and now increases with increasing redshift. The feature function for environmental densityf (ρ 1 ) becomes much weaker over a wide range of ρ 1 , but increases dramatically at high ρ 1 . Relative to the γ(M |θ ) feature functions, the feature functionf (Υ 0.1 ) for δ(M |θ ) is weak and noisy. The interaction functions show increased contributions at large [z, ρ 1 ], and for low T 1 and large ρ 1 . Figure 17 shows the average contribution, feature functions, and interaction functions for the classifier EBM φ M (θ ). For each of these properties, we note that in determining φ M (θ ) a sigmoid function σ is applied to the sum of β, f i y and f ij y that model the log odds that a galaxy is an outlier in stellar mass. In determining M , the features with the largest average contribution are environmental density ρ 1 , redshift z, Υ 0.1 , and virial mass M vir . The interaction terms with the largest contribution are (z, ρ 1 ) and (ρ 1 , T 1 ). Clearly, environmental density plays an important role in determining whether a given simulated galaxy is an outlier relative to the base EBM γ(M |θ ). The feature functions show that galaxies with large environmental densities ρ 1 , at low redshift z, or with a large neighboring galaxy (expressed by Υ 0.1 ) have an enhanced probability of being outliers relative to γ(M |θ ). Galaxies at both high and low M vir or large environmental temperature T 1 are also more likely to be outliers.
We construct the stellar mass CEBM with the sum Γ(M |θ ) ≡ [1 − φ M (θ )]γ(M |θ ) + φ M (θ )δ(M |θ ). Figure 18 shows the average contribution, feature functions, and interaction functions for Γ(M |θ ). The feature with largest average contribution is M vir , with redshift z, environmental density ρ 1 , environmental temperature T 1 , and the mass ratio of nearby galaxies Υ 0.1 having an lower average contribution by a factor of ∼ 5 − 10. Relative to M vir , the interactions [z, ρ 1 ] and [M vir , ρ 1 ] contribute at level of a few percent. The M CEBM feature functionf (M vir ) has increased in amplitude relative to The dominant feature is virial mass Mvir, with an average contribution to log 10 SFR roughly 8 − 10× larger than environmental density ρ1 and temperature T1. Compared with the average contributions to the EBM γ(SFR|θ) (see Figure 1), Mvir subsumes the contribution provided by the missing v peak parameter. Panel b) shows the feature functions contributing to the base EBM model. The SFR increases with Mvir, which provides the dominant contribution. A secondary contribution comes from environmental density ρ1.
Environmental temperature T1 has a minor contribution, but shows the same feature at T1 ≈ 10 4 K where hydrogen ionizes. The mass ratio of nearby halos Υ0.1 and redshift z provide minor contributions. Panel c) presents the interaction functions for the base EBM γ(SFR|θ ). Each panel shows the contribution of the bivariate interaction terms, normalized such that the color map ranges between plus or minus the maximum of the norm of each function ||f ||max. Purple indicates negative contributions and blue indicates positive contributions. The table lists ||f ||max for the interaction functions, each with units log 10 M yr −1 . In absolute terms, the largest interaction occurs for large virial mass Mvir and environmental temperature T1. SFR is partially reduced for low environmental temperature T1 and either low environmental density ρ1 or redshift z.   Figure 13. Details for the classification EBM model φSFR(θ ) that interpolates between the base EBM γ(SFR|θ ) and the outlier EBM δ(SFR|θ ) for creating the CEBM Γ(SFR|θ ). Panel a) displays the average contribution of features to the classification EBM model φSFR(θ ).
The most important features for determining whether a galaxy is an outlier in the SFR distribution are environmental density ρ1, virial mass Mvir, and nearby halo mass ratio Υ0.1. The average contributions are unit free, and represent changes to the log odds of a galaxy being an outlier in the SFR distribution. Panel b) shows the feature functions contributing to the classifier EBM φSFR(θ ). These feature functions represent the change in log odds that a given galaxy will be an outlier in SFR. Outliers tend to occur at high environmental density ρ1 or very low or high virial masses Mvir. Galaxies with massive neighbors, reflected by Υ0.1, or high environmental temperature T1 are also more likely to be outliers. The lowest redshift galaxies in the dataset are additionally likely be outliers in SFR. Panel c) presents the interaction functions for the classifier EBM φSFR(θ ). Each panel shows the contributions of the interaction terms, normalized such that the color map ranges between plus or minus the maximum of the norm of each function ||f ||max. Purple indicates negative log odds and blue indicates positive log odds that a given galaxy is an outlier in SFR. The table lists ||f ||max for the interaction functions, listed as the corresponding change in log odds. Galaxies with large environmental temperature T1 and with high environmental density ρ1, at high redshift z, or with large virial mass Mvir are more likely to be outliers. Massive galaxies in high environmental densities or at high redshift also are more likely outliers in SFR.   Figure 14. Details for the CEBM model Γ(SFR|θ ) trained to predict SFR. Panel a) displays the average contribution of features to the CEBM. Virial mass Mvir provides the largest average contribution to the star formation rate. The environmental density ρ1 provides a ∼ 6× smaller average contribution. The environmental temperature T1, nearby galaxy mass ratio Υ0.1, and redshift z provide a relative contribution roughly 10× smaller than Mvir. Panel b) shows the feature functions contributing to the CEBM Γ(SFR|θ ). The SFR increases with Mvir, which provides the largest contribution. A secondary contribution comes from environmental density ρ1. Environmental temperature T1 has a minor contribution, but shows the familiar feature at T1 ≈ 10 4 K where hydrogen ionizes. The mass ratio of nearby halos Υ0.1 and redshift z provide minor contributions. As expected, the CEBM feature functions are similar to the base EBM feature functions that represent the parameter dependence of star formation rate for most galaxies in the dataset (see Panel b) of Figure 11). Panel c) presents the interaction functions for the CEBM Γ(SFR|θ ). Each panel shows the contribution of the interaction terms, normalized such that the color map ranges between plus or minus the maximum of the norm of each function ||f ||max. Purple indicates negative contributions and blue indicates positive contributions. The table lists ||f ||max for the interaction functions, each with units log 10 M yr −1 . As for the interaction functions for the base EBM γ(SFR|θ ), the largest interaction occurs for large virial mass Mvir and large environmental temperature T1 or low redshift z.     The feature with the highest average contribution is virial mass Mvir, with an average contribution to log 10 M roughly 5× larger than that from redshift z or environmental density ρ1. The environmental temperature T1 and nearby halo mass ratio Υ0.1 provide smaller average contributions, and interactions between features are yet smaller. Panel b) shows the feature functions contributing to the base EBM model γ(M |θ ). The stellar mass increases with Mvir that provides the largest contribution. Secondary contributions come from redshift z, which increases M at later times, and the positive correlate environmental density ρ1. comes from environmental density ρ1. Environmental temperature T1 has a small contribution, and shows the familiar feature at T1 ≈ 10 4 K where hydrogen ionizes. The mass ratio of nearby halos Υ0.1 provides a minor contribution. Panel c) presents the interaction functions for the base EBM γ(M |θ ). Each panel shows the contribution of the bivariate interaction terms, normalized such that the color map ranges between plus or minus the maximum of the norm of each function ||f ||max. Teal indicates negative contributions and green indicates positive contributions. The table lists ||f ||max for the interaction functions, each with units log 10 M . In absolute terms, the largest interaction occurs for large virial mass Mvir and environmental temperature T1 (same as for the base EBM γ(SFR|θ ) modeling star formation rate, see Panel c) of Figure 11). Stellar mass is partially reduced for low environmental temperature T1 and either high environmental density ρ1 or high redshift z.    Figure 15). The stellar mass of outliers increases with increasing environmental density ρ1, with a large enhancement at very large ρ1. Unlike the base EBM γ(M |θ ), the stellar mass of the outliers increases with increasing redshift. The feature function for the nearby halo mass ratio Υ0.1 is weak and noisy. Panel c) presents the interaction functions for the outlier EBM δ(M |θ ). Each panel shows the contribution of the interaction terms, normalized such that the color map ranges between plus or minus the maximum of the norm of each function ||f ||max. Teal indicates negative contributions and green indicates positive contributions. The table lists ||f ||max for the interaction functions, each with units log 10 M . For outliers, stellar mass increases at high environmental density ρ1 with low environmental temperature T1 or high redshift z.   Figure 17. Details for the classification EBM model φM (θ ) that interpolates between the base EBM γ(M |θ ) and the outlier EBM δ(M |θ ) for creating the CEBM Γ(M |θ ). Panel a) displays the average contribution of features to the classification EBM model φM (θ ). The most important features for determining whether a galaxy is an outlier in the stellar mass distribution are environmental density ρ1, redshift z, nearby halo mass ratio Υ0.1, and virial mass Mvir. The average contributions are unit free, and represent changes to the log odds of a galaxy being an outlier in the stellar mass distribution. Panel b) shows the feature functions contributing to the classifier EBM φM (θ ). These feature functions represent the change in log odds that a given galaxy will be an outlier in M . Outliers tend to occur at high environmental density ρ1 or very low or high virial masses Mvir. Galaxies with massive neighbors, reflected by Υ0.1, or high environmental temperature T1 are also more likely to be outliers. The lowest redshift galaxies in the dataset are additionally likely be outliers in stellar mass. These trends are similar to the feature functions for the classifier EBM φSFR(θ ) (see Panel b) of Figure 13). Panel c) presents the interaction functions for the classifier EBM φM (θ ). Each panel shows the contributions of the interaction terms, normalized such that the color map ranges between plus or minus the maximum of the norm of each function ||f ||max. Teal indicates negative log odds and green indicates positive log odds that a given galaxy is an outlier in stellar mass. The table lists ||f ||max for the interaction functions, listed as the corresponding change in log odds. Galaxies with large environmental temperature T1 and at high redshift z are more likely to be outliers. Massive galaxies at high environmental density ρ1 or with large nearby halos (large Υ0.1) also tend to be outliers. Galaxies at low environmental density but with large nearby halos are also have an increased likelihood of being outliers in stellar mass. Virial mass Mvir provides the largest average contribution to the stellar mass. The environmental density ρ1 and redshift z provide ∼ 5× smaller average contributions. The environmental temperature T1 and nearby galaxy mass ratio Υ0.1 provide a relative contribution to stellar mass roughly 10× smaller than Mvir. Panel b) shows the feature functions contributing to the CEBM Γ(M |θ ).
The stellar mass increases with Mvir, which provides the largest contribution. Secondary contributions come from environmental density ρ1 and redshift z. Environmental temperature T1 has a smaller contribution, but shows the familiar feature at T1 ≈ 10 4 K where hydrogen ionizes. The mass ratio of nearby halos Υ0.1 provides a minor contribution. As expected, the CEBM feature functions resemble the base EBM feature functions that represent the parameter dependence of stellar mass for most galaxies in the dataset (see Panel b) of Figure 15). Panel c) presents the interaction functions for the CEBM Γ(M |θ ). Each panel shows the contribution of the interaction terms, normalized such that the color map ranges between plus or minus the maximum of the norm of each function ||f ||max. Teal indicates negative contributions and green indicates positive contributions. The table lists ||f ||max for the interaction functions, each with units log 10 M . As for the interaction functions for the base EBM γ(M |θ ), the largest interaction occurs for large virial mass Mvir and large environmental temperature T1. Stellar mass is partially reduced for low environmental temperature T1 and high environmental density ρ1. These trends are similar to those for the base EBM γ(M |θ ) modeling stellar mass (see Panel c) of Figure 15).