Mangrove: Learning Galaxy Properties from Merger Trees

Efficiently mapping baryonic properties onto dark matter is a major challenge in astrophysics. Although semianalytic models (SAMs) and hydrodynamical simulations have made impressive advances in reproducing galaxy observables across cosmologically significant volumes, these methods still require significant computation times, representing a barrier to many applications. Graph neural networks have recently proven to be the natural choice for learning physical relations. Among the most inherently graph-like structures found in astrophysics are the dark matter merger trees that encode the evolution of dark matter halos. In this paper, we introduce a new, graph-based emulator framework, Mangrove, and show that it emulates the galactic stellar mass, cold gas mass and metallicity, instantaneous and time-averaged star formation rate, and black hole mass—as predicted by an SAM—with an rms error up to 2 times lower than other methods across a (75 Mpc/h)3 simulation box in 40 s, 4 orders of magnitude faster than the SAM. We show that Mangrove allows for quantification of the dependence of galaxy properties on merger history. We compare our results to the current state of the art in the field and show significant improvements for all target properties. Mangrove is publicly available: https://github.com/astrockragh/Mangrove.


INTRODUCTION
In the hierarchical paradigm of ΛCDM cosmology, dark matter is a crucial constituent of galaxy formation.While modeling the evolution of universes with only dark matter can be done both analytically (Sheth et al. 2001) or through numerical N-body simulations (Aarseth et al. 1979;Efstathiou et al. 1985;Maksimova et al. 2021), co-evolving dark matter and baryons still represents a major challenge, as no simple, direct mapping between the two exists (Contreras et al. 2015;de Santi et al. 2022).Instead we turn to simulations for modeling these complex interactions.Two widely accepted frameworks for doing so are semi-analytic models (SAMs) and hydrodynamic simulations, which in Corresponding author: Christian Kragh Jespersen ckragh@princeton.edua) https://github.com/astrockragh/Mangrovethe last two decades have made it possible to populate cosmologically significant volumes with galaxies (Somerville et al. 2008;Vogelsberger et al. 2014;Somerville & Davé 2015;Naab & Ostriker 2017;Vogelsberger et al. 2020).However, the state-of-the-art hydrodynamical simulations take hundereds of millions of CPU hours to run (Schaye et al. 2015;Pillepich et al. 2018;Davé et al. 2019;Villaescusa-Navarro et al. 2021).SAMs achieve much greater computational efficiency by combining dark matter merger trees with a suite of physically motivated recipes for evolving the baryonic components of galaxies, but still require several hundreds of CPU hours to fill a (75M pc/h) 3 simulation box (White & Frenk 1991;Cole et al. 1994;Somerville & Primack 1999;Benson 2012;Lacey et al. 2016;Lagos et al. 2018).Both methods reproduce many key observables over a broad redshift range, and semi-analytic and numerical hydrodynamic simulations make qualitatively similar predictions for many galaxy properties (Somerville & Davé 2015), as well as agreeing halo-by-halo on stellar and interstellar medium (ISM) properties (Pandya et al. 2020;Ayromlou et al. 2021;Gabrielpillai et al. 2021).
However, there are still significant differences between SAMs and hydrodynamic simulations, as well as tension between the behavior of individual SAMs and hydrodynamic simulations (Somerville & Davé 2015).This tension is seen both on a halo-by-halo basis and on a population-basis, depending on the parameter in question.Kamdar et al. (2016); Agarwal et al. (2018); Jo & Kim (2019); Lovell et al. (2022); de Santi et al. (2022)(hereafter K16a, A18, JK19, L22, dS22) all attempt to map between dark matter and galactic baryonic properties using simple ML algorithms, like Extremely Randomized Trees, Random Forests, Multi-Layer Perceptrons or a combination of the above.These all attempt to map between dark matter halos and galactic properties using only features from the final halos at z = 0, or summary statistics believed to encode the merger history along with the features of the z = 0 halo.These methods all achieve reasonable success on select quantities such as galactic stellar mass and hot gas mass, but struggle to reconstruct quantities such as cold gas mass, star formation rate (SFR), and metallicity.Even in cases where the regression methods were able to predict the median values of a quantity with relatively low error (such as stellar mass), these techniques typically underestimated the dispersion in the baryonic property at a given halo mass (Agarwal et al. 2018).Galaxy properties such as stellar mass, ISM mass, stellar and ISM metallicity and SFR are known to lie in a relatively small sub-region of the high-dimensional parameter space; i.e., they populate a hyperplane.The ultimate goal of emulation methods is to reproduce the full hyperplane and its dispersion for the full suite of baryonic properties of interest.
In this work, we present a new method for learning this highly non-trivial mapping, using the natural choice for learning on merger trees, a Graph Neural Network (GNN).GNNs have lately been demonstrated to work extremely well at modeling various problems in astrophysics (e.g., Cranmer et al. 2019Cranmer et al. , 2020;;Cranmer et al. 2021a,b;Villanueva-Domingo et al. 2021;Thiele et al. 2022;Lemos et al. 2022).This choice of model allows us to include the full merger history as recorded in the merger tree, since the merger tree can naturally be encoded as a graph.
As we will show, our model, Mangrove, outperforms all other models in the literature when predicting stellar mass, cold gas mass, black hole mass, cold gas metallic-ity, and SFR.1 This indicates that exploiting the inherent structure of the merger tree indeed is the stronger choice for mapping directly between dark matter and baryonic properties.In this paper, we will furthermore demonstrate some valuable use cases that Mangrove allows for.First, we can probe the extent to which the exact merger history is important for predicting baryonic quantities, in a way that is infeasible with SAMs and hydrosimulations.Comparing results from using different parts of the merger trees allows us to quantify what aspects of baryonic properties are due to formation history, and which are due to the direct, time-independent dark matter-baryon connection, which has not been possible until now.
We can furthermore probe the importance of halo features along the merger tree in order to determine which parameters would be important in constructing an analytical theory that directly relates the merger tree to galactic properties, and where the most salient information lies.This is done by removing certain parameter sets and observing the change in model performance.
This paper is designed as follows.In §2 we present the simulations used to train Mangrove, as well as our data selection criteria.In §3 we briefly introduce GNNs and the loss function used for optimizing Mangrove.In §4 we present our results.In §4.2.1 we present the results for predicting galactic stellar mass at different redshifts.
§4.3 compares our z = 0 results to the existing literature.§5 presents our exploration of the dependence of stellar mass on formation history, as well as an exploration of which dark matter features are most important for.In §6 we discuss our results and possible future work, and in §7 we summarize our conclusions.An appendix with additional details is also provided.

Simulations and Merger Trees
We use the dark matter only version of the Illus-trisTNG simulation, TNG-100-1-Dark.This simulation contains (1820) 3 particles within a box of 75h −1 Mpc on a side.This implies a dark matter particle mass of 6×10 6 h −1 M .The halo finding code rockstar (Behroozi et al. 2013a) has been run on 99 snapshots from this simulation, and the consistenttrees (Behroozi et al. 2013b) code is used to construct merger trees from these halo catalogues.See Gabrielpillai et al. (2021, hereafter G21) for more details on the halo finding and merger tree algorithms.The merger tree encodes the full formation history and temporally evolving dark matter features of all halos that merge to form the final halo.Galaxies co-evolve with their dark matter halo, and the dark matter merger trees are therefore useful for determining galaxy formation.

Santa Cruz Semi-Analytic Model
We then run the well-established Santa Cruz Semi-Analytic Model (Somerville & Primack 1999;Somerville et al. 2008Somerville et al. , 2015) ) on the merger trees described above.The current version of the SC-SAM is documented in G21.
Fundamentally, the SAM uses a set of coupled ODEs to track the flow of matter, gas, and metals between different reservoirs (the diffuse intergalactic medium, hot gas halo, cold gas in the ISM, stars, etc.).The predictions of the SC-SAMs have been extensively compared with observations from z ∼ 0 − 10, and the SC-SAMs predictions are in good agreement with observables such as stellar mass functions, SFR distributions, and cold gas content (Somerville et al. 2008(Somerville et al. , 2015;;Yung et al. 2019a,b;Somerville et al. 2021).Moreover, they produce reasonably similar predictions to hydrodynamic simulations for the stellar mass content and cold ISM content of galaxies across a broad range of halo mass and cosmic time.See Somerville & Davé (2015) for a review comparing many SAMs and hydrodynamical simulations, Pandya et al. (2020) for a comparison of the Santa Cruz SAM with the FIRE simulations, and G21 for a comparison of the SC-SAMs with the IllustrisTNG simulations.
The SAM reproduces the galaxy "assembly bias" (dependence of clustering on properties other than halo mass) seen in IllustrisTNG, which is by definition not reproduced by regular Halo Occupation Distribution (HOD) models (Hadzhiyska et al. 2021).

Data Selection
To ensure that the galaxies in the dataset have reliable features and targets, we employ a set of selection criteria.First, only merger trees where the final halo has a mass of 10 10 M or above are included.This choice is made as the mass of the final halo indicates both the reliability with which the dark matter properties can be measured as well as the reliability of the derived SAM baryonic properties.Secondly, only central galaxies are included, since centrals and satellites are believed to have different relationships with their host halos (Hearin et al. 2016).An inclusion of satellites in future work would be of great interest.
Since in any given merger tree there can be upwards of millions of nodes, some reductions are made.As we are mainly interested in probing the merger history, we preserve nodes/halos that are either: • A progenitor node, i.e., the first time a halo was detected in the simulation • Pre-merger nodes, i.e., halos from the snapshot before they merge • Post-merger nodes, i.e., halos that are the direct result of a merger • The final node, i.e., the final halo This reduces the number of nodes by a factor of ∼10-50, depending on the merger tree in question.A sample merger tree resulting from this selection can be seen in Figure 1.Our selection produces a strong inductive bias, since smooth accretion modes are not included.We also limit the total number of nodes to be < 2 • 10 4 , which results in the exclusion of 107 merger trees.Since we regress logarithmic targets, only galaxies with non-zero target quantities are included, excluding 470 trees.In total, the z = 0 dataset consists of 108,338 merger trees.In the future, this problem should be addressed more eloquently, either by combining Mangrove with a classifier predicting if a given galaxy has a zero-value target or not.Another transformation such as the arcsinh transformation could also be considered.
In the basic dataset, we include all dark matter features that are not IDs, x,y,z positions, or x,y,z velocities, even features not explicitly used by the SAM.See §A in the appendix for a list of halo parameters.The description of our training, validation, and test set selection can also be found in §B of the appendix.
SAMs output a large range of baryonic galactic properties, but for exploring the possibility of emulating them with a GNN, we pick a few quantities of interest.The main target of interest is stellar mass (log(M * /M ), hereafter M * ).This is a central quantity for both creating mock catalogues and for simulators to successfully reproduce, and is therefore also the main focus of this project.
To explore the possibility of emulating other baryonic properties that are physically significant or closely related to observables, we also include a range of other targets.
• Cold gas (ISM) mass (log(M cold /M ), hereafter M cold ).The cold gas is the fuel for star formation, and can also be probed observationally through sub-mm emission lines such as CO and through the dust continuum.This is also explored by A18 and L22.
The supermassive black hole at the center of a galaxy influences the entire galaxy way beyond its gravitational influence (Kormendy & Ho 2013).This is also explored by JK19 and L22.
• Cold gas (ISM) metallicity (log(M Zgas /M cold )), hereafter Z gas ).The cold gas metallicity is observable through the strength of different metal lines and is used as a tracer of cold gas clouds.This quantity is also explored by A18 and L22.
• Star Formation Rate averaged over 100 Myr (log(SFR 100 /M /yr), hereafter SF R 100 ).Both SFR properties can be probed through observations of UV or FIR light, and emission lines such as Hα, and correlate strongly with color.We attempt to regress both the current and 100 Myr averaged SFR, since these in conjunction should provide crucial information about the recent merger history of a galaxy (Caplar & Tacchella 2019;Iyer et al. 2020).

GRAPH NEURAL NETWORKS
The most successful models are the ones which embed well-motivated inductive biases into the model that one wishes to fit to some data.In machine learning, using convolutional neural networks (which preserve locality) for images is a prime example of this general principle.Since halo and galaxy evolution are naturally encoded in merger trees, which are graphs, a Graph Neural Network (GNN) is an intuitive choice for mapping galactic baryonic physics onto dark matter.GNNs, as other ML methods, are built from a sequence of modules, called layers.These layers are then stacked as a sequential series of message-passing or graph convolutional layers (Kipf & Welling 2017;Battaglia et al. 2018) which pass information from the nodes along the edges of the graph, followed by a differentiable pooling function and a decoder function, which is usually a Multi-Layer Perceptron (MLP) (Rumelhart et al. 1986).The pooling function is applied in order to standardize the outputs for graphs varying number of nodes.The overall flow of Mangrove is visualized in Figure 2.
A description of the full structure of Mangrove, known as the architecture of the model, can be found in §E in the appendix.

Core Concepts
GNNs are a species of neural network which operate on graph-structured data (Scarselli et al. 2008;Bronstein et al. 2017;Battaglia et al. 2018).For our purpose, the graphs, G, on which GNNs operate are defined as 2-tuples, G = (V, E),2 where V = {v i } i=1:N v , and N v is the total number of nodes, is a set of node attribute vectors of dimensionality D v .The edges can be encoded by E = {(e k , r k , s k )}, a set of edge attribute vectors of dimensionality D e , and with indices r k , s k ∈ {1 : N v } of the "receiving" and "sending" nodes connected by the k-th edge.
In this work, only node attributes and edge indices are used, although edge features could be created.Note that our graphs are directed, since merger trees are inherently directed in time.A directed graph means that information can only be passed one way on a given edge, which for our purpose follows the flow of time since propagating information backwards in time would break causality.
The neighborhood of node i consists of all nodes that are connected to node i by an edge.Note that for a directed graph, this only includes the set of nodes Node states are updated by applying a learnable function f to the current node states, applying a learnable function g to the mean of the node states of the neighboring nodes and adding these two as described in §3.Transferring into the latent space is marked by shaded colors and a change of shape from circles to rectangles.Adding neighborhood information is marked by mixing of colors.The model makes message-passing steps, after which the graph nodes are summed over, and this sum is then decoded by another learnable function, h, which gives the predictions and the Gaussian covariance matrix.All learnable functions are Multi-Layer Perceptrons.
for which r k = i.Some prefer to instead define two separate notions of neighborhoods for directed graphs, an incoming neighborhood and an outgoing neighborhood.Our definition would be the same as the incoming neighborhood.We denote the neighborhood of node i by N (i).

GraphSAGE
As described above, the message-passing or graph convolutional layer makes up the cornerstone of GNNs.Information from the neighborhood is passed along the edges leading to a given node, in order to learn not just from the node feature vector but from its neighbors.
There has been a variety of different proposed message-passing and graph convolutional layers.In this project we use the PyTorch Geometric (Fey & Lenssen 2019), implementation of the GraphSAGE convolutional layer from Hamilton et al. (2017).With each application of this layer, each node updates its state from the input state v i to a hidden state v i , through: where W 1 and W 2 are learnable weight matrices.Thus, W 1 operates on information from the node itself, and W 2 on the mean3 of the node states of the neighborhood nodes.
In order to gain some intuitive understanding of what this layer actually does, it is useful to think about these layers in terms of function.We can introduce two learnable functions, f , the node function and g, the neighborhood function, which are both constrained to be linear.Then a full application of GraphSAGE can be written as: The optimization task can then be framed as learning the functions f and g, expressed through matrices W 1 and W 2 .See Figure 2 for a schematic of the flow of these learnable functions.
Since all functions in this layer are linear, a non-linear activation function, usually written as σ,4 is applied between each layer, allowing expression of nonlinear functions.Thus the node state is updated to: In this work we use the ReLU activation function between GraphSAGE layers (Agarap 2018).

Loss Function
The loss function L is central to the optimization and performance of any GNN, as the parameter set θ which makes up the GNN is optimized to satisfy min(L θ ({G} train )).In this work we employ a generalized Gaussian Negative Log-Likelihood (NLL).
For a single input, the general Gaussian NLL is defined as: where, y is the true target vector, ŷ is the network prediction vector, Σ is the predicted covariance matrix, and | Σ| denotes its determinant.These are then easily extended to their batch form, by simply summing over all inputs in a batch.In this paper, the quoted results are obtained via a purely diagonal covariance matrix, although in some cases, we obtained better results with the full covariance matrix.The diagonal covariance matrix is preferred for its simplicity, and still renders Gaussian uncertainties, which are highly useful as discussed in §6.The metrics (see §4.1) between the two cases differ by no more than a few %.A further discussion of the uncertainties can be found in Appendix §C.

RESULTS
In this section, we first introduce the metrics used to characterize the performance of Mangrove.We then present results for our M * predictions and compare them to other methods and the possibility of generalization across different redshifts.We then explore the predictions of the other target parameters.We compare our results to results from four other frameworks.
• Our M * prediction will be compared to the more widely used method for connecting halo masses and galactic stellar masses, Abundance Matching (Vale & Ostriker 2004).
• Where possible, we compare to other papers in the literature which have attempted to regress the same target parameters.However, these are not performed on the same dataset as ours.
• In order to mitigate this issue, we train a MLP on the z = 0 halos (final halos) of our dataset which should be comparable to the methods in the literature.
• As a way of providing an estimate of the best possible performance on the test set, we run the SC-SAM with different random seeds, and calculate the comparison metrics between them.This should be an estimate of the lower information limit.If our predictions do significantly better than this across the entire simulation, it would be cause for some concern.This is not the same as the numerical uncertainty, but arises due to the fact that while the model can learn summary statistics for the probability distributions used in the SAM, it is a fully deterministic model and thus could not emulate the random draws.Going forward this is denominated as the SAM probabilistic limit.

Metrics
The most commonly used metric to determine accuracy in astronomy is the scatter5 , defined as: (5) where ∆y ≡ y − ŷ is the residual of a single prediction and ∆y is the mean of the residuals.This metric has two significant caveats.Firstly, it does not measure any systematic offset in the residuals.Therefore, as an important addition, we introduce the bias as an auxiliary metric, defined as the mean of the residuals, i.e.: bias The bias effectively measures any systematic offset.The best possible predictions would have low scatter and no bias.Second, because the scatter is susceptible to outliers, we include two secondary metrics that are more stable and not directly optimized for in our loss function.
• Pearson correlation coefficient (ρ), i.e., the linear correlation between the target and model prediction: For both of these metrics, a perfect set of predictions would correspond to ρ P earson = R 2 = 1 It is important to note that the ability to predict any specific target generally improved when Mangrove was trained to predict all target variables.We therefore distinguish between models trained for all targets and only a single target when presenting results.

Stellar Mass Results
As the central quantity of interest, the stellar mass received the most attention in this paper.The test set results were a scatter of 0.070 dex, with 0.002 dex bias.This is shown in Figure 3, along with a comparison to the usual halo mass abundance matching approach. 6Abundance matching (Vale & Ostriker 2004), simply rank-orders all galaxies and halos by mass and assumes a monotonic matching relation exists between the two.We include this comparison as a baseline due to its simplicity and widespread use.
Figure 3 shows the relation between target value and predicted value, along with distributions on the respective axes.The Figure shows the (target, prediction)relation as a 2D histogram with logarithmic bin heights.
6 Other metrics can be found in Table 1 If this relation follows the diagonal, that would indicate perfect predictions.The tighter the relation follows the diagonal, the better.
A few comparisons are beneficial to keep in mind: • Training Mangrove to predict only M * yields a scatter of 0.078 dex, 11% worse than the performance when training Mangrove to predict all quantities simultaneously.
• The performance of the model worsens to a scatter of 0.132 dex when using only the parameters of the final halo, indicating a strong dependence on assembly history.
• The scatter of Mangrove's M * predictions is comparable to the SAM probabilistic limit as defined above, which has a scatter of 0.043 dex (see Table 1).
Further investigation of the impact of different features and the merger history can be found in §5.

Stellar mass at other redshifts
A central question for many ML models is to what extent its predictions will generalize.For astrophysical purposes, generalization across redshifts is crucial.We investigated whether Mangrove would perform well at z > 0 by doing several experiments.1. Test models trained at a single redshift at redshifts where they were not trained.This in general leads to imprecise and highly biased results.7 2. Train and test models at individual z ≥ 0.
Mangrove can achieve an accuracy below 0.08 dex at all redshifts as can be seen in Figure 4.
3. Train a general model by pooling training sets at z ∈ 0, 0.5, 1, 2, and testing at z ∈ 0, 0.5, 1, 2. Compared to training and testing at individual redshifts, we obtain similar results at all redshifts.
4. Most surprisingly, by pooling training sets at z ∈ 0, 0.5, 1, 2, and testing at z ∈ 0.25, 0.75, 1.5, 1.75 where Mangrove was not trained, we obtain comparable bias and scatter to where Mangrove was trained.

5.
Pooling training sets at z ∈ 0, 0.5, 1, 2, and testing at z = 3 where Mangrove was not trained, we get a relatively inaccurate result as seen in Figure 5.The prediction scatter is improved by almost a factor of two compared to the state of the art, single-halo method and by more than a factor of 4 compared to the widely used abundance matching method (see Table 1).Mangrove's M * scatter is comparable to, but still well above, the probabilistic limit of the SAM, which is at 0.043 dex.All experiments in this section were done with models trained to predict only M * .
Mangrove performs well on all z > 0, even learning a smooth transformation for general redshifts if in the interpolative regime.However, it worsens significantly when extrapolating.  4, but including a point from extrapolating to z = 3 using the models trained at z ∈ 0, 0.5, 1, 2. The predictions have much higher scatters.

M cold , MBH , Zgas, SF R and SF R100 Results
In general, we observe a weaker dependence on merger history for all other target quantities, but the predictions from using the full merger tree are always significantly better than just using the final halo, as can be seen in Table 1. 8 The regression results for M cold , M BH , Z gas , SF R and SF R 100 for a model trained to predict all targets at the same time are visualized in Figure 6.As Recovering galactic M cold , Zgas, SF R, SF R100 and MBH renders mostly tight and unbiased relationships across the entire ranges of these galactic features, with the notable exception of targets where SF R or SF R100 > 0.1.This flaw is spurious and should be possible to mitigate.All of these results outperform results from the literature for all precision metrics (see Table 1).The fact that such mappings exist imply that the full hyperplane of galactic properties may be emulated with Mangrove.can be seen, Mangrove generally performs very well but struggles in regions of SF R or SF R 100 > 0.1.This is due to the two diverging branches in SFR that occur around this value (see Figure 13 in the appendix).
Notably, the M BH predictions are slightly better than the SAM probabilistic limit.Since the predictions are just below this limit, we can interpret the results as having accurately captured a generalizing rule for mapping between the dark matter merger tree and M BH , with little cause for concern.
See §D in the appendix for an in-depth analysis of both the relations between halo mass and the target parameters, as well as the interdependence of the residuals of the target parameters.

Comparison to Benchmark
To evaluate our performance against the current stateof-the-art in mapping baryonic properties directly onto dark matter, we provide Table 1, containing accuracy metrics 9 for this work and others in the literature at z = 0.
Although we here compare directly to the literature results, it should be noted that the dS22 report values for specific, instantaneous SFR, sSFR 10 , which means that their results for just SFR have significantly higher scatter, since this prediction is dominated by the stellar mass prediction.
Since papers from the literature do not report all metrics used here, we provide an estimate for a final-haloonly result, which effectively follows the basic Neural Network approach of dS22.The final halo only accuracy should also be comparable to the approaches using 9 See beginning of §4 for an overview of the used metrics 10 Defined as SF R/M * .1. Metrics for the methods discussed in this paper.SAM probabilistic limit denominates the metrics obtained from comparing separate realizations of the SC-SAM.Mangrove denominates the results of Mangrove using the full merger history and all halo parameters.Final halo only denominates the results of Mangrove using all halo parameters for the z = 0 halo, i.e., the final halo.We bold the best emulator performance for each metric for each target variable.Exact values from JK19, dS22 and L22 were obtained through correspondence with the authors.Only for bias are our results comparable to results from the literature, although the bias is usually small enough to be dominated by noise.Note that although the SF R/SF R100 scatter is quite high, so is the SAM probabilistic limit.For MBH , Mangrove notably performs slightly better than the SAM limit.
tree-based algorithms, as shown by dS22, so the results of A18, JK19 and L22 should also be comparable.
As described in §4, we also provide the probabilistic limit of the SAM.
Comparing our results to results from the literature, we see that Mangrove outperforms other models across all categories, as well as showing remarkable improvements for using the full structure compared to only using the properties of the final halo.Especially M * and M BH show very significant improvements comparing our merger tree approach with the final halo only approach.

DEPENDENCE ON ASSEMBLY HISTORY AND FEATURE ABLATION
A key method of investigating ML models is to simply remove certain parts of either the model or the input data, in order to probe the importance of said part.This process is known as ablation.In this section we seek to quantify the impact on our model accuracy by removing parts of the input data.We here investigate two separate cases, in both cases with models trained to predict only M * at z = 0.
• The dependence of our model accuracy on the fraction of the merger history included in the input merger trees.
• The dependence of our model accuracy on a different sets of input features when given the full merger tree as an input.
These tests investigate what aspects of current galactic stellar masses are due to history, and which are inevitable due to the fundamental physical connection between different aspects of a galaxy and its host halo, and which of these aspects are the most important.

Dependence on Merger History
In order to quantify the dependence on merger history of our galactic stellar mass prediction, we perform a simple study.For all merger trees, we reduce the number of nodes by some fixed percentage, P, for which we retrain and retest the model.We reduce the number of nodes starting at earlier times (high redshift) by finding the scale factor corresponding to the P'th percentile and then excluding all halos at scale factors lower than this.Here we choose the fixed percentage in order to not bias the results towards lower-mass galaxies, since in our dark matter only simulations, their assembly would only start being recorded at lower redshifts, and their merger trees would thus not be pruned to the same extent if one pruned above a fixed redshift.Our chosen pruning method is illustrated in the difference between Figures 1 and 7.If our M * prediction scatter using the P'thpercentile reduced tree is as good as when using the full merger trees, it implies that there was no useful information in the P'th percentile highest redshift nodes that is not also contained in the low redshift nodes.
We choose to train and test 25 times for each percentage.We choose to investigate at 0, 50, 75, 85, 95, 99 and 100% of the merger history removed 11 .The median and 16th/84th percentile of the test scatters are shown in Figure 8.It is clear that the impact of including more of the merger history is quite significant, with higher relative significance towards halos at lower redshifts.
11 Removal of 100% of the merger history corresponds to only using the final halo.Removing 0% corresponds to using the full merger history.
Figure 7.A partial merger tree showing the merger tree from Figure 1 with the nodes with the 75% highest redshift pruned, which is our method of quantifying the importance of nodes at high redshifts.If our M * prediction from using this reduced tree is as good as when using the full tree, it indicates that there was no useful information in the 75% highest redshift nodes.The relation between precision and cutoff percentage can be seen in Figure 8.
Figure 8. Median and 16th/84th percentile of the scatter for 25 models at each percentile cut.There is a clear trend towards a loss of accuracy as we remove more nodes, so the stellar mass prediction depends strongly on the assembly history of the given galaxy.The curve is approximately exponential.Note however, that the drop in accuracy when pruning the 50% highest redshift nodes is small.

Feature Ablation
A GNN can act as a predictor of what features are most important in determining the properties of a galaxy.Say that one would like to test whether some connection between black hole mass and three halo parameters exists.Then one can simply train and test a GNN using said three parameters, and compare the results to when the GNN uses all parameters.Thus, in order to investigate which features are the most important for Mangrove's M * -prediction, we choose this simple approach, where we only include certain sets of features during both training and testing of Mangrove.
Besides a series of physically motivated sets of quantities (see Table 2) from analytical approaches in the literature (Rodríguez-Puebla et al. 2016), we also attempt to regress M * from an empty tree, i.e., a tree with no features.The merger tree then contains no information but that encoded in the geometric structure itself.This approach is less precise than the final halo only regression, but it still outperforms abundance matching.
We observe that if one wishes to use only a single parameter, V max has a much lower scatter compared to the otherwise most common choice, the halo mass.
The most salient single group of parameters is the NFW profile parameters. 12The NFW profile information can be encoded either as M 200c,500c,2500c or R S,Klypin and R vir , which together make the NFW concentration parameter c N F W . Combining redshift and NFW profile information rendered the most precise prediction using the fewest features.
Some general interpretability methods rely on the same basic principles of ablation to investigate model behavior, such as SHAP values (Lundberg & Lee 2017).SHAP values are however not reliable for GNNs in their current implementation, but only for simpler models, such as Random Forests (Lundberg et al. 2018).

DISCUSSION AND FURTHER WORK
In this section we discuss some of the implications of our work and how these connect to possible future work.Among the topics discussed will be the issues with the constituents of star formation, M cold , SF R and SF R 100 , merger history dependence and interpretability, the bias and dispersion in relationships with halo mass, the predicted uncertainties, and how to use Mangrove with other simulations and combining results from Mangrove with observations.Navarro-Frenk-White (NFW) profile (Navarro et al. 1997) Table 2. Results for feature ablation for predicting only M * .Training on a smaller subset of features renders information about the importance of each subset.Interestingly, the empty tree regresses significantly better than abundance matching, demonstrating that there is significant information in just the geometrical structure of the merger tree.We also observe that Vmax and CNF W are the single parameters that allow Mangrove to make the best predictions.
6.1.M cold , SF R and SF R 100 Although this paper shows that some highly accurate mappings between dark matter merger trees and baryonic galactic properties exist, there is still significant scatter between the Mangrove and SAM M cold , SF R and SF R 100 .It should, however, be noted that the scatters between different SAM runs due to only random seed variation in these quantities are already quite high (see Table 1).There were no notable differences in the reconstruction strength between SF R and SF R 100 , which indicates that reconstructing the Star Formation History, as well as the current SF R, are similar tasks to the Mangrove.As a way of investigating if Mangrove has learned physically meaningful relationships for these target, we test the interdependence of the target residuals.Here we find that the residuals between the two SF R targets and M cold are strongly correlated (see Figure 12 in the appendix), meaning that if Mangrove predicted a too high M cold , it would also predict a too high SF R, analogous to the Kennicutt-Schmidt relation (Kennicutt 1998).The improvement in these three quantities when going from using only the final halo to the full merger history was smaller than expected, since they are thought to be strongly connected to the merger history of the galaxy (White & Frenk 1991;Somerville & Davé 2015;Caplar & Tacchella 2019).This aspect of galaxy evolution could instead be more strongly connected to their environments, which is not included as an input to Mangrove.An obvious possibility for future work is therefore to include an encoding of the environment.Since Lovell et al. (2022) showed that summary statistics do not make a significant difference, it would perhaps be wise to include the environmental dependence as an additional graph that extends spatially as in Makinen et al. (2022), such that the environmental dependence could be learned.This would naturally lead to including a more explicit subhalo model, with satellites included in the spatial graph.It would then be natural to also predict properties for the satellites.

Merger History Dependence and Interpretability
For M * , in contrast to M cold , SF R and SF R 100 , we find a very strong dependence on the percentage of merger history included, with the importance of including a specific node increasing the closer it is to the z = 0 (see Figure 8).This is in contrast to the conclusions of McGibbon & Khochfar (2022), although their modelling framework does not fully use the formation history.Our analysis of merger history dependence is, however, only preliminary for all parameters but M * .The more powerful graph-based framework should facilitate deeper future investigations of the formation history dependence for a wider range of properties.Another interesting avenue for interpreting the model, could be to investigate the model's behavior with symbolic regression (Cranmer et al. 2020), which would lead to a highly interpretable merger history dependence.A separate but interesting interpretability approach could be to investigate the merger trees from the point of view of unsupervised learning, as in Jespersen et al. (2020); Hovis-Afflerbach et al. ( 2021) using either t-SNE or UMAP (Van der Maaten & Hinton 2008;McInnes et al. 2018).This would be done in the latent space of Mangrove, as the latent space would be more readily comparable.

Additional Physical Relationships
As can be seen in Figure 9, Mangrove reproduces not just the median relationship, but also the dispersion in the M halo − M * relationship.Figure 13 in the appendix shows that this is the case for all quantities except the two SF R's.Combining this with the above mentioned correct physical interdependence of the residuals leads to a suspicion that we are closer than ever to being able to emulate the full galaxy property hyperplane.Since Mangrove furthermore performs well at all redshifts, even when interpolating at redshifts where it had not been trained, it is possible to imagine a GNN being able to emulate the full galaxy property hyperplane for all redshifts.Relations like Figure 9 for all other target variables can be found in Appendix §D.
Another curious result is the tendency for Mangrove to improve as more output variables were included (with no As can be seen comparing the low M h , M * region, the Gaussian loss allows Mangrove to "give up" on some low mass galaxies.This "limit" is indicated by a dashed black line.Figure 13 in the appendix shows the same relation for all targets. increase in the amount of model parameters), indicating that an even larger range of baryonic properties than included here could be predicted by one unified model with even greater accuracy, lending further credence to the possibility of emulating the full galaxy property hyperplane.The authors hypothesize that this is due to weight smoothing, which diminishes spurious correlations, since each weight is "regularized" by the loss from other targets.

Uncertainties
As an additional feature in the literature on emulating baryonic physics, we also have some interesting possibilities due to predicting a set of Gaussian covariances.The diagonal entries, the variance of each parameter, could be used as a filter for selecting predictions that Mangrove is highly confident that it has gotten right.For example, if one filter our results such that only the 50% highest confidence regressions as judged by Mangrove were included we get significantly better results (see Table 3, and Figure 16 in the appendix).A further investigation into the drivers behind the variance value is beyond the scope of this paper, but highly valuable information could most likely be extracted from these, as demonstrated by Stiskalek et al. (2022).Distributions and correlations between the predicted uncertainties and galaxy properties can be found in Appendix C.
A minor drawback to the generalized Gaussian loss function is that information-sparse areas of the target space can end up being "ignored" by Mangrove, which instead prefers ascribing a large variance to these points.An example of this can be seen in the halo mass-stellar mass relation in Figure 9, where low-mass halos with low mass galaxies are effectively ignored in favor of a "safe" floor value of M * ≈ 5.9.Since the issue persists in a mass region where the SAM is already quite uncertain due to the mass resolution, and where the galaxies would not currently be of major importance, this is only of minor concern.

Extensions to Other Simulations
Mangrove can also be applied to the full magnetohydrodynamical version of IllustrisTNG, as well as any other SAM and hydrodynamical simulation.An analy-sis of the dependence of baryonic properties on merger history in IllustrisTNG similar to the one presented in this paper would render crucial information as to how and why SAMs and hydrosimulations render different outputs.Since Mangrove works regardless of the number of snapshots included, an interesting avenue to explore could also be different subsampling methods, where specific snapshots are left out according to some scheme, whereafter the Mangroves's performance on the subsampled merger tree is then measured.This would inform decisions about how many snapshots a simulation team needs to store in order to achieve a satisfactory galaxy property reconstruction, possibly reducing the need for storing an extensive number of snapshots.

Connection to Observations
While extending our framework to other simulations is exciting, we should consider possibilities for combination with observation.Here it should be noted that as determined in §5, only the recent merger history, which is somewhat possible to observe, is required in order to regress the stellar mass significantly more precisely.This kind of recent merger information should be recoverable from spectroscopic missions (e.g. the PFS Galaxy Evolution Survey (Takada et al. 2014)), since recently merged galaxies can normally be identified from either a strong infrared emission from heated dust, Astar population or Hα kinematics (Kennicutt & Evans 2012).Furthermore, we have determined that the most salient information comes from NFW profile information, which is becoming possible to measure (Niikura et al. 2015).Halo masses are measurable both from dynamic masses but most confidently from lensing.Therefore, one can imagine a simple population-level version of the model developed in this work being used along with measured stellar masses, recent merger histories and halo features to see if the connections that are emphasized by Mangrove truly are the most important.

CONCLUSION
Using the full merger history, we greatly improve upon the current state of the art for emulating galaxy properties with only dark matter properties.We have furthermore shown that interrogating Mangrove renders physical insights into the connection between merger trees and galaxy properties.
Considering first only the models use as an emulator, three points are crucial: • Mangrove outperforms all other known ML models when emulating the SC-SAM M * , M cold , Z gas , SF R , SF R 100 and M BH by using the full merger history of a given galaxy.Predictions always improve when using the full merger history, and especially M * and M BH are regressed highly accurately when using the merger history.Since M * and M BH are regressed so well, they do not just reproduce median relationships, but also the width of the distributions.
• When trained, Mangrove is respectively 4 and 9 orders of magnitude faster than the SC-SAM and IllustrisTNG.Including training, this drops to respectively 2 and 7 orders of magnitude.Especially for populating large boxes (side length > Gpc), this would drastically reduce run times.
• Mangrove works at a range of redshifts and can reliably interpolate between redshifts, even if not trained on galaxies at a given redshift.
The physical insights that we have obtained from the model center around two aspects of the connection between merger trees and galactic stellar masses.First, whether the galactic stellar masses are directly related to the properties of the halo within which it resides, or the formation history of the halo, and secondly, which dark matter features are the most valuable.Here we found three especially exciting results.
• The earliest half of the merger history of a galaxy can be discarded with only a minor loss of performance when predicting stellar mass.
• Including just 1% of the merger history closest to the present day leads to significantly improved regression.
• Mangrove identifies one especially important set of features, which encode the halos 1-dimensional NFW profile.This can be encoded in two ways, either by using M 200c , M 500c , and M 2500c or R s and R vir .The second most important parameter is V max .

B. TRAINING, VALIDATION AND TEST SETS
As outlined in Kuhn & Johnson (2013), it is important that the final model evaluation is made on data that is not used in either the training or for optimizing hyperparameters.Therefore we here split our data in three groups, a training set, a validation set used for evaluating performance during hyperparameter tuning, and a test set for independently evaluating the performance of the final model.The test set is never used during training or hyperparameter optimization.
A 70/10/20 split is used.After optimizing the hyperparameters via the validation set, it is absorbed into the training set for the final training of the models before testing.

B.1. Training and Testing at Higher Redshifts
Since all hyper-parameter tuning is done at z = 0, only a training and testing set are constructed for predicting at z > 0. For training and testing at z > 0, it is important to keep in mind that most galaxies at any z = z 1 will be a progenitor of a galaxy at z 2 < z 1 .Thus, if one were to naively train a model on baryonic quantities at both z 1 and z 2 with randomly chosen training and testing sets, there would be significant information leakage from the training to the test set.
Therefore, we first construct the z = 0 dataset according to the above prescription.Next, for a dataset at any z n > 0, for every merger tree, we test if it contains any part of any merger tree in any dataset at a redshift lower than z n .If it does, we assign it to the set which the descendant galaxy is part of.All merger trees not assigned to either set are then split such that the overall dataset at z n has an 80/20 split between training and testing.

C. UNCERTAINTIES
The meaning of uncertainties predicted by any neural network are open to interpretation.In a way, they are simply an ensemble spread from the training set, as most networks, including ours, are inherently deterministic.However, since they lack a clear interpretation, we can at least interpret their distributions and accuracies.Since we minimize a Gaussian Likelihood, the pull/z-score distribution (z = ∆y σ ) should be approximated by a unit Gaussian and the reduced χ 2 (χ 2 N = χ 2 /N ) should be close to 1.In Figure 10, we show the distributions of logarithmic uncertainties, the distribution of z-scores (pull plot) along with unit Gaussians, as well as relationships between the predictions/residuals and logarithmic uncertainties.Immediately noticeable is the fact that the two SF R's and M BH have strong bimodalities in the uncertainties, and that the uncertainties in general span many orders of magnitude.From the pull plots, we see that all distributions are approximately Gaussian, but with variances ≈ 25% too high, so there is some departure from Gaussianity.The reduced χ 2 are also too high in general, indicating generally overconfident (too low) uncertainty estimates.It should be noted though, that the χ 2 value assumes that the uncertainties are correctly approximated by a Gaussian, which isn't quite true in our case.Therefore, the χ 2 should be used with some caution.We see some correlations between the predictions and the uncertainties, with especially noticeable case being the high uncertainties given to both low and high M * , the high uncertainties given to the highest Z gas , SF R, SF R 100 and M BH , as well as the the low uncertainties given to higher values of M cold .
Figure 10.In the first column we show distributions of logarithmic uncertainties.In the second column we show distributions of pulls/z-scores ∆y σ , and a unit Gaussian to guide the eye, along with annotations of metrics concerning the pull distributions.The third column shows the relationship between predicted values and uncertainties, along with a red dashed line indicating their median relationship, as well as being annotated with a Pearson correlation coefficient.The fourth and final column shows the relationship between the residuals and the predicted uncertainties, with a clear broadening as one goes toward higher predicted uncertainty.

D. INTERPRETATION PLOTS
There are several checks that we can perform in order to make sure that Mangrove is predicting physically meaningful things, as well as probing the regions where Mangrove struggles.The very tight scatter is highly indicative that most physically relevant relationships will be reproduced, but here we probe these relations further.We follow tests done by Agarwal et al. (2018); Gabrielpillai et al. (2021) for: • Median stellar mass -halo mass deviation dependence on NFW concentration.As shown by G21, this is a property reproduced by both IllustrisTNG and the SC-SAM.This comparison can be found in Figure 11.
• Halo mass -variable relations (as for stellar mass in Figure 9) are also generally very useful for identifying the regions where Mangrove fails to reproduce the SAM.This general comparison can be found in Figure 13.Here we quickly identify one of the reasons for Mangrove's poor performance on SF R and SF R 100 , namely that it doesn't successfully capture the two diverging branches of SFR around M halo ≈ 11.7, regressing only the lower branch accurately.We also observe that M * , M cold , Z gas and M BH generally follow both the median relation as well as reproducing the scatter.The scatter isn't reproduced for the two SF R targets.This is a problem discussed in Agarwal et al. (2018), which our method also improves significantly upon.We also investigate how these quantities evolve within fixed bins of a given halo mass in Figure 14.
• Residual -residual plots are also very useful for investigating the interdependence between predictions.Here we provide a plot to provide a picture of these interdependences.Figure 12, simply shows residual -residual relations for Mangrove relative to the SAM targets, along with the slope (a) and intercept (b) of a line fitted using least squares (not using the σ predicted by Mangrove).
From this plot we clearly observe a strong interdependence between SF R -and SF R 100 -residuals (as expected), positive correlations between M * -and SF R / SF R 100 / Z gas -residuals, positive correlations between M cold and SF R / SF R 100 -residuals(analogous to a Kennicutt-Schmidt relation) and a negative correlation between M cold -and Z gas -residuals.
Besides these tests, we also provide precision Figures in the style of Figures 3 and 6 for the performance of the final halo only regression (Figure 15), as well as for the 50% lowest variance objects (Figure 16).  Figure 13.Relation between halo masses and target parameters for all targets for both the SAM and Mangrove predictions, in similar style as Figure 9, with outliers clearly marked.We furthermore show general trends in the right column, where the dashed and solid lines show the medians and the shaded areas show the 16 and 84th percentiles for the parameter in question for both the SAM and Mangrove.Here we immediately see the source of some of the errors, as for example, the inability of Mangrove to accurately capture the two diverging branches in SF R.
Figure 14.Relation between M halo and target parameter residuals with respect to the median of the target parameters in M halo -bins.20 bins are used.This is a separate but distinct way of visualizing the main points of Figure 13, i.e. that the dispersion is quite accurate for all parameters but SF R and SF R100.There is thus hope for producing a full galaxy property hyperplane with Mangrove or a similar GNN -based method.3 and 6, but for regressing using only information from the final halo.We see a general decline in performance compared to using the full merger tree, although median relations are improved, and the SF R is slightly better when predicting SF R > 0.2.A single node is still technically a graph, however, no graph structure is used, so the GNN label may be slightly misleading.
Figure 16.Same as Figures 3 and 6, but using cuts where only the 50 % lowest variance predictions, as predicted by Mangrove, are included.As can be clearly seen, Mangrove does significantly better, even though it tends to avoid regions that are information -poor, like the massive black hole region (MBH > 6.1).
Figure 17.Same as Figures 3 and 6, but where the differences are taken between two different runs of the SC-SAM with nothing changed but the random seeds.The numbers aren't exactly the same as Table 1, since those are the mean of the metrics of different SAM runs.

E. MODEL ARCHITECTURE
Here we wish to describe the architecture a bit more in depth for the purpose of reproducibility.The architecture is visualized in Figure 18.The merger tree is passed through a 2-layer Multi-Layer Perceptron (MLP) to encode the node state before any graph convolutional layers.Then the encoded merger tree is passed through 5 GraphSAGE layers, each with a ReLU activation layer between.The encoded merger tree is then summed over with a global sum pooling.Using a global max pooling renders similar performance.Each of the targets then has its own 3-layer MLP decoder "head".A "head" means a different branch of the model with all heads taking the same input, allowing each head to predict more independently of the others.If the uncorrelated Gaussian loss is used, no off-diagonal components of the covariance matrix are predicted, and Σ is diagonal and corresponding to just having the usual Gaussian uncertainties.The layer normalization description can be found in Ba et al. (2016).After the sequence of convolutional layers, a differentiable global pooling operator is applied across all nodes in order to standardize the output size.The dimensionality of the latent space (known as the number of hidden states) was 128.

F. TRAINING THE MODEL
We train the models using the Pytorch OneCycleLR learning rate scheduler (Smith & Topin 2018;Paszke et al. 2019), using a max learning rate of 10 −2 and a batch size of 256 using the Adam optimizer (Kingma & Ba 2017).The models were trained for 1000 epochs when optimized for all targets, and 500 for 2 targets or less, as this was determined during hyperparameter 13 optimization to be above the average number of epochs required for a model to converge.A Gaussian quantile transform 14 , which maps each parameter to a Gaussian distribution defined by the quantiles of the parameter in question, was fit on the training set and applied to all input data before training, except for categorical data such as the number of progenitor halos or whether the halo had recently undergone a major merger, which is encoded as a boolean in the data.This makes training more stable at the risk of destroying some information.We also attempted using a standard scaler, which scales data to have zero mean and unity variance.This resulted in slightly higher scatters by about 3-5 %. 13 The hyperparameters of the model and training scheme are defined as parameters not of the model itself, but about the model or training scheme.Examples include the dimensionality of the latent space, the number of layers and the learning rate. 14sklearn source code We employ a max learning rate of 10 −2 , a 15% start percentage and a final division factor of 10 3 .
A series of learning rate schedulers (constant, warmup with exponential, cosine annealing and one cycle) were attempted, all with reasonable success.Although the constant learning rate works well, it is suboptimal for long runs, and consistently underperforms by ≈ 5%.Among the others, we generally observe similar performance, although the cosine annealing schedule renders results with higher variance between runs.

G. REPRODUCING OUR RESULTS
The code for reproducing our results can be found GitHub at https://github.com/astrockragh/Mangrove( ).The repository is provided under the MIT license.Data can be obtained from the IllustrisTNG website (https://www.tngproject.org/data/).

Figure 1 .
Figure1.The merger tree of a 10 8.5 M galaxy.Only nodes that are progenitors, merging or final (at z = 0), are shown.Both colour and dot size encode the mass of the halos.The merger tree encodes the full formation history and temporally evolving dark matter features of all halos that merge to form the final halo.Galaxies co-evolve with their dark matter halo, and the dark matter merger trees are therefore useful for determining galaxy formation.

Figure 2 .
Figure 2.An illustration of the full workflow of Mangrove.Merger trees are encoded as graphs, which are then passed trough Mangrove.Messages are passed forward in time only, since merger trees are directed in time.Node states are updated by applying a learnable function f to the current node states, applying a learnable function g to the mean of the node states of the neighboring nodes and adding these two as described in §3.Transferring into the latent space is marked by shaded colors and a change of shape from circles to rectangles.Adding neighborhood information is marked by mixing of colors.The model makes message-passing steps, after which the graph nodes are summed over, and this sum is then decoded by another learnable function, h, which gives the predictions and the Gaussian covariance matrix.All learnable functions are Multi-Layer Perceptrons.

Figure 3 .
Figure 3. Histogram of the SAM M * versus predicted M * with logarithmically colored bin heights.The left panel shows the target-prediction relation for our GNN, Mangrove, and the right panel shows target-prediction relation of the common abundance matching approach.Kernel Density Estimates of the SAM and Mangrove -predicted distributions are shown on the relevant axes.By leveraging the formation history of the galaxy via the merger tree, we obtain precise and accurate predictions of M * .The prediction scatter is improved by almost a factor of two compared to the state of the art, single-halo method and by more than a factor of 4 compared to the widely used abundance matching method (see Table1).Mangrove's M * scatter is comparable to, but still well above, the probabilistic limit of the SAM, which is at 0.043 dex.

Figure 4 .
Figure 4. Median, 16th and 84th percentile of 10 models trained to predict only M * at a series of different redshifts in three different ways.Red is when testing at the same single redshift where Mangrove was trained, black is from training and testing jointly at z ∈ 0, 0.5, 1, 2 and green is from training jointly at z ∈ 0, 0.5, 1, 2 and testing at z ∈ 0.25, 0.75, 1.5, 1.75.Mangrove predicts M * at previously unseen redshifts with similar or lower scatter than the redshifts at which Mangrove was trained.

Figure 5 .
Figure5.Same as Figure4, but including a point from extrapolating to z = 3 using the models trained at z ∈ 0, 0.5, 1, 2. The predictions have much higher scatters.

Figure 6 .
Figure6.Histogram of SAM targets versus the predicted targets of our GNN, Mangrove, with logarithmically colored bin heights.Recovering galactic M cold , Zgas, SF R, SF R100 and MBH renders mostly tight and unbiased relationships across the entire ranges of these galactic features, with the notable exception of targets where SF R or SF R100 > 0.1.This flaw is spurious and should be possible to mitigate.All of these results outperform results from the literature for all precision metrics (see Table1).The fact that such mappings exist imply that the full hyperplane of galactic properties may be emulated with Mangrove.

Figure 9 .
Figure 9. M halo − M * relation for the SAM (middle), and as predicted by our GNN, Mangrove (upper).For the 99% of points closest to the median M halo − M * relation, points are plotted as a histogram with logarithmic colouring, whereas the remaining outliers are plotted as points.The outliers are regressed to a remarkable precision, especially for trees with M * way above the median M halo − M * relation.As can be seen comparing the low M h , M * region, the Gaussian loss allows Mangrove to "give up" on some low mass galaxies.This "limit" is indicated by a dashed black line.Figure13in the appendix shows the same relation for all targets.

Figure 11 .
Figure 11.The difference between the M * /M halo and the median value of M * /M halo for halo-galaxy pairs in bins of halo mass bin plotted as a function of the NFW concentration parameter of the halo.The dashed lines show the medians and the shaded areas show the 16 and 84th percentiles.The left panel shows the relationship for the SC SAM, the middle panel shows that for Mangrove, and the rightmost panels shows a comparison between the two.Equal bin widths are chosen for comparison with Figure 12 in G21.

Figure 12 .
Figure 12.Residual for all targets, along with linear (a*x+b) fits.Each window is annotated with the slope (a) and the intercept (b) of the residual -residual relation in question.The plot is made with the corner package (Foreman-Mackey 2016).

Figure 15 .
Figure15.Same as Figures3 and 6, but for regressing using only information from the final halo.We see a general decline in performance compared to using the full merger tree, although median relations are improved, and the SF R is slightly better when predicting SF R > 0.2.A single node is still technically a graph, however, no graph structure is used, so the GNN label may be slightly misleading.

Figure 18 .
Figure 18.A diagram of Mangrove's architecture for predicting values and the full-covariance matrix.Nt is the number of targets one wishes to regress.The number of times a given block is repeated is written by the upper right corner of the block.A linear layer is the same as a 1-layer MLP.The flow is from left to right, but inside each box the flow is from top to bottom.Each layer operates with 128 hidden states.The off-diagonal covariances are not included in the work in this paper.

Figure 19 .
Figure 19.A series of sample training curves for regressing M * only.Hard lines are smoothed with an exponential kernel with strength 0.6, and shaded are the actual validation scatters.As can be seen, the training can be noisy, and have short, strong spikes, but eventually converges given enough epochs.

Table 3 .
Taking the 50 % highest confidence regressions as predicted by Mangrove, our results improve greatly, indicating that the predicted variances could be used as a highly efficient filter.A Figure to visualize this improvement can be found in Appendix §D.