Mining the Information Content of Member Galaxies in Halo Mass Modeling

Motivated by previous findings that the magnitude gap between certain satellite galaxies and the central galaxy can be used to improve the estimation of halo mass, we carry out a systematic study of the information content of different member galaxies in the modeling of the host halo mass using a machine-learning approach. We employ data from the hydrodynamical simulation IllustrisTNG and train a random forest algorithm to predict a halo mass from the stellar masses of its member galaxies. Exhaustive feature selection is adopted to disentangle the importance of different galaxy members. We confirm that an additional satellite does improve the halo mass estimation compared to that estimated by the central alone. However, the magnitude of this improvement does not differ significantly using different satellite galaxies. When three galaxies are used in the halo mass prediction, the best combination is always that of the central galaxy with the most massive satellite and the smallest satellite. Furthermore, among the top seven galaxies, the combination of a central galaxy and two or three satellite galaxies gives a near-optimal estimation of halo mass, and further addition of galaxies does not raise the precision of the prediction. We demonstrate that these dependences can be understood from the shape variation of the conditional satellite distribution, with different member galaxies accounting for distinct halo-dependent features in different parts of the cumulative stellar mass function.


INTRODUCTION
According to the standard cosmological paradigm, galaxies are believed to form and evolve in dark matter halos (White & Rees 1978).Hence the properties of galaxies are tightly linked to the properties of their dark haloes, giving rise to the so-called the galaxy-halo connection.Studying this connection is of great value in understanding the mechanisms of galaxy formation and evolution, while also providing a way to infer the properties of dark matter haloes through galaxy observations.
Some recent studies have shown that the magnitude or mass difference (a.k.a.gap) between the central galaxy and some satellite galaxies may contain information about the assembly history of their host halo (e.g.Harrison et al. 2012;Deason et al. 2013;Solanes et al. 2016;Kang et al. 2016) and thus could be used to tighten the SHMR.The magnitude gap between the brightest central galaxy (BCG) and the second brightest galaxy (M12) was originally studied as a diagnosis for selecting fossil groups (Ponman et al. 1994;Jones et al. 2003;Sales et al. 2007;von Benda-Beckmann et al. 2008).Later, Dariush et al. (2010) and Tavasoli et al. (2011) proposed that the magnitude gap between the BCG and the fourth brightest galaxy (M14) is a better indicator for Zhou & Han selecting fossil groups.Subsequent studies have shown that using these gaps, in addition to the central luminosity, can indeed substantially reduce the scatter in the halo mass estimation (e.g.More 2012;Hearin et al. 2013;Shen et al. 2014;Lu et al. 2015;Golden-Marx & Miller 2018, 2019, 2021;Wang et al. 2021a).However, it is still not known whether the improvement in the halo mass estimation is equally effective for gaps between BCG and different ranked satellite galaxies, and which gap can optimally constrain the dark halo mass, or equivalently, which satellite galaxy provides the most information in constraining the halo mass.
A related question is how many satellites are needed to optimally constrain the halo mass.Taking information from all member galaxies would certainly improve the halo mass constraint.For example, as satellite galaxies are expected to trace the dark matter distribution in a halo (e.g.Han et al. 2016), the total stellar mass or total luminosity can be used as a good proxy for halo mass (e.g.Zaritsky et al. 1997;Prada et al. 2003;Yang et al. 2005;Conroy et al. 2007;Han et al. 2015;Wang et al. 2021b).However, the complete population of all member galaxies are usually not available observationally.Bradshaw et al. (2020) demonstrated that employing the sum of the stellar mass of the central and the N most massive satellites (cen + N ) as a new halo mass estimator can effectively reduce the scatter compared to using only the stellar mass of the central galaxy (M * ,cen ).They also showed that the scatter adopting this estimator already approaches that using the total stellar mass of all galaxy members (M * ,tot ).However, it should be noted that the total stellar mass alone could miss information from the relative mass distribution of different satellites, and thus may not be optimal itself in estimating the halo mass.Thus it remains to be seen which of the satellite galaxies play a major role and how to best combine the galaxy information to maximize the accuracy of halo mass prediction when only a few satellite galaxies are available.
In this work, we seek to clarify the roles played by different satellites in the estimation of the halo mass.The data we employed is from the hydrodynamic simulation IllustrisTNG.We make use of a machine learning technique called Random Forest (RF) regression to model the nonlinear joint connections between halo mass and the first few satellites, while sorting out the relative importances of different satellite combinations using the exhaustive feature selection method.We find that there is indeed an optimal combination of satellites that can lead to a nearly saturated improvement in the halo mass constraint.We further examine the results in the context of the conditional satellite distribution, Table 1.Parameters of the TNG100-1 and TNG300-1.From left to right: side length of the simulation box, the number of dark matter particles, and the masses of dark matter and baryonic particles.
The paper is organized as follows: In section 2 we introduce the IllustrisTNG simulation on which our analysis is based, and the process of data processing and filtering.In section 3 we describe the machine learning method and the training process.The main results from the machine learning analysis are presented in section 4. In section 5 we examine the results in the context of the conditional satellite distribution.Summary and conclusions are presented in section 6.

IllustrisTNG
Our analysis is based on data from the IllustrisTNG,1 a suite of state-of-the-art magnetohydrodynamical cosmological simulations (Naiman et al. 2018;Springel et al. 2018;Pillepich et al. 2018;Nelson et al. 2018;Marinacci et al. 2018) run with the moving-mesh code Arepo (Springel 2010).TNG is the successor of the original Illustris simulation, while improving in many aspects of its galaxy formation recipes.TNG follows the Λ Cold Dark Matter cosmology adopting parameters from the Planck observations (Planck Collaboration et al. 2016), with Ω Λ = 0.6911, Ω m = 0.3089, Ω b = 0.0486, σ 8 = 0.8159 and h = 0.6774.The full TNG suit consist of simulations run in three different boxsizes of roughly 50, 100 and 300 Mpc, referred to as TNG50, TNG100 and TNG300 respectively, each of which are also run at three or four levels of resolutions.In this work, we choose data from the highest resolution runs of TNG100 and TNG300 (named TNG100-1 and TNG300-1 in the data release), as TNG300 has the largest volume and therefore provides a large sample, while TNG100 has a higher resolution compared to TNG300.Given that the cosmologies are the same for both simulations, the data from the two are joined together in order to cover a larger halo mass range.More details on TNG100 and TNG300 are provided in Table 1.

The halo and galaxy sample
Our study focuses on the sample at redshift z = 0.The halo mass is defined as M 200mean , the total mass in a sphere around the halo centre with an enclosed density of 200 times the mean density of the universe.We define satellite galaxies as those located within the virial radius, R 200mean , of the host halo except the central galaxy.As the stellar mass function of the simulations becomes incomplete at ∼ 10 6 M (∼ 10 7 M ) for TNG100 (TNG300), we only consider galaxies above these stellar mass limits respectively.
We further demand that each halo contains at least 7 member galaxies in the halo mass range studied.As lower mass halos are typically resolved with fewer number of satellites, this richness cut translates to a cut in the halo mass for our sample.In Fig. 1 we plot the stellar mass-halo mass relation for the top 7 most massive galaxies in each halo, as well as the fraction of halos with more than 7 members as a function of halo mass.As can be seen from Fig. 1, limiting the halo mass to M > 10 12.3 M in TNG100 ensures that the top 7 galaxies are well resolved with stellar masses above 10 6 M .For the TNG300 sample, a corresponding halo mass limit of 10 12.8 M can be found.Under these selection criteria, we are left with 1235 valid halos in TNG100 and 8413 halos in TNG300, spanning a combined mass range of 10 12.3 < M halo /M 10 15.3 .It is known that the galaxy properties do not fully converge between TNG100 and TNG300 due to the resolution dependence of the hydrodynamical solver used.To correct this, we simply multiply the stellar masses in TNG300 by a constant factor of 1.4 before combining it with TNG100, following Pillepich et al. (2018).

METHOD
We employ the Random Forest (RF) algorithm (Breiman 2001) as implemented in scikitlearn (Pedregosa et al. 2011) to analyze the relation between halo mass and the masses of member galaxies.RF is a supervised machine learning algorithm that can be trained to map out the relation between the input and output data in a non-parametric way.It is an ensemble method that aggregates many base estimators called decision trees via the bagging approach.This enables the RF to overcome the common problem of overfitting faced by a single decision tree and improve the generalization ability of the model.Due to its simplicity and efficiency, RF has been widely used in many recent studies in astrophysics (e.g., Hoyle et al. 2015 TNG300-1 Figure 1.Stellar mass distributions of member galaxies in TNG100 (upper panel) and TNG300 (bottom panel).Each coloured solid curve shows the median stellar mass to halo mass relation for galaxies of a given rank as labelled, while the corresponding shaded region is bounded by the 16 th and 84 th percentiles in the stellar mass distribution.The light purple points in the background show the distribution of all the member galaxies.The black dashed line shows the fraction of halos with more than 7 members as a function of halo mass.Man et al. 2019;Petulante et al. 2021;Shi et al. 2021).
In the following we explain the algorithm in more detail.

Decision tree
As the basic unit of a RF, a decision tree is a tree-like decision model.For a given input parameter space, a decision tree aims at partitioning the parameter space into multiple nodes such that each node is mapped to a single prediction.The partitioning is done by splitting the parameter space along one dimension at a time according to a certain criterion, forming a tree-like structure after multiple operations.The complete input parameter space forms the root node of the tree, while nodes that no longer split are called leaf nodes.The predictions in the leaf nodes can be either discrete classes or Zhou & Han continuous values, corresponding to a classification or a regression tree.For this analysis we use regression trees.
Consider a given data set consisting of n observations, D = {( x 1 , y 1 ), ( x 2 , y 2 ), ..., ( x i , y i ), ...( x n , y n )}, where x i is an m-dimensional vector with m input features, y i is the target feature we want to predict, and i = 1..n represents n observations.Expressing the decision tree as a function f ( x), the goal of the regression is to find a f ( x) that minimizes the Mean Squared Error (MSE) of the data set (also referred to as impurity) This is achieved by choosing an appropriate division at each step to minimize the MSE in each node, with f ( x) replaced by the mean value of y in the node.More specifically, starting from the root node, we recursively divide each node into two child nodes R 1 (j, s) = {x|x (j) ≤ s} and R 2 (j, s) = {x|x (j) > s} according to a feature j and a threshold s.To minimize the MSE of the final tree, we choose (j, s) to minimize the MSE of each division, where c 1 and c 2 correspond to the averages of the labels y i in R 1 and R 2 .The division continues till some stopping criteria regarding the depth of the tree or the size of the leaf node are satisfied, which we specify in section 3.4.Once a tree is constructed, it is straightforward to make predictions with it.A new input observation x new can be inserted into a leaf node through a tree walk, and the corresponding prediction is found as the average y of the training data in the leaf.

Random forest
A decision tree can easily overfit the data.For example, when a leaf node contains only one observation, any noise in the observation will be inherited by the model prediction.To overcome this problem, a random forest works by combining the predictions of many trees each constructed from a bootstrap realization of the original data.
For each tree in the forest, when selecting splitting features on a node, a further pooling step is added to restrict the selection to a random subset of the original feature set.This randomness further enhances the generalization ability of the model.The final prediction is obtained by combining (averaging in the case of regression) the predictions of all trees.

Feature importance and Exhaustive Feature Selection
A random forest can not only serve as a predictive model that fits the data.It can also output an importance score for each feature quantifying its relative contribution in the prediction, which is of great significance in feature selection and helps to understand the underlying model construction process.In the scikitlearn (Pedregosa et al. 2011) package, RF feature importance ranking is based on the Mean Decrease Impurity (MDI), which quantifies the average reduction in MSE contributed by the tree divisions in each feature.We provide the detailed definition of the MDI importance in Appendix A.
Despite that the feature importance in random forests based on MDI is widely used for feature selection, it has been shown in the literature that such importances may produce misleading results (Strobl et al. 2007;Louppe 2014;Scornet 2020).For completely independent variables and in absence of variable interactions, MDI provides a variance decomposition of the output.However, for partially redundant variables that carry similar information, which almost always happen in practice, the one with slightly more information may always stand out in the feature selection process of node splitting, leaving little MDI to the others.For this reason, we can not completely rely on the importance ranking given by the random forest.Hence we take the strategy of still using the random forest as the regression model while combining it with an exhaustive method for feature selection.
Specifically, we try all possible feature combinations and train one model using each combination.The performances of the models are then compared to select the best combination of features for each number of features.The best feature combinations at each step are selected according to the R 2 score, which is used to evaluate the performance of regression models.The R 2 score is defined as where y i is the true target variable with ȳ being its mean value in the sample, and ŷi is the predicted value for observation i.This approach enables us to identify the most important feature combinations without having to worry about feature correlations.The evolution of the performance with the addition of features can also be used to understand the unique contributions of features to the model improvement.

Tuning the hyperparameters
To achieve the best performance of a machine learning model, it is crucial to tune the hyperparameters of the model.For RF in scikit-learn, there are several major hyperparameters to be tuned: n estimators, max depth, min samples leaf and min samples split.We start the tuning process from n estimators, and obtain the number of trees in a range that makes the model perform best.The remaining parameters are further tuned one by one after fixing previous parameters to their optimal values.The final set of adopted hyperparameters are presented in table (2).

Model Training and Performance
Our fiducial model is the random forest model that adopts the hyperparameters given in Table (2), taking the logarithmic stellar masses of the top seven galaxies in the mass ranking as the input feature variables, and the target variable to be predicted is the logarithmic dark matter halo mass.Cross-validation was employed to evaluate the model performance, dividing the dataset into a training set and a test set, which are used for training and testing respectively.Figure 2 shows the relation between the predicted and true halo masses in the test set of our model.Overall, the model can unbiasedly predict the true halo mass accross the entire mass range, with a fairly small total MSE of 0.01 and a R 2 score of 0.946.The deviation of the few data points at the highest mass end is due to the limited number of haloes in this mass range that get allocated into a single leaf node.
Besides the fiducial model, we further train additional models using subsets of the available features as input.
In Fig. 3, we compare the residual distributions for models involving the central and another satellite in the halo mass prediction.As can be observed, the inclusion of a satellite galaxy leads to a more accurate prediction compared to that using only the central galaxy, indicating that satellite galaxies can indeed provide additional information for the halo mass estimation.However, the residual distributions involving different satellites are all very close to each other, suggesting that there is not an outstanding satellite that improves the prediction much more than the others.We will come back to this conclusion later.
In the following we explore the roles played by different galaxies in more detail using feature importance and exhaustive feature selection.

Importance Ranking
The MDI based importance ranking given by RF is presented in Figure 4 deviations of the importances from individual trees in the forest.As expected, the most important feature is the stellar mass of the central galaxy.The second important feature is the stellar mass of the 7th massive galaxy, which is also the least massive galaxy in the data we used.
As mentioned above, Exhaustive Feature Selection (EFS) was introduced considering that the default feature importance ranking provided by random forest might carry some bias.We train our model using all  possible feature combinations, and list the top four best scoring combinations for each number of features in Table 3.A 5-fold cross validation is adopted in this process.Specifically, we split the dataset into 5 equal subsets (a.k.a., folds) with each subset used once as validation while the 4 remaining folds forming the training set.The error of R 2 score is calculated as where s i is the R 2 score of fold i with s being the mean score, and k is the fold numbers.
In Fig 5, we plot the performance of the models trained by different combinations versus the number of features.It is seen that the scores of models trained with two features improve compared to model trained only by the central galaxy in Table 3.It verifies once again that satellite galaxies have an extra contribution to the prediction of halo mass.However, by observing the case when only two features are available, we can see that the scores of the different combinations do not differ significantly and that no combination is outstanding.That is, the satellite galaxies of different orders alone play a similar role as complements to the central galaxy in the prediction of halo mass.For the case when only three features are input, the [127] (stellar masses of the first, second and seventh galaxies) combination gives the highest model score and is almost as high as the highest score attainable.This means instead of the whole population, we can use the information from only the first, second and seventh (here the least massive) galaxies as a high precision probe of halo mass.Moreover, once the input feature number reaches 4, the improvement of the model goes less noticeable and even almost absent with further increase in feature number.This result indicates that information of only a few satellite galaxy members is sufficient to make high-precision predictions of the halo mass regardless of a complete galaxy population.

The roles of different galaxies
In order to disentangle the role played by different ranked galaxies for the prediction of the halo mass in more detail, we choose the first, second, fourth and the seventh ranked galaxies to analyse.
We first consider the combinations of the central galaxy with one satellite galaxy, i.e., [12],[14] and [17], and train a model for each combination.Then we plot the predicted as well as true values of these models as functions of two stellar masses at a time in Figure 6.Overall, the contours that represent the halo masses are roughly perpendicular to the axes corresponding to the stellar mass of central galaxy, suggesting that the halo mass can be mostly determined by the stellar mass of the central galaxy.However, the slight inclination towards the x-axis indicates that it also depends on the satellite galaxies.
Taking the top row of the figure as an example, as the model is trained by the stellar mass of central and the second galaxy, the predicted and true values match well in the sm1-sm2 plane (left panel), confirming that the model have sucessfully learned the mapping between halo mass and these two stellar masses.While for the rest right figures, obvious misalignment exists between the true and predict contours.This reflects the difference in the information provided by the different galaxies for predicting halo mass.It implies that although the inclusion of different satellite galaxies provides roughly equivalent improvement to the halo mass estimation with central galaxy alone, the supplementary information relative to the central galaxy that they carry is different.The deviation between the predicted and true masses is larger in the rightmost panel, indicating a larger information difference between (sm2, sm7) than that between (sm2, sm4).This could also explain why the combination [127] is the best when we use only three features.
It is interesting to note that the galaxies [127] also appear in the top combinations involving larger numbers of features in Table 3.The substantial extra information carried by the 7th satellite relative to the 2nd may be because it is the least massive satellite in our halos.In other words, the largest differences exist between the satellite galaxies with the largest ranking separation.To further verify this interpretation, we perform the same analysis on complete samples containing 5 and 6 galaxy members respectively.The results are consistent: the best combination of three galaxies is always the first, second and smallest galaxies, as shown in Figure 7.

Mass range independence
To guarantee that our results are not dependent on the mass range, we examine the residuals of different models at different central galaxy masses in Figure 8.The residuals are concentrated around 0 in the full mass range, indicating that the models are unbiased over the whole mass range.Note the large deviation at the highest masses is due to the rarity of halos there.In addition, for the same model, the scatter is also similar over various masses.This suggests that our results have no dependence on mass and are valid throughout the mass range.Comparing the dispersions in the residuals, the previous conclusions can also be seen, that the inclusion of satellite galaxies helps to improve the accuracy of the halo mass estimation, and that the precision of the model constructed using the three best combinations is already comparable to that using all the features.

DISCUSSION: UNDERSTANDING THE GAPS WITH THE CONDITIONAL GALAXY DISTRIBUTION
The magnitude or stellar mass gaps, and the galaxy combinations studied here are all constructed based on the ranks of galaxies.Such ranks and their corresponding sizes naturally appear as function values and random variables in the cumulative mass or luminosity functions.Such a connection have been exploited before to derive the distribution of magnitude gaps as well as that of the BCGs by drawing from the global or conditional luminosity functions (More 2012;Paranjape & Sheth 2012;Hearin et al. 2013;Shen et al. 2014;Paul et al. 2017).
Figure 6.Relationship between the predicted halo masses from different models and stellar masses of different galaxies.The axes are the logarithmic stellar masses of the central (sm1), second (sm2), fourth (sm4) and seventh (sm7) most massive galaxies.The grey filled contours show the signal-to-noise level of the data at each location, which reflects the number of halos within each bin.The thin coloured lines are the contours of the true halo masses, while the thick light lines are the contours of the predicted halo masses of the corresponding model labelled in each panel.
Unlike previous studies, our machine learning results allow us to explore this connection in a reverse manner, to directly identify where and how much information on halo mass is stored in the cumulative galaxy distribution.In this context, galaxies with different mass ranks control different segments of the cumulative stellar mass function (CSMF).Those with ranks 1 and 2 control the shape of the curve at the massive end, while those with rank 7 control the shape of the curve at the more distant end, i.e. the low mass end.The connection of the gap or rank statistics to halo mass can then be understood as the variation in the relevant segments of the conditional cumulative stellar mass function (CCSMF) with halo mass, φ(> M |M h ).The finite number of informative features then reflects the limited number of distinct mass-dependent features in the CCSMF, or the universality of CCSMF subject to a few mass-dependent parameters.
To verify this conjecture, we plot the CCSMF for halos with the same predicted values but different true halo mass values in Figure 9.For a given model, fixing the prediction is equivalent to fixing the values of the input features and the corresponding segments in the CCSMF.The remaining differences in the CCSMF curves for different true halo masses then reflect the contributions of features outside the combination used.We test the CC-SMF using the RF models constructed respectively with combinations [1], [12], [17] and [127], and show the results for three representative mass ranges centered at log(M/M ) = 12.5, 13.5 and 14 in predicted mass.
It can be seen from this plot that for the model trained with only the central galaxies (first row), the CSMF curves for different true halo masses are noticeably different at fixed prediction, and only converge at the most massive end where the central mass is fixed.In the second row, adding the second rank galaxies to the model, the curves show a further strong tendency to bunch up at the massive end, but still with a clean separation in the low mass region.Correspondingly, in the third row, adding the seventh rank galaxies to the central galaxies, one of the focal points of the curves shift to the relatively lower mass end at N = 7, and yet some discrepancies of the curves can be seen between and beyond the two focal points.Finally adding both the second and seventh galaxies as supplements to the central galaxy (last row), it is seen that the CSMF curves already approach complete overlap in the region of our concern (N ≤ 7), exhausting the distinct features in the top 7 galaxies.All these results are consistent with what we speculated earlier.When only the central galaxy is controlled, the apparent divergence between the curves implies that there is additional information beyond the central galaxy that we can utilise in the estimation of the halo mass.Further constraining both the central galaxy and the second or seventh galaxy, the previous divergence converges further at the corresponding massive and low mass end, and the halo mass is tightened further, suggesting that the inclusion of satellite galaxies improves the prediction and that the second and seventh galaxies contribute distinctively to their host halo.After restricting both the central galaxy and the second and seventh galaxies together, the separation between the curves in the area we considered almost disappears, indicating that the extraction of the information required for the prediction is almost maximised, which is consistent with the result in previous sections that the precision of the estimations is nearly saturated after the inclusion of the three best combined features.From top to bottom, models are trained with galaxies [1], [12], [17] and [127] as input features respectively.From left to right, the predicted value of halo mass are binned around 10 12.5 M , 10 13.5 M and 10 14 M respectively, with a bin width of 0.5 dex.Each coloured curve represents the CSMF for a given true halo mass as labelled in the colour bars.

Zhou & Han
It is interesting to notice that for the middle and right columns where more satellites can be resolved in a halo, the CCSMF still diverges at the lowest mass end even in the [127] model.This means the low mass end distribution still carries extra information that can be used to further constrain the halo mass, in addition to that already explored in the top 7 members.It is also consistent with our previous finding that the least massive satellite in the sample can contribute significantly to the halo mass estimation, instead of galaxy 7 being special.
This can be equivalently understood as the least massive satellite controls the overall amplitude of the faint end mass function or the richness of the halo, which is known to be tightly connected to halo mass.

CONCLUSIONS
In this work we have explored the connection of galaxy population to the host halo mass, to clarify the roles played by different galaxies on the halo mass estimation and to understand the information content of galaxy mass distribution on the halo mass.To this end we extract halos with at least 7 satellite galaxies from the IllustrisTNG simulation, and train a random forest algorithm to systematically assess the importances of different galaxy mass combinations in the prediction of halo mass.The results are further examined in the context of the conditional stellar mass function.
Our findings and conclusions are summarised as follows.
• When only one galaxy is used, we confirm that the central galaxy is the most informative single feature in estimating the halo mass.
• Compared with models that only use the central galaxy mass, the inclusion of satellite galaxy masses does improve the estimation of the halo mass, and the most informative binary features are always the central galaxy mass combined with another satellite mass.
• For the case of a combination of only two galaxies, the difference between the improvement of the model by adding any of the satellite galaxies to the central galaxy is not significant.This means there is not an outstanding satellite galaxy which contributes much more than the others to the halo mass estimation.In other words, we do not find an obviously "optimal" mass gap to be used in mass estimation.
• For combinations of three galaxies, the best combination is always that of the central galaxy with the second and the least massive galaxies.This conclusion holds when examining the top 7, 6 or 5 galaxies, and may be generalised to a larger number of available galaxies.It suggests that the biggest and smallest satellite galaxies provide the greatest differential information.
• For the seven member galaxies studied, the combination of a central galaxy and 2 or 3 satellite galaxies gives a near-optimal model performance, and continued addition of feature variables barely improves the model performance further.In other words, only a few galaxies are required to build a model with comparable accuracy to that using the whole member galaxy population.
• The different roles played by differently ranked galaxies can be directly mapped to the variation of different segments of the CCSMF with the halo mass.While the central galaxy controls the starting point of the CCSMF, the second massive galaxy controls the variation at the high mass end, and the least massive galaxy controls the amplitude or shape at the low mass end.Once these 3 galaxies are controlled, the CCSMF, that is, the full mass distribution of all the member galaxies, become largely determined in the studied mass range, with little extra variations that can inform about halo mass.However, we notice that the CSMF still contains extra variation at even lower masses beyond the 7th galaxy, which could be used to further constrain the halo mass.
The physical mechanism responsible for the information in the second and least massive galaxies might be that the former is related to recent as well as major merger events (Deason et al. 2013), while the latter characterises the total mass accretion of the halo.Recent and major mergers can significantly influence the mass distribution around the halo, causing it to deviate from common galaxy-halo connections.Moreover, the second massive galaxy and the smallest satellite galaxy in the satellite population have the greatest difference in the time of entry into the host halo, and therefore the greatest gap in the information they can provide.
Our findings can provide insights into how to choose members to obtain the most information about the halo mass when the available galaxy population in the halo is limited.The direct visualisation of the CSMF dependence on halo mass also has implications on how to describe the CSMF, to maximize the information it carries about halo mass.It remains interesting to check whether current CCSMF or similarly conditional luminosity function models (e.g., Yang et al. 2003;Guo et al. 2018) can fully capture these information.It is also straightforward to apply the analysis in this work to study the galaxy population-halo connection in other datasets such as the galaxy magnitude data and those from semi-analytical models, before applying the results to real observations.It is also worth extending these explorations to alternative halo mass definitions, given recently new understandings on the physical boundaries of halos such as the splashback radius (Diemer & Kravtsov 2014;Adhikari et al. 2014;Shi 2016) and the depletion radius (Fong & Han 2021;Li & Han 2021).

ACKNOWLEDGMENTS
We acknowledge helpful discussions with Wenting Wang, Rui Shi and Qingyang Li.JH benefited from discussions with Houjun Mo and many others at the assembly bias workshop at SJTU in 2019 which motivated this study.This work is supported by National Key Basic Research and Development Program of China (No. 2018YFA0404504), NSFC (11973032, 11890691, 11621303), 111 project (No.B20019), and the science

Figure 2 .Figure 3 .
Figure2.Relationship between the predicted halo mass and true halo mass values.The diagonal dashed line is the line represented y = x, and red line shows the median relation between true value and predict value.Red shaded region represents the 1σ percentile.The data points are spread uniformly and concentrated on both sides of the line, indicating that the difference between the predicted and true values is quite small and that random forest makes a good prediction of the halo mass.The fitted R 2 score is 0.946 and the Mean Square Error (MSE) value is 0.01.

Figure 4 .
Figure 4. MDI based feature importances given by the random forest.Histogram height represents the relative importance and the error bar is the standard deviation of the importance from individual trees in the forest.

Figure 5 .
Figure 5. Performances of the top four scoring feature combinations for each number of features.The data points are the R 2 scores of the 5-fold cross-validation and the errorbars are their standard deviations.Different colours represent different rankings, slightly offset horizontally for better visibility.The detailed combination names are labelled next to each point.The scores of solo features are not plotted to reduce the dynamical range of the figure.

Figure 9 .
Figure9.The cumulative stellar mass function (CSMF) for halos with the same predicted mass but different true halo mass.From top to bottom, models are trained with galaxies [1], [12],[17] and [127]  as input features respectively.From left to right, the predicted value of halo mass are binned around 10 12.5 M , 10 13.5 M and 10 14 M respectively, with a bin width of 0.5 dex.Each coloured curve represents the CSMF for a given true halo mass as labelled in the colour bars.

Table 2 .
The optimal RF hyperparameters in our model.From left to right: number of trees within the forest, maximum growth depth of decision tree, minimum number of samples of leaves and minimum number of samples of branch nodes to split.

Table 3 .
The top four scoring feature combinations from exhaustive feature selection.The numbers in the brackets specify the stellar mass ranks (1 for central and 2-7 for satellites) of the constituting galaxies.The scores are the R 2 scores of the corresponding model.