Disentangling Stellar Age Estimates from Galactic Chemodynamical Evolution

Stellar ages are key for determining the formation history of the Milky Way, but are difficult to measure precisely. Furthermore, methods that use chemical abundances to infer ages may entangle the intrinsic evolution of stars with the chemodynamical evolution of the Galaxy. In this paper, we present a framework for making probabilistic predictions of stellar ages, and then quantify the contribution of both stellar evolution and Galactic chemical evolution to those predictions using SHapley Additive exPlanations. We apply this interpretable prediction framework to both a simulated Milky Way sample containing stars in a variety of evolutionary stages and an APOGEE-mocked sample of red clump stars. We find that in the former case, stellar evolution is the dominant driver for age estimates, while in the latter case, the more restricted evolutionary information causes the model to proxy ages through the chemical evolution model. We show that as a result of the use of nonintrinsic Galactic chemical information, trends estimated with the predicted ages, such as the age–metallicity relation, can deviate from the truth.


INTRODUCTION
Measuring the chronochemodynamics of stars allows us to reconstruct a timeline of events in the Galaxy.Of particular importance are stellar ages; good measurements of stellar ages are necessary to understand stellar evolution and the formation history of the Milky Way, and to identify new stellar populations within the Galaxy (Mackereth et al. 2019;Lu et al. 2021b,a;Ness et al. 2019;Aguirre et al. 2018).However, stellar ages are particularly difficult to measure, as they are not directly related to any observable (see Soderblom 2010 for a review).Instead, we can only measure properties that are correlated with ages.For example, ages can be Corresponding author: Jeff Shen shenjeff@princeton.eduobtained using gyrochronology (e.g., Barnes 2003;Angus et al. 2019) uses the empirical relation between a star's rotation period and its age, and asteroseismology (e.g., Lebreton & Montalban 2008;Aguirre et al. 2018), which uses measurements of oscillation frequencies to probe the properties of the interiors of stars, which are related to age.
With the advent of surveys like APOGEE (Majewski et al. 2017), measurements of many stellar surface properties are available, and they can be used to infer ages.Surface properties like log g and T eff describe stellar type, so they can be used in conjunction with theoretical stellar evolution models to get information about the age of a star (Bressan et al. 2012;Degl'Innocenti 2016;Choi et al. 2016).Stellar evolution also changes surface abundances through physical processes like dredge-up, whereby CNO products (nitrogen-rich and carbon-poor) are mixed into the atmosphere of a star through convec-tion.The CNO process is also mass dependent, and as a result, abundance ratios like [C/N] can give insight into stellar evolution and be used to estimate ages (Salaris et al. 2015).Thus, information about stellar evolution (i.e., age) is also correlated with [Fe/H], [α/Fe], and other chemical abundances (Nissen et al. 2017;Casali et al. 2019).However, measured stellar abundances also depend on the material that a star was formed from; chemical properties of the interstellar medium vary across time and location within the Galaxy (Spina et al. 2016;Mena et al. 2019;Ness et al. 2019).For example, Nissen et al. (2017) find for Kepler LEGACY stars in the Galactic disk that [Mg/Fe] is correlated with age because the deposition of iron into the ISM from SNIa is delayed relative to the production of Mg from SNII.Thus, ages that are inferred from certain chemical abundances are entangled with Galactic chemical evolution.
For understanding the properties of stars and obtaining age estimates that are as precise as possible, it may make sense to use as much information as is available.However, to understand the assembly history of the Galaxy, it is ideal to have ages that are derived from information intrinsic to a star-Galactic information should be removed.
The primary contribution of this paper is to introduce a framework for making probabilistic predictions of the ages of stars.Here we use the Galactic information in the form of [Mg/Fe] and [Fe/H] as well as stellar information in the form of [Fe/H], log g and T eff .We then quantify how much age information comes from Galactic vs stellar models.We introduce the methods used to do so-probabilistic random forests and SHapley Additive exPlanations-and demonstrate the framework's potential by applying it to a simple but realistic simulation of Milky Way stars.We show that the model predictions are driven primarily by stellar evolution information, and as a result, we are able to make robust and interpretable predictions that reproduce expected trends in the data.
In Section 6 we repeat our analysis using a sample of APOGEE-mocked red clump stars.This is a harder test case because the stars in the sample have very similar log g values, as red clump stars lie in a small cluster on the Hertzsprung-Russell diagram.Since log g is be almost the same for all stars in that sample, we expect much less information to be present in the stellar models than for the Milky Way sample.Thus, the models may need to obtain age information by proxying through chemical correlations, which are not intrinsic to the star itself.We show that this is indeed the case, and discuss some implications of this result.We also make an attempt to explicitly decorrelate the chemical information from the age information, and show how that process affects our results.We note, however, that observers typically use [C/N] ratios rather than log g and T eff to determine ages (e.g., Salaris et al. 2015;Martig et al. 2016;Ho et al. 2017;Casali et al. 2019).In a follow-up paper, we will extend our framework to directly use stellar spectra rather than stellar parameters and abundances estimated from the spectra.
The layout of this paper is as follows.In Section 2 we describe our data and the underlying Galactic and stellar models.In Section 3 we explain the probabilistic random forest regression model which we use to make probabilistic age predictions.In Section 4 we explain SHAP values, which allow us to quantify the relative importance of certain features in our predictive models, and then use them to determine the relative importance of features in our predictive model.In Section 5 we discuss how well the models are able to reproduce simulated age-abundance trend in the Galaxy, as a function of (1) choice of age estimator, (2) metallicity, and (3) whether there are chemical correlations.In Section 6 we apply our framework to a harder test case of red clump stars where the model proxies age information using chemical correlations, and compare the results to those obtained from the Milky Way sample.A summary of our findings is presented in Section 7.

MOCKING MILKY WAY DATA
We produce a mock dataset of stars in the Milky Way, with known true ages, masses, and properties, and in Section 3, we train random forests to learn the masses from the properties.The dataset contains stars of various evolutionary stages from the Milky Way's low-α disk.The dataset contains age (mass), present-day Galactocentric radius, [Mg/H], [Fe/H], log g and T eff .The relationship between all the variables is represented in the graphical model in Figure 1.
We first sample from a variant of the Frankel et al. (2020) Milky Way's low-α disk model, to which we add [Mg/Fe] with a similar functional form as the [Fe/H](R birth , t) described in their Eqs.17 and 18, but with earlier enrichment1 .The model produces stars born from an inside-out star formation history, predicts with what abundances stars were born, and how they subsequently migrated from their birth places.The [Mg/Fe] aspect of the model is not a best fit to the Milky Way but contains plausible correlations seen in real data, in particular faster enrichment in alpha elements than in iron-peak elements.Given the Galactic variables, we then generate the stellar parameters using isochrones.We begin by sampling initial masses plausible given the stars ages and metallicities from the Kroupa (2001) initial mass function (IMF).We consider evolutionary stages from the zero age main sequence (at equivalent evolutionary point, EEP, of 202 (Choi et al. 2016)) to the thermally pulsating asymptotic giant branch (EEP = 808).Beginning with a coarse EEP grid (EEP steps of 2.5), we use Brutus (Speagle et al., submitted to ApJ) to evaluate MIST isochrones (Choi et al. 2016) over the EEP grid given some age and metallicity.Here the isochrones are only a function of age and metallicity; Mg is not given.This process yields stellar parameters for each value in the grid.We select the set of parameters for which the initial mass most closely matches the initial mass sampled from the IMF.We then perform a second pass using a finer grid (EEP steps of 0.05) around the EEP corresponding to the matched initial mass.We again select the set of parameters for which the initial mass most closely matches the sampled initial mass, and use these parameters as the final stellar parameters.

Age
Instead of working with stellar ages, we train our models to predict the stellar mass to avoid making assumptions about the mass loss prescription within our model itself.The age can be inferred from the mass using a number of methods afterward, because they are closely related by the evolutionary stage lifetime.Here we use MIST isochrones (Choi et al. 2016) to convert the predicted masses to ages.However, we note that because isochrone interpolation is not exact, the conversion from mass to age introduces a small amount of error.Thus, converting from age to mass and then back to age gives a result that differs from the original age by ∼ 2.5%.
The stellar parameters are correlated with both age and metallicity, since the isochrones were conditioned on these two variables.Therefore, metallicity contains age information through both the Galactic model and the stellar model (isochrones).In order to disentangle the effects of stellar ages from Galactic chemodynamical evolution, we also produce a "control sample" for which the ages are de-correlated from the chemical evolution model.To do this, we randomly shuffle the ages produced by the Milky Way model, before using them to produce isochrones and train the random forests.This shuffling step breaks the direct link between [X/H] and age (i.e., through the Galactic model).In this control sample, the only age information coming from [X/H] should be indirect-purely from correlations with stellar evolution, through isochrones.
After generating all the parameters of interest, we simulate observational noise by applying Gaussian errors of 0.05 dex to [X/H], 0.1 dex to log g, 2% to T eff , and 0.04 dex to mass (García Pérez et al. 2016;Majewski et al. 2017;Frankel et al. 2019).
In Figure 2 we show our generated Milky Way sample.

PROBABILISTIC MASSES (AGES) WITH RANDOM FORESTS
With the mock dataset generated in Section 2, we train a probabilistic random forest regression model (PRFR2 ) to learn the masses from the stellar parameters and chemistry.The goal is to understand how PRFR is using the information from the training sample, and in particular to quantify how much chemical evolution it is using to learn stellar masses.We split our dataset into a training set, a validation set, and a testing set, consisting of roughly 70%, 20%, and 10% of the full dataset respectively.Model training is done with the training set, validation with the validation set, and model performance is determined using the testing set which is not seen during training.
The model that we use to learn masses from stars is a modified version of the popular random forest regression model (Breiman 2001), which we call the probabilistic random forest regression model (PRFR).The algorithm trains a number of decision trees, where each tree is given a bootstrapped and noisy version of the dataset.Rather than aggregating (i.e., averaging) the predictions from all the trees as in a standard random forest regression model, we preserve all the individual predictions from the different trees.This means that for each sample, we obtain n predictions, where n is the number of trees in the forest.As a bootstrap can be thought of as a special case of the Bayesian bootstrap, which simulates a Bayesian posterior distribution (Rubin 1981;Hastie et al. 2009), we can treat the samples from our bootstrapped trees as approximate samples from an posterior distribution for mass of the stars.
To account for measurement noise in our input parameters, we train each decision tree in the PRFR model using a draw from a Gaussian distribution centered around the measured values, and with standard deviation equal to the noise in the original dataset.This allows us to, conditional on the noisy measured value, marginalize over the unknown true value.Noise is also incorporated when making predictions in the same way.During model training, we would also like to incorporate noise in the labels (i.e., the masses), which we do by using sample weighting.Each sample is given an inverse variance weighting, so that stars with smaller errors in mass are weighted more heavily.
To ensure that the posterior (samples) are wellcalibrated, we make predictions on the validation set and then compare the the masses in the validation set to the resulting mass PDFs.Ideally, we want our PDFs to be calibrated so that the credible intervals match the coverage probability (i.e., roughly 50% of the time, the true value lies in the 50% credible interval).We do this by finding a scaling factor to either narrow or broaden the PDF.We then calculate the empirical CDF (ECDF) of the scaled PDF and evaluate it at the measured ("true") mass.Across multiple predictions, the resulting distribution of probability integral transform (PIT) values should be uniform in the ideal case (Dawid 1984;Diebold et al. 1998;Gneiting et al. 2007).We find the scaling value that minimizes the difference between the quantiles of the distribution of PIT values and a uniform distribution (i.e., makes the PIT distribution the closest to uniform).We show in Appendix A the impact of this procedure on the calibration of the posteriors.
Using the validation set, we also fit a multivariate linear regression model to the residuals, defined as the difference between the true values and the mean prediction, to explicitly correct for bias.For predictions on the test set, we run the PRFR model (incorporating noise), adjust for bias, and apply the PDF calibration to obtain samples from a calibrated posterior for the masses.
Figure 3 shows the predicted mass against the true mass for a set of 25,000 mock stars from the sample described in Section 2.Here we use the modes of the predicted mass distributions for the predicted value, and the standard deviation of the distribution as the y-errorbar.The difference between the noisy training mass and the true mass is used as the x-errorbar.
The PRFR model, across the entire mass range of the data, is able to accurately and precisely estimate the true mass, even when the training data are noisy.The scatter in the residuals (defined to be the difference between the mode of the predicted mass distribution and the true mass) is 0.05 M ⊙ , and the bias (by design of the model) is negligible at 0.006 M ⊙ .
In the top left of the plot, we show the variation of residuals across parameter space.At low logg, the stellar mass can be underestimated.However, for most stars, the mode of the estimated posterior accurately traces the true mass of the stars.On the right, we show examples of posteriors across a range of masses.The blue lines represent kernel density estimates of the posteriors, and the red line shows the true mass.In general, the posteriors are narrow and centered around the true mass.That indicates that we are able to confidently recover the true mass, even when the training data are noisy.The top right panel shows an exception where the mass is underestimated, but the posterior is appropriately wider, leading the mode and the mass to be separated by a reasonable distance.

INTERPRETING MODEL PREDICTIONS WITH SHAP VALUES
When we use multiple variables in a model to make a prediction of age, how can we tell how much each of the variables contributes to our prediction?SHapley Additive exPlanations (SHAP) values Lundberg & Lee (2017) are a metric for quantifying the marginal contribution of different features in a model to that prediction.They can be used to explain, for individual stars, how the inclusion of [Mg/Fe] (for example) in the model changes the prediction for the age of that star.SHAP values are local, and the amount that a prediction changes due to conditioning on a particular feature depends on where in parameter space the star lies.
SHAP values have the desirable property of additivity; the total effect of multiple features is simply the sum of the effects of the individual features.Consequently, the sum of the SHAP values of all the features gives the difference between the prediction from the full model (using all variables) and the baseline Naive Bayes prediction (i.e., randomly sampling from the age distribution), as well as what drives that difference.
Figure 4 shows a graph of models conditioned on various variables and indicates how SHAP values can be calculated.We first calculate "model differences", which are the differences between models that include a particular feature and models that exclude that feature, all other features equal.Each of these model differences is represented by an arrow, with the color of the arrow indicating the feature being considered for inclusion.We focus on [Mg/Fe], indicated with blue arrows, for clarity.Because the included feature might depend on other features, the SHAP value for a feature is a weighted average of all the relevant model differences.The calculation of SHAP values requires that we have a separate model for each element in the power set of available features; for K features, we need 2 K models-each feature is either included or not.
The formula for calculating SHAP values is where j is the feature being considered, f is our model, s is some set of features, F is the set of all features, x is the current point in parameter space where the SHAP value is being calculated, P denotes the power set, and |s| is the cardinality of s.
In many cases, we would like to compare the relative importance of different features.To do so, we introduce normalized SHAP values, which are calculated as follows: where SHAP N is the normalized SHAP value and j is the feature.Simply, for a given object, the normalized SHAP value is the SHAP value of some feature divided by the sum of the absolute values of all the SHAP values for that object.The interpretation of normalized SHAP values remains similar to that of SHAP values: if a normalized SHAP value is close to 0, that feature is not very informative, because adding or removing it does not change predictions much relative to the other features.
Conversely, if a value is close to -1 or 1, then that feature is almost entirely responsible for driving the prediction.The magnitude of the normalized SHAP value is the relative importance of that feature, and the sign of the normalized SHAP value is the direction of the change in prediction.

Application of SHAP Values to Probabilistic Age Predictions
We train 16 PRFR models on our original mock dataset, and another 16 on the chemically decorrelated version; in each version, each of the 16 models is trained on a different subset of features.SHAP values, as described in Section 4, are calculated using scalar predictions.However, given that our models give us not only a single mass, but (samples from) a posterior distribution, we can instead calculate SHAP values using certain summary statistics (e.g., mean or median) of the posteriors.We note that in theory, it is possible to calculate SHAP values for a grid of quantiles across the posteriors for each star, and that would allow us to get an idea of how the posterior distribution is influenced by the inclusion or exclusion of each of the variables.
Here we show results for the version of the Milky Way sample with observational noise.For this simple test case of direct age estimation (see Section 4.2 for justification), our model is robust even in the presence of noise, and so we do not show the test case without noise.
In Figure 5, we take the fiducial dataset (i.e., with chemical correlations) and predict the mass distributions of 1000 stars using all 16 models, and then for each star we calculate the SHAP values for each of the four parameters using the means of the mass distributions.[Fe/H] remains informative; in addition to the Galactic correlations that it has with age, it also has information about the ages through isochrones.
We see that for this test case, the difference in SHAP values between the fiducial and the chemically decorrelated datasets is minimal.In particular, the SHAP values for [Mg/Fe] and [Fe/H] are (relatively) close to zero in all cases.The results are similar even when Galactic correlations are explicitly removed; this is an indication that the models are directly learning ages and masses through stellar models, and not proxying them through Galactic correlations.Indeed, the SHAP values for log g and T eff are much larger than those for [Fe/H] and [Mg/Fe]-even in the presence of noise-indicating that they are more important for driving changes in model predictions.

Where Is Age Information Coming From?
As a way to more explicitly visualize the changes in SHAP values due to the decorrelation process, we create a split-violin plot in Figure 7 using the noisy Milky Way dataset.Each of the violins shows the overall distribution of normalized SHAP values for the different model features; the blue side shows the overall normalized SHAP values from the models trained on the fiducial dataset, and the red side shows the same but for the decorrelated dataset.This gives us an overview of how the decorrelation process changes, overall, where information comes from.The values above each of the violins are the means of the absolute value of the normalized SHAP values, which roughly gives the percentage contribution of information from that particular variable.
Confirming our previous results, we see that the SHAP values for log g and T eff tend to be far from zero, in-dicating that they are relatively more informative than [Mg/Fe] and [Fe/H] for which the SHAP value distribution has the bulk of its mass around zero.
We note that the results shown here for the how well the PRFR models are able to predict ages using stellar models represents an ideal case where we have information about the mass loss prescription.Thus, the models here likely have more information than we would in practice.If we have a less complete understanding of how mass loss works, then the model might not be same as reality.In that case, log g and/or T eff might not be as informative, and there might be more information coming from chemical enrichment than is revealed by SHAP values.

AGE-ABUNDANCE TRENDS
We would like to examine whether the models are able to accurately reproduce age-abundance trends in the  data.In Figure 8, we show the age-abundance trends for the Milky Way sample.On the x-axes, we show the age of the star, and on the y-axes, we show the [Mg/Fe] value from the dataset.In each panel, we consider stars in three different [Fe/H] bins, as indicated by the color of the lines.Within each bin, there are three sets of lines of identical colors: one created with the true age, one created with the predictions from the model trained on the fiducial dataset, one created with the predictions from the model trained on the dataset with chemical correlations removed.That is, each line is created from datapoints with the same [Mg/Fe] values, but differ in their ages (x-positions).
Recalling that the outputs of our PRFR model are samples from a posterior, we can consider how taking different point estimates from the posterior can influence the predicted age-abundance trends.On the left, we consider random samples from the conditional age distributions (converted from the conditional mass distributions).On the right, we consider the mode from the conditional age distributions.We see that the observed trends are very similar between the different metallicity bins, and also between the two panels.In addition, the lines corresponding to the model predicted ages and the true ages agree well.This shows that in this test case where we are able to directly estimate the ages without proxying through chemical correlations, the predictions from the model-regardless of whether we take random samples (i.e., sample the entire posterior) or a summary statistic that traces the bulk of the distribution-we are able to accurately reproduce age-abundance trends.

RED CLUMP SAMPLE
We repeat our entire analysis using an APOGEEmocked red clump sample of stars.We generate this dataset following the method described in Section 2, but using EEPs ranging from 631 to 707, corresponding to the core helium burning phase of a star's lifetime.
Figure 9 shows the red clump sample in parameter space.The overall distributions of [Mg/Fe] and [Fe/H] in the red clump sample is identical to those in the Milky Way sample, but the log g and T eff values span a much smaller range.In particular, the log g range for the red clump sample, especially in the absence of observational noise, is extremely small, with most stars having log g ≈ 2.5.
This particular sample is a test case where the information content coming from stellar evolution should be much lower, precisely due to the highly constrained range in values as explained above-in the limiting case, we would be using a constant as one of the variables, which should be totally uninformative.As a consequence, the model tries to learn to predict masses using the available chemical proxies, rather than only through stellar evolution.We stress, however, that it is indeed possible to use stellar evolution to predict masses for red clump stars; there is simply less stellar evolution information available.
6.1.Mass (Age) Estimates with PRFR Figure 10 shows a simple one-to-one comparison of the true mass with the mass predictions from the PRFR model for the RC sample.In general, the model prediction is in agreement with the true mass (but less so than for the full sample); the scatter is 0.33 M ⊙ .However, there is a particular horizontal clump of stars where the PRFR model systematically underpredicts the mass to be ∼ 1.2 M ⊙ , even in cases where the true mass is ∼ 4 M ⊙ .
Torward the right side of the plot, we randomly select four stars from this clump and show their predicted mass distributions as inset axes.We see that in all cases, there is a clear peak in the posterior near ∼ 1.2 M ⊙ which is taken to be the prediction in the main scatterplot, as well as a secondary solution at higher masses which matches the true mass.This is an indication that point predictions are insufficient when trying to reproduce the full complexity of the mass distribution in the data; many of the posteriors are not, e.g., simple Gaussians, and thus the full posterior distribution should be taken into account.
We further examine the performance of the PRFR model in the top-left inset axes of Figure 10.The two panels show the residuals of the model across parameter space.We find that there is a diagonal residual gradient across [Mg/Fe] and [Fe/H] space, with masses at lower [Mg/Fe] and [Fe/H] being underpredicted.On the right panel, there is a red-colored clump of stars around T eff ≈ 4700 K and log g ≈ 2 where the residuals are mostly negative, indicating that the PRFR model systematically underpredicts the mass.These stars are likely the same as the ones in the clump in the main part of the figure.

SHAP Values
Here we recreate the 2D SHAP value plots from Section 4.1 using the RC sample.Figures 13 and 14 are created from the noisy RC datasets with and without chemical correlations, respectively.By comparing the results from the fiducial dataset to those from the chemically decorrelated dataset, we can identify the effect that the decorrelation process has on the model predictions.In the top row of Figure 11, we see that there are diagonal gradients in all panels; this indicates that the SHAP values for all four variables depends on chemistry.When chemical correlations are removed in Figure 12, the diagonal gradient effectively disappears in of the top panels, and the SHAP values are essentially independent of the combined chemistry, as expected.There is still a weak horizontal gradient in the upper right panel, indicating that the T eff SHAP value varies with metallicity, but this is reasonable since we expect the two to be covariant (Rix et al. 2022).
Compared to the Milky Way sample, the [Mg/Fe] and [Fe/H] SHAP values for the RC sample are far less clustered around zero.There are clear gradients in the left two panels in the upper row of Figure 13; the magnitudes of the SHAP values there are comparable to those in the right two panels in the upper row, indicating that the normalized SHAP values are much more balanced across the four variables.In constrast, in Figures 5 and 6 The SHAP value plots suggest that [Fe/H] and [Mg/Fe] provide more information for the red clump sample than for the Milky Way sample.While stellar evolution remains quite informative, chemical correlations play a much larger role in driving predictions.The decorrelation process shifts the bulk of the distribution of SHAP values for [Mg/Fe] to be close to zero, indicating that in most cases, [Mg/Fe] becomes less responsible for driving predictions than when chemical correlations are present.At the same time, log g and T eff become slightly more informative, as the bulks of the distributions shift slightly further away from zero in the negative direction.
When observational noise is included, we see that the decorrelation process has a similar effect on [Mg/Fe] as before.However, [Fe/H] now becomes much more informative.When we remove chemical correlations, we eliminate any direct age information contained in [Mg/Fe]; the only age information contained in the decorrelated [Mg/Fe] should be the correlation that it has with [Fe/H], which still contains age information because of isochrones.

Age-Abundance Trends
We now turn to the age-abundance trends for the RC sample.Figure 15 shows the age-abundance trends created from the PRFR models trained on the noisy RC sample.On the left-where we take random samples from the conditional age distributions, we see that the  predicted trends from the models trained on both the fiducial and chemically decorrelated datasets are in agreement with the true trends.However, on the right panel, we observe interesting deviations from the truth when we take the modes of the conditional age distribution.At large ages, the trends created from mode ages predicted from the model trained on the fiducial dataset (dashed line with triangles) have an upturn relative to the truth, while the trends from the mode ages predicted from the model trained on the chemically decorrelated dataset (dotted line with squares) have a downturn.
The cause of this is that there is a [Mg/Fe]-dependent difference in the mode predictions between the fiducial and chemically decorrelated models.In Figure 16, we explicitly show the structure of this [Mg/Fe] dependence.For low [Mg/Fe] stars, the decorrelated predictions tend to be older than the fiducial predictions, and for high [Mg/Fe] stars, they tend to be younger.Even though the two sets of mode age predictions have the same overall distributions, they differ for individual predictions, and that variation depends on [Mg/Fe].In particular, the right panel shows a diagonal gradient (i.e., across the 1:1 line) which indicates that there is [Mg/Fe] dependence in the variation between the two predictions.So at, for example, low [Mg/Fe] values, one would expect to be in the red region of the plot where the prediction from the chemically decorrelated model is older than the prediction from the chemically correlated model.Another way to think of this is that for a given star with some particular age, the [Mg/Fe] dependence of the model predictions means that the chemically correlated model is associated with a higher [Mg/Fe] value than the chemically decorrelated model.Thus, there is an upturn in the trends for the chemically correlated lines and a downturn for the decorrelated lines.
We can also choose to use random samples from the posterior instead of modes.Since random samples more representative of the full distribution of predictions, the variation of variation of [Mg/Fe] in true age and predicted age space is similar between the fiducial and chemically correlated cases, and thus the difference in the trends is much smaller than for modes.As a result, using random samples instead of point estimates can lead to more faithful reconstructions of the age-abundance trends (as seen in Figure 15).

SUMMARY
In this paper, we have introduced a framework for estimating probabilistic stellar ages from stellar parameters.We show that this framework is able to accurately and precisely recover ages, even in the face of observational noise, by applying it to the simple test case of a simulated dataset of Milky Way stars.Using SHAP values, we interpret the model predictions and show how each variable drives age predictions, and how the importance of each of those variables varies across parameter space.While in the fiducial case the PRFR model uses both stellar evolution and Galactic chemical correlations, we show that in a dataset where chemistry is explicitly decorrelated from ages, the only age information that is contained in [Mg/Fe] exists in the correlations it has with [Fe/H] (and by extension, stellar evolution).
Using these explanatory tools to compare the results before and after chemical decorrelation, we find that for the Milky Way sample, log g and T eff are almost entirely responsible for driving predictions.We then test our framework on a more difficult problem by applying it to

Figure 2 .
Figure 2. The simulated Milky Way sample.The left panel shows [Fe/H] against [Mg/Fe], with points colored by mean age within each hexagonal bin.The right panel shows the sample on a Hertzsprung-Russell diagram, where stellar parameters are generated conditional on ages and [Fe/H].The points are again colored by mean age.All variables include observational noise.The inset axes in each of the panels show the marginal distributions.

Figure 3 .
Figure 3. Predictions from PRFR model for Milky Way sample.The red diagonal line in the main panel is a 1:1 line.The blue crosses show the true mass and the mode estimate, with uncertainties corresponding to the simulated measurement error and the standard deviation of the posterior.In the top left, we show the residual mass as a function of different parameters.On the right, we show mass posteriors for several stars in blue, and the true mass for those stars as red vertical lines.
The figure shows how the normalized SHAP values vary across parameter space.Each column represents the SHAP values for a different feature.The top row shows the SHAP values as a function of [Fe/H] and [Mg/Fe], and the bottom row shows the same values as a function of T eff and log g.This plot indicates the regions of parameter space where the inclusion of a feature increases or decreases mass predictions-relative to other variables-and by how much.In Figure 6, we show the same plot, but for the decorrelated dataset.A priori, we expect the decorrelation process to reduce the importance of [Mg/Fe] and shunt the information into [Fe/H].This is because [Mg/Fe] becomes decorrelated from the ages and is thus mostly uninformative, except for the correlations that it retains through [Fe/H].

Figure 5 .Figure 6 .
Figure 5. Top row: normalized SHAP values as a function of [Fe/H] and [Mg/Fe].Bottom row: normalized SHAP values as a function of T eff and log g.Each column represents the SHAP values for a different feature; moving from left to right, they are [Fe/H], [Mg/Fe], log g, and T eff .The color scale indicates the magnitude of the normalized SHAP value.This plot is created with models trained on the noisy Milky Way data with chemical correlations.

Figure 7 .
Figure 7. Split violin plot showing normalized SHAP values for the different parameters.The blue distributions on the left side of each violin represent the distribution of the SHAP values for the PRFR models trained on the fiducial dataset, and the red distributions represent the distribution of the SHAP values for the PRFR models trained on the data where chemical correlations are removed.The numbers above the violins are the means of the distribution of the absolute value of the normalized SHAP value.The dots in the middle of each violin represent the median of the distribution, and the errorbars show the 16th and 84th percentile.This plot is created with models trained on the noisy Milky Way dataset.
, the [Mg/Fe] and [Fe/H] SHAP values are clustered around zero, as indicated by the near-white colors in the four left-most panels in each of the plots.The results from the RC sample show that the SHAP values for the chemically decorrelated dataset are somewhat similar to those for the fiducial dataset, particularly for log g and T eff .However, the SHAP values for [Fe/H] and [Mg/Fe] are clearly different; the strong gradient in the SHAP value for [Mg/Fe] (seen in the upper left panel of Figure 13) essentially disappears in Figure 14.At the same time, the magnitude of the SHAP values for [Fe/H] are larger in the decorrelated version; especially for low [Fe/H], the SHAP values in Figure 14 are closer to -1 than those in Figure 14.These two results, in combina-tion, suggest that information content from [Mg/Fe] has been shunted into [Fe/H] in the model.6.3.Where Is Age Information Coming From?

Figure 9 .
Figure 9.The simulated APOGEE-mocked red clump sample.The two plots in the left column show [Fe/H] against [Mg/Fe], with points colored by age.The plots in the right column show the sample on a Hertzsprung-Russell diagram, where stellar parameters are generated conditional on ages and [Fe/H].The points are again colored by age.The top two plots does not include noise, and shows the true underlying trend from the simulation; the bottom two plots and the (corresponding colorbar) show the noisy parameter values and ages.The inset axes in each panel show the marginal distributions.

Figure 10 .Figure 11 .Figure 12 .Figure 13 .Figure 14 .Figure 15 .
Figure10.Predictions from PRFR model for RC sample.The inset panels on the top left show how the residuals vary across parameter space; we find that there is structure in the residuals.The inset panels on the right show particular cases of poor predictions.In each inset panel the blue line shows a kernel density estimate of the predicted mass distribution, and the red vertical line shows the true mass.In all four cases the true mass is a secondary solution.
Figure 8. Trends in age and [Mg/Fe] across different metallicity bins, represented by the different colors.For each metallicity bin, we plot different age samples against [Mg/Fe] values:(1) the truth (i.e., the data) (2) the predictions from the PRFR model trained on the fiducial dataset, and (3) the predictions from the PRFR models trained on the chemically-decorrelated dataset.For each star, an age prediction is taken to be (a) a random sample from the conditional age distribution, or (b) the mode of the conditional age distribution.This gives us a sample of (age, [Mg/Fe]) pairs, which we then bin in age.We then use the center of each age bin and the median of all [Mg/Fe] values in that age and metallicity bin to create a line.