AspGap: Augmented Stellar Parameters and Abundances for 37 Million Red Giant Branch Stars from Gaia XP Low-resolution Spectra

We present AspGap, a new approach to inferring stellar labels from the low-resolution Gaia XP spectra, including precise [α/M] estimates—the first time these are obtained by such an approach. AspGap is a neural-network-based regression model trained on APOGEE spectra. In the training step, AspGap learns to use not only XP spectra to predict stellar labels but also the high-resolution APOGEE spectra that lead to the same stellar labels. The inclusion of this last model component—dubbed the hallucinator—creates a more physically motivated mapping and significantly improves the prediction of stellar labels in the validation, particularly that of [α/M]. For giant stars, we find cross-validated rms accuracies for T eff, log g, [M/H], and [α/M] of ∼1%, 0.12 dex, 0.07 dex, and 0.03 dex, respectively. We also validate our labels through comparison with external data sets and through a range of astrophysical tests that demonstrate that we are indeed determining [α/M] from the XP spectra, rather than just inferring it indirectly from correlations with other labels. We publicly release the AspGap codebase, along with our stellar parameter catalog for all giants observed by Gaia XP. AspGap enables the discovery of new insights into the formation and chemodynamics of our Galaxy by providing precise [α/M] estimates for 37 million giant stars, including 14 million with radial velocities from Gaia.


Introduction
Understanding the star formation and galactic enrichment history of the Milky Way is essential for gaining insights into the broader context of galaxy evolution and cosmology.For decades, the field of Galactic archaeology has been dedicated to unraveling the formation history of our own Galaxy.This endeavor has been greatly aided by large-scale spectroscopic surveys such as the Sloan Digital Sky Survey (SDSS)/the Apache Point Observatory Galactic Evolution Experiment (APOGEE; Majewski et al. 2017), the Milky Way Mapper of SDSS-V (Kollmeier et al. 2017;Almeida et al. 2023), the Large Sky Area Multi-object Fiber Spectroscopic Telescope (LAMOST; Cui et al. 2012;Deng et al. 2012;Luo et al. 2012), Galactic Archaeology with HERMES (GALAH; Buder et al. 2021), and the 4 m Multi-object Spectroscopic Telescope (4MOST; de Jong et al. 2019).These surveys have played a crucial role in providing extensive spectroscopic data and enabling detailed investigations into the chemical and dynamical properties of stars in the Milky Way.In addition to spectroscopic data, the European Space Agency's Gaia mission (Gaia Collaboration et al. 2016) has played a pivotal role in this field by observing billions of stars with unprecedented precision.The Gaia mission has provided measurements of parallax and proper motion, enabling the construction of comprehensive six-dimensional phase-space information on our Galaxy, and thereby revolutionizing our knowledge of the structure of the Milky Way.By combining astrometric information, ongoing and future spectroscopic surveys have the potential to significantly expand our understanding of fundamental galactic astronomy.These surveys can broaden the distribution range of atmospheric parameters and chemical abundances, leading to valuable insights into various aspects such as the formation history of the Milky Way (Xiang & Rix 2022), the variation of the stellar initial mass function across different chemical environments and star formation histories (Li et al. 2023), and the discovery of the existence of very massive stars in the early Universe (Xing et al. 2023).
In the recent Gaia Data Release 3 (DR3; Gaia Collaboration et al. 2023), a substantial number of approximately 220 million low-resolution spectra of stars have been made available (De Angeli et al. 2023;Gaia Collaboration et al. 2023;Montegriffo et al. 2023).These spectra, obtained through the combined observations of the blue photometer (BP) and red photometer (RP) of the Gaia mission, provide essential information on stellar parameters, including chemical abundance measurements.The wavelength range covered by the combined BP/RP (XP) spectra spans 3300 to 10500 Å, with a resolution 15 85  » - (Andrae et al. 2023a).These spectra serve as valuable resources for inferring stellar parameters, distances, and extinctions for stars within the Milky Way (Andrae et al. 2023a).
The potential of using Gaia XP spectra for stellar parameter estimation was initially explored by Liu et al. (2012).
Subsequently, the Gaia General Stellar Parameterizer from Photometry, known as GSP-phot (Andrae et al. 2023a), applied a Bayesian forward modeling approach to fit XP spectra and produced a homogeneous catalog containing effective temperature (T eff ), surface gravity, and metallicity estimates for approximately 471 million stars with G-band magnitudes brighter than 19.Although GSP-phot provides valuable information on fundamental stellar parameters, the estimation of metal abundance ([M/H]) is not without its limitations due to the characteristics of the Gaia XP system and the limited information it provides about [M/H] (Andrae et al. 2023a(Andrae et al. , 2023b)).A theoretical study by Ting et al. (2017) demonstrated that valuable information on element abundances, including [Fe/H] and α-abundance ([α/M]), can be gleaned from low-resolution spectra, although stellar labels become degenerate at 100   . Remarkably, the precision of these element abundances remains largely unaffected by the resolution of the spectra as long as the exposure time and number of detector pixels are held constant.In fact, lowresolution spectra such as those of Gaia XP offer the advantage of a higher signal-to-noise ratio (S/N) per pixel and a broader wavelength coverage within a single observation.These characteristics highlight the potential of low-resolution spectra for delivering accurate measurements of element abundances.
Accurately determining chemical abundances from XP spectra using traditional model-driven methods, which rely on comparing observed spectra to stellar spectral libraries, presents challenges due to the inherent systematic differences in flux calibration between observations and synthetic spectra.Theoretical spectra and observed spectra often exhibit separate distributions in the high-dimensional flux space because of imperfections in the theoretical spectra and errors introduced by observation conditions and instrument effects (Wang et al. 2023).This calibration issue becomes particularly problematic when attempting to identify α-sensitive spectral lines (Gavel et al. 2021).Given the advantages of lowresolution spectra and the challenges associated with traditional model-driven methods, employing a data-driven approach for estimating chemical abundances from low-resolution spectra becomes a natural and promising choice (Ting et al. 2017).By leveraging the information contained in the data itself, datadriven methods can overcome the limitations of model-driven approaches and provide more robust and accurate estimates of chemical abundances.
Data-driven methods and machine-learning techniques have been widely adopted for deriving stellar labels (parameters) from large volumes of low-resolution spectra.These methods offer an alternative approach to traditional model-driven methods and have shown great success in accurately estimating stellar properties in recent years (e.g., Ness et al. 2015;Ting et al. 2017Ting et al. , 2019;;Zhang et al. 2020;Xiang et al. 2022;Guiglion et al. 2024).
There are two main categories of data-driven methods: empirical forward models and discriminative models.In the empirical forward model approach, models are built to predict the spectrum based on the stellar parameters (Ness et al. 2015;Casey et al. 2016;Ho et al. 2017;Ting et al. 2017Ting et al. , 2019;;Zhang et al. 2020Zhang et al. , 2023;;Li et al. 2021;Xiang et al. 2022).These models utilize a large training data set with known stellar parameters to establish the relationship between the spectra and the stellar labels.By applying these models to new spectra, the stellar parameters can be inferred.On the other hand, discriminative models take spectra as input and output stellar labels (Leung & Bovy 2019;Rix et al. 2022;Andrae et al. 2023b;Yao et al. 2024).These models are trained using a labeled data set, where both the spectra and the corresponding stellar parameters are known.The models learn the complex mapping between the input spectra and the desired output labels, allowing them to predict stellar parameters for unseen spectra.Both empirical forward models and discriminative models have their strengths and applications.Empirical forward models directly predict spectra based on the stellar parameters, which can be useful for studying the physical processes shaping the spectra and for generating synthetic spectra for stellar population synthesis.Discriminative models, on the other hand, provide a more direct approach to inferring stellar labels from spectra, which is beneficial when the focus is on estimating stellar parameters for large data sets efficiently.Recently, Leung & Bovy (2024) demonstrated that a single transformer-based neural network trained on heterogeneous spectroscopic and photometric data sets could perform discriminative tasks like predicting stellar parameters from spectra as well as generative tasks such as generating spectra from parameters, which opens up some new ideas on how to use spectra and stellar labels.
The Cannon (Ness et al. 2015) is a data-driven generative model that was introduced for spectroscopic data analysis.This approach involves establishing mappings between known stellar labels and spectra using a training data set.Once the data-driven model is trained, it can be applied to infer labels for observed spectra.The Cannon has demonstrated its effectiveness in deriving labels for high-resolution APOGEE spectra (Ness et al. 2015;Casey et al. 2016), as well as for lowresolution spectra from surveys like LAMOST (Ho et al. 2017).Data-driven methods offer several advantages, including the ability to learn patterns or features in spectra and make predictions based on them, as well as improved performance and efficiency as compared to other methods (Casey et al. 2016).One advantage of the forward modeling approach is its interpretability, allowing for the examination of residuals and the identification of new systematics and explanatory variables.It also has the capability to handle missing data (Andrae et al. 2023b).However, both generative and discriminative models have their limitations, including the risk of learning physically implausible relationships (Hogg et al. 2019).Additionally, they may not offer novel insights beyond our current understanding of the underlying physics.
In contrast, the direct discriminative approach can identify features in spectra that are correlated with stellar parameters.However, it may not capture all the relevant parameters and can be susceptible to systematic errors.These supervised-learning models are generally easier to train and better at avoiding overfitting, where the model becomes overly complex and fails to generalize to new data.Forward modeling approaches can address overfitting by incorporating a regularization term (Casey et al. 2016). O'Briain et al. (2021) introduced a hybrid generative domain adaptation method that utilizes unsupervised learning on large spectroscopic surveys to transform simulated stellar spectra into realistic spectra.This method successfully calibrates synthetic data to match observations (Wang et al. 2023), bridging the gap between theoretical models and practical observations.It also enables the identification of missing spectral lines in synthetic modeling (O'Briain et al. 2021).This innovative approach has the potential to enhance data analysis techniques in stellar spectroscopy and other fields that rely on large data sets.Notably, the methodology employed in this study offers a balanced approach to stellar parameterization, neither relying solely on forward models nor on discriminative models.
In this paper, we present a novel data-driven method called AspGap, which enables the simultaneous estimation of stellar labels (T eff , g log , [M/H], and [α/M]) for red giant branch (RGB) stars using Gaia XP spectra combined with APOGEE labels.Our approach lies between the forward modeling and direct supervised methods, leveraging the benefits of both.The architecture of our model functions as a mapping from XP spectra to APOGEE labels, but we incorporate APOGEE spectra during training to enhance the model's performance.This combination allows us to exploit the rich information in both data sets effectively.We demonstrate that Gaia XP spectra can accurately predict T eff and g log due to their clear reflection in the overall spectrum profile.However, the estimation of metal abundance ([M/H]) and α-abundance ([α/M]) from XP spectra is more challenging.Despite this challenge, our model can still achieve comparable precision in determining [M/H] and [α/M] to that with LAMOST spectra, which typically have higher resolution of 1800  » . For [M/H], the expected median absolute error (MAE) is estimated to be 0.1-0.2dex for Gaia XP spectra (Liu et al. 2012) values and achieved a significantly improved MAE of 0.06 dex when using only XP information.However, the estimation of α-abundance ([α/M]) from XP spectra has been found to be a remapping between [α/M] and other parameters rather than a direct causal effect of α information (Gavel et al. 2021).
In our study, we demonstrate that our approach successfully derives meaningful [α/M] values for a large sample of approximately 37 million stars.Our sample size is about 6 times greater than the previous largest available sample with αabundance measurements.This earlier sample includes approximately six million stars observed with a resolution of R ∼ 1800 in the LAMOST Data Release 5 (DR5) data set (Xiang et al. 2019) and an additional six million stars from data from Gaia's Radial Velocity Spectrometer, processed through the GSP-Spec module (Recio-Blanco et al. 2023).This substantial increase in sample size provides unprecedented statistical power for studying the galactic enrichment history of the Milky Way.An additional advantage of our data product, presented in this paper, is its independence from complicated selection effects introduced by crossmatching with multiple catalogs.Our published catalog is based solely on the selection function of Gaia, ensuring a homogeneous and all-sky data set for studying the galactic enrichment history of the Milky Way.
The subsequent sections of the paper are organized as follows: Section 2 focuses on the data set utilized for training AspGap and provides insights into its composition and characteristics.In Section 3, we provide a comprehensive explanation of the AspGap method, including a detailed description of the model architecture and the loss function.In Section 4, we evaluate the performance of AspGap and present a catalog containing the labels and uncertainties of approximately 37 million red giant stars obtained using AspGap.Finally, in Section 5, we conclude the paper by discussing the implications of our results and addressing potential limitations and considerations associated with the use of our data product.The resulting catalogs generated in this study have been published online and can be accessed at Zenodo via doi:10.5281/zenodo.10469859.

Data
In this section we briefly introduce the Gaia XP data and their preprocessing, and the APOGEE training set.

XP: The Gaia BP/RP Low-resolution Spectra
The third data release of the Gaia mission (DR3; Gaia Collaboration et al. 2023) offers low-resolution aperture prism spectra (De Angeli et al. 2023;Montegriffo et al. 2023) for approximately 220 million stars.These spectra are obtained using the BP (330-680 nm) and RP (640-1050 nm) instruments.The Gaia observation coverage includes a staggering 78 billion transits, and the processing pipeline of BP/RP spectra generates calibrations for each individual transit spectrum out of a total of 65 billion epoch spectra (De Angeli et al. 2023).The final data product consists of more than two billion sources obtained by averaging the epoch spectra.
It is important to note that XP spectra differ from classical spectra in terms of their representation.Instead of providing flux values corresponding to specific sampled wavelengths, XP spectra are represented as a continuous function using a set of basis functions (De Angeli et al. 2023).The continuous spectra are then encoded as an array of coefficients, with the first coefficients capturing the majority of the flux and the higherorder coefficients storing narrow spectral features (De Angeli et al. 2023).This representation maximizes the information content of the XP spectroscopy data by efficiently representing the spectra with a reduced number of coefficients (De Angeli et al. 2023).

Training Data Set
For training, we use stars that are common to both the Gaia XP data set and the APOGEE catalog of targets in SDSS Data Release 17 (DR17; Abdurro'uf et al. 2022) and Gaia DR3, after some data cleaning.First, we rule out stars with problematic flags (ASPCAPFLAG) from ASPCAP.We remove stars with WARNING or BAD flags in the data model defined in ASPCAP for T eff , g log , [M/H], and [α/M], as well as stars with flags with CHI2 and NO_GRID.The reason we do not use a strict flag cut (i.e., ASPCAPFLAG ≠ 0) to obtain a clean RGB training sample is that labels on parameter boundaries tend to be harder to estimate, and expanding the boundaries to mainsequence stars would move the labels of the giant stars of interest away from these boundaries.Second, samples are selected using the criteria displayed in Table 1 both to obtain reliable stellar labels from APOGEE and to eliminate stars without meaningful astrophysical parameters.
Third, we further apply the following conditions to select 142,130 stars with good APOGEE labels, and we display the S/N distributions of coefficients in Figure 1.Here, the S/N is defined as the L2 norm of the given coefficients divided by the L2 norm of the corresponding errors.We find the following results: 1. Most of the stars have high global S/N (>100) values for the coefficients.2. Although the S/N of high-order coefficients (11-55), which contain abundance information, are lower than those of the first-order coefficients (1-10), most of them are higher than 1, indicating that there is still valuable information present.

Preprocessing
Then we preprocess the data before we train AspGap.We first concatenate the BP and RP coefficients to a 110-element array.Then we use the value of 10 0.5(15 − G) (G is the Gaia G-band magnitude) as the normalization value, divided by the 110-element XP array.
In cases where the Gaia XP coefficients exhibit a wide range of values spanning 4 orders of magnitude from the first to the highest coefficients, it is found that the higher-order coefficients tend to cluster around similar values.In such scenarios, using the median and interquartile range as normalization measures can yield more robust results.Some works on data-driven methods to preprocess the spectra use standard normalization, performed by removing the mean and scaling to a unit variance (e.g., Zhang et al. 2020).However, outliers can often negatively affect the sample statistics, especially for noisy data.Hence we adopt the median value of each jth coefficient, μ j as the centering value, and the coefficient range (the difference between the 25th quantile and 75th quantile) as the scale.Let x i,j be the jth coefficient of the ith spectrum; then we define the scaling coefficient x i j , ˆfor the jth coefficient as Stellar labels are scaled in the same way.Because the label measurements are more robust than the high-order XP coefficients, here we choose a label range of scale of y 97.5 − y 2.5 , where y m is the mth percentile value of label y.
Figure 1.Distribution of the S/N Gaia XP coefficients for the sample, for the BP portion of the spectrum (shown on the left) and for the RP portion of the spectrum (on the right).The black dashed histograms represent the sample's distribution when the S/N is averaged across all 55 coefficients in BP and RP.The blue and red histograms show the analogous distributions considering only the first 10 coefficients and the remaining higher-order coefficients, respectively.The vertical lines of different colors represent the S/N thresholds employed in our training sample.Specifically, we consider spectra with S/N values greater than 100 for both BP and RP first-order coefficients (1-10), while for coefficients 11-55, an S/N threshold of 1 is applied.
Note.Criterion 5 is used to identify stars that are situated on the main sequence or that reside on the RGB.

Method: Building the AspGap Model
Our goal is to create a data-driven model that estimates stellar labels from XP spectra.To achieve this, we have developed a neural-network-based model, AspGap, which leverages the rich abundance information of high-resolution APOGEE spectra.
The AspGap model is constructed in four blocks as shown in Figure 3: a pretrained APOGEE decoder, an XP encoder, a component dubbed the hallucinator, and an XP decoder.The key idea behind the AspGap architecture is to fully exploit the rich abundance information contained in APOGEE spectra to improve stellar parameter estimates from Gaia XP spectra.In a vanilla approach, we might only use AspGap XP spectra as input to predict stellar labels with an encoder and decoder.While this approach is valid, it does not impose any constraints on how the prediction is made, e.g., that the prediction draws on physically meaningful parts of the XP spectra.Training a neural network involves determining the weights and biases of a flexible model that fits the training data.Within the set of possible parameter combinations that fit the data well, some may result in overfitting, exhibiting unphysical behavior outside the training data.Our model requires the introduction of extra constraints that entail matching the APOGEE spectra, resulting in more robust results on unseen data.7This forces the model to learn a representation of the AspGap spectra that is consistent with how stellar parameters are derived from APOGEE spectra.
By incorporating the hallucinator component, we effectively modify the XP encoder component (which is common to both subsequent branches), as it must be able to allow the prediction of APOGEE-like spectra, where stellar labels are determined from physically well-defined spectral features.We anticipate that this discourages the model from overfitting when mapping XP spectra to stellar labels alone, thereby improving the teststep performance, when applied to new XP spectra.By leveraging both XP and APOGEE spectra simultaneously during training, AspGap exploits explicitly the rich information in APOGEE spectra.As we show below, this indeed substantially improves the stellar label estimates, in particular the [α/M] abundance measurements, from the low-resolution XP spectra.
As Figure 3 illustrates, AspGap starts with an XP encoder that maps the low-resolution XP spectra to a latent space.Then, the hallucinator network generates an APOGEE-like spectrum from this latent space representation.This spectrum then gets mapped to stellar labels, using a decoder that has been pretrained on real APOGEE spectra.In a second branch of AspGap, the initial latent embedding from the encoder gets mapped to the stellar labels directly via an XP decoder.
Note that we do not require explicitly that the spectral features in the hallucinated APOGEE-like spectra vary with changes in labels exactly as expected from "physics."Instead, the hallucinated spectra must only generate accurate stellar labels when inputted into the pretrained APOGEE pipeline decoder, originally trained on genuine APOGEE spectra.Nonetheless, the hallucinated gradient spectra (i.e., the spectra's derivatives with respect to some stellar label) show that the hallucinated APOGEE spectra vary with [M/H], for example, at the wavelengths where physics-based spectral models expect them to.This is shown in Figure 4, and it demonstrates the hallucinator's ability to generate realistic synthetic spectra containing meaningful spectral information related to the stellar parameters.
In each block, we choose to use a multilayer perceptron (MLP) as our feature extractor, rather than other powerful feature extraction techniques like convolutional neural networks (CNNs).We do this mainly because an MLP is both versatile and simple.An MLP is capable of learning nonlinear relationships between input features, which makes it well suited for problems where relationships between features are complex and difficult to capture using linear models.While other feature extraction techniques, such as CNNs, can be very powerful, they may not always be necessary for every problem.In our case, we find that the MLP provides a good balance between performance and simplicity, allowing us to achieve good results without adding unnecessary complexity to our model.
While we may have qualitative reason to expect that the hallucinator enforces better latent embedding in the XP encoder in training, any improvements in the prediction of stellar test labels can only be estimated empirically, as we do below.

Objective Function
During the training process, we optimize the objective function and determine the model parameters θ.We denote the ith target's APOGEE labels as y i ˆand their uncertainties as σ i , and we denote the predicted labels from the XP decoder as y model1,i and those from the hallucinator and pretrained APOGEE decoder as y model2,i .The overall data-model variance term s i is described as For stability in the training, we introduce a floor ò (10 −6 ) to prevent values of excessive 1/s i .With these definitions, we can spell out the loss function, which penalizes deviations of predicted labels from true labels, and incorporates regularization terms to encourage model parameter sparsity: Here, N represents the size of the training data, and α 1 and α 2 denote the relative weights for the loss contributed by the XP decoder and the APOGEE decoder.A good representation requires the decoder to predict the same stellar label both directly from the XP spectra and first from the XP spectra and then from the APOGEE spectra, and then decode to the stellar label; hence we have both terms in our loss function.The loss function comprises a weighted total of the losses computed by both decoders, the weights of which are determined by the relative importance of each decoder for the task at hand.In actuality, these weights are obtained through a hyperparameter grid search, resulting in assigned values of 0.4 for the hallucinator and 0.6 for the XP decoder.The term with s ln i is to ensure the model does not just arbitrarily increase the variance of the model to reduce the loss.Λ 1 is a regularization parameter, and ∥θ∥ L2 is the L2 norm of the sum of the absolute values of the components of model parameter θ, and the regularization term Λ 1 ∥θ∥ L2 encourages parameters to take on zero values.Specifically, Λ 1 and Λ 2 represent the relative weights assigned to the L2 norm regularized term and Kullback-Leibler (KL) divergence (D KL ), respectively.Λ 1 and Λ 2 control the balance between the main loss function and the corresponding regularization or divergence term.
To further encourage the model to estimate accurate uncertainties, we add a KL divergence penalty to the loss function.Specifically, we measure the divergence between the distribution of the uncertainty-normalized difference y y s and the standard normal distribution 0, 1 ( ). Ideally, the probability density distribution of χ should be 0, 1 ( ), if the predicted s model are accurate.We assume the uncertainty-normalized difference χ follows a Gaussian distribution , res res  m s ( ) , with res m and res s representing the mean and standard deviation of the Gaussian distribution.In the case at hand, the KL divergence is then given by By adding the KL divergence term to the loss function, the model is penalized when the predicted distribution of uncertainty-normalized residuals deviates from the standard normal distribution.This encourages the model to produce more accurate uncertainties, as it must minimize the KL divergence in addition to minimizing the deviation between the predicted and true values.Overall, adding a KL divergence penalty to the loss function can help AspGap to estimate accurate uncertainties by ensuring that the predicted uncertainties are as close as possible to the true uncertainties, and that the predicted distribution of uncertainty-normalized residuals follows a standard normal distribution (we will show the error analysis in Section 4).

Running AspGap
The implementation of AspGap is numerically straightforward and stable.The core AspGap model is built using PyTorch and can be trained on a single NVIDIA V100 GPU.Prediction of stellar labels on new XP spectra during inference is efficient, taking approximately 0.2 ms per star on a single GPU.
We have open-sourced the full AspGap code at https:// github.com/jiadonglee/aspgap to allow others to replicate our results, extend the model for new applications, and deploy it for making predictions on large data sets.During training, we use the Adam optimizer with an initial learning rate of 0.001.We adopt specific values for the hyperparameters, setting Λ 1 = 10 −10 and Λ 2 = 1.Due to the limitations of our available GPU memory, the batch size for our training samples is set to 16,384.Training is run for 1000 epochs with early stopping based on the validation loss.We do not find any significant numerical instabilities during training or inference.The model implementation and training/inference procedures are encapsulated in Python scripts, making it easy to apply AspGap to new data sets from a programmatic interface or script.

Results and Validation
In this section we present the results of the AspGap training and application just described.After an initial internal validation, we present the stellar label catalog for a large sample of RGB stars, where we deem our results to be particularly robust and pertinent to astrophysical applications.We then present an extensive comparison with external data sets, followed by some "astrophysical" plausibility tests that imply that the determination of [α/M] (presented here for the first time for XP spectra) for giants is meaningful.

Self-validation and Error Analysis
For internal validation, we evaluate the performance of our AspGap method by repeated twofold cross-validation on the APOGEE training sample.This involves dividing the training data into two equal subsets, and then iteratively swapping the subsets as training and validation sets.Figure 5 displays the cross-validation results for our four stellar labels (T effXP , g log XP , [M/H] XP , and [α/M] XP ) inferred by AspGap from XP coefficients, plotted against the corresponding APOGEE labels, which are taken to be the ground truth.The rms scatter of the four labels is 48 K for T eff , 0.12 dex for g log , 0.07 dex for [M/ H], and 0.03 dex for [α/M].The comparison immediately confirms that the performance of our stellar label prediction from Gaia, low-resolution spectra is comparable to the quality of data-driven labels from, e.g., LAMOST (Ho et al. 2017).For bright stars (G < 14) the scatter is even slightly lower for all four labels (particularly for [α/M] XP ): 46 K, 0.10 dex, 0.06 dex, and 0.021 dex, respectively.For all labels, we define an outlier rate η over the full sample of size N via N 1 deviation : deviation 1 5 where deviation i is the difference between the ground truth and the predicted values for the ith data point divided by a specified threshold for determining outliers, out_bound.The determination of the threshold for T eff and g log is based on residuals exceeding 15% of the ground truth values.Concerning [M/H], outliers are identified when the residuals deviate by 0.15 dex from the ground truth.Similarly, for [α/M], outliers are detected when the residuals deviate by 0.05 dex from the ground truth.
Remarkably, the outlier rate for the predicted T eff stands impressively low at 0.05%, signifying that only a minute fraction of the predicted T eff values deviate significantly from the ground truth values.On the other hand, the outlier rate for the predicted g log is higher at 6.67%, suggesting a relatively larger proportion of predicted g log values that fall outside the specified threshold for determining outliers.
We also show the comparison color coded by the AspGapderived uncertainties in Figure 6.We find that AspGap-inferred labels that differ more from the true APOGEE labels are assigned larger AspGap uncertainties, confirming that these error estimates are meaningful.
Figure 7 further illustrates how the uncertainties change with S/N, showing the formal AspGap uncertainties, as well as the RMSE, MAE, scatter, and bias, at different S/N.As the S/N increases, the AspGap uncertainties as well as the RMSE, MAE, and scatter decrease rapidly, as expected.The scatters are ∼40 K, 0.10 dex, 0.05 dex, and 0.02 dex for T eff , g log , [M/H], and [α/M] for an (RP coefficient) S/N of ∼800.For the best S/N (>1500) in the RP coefficients, the scatters are 30 K, 0.06 dex, 0.04 dex, and 0.01 dex for T eff , g log , [M/H], and [α/M], respectively.These scatters are comparable to those published for data-driven stellar labels derived from high-resolution APOGEE spectra (Ness et al. 2015;Casey et al. 2016;Ting et al. 2019).
The cross-validation results generally align with the expected S/N scaling for T eff , g log , and [M/H], but such alignment is less apparent for [α/M].This discrepancy likely arises from [α/M] primarily deriving its information from the Mg II triplets, especially in scenarios involving low-resolution spectra or narrowband photometry, as indicated by Yang et al. (2022).
The label uncertainties derived by AspGap (formal errors) as a function of S/N are also shown in Figure 7.The formal errors from AspGap decrease with S/N, similar to the MAEs.At S/N of 500 the uncertainties of all labels reach a floor, indicating that they are dominated by systematic errors rather than by random uncertainties.
We further validate the uncertainty estimates by looking at the χ 2 statistic.Figure 8 presents the uncertainty-normalized difference (χ) between the true labels and those predicted by the model.Compared to the standard normal distribution 0, 1 ( ), we find that the AspGap uncertainty for T eff , g log , and [M/H] is underestimated by about 30% if we assume that the APOGEE labels are exact.The uncertainties of AspGap predictions are influenced by both the S/N and the G-band magnitude of the stars, as shown in Figure 7.For stars with high S/N and bright magnitudes (G < 12), the uncertainties are generally low.However, it is important to note that label precision is best predicted by the spectral S/N rather than by the magnitude itself.This implies that the quality of the spectra,  as indicated by the S/N, plays a crucial role in determining the accuracy of the AspGap predictions.Therefore, when aiming to select samples with small uncertainties in their AspGap predictions, it is essential not only to take into account the G-band magnitude of the stars, but also to focus on ensuring a sufficient S/N.  ), and the orange solid line denotes the best- fitting normal distribution of χ.The formal uncertainties are meaningful-also for [α/M]-but about 30% underestimated.By conducting these validations, we gain access to the ground truth labels, which allows us to evaluate various performance metrics such as the MAE (χ).

The Catalog of RGB Stars
We now present the stellar labels (T eff , g log , [M/H], and [α/M]) derived in this study, and summarize the catalog (Table 2).We first use the ADQL query as presented in Appendix A to obtain ∼220 million stars with available XP data.Then we employ one of the trained models from the twofold training set as the inference model.The labels and their corresponding uncertainties generated by the XP decoder serve as the estimated labels provided by XP.Additionally, we exclude hot stars by selecting those with G BP − G RP > 0.8.Second, we follow conditions 1-5 described in Section 2.2 to select physically reasonable stellar labels.Third, we apply the following conditions to select a high-quality sample of approximately 37 million RGB stars: We note that such S/N-based sample cuts produce clean and easy-to-work-with subsamples, but may cause some difficulties if they become part of a modeling selection function (Rix et al. 2021).As last steps we make simple cuts in the H-R (or "Kiel") diagram, as shown in Figure 9, to distill it into a relatively pure RGB sample.We adopt a pseudoluminosity where L pseudo is a scaling value of luminosity that equals 10 M 5 2 G + .This cut is effectively equal to M G > −0.003T eff + 19 for high-quality parallax measurements.This final cut leaves ∼37 million stars, which we illustrate in Figure 10.
We deem this RGB sample to be most suitable for astrophysical analyses where [α/M] may play a role, and we present a description of the catalog in Table 2.The catalog can be accessed via Zenodo at doi:10.5281/zenodo.10469859.Figure 11 illustrates the larger coverage of the Galactocentric Cartesian X-Y plane by the RGB catalog compared to APOGEE.The all-sky nature and sample size of our RGB catalog from Gaia XP enhance our capabilities in Galactic archaeology, particularly when it comes to obtaining αabundance estimates.Additionally, Figure 12 presents the G magnitude and errors of stellar parameters across 37 million RGB stars, with histograms for G magnitude, T eff s , σ log g , σ[M/H], and σ[α/M], which also includes a subset of 14 million stars with Gaia radial velocity measurements.

Comparison with A23
We compare the labels provided by AspGap with those derived by A23 for 13,300,628 stars in the RGB catalog.To  , where the pseudoluminosity (y-axis) in the second cut is chosen to remain well defined even for ϖ 0. They lead to a sample of 37 million objects.start, we note that both AspGap and A23 utilize XP spectra and train on APOGEE data.However, they use quite different approaches, as A23 employ a direct discriminative machinelearning algorithm trained on APOGEE.As shown in Figure 13, the scatter between AspGap and A23 for T eff , g log , and [M/H] is found to be 73 K, 0.2 dex, and 0.1 dex, respectively, for all the crossmatched stars, with nearly no biases for the labels from the two catalogs, only 7 K, 0.02 dex, and 0 dex.The higherquality labels given by AspGap, with T eff s smaller than 50 K, g log s less than 0.1 dex, and σ [M/H] < 0.1 dex, are also shown in Figure 13.We find that the scatter in T eff , g log , and [M/ H] decreases to 45 K, 0.12 dex, and 0.07 dex.The [M/H] XP are systematically higher from our study for the metal-poor regime ([M/H] < −2).This is due to the limitation of our training sample: the majority of [M/H] from the APOGEE labels as the training data set are larger than −2 dex, but A23 add a very metal-poor star training sample (Li 2023) to benefit the estimation of the metal-poor regime.
To calculate the combined uncertainty when comparing the two catalogs, we can simply use where δ 1 and δ 2 are the uncertainties of the two catalogs.Given the typical error of 50 K, 0.12 dex, and 0.07 dex for T eff , g log , and [M/H] for AspGap and 50 K, 0.08 dex, and 0.1 dex for A23, the combined uncertainties for the three labels (T eff , g log , and [M/H]) are 70 K, 0.14 dex, and 0.12 dex, assuming the errors are uncorrelated.The rough estimates are consistent with the comparison displayed in Figure 13 for T eff and [M/H], but the scatter of g log between A23 and AspGap in the figure is larger by 0.06 dex.After the quality cut of AspGap, the scatters are nearly the same as shown in the self-validation value reported in Section 4.1.
We find that 88% of the A23 stars are contained in our catalog, which implies that much of the RGB sample of A23 also overlaps.However, it should be noted that our catalog is based only on Gaia DR3 data, while the parent catalog of A23 includes Gaia Data Release 2 (DR2), the Two Micron All Sky Survey (2MASS), and ALLWISE, resulting in a smaller parent catalog size compared to the AspGap catalog of approximately 37 million stars, mostly caused by ALLWISE's incompleteness.It is plausible that the main difference in catalog content between A23 and this work is attributable to the S/N cuts we use for selecting stars.

Comparison with LAMOST-LRS
For an independent validation with R  2000 spectroscopy, we use data sets beyond the APOGEE survey to verify the robustness of our results.In Figure 14  transferring approximately 90,000 common stars between LAMOST DR5 and APOGEE Data Release 15.We first select stars with S/N > 100 (in the SDSS g band) from the SLAM catalog,8 with the flag set to K giant, resulting in 142,608 stars.The S/N cut ensures that the errors of the SLAM labels are less than ∼50 K, 0.1 dex, 0.037 dex, and 0.026 dex for T eff , g log , [M/H], and [α/M], respectively (Zhang et al. 2020).We then crossmatch the selected LAMOST K giant stars with our catalog, resulting in 116,061 common stars.Comparisons between the AspGap labels from Gaia XP and the SLAM labels from LAMOST are shown in Figure 14.Generally, the differences between the AspGap labels and LAMOST labels are similar to the validation results shown in Figure 5.For T eff , the difference is larger at the edges of the label range, i.e., T eff < 4200 and T eff > 5000.Similarly, the scatter of g log is higher in the range of g log < 1.5.One reason for this could be that the K giant stars from the LAMOST catalog are selected using a hard cut on T eff and g log (Figure 6 in Zhang et al. 2020), which may erroneously include main-sequence stars mixed with RGB stars.To validate our conjecture, we further cut the sample with a T eff difference less than 200 Kthe number of outliers is 2006.We find that the number of stars with a large [M/H] difference from LAMOST decreases after the cut.Our comparison with LAMOST illustrates the remarkable consistency of the AspGap labels with the literature, even outside of the APOGEE observations.However, we should note that both AspGap and SLAM are limited in their truncation of the APOGEE training data set.
For [M/H] and [α/M], we find some spurious differences in the metal-poor regime ([M/H] < −1).We suspect that the AspGap-derived [M/H] suffers from significant errors for metal-poor stars, which is consistent with our validation results shown in Figure 10.Additionally, we find large discrepancies for [α/M] > 0.2, but that might be a result of large errors in [M/H] for metal-poor stars, where [M/H] and [α/M] are strongly degenerate in this challenging regime for abundance estimation.Although the scatters of [M/H] between AspGap and SLAM are only 0.07 dex, there is a 0.06 dex bias between AspGap and SLAM.The bias might come from the updated pipeline between APOGEE DR17 and Data Release 14 (DR14).We find similar bias for [α/M]: the bias between AspGap and SLAM is −0.03 dex.

Comparison with LAMOST-MRS
We further compare our results with those of Cycle-StarNet from Wang et al. (2023), which utilized MARCS model atmosphere theoretical synthetic spectra combined with a domain adaptation method to estimate the fundamental stellar parameters (T eff , g log , and [Fe/H]) and 11 chemical abundances for 1.38 million stars from the Medium-resolution ( 6500  ~) Spectroscopic Survey (MRS) in LAMOST-II Data Release 8.
To perform our comparison, we crossmatch our RGB catalog with the data set from Wang et al. (2023) and obtain a sample of 301,478 stars.We then apply specific selection criteria based on the flags provided by Wang et al. (2023), namely Flag_Teff=0, Flag_logg=0, Flag_FeH=0, Flag_MgFe=0, SN_blue >100, and SN_red >100, and constrain the errors of [Fe/H] and [Mg/Fe] to smaller than 0.05 dex and 0.04 dex, respectively.After applying these data quality cuts, we obtain a final sample of 40,035 stars for our comparative analysis as shown in Figure 14.
As depicted in Figure 14, the comparison results with LAMOST-MRS exhibit similarities to those obtained from LAMOST-LRS, despite the application of different methods and spectra.Regarding the effective temperature (T eff ), the scatter shows a marginal difference, attributed to a slightly larger bias ranging from 0 to 83 K.In terms of g log , the scatter is slightly larger, measuring up to 0.17 dex.For metallicity ([M/H]), the scatter remains comparable to the comparison with LAMOST-LRS; however, there appears to be a smaller representation of metal-poor stars ([M/H] < −1) in the LAMOST-MRS sample.Since Wang et al. (2023) did not provide an overall α-abundance, we compare the AspGapderived [α/M] with [Mg/Fe] as a reference.We find that the scatter is similar to that obtained for SLAM.In summary, we find good agreement between the T eff , g log , [M/H], and [α/M] derived from AspGap and those obtained from LAMOST-MRS.

Comparison with GALAH
We further conduct a comparison with the results obtained by Buder et al. (2021).GALAH DR3 consists of 768,423 highresolution (R ∼ 28,000) optical spectra obtained from 342,682 stars.The stellar parameters in GALAH DR3 were estimated using the Spectroscopy Made Easy model-driven approach in combination with one-dimensional MARCS model atmospheres.Additionally, Buder et al. (2021) incorporated astrometric information from Gaia DR2 and photometric data from 2MASS to mitigate spectroscopic degeneracies, accounting for LTE/non-LTE effects in their computations.
To perform a comparative analysis, we crossmatch our results with GALAH DR3 and identify 13,504 stars with corresponding stellar parameters.Quality flags are applied to ensure the reliability of the GALAH labels, including flag_sp=0, flag_fe_h=0, flag_guess=0, and red_ flag=0.Moreover, we impose constraints on the errors of T eff , g log , [Fe/H], and [α/Fe] given by GALAH DR3, setting them to smaller than 200 K, 0.25 dex, 0.2 dex, and 0.05 dex, respectively.After applying these quality criteria, we obtain a subset of 13,504 stars for the comparative analysis as displayed in Figure 14.
The comparison between AspGap and GALAH reveals a consistent pattern, although the level of consistency is relatively weaker compared to that with A23, LAMOST-LRS, and LAMOST-MRS.The scatter values for T eff , g log , [M/H], and [α/M] amount to 77 K, 0.18 dex, 0.13 dex, and 0.07 dex, respectively.In terms of bias, the four labels show a slight deviation, with a bias of −30 K for effective temperature (T eff ).
A comparative analysis of various surveys' similarity in labeling is presented in Wang et al. (2023).LAMOST-MRS and APOGEE exhibit the highest level of consistency in their labels, followed by GALAH.The SLAM labels (derived from LAMOST-LRS) are trained using labels from APOGEE DR14 -hence the expected consistency between them.
In conclusion, we perform independent validation by comparing the labels derived by AspGap with the LAMOST-LRS and LAMOST-MRS data sets and GALAH DR3.The comparison results demonstrate the remarkable consistency of the AspGap labels with the literature, indicating the accuracy and reliability of the AspGap model in providing stellar labels.However, it is important to note that AspGap has limitations in its training data set, which is based on the truncation of the APOGEE data set.

The Accuracy of Metallicity Assessed from Clusters
To assess the accuracy of the abundances estimated in our work, we explore the abundance derived with AspGap for stars in open clusters.Stars in open clusters serve as benchmark stars, due to their approximately chemically homogeneous nature.In Figure 15, we compare the [M/H] from AspGap with the literature values for 67 known open clusters (Donor et al. 2020).The differences between the AspGap and the literature values, as well as the comparison of the APOGEE [M/H] to the literature values, are shown in Figure 15.We find that although the deviations of the estimates from AspGap are larger than those of APOGEE, 66 of the 67 clusters have AspGap [M/H] within 0.2 dex.However, for 12 selected clusters with more than three member stars, the deviation of the AspGap-derived [M/H] from the literature values is within 0.07 dex as shown in Figure 15.There is no metallicity dependence for the deviations of estimates from AspGap.
Testing AspGap on open clusters shows that the error of the [M/H] estimate is mainly a random error when compared to the APOGEE abundances.

Verification of the [α/M] Abundances
The task of determining the [α/M] abundance is challenging for low-resolution spectra like those of Gaia XP.This is due to the potential degeneracy between the [α/M] abundance and other parameters, such as the metallicity ([M/H]), as shown in studies by Ting et al. (2017) andGavel et al. (2021).In this section, we evaluate the precision and accuracy of our [α/M] abundance estimates, thereby validating the [α/M] derived from AspGap.We conduct three main tests to validate the [α/ M] abundance prediction.These tests in particular include circumstances where we know how completely independent properties of the stars (such as their positions or velocities) correlate with [α/M]; we then check whether these correlations are seen at the expected level.
First, we examine how well AspGap can distinguish between different chemical components of the disk, specifically the socalled low-α and high-α disk, within the same range of [M/H].This analysis is discussed in detail in Section 4.5.1.Then we explore the relationship between the orbit dynamics and [α/M] abundance for disk stars: do high-α stars (at a given [M/H]) form a hotter disk than low-α stars of the same [M/H]?For a detailed discussion, please refer to Section 4.5.2.While our comparison with LAMOST and GALAH data demonstrates agreement between AspGap-derived [α/M] abundance and these surveys, it is important to note that the majority of the LAMOST and GALAH samples consist of disk stars with σ [M/H] > −0.8, leaving the question open of whether our [α/M] estimates can also differentiate stellar populations in the metal-poor regime.To offer further independent validation, we specifically choose stars from halo substructures and the Large Magellanic Cloud (LMC), which exhibit distinct star formation histories in comparison to disk stars within the Milky Way.The validation process and the corresponding results will be thoroughly discussed in Section 4.5.3.

Assessing Accuracy through the Bimodality of Disk Stars
The well-established bimodality of [α/M] abundance in disk stars serves as the initial validation of our ability to separate chemically low-α (or "thin-disk") stars from high-α (or "thickdisk") stars.To accomplish this, we employ a twofold validation approach using AspGap labels.Specifically, we compare a selected sample of disk stars with −0.9 < [M/H] < 0 against the APOGEE training labels to evaluate the effectiveness of differentiating between low-alpha and high-alpha disk populations (the details can be found in Appendix B).
As shown in Table 3, for the low-alpha disk, we find that 97% of the XP-identified group corresponds to the low-alpha class according to the APOGEE labels, while 96% of the entire low-alpha disk sample identified by APOGEE is correctly classified by XP.Similarly, for the high-alpha disk, 93% of the XP-identified group represents the high-alpha class, and 94% of the complete high-alpha sample identified by APOGEE is accurately recognized by XP.These validation results demonstrate that the AspGap [α/M] exhibits a high level of precision and recall in distinguishing between low-alpha and high-alpha disk populations.Further details can be found in Appendix B.

[α/M] Validation via Orbit Dynamics
From the RGB catalog compiled from Gaia XP, we focus on stars within two fixed [M/H] groups: −0.7 < [M/H] < −0.5 and −0.5 < [M/H] < −0.3.We crossmatch these groups with the sample from Kordopatis et al. (2023) to obtain orbital parameters based on Gaia DR3.We obtain a total of 960,661 and 1,644,704 stars in the two [M/H] groups, respectively.
Figure 16 illustrates a clear trend: within the range −0.8 < [M/H] < −0.3, the high-α disk is vertically hotter (higher vertical action, J z ), has older ages, and has lower angular momentum (L z ), while the low-α disk is vertically cooler (lower J z ), has younger ages, and has higher angular momentum.
For reference, we overlay the analogous APOGEE results in Figure 16.In general, Gaia XP and APOGEE exhibit similar trends on the [α/M]-J z diagram, indicating a consistent relationship between α-abundance and vertical motion.Gaia XP also allows for a distinction between different [M/H] groups.On the α-abundance-L z panel, APOGEE consistently yields higher values compared to Gaia XP.This offset may be attributed to a selection effect, as Gaia XP contains a larger number of stars in the inner disk with lower values of angular momentum (L z ).
Furthermore, we compare the distributions of α-abundance and Galactocentric Cartesian X positions between Gaia XP and APOGEE for the group with −0.5 < [M/H] < −0.3.We observe that the sampling in Gaia XP is more uniform across  While our comparison with LAMOST and GALAH data demonstrates agreement between AspGap-derived [α/M] abundance and these surveys, it is important to note that the majority of the LAMOST and GALAH samples consist of disk stars with [Fe/H] > −0.8.To offer further independent validation, we specifically choose stars from halo substructures and the LMC, which exhibit distinct star formation histories compared to disk stars within the Milky Way.
Verifying the accuracy of [α/M] measurements in extreme environments, such as Gaia Enceladus/Sausage (GES) and the LMC, holds significant importance.These regions are unique testing grounds for our [α/M] estimates as they have distinct stellar populations from the disk, which dominates our training set.GES stars are uncommon halo stars that are beyond the range of thin-and thick-disk stars, which dominate the training sample.Similarly, LMC stars are positioned at significantly greater distances than most training examples.Moreover, general associations observed in typical Milky Way stars, such as a tendency of farther stars to exhibit a higher [α/M], cannot be presumed to be applicable to these extragalactic populations.GES and LMC stars present novel challenges to the model and extend it beyond the standard training data.If AspGap is still able to recover accurate [α/M] values for these outliers, this will present compelling proof that the actual abundance is being measured instead of being only inferred from bulk correlations in the training set.These tests are pivotal in demonstrating that XP determines elemental abundances and not simply correlates labels.
First, we select high-confidence GES member stars on the basis of their radial orbits.Specifically, we select stars with E tot > −1.2 × 10 5 km 2 s −2 and |L z | < 0.5 × 10 3 kpc km s −1 , which should confidently exclude the disk and select the most energetic and radial GES stars.This selection is motivated by the expectation that GES stars, which were formed in the smaller potential of a former dwarf galaxy, would exhibit lower α-abundances at fixed [M/H] (e.g., Hasselquist et al. 2021).
We apply further cuts to the sample, requiring σ [M/H] < 0.1, σ [α/M] < 0.05, and g log < 2 to ensure good-quality labels.The results for this sample obtained from AspGap are depicted in Figure 17, with APOGEE labels overplotted as reference.This figure shows that the vast majority of the stars kinematically selected as GES members indeed lie below the diagonal line, where high-resolution studies and APOGEE analyses expect them to reside.This represents a confident recognition of the [α/M] values derived by AspGap for these stars.We also find that there are a few stars that exhibit chemical characteristics similar to those of the high-alpha disk (above the diagonal line) in both APOGEE and Gaia XP.This phenomenon can be attributed to the dynamical selection of GES stars based on the energy-angular momentum plane, which results in a purity of approximately 24% and completeness of 41% as demonstrated in simulations (Carrillo et al. 2024), i.e., these stars' locations on the [α/M]-[M/H] plane may reflect merely the limitations of the kinematic selections.
Second, we crossmatch our AspGap catalog with the catalog of Rix et al. (2022) to analyze the [α/M] abundances of stars in the "poor old heart" of the Milky Way.This crossmatching reveals a common sample of 1,144,026 stars.For the analysis of metal-poor stars in the inner disk, we specifically select 92,975 stars with [M/H] < −1.1 based on the results of Rix  (2022).We anticipate observing two distinct regimes within this metal-poor sample (Rix et al. 2022).The first regime consists of tightly bound metal-poor stars that have a broad eccentricity distribution ranging from 0.1 to 0.8 (i.e., that are approximately isotropic).These stars are members of the "poor old heart."Our AspGap estimates show them to be predominately high-[α/M] stars, as expected.The other regime is one of loosely bound, radially anisotropic stars with eccentricities greater than 0.75 and apocenter radii R apo exceeding 10 kpc: these represent the pericenter members of GES.Consequently, we expect them to be metal-poor, low-[α/M] stars.Both aspects are confirmed by the top left panel of Figure 17, where we plot eccentricity versus R apo with the color coding representing the AspGap-derived  from foreground Milky Way stars based on Gaia DR3 kinematic data.From the common sample of 92,063 stars with high member probability >0.9, we select a subset of 77,277 stars with precise label estimates (σ [M/H] < 0.1, σ [α/M] < 0.05, and g log < 2).We also make sure to include 1822 stars from APOGEE DR17.The comparison of the high-probability LMC member stars with AspGap [α/M] XP predictions and APOGEE [α/M] measurements in Figure 17 yields consistent results.The [α/M] predictions from AspGap align with the expected values for the LMC (Russell & Dopita 1992) and are in agreement with the measurements obtained by APOGEE.This agreement provides further confirmation of the accuracy of the [α/M] predictions, even in challenging validation scenarios.

Caveats
We have demonstrated how well the low-resolution XP spectra can be used to predict stellar labels, now also including [α/M].However, our model is based-inevitably-on a series of assumptions, and we remind the reader here of the caveats when using our catalog.1.Our approach is based on the assumption that the lowresolution spectra are single-star spectra set by only four labels: T eff , g log , [M/H], and [α/M].We assume stellar rotation and detailed chemical abundances to be negligible, and interstellar extinction to be a nuisance parameter.To break the degeneracy between temperature and extinction, additional infrared photometry, such as from 2MASS and ALLWISE (Andrae et al. 2023b), can be incorporated.However, our main objective is to create an RGB catalog solely from Gaia data.Consequently, in regions with high extinction, we may encounter challenges in accurately estimating stellar labels.Although Rix et al. (2022) have shown that [M/H] estimation remains unbiased even in the presence of significant extinction (e.g., A V = 3), we acknowledge the potential limitations of our stellar label estimates in highextinction regions.In Appendix C, we present a comparison with the results of Andrae et al. (2023b) in relatively low extinction regions characterized by Galactic latitudes |b| > 30, as well as in high-extinction regions with |b| < 10.We find no systematic offset between the two Galactic latitude groups, indicating that any discrepancies in label estimation primarily arise from the increased scatter caused by extinction.The presence of extinction introduces an inherent latent variable that is intertwined within the spectra, which we neglect.Unlike the derivation of stellar labels through direct forward modeling, it is crucial to consider the existence of extinction as an important parameter.AspGap simplifies the model by choosing to ignore the presence of extinction, as it does not significantly influence our conclusions, as demonstrated in the validation process.We refer users whose scientific focus revolves around the extinction of Gaia XP to the work of Zhang et al. (2023) and related references therein.2. We make the assumption that the training data, obtained from the overlap of Gaia and APOGEE data sets, constitutes a representative and sufficiently diverse sample of stars.Our main focus is on deriving accurate stellar parameters and abundances for RGB stars, which are well covered by APOGEE observations.To ensure the purity of the RGB catalog, we exclude white dwarfs and hot stars from our training set.Although this exclusion is expected to have minimal impact on the training process, there is a possibility that some hot stars may be inadvertently included in the RGB catalog if they fall within the predefined boundaries.However, we have taken precautionary measures by examining the G BP − G RP color index for the RGB catalog, and the fraction of misclassified hot stars is found to be negligible.3. We impose a cutoff at [M/H] values larger than −2 for all stars in our analysis.(2024), which specifically address the very metal-poor population.4. Our assumption of all sources being single stars overlooks the presence of a significant fraction of stars in binary systems.While our model can accurately predict labels for the primary star in binaries with high mass (and light) ratios, the performance may be impacted for systems with close-to-equal mass ratios.This limitation is inherent in many data-driven methods.A possible strategy to tackle this issue is to explicitly consider each spectrum as a binary system.This approach involves comparing the goodness of fit, such as the reduced χ 2 , between single-star and binary solutions for each spectrum (El-Badry et al. 2018).By doing so, it becomes feasible to identify potential binary systems (Z.Niu et al. 2024, in preparation).This avenue holds promise for further refining our understanding of binaries within the XP sample.5.For metal-poor ([M/H] < −1) stars, the [α/M] uncertainties are relatively higher for stars with higher surface gravity ( g log > 2.5) compared to metal-rich stars.This is especially true for α-poor stars, as observed in the validation process described in Section 4.5.3.This could be attributed to the fact that high g log values may result in pressure-broadened wings of strong metal lines (e.g., Mg I and Ca I; see Gray 2008 for details).In addition to line broadening effects, the inherent weak metal features in metal-poor stars can pose challenges in accurately estimating [α/M], given the existing correlation between [α/M] and [M/H].For studies focusing on metal-poor stars, we recommend selecting stars with g log values lower than 2.5 and carefully considering the error estimation provided by AspGap to ensure the accuracy and validity of derived parameters.

Conclusion
This paper presents the AspGap model, a data-driven approach for performing nonlinear regressions that estimate stellar labels from the low-resolution spectroscopic data provided by the BP/RP (XP) spectra from Gaia DR3.Our approach has two new aspects as compared to published analyses: it also yields precise estimates for [α/M], and it employs a hallucinator.By utilizing a pretrained model based on high-resolution APOGEE spectra, AspGap, with the hallucinator component, achieves remarkable accuracy in predicting the effective temperature, surface gravity, metallicity, and α-abundance of stars.Through twofold crossvalidation, the model demonstrates accuracies of approximately 50 K, 0.12 dex, 0.07 dex, and 0.02 dex, respectively.Our study results in a comprehensive catalog containing fundamental parameters (T eff and g log ) and abundance predictions ([M/H] and [α/M]) for approximately 37 million RGB stars.This extensive data set, accompanied by the open-source code, is publicly accessible via Zenodo at doi:10.5281/zenodo.10469859.
The extensive catalog of stellar labels for ∼37 million RGB stars generated in this study provides the astronomical community with a valuable multipurpose data resource.The unprecedented scale of all-sky [α/M] measurements will facilitate the production of novel insights into the formation history and chemodynamics of the Milky Way.Additionally, the public release of the open-source AspGap code will promote further methodological advancements in analyzing low-resolution spectra.This work demonstrates the meaningful [α/M] abundance information that can be extracted from Gaia's low-resolution spectroscopic data.By developing innovative techniques tailored to these spectra such as AspGap, we can unleash the full potential of the vast XP data set to illuminate our Galaxy's chemical evolution.The catalog and methods presented here will enable a diverse range of Galactic archeology research and upcoming spectroscopic surveys.stars belonging to the low and high disk groups, as classified by APOGEE, undergo twofold cross-validated estimations of [M/H] and [α/M] using AspGap.To evaluate the performance of AspGap in distinguishing between high-alpha and low-alpha disk stars, we employ precision and recall metrics.Precision measures the proportion of correctly classified stars out of all stars classified into a particular category, while recall measures the proportion of correctly classified stars out of all stars belonging to that category.
As illustrated in Figure 18, our validation results highlight the remarkable performance of AspGap in differentiating between the low-alpha and high-alpha disk populations.For the low-alpha disk stars, 97% of the stars identified by AspGap correspond to the low-alpha class as defined by APOGEE labels.Furthermore, AspGap accurately classifies 96% of the entire low-alpha disk sample identified by APOGEE.Similarly, for the high-alpha disk, 93% (8808 out of 9434) of the stars identified by AspGap represent the high-alpha class, and 94% (8808 out of 9343) of the complete high-alpha sample identified by APOGEE are correctly recognized by AspGap.

Figure 2 .
Figure 2. AspGap training set for this application, taken from SDSS APOGEE DR17.The left panel shows the T effg log Kiel diagram of the training data, color coded by [M/H].The right panel shows the number density distribution of the [M/H]-[α/M] abundance diagnostics in the training set.AspGap is designed to estimate T eff , g log , [M/H], and [α/M] values from XP spectra that are consistent with the SDSS APOGEE training data.

Figure 3 .
Figure3.The architecture of the AspGap model, consisting of four blocks: a pretrained APOGEE decoder that takes APOGEE(-like) spectra to generate predictions for stellar labels; an XP encoder that generates 1024 shared latent variables from the 2 × 55 XP coefficients; the hallucinator, which generates APOGEE-like spectra from these embedded variables to be fed into the pretrained APOGEE decoder; and an XP decoder that generates stellar label predictions from the XP encoder's 1024 latent variables.During the training process, both sets of stellar label predictions, obtained from the XP decoder and the hallucinator/APOGEE decoder, are utilized in the objective function of AspGap.The fundamental concept behind the design of AspGap is the prospect that incorporating the stellar label generation through the hallucinator route will lead to a more robust and physically meaningful XP encoder following the training phase.In the testing phase and for inference on other data, we only use the labels generated by the XP decoder.This is because, once the model is well trained, the mean-square error between the hallucinatorʼs predictions and the ground truth and that between the XP decoder's predictions and the ground truth become approximately the same.

Figure 4 .
Figure 4. Example of a predicted APOGEE-like gradient spectrum produced inside the hallucinator, compared to an actual APOGEE gradient spectrum.During training, the hallucinator predictions are not directly compared to real APOGEE spectra.Rather, the hallucinated spectra only need to produce the correct stellar labels when fed into the pretrained APOGEE pipeline decoder.This allows the hallucinator to learn a mapping that captures relevant spectral features correlated with the stellar labels.

Figure 5 .
Figure 5. Validation of AspGap-derived labels from Gaia XP spectra (XP) vs. APOGEE labels (AP).From left to right, the panels show the results of twofold crossvalidation drawn from the training set for T eff , g log , [M/H], and [α/M].The (small) scatter and (negligible) bias are indicated in each panel, and η is defined as the outlier rate given by Equation (5).The pairs of dashed lines indicate the 1σ deviation in the difference between the compared labels.The color denotes the logarithm of the number density.

Figure 6 .
Figure 6.Validation of AspGap-derived labels from Gaia XP spectra (XP) vs. APOGEE labels (AP).This figure is analogous to Figure 5, except that the color coding now denotes AspGapʼs estimate of the label prediction's uncertainty.Most objects with more discrepant AspGap vs. APOGEE label estimates have correctly identified larger AspGap label uncertainties.

Figure 7 .
Figure 7. Quality of the labels generated by AspGap, as a function of the input XP spectra's S/N and G-band magnitude, shown for T eff , g log , [M/H], and [α/M] from left to right.In all panels, the solid black curves denote the formal errors generated by AspGap.The teal, green, orange, and red lines denote the crossvalidation values for different measures of precision and accuracy: the rms deviation (RMSE), the median absolute error (MAE), the scatter (standard deviation), and the bias, respectively.The thin dashed line indicates the naive 1/(S/N) expectation.The top row shows this as a function of the BP S/N, the middle row as a function of the RP S/N.The cross-validation results approximately follow the S/N-scaling expectations for T eff , g log , and [M/H], but less so for [α/M].

Figure 8 .
Figure 8. Fidelity of the label uncertainty estimates.The panels show the probability density distribution of the uncertainty-normalized differences (χ) between AspGap and APOGEE labels.In each panel, the blue dashed curve represents the standard normal distribution 0, 1 (), and the orange solid line denotes the best- fitting normal distribution of χ.The formal uncertainties are meaningful-also for [α/M]-but about 30% underestimated.By conducting these validations, we gain access to the ground truth labels, which allows us to evaluate various performance metrics such as the MAE (χ).

(
L pseudo ;Anderson et al. 2018) to select giant stars in the H-R diagram, instead of the absolute magnitude M G , to avoid losing sample members with negative parallax measurements (

Figure 9 .
Figure 9. Definition of the RGB sample for which the AspGap label estimates are most precise and robust.The sample cuts are illustrated in the H-R diagram (color coded by the number density of the Gaia XP sample) with two conditions T eff < 5300 K and 10 10 G T 5 0.003 19 5 10 eff

Figure 10 .
Figure 10.Summary of the stellar labels derived via AspGap for the ∼37 million members of the RGB catalog (see Figure 9).Left: T effg log diagram color coded by [M/H].Right: [M/H]-[α/M] abundance diagnostic diagram color coded by logarithmic density.

Figure 11 .
Figure 11.The left panel shows the two-dimensional distribution of the median Galactocentric Cartesian X position vs. the Cartesian Y position for Gaia XP.On the right panel, a similar two-dimensional distribution is displayed for APOGEE data.By utilizing the sample provided by XP, we can assess whether the larger and allsky Gaia XP sample covers the same physical extent as the APOGEE data.

Figure 12 .
Figure 12.Histograms of G magnitude and errors of stellar parameters.From left to right, the histograms represent the G magnitude, Teff s , g log s , σ [M/H] , and σ [α/M]distributions.The black color represents the compiled data of 37 million RGB stars.A subset, comprising 14 million stars with available radial velocity measurements from Gaia, is also included.
, we compare the labels derived by AspGap for RGB stars from LAMOST Lowresolution ( 1800  ~) Spectroscopic Survey (LRS) DR5 (Zhang et al. 2020) using the Stellar Label Machine (SLAM).The labels of the LAMOST RGB stars are trained by

Figure 13 .
Figure13.Comparison of the stellar labels T eff , g log , and [M/H] (left to right) derived from essentially the same XP data but with two different approaches, AspGap and XGBoost as implemented by A23.(Note that A23 also used bandpass magnitudes derived from spectra and additional external photometry, such as that from the Wide-field Infrared Survey Explorer.)The diagrams are color coded by the logarithmic sample number density.The first row of panels represents the comparison for the entire crossmatched RGB sample, consisting of 10,825,736 stars.In the second row, the comparison is restricted to the subset of stars with an S/N of RP coefficients higher than 1000, comprising 917,921 stars.

Figure 14 .
Figure 14.Comparison between AspGap-derived labels and those from three different ground-based survey data sets: LAMOST-LRS, LAMOST-MRS, and GALAH.Each column corresponds to a specific parameter: T eff , g log , [M/H], or [α/M].The top panel represents the comparison with Gaia XP (AspGap) and LAMOST-LRS (SLAM), color coded by the number density.The middle panel showcases the comparison with the LAMOST-MRS data set (Cycle-StarNet), and the bottom panel depicts the comparison with GALAH DR3.In each panel, the scatter and the median bias for each stellar label are marked.The black dotted line represents the one-toone line, reflecting perfect agreement between the surveys.The pairs of gray dashed lines in the plots represent the deviation of 1σ in the difference between the compared labels.

Figure 15 .
Figure 15.Comparison between literature [Fe/H], AspGap [M/H], and APOGEE Data Release 16 [M/H] for open clusters.We compare the AspGap [M/H] and the [Fe/H] in Donor et al. (2020), with 1σ as the error bar in the left panel.In the right panel, all cluster samples crossmatched with literature are denoted as blue circles, and clusters with more than three identified members in APOGEE are denoted as red triangles.The marker size indicates the number of members in each cluster, and the metallicity difference Δ[M/H] > 0.2 is shown by the gray shaded areas.

Figure 16 .
Figure 16.Astrophysical validation of AspGapʼs [α/M] determinations, based on the fact that-at a given [M/H]-the α-enhanced (or thick) disk has more vertical motions (e.g., Bovy et al. 2012).The left panel illustrates the relationship between the rms vertical action J z (quantifying the vertical kinematics) and the AspGapderived [α/M] in two narrow bins of [M/H], −0.7 < [M/H] < −0.5 (red) and −0.5 < [M/H] < −0.3 (blue).The solid line with circles shows the result of AspGap labels, while the dashed lines with triangles represent those from APOGEE for reference.The AspGap results show the expected trend that agrees quite closely with that seen in APOGEE data; subtle offsets may simply reflect the different spatial selection function of the sample.The right panel shows a related astrophysical validation test, based on the known fact that-again at a given [M/H]-the α-enhanced disk is much more centrally concentrated (e.g., Bovy et al. 2016).The panel displays the mean angular momentum L z as a function of [α/M] in the same two [M/H] bins, again showing the expected trend and good agreement with APOGEE.
[α/M] XP values.Notably, we observe that the stars classified as nonisotropic display slightly lower [α/M] XP values compared to the isotropic stars.The color bar accompanying the plot indicates the number ratio of high-[α/M] XP stars to low-[α/M] XP stars, with high-[α/M] XP stars defined as those above the diagonal dashed line and low-[α/M] XP stars as those below it.The results depicted in Figure 17 provide additional evidence supporting the reliability of the AspGap-derived [α/M] XP values.We observe that the nonisotropic stars, which are likely associated with debris from the GES merger, exhibit relatively lower [α/M] XP values.This chemical behavior is consistent with the expected signature of GES debris, further strengthening the validation of [α/M] XP in accurately determining α-element abundances.Finally, we validate the [α/M] predictions in the LMC, which is a population that is fairly metal-poor with exceptionally low [α/M].We crossmatch our sample with the catalog of Jiménez-Arranz et al. (2023), who employed a supervised neural network classifier to distinguish LMC stars

Figure 17 .
Figure 17.Three astrophysical validation tests for the quality of the AspGap-based [α/M] estimates in the metal-poor regime.All three validations are based on the idea that in some regimes the spatial or kinematic selection of stars leads to clear (externally derived) expectations for p([α/M]) in a given [M/H] regime.Top left: validation using the "poor old heart" of the Milky Way (Rix et al. 2022), showing the fraction of high-[α/M] XP stars as a function of R apo and eccentricity for stars with [M/H] XP < −1.1.At high eccentricities (>0.75) and large R apo (>10 kpc) the population should be dominated by the GES population, known to be low-α (Helmi et al. 2018; Hasselquist et al. 2021); this is exactly what our [α/M] XP values show.Top right: distribution of stars on the [M/H]-[α/M] plane that have been selected (purely kinematically) as likely GES members, expected to lie below the (black dashed) high-α vs. low-α dividing line in the diagram.The stars with AspGap-determined [M/H] and [α/M] labels (colored density) lie in the expected position, and also agree with APOGEE (green points); this latter coincidence is not a trivial consequence of the training, given that these stars are a tiny and unusual subsample.Bottom: validation in LMC, using a similar [M/H]-[α/M] diagonal diagram to that of the top right panel.The stars with AspGap-determined labels (colored density) lie at low [α/M], as expected for the LMC (Russell & Dopita 1992); they again agree with the APOGEE labels for this peculiar subset.
Andrae et al. (2023a)l.(2023a)reported a slightly higher MAE of 0.21 dex for [M/H] derived from Gaia XP spectra compared to APOGEE, this information is still valuable, albeit at just the qualitative level.

Table 1
Criteria to Deem Stellar Labels from SDSS/APOGEE Reliable and Eliminate Stars without Meaningful Astrophysical Parameters

Table 2 Table Descriptions for
∼37 Million RGB Stars Predicted by AspGap

Table 3
Precision and Recall for Low-[α/M] Disk and High-[α/M] Disk This decision is based on the limited coverage of the training sample of APOGEE in terms of [M/H] values, which excludes very metal-poor stars.The fraction of stars in the inner disk with [M/H] values below −2 is very small, approximately 0.003% (Rix et al. 2022).If the analysis requires a focus on very metal-poor stars with [M/H] < −2, we recommend referring to works such as Li et al. (2022) and Yao et al.