Robust Data-driven Metallicities for 175 Million Stars from Gaia XP Spectra

We derive and publish data-driven estimates of stellar metallicity [M/H] for ∼175 million stars with low-resolution XP spectra published in Gaia DR3. The [M/H] values, along with T eff and logg , are derived using the XGBoost algorithm, trained on stellar parameters from APOGEE, augmented by a set of very-metal-poor stars. XGBoost draws on a number of data features: the full set of XP spectral coefficients, narrowband fluxes derived from XP spectra, and broadband magnitudes. In particular, we include CatWISE magnitudes, as they reduce the degeneracy of T eff and dust reddening. We also include the parallax as a data feature, which helps constrain logg and [M/H]. The resulting mean stellar parameter precision is 0.1 dex in [M/H], 50 K in T eff, and 0.08 dex in logg . This all-sky [M/H] sample is substantially larger than published samples of comparable fidelity across −3 ≲ [M/H] ≲ +0.5. Additionally, we provide a catalog of over 17 million bright (G < 16) red giants whose [M/H] values are vetted to be precise and pure. We present all-sky maps of the Milky Way in different [M/H] regimes that illustrate the purity of the data set, and demonstrate the power of this unprecedented sample to reveal the Milky Way’s structure from its heart to its disk.


INTRODUCTION
The chemical composition of stars, reflected in their photospheric abundances, is a fundamental stellar observable.To zeroth order, it can be summarized by the mean metallicity [M/H], which varies by orders of magnitude among the stellar populations in the Galaxy, while the individual abundance ratios among heavy elements tend to vary far less.
In the context of the Milky Way, or other resolved nearby galaxies such as the Magellanic Clouds, having vast samples of stars with [M/H] estimates across all stellar populations matters greatly for both galaxy and stellar evolution: [M/H] traces the "chemical evolution" of the galaxy that reflects the combination of the star formation history, stellar yields, gas inflow and feedback (e.g, Dekel & Silk 1986;Matteucci 1994;Tremonti et al. 2004).Large and systematically selected samples of low-[M/H] stars are needed to quantify and test stellar yields and the importance of different nucleosynthetic channels, in particular of the earliest stars in a galaxy (e.g., Tinsley 1979;McWilliam 1997).Furthermore, [M/H] is indispensable to study the chemo-dynamics of, say, the Milky Way: the determination and evolutionary interpretation of the stars' distribution in the space of orbits, ages, and element abundances (e.g., Hayden et al. 2015;Weinberg et al. 2019Weinberg et al. , 2022)).Studies of the chemical and dynamical evolution of the Galaxy are linked closely, as the abundances, in particular [M/H], also serves as an -albeit complex -proxy for stellar ages (e.g., Tinsley 1980;Twarog 1980;Nordström et al. 2004;Gallazzi et al. 2005;Rix et al. 2022).
While [M/H] is the fundamental measurement for abundances, the α-element enhancement has long been established as the arguably next most important abundance observation, as it reflects the relative roles of corecollapse and thermonuclear supernovae in the enrich-ment of a star's birth material (Tinsley 1979;Hayden et al. 2015;Weinberg et al. 2019).The dimensionality of the abundance space for elements through the iron peak is not yet fully settled (Ness et al. 2019;Ting & Weinberg 2022).The abundances of elements beyond the iron peak, which arose primarily via the s-and rprocess, is of great interest (e.g., Sneden et al. 2008).However, the observational determination of these elements requires spectra of relatively high resolution and S/N (e.g., Ji et al. 2019a,b).
These arguments have motivated over the last decade(s) a suite of large-scale spectroscopic surveys: SDSS I-IV (York et al. 2000), LAMOST (Cui et al. 2012), GALAH (De Silva et al. 2015), Gaia-ESO (Gilmore et al. 2012) in the past; SDSS-V (Kollmeier et al. 2017), WEAVE (Dalton et al. 2012) now; 4MOST (de Jong et al. 2019) in the future.Although these ground-based surveys have now reached sample sizes of nearly 10 7 stars, all have had highly incomplete sky coverage and complex selection functions.Only SDSS-V will provide ground-based spectroscopic all-sky coverage over the next few years (Kollmeier et al. 2017;Almeida et al. 2023).
As Gaia's most recent data release DR3 (Gaia Collaboration et al. 2022b) has again made abundantly clear, Gaia is not only a photometric and astrometric mission, but also a spectroscopic one.Gaia obtained spectra both with the RVS instrument, at a resolution of ∼8000 around the near-IR Ca triplet (Seabroke et al. in prep., Sartoretti et al. 2022); and very low-resolution spectra (R ∼ 40 − 150) taken with the two prisms BP and RP (Carrasco et al. 2021;De Angeli et al. 2022) that together cover the wavelength range from ∼350nm to ∼1,000nm (Montegriffo et al. 2022).In the following, we denote these BP and RP data as XP spectra.These spectra -and astrophysical parameters derived from these spectra -were released in Gaia DR3 for both RVS and XP: 220 million and 1 million spectra, as well as 470 million and 6 million sets of astrophysical parameters, for XP and RVS, respectively.
To maximize the Gaia data set suitable for chemodynamical studies of our Galaxy, one would need to have abundances, at least [M/H], for most stars that have RVS velocities1 .In the context of DR3, this can be done with [M/H] based on XP spectra, but not on RVS spectra or derived abundances, as these are published only for a subset of 6 million stars (Recio-Blanco et al. 2022) compared to 33.8 million stars that have RVS velocities (Katz et al. 2022).
An extensive set of metallicities for Gaia sources with XP spectra was published as part of DR3 (Andrae et al. 2022).By design, these [M/H] values were derived using synthetic model spectra in comparison with the XP specta, with the goal of a consistent approach to stellar parameter estimates across much of the CMD.Unfortunately, external validation has shown that these [M/H] values have important shortcomings (e.g.systematics and a high rate of "catastrophic" outliers) due to two aspects known and stated at the time of publication: First, knowledge of the Gaia XP system is detailed but imperfect, so that significant discrepancies between the predictions of the synthetic model and the XP data exist and lead to erroneous [M/H] estimates.Second, the spectra of different T eff and log g have very different information content about [M/H] at low resolution and for some temperatures (e.g.OB stars), the XP spectra are simply not informative on [M/H].
On the other hand, it has been established that for cool stars, even very low-resolution spectra are informative about [M/H] (Ting et al. 2017).By using a datadriven approach to estimate [M/H] and by focusing on stellar types whose low-resolution spectra are informative about [M/H], one can overcome these limitations.This has recently been shown by Rix et al. (2022), who produced a large set of [M/H] estimates towards the Galactic center.Similar methods have been employed to study the halo of the Galaxy, for example to map out its last major merger (Belokurov et al. 2023;Chandra et al. 2023).
Here we set out to build on this work and produce a comprehensive catalog of high-fidelity stellar [M/H] estimates (with T eff and log g, as corollary) that • includes essentially the entire sample of published XP spectra, acknowledging that the [M/H] estimates at low S/N and high T eff are potentially unreliable.This requires the identification of subsamples where the [M/H] estimates are precise, accurate and robust (i.e. with negligible outliers), as verified by external comparison.
• is "all-sky", accepting that the reach of such a catalog varies across the sky due to a) the Gaia experimental set-up; b) the details of the DR3 data release; c) the changing source density and dust extinction.
• is data-driven, drawing on high-quality training sets that cover essentially the whole metallicity range present in the local group −3 < [M/H] < 0.5.
• draw mostly on XP spectral information, but also utilize relevant information that is available for most of the target sample: broad-band photometry across a wide range of wavelengths (extending to WISE in the infrared) to constrain the overall SED, and reduce T eff -reddening degeneracy; the parallaxes which are -even at low or negative /δ highly informative about the luminosity or absolute magnitude M of the star, as 10 M λ /5 ∝ 10 m λ /5 .Indirectly, therefore informs log g and

TRAINING XGBOOST
We seek to train XGBoost models (Chen & Guestrin 2016) to estimate stellar metallicity, effective temperature and surface gravity from XP spectra, drawing on the subset of objects for which both XP spectra and externally-derived stellar parameters of high fidelity exist.

Training sample selection
Encouraged from the results in Rix et al. (2022), we train XGBoost models using as data features both XP coefficients (Carrasco et al. 2021;De Angeli et al. 2022) and synthesized photometry that was computed with GaiaXPy (e.g.Gaia Collaboration et al. 2022c).For the most part, we do this for stars with literature labels from APOGEE DR17 (Abdurro'uf et al. 2022).How-  2022), which provide a consistent and fairly extensive set of [M/H] determinations for a set of (apparently) bright stars.We also replace the AllWISE photometry (Cutri et al. 2021) using in Rix et al. (2022) by CatWISE photometry (Marocco et al. 2021), which is deeper and thus achieves higher completeness.APOGEE DR17 contains a total of 733 901 stars.Of these, 647 025 actually have stellar parameters T eff , log g, and [M/H], and 643 401 have a cross-match to Gaia.Among these, only 599 662 achieve signal-to-noise ratios above 50 in the APOGEE spectra.For the purpose of this paper, we also require XP spectra to be available, which is the case in the Gaia DR3 data for 537 412 of these stars.We also require CatWISE photometry in W 1 and W 2 bands as these bands greatly aid reducing the temperature -extinction degeneracy.This reduces the set of APOGEE DR17 training to 510 413, which only contains 485 850 unique Gaia source IDs, i.e. there are "duplicates" which represent repeated APOGEE observations of the same star.In such cases, we adopt the mean APOGEE parameters averaged over all repeat observations as training labels.The results of Li et al. (2022) encompass 385 stars, of which 291 stars have published XP spectra, as well as W 1 and W 2 photometry in CatWISE.
The resulting [M/H] distribution of this training sample is shown in Fig. 1.Already from APOGEE, we have good coverage for [M/H] < −1.But this figure also shows how critical the inclusion of stars from Li et al. (2022) is to cover metallicities below −2.5, eventually down to a minimum value of −4.37.This expanded training sample removes an important limitation at low metallicity of the work by Rix et al. (2022).
It is worth taking a closer look at the distribution of all stellar parameters in our training sample in Fig. 2, to understand over which range we can expect XGBoost to return robust estimates.While Fig. 2c suggests that we have a good coverage of main-sequence dwarfs and redgiant stars, the temperature range is limited to 3107K to 6867K.In particular, we have no OBA stars, no white dwarfs, or ultra-cool dwarfs in our training sample.Furthermore, Fig. 2a shows that we have essentially no training examples for [M/H] < −2 and T eff < 4000K.Also, Fig. 2b shows that we have only very few training examples for metal-poor dwarfs with log g > 3.5 and [M/H] < −1.This will likely preclude robust and precise parameter estimates in this regime.et al. (2022) noticed that some bands synthesized from the XP coefficients with GaiaXPy had negative fluxes and thus invalid magnitudes, particularly narrow bands in the blue where fluxes are often low.This leads to a rapidly decreasing completeness of XGBoost predictions in Rix et al. (2022) at the faint end.Here, we address this issue more systematically to keep the completeness towards the faint end as high as possible.First, for all stars in our application sample, we synthesize their photometry with GaiaXPy in the following photometric systems that we a priori believe to be useful for estimating metallicities (for details see Gaia Collaboration et al. 2022c):

Rix
Second, for all photometric bands, derived from the XP spectra, CatWISE and AllWISE, we investigate the completeness of its magnitudes as function of G BP in Fig. 3. Evidently, the completeness of synthesized photometry diminishes much earlier in some bands than in others.Further investigation reveals that bands with pivot wavelengths below ≈420nm are the first to be affected by incompleteness.This is consistent with our interpretation of the incompleteness arising through noise in the XP coefficients, given that the BP spectrum has low transmission and thus low signal-to-noise for wavelengths below 420nm.Furthermore, Fig. 3 shows that AllWISE would limit the completeness at all magnitudes, and that CatWISE can reach much higher completeness especially at the faint end where there are numerous stars.Still, even CatWISE does not reach full completeness even at the bright end.

Input features for XGBoost
For the final set of input features for XGBoost, we adopt all bands that achieve a completeness of 95% or higher at G BP = 18.These are 31 bands and for each band, the XGBoost input feature is the color obtained from apparent G magnitude minus the magnitude in this band.We choose the G magnitude for all colors for two reasons: first, the G magnitude is measured independently from the XP spectra from which all synthetic photometry is derived; and second, the G magnitudes have very high S/N.Additionally, we use the three Gaia colors G−G BP , G−G RP and G BP −G RP as input features, as well as several colors including CatWISE photometry (namely Therefore, our final set of data features comprises 38 colors and all 110 XP coefficients normalized to G = 15.This may appear confusing at first, because the synthesised photometry is fully redundant with the XP coefficients and adding redundant features could even be detrimental to the scientific performance (curse of dimensionality).Ultimately though, the choice to include both, XP coefficients and photometry synthesised from XP, as input features for XGBoost is a matter of feature selection that we test during cross-validation (see Table 1 in Rix et al. 2022).As it turns out, both are required in order to achieve optimal [M/H] results and the omission of either XP coefficients or synthetic photometry would lead to a noteworthy increase in the [M/H] errors during cross-validation and later application.This implies that XGBoost is unable to fully extract all information from the XP coefficients alone.Instead, our manual help to "re-phrase" the information in terms of synthesized photometry is required in order to make the information more easily accessible for XGBoost.
For deriving stellar parameters, in particular log g, the absolute magnitude is highly informative: e.g. it straightforwardly differentiates between giants and dwarfs.While a substantive subset of the stars with XP spectra have good parallax S/N, form which we can  estimate absolute magnitudes, many sample members have parallaxes that are consistent with zero or even negative.Therefore, we added a data feature that reflects or places limits on the absolute magnitude, but is linear in the parallax in order to remain well behaved in cases of noisy or even negative parallaxes.Specifically, we opted for input features of the form where denotes the parallax, m X is the apparent magnitude in some band X, while M X and A X are the absolute magnitude and dust attenuation in the same band X. 4 We added five such input features, for the photometric bands X = G, G BP , G RP , W 1 , W 2 .The parallax in Eq. ( 1) has been corrected for the parallax zero-point according to Lindegren et al. (2021).We find that these additional features do not only help to estimate log g, but they also improve our [M/H] estimates by ∼10% where metal-poor giants benefit in particular.The complete list of all input features and the details of the XG-Boost configuration are provided in Appendix A.
We use the exact same set of input features for training the XGBoost models for [M/H], T eff and log g, all based on training labels (see Sect. 2.1).Our objective is to maximize the number of stars for which all required input features are available.In that case, the completeness of our results would be dominated by the completeness of AllWISE photometry (see Fig. 3).

Internal 20-fold cross validation
For internal validation, we assess the quality of XG-Boost results on the training sample, using 20-fold crossvalidation: 20 times we set aside disjoint sets comprising 5% of the data for subsequent testing of a model trained on the other 95% of the data.In the end, the data features for each object in the training sample have been compared to a statistically independent XGBoost model prediction for them.These cross-validation results are summarized in Fig. 4.
The first row of   1 therein).This improvement, despite the expanded coverage of the CMD, is mainly due to the inclusion of luminosity estimates (see Eq. ( 1)) as features.Like in Rix et al. (2022), the current [M/H] estimates remain unbiased as the A K extinction increases as is evident from Fig. 5a.Including CatWISE photometry is the key here.Furthermore, Fig. 5b establishes that there are also no systematics with parallax (i.e.inverse distance).
In particular, Rix et al. (2022) restricted their analysis to bright (G BP < 16mag) red-giantbranch (teff xgboost<5500K and logg xgboost<3.5)stars, Fig. 4 panels (b) and (c) suggest that the [M/H] estimates from the current work are also robust outside the RGB and panel (d) suggests that this also holds down to the faintest stars which have their XP spectra published in Gaia DR3.Note that we cannot test with this validation sample whether our [M/H] estimates remain so precise and robust also for very metal poor stars.
The other two rows of Fig. 4 show the XGBoost residuals for T eff (middle) and log g (bottom).The RMS differences are remarkably small: 54K for T eff and 0.089 for log g.Furthermore, the residuals do not show obvious systematics and appear to remain robust down to G BP ∼ 20.

STELLAR PARAMETERS FROM XP SPECTRA VIA XGBOOST
We now turn to applying the XGBoost estimator, trained as just described, to an all-sky sample of stellar sources with XP spectra and CatWISE photometry.

Sample selection
We define the sample to which we apply the XGBoost estimator as all sources in Gaia DR3 that have XP spectra, valid parallaxes and proper motions, and valid XPderived and CatWISE photometry; the parallaxes do not have to differ significantly from zero.The following AQDL query  Their apparent magnitude distributions are shown in Fig. 6.It is important to note that in Gaia DR3 XP spectra were only published for sources brighter than G = 17.65, yet Fig. 6 shows sources fainter than that.The reason is that the XP spectra of presumed QSOs, galaxies, and ultracool dwarfs were exempt from the Gaia DR3 publication limit of G = 17.65.Consequently, these objects may be contaminants in our stellar parameter catalog: they manifest as a small bump at G ∼ 19 in the distributions of G and G RP .

SELECT
The overall result of this analysis is given in Table 1: the three stellar parameters, [M/H], T eff , log g for 175 million sources, specified by their Gaia DR3 source ID and with a label whether these sources were included in the XGBoost training.

External validation with other surveys
We can validate these XGBoost results by a comparison to other surveys not used in the training.Specifically, we compare to results from GSP-Spec after calibration 5 in Gaia DR3 (Recio-Blanco et al. 2022), GALAH DR3 (Buder et al. 2021), and SkyMapper DR2 (Chiti et al. 2021).
To start, we compare [M/H] estimates from XG-Boost with other metallicity estimates.For GSP-Spec (Fig. 7a), the agreement is excellent: there are no discernable systematics and very few outliers.Importantly, the GSP-Spec comparison is mostly limited to [M/H] > −1, where we expect [M/H] estimates to be robust.For GALAH (Fig. 7b), we still see a good overall agreement across the full metallicity range.However, there are some outliers, where GALAH estimates [Fe/H] below −1, while XGBoost estimates [M/H] above −0.5.We also note a small systematic offset below [Fe/H] of −1 where XGBoost's [M/H] is ∼0.2 lower than GALAH's [Fe/H].These outliers and the slight offset have also been observed in the results of Rix et al. (2022).Yet, unlike in Rix et al. (2022), we no longer see a saturation of XGBoost metallicities below −2, where we now see a continuation of the one-to-one relation with GALAH.This is the result of including the very metal-poor stars of Li et al. (2022) in our training sample, thus extending the APOGEE metallicity range.For the SkyMapper photometric metallicities (Fig. 7c), a substantial scatter is evident both visually and quantitatively.Successful comparisons to external spectroscopic surveys suggest that this scatter is probably inherent to SkyMapper.
We further investigate the origin of the outliers and systematics of our [M/H] estimates in Fig. 8 where we directly compare metallicity estimates from APOGEE (i.e. the training sample underlying XGBoost) to those from GALAH: First, we observe the same small systematic offset below [Fe/H] of −1 between GALAH and APOGEE, i.e. the XGBoost model has correctly learned from APOGEE and simply reflects this difference.Second, we can also see the outliers where GALAH vote for [Fe/H] < −1 whereas APOGEE votes for [M/H] > −0.5, so again XGBoost has faithfully learned from its APOGEE training sample.Consequently, both effects are traced back to genuine differences between GALAH and APOGEE and are thus not introduced by XGBoost.In fact, these outliers in Fig. 7b mostly have high temperatures in GALAH and potentially correspond to outliers in GALAH DR3 itself.
Quantitatively, Fig. 7 shows that our XGBoost [M/H] estimates compare very well with those from GSP-Spec and GALAH, with half of the stars differing by no more than 0.092 and 0.068, respectively.This external validation error is somewhat larger than the cross-validation error of 0.042 found for APOGEE in Fig. 4.This most likely reflects subtle differences between APOGEE's and GSPSpec's [M/H] estimates (our XGBoost estimates are tied to the APOGEE scale), as a similar scatter is found in the direct comparison of these surveys.
For the scatter of temperatures and surface gravities when comparing with GSP-Spec we find RMS differences of 112K for T eff and 0.223 for log g; and when comparing with GALAH, we find 166K for T eff and 0.119 for log g, respectively.These are again slightly higher than the RMS differences from the 20-fold crossvalidation on APOGEE (54K and 0.089, respectively) quoted in Fig. 4. For log g, the difference to GSP-Spec is larger than for APOGEE or GALAH, but we also did not apply the empirical corrections for GSP-Spec's log g recommended in Recio-Blanco et al. (2022).

[M/H] estimates at faint apparent magnitudes
Of particular interest is the publication limit of XP spectra in Gaia DR3, which was set at G = 17.65.Figure 4d suggests that our XGBoost results may remain robust as we approach the publication limit, but we would like to confirm this with an independent validation sample.Unfortunately, both GSP-Spec and GALAH DR3 are of no use in exploring this regime, as both samples are limited to bright stars.Therefore, we make use of the LAMOST DR66 data (Wu et al. 2011(Wu et al. , 2014)).As is evident from Fig. 9, the [M/H] differences between XGBoost and LAMOST degrade "gracefully" towards the faint end, which means that the random scatter increases smoothly and no systematics appear.At G = 16 the central 68% interval ranges from −0.2 to +0.2 and even at G = 17.65 it ranges from −0.3 to +0.4.In fact, these variances include the [Fe/H] uncertainties from LAMOST, which typically are of the order of 0.25 around G = 17.65.Assuming that these uncertainties add in quadrature, the −0.3 to +0.4 interval at G = 17.65 implies an uncertainty of 0.17 -0.32 attributable to XGBoost.
While a random error of ∼ 0.33 in [M/H] at G ∼ 17 is acceptable, it still represents a substantial increase from the error of 0.1 at the bright end.What are the possible origins of this increased noise?First, an earlier version of our catalog was based on AllWISE photometry instead of CatWISE, but apart from having significantly lower completeness (see Fig. 3) it produced the same results at the faint end.This rules out the CatWISE photometry as origin for the increased noise.Second, Gaia DR3 parallaxes can become very noisy towards the faint end, such that the features defined in Eq. (1) could begin to confuse XGBoost at the faint end.However, if we remove these features from the XGBoost input and   thus become entirely independent from the parallax, we find no improvement either.This only leaves the XP spectra as the source of the increased noise towards the faint end.More precisely, we suspect that it is not the XP coefficients themselves but rather the synthesized narrow-band photometry which is becoming increasingly susceptible to noise towards the faint end.This interpretation is also supported by Fig. 3, which reminds us that the synthetic photometry becomes incomplete due to noise leading to negative flux values even when XP spectra are available.

External validation with solar analogs
Gaia Collaboration et al. (2022a) compiled a list of 5863 Solar-analog candidates, whereof 5759 are in our sample.According to XGBoost, their mean [M/H] is 0.012 ± 0.105 and the central 90% interval ranges from -0.167 to 0.178.Figure 10 shows their distribution, which is consistent with a Gaussian of standard deviation 0.1.This is in excellent agreement with the Solar value and demonstrates that our [M/H] estimates are also reliable at least for Solar-like main-sequence dwarfs, whereas the estimates from Rix et al. (2022) were applicable only to giant stars.

External validation with clusters
Among our XGBoost results, we find 22 477 member stars in 36 open clusters from Gaia Collaboration et al. (2018).As a first instructive example, Fig. 11b shows how XGBoost's metallicity estimate varies with G BP −G RP color in the Praesepe cluster.The expected literature value is recovered only within a certain color range but otherwise XGBoost systematically underestimates the metallicity.This underestimate is related to the limited temperature range of the training sample (3107K -6867K, see Sect.2.1).In the absence of interstellar extinction (such as for Praesepe), this temperature range roughly corresponds to a G BP − G RP color range from 0.5 to 2.3.Indeed, Fig. 11b shows that the underestimated [M/H] occurs mainly for colors bluer than 0.5 or redder than 2.5, with a less pronounced underestimation by about 0.15 also in the range from 1.5 to 2.5.As is also evident from Fig. 11a, the Praesepe member stars are dominated by main-sequence dwarfs.We also note that the Solar analogs showing excellent agreement in Fig. 10 have intrinsic colors of G BP − G RP = 0.818 ± 0.029 (Gaia Collaboration et al. 2022a) and would thus fall well within the regime of good agreement between 0.5 and 1.5 in Fig. 11b.
Obviously, many Praesepe member stars fall into a temperature range that is not covered sufficiently by our training sample.Unfortunately, we cannot select on our catalog's XGBoost temperature estimate because its values are also strictly confined to the range 3107K -6867K of training examples.Instead, we select on input features as shown in Fig. 12: for every individual cluster member star, we ask where its features fall into the two diagrams and we reject it from further consideration if and only if it has a sufficiently high number of training examples nearby in these diagrams.7After this filtering procedure, Fig. 13

External validation with wide binaries
Given the catalog of El-Badry et al. (2021), we find 55 033 pairs of wide binaries where each component star is observed as an individual source by Gaia.Since both stars from each binary pair have formed from the same gas cloud, XGBoost should estimate the same metallicity.In fact, Fig. 14 shows that their [M/H] estimates are This offset is slightly less than the systematic underestimation of -0.15 seen in Praesepe in Fig. 11b.
We also note that while most of the wide binaries in Fig. 14   have Solar-like metallicities or higher.However, XG-Boost assigns a median [M/H] of −0.443 to the stars in this sample and about 11% of OBA stars are assigned a [M/H] lower than −1 by XGBoost.Consequently, the XGBoost estimates are clearly not viable for OBA stars.We note, however, that Gaia Collaboration et al. (2022a) report contamination from metal-poor stars, i.e. not all of those may actually be hot OBA stars.

ILLUSTRATION OF THE SAMPLE
The results of our XGBoost analysis are listed in Table 1.They are unprecedented in sample size at such precision and accuracy (σ([M/H]) ∼ 0.1 dex, σ(T eff ) ∼ 50K, σ(log g) ∼ 0.08 dex) and can be used for a vast array of science applications, which is beyond the scope of this paper.This combination of sample size and data quality warrants a rigorous modeling of the selection function (see e.g.Rix et al. 2021), which are also beyond the scope of this paper.What we will do here is to provide two, only qualitative, illustrations of the sample's science potential: its total [M/H] distribution, and allsky maps in different bins of [M/H].More generally, we emphasize that each science application warrants specific vetting of the subsample used.

[M/H] distribution of the sample
Perhaps the most compact way to present the sample is to show its [M/H] distribution.We show this distribution in Fig. 15 for all 174 922 161 stars, for 43 520 755 likely RGB stars, and for 18 858 968 RGB stars with high-quality parallaxes (see below).The [M/H] distribution for this last subset in Fig. 15, restricted to the Milky Way within about 10 kpc, is the one that can be ) and RGB stars with high-quality parallaxes (orange).The differences in the distributions between the RGB's with and without highquality parallaxes is physical, as the parallax cut eliminates mostly distance stars, often in the halo or the Magellanic Clouds, which are metal-poor.The parallax cut on the RGB sample matters, as most of this sub-sample also has RVS radial velocities: for these stars orbits can be calculated (e.g.Rix et al. 2022).
taken most at face value as an observational approximation of the "total" metallicity distribution of the Galaxy.Of course this is a flux-limited sample, whose extent is limited by distance, dust extinction and (in part of the sky) crowding.Proper volume corrections of this [M/H] distribution would be a complex exercise (see, e.g.Rix et al. 2021) beyond the scope of this work.Such an analysis must include the G ≤ 17.65 publication limit of XP spectra in Gaia DR3 (e.g.introducing foreground dust extinction), the completeness of photometry synthesized from XP spectra (i.e.loss of stars due to negative synthetic fluxes caused by noisy XP spectra, see Sect.2.2) and the crowding-afflicted completeness of All-WISE photometry.
Nonetheless, this distribution shows a number of remarkable features, extending over a factor of 10 000 in metallicity within a single galaxy.It starts at [M/H] ≈ −3.5, rises steeply to [M/H] ≈ −2.5, and then follows At this point, the onset of the old disk in metallicity, the [M/H] distribution rises quickly to a maximum near [M/H] ≈ −0.4,stays flat to [M/H] ≈ +0.2, dropping steeply beyond.The implications of this distribution in terms of chemical enrichment warrant to be studied in a framework of chemical evolution models, such as that of Weinberg et al. (2017).The interpretation in terms of halo, old disk, thin disk, etc., will be most powerful when combining this information with orbital information.We do not pursue these avenues in this paper, but stress only two

Mono-abundance, all-sky maps
Since we have an all-sky sample with precise and robust metallicities, it behoves us to make all-sky maps as a function of metallicity to illustrate it.Fig. 17 and Fig. 18 provide all-sky maps of the sample's number density in various metallicity ranges for two samples: first, the complete (unfiltered) sample of all 174 922 161 stars, and second a vetted sample of 17 558 141 RGB stars.The vetted RGB sample was designed to eliminate spurious [M/H] estimates at the expense of sample size, in particular to eliminate sample contamination among the metal-poor stars that result from unrecognized instances of hotter but reddened stars.After some experimentation, we adopted the following selection criteria illustrated in Fig. 16: • phot g mean mag < 16 where M W 1 = W 1 + 5 • log 10 ( /100).
Only 11 853 of the 2 371 118 of the OBA stars identified by Gaia Collaboration et al. (2022a) that are in our sample pass these quality cuts, i.e. these cuts succeed to eliminate 99.61% of OBA stars from this sample.Notably, Gaia Collaboration et al. (2022a) find that their OBA star sample has some contamination from "halo" stars (i.e.metal-poor stars), which they eliminate kinematically, but cannot eliminate for metal-poor stars with disk-like orbits.Hence, our metal-poor sample may be even purer than the above comparison implies.
For the convenience of the user, this vetted RGB subset is provided as a separate table, as described in Table 2.In this table, we also provide auxiliary information from Gaia about a source's astrometry, photometry, and RVS radial velocities that are available for a substantive fraction of the sample.This provides all the information necessary for the user to compute stellar orbits.
The three panels of the all-sky maps for the unfiltered sample in Fig. 17 clearly show the imprint of the Gaia scanning law: in particular two "crescents" of lower sample density at high latitudes are attributable to too few transits that prevented the publication of XP spectra in Gaia DR3 (c.f.De Angeli et al. 2022, Fig. 29 therein).Moreover, dust extinction in the Galactic plane causes many stars to be dimmed below G < 17.65: they may be too faint to have XP spectra published or even too faint to be in the Gaia catalog at all.Note that the extinction in the Galactic plane appears most dramatic among the two metal poor bins (top and middle panel), as these have far fewer foreground stars.The top map in Fig. 17 (Andrae et al. 2023).We adopt the column names from the Gaia DR3 archive where appropriate.We emphasize that the zero-point correction of Lindegren et al. (2021)  It is these stars that most immediately show that the full unfiltered sample must contain some spurious [M/H] estimates, which motivated our vetted sample of bright giant stars.Comparison of the two top panels of Fig. 17 and Fig. 18 shows that these spurious sources are absent in the vetted giant sample.
Apart from these issues, the top maps in Fig. 17

Potential filtering
While we have illustrated only two examples of how to vet or filter the overall table of [M/H] estimates, it is clear that there are further limitations of our stellar parameter estimates that may compromise the use of our catalog.Here, we provide some guidance on potential filtering by the user: The cuts were designed with two goals in mind: first, isolate a subsample of bright giants for which the [M/H] estimates should be most precise and robust, to be used e.g. in Galactic chemodynamics.Second, they are limited in temperature to 5200K, which was empirically found to be highly effective to eliminate contamination of the metal-poor subsample by unrecognized hotter and reddened stars.
• In Rix et al. (2022), we used a bright RGB sample defined by teff xgboost<5300K, logg xgboost<3.5 and G BP < 16.While this is still possible, we point out that our results in this work also hold for main-sequence dwarfs (see Fig. 4c).Nevertheless, if the focus is on RGB stars, we recommend to drop or relax the selection on G BP , given that our results are robust towards the faint end (see Fig. 9).
• Given that OBA stars are problematic (see Sect. 3.7), one could take the golden sample of OBA stars from Gaia Collaboration et al. (2022a) and remove all known OBA stars from the sample.
• The Gaia DR3 publication limit of G = 17.65 for XP spectra was not strict.Rather, XP spectra for 162 686 QSOs and 26 500 galaxies were also published down to the survey detection limit.Furthermore, XP spectra of ultra-cool dwarfs were published beyond G = 17.65.Given our training sample's temperature range of 3107K -6867K, ultra-cool dwarfs are not covered.Therefore, the user may consider removing all 129 997 results for G > 17.65.
• The user may want to give special consideration to globular clusters, as those represent regions of high source density where the CCD windows assigned to the XP spectra may begin to overlap, thus com-   promising the XP spectra and our derived [M/H] estimates.This effect was illustrated for Omega Centauri in Fig. 27 of Creevey et al. (2022).
• When working with [M/H] for stars of the main sequence, we recommend limiting the color range to 0.5 < G BP − G RP < 1.5 for unbiased results (see Fig. 10 and Fig. 11).The T eff estimates of stars on the main sequence may be precise over a wider range.
• In order to prevent invalid extrapolations beyond the training sample, the user can check if a source's colors fall within the training sample, as we did for clusters in Sect.3.5 and Fig. 12.To this end, our catalog contains a column named in training sample which is a boolean flag that indicates if a source was part of the training sample.

SUMMARY
We have derived and presented a catalog of datadriven, precise, accurate and robust metallicity estimates [M/H] (as well as T eff and log g) for 175 million stars from Gaia DR3.These estimates were derived using an externally trained XGBoost algorithm that draws on an extensive set of data features: parallaxes, low-resolution XP spectra, robust synthetic photometry based on those XP spectra, and CatWISE photometry.By construction, the resulting parameters are tied to the stellar parameter scale of the main training set, SDSS DR17 (APOGEE).The entire catalog is published and available online (Andrae et al. 2023).
This catalog greatly improves on our earlier catalog in Rix et al. (2022) in several respects: 1) It is all-sky, not restricted to stars towards the Galactic center.2) It covers much of the stellar color-magnitude plane, not just red giants.3) It encompasses all stars with XP spectra, not just the bright ones (G < 16).4) The [M/H] estimates overcome the [M/H] −2.5 limitation in Rix et al. (2022) by augmenting the main APOGEE DR17 sample by the very and extremely metal poor stars from Li et al. (2022) in the training of the XGBoost algorithm.5) It replaces AllWISE (Cutri et al. 2021) by CatWISE (Marocco et al. 2021), thus improving completeness substantially.
For stars within our training sample's temperature range (from 3107K to 6867K), our empirical approach recovers the [M/H] to within an RMS test error of 0.1 (from cross-validation in Fig. 4) and an RMS validation error of 0.146 on GALAH DR3 (see Fig. 7).In particular, our empirical results exhibit the same systematics that our APOGEE-dominated training sample exhibits in comparison to GALAH DR3, i.e. our results are perfectly consistent with the known discrepancies between spectroscopic surveys.An independent validation on Solar-analog candidates from Gaia Collaboration et al.  9), but we do not see systematic errors emerge.We suspect that this "graceful" degradation is probably caused by increasing noise in the narrow-band photometry synthesized from XP spectra.
We provide the full, unfiltered catalog of all ∼ 175 million [M/H] estimates without applying any quality cuts (see Table 1), as we had already published a smaller catalog with highly conservative cuts as part of Rix et al. (2022).The purpose of the current work is to push towards what is maximally possible in [M/H] estimates from XP spectra, to allow further and broader scientific exploitation of the Gaia DR3 data.Consequently, the user is advised to carefully vet each data subset to understand its limitations for each astrophysical application.In particular, all applications that draw on stars with parameters not well represented in the training set, require caution.We provide some guidance in Sect.4.3.For user convenience, we also define a vetted sample of 17.5 million RGB stars (see Table 2) with conservative cuts to ensure high data quality.This sample is still much larger than the sample in Rix et al. (2022) which contained only 2 million RGB stars towards the Galactic center.
However, we emphasize that main-sequence stars from the overall sample presented here can also be used reliably for [M/H] analysis, as long as they are in the color range 0.5 < G BP − G RP < 1.5, where they achieve typical [M/H] uncertainties between 0.079 for wide binaries (see Sect. 3.6) and 0.1 for Solar analogs (see Fig. 10).Additionally, in Appendix B we provide instructions how to retrieve all Gaia DR3 sources with radial velocity measurements and astrometry and match those with our metallicity catalog in order to facility chemodynamical studies.
This catalog is already being used for various upcoming research projects, e.g. on the metallicity gradient in the Large Magellanic Cloud (Andrae et al. in prep.), the chemodynamics of the Milky Way disk (Chandra et al. in prep.), on stellar rotation in open clusters (Pancino et al. in prep.).
Although Gaia DR3 is only a few months old, our work also allows us to be very optimistic for Gaia DR4: The fact that we can obtain robust and reliable [M/H] estimates even for the faintest stars at the publication limit of XP spectra in Gaia DR3 (see Fig. 4d and Fig. 9) suggests that useful [M/H] estimates might also be achievable for fainter XP spectra that will be published in Gaia DR4.In addition, it is likely that the XP spectra themselves will improve substantially from Gaia DR3 to Gaia DR4, due to improved processing and about twice as many observing epochs.All this bodes extremely well for the science potential of the XP spectra that will be published in Gaia DR4.
[M/H].Since the first submission of this manuscript, two similar works have been published: Yao et al. (2023) classify 188 000 candidates for very-metal-poor stars ([Fe/H] < −2) using XP spectra and XGBoost while Zhang et al. (2023) build an empirical forward model from LAMOST training examples in order to estimate stellar parameters with realistic uncertainties for all 220 million published XP spectra.The rest of the paper is organized as follows: In Sect.2, we explain how we compile the training sample, what input features we choose for XGBoost and we show first internal validation results.In Sect.3, we define our application sample and validate our results on external data that have not been used for testing.In the closing Sect.4, we illustrate the power of this sample by showing a set of all-sky maps in different metallicity bins, which illustrates that even the (rare) low-metallicity subsamples have little if any contamination.In the summary and outlook, we touch on obvious future astrophysical uses of this sample.The catalogs produced in this work are published online 2 (Andrae et al. 2023).

Figure 1 .
Figure 1.Distribution of [M/H] in training sample of stars, which is drawn from SDSS-APOGEE DR17 (red) and the very metal-poor stars from Li et al. (2022) (blue).

Figure 2 .
Figure 2. Distributions of the [M/H] training sample terms of effective temperature, surface gravity and metallicity.The dominant SDSS-APOGEE DR17 part of the sample is shown as the logarithmic density map; the metal-poor training stars from Li et al. (2022) as black dots.

Figure 3 .
Figure 3. Completeness of the sample at a given G BP magnitude in several bandpasses that limit the application of XGBoost, which requires the full set of features for both training and testing.For G BP 17.7 CatWISE is the most severe limitation, while for G BP 17.7 the two narrow bandpasses synthesized with GaiaXPy in the far blue (e.g.Pristine mag CaHK and Jplus mag J0395) become severly incomplete.
Fig. 4 is most important because it shows the cross-validation of the [M/H] estimates.Panel (a) shows that, for the most part, our results are accurate, i.e. unbiased with respect to the APOGEE reference [M/H].There are only a few outliers where

Figure 4 .
Figure 4. Cross-validation of the XGBoost parameters on the training sample (SDSS-APOGEE DR17 and Li et al. (2022)).The plots show the results of the twenty-fold cross-validation of the 5% portions of the training sample, withheld in the training.Rows from top to bottom show [M/H] residuals, T eff residuals and log g residuals.Columns from left to right show residuals vs. the training sample's [M/H], T eff , log g and Gaia's apparent G BP magnitude.The numbers in the top right corners quote the median absolute difference (MedAD) and the root mean square difference (RMSD).The density map is logarithmic.These plots illustrate the remarkable precision of the approach: 0.10 dex in [M/H], 50K in T eff , and 0.08 dex in log g.Note that these variances still include all the uncertainties in the APOGEE estimates.XGBoost assigns a higher [M/H] value than APOGEE.However, for training labels [M/H] < −3 XGBoost tends to overestimate [M/H].Most likely, this is a consequence of mixing different definitions of "metallicity" in our training sample: While APOGEE provides [M/H] estimates, Li et al. (2022) provide estimates of [Fe/H].Astrophysically, very old stars will have low Iron content but may already have been enhanced in other elements, such that the stars from Li et al. (2022) may be genuinely low in [Fe/H] but have higher [M/H], which is recognized by XGBoost learning [M/H] from the majority of training examples provided by APOGEE.Panel (b) shows how our [M/H] estimates depend on T eff values: the agreement is overall very good, and for stars hotter than ∼5500K our [M/H] estimates are closer to the training labels than for cooler stars.The main reason for this is that for T eff > 5500K the APOGEE sample contains virtually no stars with [M/H] below -0.75, limiting the comparison to the "easy" metral-rich regime.Panel (c) shows that our [M/H] residuals do not exhibit any noteworthy trends with the training sample's log g, i.e. our [M/H] estimates work just as good for main-sequence dwarfs as they work for red giant stars.

Figure 5 .
Figure 5. Cross-validation of XGBoost on the training sample (SDSS-APOGEE DR17 and Li et al. (2022)).Dependence of test error for [M/H] on WISE AK extinction (panel a) and parallax (panel b).[M/H] shows no systematics with either.

Figure 6 .
Figure 6.Apparent magnitude distributions -G (black), G BP (blue), G RP (red) and W1 (grey) -for the full application sample of 174 922 161 stars without any quality cuts.The bump in the G-band distribution at G ∼ 19 reflects contamination by galaxies, QSOs, and ultracool dwarfs.
source_id FROM gaiadr3.gaia_source_liteWHERE has_xp_continuous='true' AND parallax IS NOT NULL results in 218 132 063 stars.Since we require complete photometry in CatWISE and synthesized passbands (see Sect. 2.2), not all of them have the complete set of input features to XGBoost.The final number of stars that satisfy these additional conditions is 174 922 161 (∼80.2%).

Figure 7 .
Figure 7.Comparison of XGBoost [M/H] estimates with GSP-Spec's calibrated metallicity estimates in Gaia DR3 (Recio-Blanco et al. 2022), with GALAH DR3 (Buder et al. 2021), and with SkyMapper DR2 (Chiti et al. 2021).For Skymapper DR2, we impose the quality flag equal to 0. Numbers quote the median absolute difference (MedAD) and the root-mean-square difference (RMSD).No quality cuts were applied, and the density maps are logarithmic.The comparison with GSP-Spec DR3 is very good, but the comparison is limited (mostly by GSP-Spec) to [M/H] −1.The comparison with GALAH DR3 is also very good, with a small systematic offset at [M/H] ≤ −1, already noted in Rix et al. (2022).Comparison with the photometric [M/H] estimates from SkyMapper DR2 shows a substantially increased scatter.The good comparison of XGBoost results with other surveys makes it likely that this is attributable to SkyMapper issues.

Figure 9 .
Figure 9. Differences of [M/H] between XGBoost for LAM-OST DR6 as function of apparent G magnitude.The Gaia DR3 publication limit for XP spectra is G = 17.65.Black lines indicate the 16th, 50th and 84th percentiles as function of G. Color maps indicate logarithmic number density.

Figure 10 .
Figure 10.Distribution of [M/H] estimates from XGBoost for 5759 Solar analog candidates from Gaia Collaboration et al. (2022a).A Gaussian with zero mean and standard deviation of 0.1 is given by the dashed line, illustrating that the [M/H] estimates are precise and accurate on the main sequence at high metallicities and for T eff ∼ 5772K.

Figure 11 .
Figure 11.Validation of the [M/H] estimates in the main sequence, using the Praesepe cluster.The cluster's colormagnitude diagram of all 653 members with [M/H] is illustrated in panel (a).Their [M/H] estimates are shown as a function of color in panel (b).The horizontal dashed line indicates the metallicity of 0.16 (Z = 0.02) adopted by Gaia Collaboration et al. (2018).For 0.5 < G BP − G RP < 1.5 the metallicity agreement is excellent, whereas for 1.5 < G BP − G RP < 2.5 they are systematically too low by 0.15 dex.Outside of these color-ranges the agreement is poor.We attribute the offsets and the poor estimates to possible systematics and poor sampling of the CMD space in the training sample.The [M/H] estimates for main-sequence stars with colors outside 0.5 < G BP − G RP < 2.5 are manifestly unreliable.
shows that the XGBoost [M/H] estimates agree reasonably well with the adopted mean metallicities of the 36 open clusters from Gaia Collaboration et al. (2018).We do see a slight positive offset below [Fe/H] of -1 and a slight negative offset around solar [Fe/H].The latter is probably similar to the offset seen in Fig. 11 for Praesepe and colors G BP −G RP > 1.4.The slight positive offset below [Fe/H] of -1 may be due to XGBoost occasionally overestimating [M/H] in that regime (see Fig. 4a).

Figure 12 .
Figure 12.XGBoost input feature distributions of the application sample (colormaps), overlayed with black contours of the training sample.The lowest contour is at 1 star per bin, i.e. it encloses the full training sample, and the other contours successively increase by factors of 10.This shows that our full sample extends across important portions of color-color space that are not covered by the training sample.The resulting parameter estimates in these regimes will be inevitably unreliable.

Figure 13 .
Figure 13.Comparison of [M/H] estimates from XGBoost after filtering on input features to mean metallicities of 36 open clusters from Gaia Collaboration et al. (2018).Black dots show median [M/H] and grey errorbars show 16th and 84th percentiles in each cluster.
are at [M/H] > −0.5, there are a handful of systems with [M/H] < −1 and that in those cases the XGBoost estimates still hold.3.7.OBA stars as a failure mode In this section, we investigate the results for OBA stars.Gaia Collaboration et al. (2022a) compiled a list of 3 023 388 OBA stars, whereof 2 371 118 are in our sample.These are beyond the temperature range of the training sample, i.e. we intentionally break the model assumptions of our XGBoost model.Given that absorption lines are often washed out in very hot stars, we expect XGBoost to misinterpret such stars as metalpoor.Since OBA stars are mostly young, they should

Figure 14 .
Figure 14.Comparison of [M/H] estimates for components of 30 748 wide binaries from El-Badry et al. (2021).We quote the root-mean-squared difference (RMSD) divided by √ 2 because we want to quantify the difference between two noisy [M/H] estimates.

Figure 15 .
Figure 15.[M/H] distributions resulting from XGBoost for all stars in our catalog (black), RGB stars (red, teff xgboost<5300K and logg xgboost<3.5)and RGB stars with high-quality parallaxes (orange).The differences in the distributions between the RGB's with and without highquality parallaxes is physical, as the parallax cut eliminates mostly distance stars, often in the halo or the Magellanic Clouds, which are metal-poor.The parallax cut on the RGB sample matters, as most of this sub-sample also has RVS radial velocities: for these stars orbits can be calculated (e.g.Rix et al. 2022).
also shows an implausible set of seemingly metalpoor stars near the disk, preferentially in star-forming regions.Most likely, these objects have spuriously low [M/H] estimates, and actually are reddened OBA stars and Fig. 18 both clearly show the central concentration of metal-poor stars toward the Galactic center, extensively discussed as the Poor Old Heart of the Milky Way in Rix et al. (2022).Figure 17 also shows the two Magellanic clouds, which are missing from Fig. 18 due to the cut on parallax quality of σ > 4.

Figure 16 .
Figure 16.Illustration of quality cuts in effective temperature and MW 1 = W1 + 5 • log 10 ( /100) for the vetted RGB sample (colored density map, 17 558 141 stars) compared to the full sample (gray density map, 174 922 161 stars).The cuts were designed with two goals in mind: first, isolate a subsample of bright giants for which the [M/H] estimates should be most precise and robust, to be used e.g. in Galactic chemodynamics.Second, they are limited in temperature to 5200K, which was empirically found to be highly effective to eliminate contamination of the metal-poor subsample by unrecognized hotter and reddened stars.

Figure 17 .
Figure 17.Skymaps showing the logarithmic number density of all unfiltered stars with −3 < [M/H] < −1.2 (top panel), −0.9 < [M/H] < −0.6 (middle panel) and −0.3 < [M/H] < 0.1 (bottom panel).These illustrate two important issues: the incompleteness of XP spectra in the two sickleshaped regions at high latitude.And a significant contamination of the metal-poor bin in the unfiltered sample, which manifests itself as a thin disk near the Galactic plane; these are presumably hotter and highly reddened stars with weaker metal lines that are not recognized as such by our algorithm, as it lacks good training sets in this CMD regime.

Figure 18 .
Figure 18.Skymaps showing the logarithmic number density of vetted RGB stars for −3 < [M/H] < −1.2 (top panel), −0.9 < [M/H] < −0.6 (middle panel) and −0.3 < [M/H] < 0.1 (bottom panel).This vetted sample (see Fig.16) is restricted to red giants with good S/N (G < 16), T eff cuts that eliminate contaminants in the low-metallicity subsample and significant parallax (which explains the "disapperance" of the Magellanic Clouds compared to Fig.17).The top panel qualitatively illustrates how clean the metal-poor subsample is: it prominently shows the Poor Old Heart of the Galaxy(Rix et al. 2022), without any traces of spurious sample members in the disk that is so dominant in the high-metallicity subsample (bottom panel).
Figure 18.Skymaps showing the logarithmic number density of vetted RGB stars for −3 < [M/H] < −1.2 (top panel), −0.9 < [M/H] < −0.6 (middle panel) and −0.3 < [M/H] < 0.1 (bottom panel).This vetted sample (see Fig.16) is restricted to red giants with good S/N (G < 16), T eff cuts that eliminate contaminants in the low-metallicity subsample and significant parallax (which explains the "disapperance" of the Magellanic Clouds compared to Fig.17).The top panel qualitatively illustrates how clean the metal-poor subsample is: it prominently shows the Poor Old Heart of the Galaxy(Rix et al. 2022), without any traces of spurious sample members in the disk that is so dominant in the high-metallicity subsample (bottom panel).
(2022a), on members of the Praesepe cluster, and on wide binaries from El-Badry et al. (2021) confirms typical [M/H] uncertainty of 0.1 with negligible bias for stars on the main sequence in this intermediate temperature regime.Towards the faint end, [M/H] errors increase moderately, reaching 0.15 at G ∼ 14, 0.2 at G ∼ 16, and finally ∼ 0.4 at G ∼ 17.65 (see Fig.

Table 1 .
(Andrae et al. 202374 922 161 XGBoost estimates presented in this work.The full table is available online(Andrae et al. 2023).The Gaia DR3 source id is sorted in ascending order.The boolean flag in training sample indicates whether or not a source was part of the XGBoost training sample.We provide minimal information in order to save data volume.

Table 2 .
Table description of 17 558 141 vetted RGB results provided online has been applied to the parallaxes in this table.