Mass and Age determination of the LAMOST data with different Machine Learning methods

We present a catalog of 948,216 stars with mass label and a catalog of 163,105 red clump (RC) stars with mass and age labels simultaneously. The training dataset is cross matched from the LAMOST (The Large Sky Area Multi-Object Fiber Spectroscopic Telescope) DR5 and high resolution asteroseismology data, mass and age are predicted by random forest method or convex hull algorithm. The stellar parameters with high correlation with mass and age are extracted and the test dataset shows that the median relative error of the prediction model for the mass of large sample is 3\% and meanwhile, the mass and age of red clump stars are 4\% and 7\%. We also compare the predicted age of red clump stars with the recent works and find that the final uncertainty of the RC sample could reach 18\% for age and 9\% for mass, in the meantime, final precision of the mass for large sample with different type of stars could reach 13\% without considering systematics, all these are implying that this method could be widely used in the future. Moreover, we explore the performance of different machine learning methods for our sample, including bayesian linear regression (BYS), gradient boosting decision Tree (GBDT), multilayer perceptron (MLP), multiple linear regression (MLR), random forest (RF) and support vector regression (SVR). Finally we find that the performance of nonlinear model is generally better than that of linear model, and the GBDT and RF methods are relatively better.


INTRODUCTION
To describe the current structure, evolution and formation history of the Milky Way, it is necessary to accurately estimate the mass and age of a large number of stars distributed throughout our home galaxy. Through the spectra of stars, astronomers could acquire many stellar parameters (Mathur et al. 2017;Wu et al. 2019;Huang et al. 2020;Zhang et al. 2020Zhang et al. , 2021. However, to date it is still not easy to acquire the age of stars accurately and precisely. The indirect isochrones method can obtain the age of clusters with relatively high precision by matching the observed data based on the stellar evolution model (Soderblom 2010;Xiang et al. 2017), but for field stars the precision of this method might not be Corresponding author: HFW hfwang@bao.ac.cn perfect due to that the high accurate stellar parameters must be needed.
For a long time, due to the limitation of observation and data analysis, we can only estimate the ages of a small number of stars in the solar neighborhood (Edvardsson et al. 1993;Nordström et al. 2004;Takeda et al. 2007;Haywood et al. 2013;Bergemann et al. 2014). With the large sky surveys, such like LAMOST (The Large Sky Area Multi-Object Fiber Spectroscopic Telescope), Xiang et al. (2015Xiang et al. ( , 2017 estimated the ages of a large number of stars. Followed by this, it has been found that there is a relation between the carbon and nitrogen abundances and the ages of the giant stars, which has already been used to predict the ages of the red giant branch stars (Martig et al. 2016;Ness et al. 2016;Ho et al. 2017).
There are also other surveys that could provide the age of large sample. The GALAH survey (Galactic Archae-ology with HERMES), a high-resolution spectroscopic survey, aims to the chemical tagging experiment (Freeman & Bland-Hawthorn 2002;Bland-Hawthorn et al. 2010). For the very bright stars, more than 30 different elements can be measured and meanwhile, age, kinematic inventory of the solar neighbourhood has also been provided in Buder et al. (2019). Bright giant stars become the primary targets for APOGEE survey (The Apache Point Observatory Galactic Evolution Experiment), which is also a high-resolution spectroscopic survey with some works on the mass and age (Zasowski et al. 2013;Martig et al. 2016;Majewski et al. 2017). Recently, the precision of ∼5% for mass and ∼20% for age is acquired by Silva Aguirre et al. (2020) with TESS (Transiting Exoplanet Survey Satellite) data and meanwhile, there also are many other results with similar precision such as ∼6% for mass and ∼20% for age in Stello et al. (2021) by using TESS asteroseismology of the Kepler red giants. And ∼10% for mass and ∼30% for age in Mackereth et al. (2021) with asteroseismology of giant stars in the TESS continuous viewing zones and beyond.
It has been found that there is a correlation between the age of solar-like stars and their surface rotation, and a detailed study has been carried out with asteroseismology data (García et al. 2014;McQuillan et al. 2014;Ceillier et al. 2016;van Saders et al. 2016). At the present time, we all know asteroseismology is an effective method to estimate the mass and age of stars (Gai et al. 2011;Chaplin et al. 2014), however, it needs high-precision, long-time and high-resolution photometric observation so that, unfortunately, we still don't have large enough asteroseismological sample.
Up to now, although there are many methods to predict the mass and age of stars, their precision and efficiency are still not perfect. We desperately need to make full use of big data to obtain more samples and try more methods to improve the prediction precision, thus then we can explore the assembly history of the Galaxy more effectively and more properties of the Milky Way mass distribution, population structure and dynamical evolution (e.g., (Wang et al. 2018a(Wang et al. ,b, 2019(Wang et al. , 2020a(Wang et al. ,b,c, 2022aBland-Hawthorn et al. 2019;Yu et al. 2021;Yang et al. 2022) and reference therein).
Machine learning is a branch of artificial intelligence and we could make full use of high quality data for training through algorithm. By combining machine learning with high quality asteroseismology data, we could predict the relationship between stellar mass (age) and stellar parameters thus then we could get these two parameters of large sample with high confidence.
During this paper, we use novel machine learning method to estimate mass of a larger sample and a smaller sample for age and mass of red clump stars in LAMOST. Furthermore, we compare the different machine learning methods quantitatively for the first time.
The paper is structured as follows: Section 2 presents the data we adopt, Section 3 is the method introduction we use, Section 4 shows our results, Section 5 is for discussion, and finally Section 6 gives a brief summary of our work.
2. DATA 2.1. Catalogs Xiang et al. (2019) has provided 8,162,566 stars from LAMOST survey and the chemical abundances are derived from DD-Payne model, which is inherited from both the Payne ) and the Cannon (Ness et al. 2015). In this work, we use this catalog to obtain the chemical abundances of stars. Ting et al. (2018) has provided us 175,202 red clump stars in LAMOST with 3% contamination, and also includes two asteroseismology parameters ∆P and ∆ν. We use this catalog to obtain RC stellar label and notice the ∆P and ∆ν are also obtained from stellar spectra, the frequency separation (∆ν) between adjacent acoustic p-modes and the period spacing (∆P) of the mixed gravity g-and acoustic p-modes could be used for the separation of red clump stars and red giant branch stars Hawkins et al. 2018). The precision for LAMOST ∆P and ∆ν is 50 s and 1µHZ respectively, which is enough for the age/mass determination according to the previous results . During this work, we determine the final age and mass using new training dataset and new methods we choose, then we compare with other catalogs in order to test the robustness of the different methods. Pinsonneault et al. (2018) has provided ages of 6,676 stars in APOKASC-2, which are derived from the model of Serenelli et al. (2018) using mass, radius, [Fe/H] and [α/Fe]. We train our model for mass and age by this high quality high resolution asteroseismology catalog. To be more specific, this catalog is stellar properties for a large sample of evolved stars with APOGEE spectroscopic parameters and Kepler asteroseismic with the help of five independent techniques, the median random mass uncertainties for red giant branch (RGB) stars could reach 4%, for RC stars could reach 9% level and meanwhile the age precision is within 8% respectively, which is suitable for training sample.
In short, thanks to these works above we use chemical abundance from Xiang et al. (2019), precise mass and age from Pinsonneault et al. (2018), red clump label and ∆P and ∆ν from . Then we use the new machine learning methods and new high quality asteroseismic age and mass, to estimate the mass of large sample for Xiang et al. (2019), and red clump stars age & mass for .
After cross matching the above catalogs, we firstly get 4,479 stars to predict large sample mass (LS-mass), and 1,806 stars for red clump mass (RC-mass) and red clump age (RC-age), notice that these are not the final dataset as shown in the next part. The distribution of sample needed to be predicted in the galactic longitude and latitude of the celestial coordinates is shown in Fig. 1.

Final training datasets
In order to improve the precision of machine learning prediction, we do the following experiment for the three catalogs mentioned above.
The three datasets after the first cross match mentioned above are separated as the test and training sample equally, then we firstly use RF to train and make mass and age prediction for the test dataset. For large sample stars (LS-mass), we select stars whose absolute error of mass prediction is less than 1 M and relative error is less than 0.3, and for RC stars, we select stars whose absolute error of mass (age) prediction is less than 1 M (3 Gyr) and relative error is less than 0.4. Notice that here we only use 200 decision trees and make full use of all stellar parameters shown in Fig. 3 as input in the method to finish this step. After this, we finally get LS-mass set 4,246 stars, RC-mass set 1,751 stars and RC-age set 1,384 stars for training and predicting, as detailed in Fig. 2, which is showing the final training mass and age distribution on the Teff-log g plane. The machine learning methods used in this paper are mainly from Scikit-learn (sklearn) (Anghel et al. 2019;Mediratta & Oswal 2019;Florescu & England 2020), which can be divided into six categories: classification, regression, clustering, dimensionality reduction, model selection, preprocessing.
Firstly, we explore the feature importance distribution of the stellar parameters for the mass/age of the three selected training samples with RF method shown in Fig. 3. In order to avoid the severe impact of one feature on the prediction due to the dimension problems unexpected, we choose to do the standardization for different features which can accelerate the convergence of weight parameters. Standardization or Z-score normalization is the transformation of features by subtracting from mean and dividing by standard deviation.
The RF method adopted here is based on decision trees and the final prediction result is also dependent on these trees. The correlation between different parameters can be easily identified with the help of information gain used to train the model so this method have good robustness and overfitting could be avoided.
The importance is implying the relative significance and we have a test to find that the importance of many stellar parameters are highly correlated so it is reasonable that we choose to use the first six or nine parameters to estimate mass and age. As shown in Equation 3 in Pinsonneault et al. (2018), the mass is very sensitive to the ∆ν and we all know the age is also sensitive to mass, so it is not strange to see the ∆ν is the most important factor for the RC age and mass.

Features choice
The relation between the precision of prediction and the number of features in the training dataset, base on the relative errors distribution vs. feature numbers, is clearly shown in Fig. 4. The mean relative error of the test dataset decreases with the increase of the number of training features (orange line) until stable pattern. Based on this pattern, we choose the first six stellar parameters to train the model for mass of large sample stars (LS-mass) and meanwhile first nine features for mass and six features for age of RC stars respectively.
We notice that the LS-mass are mixed with different types of stars which might not belong to the training dataset, so we use the first six stellar parameters of [C/Fe], Teff, [Mg/Fe], [N/Fe], log g, [Ba/Fe] to construct a convex hull in order to determine which stellar types our training model are suitable for, as displayed in Fig. 5 , we could see our sample is mainly consist of K giant stars including RC and RGB, and there are also very few possible other type stars as such G type, which is consistent with the result that APOKASC is mainly consist of RGB and RC. So our large sample is almost consist of K giants, we find that LAMOST DR5 contains around 1 million K giant stars, in this work we also use convex hulls to select 948,216 stars, which are self-consistent. Notice that our method could be wildly used in different type of stars if the quality and quantity of the training dataset is enough in the future, and in order to avoid the mixing effects of RGB and RC, we choose not to estimate the age of all large sample here, age of the RGB estimation will be shown in the next work. Algorithms that construct convex hulls of various objects have been wildly used in astrophysics, mathematics and computer science. Finally, we have 948,216 stars for LS-Mass suitable for the training model based on the Pinsonneault et al. (2018). Notice that we have  also removed some vacancy values for the red clump catalog before mass and age determination, then we finally get the 163,105 stars to be predicted without convex hull algorithm.  Fig. 6, coloured by the mass or age on the Galactic longitude and latitude celestial sphere. For the mass distribution, we could see the more massive stars are located in the disk similar to the mass pattern for the red clump stars in the middle panel, and the age distribution of red clump stars is also showing the younger stars are mainly located in the low latitude. It could be naturally understood that there are more star forming regions in the disk so that more massive stars and younger stars located in the disk and low latitude.
The distribution of age in the right ascension and declination plane is also shown in the left panel of Fig. 7, the number and fraction for declination beyond 20 or 30 degree are denoted on the top, they are 110071 and 68%, 82739 and 51% respectively. The middle one of this figure is for density distribution in the longitude and latitude plane, star counts and fraction beyond 20 or 30 degree for latitude are labeled on the top of this figure, they are 47,926 and 29%, 24,673 and 15% respectively. The right panel in this figure is R and Z plane in cylindrical Galactic coordinates coloured by density/stellar number and fraction larger than 10 or 15 kpc for distance are also denoted on the top, they are 78,424 and 48%, 2,694 and 2% separately. Fig. 8 shows the results of our method for the test datasets of three groups, from the top left to the top right, the y-axis is predicted mass, the absolute mass error and the relative error, the x-axis is the true mass from asteroseismology. As shown in the figure, the predicted dispersion of the large sample mass is 0.13 M , the mean absolute error is 0.08 M and the median is 0.05 M ; the mean relative error is 6% and the median is 3%. Dispersion means the standard deviation of the predicted age/mass minus the true values in the catalog we used, the absolute error is the predicted value minus the true value, and the relative error is the predicted value minus the true value divided by true value, notice in this work we use the median relative error for the final precision uniformly.
Similarly, the middle of Fig. 8 is the red clump stars mass, as shown in the label, the predicted dispersion of mass of RC stars is 0.14 M , the mean absolute error Figure 3. The results of feature extraction using random forest. The different figures are three different training samples that we have selected, the top one is for the large sample containing different type of stars, for which we only estimate mass, the middle and bottom one are for red clump stars, for which we could estimate mass and age. The importance represents the contribution of the stellar parameter to our prediction model, it is actually the relative importance.
is 0.09 M and the median value is 0.05 M , the mean relative error is 6% and the median value is 4%.
It can be found in Fig. 8 for the prediction of mass, the precision of RC stars (4%) is slightly worse than that of large sample stars (3%) for the test dataset. The main reason is that the number of stars in the training samples is different. The larger the sample size, the more effectively the machine learning method could find the rule. Moreover, the predicted dispersion of age of RC stars is 0.68 Gyr, the mean absolute error is 0.42 Gyr and the median value is 0.21 Gyr, the mean relative error is 11% and the median relative value is 7%. We could speculate that the precision of age of RC could reach higher if we have higher quality catalog.
Then we explore the relations between the predicted age and [C/N], as shown in Fig. 9. We could see that in the region where age is less than or equal to 8 Gyr, the age and [C/N] show a good linear relationship, which is consistent with our expectation. While in the region where age is older than 8 Gyr, it seems that there is no obvious pattern due to that the RC stars are inclined to the relatively younger group, and the number of old stars is very small in our sample, so it is impossible to make high-precision statistics. Fig. 10 shows the comparison between the mass or age we predict and the reference values we use, the consistency provides verification for the robustness of our method. We also compare our predicted age with other works based on LAMOST, APOGEE and Gaia data, which will also provide independent verification for the method. The comparison results are shown in Fig. 11 and we could see, the top left one is comparison for the common stars of APOGEE , the top right one is for LAMOST ) 1 , the bottom left one is for Gaia (Sanders & Das 2018), and the last one is for the work of Ho et al. (2017). We could see although there are some differences, for the overall trend the consistency is acceptable. Similarly, the first four subfigures of Fig. 13 show the mass comparisons for other works. The left panel is compared to Yu et al. (2018), right panel is compared to Ho et al. (2017), top panel is for LS-mass, and the bottom panel is for RC-mass, all are matched well with some reasonable difference.

More comparisons
Compared with APOGEE high quality data we could claim for this work, the precision of RC age could reach 18% (top left in Fig. 11) and by matching with the high precision Kelpler asteroseismology data we could claim our uncertainty of RC mass could reach 9% (bottom left of Fig. 13). Meanwhile, the precision of LS-mass could be 13% (top left of Fig. 13). All these final precision are based on the final relative error analysis using high precision asteroseismology dataset and we frankly admit that the systematics might be ignored so we need more works in the future.
Moreover, we also compare the open cluster (OC) age using our final sample, the OC is chosen by the spatial locations, kinematics (line of sight velocity, proper motions)and metallicity clustering distributions. As we could see in Fig. 12 the relative errors are NGC 6811: 9.1%, NGC 2420: 9.3%, NGC 6819: 23.4%, NGC 2682: 9.5%, NGC 6791: 2.7%, Be 17: 33.5%. The final median relative error is 9.5%, which strongly supports our  final conclusions. Notice that we use our final LAM-OST RC catalog to select OC memberships and then compare with literature values. In our final RC catalog, the stellar number of memberships for these open clusters mentioned above is: NGC 6811: 2, NGC 2420: 1, NGC 6819: 6, NGC 2682: 2, NGC 6791: 2, Be 17: 4.
We also explore the relationship between RC-age relative error and SNR (the ratio of the intensity of a signal to the background noise detected by a measuring instrument for spectra used for LAMOST stellar parameters estimation), as shown in Fig. 14, the relative error tends to be stable with the increase of SNR.The distributions of the relative errors of mass and age for our test dataset with stellar parameters Teff, log g and [Fe/H] are also displayed in Fig. 15, which is showing that the robustness of our method with small dispersion. Fig. 16  As we mentioned, almost all of parameters are correlated with age but why do we only choose first six to nine parameters for our method and why the other parameters shown in Fig. 3 do not have high importance. The reason is that we find they are related to the properties of random forest method, it means that when there are correlations for multiple features, the RF will extract the one with the greatest contribution, and then the importance of other features might become not very important artificially (e.g., [Fe/H]).
As a test, we attempt to use the first six stellar parameters of importance to independently predict other stellar parameters in RC-age sample and check the predicted results. As shown in Fig. 17, we find that other stellar parameters can be predicted by using the first six stellar parameters. Because the first six features are more or less related to other features, the importance of other features behave not so significant when we make the related analysis.
Inversely, we also randomly choose several other relevant stellar parameters to empirically predict age in order to compare with our previous results, notice the Fig. 18. Obviously, we find that even though we use other parameters to predict the similar precision could be reached. All these results show that our method for age and mass estimation is reasonable and we could make full use of many parameters to estimate age and mass for other catalogs even though we are lack of some chemical stellar parameters. Figure 6. The distribution of predicted mass (age) in celestial sphere coordinates. The top one is the mass distribution of large sample, the middle one is the mass distribution of red clump stars, the bottom one is the age distribution of red clump stars.

Comparisons for age prediction using different catalogs
In this paper, we choose the RC age of APOKASC-2 as the training dataset because it is the high resolution asteroseismology sample. In order to compare the age based on the APOKASC-2 and APOGEE ), we use these two different catalogs to predict age, as shown in Fig. 19, the x-axis is age trained by APOGEE and the y-axisl is trained by APOKASC-2, they have different stellar number. We find that for older stars, the age predicted by APOKASC-2 is systematically higher than the stars predicted by APOGEE, which is caused by the different dataset precision possibly. And as can be seen from Fig. 20, showing the relative error analysis for these two catalogs, the age based on APOGEE is systematically smaller than that based on APOKASC-2. With the increasing of age, the difference becomes more and more obvious, however, for the overall trend, almost of all difference are within 10% which is acceptable and implying that the precision of prediction is dependent on the quality of the dataset.

Comparison of common stars between two different mass predictions during this work
We have predicted the mass of two groups of samples, the LS-mass with convex hull algorithm and the RCmass without convex hull algorithm. After cross match we find 155,532 common stars and then we compare the two slightly different mass prediction methods.
As can be seen from Fig. 21, the values of relative errors is 8%, which shows that the mass difference predicted by the two methods is small and self-consistent.

Comparison of different machine learning methods
Different machine learning methods used in this work have their own characteristics but there should be no absolute difference for the advantages and disadvantages, which is dependent on the specific purposes. The reason why we choose RF is that after many attempts, we find that it is better to in line with our expectations. The quantitative comparison of the six machine learning methods including bayesian linear regression (hereafter: BYS), gradient boosting decision tree (hereafter: GBDT), multilayer perceptron (hereafter: MLP), multiple linear regression (hereafter: MLR), random forest (RF) and support vector regression (hereafter: SVR) is shown in this section. Fig. 22 shows the relation between features number used in the training model and the median relative error in different machine learning methods. Meanwhile, Fig. 23 shows the age prediction of different methods for the test dataset. Based on the value labeled in panels   of these two figures, we can clearly see that BYS and MLR are relatively worse because both of them have the higher median relative error of ∼ 28% and larger dispersions of 0.97 Gyr, which might be caused by that our prediction of RC stars is nonlinear, but BYS and MLR are linear models.
Among the other nonlinear methods, the MLP are difficult to adjust during our experiments and the performance is hard to keep stable, the median relative error is 13% and dispersion is 0.73 Gyr. And the median relative error and dispersion of SVR are 14% and 0.74 Gyr. We can see from Fig. 22, the precision of GBDT is similar to the RF with median relative error of 10% and dispersion of 0.68 Gyr, but more features of GBDT (10) are needed than the RF (6) when the median relative error is becoming stable, in order to make our trained model applicable to more stars with fewer features, we decide to choose the RF for this work. More introduction about the six machine methods will be presented in the Appendix part.

CONCLUSIONS
In this paper, with the help of LAMOST, APOGEE and asteroseismology data, we use random forest to predict the mass of 948,216 large sample stars, mass and age of 163,105 RC stars. We select stellar parameters with high correlation with mass and age to construct training model, then we use theses features, convex hull algorithm and random forest method to determine the age and mass of larger sample.
We find that the precision of the mass for large sample stars could reach 3%, RC stars could reach 4%, and RC age precision could be 7% for test dataset (shown in Fig. 8). Compared with other high quality sample, the precision for mass of large sample stars could reach 13%, mass precision of RC stars could reach 9%, and age precision of RC stars could reach 18% for the median relative error. In general, our results could be compared well to recent works, in particular for open clusters, which could reach 9.5% for median relative error, so it is strongly implying we could make full use of the method in the future.
We also explore the performance of different machine learning methods for the first time, in particular for age. There should be no absolute advantages and disadvantages between different machine learning methods, and each method has its own applications dependent on purpose. After comparisons, we find that the nonlinear model is more in line with our expectations than the linear model, and the GBDT and RF are better. In order to make the model suitable for more stars, we choose the RF which needs less feature numbers to achieve our scientific target in this work.
To some extent, this paper could be considered as the first paper of our series of works and the catalog will be shared online with community. This method will be widely used in the other catalogs or surveys and we will also attempt to consider systematics, possible zeropoints for age in the future.
We would like to thank the anonymous referee for his/her very helpful and insightful comments. Thanks for the helpful comments from López-Corredoira Martín. HFW is supported by the CNRS-K.C.Wong Fellow in France and we acknowledge the science research grants from the China Manned Space Project with NO. CMS-CSST-2021-B03, CMS-CSST-2021-A08. HFW also acknowledges the support from the project "Complexity in self-gravitating systems" of the Enrico Fermi Research Center (Rome, Italy). L.Y.P is supported by the National Key  H.F.W. is fighting for the plan "Mapping the Milky Way (Disk) Population Structures and Galactoseismology (MWDPSG) with large sky surveys" in order to establish a theoretical framework in the future to unify the global picture of the disk structures and origins with a possible comprehensive distribution function. Figure 10. The comparison between the mass and age we predict and the reference values we use during this work, and the number marked on the figure represents the median value of relative error for our method. It is consist of the common stars of LAMOST data we predict and APOKASC-2 in this work. The purpose here is method validation and the precision is naturally quite good for this dataset since we use the APOKASC-2 for training. Figure 11. Comparing our predicted age with other works using LAMOST, APOGEE and Gaia data. On the top left is the age of APOGEE data using different method , the top right one is the age of LAMOST data , the bottom left is the age of Gaia data (Sanders & Das 2018), and the last one is the work of Ho et al. (2017). The median value of relative error is shown on the top left and the consistency is acceptable. We have fewer stars around 2 Gyr in the training dataset so there are apparently disconnect features.    We pay our respects to elders, colleagues and others for comments and suggestions, thanks to all of them. The Guo Shou Jing Telescope (the Large Sky Area Multi-Object Firber Spectroscopic Telescope, LAMOST) is a National Major Scientific Project built by the Chinese Academy of Sciences. Funding for the project has been provided by the National Development and Reform Commission. LAMOST is operated and managed by National Astronomical Observatories, Chinese Academy of Sciences. This work has also made use of data from the European Space Agency (ESA) mission Gaia (https://www.cosmos.esa.int/gaia), processed by the Gaia Data Processing and Analysis Consortium (DPAC, https://www.cosmos.esa.int/web/gaia/dpac/ consortium). Funding for the DPAC has been provided by national institutions, in particular the institutions participating in the Gaia Multilateral Agreement.     . Comparison of age predictions using two different age catalogs. x axis is age trained by APOGEE and the y axis is trained by APOKASC-2 during our work, the figure coloured by star counts. Figure 20. The relative age error of APOGEE and APOKASC-2 along with the age for common stars. We use two catalogs to make prediction and find that the older the star, the more obvious for the difference. The error bar is poission noise.