A Classification Catalog of Periodic Variable Stars for LAMOST DR9 Based on Machine Learning

Identifying and classifying variable stars is essential to time-domain astronomy. The Large Area Multi-Object Fiber Optic Spectroscopic Telescope (LAMOST) acquired a large amount of spectral data. However, there is no corresponding variable source-related information in the data, constraining LAMOST data utilization for scientific research. In this study, we systematically investigated variable source classification methods for LAMOST data. We constructed a 10-class classification model using three mainstream machine-learning methods. Through performance comparison, we chose the LightGBM and XGBoost models. We further identified variable source candidates in the r band in LAMOST DR9 and obtained 281,514 variable source candidates with probabilities greater than 95%. Subsequently, we filtered out the sources of periodic variable sources using the generalized Lomb–Scargle periodogram and classified these periodic variable sources using the classification model. Finally, we propose a reliable periodic variable star catalog containing 176,337 stars with specific types.


Introduction
The Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST) is a special quasi-meridian reflecting Schmidt telescope (Cui et al. 2012;Zhao et al. 2012) with an effective aperture of 3.6-4.9m and a field of view of about 5°, which was designed to collect 4000 spectra in a single exposure (spectral resolution R = 1800, limiting magnitude as faint as r = 19 mag, wavelength coverage 370-900 nm).Since LAMOST entered routine observations, more than 19.46 million spectral data have been released.The LAMOST DR9 data set v1.06 was released in 2023 February.This data release comprises two components: low-resolution spectral data and medium-resolution spectral data.Specifically, it encompasses 11.21 million low-resolution spectra and 8.25 million medium-resolution spectra.
However, the current LAMOST catalog lacks variable star information, which constrains the use of LAMOST data for studying variable stars.In practice, variable stars are an effective tool in astrophysics and a primary scientific driver in many research areas.Variable stars have been essential for establishing distance indicators (Richards et al. 2011;Riess et al. 2018;Pietrzyński et al. 2019), identifying dwarf galaxies (Izotov & Thuan 2009), understanding galaxy formation history (Genovali et al. 2015), and probing stellar structure and stellar evolution theory (Walkowicz et al. 2009).
Several studies have been conducted using the LAMOST catalog to investigate variable stars.In some of these studies, variable sources are initially identified and crossmatched with published catalogs to determine their specific variable source types.For instance, Tian et al. (2020) presented a LAMOST radial velocity variable star catalog, consisting of 80,702 radial velocity variable stars with a probability greater than 60%.Moreover, through crossmatching with other catalog data, 3138 sources were classified.Xu et al. (2022) built a catalog that includes 631,769 LAMOST variable star candidates with a probability greater than 95%.By crossmatching other catalogs, 85,669 variable source types were identified.The Tsinghua University-Ma Huateng Telescopes for Survey (TMTS) began monitoring the LAMOST sky region in 2020 and produced about 10 million light curves in 2 yr.Lin et al. (2022) analyzed these light curves and combined the color-magnitude diagram, Gaia parameters, and the International Variable Star Index (Watson et al. 2006) type to classify the variable sources and determine the types of about 1100 short-period variable stars.Although the types of variable sources obtained by the crossmatching method are credible, due to the limitation of the star catalog many identified variable sources have not been classified accurately.Several studies have also focused on identifying a particular type of variable source.Zhang et al. (2023) constructed a self-paced ensemble unbalanced classifier to recognize young stellar objects (YSOs) from the LAMOST and Zwicky Transient Facility (ZTF) databases, and finally obtained 8210 YSO candidates.Jia et al. (2023) employed machine-learning techniques to discover potential symbiotic star candidates.Three machine-learning algorithms (XGBoost, LightGBM, and decision tree) were applied to a data set Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence.Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.containing 198 confirmed symbiotic stars.Subsequently, these models were used to analyze data from the LAMOST survey, identifying 11,709 potential symbiotic star candidates.Identifying specific classes can satisfy scientists' needs, but much variable source information remains untapped in LAMOST.
In recent years, machine-learning methods have been applied to classify variable sources and have shown significant advantages (Belokurov et al. 2003;Dubath et al. 2011;Jayasinghe et al. 2018;Narayan et al. 2018;Hosenie et al. 2019).Mahabal et al. (2017) devised a classifier utilizing a convolutional neural network model, employing labeled data sets of periodic variables obtained from CRTS.They transformed a light curve (time series) into a two-dimensional mapping representation (dm-dt) that is based on the changes in magnitude (dm) over the time difference (dt).Their algorithm achieved an accuracy of ∼83% through multiclass classification.Kim & Bailer-Jones (2016) developed the UPSILON software package, utilizing 16 features extracted from light curves and employing the random forest algorithm to identify periodic variable sources with reasonable results.van Roestel et al. (2021) computed 37 different features from ZTF light curves and utilized the XGBoosting algorithm for multiclassification of variable stars, achieving improved accuracy in variable star classification.
Overall, the classification of variable sources for LAMOST still needs to be explored.In light of previous research methods, we identify and classify variable sources in LAMOST DR9 to support future studies on variable sources in LAMOST.In the rest of the paper, we discuss data preparation in Section 2, which includes preparing the variable star classification sample set and extracting relevant features.In Section 3, we present the construction and evaluation of our classifier models.In Section 4, we detail the process of periodic variable source identification, giving a catalog of periodic variable source candidates for LAMOST in the r band.We apply the optimal model obtained in Section 3 to predict the type of variable source.Finally, we present the conclusion in Section 5.

Data Preparation
We performed two aspects of data preprocessing to classify the variable sources in the LAMOST catalog.One is selecting plausible variable source features for classification, and the other is constructing a variable source classification model to obtain the types of variable sources.

Classification Feature Selection
Determining suitable variable star classification features is the key to further exploiting machine learning.In the study, with reference to the previous work, we selected 30 features from two aspects.
Note.Index 1-20 are calculated based on the light curves.Index 21-30 are the features in relevant catalogs.
These features are intrinsic statistical properties related to variable stars' scale, morphology, period, and other properties.
They are calculated from a light curve's three vectors such as time, magnitude, and magnitude error.These features are highly explainable and robust against bias (Nun et al. 2015;Cabral et al. 2018).They have been proven to work well for identifying and classifying variable stars through a machinelearning approach (van Roestel et al. 2021).
The period is a very significant property of periodic variable stars.We employed the method described in Jayasinghe et al. (2018Jayasinghe et al. ( , 2019a) ) to measure the periods of variable sources.The period search range was from 0.025 to 1000, with a frequency step of 0.001, utilizing three period calculation methods: the generalized Lomb-Scargle (Lomb 1976;Scargle 1982), the box least squares (Kovács et al. 2002), and the Multi-Harmonic Analysis Of Variance (MHAOV; Schwarzenberg-Czerny 1996).We initialized the MHAOV periodogram with Nharm = 5 harmonic terms, and the remaining parameters were consistent with Jayasinghe et al. (2018).Each method provided five candidate periods.
Due to the frequent occurrence of aliasing in period calculations, such as multiples of 1 day, this could significantly impact the final predictive results.An effective method for identifying potential erroneous periods involves inspecting the window function (Dawson & Fabrycky 2010;Vander-Plas 2018).We utilized the Lomb-Scargle window function (Astropy Collaboration et al. 2013), which employs a light curve with the same time sampling as the data but sets stellar magnitude measurements to a constant value.We saved the 10 strongest peaks in this window function and considered them potential aliases for the light curve.The 15 obtained periods were compared with aliases obtained from the associated window function, and those with aliases within 10 −2 were removed.The remaining periods were passed to the phase dispersion minimization (PDM; Stellingwerf 1978) to calculate the optimal period.While the presence of some aliasing periods is still possible, the classification of periodic variable sources remains accurate due to the compensation of other features, such as those related to variability amplitude and timescale.
Considering that the extinction and reddening of interstellar dust grains pose effects on the brightness of variable stars (Green et al. 2018(Green et al. , 2019)), we corrected for the extinction and reddening of dust to obtain the intrinsic brightness or color of the observed objects using the three-dimensional Bayestar19 reddening map (Green et al. 2019).Combined with features extracted from light curves, it is promising to improve variable star classification accuracy.

Sample Set Preparation
We built a sample set for variable source classification by the crossmatching approach.Figure 1 shows the flow of the entire classification sample set preparation process.To ensure the accuracy of the sample data set labels for the machine-learning model, we selected two variable catalogs: Chen et al. (2020) and the ASAS-SN variable star databases. 7We crossmatched these two catalogs to obtain variable source labels, then crossmatched them with ZTF DR11 to obtain light-curve data, both crossmatching with a radius of 1 5. Chen et al. (2020) presented a periodic variable catalog for ZTF DR2 including 781,602 variables.The density distribution of these variable stars is uniform in Galactic coordinates.The 781,602 variable stars were classified into 11 classes by density-based spatial clustering with noise application.Based on crossmatching with other variable source catalogs, the misclassification probability was only 2%.
ASAS-SN is an optical survey that monitors the entire sky and has published a series of variable source catalogs based on the photometric data it collects (Jayasinghe et al. 2018(Jayasinghe et al. , 2019a(Jayasinghe et al. , 2019b(Jayasinghe et al. , 2020a(Jayasinghe et al. , 2020b;;Pawlak et al. 2019).The ASAS-SN variable star database contains approximately 688,000 clearly labeled variable stars.We obtained labels by crossmatching the ASAS-SN variable source database and the Chen et al. (2020) variable source catalog, selecting data with the same labels as the source of labels for our sample set. 1 5 was used as the matching radius in crossmatching.
Considering ASAS-SN and Chen et al. (2020) have different naming schemes, we used a matching scheme to convert their variable source types.Specifically, the variable types W Virginis type variables, BL Herculis type variables, fundamental-mode Cepheids, and first-overtone Cepheids in ASAS-SN, as well as Cepheid (CEP) and CEPII in Chen et al. (2020), After obtaining reliable labels for variable sources, we crossmatched them with ZTF DR11 8 to obtain the light curves.Due to its wide field of view and faint limit stars, ZTF provides a robust database for identifying and classifying variable sources.In order to improve the reliability of our research results, we have implemented the following criteria to ensure the quality of the light curves: (1) Select sources with at least 50 observations.The inadequacy of having only a small number of light-curve observations introduces significant uncertainty, which hinders the possibility of conducting comprehensive follow-up research.
(2) Reject the data marked as unusable.In the study, we need to eliminate the data marked as unavailable in the light curve to ensure the reliability of the final results.Specifically, INFOBITS < 33, 554, and 432 and catflags 32 and 768.
(3) Removed data points above 3σ in the light curve where σ is the standard deviation to eliminate occasional data point fluctuations due to inaccurate photometry.
After crossmatching with Gaia DR3, ALLWISE, and 2MASS, we successfully screened 44,838 variable sources with reliable labels.During the crossmatching process, we noticed some sources missing key information, such as parallax.To compensate for this defect, we adopted a unified median-filling strategy to ensure the completeness and consistency of the data.
As shown in Figure 2, we observe an apparent imbalance between various data types by plotting the histogram of the distribution of variable source types.In order to simplify the experimental process, we used random undersampling to reduce the size of the EW type, which has the largest number of samples, to 8867 samples (50% of the EW samples were randomly excluded).After downsampling EW, our data set contains 35,972 samples, and we randomly divide the data set into a training set (70%) and a test set (30%).Due to the data imbalance, we also tried to introduce the SMOTE oversampling technique (Chawla et al. 2002).Table 2 shows the number of samples in the training and test sets in the original data set and the number of samples in the training set after SMOTE oversampling.

Classification Models
Based on the features defined and the sample set in the previous section, we used machine-learning approaches to build classification models for turning the feature information into a probabilistic statement about the class of variable stars.

Modeling
We applied three machine-learning algorithms: random forest (RF), XGBoost, and LightGBM.RF and XGBoost have been proven to have good performance and robustness in variable source classification (Richards et al. 2011;Hosenie et al. 2019;Becker et al. 2020;van Roestel et al. 2021).LightGBM is a machine-learning algorithm based on gradientboosted decision trees proposed by Microsoft in 2016 (Ke et al. 2017).LightGBM adopts a more efficient decision tree learning method than the traditional gradient-boosting algorithm.It uses a histogram-based algorithm to select the appropriate splitting point, significantly improving training

Hyperparameter Tuning
We use hyperparameter optimization to achieve optimal performance.Hyperparameter optimization is a crucial process that involves finding the most appropriate combination of hyperparameter values for a model.This impacts the prediction accuracy of machine-learning algorithms.Hyperparameter estimation methods include grid search, stochastic search, and Bayesian optimization.In our study, we applied a stochastic search approach.This method automatically iterates through different hyperparameter configurations until a specified number of iterations or other stopping conditions are met.It aims to find the optimal hyperparameter values that achieve maximum balance accuracy for the output classifier.
To evaluate the performance of the model, we combine a stochastic grid search with a fivefold cross validation technique applied to the training set.This allows us to assess the model's performance and ensure its robustness.Table 3 lists the tunable hyperparameter ranges for the three models.

Model Performance
Machine-learning models are usually evaluated by accuracy, precision, recall, and F1 score (Forman 2003).After training and testing the RF, LightGBM, and XGBoost models, we calculated these performance metrics and performed the same evaluation on the SMOTE-augmented model.Table 4 shows the performance of the three models using the original data and after oversampling using SMOTE.
It can be seen from Table 4 that by using SMOTE for oversampling, the model improves the recall rate, but decreases the precision rate.Considering that we pursue the high precision of the final classification results, we decided to choose the model built based on the original data as the optimal model.We can also see from Table 4 that both LightGBM and XGBoost consistently demonstrate excellent performance in variable star classification, achieving a high accuracy and F1 score of up to 94%.In comparison, RF does not perform as strongly as LightGBM and XGBoost.
Since RF does not perform as well as LightGBM and XGBoost, we only show the confusion matrices and receiver operating characteristic (ROC) curves for the latter two models, as shown in Figure 3.
As can be seen from the confusion matrix, our model can distinguish most of the subclasses, such as DSCT, Mira, and RRAB.However, there is some confusion between EB and EA/EW, both of which belong to the class of eclipsing binaries; their similar physical properties render them challenging to distinguish distinctly.In addition, there is a small amount of confusion between RRC and EW/RRAB.

Periodic Variable Source Identification and Classification in LAMOST DR9
After modeling the classification models, we further classified LAMOST DR9 sources.The classification process has two steps: (1) identifying periodic variable sources from LAMOST DR9; (2) classifying periodic variable sources.

Identification of Periodic Variable Stars
According to the official data from LAMOST, observations with a signal-to-noise ratio (S/N) less than 15 have large uncertainties in the apparent velocity measurements.In addition, they have large uncertainties in the estimation of the physical parameters of the stellar atmosphere.Therefore, to ensure the availability of the obtained LAMOST data, we set the S/N 20.We followed the approach presented by Xu et al. (2022) to identify the variable stars with S/N 20 in LAMOST DR9: (1) Modeling variability parameters using light curves from both variable and nonvariable sources.Different from Xu et al. (2022), our variable source labels were derived from the ASAS-SN variable star database.
We selected all variable sources from ASAS-SN with a classification probability greater than 95%.After crossmatching with ZTF DR11, we obtained light curves for 123,527 variable sources in the r band (observations 50).The nonvariable source labels were derived from the standard stars in SDSS (Ivezić et al. 2007).We also crossmatched them with ZTF DR11 to obtain the light-curve data, yielding 901,710 coobserved sources.To ensure a balanced sample size of variable and nonvariable sources, we randomly selected 123,527 light curves.
(2) Obtaining the optimal model for identifying variability parameters through rigorous testing and evaluation.
(3) Applying the optimal model to identify LAMOST variable candidates and assessing correctness through cross validation of the catalog data.
As a result, we identified 281,514 variable sources in the r band with a confidence level exceeding 95%.The corresponding crossmatch results are presented in Table 5.
However, our crossmatching results with the variable source catalog given by Xu et al. (2022) are relatively low.This difference mainly comes from the different data sets: the light curves used in Xu et al. (2022) are retrieved from ZTF DR2, while ours originate from ZTF DR11, offering a more extensive data set with increased observation points than ZTF DR2.Furthermore, the labeled data set in Xu et al. (2022) came from Kepler, which selects 3752 variable sources of three types: rotating variable stars, pulsating variable stars, and eclipsing binary stars.However, we chose all variable source species in ASAS-SN, about 120,000 variable sources, to build the variable source identification model.Differences in the distribution of data sets and variable source types inevitably impact the results.Overall, the results of our model in identifying variable sources are still credible.
After obtaining the variable source candidates, we searched all 281,514 variable star candidates using the Lomb-Scargle periodogram and selected periodic variable star candidates based on a false-alarm probability <0.001, which represents the confidence of the periodicity determination.This process resulted in 198,548 periodic variable star candidates.The eliminated candidates were recorded as suspected variables.They will be better classified based on enhanced data in the future.

Classification Result
We applied LightGBM and XGBoost models to classify the 198,548 variable candidates.Given the excellent performance demonstrated by LightGBM and XGBoost, to further improve the purity of the prediction results we consider using the labels provided by the models as the final predicted classification when the predictions of the two models agree.Based on these  two models, we generated a variable source classification catalog for LAMOST DR9 with 176,337 variable sources.
Variable sources with inconsistent classification are recorded as suspected variables and not added to the catalog.Table 6 provides the classifications of variable sources predicted by the model.The histogram of predicted variable source types is shown in Figure 4.
As shown in Figure 4, most variable source candidates have been classified into the ROT class, followed by 55,764 EA variables, with CEP variables being the least numerous, numbering only 56.In model evaluation, we did not use an additional validation set, mainly because of the small sample size of some variable star types such as CEP and DSCT.In order to ensure the accuracy of the classification results, we adopted the method of crossmatching with the publicly available variable source catalog; the results are shown in Table 7.
It can be seen from Table 7 that the coincidence rate of most classes in our catalog reaches 90%.We also notice that some types have relatively low coincidence rates, with a purity of 90% for the RRC, because the RRC has a more symmetric light curve and can easily disguise itself as an EW eclipsing binary.The purity of eclipsing binaries is also not that high, mainly due to the inherent difficulty in distinguishing between the three different types of eclipsing binaries.ROT coincidence rate is 90.60%.It is mainly confused with eclipsing binary stars through subsequent analysis.We further analyze the crossmatching results of Chen et al. (2020) and find that the difference mainly occurs between RSCVn and eclipsing binary stars.At the same time, BYDra performs well, which may require us to find more useful features to improve the results later.

Feature Importance in the Classification Model
To understand which features are more critical for the model's performance and decision-making process, we obtained the feature importance ranking of LightGBM and XGBoost, as shown in Figure 5. Period is the most significant feature in both models.Parallax, BP − RP, and Gskew rank high in the two models, which also means they play a key role in classification.Although the two algorithms differ in the specific ordering of feature importance, they show high consistency in evaluating results.In predicting 198,548 variable source candidates in LAMOST DR9, the two models gave the same results for 176,337 variable sources.
It should be noted that the parallax feature is a significant feature of periodic variable source classification, which is largely attributed to its strong correlation with other core parameters in variable source classification.Our results are consistent with Chen et al. (2020) who presented that parallax has a correlation with luminosity and is a key parameter in variable source classification.

Hyperparameter Tuning on Classification Performance
We have systematically tuned the hyperparameters in the study.After a series of experiments, we found that hyperparameter tuning brought about a 1%-2% performance improvement on some specific types, such as EA and ROT.
Although it improves the classification accuracy of the model to some extent, it is not significant overall.This may be due to some inherent challenges in the data set we used itself, such as an imbalanced distribution of types.This makes it difficult for the model to improve performance through simple hyperparameter tuning.

Impact of an Imbalanced Data Set for Classification Performance
The problem of imbalance in the data sample is a critical issue in this study.After conducting thorough research and experimental validation of the SMOTE technique, we evaluated its impact on classification performance in the study.As an oversampling technique, SMOTE aims to address class imbalance in classification tasks by increasing minority class samples.However, in our study, while SMOTE demonstrated a noticeable improvement in recall rates for categories such as CEP and EA, it also led to a decrease in precision.The overall improvement in classification performance was not significant.
To further investigate this, we utilized a SMOTE-based model to predict the types of LAMOST variable source candidates and compared the results with published star catalogs.Our results using the SMOTE model were not superior to the original data model.This finding further supports our hypothesis that SMOTE may have limitations when dealing with class imbalance issues.
We speculate that the reason may be that SMOTE's synthetic samples may not accurately represent the distribution of the original data set.In addition, SMOTE's focus on a small number of category samples may lead to the neglect of other relevant information, resulting in suboptimal performance of the test set.Therefore, when deciding whether to use SMOTE, it is crucial to consider its impact on both recall and precision.In addition, it is crucial to consider the potential for overfitting.In future studies, we aim to explore more effective data augmentation and class-balancing strategies.This may include utilizing advanced synthetic sample generation methods, combining different sampling techniques, or introducing more complex model architectures.

Conclusions and Future Works
The LAMOST survey has accumulated a large amount of spectral data.However, the study of variable sources is limited by the lack of photometric information.Based on the previous statistical analysis methods, we identified candidate variable sources in the r band in LAMOST DR9 and obtained 281,514 candidate sources with probabilities greater than 95%.We then classified the rest of the variable sources using the XGBoost and LightGBM models, both showing high performance in our evaluation.Through thorough analysis and evaluation, we finally constructed a catalog of 176,337 variable sources.They can be classified as DSCT, ROT, EW, EB, EA, RRAB and RRC, Cepheids, SR, and Miras.The variable source catalog provides a valuable resource for subsequent data analysis.
Our method mines more variable source information from LAMOST than previous studies.Machine-learning methods greatly reduce workforce input, providing a more direct and efficient way to classify the large amount of variable source data obtained from various sky survey projects in the future.
However, we must also recognize the limitations of this research.Specifically, based on the classification results of the test data, our model needs to be improved when classifying eclipsing binaries.Classifying these three subclasses of eclipsing binaries requires specialized methods, which we will consider in future work.
The subclassification of variable stars is complex and varied.We will continue to classify more subclasses of variable stars based on data from multiple surveys such as LAMOST, ZTF, 2MASS, and so on.In addition, it is worth noting that when we classified the variables, we mainly subclassified the periodic sources, and we have yet to study the nonperiodic sources in depth.However, this part of the data remains valuable.In our future work, we aim to explore and develop more comprehensive models for identifying and classifying various classifications of variable sources.

Figure 1 .
Figure 1.The data-flow diagram of the sample set preparation.

Figure 2 .
Figure 2. The histogram of the number of different variable source types in the initial sample set.

Figure 3 .
Figure 3.The confusion and ROC curves of LightGBM and XGBoost.

Figure 4 .
Figure 4.The histogram of predicted variable source types.

Figure 5 .
Figure 5.The feature importance of LightGBM and XGBoost.

Table 2
Number of Training and Test Samples in the Original and after Oversampling Using SMOTE 8 https://irsa.ipac.caltech.edu/data/ZTF/lc/lc_dr11/

Table 3
Parameters of the RF, XGBoost and LightGBM Model Hu et al. (2021) it suitable for large-scale data sets.In addition, LightGBM also performs well with sparse data and a large number of categorical features.It is also gradually applied to astronomy.For example,Hu et al. (2021)used LightGBM to discover 225 cataclysmic variable star candidates in LAMOST DR7, of which four are newly discovered.Meanwhile, Ribeiro & Gradvohl (2021) also utilized LightGBM to classify solar flares automatically.

Table 4
The Model Performance Metrics for 10-class Variable Stars Based on LightGBM, XGBoost, and RF with Original Data and SMOTE

Table 5
The Cross Identification between Our Catalog and Published Catalogs in the r Band

Table 7
Variable Purity Comparison between Our Catalog and Other Catalogs

Table 6
The Classification Results of Variable Stars for LAMOST DR9 Based on LightGBM and XGBoost (This table is available in its entirety in machine-readable form.)