A Robust Identification Method for Hot Subdwarfs Based on Deep Learning

Lei Tan; Ying Mei; Zhicun Liu; Yangping Luo; Hui Deng; Feng Wang; Linhua Deng; Chao Liu

doi:10.3847/1538-4365/ac4de8

1. Introduction

Hot subdwarfs are core helium-burning stars located below the upper main sequence of the Hertzsprung–Russell diagram and are referred to as extreme horizontal branch stars because of their evolution stage. In general, hot subdwarfs are classified into three types by their spectra: subdwarf B (sdB), subdwarf O (sdO), and subdwarf OB (sdOB).

The study of hot subdwarfs has significant scientific value. O'Connell (1999) and Han et al. (2007) found that hot subdwarfs are the primary sources of the ultraviolet (UV) upturn phenomena found in elliptical galaxies. They are also critical studies into close binary interactions, because most sdB stars are formed in binary systems. Geier et al. (2007) found that hot subdwarf binaries with massive white dwarf companions are believed to be the precursors of type Ia supernova. The peculiar atmospheres of hot subdwarfs can be used to study gravitational settling and radiative levitation (O'Toole & Heber 2006; Geier 2013). Hot subdwarfs with pulsations can be used for asteroseismic analyses (Charpinet et al. 2011). Lei et al. (2015, 2016) used hot subdwarfs to study the extended horizontal branch morphology of globular clusters.

Driven by scientific research, searching for and identifying hot subdwarfs and constructing hot subdwarf catalogs have become popular topics in hot subdwarf research. The traditional method of searching for subdwarfs is mainly based on the basic characteristics of hot subdwarfs. For example, Vennes et al. (2011) identified 48 hot subdwarfs based on UV photometry of the Galaxy Evolution Explorer survey. Luo et al. (2016) identified 166 hot subdwarfs using photometric information. Geier et al. (2019) used color, absolute magnitude, and reduced motion-cut methods to compile a candidate catalog of 39,800 hot subdwarfs. However, these identification methods mainly rely on manual processing, which is laborious and difficult to use to meet the demands of handling large-scale spectral data.

The Large Area Multi-Objective Fiber Optic Spectroscopic Telescope (LAMOST; Cui et al. 2012; Deng et al. 2012; Zhao et al. 2012) is a Schmidt telescope with an effective aperture of 3.6−4.9 m with a field of view of about 5°. Being a survey telescope, LAMOST was designed to collect 4000 spectra in a single exposure (spectral resolution R ∼ 1800, limiting magnitude r ∼ 19 mag, wavelength coverage 3700–9000 Å). The LAMOST DR7-V1 catalog contains 10,640,255 calibrated spectra. The total catalog includes 9,881,260 star spectra, 198,393 galaxy spectra, 66,406 QSO spectra, and 494,196 spectra of an unknown type.

Identifying hot subdwarfs from the LAMOST catalog has essential research value because LAMOST can reveal the spectral characteristics of hot subdwarfs that show details of their formation and evolution. However, identifying hot subdwarfs from the LAMOST data is a challenging task. The critical reason is that LAMOST does not have homogeneous color data (Luo et al. 2015) and therefore traditional methods such as a color cut (Bu et al. 2019) cannot be used to search for hot subdwarfs in its catalog. Another reason is the massive amount of catalog data. Scientists have experimented with other methods to identify hot subdwarfs from the LAMOST catalog. Luo et al. (2021) identified 1587 hot subdwarfs using the spectra released by LAMOST and the catalog of hot subdwarf candidates released by Gaia DR2. With the development of machine-learning techniques, especially the deep-learning techniques, Bu et al. (2017) used machine-learning methods to search for hot subdwarf candidates from LAMOST DR1 and obtained 10,000 candidates, and Lei et al. (2019) further identified 56 hot subdwarfs from these candidates. Bu et al. (2019) applied a convolutional neural network (CNN) to construct a binary classification model for hot subdwarfs, searching in LAMOST DR4, and achieved an F1 value (defined in Section 3.3) of 76.98%.

Overall, these artificial-intelligence-based identification methods have achieved remarkable results. Identifying hot subdwarfs based on deep learning can significantly reduce the difficulty of manual identification and obtain credible results. It can be considered one of the most effective methods to search for a specific target in a large amount of catalog data.

However, there are still some minor deficiencies in current research on deep-learning-based recognition, mainly focusing on the imbalance of data samples. In the model of Bu et al. (2019), 510 hot subdwarfs and 5458 other classes (star, galaxy, unknown object, and so on) were used for the binary classification model. From a practical point of view, this ratio is grossly unbalanced. Studies have shown that an unbalanced number of data sets for different classes can affect the reliability and accuracy of the classification results. Insufficient sample size can lead to overfitting of the model (Zhan et al. 2018; Zhu et al. 2018). At the same time, the model will be incomplete due to the insufficient sample size (Du et al. 2016).

In light of the previous literature, a hybrid model with an eight-class classification model and a binary classification model is proposed and applied to search for hot subdwarfs in the LAMOST catalog. We introduce the data and preprocessing in Section 2, including data preparation and data augmentation. In Section 3, we describe the construction process of our classification model and give a detailed performance evaluation. In Section 4, the model is applied to classify the entire LAMOST spectral data to obtain hot subdwarf candidates. The candidates are identified by manual authentication. The ability of our model is further discussed in Section 5. In Section 6, we summarize the work and propose some future work.

2. Data and Preprocessing

2.1. Sample Set

Based on the LAMOST DR7-V1 catalog, we created a sample set that extracted the data with all spectra classes (O, hot subdwarf, white dwarf, B, A, F, G, K, and M) to meet the labeling requirements of the deep-learning techniques. Referring to the sample selection criteria of Bu et al. (2019), we selected samples with a signal-to-noise ratio (S/N) greater than 10 to construct the sample set. When the S/N was set to be greater than 10, there were sufficient samples for hot subdwarf, white dwarf, B, A, F, G, K, and M stars, while there were only 72 O-type stars.

2.2. Data Augmentation

To ensure the balance of the sample, we have to augment O-type stars according to the characteristics of the existing 72 samples. The spectrum of a star is a blackbody radiation spectrum, which is accompanied by emission and absorption lines. That is to say, the generation of a spectrum needs to satisfy characteristics of both two aspects, such as a blackbody radiation spectrum and emission/absorption lines.

1. Blackbody radiation spectrum fitting

The blackbody radiation spectrum of a star is defined by the Planck formula (Equation (1), a relationship between the flux, the wavelength (λ), and the temperature (t) of the spectrum). To further compensate for the effects of relative flux correction and sky background, we modify the formula to Equation (2), where a compensates for effects of relative flux correction, and b is compensated for by effects of the skylight background.

$\begin{eqnarray}&&\mathrm{flux}(\lambda ,t)=\displaystyle \frac{8\pi {hc}}{{\lambda }^{5}}\cdot \displaystyle \frac{1}{{e}^{\tfrac{\mathrm{hc}}{\lambda {kt}}}-1}\end{eqnarray} \tag{ 1 }$

$\begin{eqnarray}&&\mathrm{flux}(\lambda ,t,a,b)=a\cdot \displaystyle \frac{8\pi \mathrm{hc}}{{\lambda }^{5}}\cdot \displaystyle \frac{1}{{e}^{\tfrac{{hc}}{\lambda {kt}}}-1}+b\end{eqnarray} \tag{ 2 }$

In order to fit the temperature t, a, and b in Equation (2), in practice we obtained 3700 flux values with wavelengths from 3700–8672 Å by truncating each O-type star spectrum. The emission and absorption lines were removed by median filtering, and the corresponding blackbody radiation was obtained. The fitting process can be realized by the scipy.optimize.curve_fit⁸ function in the Python Scipy package. For each known O-type star spectrum, optimal values were obtained for the parameters (t, a, b) so that the sum of the squared residuals of ${\mathrm{flux}}_{i}-{\mathrm{flux}}_{i}^{{\prime} }$ was minimized (i.e., Equation (3) was minimized). In Equation (3), flux is the original blackbody radiation spectrum and $\mathrm{flux}^{\prime}$ is the fitted flux value obtained by substituting the fitted parameters into Equation (2).

$\begin{eqnarray}&&\alpha =\sum _{i=1}^{3700}{({\mathrm{flux}}_{i}-{\mathrm{flux}}_{i}^{{\prime} })}^{2}\end{eqnarray} \tag{ 3 }$

Consequently, 72 sets of parameters (t, a, b) were obtained by fitting 72 O-type samples. Figure 1 shows the real blackbody radiation spectrum and the generated blackbody radiation spectrum, proving consistent in their physical characteristics.

**Figure 1.** The real blackbody radiation spectrum (flux_i, the blue curve) and the generated blackbody radiation spectrum ( ${\mathrm{flux}}_{i}^{{\prime} }$ , the red curve) under the unified observational conditions (temperature t, flux correction a, and the sky background b). Three samples are given here. The first two are cases where the sum of the squared residuals are relatively small, and sample (c) is a poor fit.
Download figure:
Standard image High-resolution image

2. Obtaining emission and absorption lines

Considering that the emission and absorption lines for a specific class are basically the same, the emission and absorption lines of O-type stars can be obtained by subtracting the blackbody radiation spectrum (obtained by median filtering) from the original spectrum. For the 72 known O-type stars, 72 sets of emission and absorption lines can finally be obtained.

To generate a new blackbody radiation spectrum of the O-type star, we set a random temperature (between 28,000 and 100,000 K) and pick a random set of parameters a, b to obtain the flux calculating by Equation (2). The final generated spectrum was obtained by randomly combining the fitted blackbody radiation spectrum with the emission and absorption lines. Algorithm 1 lists the spectrum generation steps in pseudocode format. From the model training results in Section 3.3, the generated data can effectively solve the overfitting problem caused by the insufficient amount of data and get a good accuracy rate at the same time, which meets the requirements of deep learning.

Algorithm 1. O-type spectrum generation

Input:

• Sepc_original: 72 O-type samples, each spectrum contains 3700 flux values at wavelength from 3700 to 8672 Å

• T_otype: Temperature of O-type star, $T\_{\rm{otype}}\in [28,000{\rm{K}},100,000{\rm{K}}]$ $T\_{\rm{otype}}\in [28,000{\rm{K}},100,000{\rm{K}}]$

• Modified Planck formula (Equation (2)): ${\rm{flux}}(\lambda ,t,a,b)=a\cdot \displaystyle \frac{8\pi {hc}}{{\lambda }^{5}}\cdot \displaystyle \frac{1}{{e}^{\tfrac{{hc}}{\lambda {kt}}}-1}+b$ ${\rm{flux}}(\lambda ,t,a,b)=a\cdot \displaystyle \frac{8\pi {hc}}{{\lambda }^{5}}\cdot \displaystyle \frac{1}{{e}^{\tfrac{{hc}}{\lambda {kt}}}-1}+b$

Output:

• Generated O-type spectrum, Sepc_generation

1: for each spectrum ( ${\rm{Sepc}}\_{{\rm{original}}}_{i}$ ${\rm{Sepc}}\_{{\rm{original}}}_{i}$ ), $i\in [1,72]$ $i\in [1,72]$ do

2: /*Step 1: blackbody radiation fitting*/

3: 1) Perform median filtering on the spectrum and obtain the blackbody radiation spectrum, ${{\rm{flux}}}_{i};$ ${{\rm{flux}}}_{i};$

4: 2) According to the modified Planck formula, fit and minimize the value of the objective function ${\sum }_{{i}=1}^{3700}{({{\rm{flux}}}_{{i}}-{{\rm{flux}}}_{{i}}^{{\prime} })}^{2};$ ${\sum }_{{i}=1}^{3700}{({{\rm{flux}}}_{{i}}-{{\rm{flux}}}_{{i}}^{{\prime} })}^{2};$

5: 3) Obtain the corresponding value of the parameters, temperature (t_i), a_i, and b_i.

6:

7: /*Step 2: acquisition of absorption and emission lines*/

8: 4) Obtain the emission and absorption line: ${\rm{em}}\_{\rm{ab}}\_{{\rm{line}}}_{i}={\rm{sepc}}\_{{\rm{original}}}_{i}-{{\rm{flux}}}_{i}$ ${\rm{em}}\_{\rm{ab}}\_{{\rm{line}}}_{i}={\rm{sepc}}\_{{\rm{original}}}_{i}-{{\rm{flux}}}_{i}$ .

9: end for

10:

11: for $i\in [1,N({\rm{number}}\_{\rm{of}}\_{\rm{spectrum}}\_{\rm{to}}\_{\rm{be}}\_{\rm{generated}})]$ $i\in [1,N({\rm{number}}\_{\rm{of}}\_{\rm{spectrum}}\_{\rm{to}}\_{\rm{be}}\_{\rm{generated}})]$ do

12: /*Step 3: O-type spectrum generation*/

13: 1) Randomly choose a temperature ( $T\_{{\rm{otype}}}_{k}$ $T\_{{\rm{otype}}}_{k}$ ) between 28,000 K and 100,000 K;

14: 2) Randomly choose a set of a and b obtained in Step 1, and feed a, b and $T\_{{\rm{otype}}}_{k}$ $T\_{{\rm{otype}}}_{k}$ to Equation (2) (where t = $T\_{{\rm{otype}}}_{k}$ $T\_{{\rm{otype}}}_{k}$ ) to generate the blackbody radiation spectrum, ${\rm{spec}}\_{\rm{blackbody}}\_{{\rm{gen}}}_{k};$ ${\rm{spec}}\_{\rm{blackbody}}\_{{\rm{gen}}}_{k};$

15: 3) Randomly choose a set of emission and absorption line obtained ( ${\rm{em}}\_{\rm{ab}}\_{{\rm{line}}}_{i}$ ${\rm{em}}\_{\rm{ab}}\_{{\rm{line}}}_{i}$ ) in Step 2;

16: 4) Generate O-type spectrum: Sepc_generation= ${\rm{spec}}\_{\rm{blackbody}}\_{{\rm{gen}}}_{k}+{\rm{em}}\_{\rm{ab}}\_{{\rm{line}}}_{i}$ ${\rm{spec}}\_{\rm{blackbody}}\_{{\rm{gen}}}_{k}+{\rm{em}}\_{\rm{ab}}\_{{\rm{line}}}_{i}$

17: end for

Download table as: ASCII Typeset image

3. A CNN-based Spectral Classification Model

3.1. Model Design

In this section, a CNN-based model for classifying the spectra is proposed. A straightforward idea of model construction is to classify all the spectra into nine classes and get different types of candidates. However, the preliminary experiments proved that the probability of confusing hot subdwarfs and white dwarfs in the nine-class classification model is high due to their similar blackbody radiation characteristics. Therefore, a more refined hybrid model with an eight-class classification model and a binary classification model is constructed. The eight-class classification model disregards the effect of white dwarfs, and the binary classification model further separates hot subdwarfs from white dwarfs.

The network consists of six convolutional layers (Figure 2). A max-pooling layer follows each convolutional layer. After the convolutional and max-pooling layers, two fully connected layers are added. In the convolutional layer, we process the output data by the ReLU function (Nair & Hinton 2010). The cross entropy function (Hinton & Salakhutdinov 2006) is used as the loss function, and the stochastic gradient descent function (Hardt et al. 2016) is used as the optimizer. During training, we start with small convolutional kernels and gradually increase the size of the kernels, and the final parameters of each layer are given in Table 1. The Softmax function (Ian et al. 2016) gives the final classification results, which are probability values of the samples being classified into different classes.

Table 1. Parameters of Each Layer in the Model

Layer	Type	Maps	Kernel Size	Striding	Activation
IN	INPUT	1	⋯	⋯	⋯
C1	Convolution	10	1 × 2	1	ReLU
S2	Max Pooling	10	1 × 2	1	⋯
C3	Convolution	20	1 × 3	1	ReLU
S4	Max Pooling	20	1 × 2	1	⋯
C5	Convolution	30	1 × 4	1	ReLU
S6	Max Pooling	30	1 × 2	2	⋯
S7	Convolution	40	1 × 5	1	ReLU
S8	Max Pooling	40	1 × 2	1	⋯
C9	Convolution	50	1 × 7	1	ReLU
S10	Max Pooling	50	1 × 2	2	⋯
C11	Convolution	60	1 × 9	1	ReLU
S12	Max Pooling	60	1 × 2	2	⋯
F13	Fully Connected	⋯	⋯	⋯	⋯
F14	Fully Connected	⋯	⋯	⋯	⋯
OUT	Fully Connected	⋯	⋯	⋯	Softmax

Download table as: ASCII Typeset image

3.2. Data Set

After sample augmentation of the O-type star spectrum, we randomly selected data and performed manual validation until there were 2000 samples for each class (O, hot subdwarf, white dwarf, B, A, F, G, K, and M stars; a total of 18,000 samples). 70% of the samples were randomly selected as the training set, and 30% were used for testing. Considering that our goal was to construct a model for searching hot subdwarfs, we did not include white dwarfs in the training and testing set of the eight-class classification model. A total of 3700 flux values in a wavelength range of 3700–8672 Å were obtained by truncating each sample. Specifically, LAMOST DR7-V1 provides spectral data with a wavelength increment of lg(λ_i+1) − lg(λ_i) = 0.0001, therefore there are 3700 data points sampled in the wavelength range of 3700–8672 Å. We did not resample the original spectral data in our analysis.

The binary classification model was constructed to accurately distinguish hot subdwarfs from white dwarfs. Since the blackbody radiation spectrum of hot subdwarfs and white dwarfs are very similar, it is difficult for the model to distinguish them when trained with the original spectra. Therefore, for the binary classification model, the emission and absorption lines of the spectra were used for training and testing. Experiments showed that the model fitted best when trained with the 1400 flux values (the emission and absorption lines) corresponding to wavelengths between 3700 and 5106 Å for each spectrum. A total of 835 hot subdwarfs with S/Ns greater than 10 and not involved in the training process were used to further validate the model. The eight-class classification model and binary classification model were trained using the same hot subdwarf training set to obtain a more robust model. All of the samples were normalized by Equation (4) to prevent gradient problems during deep-learning model training, where x_min was the minimum value of the flux and x_max was the maximum value of the flux:

$\begin{eqnarray}&&x^{\prime} =\displaystyle \frac{x-{x}_{\min }}{{x}_{\max }-{x}_{\min }}.\end{eqnarray} \tag{ 4 }$

3.3. Modeling and Performance Analysis

The performance of a deep-learning model is usually evaluated by accuracy, precision, recall, and F1 score (Forman et al. 2003), which are parameters calculated from the confusion matrix. A confusion matrix includes four parts: true positive (TP), false positive (FP), true negative (TN), and false negative (FN). TP and TN are the observations that are correctly predicted. FP and FN are cases of misclassification. Accordingly, the definitions of accuracy, precision, recall, and F1 score are in Equations (5), (6), (7), and (8), respectively.

**Figure 2.** The architecture of the classification model. The two rows of numbers under the model represent the feature map size, the top one is for the eight-class classification model and the bottom one is for the binary classification model.
Download figure:
Standard image High-resolution image

For the eight-class classification model, we set the batch size to 100 and the number of iterations to 2500. After training, we found that the model gradually fitted after about 2200 iterations. For the binary classification model, we set the batch size to 10 and the number of iterations to 2000. The model gradually fitted after about 1750 iterations. Through repeated training and testing, we finally got an accuracy of 94.21% for the eight-class classification model and accuracy of 96.17% for the binary classification model. The confusion matrices are shown in Figure 3. The precision, recall, and F1 score for each class are presented in Tables 2 and 3. We also output the mean and variance μ of the difference between the predicted and true labels of the two models. For the eight-class classification model, the mean is 0.0166 and the variance μ is 0.67. For the binary classification model, the mean is 0.023 and the variance μ is 0.46. The results show that the model performs well, with a tiny difference between the predicted and true labels.

$\begin{eqnarray}&&\mathrm{Accuracy}=\displaystyle \frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{FP}+\mathrm{FN}+\mathrm{TN}}\end{eqnarray} \tag{ 5 }$

$\begin{eqnarray}&&\mathrm{Precision}=\displaystyle \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}\end{eqnarray} \tag{ 6 }$

$\begin{eqnarray}&&\mathrm{Recall}=\displaystyle \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}\end{eqnarray} \tag{ 7 }$

$\begin{eqnarray}&&F1=\displaystyle \frac{2\ast \mathrm{Precision}\ast \mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}\end{eqnarray} \tag{ 8 }$

**Figure 3.** The confusion matrix. Panel (a) is the confusion matrix of the eight-class classification model and panel (b) is the confusion matrix of the binary classification model.
Download figure:
Standard image High-resolution image

Table 2. Precision, Recall and F1 Score of the Eight-class Classification Model for Each Class

Class	Precision	Recall	F1
O	98.98%	99.32%	99.15%
H	98.48%	97.82%	98.15%
B	95.41%	94.44%	94.92%
A	88.25%	92.85%	90.49%
F	88.97%	89.43%	89.20%
G	92.23%	88.93%	90.55%
K	95.26%	92.45%	93.83%
M	96.47%	98.53%	97.49%

Download table as: ASCII Typeset image

Table 3. Precision, Recall and F1 Score of the Binary Classification Model for Each Class

Class	Precision	Recall	F1
Hot subdwarf	96.45%	95.81%	96.13%
White dwarf	95.89%	96.52%	96.20%

Download table as: ASCII Typeset image

To further test the ability of our model at searching for hot subdwarfs, we used the 835 identified hot subdwarfs that were not involved in the training process for validation. The classification results showed that 802 hot subdwarfs (96.05% of the total) were correctly identified. An analysis and discussion of the misclassified samples are given in Section 5.

4. Hot Subdwarf Identification and Manual Validation

We identified all spectra in LAMOST DR7 using the hybrid CNN model implemented in the previous section. The hot subdwarf candidates obtained by the model were verified manually and a final credible catalog was given.

The classification results of the model gave the categories and the corresponding probabilities for all the spectra of LAMOST. By analyzing prediction probabilities of the 2235 identified hot subdwarfs with S/Ns greater than 10, we found that the model gave a prediction probability of more than 80% for 2103 of them, within which the predicted probability of 2043 of them was more than 99%. With reference to the work of Zheng et al. (2020), we considered samples with a prediction probability above 99% to be more reliable.

In order to obtain reliable results, we did further screening of the results of model predictions from two aspects: (a) for candidates with an S/N greater than 10 (g band), we retained the recognized spectra with a probability higher than 99%. (b) For candidates with an S/N lower than 10 (g band), we selected the top 200 samples with the highest prediction probabilities. Finally we obtained 2393 hot subdwarf candidates.

We checked the obtained candidates by cross validation with the catalog of hot subdwarfs published by Luo et al. (2021) and Geier et al. (2017), plus manual validation. Manual validation results of the 2393 hot subdwarf candidates are presented in Table 4. Along with basic information about the spectra, the last column of Table 4 describes the manual validation results. The results of manual validation are finally summarized into four categories: (1) hot subdwarfs that have been identified by previous work; (2) newly discovered hot subdwarf stars; (3) identified as other classes by manual validation; and (4) cannot determine the class due to a low S/N. The results show that among the 2393 candidates, 2092 are hot subdwarfs (25 are newly discovered), 183 are other types of spectra and misclassified as hot subdwarfs, and 118 spectra have a low S/N and cannot be manually confirmed. In other words, 87.42% of the hot subdwarfs predicted by the model are proved to be hot subdwarfs in artificial verification.

Table 4. Manual Verification Results of the 2393 Hot Subdwarf Candidates

Designation	OBSID	R.A. (deg)	Decl. (deg)	S/N	S/N	S/N	S/N	S/N
LAMOST	LAMOST	LAMOST	LAMOST	g band	μ band	z band	r band	i band	Type
J023200.24+333436.1	632206097	38.001041	33.576702	60.56	21.51	25.19	53.68	47.63	1
J211531.47+123957.5	592502156	318.881150	12.665982	52.55	35.69	17.17	39.60	35.41	1
J155144.87+002948.8	133709100	237.936982	0.496901	18.40	7.45	6.73	18.98	15.05	1
J071856.25+102638.3	446310014	109.734390	10.443990	64.56	35.33	24.47	53.17	49.80	1
J212356.68+153323.5	592415028	320.986190	15.556550	36.19	18.30	12.39	31.63	28.95	1
J210132.70+135622.5	371602229	315.386270	13.939585	37.99	29.44	10.61	27.60	22.89	2
J034322.46+463222.2	399504032	55.843622	46.539507	18.85	8.98	9.62	18.38	18.24	2
J045031.77+230712.9	197203189	72.632382	23.120270	41.74	26.74	23.16	33.79	40.42	2
J060619.60+200141.7	504103064	91.581701	20.028262	43.07	16.75	36.92	40.71	50.95	2
J182942.24+110429.1	746715115	277.426030	11.074758	38.31	16.52	17.89	29.15	31.51	2
J072222.78+445904.4	504804107	110.594950	44.984569	26.37	13.70	7.60	21.90	16.67	3
J122420.49+264738.8	714311122	186.085402	26.794137	17.85	8.02	7.97	13.39	13.35	3
J113143.40+370128.0	450104014	172.930840	37.024465	37.58	17.88	17.81	37.05	33.04	3
J101134.32+342150.7	732201084	152.893033	34.364110	2.92	0.85	3.17	4.18	4.56	3
J104052.58+284856.7	630714012	160.219100	28.815752	21.98	7.24	7.29	14.95	13.58	3
J222527.75+012522.5	484002197	336.365666	1.422931	5.74	3.83	2.23	4.17	4.64	4
J004511.44+321708.6	195305132	11.297683	32.285725	9.69	2.65	5.83	8.55	10.90	4
J004938.10+460953.1	498808130	12.408772	46.164764	5.17	1.94	6.25	6.29	9.29	4
J095728.68+275506.2	50401068	149.369516	27.918390	9.54	14.27	1.76	1.56	2.63	4
J105916.42+512443.1	148611045	164.818450	51.411984	7.49	3.89	1.52	6.28	4.22	4

Note. Columns 1–9: Basic information in the LAMOST catalog. Column 10: Type description as follows. (1) Hot subdwarfs that have been identified by previous work. (2) Newly discovered hot subdwarfs. (3) Identified as other classes by manual validation. (4) Cannot determine the class due to a low S/N.

Only a portion of this table is shown here to demonstrate its form and content. A machine-readable version of the full table is available.

Download table as: Data Typeset image

Following the works of Luo et al. (2016) and Luo et al. (2021), atmospheric parameters (effective temperature T_eff, surface gravity log(g), and He abundances $\mathrm{log}({n}{\rm{H}}{\rm{e}}/{n}{\rm{H}}))$ for the 25 newly identified hot subdwarf stars were obtained by fitting the LAMOST spectra with TLUSTY/SYNSPEC non-LTE synthetic spectra (Hubeny & Lanz 1995, 2011). All the parameters of the 25 newly discovered hot subdwarfs are listed in Table 5, which includes the LAMOST designation, R.A., decl., effective temperature, surface gravity, He abundance, the S/N in the μ band, apparent magnitudes in the μ band and g band of Sloan Digital Sky Survey (SDSS) DR9, and apparent magnitudes in the g band of Gaia DR2.

Table 5. The Atmospheric Parameters of 25 Newly Identified Hot Subdwarfs in this Work

Designation	R.A. (deg)	Decl. (deg)	T_eff	log(g)	log(nHe/nH)	S/N	μSDSS	gSDSS	gGaiaDR2
LAMOST	LAMOST	LAMOST	(K)	(cm/s⁻²)		μ band	(mag)	(mag)	(mag)
J011346.77+002828.6	18.444903	0.474638	66,154 ± 377	6.5 ± 0.03	2.38 ± 0.45	63.05	14.51	14.96	15.23
J021334.02+435651.7	33.391767	43.947714	50,287 ± 1961	5.86 ± 0.31	−0.07 ± 0.19	2.0	⋯	⋯	16.89
J023007.26+292256.2	37.530274	29.382299	74,994 ± 1650	6.50 ± 0.08	1.89 ± 2.49	21.37	16.99	17.39	17.61
J032425.51+430302.7	51.106302	43.050760	49,080 ± 511	6.19 ± 0.08	0.76 ± 0.13	6.28	17.52	17.58	17.67
J034322.46+463222.2	55.843622	46.539507	48,674 ± 346	5.81 ± 0.07	1.00 ± 0.12	8.98	16.80	16.73	16.70
J042351.08+322201.0	65.962855	32.366951	50,248 ± 977	5.73 ± 0.13	0.11 ± 0.13	3.61	17.38	17.39	17.45
J045031.77+230712.9	72.632382	23.120270	41,776 ± 75	5.79 ± 0.02	0.84 ± 0.02	26.74	17.73	17.56	17.47
J051135.26+391706.5	77.896933	39.285150	42,434 ± 273	6.31 ± 0.09	1.08 ± 0.20	4.47	⋯	⋯	17.13
J051304.75+114717.1	78.269827	11.788084	49,634 ± 257	6.50 ± 0.01	2.07 ± 0.11	73.99	⋯	⋯	14.38
J054559.57+222738.0	86.498229	22.460566	46,032 ± 797	5.52 ± 0.11	1.03 ± 0.24	8.75	⋯	⋯	17.63
J060619.60+200141.7	91.581701	20.028262	47,105 ± 321	5.87 ± 0.07	−0.25 ± 0.03	16.75	⋯	⋯	16.90
J064529.73+412642.0	101.373890	41.445024	55,297 ± 0	5.07 ± 0	2.50 ± 0	4.81	⋯	⋯	16.61
J064558.40+111223.6	101.493340	11.206556	39,831 ± 329	5.54 ± 0.06	0.74 ± 0.08	4.66	17.44	17.52	17.68
J070700.65+193526.9	106.752730	19.590809	51,530 ± 469	6.50 ± 0.07	1.08 ± 0.20	12.07	⋯	⋯	17.42
J084223.13+375900.2	130.596410	37.983394	59,127 ± 263	6.27 ± 0.10	2.40 ± 3.53	12.98	17.04	17.51	17.86
J084350.85+361419.5	130.961886	36.238770	38,293 ± 166	5.62 ± 0.01	2.27 ± 0.04	7.12	16.87	17.01	17.23
J091029.42+090205.1	137.622608	9.034770	39,329 ± 235	6.23 ± 0.02	1.50 ± 0.04	5.05	17.00	17.09	17.26
J101457.79+451906.1	153.740813	45.318381	42,182 ± 172	5.61 ± 0.01	0.57 ± 0.05	9.99	19.64	17.69	19.01
J124931.17+750727.2	192.379883	75.124244	55,650 ± 2437	5.17 ± 0.15	−3.68 ± 0.73	6.04	⋯	⋯	16.83
J180237.05+032550.1	270.654400	3.430598	44,673 ± 561	6.07 ± 0.13	1.05 ± 0.23	6.89	17.72	25.11	17.65
J181325.00+065315.9	273.354170	6.887751	50,584 ± 434	5.80 ± 0.06	0.60 ± 0.07	21.92	⋯	⋯	15.96
J182942.24+110429.1	277.426030	11.074758	52,791 ± 533	6.07 ± 0.07	0.60 ± 0.01	16.52	⋯	⋯	16.70
J205901.96+334933.0	314.758180	33.825842	74,995 ± 7667	5.97 ± 0.27	2.14 ± 11.20	4.01	⋯	⋯	17.49
J210132.70+135622.5	315.386270	13.939585	74,981 ± 251	6.50 ± 0.01	−0.33 ± 0.06	29.44	⋯	⋯	16.49
J230829.86+214333.8	347.124442	21.726063	28,470 ± 536	5.07 ± 0.11	−2.86 ± 0.13	3.23	17.16	15.94	15.60

Download table as: ASCII Typeset image

5. Discussions

5.1. Model Performance

Compared with the previous binary classification model (hot subdwarfs and other types), a more accurate hybrid deep-learning model is constructed in this paper. It achieved an accuracy rate of 96.17% on the testing set and 802 of the 835 hot subdwarfs (96.05%) in the validation sets were correctly identified. From Table 3, we can see that the F1 value for searching hot subdwarfs is 96.13%, which is much higher than the 76.98% in Bu et al. (2019).

Comparative analysis shows that the overfitting problem caused by insufficient data can be solved by data augmentation. The improvement of training samples also enables the model to learn more complete features, resulting in better results. Overall, the model structure that combined eight-class classification with the binary classification model obtains better performance than the binary classification model alone.

5.2. Misclassification

By analyzing the classification results in Section 4, it can be found that for 2428 hot subdwarfs in the LAMOST catalog, the model eventually found 2067. Therefore, we analyzed the remaining 361 hot subdwarfs. Through sample analysis, it can be concluded that 89 hot subdwarfs were misclassified by our model, and 272 hot subdwarfs were predicted as hot subdwarfs by our model with a low probability and filtered out during candidate screening (in Section 4).

The spectral characteristics of most correctly classified hot subdwarfs are consistent with Figure 4(a). We performed a detailed analysis of the misclassified spectra and the low-probability spectra, and it was found that there are mainly three conditions: (1) low S/N data (Figure 4(b)). It is difficult to identify the absorption and emission lines. For the effect of noise, our experiments find that the prediction probability gradually decreases as the noise increases. (2) A spectrum with abnormal value, that is, a spectrum dominated by sharp spikes either from cosmic rays or instrumental artifacts (Figure 4(c)). We found through experiments that artificially setting abnormal values for hot subdwarfs filtered out by our models can also lead to misclassification and a low prediction probability. (3) A spectrum exhibits the features of a binary star, as seen in Figure 4(d), which is probably a binary star (possibly a hot subdwarf and an M star). To improve the classification accuracy, we need more labeled training data. However, there are very few samples of these types in the LAMOST catalog. These three factors lead to a sample being misclassified or classified with a very low probability as a hot subdwarf.

For a spectrum with a low S/N and a category that cannot be manually confirmed (Table 4, type 4), repeat observations of the LAMOST can be continuously followed at a later stage.

6. Conclusions

In this study, our proposed approach uses a hybrid CNN-based model to classify the spectra and search for hot subdwarfs from the LAMOST data. For the problem of insufficient data during model training, we propose a spectral generation method based on fitting the Planck formula. The model is applied to identify existing hot subdwarfs in the LAMOST catalog, and most of the hot subdwarfs are correctly identified by our model. Moreover, our approach offers superior performance to the previous binary classification approach. The proposed model is finally used to classify the LAMOST data and search for hot subdwarfs. The experiments indicate that 87.42% of the hot subdwarfs predicted by our model are proved to be hot subdwarfs. The 25 newly identified hot subdwarfs will provide a reference for follow-up research in related fields.

In addition to newly discovered hot subdwarfs, the current approach is well suited for searching specific targets in vast amounts of spectral data. In particular, spectrum generation provides a solution for samples with a small amount of data and guarantees the reliability of deep-learning methods in spectrum classification. Improving the sample size is one of the following efforts, and it is essential for improving the model accuracy. We believe that the multiclassification model implemented in this paper is up-and-coming enough to be successfully applied to search for other specific stars. Capability and performance of the model are worthy of further experiments. Codes of the spectral generation algorithm, model construction, and model training are available.⁹

This work is supported by the National SKA Program of China (2020SKA0110300), the National Science Foundation for Young Scholars (11903009), the Joint Research Fund in Astronomy (U1831204, U1931141) under cooperative agreement between the National Natural Science Foundation of China (NSFC) and the Chinese Academy of Sciences (CAS), Funds for International Cooperation and Exchange of the National Natural Science Foundation of China (11961141001), the National Science Foundation of China (12173028), Fundamental and Application Research Project of Guangzhou (202102020677), and the Innovation Research for the Postgraduates of Guangzhou University under grant 2021GDJC-M15.

We thank the anonymous referee for valuable and helpful comments and suggestions.

A Robust Identification Method for Hot Subdwarfs Based on Deep Learning

Article metrics

Author e-mails

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction