Classification of Astronomical Spectra Based on Multiscale Partial Convolution

The automated and efficient classification of astronomical spectra is an important research issue in the era of large sky surveys. Most current studies on automatic spectral classification primarily focus on specific data sets and demonstrate outstanding performance. However, the diversity in spectra poses formidable challenges for these classification models, as they exhibit limited capability to generalize across more comprehensive data sets. In response to these challenges, we pioneer a method called the multiscale partial convolution net (MSPC-Net), which amalgamates partial, large kernel, and grouped convolution to facilitate multilabel spectral classification. By harnessing the capabilities of partial convolution, MSPC-Net can effectively reduce the number of model parameters, accelerate the training process, and mitigate the overfitting issue. Integrating large kernel and grouped convolution empowers the model to capture local and global features simultaneously, enhancing its overall classification efficacy. To rigorously evaluate the model’s performance, we generate ten different data sets sourced from the Sloan Digital Sky Survey and Large Sky Area Multi-Object Spectroscopic Telescope. These data sets encompass stellar class, stellar subclass, and full classification, providing a comprehensive assessment across various application scenarios. The experimental results reveal that MSPC-Net consistently outperforms the other models across different data sets, especially demonstrating superior performance in the last two data sets with full classification. Consequently, MSPC-Net is poised to find extensive applications in the detailed classification for large-scale sky survey projects. This work not only addresses the challenges of generalization in spectral classification but also contributes significantly to the advancement of robust models for astronomical research.


Introduction
Astronomical spectra furnish vital insights into stars, galaxies, and the Universe.Over the years, various sky survey missions, including the Sloan Digital Sky Survey (SDSS; York & Adelman 2000), the Large Sky Area Multi-Object Spectroscopic Telescope (LAMOST; Cui et al. 2012), the Dark Energy Spectroscopic Instrument (DESI; Dey et al. 2019), and the Global Astrometric Interferometer for Astrophysics (Gaia Collaboration et al. 2016), have been undertaken, yielding extensive spectral data.Consequently, the precise classification of these spectra presents an escalating challenge in astronomical spectroscopy research.To handle this growing complexity, it is essential to investigate novel approaches that enhance the comprehension of celestial objects.The process of spectral classification has evolved from manual expert assessments to automated techniques such as template matching and machinelearning methods.
Researchers in astronomy have explored the advantages of transitioning from manual expert methods to template matching for spectral classification (Duan et al. 2009;Bolton et al. 2012;Gray & Corbally 2014) to automate the process, acknowledging that template matching may encounter difficulties with complex spectra and accurate recognition.To address these challenges, machine-learning algorithms have been employed.For instance, von Hippel et al. (1994) and Singh et al. (1998) employed artificial neural networks for stellar spectral classification.Brice & Andonie (2019) utilized k-nearest neighbors and random forest (RF) algorithms.Li et al. (2019) conducted stellar spectral classification and feature evaluation using an RF algorithm.Chen et al. (2014) applied the restricted Boltzmann machine for spectral classification.Kheirdastan & Bazarghan (2016) explored probabilistic neural networks, support vector machines, and K means for spectral classification.Liu & Zhao (2017) proposed an entropy-based approach for unbalanced spectral classification.For a comprehensive review of machine-learning applications in astronomical spectral classification, Yang et al. (2022) provided an extensive overview.
Deep-learning techniques, specifically convolutional neural networks (CNNs), have gained popularity in astronomical spectral classification.Notably, Liu et al. (2019) employed a nine-layer CNN architecture for stellar spectrum classification, achieving notable improvements compared to traditional machine-learning methods.This research showcased the efficacy of CNN in spectral classification by achieving high accuracy in classifying large astronomical datasets.The ninelayer CNN extracted informative features from the spectra, thereby enhancing classification accuracy.Additionally, Zou et al. (2020) proposed an innovative approach by incorporating residual and attention mechanisms into the CNN architecture to further enhance performance.The residual block facilitated Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence.Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.direct information propagation from previous layers, aiding in capturing significant spectral features, while the attention mechanism allocated weights to different channels or layers to focus on essential spectral regions.However, these studies often revolve around specific spectral datasets.For instance, Liu et al. (2019) classified stars into three classes (F, G, and K) from the SDSS dataset while also considering subclass classifications such as A0, A5, F0, F5, G0, G5, K0, K5, M0, and M5 from SDSS.Similarly, Zou et al. (2020) employed the STAR, GALAXY, QSO, and UNKNOWN categories for the quadruple classification task, alongside the stellar classification of F, G, and K categories from LAMOST.
With the availability of massive data in astronomy, the versatility, efficiency, and effectiveness of supervised learning make it a valuable tool for astronomers to analyze vast astronomical datasets, classify celestial objects, predict their properties, and uncover new insights into the universe.The adoption of supervised learning techniques, encompassing both machine-learning and deep-learning approaches, has significantly advanced the field of spectral classification and feature extraction (Zhang et al. 2020;Li & Lin 2023;Tan et al. 2023;Wang et al. 2023).Due to the diversity of astronomical spectra, these methods face significant challenges when applied to more generalized and comprehensive datasets, resulting in poor performance.
These challenges frequently arise due to the substantial size of deep-learning models and their propensity to learn redundant or incorrect features.Consequently, there is an elevated risk of overfitting, which in turn leads to a decline in classifier performance.To address these issues, our study concentrates on reducing model size while maintaining or even enhancing performance.Recent researches in computer vision (Ding et al. 2022;Guo et al. 2022;Liu et al. 2022) suggest that a larger receptive field, represented by the convolution kernel size, can enhance classification performance.This insight motivated us to expand the convolution kernel size, despite the significant increase in model parameters.To address this issue, we introduce partial convolution (Chen et al. 2023) and grouped convolution (Ioannou et al. 2017;Xie et al. 2017;Wang et al. 2023b).It is essential to clarify that the partial convolution utilized in this study, as referenced in Chen et al. (2023), operates at the channel level, distinguishing it from the more commonly presumed pixel-level partial convolution discussed by Liu et al. (2018).These techniques allow us to achieve multiscale receptive fields and reduce the parameter number.By implementing these enhancements, our study outperforms other methods on ten datasets from LAMOST and SDSS.Moreover, our method exhibits superior resistance to overfitting compared to other models, affirming its effectiveness.
The subsequent sections of this article are structured as follows: Section 2 provides an overview of data sources and preprocessing techniques; Section 3 elucidates the specific methodologies employed; Section 4 presents experimental results; Section 5 furnishes an in-depth analysis based on these results; and finally Section 6 concludes the study.

Data and Preprocessing
The SDSS and LAMOST surveys have amassed tens of millions of spectral data.To evaluate the modelʼs generalization performance, spectra from SDSS DR18 and LAMOST DR10 are employed in this study.
While the eight datasets mentioned above assess the model's performance under ideal conditions, where each class is equally represented and subject to an S/N threshold, practical application scenarios may not adhere to such S/N restrictions, and the number of objects for classification may vary.To gauge the model's performance under realistic conditions, we acquire Dataset FC-S from SDSS and Dataset FC-L from LAMOST.These datasets encompass all categories, including galaxies, quasars, and all stellar subclasses, and they are selected randomly without S/N limitation.The description of these datasets is indicated in Table 1.For ease of comparison with past and future research endeavors, Each dataset underwent (2) Stellar Subclass Classification: these datasets also have an S/N greater than 5 and greater than 10, respectively.
(3) Spectral Full Classification: these datasets have no specific limit on the S/N.SC, SS, and FC in the names of the datasets stand for the three tasks of stellar classification, stellar subclass classification, and full classification, respectively.S and L stand for data from SDSS DR18 and LAMOST DR10, respectively.
five-fold cross validation, where the testing set for each fold constituted 20% of the data.Additionally, 25% of the training set was used as a validation set, resulting in a final split ratio of 60% for training, 20% for validation, and 20% for testing.These partitioned datasets are available on China-VO. 6he preprocessing steps for all spectra, including the SDSS spectra and the LAMOST low-resolution spectra, are consistent due to their approximate resolution.Initially, all spectra are cropped to a uniform dimension of 1 × 3,522, covering wavelengths from 4000 to 9000 Å.Subsequently, each spectrum is normalized using the z-score (Al Shalabi et al. 2006), as represented in Equation (1): where μ represents the mean fluxes and σ represents the standard deviation.Figure 1 shows spectral samples from SDSS and LAMOST before and after normalization.

Convolution and Large-kernel Convolution
CNNs (Krizhevsky et al. 2012;Simonyan & Zisserman 2015;Tan & Le 2021;Wu et al. 2021) have demonstrated exceptional success in the realm of natural image processing.Likewise, in recent years, 1D CNNs have exhibited promising performance in the classification of astronomical spectra (Liu et al. 2019;Zou et al. 2020).Nevertheless, with the advent of transformers (Vaswani et al. 2017) and their application to natural images (Dosovitskiy et al. 2021), several studies (Liu et al. 2021;Huang et al. 2022;Zhu et al. 2023) have implicitly surpassed the majority of CNN models as they evolved.It is worth noting that transformer-based models demand a substantial amount of feature-rich data for training, rendering them less suitable for datasets characterized by a relatively modest number of features, such as astronomical spectra.
Recent researches (Guo et al. 2022;Liu et al. 2022) underscored that transformer-based models outperformed traditional CNNs because of their ability to capture global information rather than solely focusing on local details.Consequently, the performance of CNN-based models can be enhanced by increasing the size of the convolutional kernel to obtain a more extensive perceptual field (Tan & Le 2019;Wang et al. 2023a).In the domain of natural images, the convolutional kernel is typically expanded to a maximum size of 31 × 31 due to the squared growth of computational requirements.Nonetheless, noteworthy outcomes have been achieved (Ding et al. 2022).Inspired by this concept, we extended the convolution kernel of the 1D CNN to a size of 1 × 405, as depicted in Figure 2 (right panel).The chosen kernel size closely matches the downsampled spectral features (446), allowing us to effectively capture global features.The size of the spectral feature 446 is obtained by downsampling the data multiple times, while the size of the convolutional kernel 1 × 405 is selected through experimental evaluation among various kernel sizes.This is because 1D CNNs require lower computational requirements compared to their 2D CNN counterparts.Specifically, as shown in Equation (2), the image input is denoted as Î ´ I c h w , where c represents the number of channels, and h and w represent the height and width of the image, respectively.The spectral input is denoted as Î  S c l , where l represents the length of the spectrum.The twodimensional convolution kernel for images is represented by Î  W I k k , while the one-dimensional convolution kernel for spectra is represented by Nevertheless, the utilization of sizable convolution kernels in 1D CNNs necessitates substantial computational resources, resulting in protracted training durations and the potential emergence of overfitting concerns.Subsequently, the forthcoming Section (3.2) will elaborate on strategies employed to mitigate these challenges.

Multiscale Partial Convolution
In their respective studies, Han et al. (2020) and Zhang et al. (2020) demonstrated that convolutional models often acquire redundant information across various channels, rendering them susceptible to overfitting.The issue of high redundancy in features extracted by regular convolutions for spectra is Figure 1.Spectral preprocessing (upper panels for original spectra and lower panels for their normalized spectra; left panels for LAMOST and right panels for SDSS).elaborated in Section 5. Addressing this concern, Chen et al. (2023) introduced Partial Convolution, a technique that selectively applies convolution to a portion of the input channel to extract spatial characteristics while leaving the remainder untouched.This approach offers a viable solution for the model to acquire essential information while significantly reducing computational demands.
In this study, we present the multiscale partial convolution structure, which amalgamates the principles of grouped convolution (Ioannou et al. 2017;Xie et al. 2017;Wang et al. 2023b), partial convolution (Chen et al. 2023), and largekernel convolution (Ding et al. 2022;Guo et al. 2022;Liu et al. 2022).Figure 2 displays the structures of regular convolution, partial convolution, and multiscale partial convolution (MSPC).Regular Convolution applies convolution operations with small kernels to all channels.Partial Convolution, on the other hand, selectively applies small convolution kernels to specific channels while leaving the remaining channels unaltered.MSPC performs convolution of different-sized convolution kernels for different parts of the channels by group convolution and does not perform any operation on the remaining channels.As depicted in Figure 2 (right panel), MSPC judiciously employs convolution operations on specific channels, facilitating the acquisition of features at various scales through group convolution.This architectural choice empowers the model to encompass both global and local information while mitigating redundancy in feature acquisition.
Furthermore, Equation (3) demonstrates that our approach does not significantly increase the computation complexity compared to the regular 1D CNN, despite using larger convolution kernel size, denoted by k 1 , k 2 , and k 3 in this study (5, 45, and 405, respectively, in this study, which is chosen by performing ablation experiments).Here, l represents the spectral length, and c represents the number of channels.

Overall Structure and Details of the Model
The comprehensive architecture of our model, multiscale partial convolution net (MSPC-Net), is elucidated in Figure 3.       Note.The total number for each category is 400.In terms of performance, the ideal scenario is when true positive (TP) values are higher and better, while false negative (FN) and false oositive (FP) values are lower and better.The sections that exhibit strong performance have been highlighted in bold.
This design draws inspiration from MetaFormer (Yu et al. 2022) and is organized into four stages (although only two are depicted in the figure).The model initiates with an embedding layer, employing a 1×3 convolution with a stride of 1 to augment the number of channels.Subsequently, a merging layer equipped with a 1×3 convolution and stride 2 is applied before each of the last three stages to downsample the features.
Within each stage, we incorporate the Token-mixer and Channel-MLP modules.
The Token mixer integrates the previously discussed MSPC structure.To account for the diverse features captured by MSPC across distinct channels, we introduce a channel attention mechanism in the initial segment of Channel-MLP.To optimize the model size, we adopt the efficient channel attention (ECA; Wang et al. 2020) mechanism, which is recognized for its computational efficiency.This decision is further supported by the ablation experiments presented in Table 7(d).Subsequently, we employ an inverted residual structure comprising two 1×1 convolutional layers and the residual structure.The model's output is processed through the Softmax function to obtain each category's probability without undergoing additional corrective operations.Where K    represents the total number of categories, and n denotes the index of the current category.
For an exhaustive understanding of the model's design, please consult Table 2. Different model architectures were designed specifically for each data source (SDSS and LAMOST), and each parameter was chosen after several experiments to select the parameter with the best results.The model designed for SDSS is relatively shallow and contains fewer Identity connections in the convolutional part.On the other hand, the model designed for LAMOST is deeper and incorporates more Identity connections in the convolutional part.As a result, both models have a similar overall size.

Experiments and Results
In this section, we assess the performance of the MSPC-Net model on each of the ten datasets from SDSS and LAMOST.To facilitate comparison, we employ 1D SSCNN (Liu et al. 2019) and Rac-Net (Zou et al. 2020) for astronomical spectral classification, alongside a 1D implementation of ConvNeXt (Liu et al. 2022) originally designed for computer vision tasks.These models for comparison can be obtained from the website https://github.com/qintianjian-lab/Spectrum-Classification-Models.Furthermore, we conduct comprehensive ablation experiments to corroborate the efficacy of the MSPC-Net model and validate our design choices.

Setup
We conduct experiments to evaluate the model's performance using the ten datasets as described in Table 1.The first eight datasets correspond to the ideal case stellar classification and stellar subclass classification, respectively.Additionally, Dataset FC-S and Dataset FC-L are employed to assess the model's performance in practical application scenarios.To ensure the reliability of our experiments, we conduct different fold cross validation on each dataset, with the training set, validation set, and test set divided in a ratio of 6:2:2.
Model Variants: An integral aspect of MSPC-Net's design is its focus on enhancing generalization performance, ensuring consistent and robust results across various spectral datasets.This adaptability hinges on two dynamic parameters: n_div, adjusting the model's width by modifying the partial   Training: for all experiments, we use the Adam optimizer (Kingma & Ba et al. 2015) empirically.The learning rate is adjusted using cosine annealing with warm restarts (Loshchilov & Hutter 2017), starting at 1e-4 and gradually decreasing to 1e-8, which is chosen after several experiments.In addition, we empirically add a random depth (Huang et al. 2016).Crossentropy loss function is utilized.
Environment: all codes are implemented in Python 3.9 using PyTorch 2.0 and executed on hardware with an i9-13900K processor and RTX-4080 graphics card.

Results
We evaluate each model's performance using accuracy and F1score as metrics.The cross-validation results for all models on the eight datasets with different folds are presented in Tables 3-4.First, let us delve into stellar classification.As shown in Table 3, MSPC-Net outperforms the other models.Transitioning to stellar subclass classification, it is noteworthy that, as depicted in Table 4, MSPC-Net excels in terms of both accuracy and F1-score across all four datasets, despite having similar parameters.We provide category-specific details in Table 5 to facilitate a comprehensive comparison of the model's performance.These details include true positive, false negative top-1, and false positive top-1 values for Dataset SS-S (>5).Due to space limitations, we exclusively compare SSCNN and MSPC-Net, where MSPC-Net outperforms SSCNN across most categories.
To further validate the model's feature extraction capability, we conduct data dimensionality reduction using t-SNE (Wattenberg et al. 2016;Ma et al. 2022) for all model-extracted features and visually represent the results.In Figure 4 (left panel), we showcase the feature visualization of Dataset SS-S (>10), focusing solely on five out of the 24 categories, for illustrative purpose.It is discernible that all models encounter difficulty in distinguishing between G0 and K0, which aligns with the low true positive quantity for G0 in Table 5.This result is consistent with findings reported by Liu et al. (2015), suggesting a minimal distinction between late G-type stars and early K-type stars.However, in the remaining three categories, notably F0, MSPC-Net outperforms all other models, underscoring MSPC-Net's advantage in feature extraction.For Dataset SS-L (>10), we visualize four out of the categories in Figure 4, and the overall differentiation is not as pronounced as in Dataset SS-S (>10).MSPC-Net notably exhibits a more robust feature extraction ability than other models for Dataset SS-L (>10).
Following the evaluation of MSPC-Net under ideal conditions (sample average; high S/N), we test each modelon Dataset FC-S and Dataset FC-L, which closely emulate practical application scenarios.As demonstrated in Table 6, the results consistently reveal MSPC-Net's superiority across all metrics, affirming its exceptional applicability in practical application scenarios.It is important to note that the relatively low F1-score for each model on Dataset FC-S is attributable to the sample's extreme imbalance.
To further substantiate the impact of S/N on model accuracy, as depicted in Figure 5, we present the distribution of S/N for Dataset FC-S and Dataset FC-L, alongside the corresponding accuracy of each model.It is discernible that the majority of data in Dataset FC-S has a relatively low S/N, whereas the S/N for the data in Dataset FC-L exhibits a marked improvement.We speculate that this dissimilarity in S/N distribution accounts for the notable disparity in accuracy between these two datasets.Moreover, MSPC-Net consistently outperforms alternative models across most S/N intervals in both datasets, underscoring its exceptional performance and robustness.

Ablation Study
Several ablation experiments are conducted to assess the effectiveness of each component in our model.We conducted ablation experiments across all datasets, with a particular emphasis on showcasing the results for Dataset SS-S (> 5).Detailed results of the ablation experiments for the remaining datasets can be found in Appendix A.
Partial convolution: as depicted in Table 7(a), the Identity ratio (the proportion of channels without any operation in the convolution) is presented.As the ratio increases, the model size gradually decreases, while the accuracy and F1-score of the model increase, affirming the effectiveness of partial convolution.For the dataset sourced from SDSS, a ratio of 5/8 is utilized, while a ratio of 13/16 is employed for the dataset sourced from LAMOST.
Token-mixer and Channel-MLP: in Table 7(b), we showcase the performance of various multiscale convolutional kernels.In comparison to conventional 1×3 and 1×5 convolutions, our multiscale convolutional kernel demonstrates a notable enhancement in performance without a substantial increase in the parameter number.Furthermore, we validate the importance of the ECA attention mechanism, as elucidated in Table 7(d).The model's size remains nearly unchanged with the incorporation of ECA, while the performance is also partially improved.
Channels: given the critical role of the number of channels in CNN-based models, we conduct comparative experiments for different channel numbers, as outlined in Table 7(c).Although a channel number of 256 yields better performance, considering both performance and efficiency, the default channel number used in this study is 128.

Discussion
In Section 1, we propose that recent approaches of spectral classification might be prone to overfitting due to redundant features in their models.To validate this hypothesis, we visually illustrate the accuracy of the training set and the validation set of all models during their training process on Dataset SS-S (>5) in Figure 6.As shown in Figure 6, accuracy improves with the increase of epoch for the whole trend, and only there are two sudden decreases for the training dataset which is due to the warm-up learning rate adjustment strategy.Except for MSPC-Net, each model exhibits very significant overfitting.The epoch less than 100 is due to the use of early stopping.We observe varying degrees of overfitting in SSCNN models with larger sizes compared to MSPC-Net, RACNet, and ConvNeXt models with smaller sizes than MSPC-Net.Notably, MSPC-Net demonstrates a higher degree of resilience to overfitting than the other models.
To deepen our comprehension of MSPC-Net's remarkable performance, we provide a visualization of the modelʼs learning representation in Figure 7.By leveraging a multiscale convolutional kernel and partial convolution, MSPC-Net can acquire distinctive features in different channels.In contrast, the regular convolutional model extracts similar features across channels, implying the existence of redundant features that contribute to its subpar performance and susceptibility to overfitting.
Additionally, we conduct an in-depth analysis of the test set from Dataset SC-S (>5).We cross-match the 2,800 stars in the test set with the LAMOST DR10 catalog within a twoarcsecond radius and identify 329 corresponding labels from the LAMOST pipeline.We then categorize these data into two groups: 301 agree with MSPC-Net's predictions, matching the SDSS labels (considered correct predictions), while 28 show discrepancies (considered incorrect predictions).As illustrated in Figure 8, comparing the SDSS and LAMOST labels for these two subsets of data reveals that the consistency in the MSPC-Net correct prediction group is significantly higher than in the incorrect prediction group.
Upon closer examination of the cases where predictions are inaccurate, most of MSPC-Net's misclassifications (20 out of 28) occur in instances where the classification systems of SDSS and LAMOST are inconsistent.Furthermore, these misclassifications often align with the labels provided by LAMOST (17 out of 28).This observation suggests that MSPC-Net has the potential to identify spectra prone to misclassification, facilitating subsequent corrections.
Additionally, we conduct a visual inspection of these 28 spectra, Figure 9 shows an example of one of these, the full 28 spectra are available in the online journal.The outcomes of this inspection are categorized as follows: 1. MSPC-Net's correct predictions where SDSS labels are incorrect (15 instances).2. Cases are hard to judge due to sources lying in the transition between two neighboring subclasses (two instances).3. Instances where MSPC-Net's predictions are incorrect while SDSS labels are accurate (seven instances).4. Low-quality spectra (four instances), with two of them confirmed by LAMOST spectra to have been accurately predicted by MSPC-Net.
MSPC-Net demonstrates remarkable performance when applied to SDSS and LAMOST datasets, effectively tackling various spectral classification tasks using extensive labeled data.As depicted in Figure 5, its superiority becomes evident particularly in handling low-S/N data, surpassing competing models.This capability fills a crucial gap where traditional template matching methods falter.However, despite these merits, MSPC-Net's reliance on vast labeled datasets presents challenges in specific astronomical tasks.Unlike the more universally adaptable template matching approach, MSPC-Net necessitates retraining due to the typically diverse data sourced from various sky surveys, thereby limiting its versatility across different surveys.Moreover, akin to other deep-learning methodologies, MSPC-Net encounters issues related to limited interpretability.In our forthcoming endeavors, we aim to enhance MSPC-Net's capabilities through transfer learning techniques (Kolesnikov et al. 2020;Tan et al. 2023) to diminish its dependence on extensive labeled datasets and augment its adaptability across diverse surveys.Additionally, we are actively developing an interactive exploration tool to enhance the interpretability of deep-learning models, thereby broadening their applicability in the domain of astronomical spectroscopy.

Conclusion
This study introduces MSPC-Net, a novel spectral classification method that addresses the limitations of existing astronomical spectral classification models, especially in general datasets.Extensive experiments demonstrate that MSPC-Net brings significant improvements to model classification performance.The proposed method leverages large-kernel and MSPC techniques to enhance accuracy.To reduce the model's parameter number and mitigate overfitting, partial convolution is employed, selectively excluding specific channels from the convolution operation.Experimental results validate that this approach not only improves model performance but also reduces the number of parameters.Furthermore, the combination of large-kernel convolution and group convolution creates an MSPC block capable of extracting both local and global features, leading to substantial enhancements in model performance.Additionally, the use of grouped convolution ensures that each channel represents distinct features, and the introduction of the ECA mechanism enables the model to learn feature importance without affecting the parameter number.Experimental validation confirms the effectiveness of these strategies.
To validate the model's performance, we create ten datasets from SDSS DR18 and LAMOST DR10.These datasets represent different aspects of celestial object classification, including stellar subclass classification, stellar classification, and full classification.To create the stellar classification and stellar subclass classification datasets, we filter the data based on the S/N threshold and the number of samples per category.On the other hand, the full classification dataset is randomly chosen without any specific restrictions.We are able to test the model's performance in both ideal and practical application scenarios with these datasets.All of these datasets are further divided into training, test, and validation sets, and we have made them open-source for future use.By doing so, we hope that other researchers and practitioners can also evaluate the performance of their models using these datasets expediently.The results of our evaluation show that all MSPC-Net models performed well in the stellar classification dataset, which has a smaller number of categories.However, in the stellar subclass classification dataset, which has a larger number of categories, as well as in the full classification dataset, our models demonstrate significant performance advantages.These results confirm the effectiveness and robustness of our MSPC-Net model in all applications.Due to its robust feature extraction capabilities and adaptable nature, MSPC-Net holds potential for a wide array of spectral data analysis tasks (e.g., spectral classification and identification of rare celestial objects).
Our work shed light on the generalization problem of classification models when faced with a diverse range of astronomical spectra.By demonstrating the effectiveness of MSPC-Net across various datasets, we contribute valuable insights to the development of robust and versatile classification models for large sky surveys.In summary, our work on MSPC-Net represents a significant advancement in the automated classification of astronomical spectra, offering a solution that excels in performance across diverse datasets.The thorough experimentation and analysis further strengthen the credibility of our findings and highlight the potential applications of our method in large-scale spectroscopic sky survey projects (e.g., LAMOST, SDSS, and DESI).

Figure 3 .
Figure 3.The MSPC-Net overall architecture: The macroscopic design of the model is based on MetaFormer (Yu et al. 2022), with four stages in this study, stages 3 and 4 are omitted for brevity.Embedding or Merging precedes each stage for encoding and downsampling.Each stage contains a Token-mixer and Channel-MLP.

Figure 4 .
Figure 4.The t-SNE visualizations of the samples (5 class for Dataset SS-S (>10) and 4 class for Dataset SS-L (>10)) generated by different models.Features extracted for the Dataset SS-S (>10) (left panel) and Dataset SS-L (>10) (right panel) models are downscaled and visualized by t-SNE.The red rectangle highlights the area of the clustering error.For Dataset SS-S (>10), all models perform poorly in distinguishing between G0 and K0, but the MSPC-Net outperforms the other models for the other data types.For Dataset SS-L (>10), all models do not discriminate clearly between A0 and other data, and overall, MSPC-Net clustering is better than the other models.

Figure 5 .
Figure 5. Accuracy as a function of S/N interval (upper panels) and the S/N distribution (lower panels), left panels for Dataset FC-S and right panels for Dataset FC-L.

Figure 6 .
Figure 6.Training accuracy (left panel) and validation accuracy (right panel) curves with different epochs for each model.

Figure 7 .
Figure 7. Feature visualization.Regular convolution and MSCP-Net extract the features of the spectrum, with the horizontal axis indicating the length of the features and the vertical axis indicating the features of the different channels.Unlike the regular convolution, where the features of each channel are very similar, the features of each channel of MSPC-Net show diversity.This visualization highlights the feature extraction capabilities of the multiscale large-kernel convolution (channels 1 to 48) and partial convolution (channels 49 to 128) within the MSPC module.The former illustrates the diversity of extracted features, while the latter showcases reduced parameter count, notably with the Identity function requiring no additional parameters.

Figure 8 .
Figure 8. Label consistency of SDSS and LAMOST in the group correct predictions (left panel) and the group incorrect predictions (right panel).

Figure 9 .
Figure 9.An example of MSPC-Net's incorrect predictions.A-type labeled by SDSS, F-type labeled by LAMOST, F-type predicted by MSPC-Net, and then identified as F-type after visual inspection.The templates and spectrum of SDSS are officially provided by SDSS, while the spectrum of LAMOST is officially provided by LAMOST.The complete figure set (28 images) is available in the online journal.(The complete figure set (28 images) is available.)

Table 2
Model Design Details Note.Different model architectures are designed specifically for each data source (SDSS and LAMOST), and each parameter is chosen after several experiments to select the parameter with the best results.

Table 3
Accuracy and F1-score on Stellar Classification Note.The sections that exhibit strong performance have been highlighted in bold.

Table 4
Accuracy and F1-score on Stellar Subclass Classification Note.The sections that exhibit strong performance have been highlighted in bold.

Table 5
Details of the Classification Results of Each Category of SSCNN and MSPC-Net on Dataset SS-S (>5)

Table 6
Accuracy and F1-score on Full Classification

Table 7
Note.Ablation experiments on Dataset SS-S (>5).(a) Partial convolution.The ratio of identity, 5/8, works best.(b) Multiscale large-kernel convolution. 1 × 5, 1 × 45, and 1 × 405 kernel combinations of different sizes give the best results.(c) Channels.256 gives the best results, but 128 is chosen for performance and efficiency.(d) Attention.Better results with ECA attention.The best performance is in bold and the model setting is adopted with a gray background in this study.