Enhancing Pulsar Candidate Identification with Self-tuning Pseudolabeling Semisupervised Learning

In the field of astronomy, machine-learning technologies are becoming increasingly crucial for identifying radio pulsars. However, the process of acquiring labeled data, which is both time-consuming and potentially biased, poses a significant limitation to current methodologies. In response to these challenges, this study proposes and validates a self-tuning pseudolabeling semisupervised learning approach. This approach synthesizes a vast unlabeled data set with a considerably smaller set of labeled data, markedly enhancing classifier performance and effectuating a transition from traditional fully supervised learning methods to more efficient radio pulsar detection strategies. Our experimental outcomes demonstrate that even with a training set comprised of only 100 labeled pulsar candidates, this method can attain a recall rate of 92.35% and an F1 score of 93.89%. When the number of labeled examples is increased to 800, we observe a further improvement in performance, with the recall rate rising to 97.50% and the F1 score reaching 97.16%. The utility of the semisupervised learning approach is evident even with minimal labeled data, which is a common scenario in the search for pulsars, including in environments like globular clusters. What stands out is the method’s capacity to detect pulsar candidates effectively with only a limited number of labeled examples. This emphasizes the robust potential of our approach to facilitate early-stage pulsar surveys and highlights its capability to yield substantial results even when labeled data are in short supply.


Introduction
Pulsars, representing an exceptional category of high-energy celestial entities, hold unparalleled and significant scientific value within the cosmological context.Characterized by their exceedingly rapid rotational velocities and potent magnetic fields, pulsars generate cyclical radio pulses, detectable by terrestrial radio telescopes.They are characterized by their exceedingly rapid rotation periods, which can be as short as 1.396 ms (such as PSR J1748-2446ad; Hessels et al. 2006) and as long as approximately 76 s (such as PSR J0901-4046; Caleb et al. 2022).Their intense magnetic fields surge from 10 8 to 10 15 G, facilitating the beaming of periodic radio pulses that are captured by radio telescopes on Earth.Although pulsars emit across the electromagnetic spectrum, our detection capabilities are predominantly tuned to their radio-wave emissions.The exploration and study of these pulsars not only enhances our understanding of cosmic evolution but also provides an optimal experimental environment for quantifying important fundamental astrophysical parameters (e.g., Manchester et al. 2001;Cordes et al. 2004;Kramer 2004;Kramer & Wex 2009).Consequently, pulsar search has emerged as a pivotal research subject within the field of radio astronomy.Pulsar signals, as intercepted by radio telescopes, furnish an abundance of data, enabling the enhancement of our comprehension surrounding these distinct celestial bodies.The radio pulses emitted by pulsars are intrinsically linked to their magnetic field intensity and rotational velocity, rendering this information indispensable for scrutinizing the mechanisms of pulsar genesis, evolution, and eventual extinction.The stability of pulsar rotational periods renders these celestial objects as exceptionally precise natural clocks, facilitating their utility in various astrometric studies that require high-precision timing.Furthermore, pulsars serve as invaluable tools for probing the distribution of the interstellar medium and for conducting investigations into the architecture of the Galactic magnetic field (Manchester 1972;Han et al. 2006).Nevertheless, despite the sophisticated detection prowess of modern telescopes, substantial challenges persist in the detection and identification of pulsars (e.g., Lorimer & Kramer 2005;Rosen et al. 2013;Stovall et al. 2013).On one hand, the radio pulses emitted by pulsars are frequently faint, exhibiting flux densities that typically span from μJy to mJy; this faint output, when coupled with background astronomical noise and terrestrial interference, substantially complicates their distinction and reliable identification.On the other hand, the escalating volume of observational data has outpaced the capacity of traditional manual screening and identification techniques (Lyon et al. 2016).As a result, the development of automated and efficient methodologies for pulsar identification has evolved into a crucial line of research.As deep-learning techniques continue to evolve, automatic feature-learning methods underpinned by these techniques have achieved noteworthy advancements in the realm of pulsar candidate identification.Such methods are capable of autonomously deciphering effective features from the data, circumventing the challenges associated with manual feature design.For example, Zhu et al. (2014) employed convolutional neural networks (CNNs) to directly extract latent features from the diagnostic diagrams of pulsar candidate samples, integrating this approach with support vector machines and logistic regression models to advance the identification of pulsar candidates.Some other researchers adopted more sophisticated models, such as the hybrid ensemble method by Y. Wang et al. (2019) and the multiinput CNN method by Zhao et al. (2022).Among them, deep residual networks have been widely applied to tackle problems with intricate structures and massive data.For instance, Liu et al. (2021) and Yin et al. (2022) employed ResNet-based models in pulsar candidate identification and achieved relatively good performance.ResNet allows layers to explicitly fit residual mappings by adding shortcut connections, which resolves performance degradation with increasing depth.Additionally, researchers also explored other types of generative models for pulsar candidate identification.For example, Bao et al. (2022) proposed a method combining deep-learning and generative models.This approach generates new pulsar candidate samples using generative models and then classifies them with deeplearning models.This not only leverages the generative capacity of generative models but also utilizes the classification strength of deep-learning models, achieving superior identification performance.
From these studies, we observe a trend of researchers applying more sophisticated models and advanced techniques for pulsar candidate identification.These novel methods and technologies open up new possibilities for improving identification accuracy.However, although remarkable success has been achieved in various applications, processing and analyzing the massive amount of unlabeled data collected daily by radio telescopes remains challenging.Moreover, due to the rarity of pulsar signals and high costs of acquiring labeled data, researchers often have access to only limited labeled data.This scarcity of labeled data restricts the performance of current approaches.In this context, semisupervised learning has emerged as a promising solution.It strives to effectively connect scarce labeled data with abundant unlabeled data to enhance model generalization.
In pulsar candidate identification, the number of real pulsar candidates (positive samples) is relatively small compared to the vast amount of noise or interference (negative samples).Moreover, accurately labeling pulsar candidates is extremely difficult due to the complex physics of pulsars and the impact of noises and interferences on observational data.In computer vision, semisupervised image classification primarily leverages two key techniques: consistency regularization (Abuduweili et al. 2021) and pseudolabeling (Lee et al. 2013).Current advanced algorithms like MixMatch (Berthelot et al. 2019b), ReMixMatch (Berthelot et al. 2019a), FixMatch (Sohn et al. 2020), andFeatMatch (Kuo et al. 2020) typically hybridize both approaches.Consistency regularization dominates semisupervised learning by applying tiny perturbations to input data via data augmentation so the model produces consistent outputs, thereby improving generalization on unlabeled data.However, its performance often relies on appropriate data augmentation strategies.For specific tasks like pulsar candidate identification, selecting suitable data augmentation is challenging given the data characteristics.Moreover, excellent consistency regularization requires extensive upfront work, including the search for data-set-specific data augmentation.For instance, top-performing strategies on CIFAR like Auto-Augment (Cubuk et al. 2018) and RandAugment (Cubuk et al. 2020) were proposed 1-2 yr before the optimal models.In summary, semisupervised image classification succeeds by effectively combining consistency regularization and pseudolabeling while requiring proper data augmentation to optimize consistency regularization.
In specialized tasks like pulsar candidate identification, the candidate images are generated from highly domain-specific and intricate astronomical data, reflecting the unique timefrequency signatures of pulsars (Eatough et al. 2010).Applying common data augmentation like rotation, scaling, and color change may disrupt these subtle features and distort or misrepresent the data.While data augmentation aims to improve model generalization by introducing variations, complex augmentation here may destroy the physical meaning of the original data.Thus, consistency regularization based on data augmentation is unsuitable for pulsar candidate identification.Additionally, a common pitfall of pseudolabeling is assigning high confidence to all samples regardless of label correctness.Training on a large number of incorrectly labeled unlabeled samples introduces substantial noise, severely impacting model performance.
To address these issues, we propose a self-adjusting pseudolabeling semisupervised approach integrating pseudolabel generation, confidence-threshold adjustment, and dynamic weight allocation.This effectively resolves problems of low pseudolabel utilization due to overly high confidence thresholds and imbalanced learning between categories, improving semisupervised performance.Moreover, our method dynamically adjusts sample weights based on predicted confidence, enabling the model to optimize the learning process.Specifically, it prioritizes high-confidence samples first, then gradually learns lower-confidence ones.This staged, prioritized strategy makes learning more efficient and targeted.

Data Introduction
We use two radio telescope data sets for pulsar identification models: FAST1 (H.Wang et al. 2019) and HTRU2 (Morello et al. 2014).FAST is collected by the Five-hundred-meter Aperture Spherical Radio Telescope (Han et al. 2021).HTRU is from the High Time Resolution Universe midlatitude pulsar survey observations (Keith et al. 2010).The pulsar candidates in the data sets are processed by PRESTO software (Ransom 2001) to obtain diagnostic feature plots-pulse profiles, subband plots, and subintegration plots (Figure 1).The pulse profile shows the average pulse at all observed frequencies, often a single narrow peak for pulsars.Subband plots show superimposed pulses at different frequencies.Subintegration plots depict the variation of pulse intensity averaged across the frequency bandwidth of the receiver over time.Brighter colors indicate larger amplitudes.These informative diagnostic plots help accurately identify real pulsar candidates.In Figure 1, we adopt a rasterization process to convert the profile into a pixelated image and merge it with the subband and subintegration plots to generate a red, green, and blue (RGB) image.This approach is advantageous for the subsequent data processing required in the generative model.It should be noted that in addition to the three aforementioned diagnostic plots for pulsar candidate identification, the dispersion measure (DM) curve is also a vital characteristic.Although these four diagnostic plots are interrelated, our experiments have shown that the current generative models fail to accurately replicate the nonzero signature inherent to pulsar DM curves.Including these imprecise DM generative data could negatively impact the final model's efficacy.Therefore, we opted not to incorporate DM curve data in our training data set for model construction.

Evaluation Metrics
In the pulsar candidate identification task, we mainly use four metrics to evaluate the model performance: precision, recall, specificity, and F1 score.These four metrics are commonly used evaluation metrics in binary classification problems and can comprehensively reflect the model performance.
Precision.Precision is the proportion of positive samples (i.e., real pulsar candidates) predicted by the model that are actually positive.It measures the accuracy of the model's predictions as positive.The calculation formula for precision is where TP represents true positives, i.e., the number of samples predicted as positive by the model that are actually positive, and FP represents false positives, i.e., the number of samples predicted as positive that are actually negative.Recall.Recall is the proportion of all actual positive samples that are correctly predicted by the model.It measures the detection rate of the model for actual positive results.The calculation formula for recall is where FN represents false negatives, i.e., the number of samples predicted as negative that are actually positive.Specificity.Specificity is the proportion of all actual negative samples that are correctly predicted by the model.It measures the detection rate of the model for actual negative results.The calculation formula for specificity is where TN represents true negatives, i.e., the number of samples predicted as negative that are actually negative.F1 score.F1 score is the harmonic mean of precision and recall, which can balance precision and recall to more comprehensively reflect the model performance.When we care about both precision and recall, F1 score can be used as the evaluation metric.The calculation formula for F1 score is In the experiments, we will calculate the values of these four metrics to comprehensively evaluate the performance of our model in the pulsar candidate identification task.

Methods
In the realm of pulsar candidate recognition, we are confronted with the dual hurdles of a marked class imbalance, where negative examples vastly outnumber positive ones, and the onerous task of manually annotating data.Such an imbalance predisposes models to bias toward the majority class in their predictions, which can hinder the accurate detection of pulsar candidates-the minority class.Moreover, manual annotation is a time-intensive and laborious process, making the compilation of expansive labeled data sets a formidable challenge.
To mitigate these challenges, we have devised a novel approach that integrates the principles of semisupervised learning with the techniques of data augmentation.The cornerstone of our technique involves initially training a model on a small, labeled data set, after which this model is employed to assign pseudolabels to a substantially larger corpus of unlabeled data.Incorporating these pseudolabeled examples into subsequent training iterations enables the model to benefit from a wider data pool, which helps balance class representations and decreases the dependency on manual data labeling.
In tandem with semisupervised learning, we engage in data augmentation to counteract the imbalance further and bolster the model's ability to generalize.We augment unlabeled pulsar candidates using a vector-quantized variational autoencoder (VQVAE; Van Den Oord & Vinyals 2017) and GPT-2 (Radford et al. 2019).Specifically, we use the VQVAE to encode unlabeled pulsar candidates into a compact latent form.Following this, we employ GPT-2 to generate novel latent representations.These are then transformed back into the original data space using the decoder of VQVAE to yield new, artificial pulsar candidates.This innovative approach maximizes the use of available unlabeled data and enriches the training set with additional pseudopulsar candidates, thereby supplementing the limited labeled data and fortifying the overall balance within the data set.

Self-tuning Pseudolabeling Semisupervised Learning
Semisupervised learning combines the strengths of supervised and unsupervised learning, well suited for scenarios with scarce labeled but abundant unlabeled data, such as pulsar candidate identification.Major semisupervised approaches include deep generative, consistency regularization, graphbased, pseudolabeling, and hybrid methods (Yang et al. 2022).Consistency regularization is based on the assumption that the addition of noise or perturbation to the input data does not significantly change the predictions of the model.However, complex pulsar physics and high demands on data augmentation make consistency regularization challenging.In addition, although graph-based methods can effectively exploit data topology, their intensive resource requirements also limit applicability to pulsar candidate identification.Considering these factors, we adopt pseudolabeling to address pulsar candidate identification.Pseudolabeling generates labels for unlabeled data using the model's self-predictions, then leverages these pseudolabels for supervised learning.This significantly reduces labeled data needs while improving learning via unlabeled data.However, as pseudolabels are model-generated, overfitting can occur.The model may also be overconfident about predictions for some unlabeled samples, which could be erroneous, degrading pseudolabel quality and learning.Therefore, to accommodate pulsar candidate characteristics, we improve pseudolabeling and propose a selfadjusting pseudolabeling semisupervised approach for enhanced pulsar candidate identification by optimizing model training, feature selection, and pseudolabel generation strategies.
The pseudolabeling technique demonstrates certain efficacy within the context of semisupervised learning.By utilizing model self-predictions, it generates pseudolabels for unlabeled data.These pseudolabeled data are then applied in supervised learning, improving the learning efficiency by making use of the unlabeled data.However, this strategy is not without its significant shortcomings and challenges.A critical task lies in the selection strategy for pseudolabeled samples, for which a unified solution is currently nonexistent.Ideally, samples where the model can make accurate predictions should be selected for generating pseudolabels.However, if the number of selected samples is excessive, it could lead to the introduction of considerable noise, thereby diminishing the model's learning effectiveness.Conversely, if too few samples are selected, it might prevent the model from fully exploiting the unlabeled data.As such, striking an appropriate balance between quantity and quality to optimize the selection of pseudolabeled samples remains an unresolved issue (Chen et al. 2023).
In semisupervised learning, we leverage a pseudolabeling strategy to fully exploit unlabeled data.To further optimize this approach, we designed a dynamic weighting system that adjusts sample weights based on the prediction confidence of each unlabeled sample.
First, our model generates prediction probabilities for each unlabeled sample in the given batch.These probabilities reflect the model's confidence in its predictions.We rank the samples based on these probabilities and then employ a novel approach to adjust the numerical range of the rankings such that the adjusted rankings fall within the interval defined by the maximum and minimum prediction probabilities.Specifically, for each sample, we calculate its adjusted ranking, r i *, which not only reflects its relative order within the entire batch but also maps this order to an interval bounded by the highest and lowest prediction probabilities.This adjustment process is designed to maintain the properties of the ranking while recalibrating the rank values for each sample according to the actual distribution range of the prediction probabilities.
Subsequently, we establish a threshold for prediction confidence, initially set at 0.90 in our experiments.When r i * falls below this threshold, we employ a truncated Gaussian function to adjust the corresponding sample weights.The Gaussian function is a unimodal function, peaking at the mean and symmetrically decaying on either side.This ensures that the sample weights reach a maximum near the area of highest confidence.The weight calculation formula is where w i represents the weight attributed to the ith sample.This weight is adjusted based on the confidence of the model's prediction, serving as a significant factor that influences the model's learning from this sample.Another vital parameter, max l , denotes the maximum weight that can be allocated to any given sample.By setting an upper limit for the weights, we prevent any particular sample from exerting an overly dominant influence on the learning process.We also establish a threshold for prediction confidence, represented by μ t .Should the normalized ranking r i * fall below this threshold, the weight w i of the ith sample is adjusted.If not, it remains at the maximum value, max l .Through this mechanism, our model effectively balances the influence of each sample based on its prediction confidence.
Thus, each unlabeled sample's weight depends jointly on its prediction confidence.This focuses model training on unlabeled data with higher prediction accuracy, thereby more effectively leveraging unlabeled data.
Furthermore, in semisupervised learning, a delicate balance must be maintained between the loss from labeled data and pseudolabeled data.A commonly employed technique for this purpose leverages a weighting coefficient.The loss from labeled data is calculated directly.Conversely, the loss from pseudolabeled data is computed and then multiplied by an adjustable coefficient.This coefficient can be tuned based on specific requirements as well as the quality of the pseudolabels.
Specifically, given a data set with N labeled and M unlabeled samples, with model parameters θ, the labeled loss is L(x i , y i ; θ), and the unlabeled loss is L x y , ; j j ( ˆ) q , where L x y y y P y x , ; 1 log ; .6 Here, y j ˆis the predicted pseudolabel with maximum probability.
With x i , y i as labeled features and labels and x y , j j ˆas unlabeled features and pseudolabels, the total loss is where α is a weighting coefficient balancing labeled and pseudolabeled loss.However, a potential problem with this method is that the pseudolabel quality for unlabeled samples may vary significantly.Some may be accurate, while others are totally erroneous.To address this, we introduce dynamic weights w j that adjust each unlabeled sample's weight based on its prediction confidence.The loss function of the model can then be further modified as Setting the confidence threshold is also challenging.An excessively high threshold may discard many uncertain pseudolabels, causing class imbalance and underutilization.
However, lowering the threshold early to include more pseudolabels risks introducing lower-quality ones.Balancing pseudolabel utilization and quality through an adaptive threshold requires further research.
We take a performance-based dynamic adjustment approach.Specifically, after each training epoch, the validation performance adjusts the confidence threshold.Improved performance increases the threshold, as it implies more reliable predictions.Degraded performance decreases the threshold, as it implies less reliable predictions. We

VQVAE and GPT-2
We propose combining VQVAE and GPT-2 to generate pulsar candidate data, addressing the data imbalance as shown in Figure 3.
Autoencoders compress and decompress data using encoder and decoder neural networks.The encoder compresses input X into a lower-dimensional latent representation Z.The decoder attempts to reconstruct the original input X¢ from Z. Variational autoencoders (VAEs) extend autoencoders by constraining the latent space Z, usually to a Gaussian distribution.New samples are drawn from this distribution and decoded.
However, for some tasks, like pulsar candidate identification, the real data may not fit predefined distributions.VQVAEs can help here.VQVAEs use vector quantization to discretize the latent space.Like VAEs, they have encoder, latent variable, and decoder stages.However, the discrete latent is obtained by nearest-neighbor lookup in a learned codebook.This makes the latent distribution more controllable than a Gaussian.Thus, VQVAEs can generate more informative images for pulsar candidate identification.
The VQVAE objective includes reconstruction error and vector quantization error.Referring to the literature (Van Den Oord & Vinyals 2017), we give the final loss function as shown in Equation (10): where z is a d-dimensional vector or matrix of any size.Let z q denote the codebook vector e i closest to z.
GPT-2 models images autoregressively, pixel by pixel, decomposing the joint distribution into a product of conditional probabilities:

Data Generation and Partitioning
The FAST and HTRU data sets collectively comprise 2287 pulsar samples and 100,321 nonpulsar samples.Specifically, the FAST data set includes 1091 pulsar samples and 10,322 nonpulsar samples, whereas the HTRU data set contains 1196 pulsar samples and 89,999 nonpulsar samples.However, the quantity of pulsar samples in both data sets is significantly smaller than that of the nonpulsar samples, resulting in a severe imbalance in the sample distribution.
To address this challenge of sample imbalance, we employed the VQVAE and GPT-2 models for data augmentation.More specifically, we used three types of diagnostic maps-the profile plot, the subband plot, and the subintegration plot-as input features.The VQVAE and GPT-2 models were trained according to the method delineated in Section 3.2.The trained models were then used to generate new pulsar samples.In total, we generated 50,000 new pulsar samples, as depicted in Figure 4.
Our strategy for data partitioning in this study is as follows.The training data set comprises both labeled and unlabeled data.The labeled data include 400 pulsar samples and 400 nonpulsar samples, while the unlabeled data consist of 50,000 pulsar samples generated by the VQVAE and GPT-2 models and 50,000 nonpulsar samples.The validation data set is entirely comprised of the original data set and contains 228 pulsar samples and 772 nonpulsar samples.The test data set consists of 1173 pulsar samples and an equal number of nonpulsar samples, where the nonpulsar samples are randomly selected from the remaining stock with each test iteration, as outlined in Table 1.This data partitioning strategy ensures that our model has ample data for learning during the training phase and allows for a precise evaluation of the model's performance during the testing phase.Additionally, by using the generated samples only in the training set, we prevent their potential impact on the evaluation of the model's performance, thereby ensuring the reliability of our evaluation results.

Experimental Results and Analysis
In this investigation, we utilized WideResNet (Zagoruyko & Komodakis 2016) as the underlying model architecture and, within the semisupervised learning framework, contrasted three distinct methodologies: the heuristic threshold method, the adaptive threshold method, and our newly proposed self-tuning pseudolabel method.The WideResNet configuration we adopted features a depth of 28 layers and a widening factor of 2. Initially, concerning the outcomes from the training set, all three methodologies surpassed 99% in terms of accuracy, recall, and F1 score, signifying that all models were capable of effectively fitting the training data.However, a model's performance cannot be fully represented solely by its fit to the training data.A high-quality model should not only excel on the training data but, more crucially, it should possess robust generalizability; i.e., it should maintain high performance even on the test data set.Therefore, while the results on the training data hold importance, they cannot serve as the sole criterion for assessing model performance.A comprehensive analysis necessitates the incorporation of results from test data.In Figure 5, we conduct a detailed comparison of the heuristic threshold method, the adaptive threshold method, and the selfadjusting pseudolabel method across a range of labeled sample sizes (400, 600, 800, 1000, and 1200).Relying on a fixed threshold of 0.90, the heuristic threshold method showed limited generalization capabilities across varying sample sizes, with its peak F1 score reaching only 88.25% with 800 labeled samples.In contrast, the adaptive threshold method, which dynamically adjusts its threshold, consistently outperformed the heuristic threshold method across all sample sizes, achieving a top F1 score of 91.31%.Remarkably, our selfadjusting pseudolabel method demonstrated superior performance across all labeled sample size tests, with F1 scores exceeding 95%.Notably, as the labeled sample size increased beyond 800, its performance leveled off, stabilizing around 97%.Therefore, although the three methods exhibited similar performances on the training set, the self-adjusting pseudolabel method showcased a significantly better generalization ability and recognition accuracy on the test set when compared to the other methods.This conclusively validates the efficacy and distinct advantage of the approach we have proposed.When comparing our semisupervised learning methods, we also conducted an in-depth analysis of the performance of the same WideResNet architecture in a supervised learning context.In this setup, the model was trained exclusively with actual labeled training data, without incorporating any VQVAEgenerated images.Given the severe class imbalance present in the training data, particularly the scarcity of pulsar samples, the purely supervised learning model significantly underperformed our semisupervised approaches in the test data set.Specifically, across 10 tests, the supervised model achieved an average recall of 85.57% and an F1 score of 91.25%.
In the context of selecting the quantity of training samples, we executed the experiments as detailed in Table 2.The labeled samples used for training the model consist of an equal number of pulsar and nonpulsar samples.By employing 100 and 200 labeled samples for training the model, we observed that the model could not attain a satisfactory level of generalization due to the limited sample size, resulting in lower recall rates and F1 scores on the test data set.With the increase in labeled samples to 400, the model's performance exhibited some improvement; however, the test set metrics remained unimpressive due to the still limited quantity of positive and negative samples.A significant enhancement in test metrics was observed when the sample size escalated to 800, with both recall rate and F1 score surpassing the 97% mark.This suggests that the increase in sample size enriches the feature learning for the model.Further augmenting the sample size to 1200 or even 1600 did not show a considerable trend in the improvement of test metrics, rendering the outcomes similar to those with 800 samples.In a nutshell, we discerned that a moderate increase in sample size positively influences performance enhancement, though the improvement progressively plateaus.Considering computational costs, a sample size of 800 is deemed optimal.Additionally, we conducted evaluation tests on the HTRU and FAST samples from the test set.The results indicated that, for HTRU data, the model exhibited a high level of precision, achieving 97.70%, with a recall rate of 98.98% and an F1 score of 98.33%.For FAST data, the precision of the model was slightly lower at 94.96%, though the recall rate remained comparably high at 98.83%, resulting in an F1 score of 96.86%.These findings suggest that the model demonstrates superior performance in identifying HTRU data in comparison to FAST data.The disparity in performance may be attributable to the unique feature composition of HTRU data and the characteristics of the generated data.Despite this distinction, the overall performance of the model is consistent with leading models in the field, exhibiting its efficacy and competitive advantage.

Comparison with Other Methods
In this study, we undertake a comprehensive comparison of several models that demonstrate strong performance in typical supervised and semisupervised learning tasks.These models include the fully supervised models PICS-resnet (H.Wang et al. 2019) and CCNN (Zeng et al. 2020) and the semisupervised model SGAN (Balakrishnan et al. 2021).Furthermore, we juxtapose these models with our newly proposed model for a comparative analysis.To ensure a fair comparison, we first retrained PICS-resnet, CCNN, and SGAN.This step is crucial because it guarantees that all models are trained on an identical data set and subsequently evaluated on the same test data set.This approach ensures the reliability and fairness of our comparative results.The evaluation outcomes are presented in Table 3. From our results, it is evident that the proposed model excels in comparison to other models across a range of evaluation metrics, including accuracy, recall, and F1 score.This underscores the efficacy of our model in tackling such tasks.However, it is important to note that while our model demonstrates superior performance overall, there may be specific categories or scenarios where other models could outperform ours.Deciding on the most appropriate model, therefore, requires a careful consideration of the specific application context and the unique characteristics of the data.
Additionally, an interesting observation from our study is that despite SGAN being a semisupervised model, it does not outperform fully supervised models.This could potentially be attributed to SGAN's generator component, which may not fully account for the true data distribution during the generation of new samples.As a result, the generated samples may not contribute to enhancing the model's performance.This finding suggests that the success of semisupervised learning hinges not just on the capability to leverage unlabeled data but also on how effectively these data are utilized.In summary, our experimental results underscore the robust performance and  considerable potential of our proposed model in handling this task.Moreover, the research carried out by Bethapudi & Desai (2018) on the HTRU data set has shown that machine-learning models constructed with manually extracted features also yielded commendable results.

Discussion of the Selection of Unlabeled Samples and the Setting of Confidence Thresholds
The selection of unlabeled samples is crucial during the training process.Initially, the model fits well to simple samples, thus requiring the selection of relatively simple unlabeled samples for auxiliary training.As the model iterates, its generalization performance improves, allowing for training and recognition of more challenging samples.We employ Equation (5) to assign weights to the unlabeled samples, considering a sample to be recognizable by the current model when its confidence surpasses a certain threshold.Samples meeting this criterion are assigned higher weights for subsequent computations, while those not meeting the threshold receive lower weights, thereby minimizing their impact on model training.Figure 6 illustrates the proportion of selected unlabeled samples surpassing the confidence threshold and the corresponding threshold value at each training iteration.The figure reveals an increasing trend in both the selected sample proportion and the confidence threshold, indicating continuous improvement in training effectiveness.However, after the 120th iteration, the confidence threshold  starts to decrease because the selection of unlabeled samples reaches saturation, and further improvement in model performance becomes limited.

Impact of Unlabeled Sample Size on the Model
Within the framework of semisupervised learning, the utilization of unlabeled samples is indispensable for enhancing model performance.This section, considering a base of 800 labeled samples, aims to explore the impact of varying quantities of unlabeled samples on model efficacy.According to the study by Balakrishnan et al. (2021), while the incorporation of unlabeled samples contributes to improved model performance, the benefits taper off beyond a certain threshold number of samples.Therefore, identifying an optimal quantity of unlabeled samples becomes crucial.To this end, we have conducted a series of experiments designed to observe the specific influence of different quantities of unlabeled samples on the model's F1 score in Figure 7.The experimental outcomes revealed that an increase in the number of unlabeled samples from 0 to 20,000 significantly boosts the model's F1 score.However, as the quantity continues to ascend to 50,000, the marginal gains in performance gradually diminish, with the growth in F1 score becoming almost stagnant upon reaching 50,000 unlabeled samples.This trend corroborates the findings of Balakrishnan et al. (2021), indicating that the positive impact of unlabeled samples on model performance decreases beyond a certain quantity threshold.Therefore, for settings with 800 labeled samples, employing 50,000 unlabeled samples is deemed optimal.This quantity not only effectively enhances model performance but also avoids the wastage of computational resources due to an excess of unlabeled samples.It should be noted, however, that the optimal choice of unlabeled sample quantity may also be influenced by factors such as the specific task, the model architecture, and the number of labeled samples available.Thus, while our experimental results provide guidance on the choice of unlabeled sample quantity, flexibility should be exercised in actual applications based on specific circumstances.

Conclusion
In this study, we have introduced and validated a selftuning pseudolabeling semisupervised learning approach for the identification of radio pulsars.This innovative methodology confronts the formidable challenges posed by the dearth of labeled data and the class imbalance prevalent in conventional supervised learning methods.By leveraging a wealth of unlabeled data and assimilating it with a smaller volume of labeled data sets, our approach significantly enhances the performance of radio pulsar classification.Our method pioneers the implementation of an adaptive threshold weight adjustment method in pseudolabel selection.This strategy allows the model to prioritize the learning from highconfidence samples, thereby fostering a more efficient and targeted learning process.Additionally, we have employed a tandem of VQVAE and GPT-2 to generate supplementary pulsar candidate data.These data are subsequently trained as unlabeled data, effectively ameliorating the authenticity of the generated data as training data and concurrently remedying the issue of class imbalance.Through rigorous experimentation, we have demonstrated that our proposed method outperforms other benchmark models across several metrics, including accuracy, recall, and F1 scores.This performance underscores the superior generalization ability inherent in our method.Remarkably, with a training set of merely 800 labeled samples, our proposal achieves an accuracy of 99%, a recall rate of 97.50%, and an F1 score of 97.16%, highlighting the high-quality classification capability of our method even when operating with limited labeled data.The findings of this study underscore the substantial potential of our proposed method for practical applications, especially in the current context, where a rapidly growing number of pulsar candidates exacerbates the challenge of maintaining large labeled data sets.We envisage this technique as a pivotal tool for future pulsar surveys, where it can provide even greater utility.To ensure transparency and facilitate further research in the field of pulsar candidate identification, we have updated the related code.The latest version of the code can be accessed on GitHub3 under a 2-Clause BSD License, and version 1.0 is archived in Zenodo (Yi et al. 2024).We encourage researchers and practitioners to utilize and build upon our work for advancing knowledge and application in pulsar candidate identification.

Figure 1 .
Figure 1.Pulsar candidate diagnostic feature plots: pulse profiles and subband and subintegration plots.

Figure 2 .
Figure 2. The modeling framework of self-tuning pseudolabeling semisupervised learning."sup_loss" represents the labeled loss, and "unsup_loss" stands for the unlabeled loss.

Figure 3 .
Figure 3. Sample generation framework combining VQVAE and GPT models.The black line indicates the training process of VQVAE, while the blue line represents the training process of GPT-2.

Figure 4 .
Figure 4. Four generated sample instances: profile, subband, and subintegration plots represented in the R, G, and B channels of the RGB image, respectively.

Figure 6 .
Figure 6.Illustration of the increasing ratio of unlabeled sample selection and confidence threshold with training iterations.

Figure 7 .
Figure 7. Impact of unlabeled sample quantity on model F1 score under the condition of 800 labeled samples.
We smooth this via a sliding window and adjust μ t based on the smoothed F1 smoothed , initialize a confidence threshold μ t , step size t rate m , and μ t maximum t max m and minimum t min m values.After each epoch, the validation F1 score is computed.
The stop gradient sg disables gradient flow through sg[] in backpropagation.To prioritize reconstructing z over quantizing it, γ < β is required.

Table 1
Sample Division across Training, Validation, and Test Data Sets

Table 2
Comparison of Test Data Set Results for Models Trained with Different Numbers of Samples