Two-stage Hierarchical Framework for Solar Flare Prediction

Solar flares, often accompanied by coronal mass ejections and other solar phenomena, are one of the most important sources affecting space weather. It is important to investigate the forecast approach of solar flares to mitigate their destructive effect on the Earth. Statistical analysis, associated with data from 2010 to 2017 in Space-weather HMI Active Region Patches (SHARPs) collected by the Solar Dynamics Observatory's Helioseismic and Magnetic Imager, reveals that there is a distribution divergence between the two types of active regions (ARs) of solar flares. A two-stage hierarchical prediction framework is formulated to better utilize this intrinsic distribution information. Specially, we pick up the ARs where at least one solar flare event occurs within the next 48 hr as flaring ARs through balanced random forest and naive Bayesian methods and then predict the events from flaring ARs by a cascade module of learning models. The empirical evaluation of SHARPs data from 2016 to 2019 verifies the promising performance of our framework, e.g., 0.727 for the true skill statistic.


Introduction
Solar flares are explosive releases of strong magnetic energy occurring in local areas of the solar atmosphere (Sturrock 1966).When a solar flare event occurs, the irradiance of X-rays and extreme ultraviolet in the corona above the active regions (ARs) will suddenly increase, which is shown as intense brightness enhancement (Priest & Forbes 2001;Liu et al. 2021).Generally, the duration of a solar flare eruption is only a few minutes to dozens of minutes.This process will release huge energy and spread at the speed of light, completing the journey of 93 million miles to the Earth in just 8 minutes (Russell 2000;Tsurutani et al. 2009).The resulting ionosphere promptly disrupts transionospheric radio wave propagation, thus damaging communication systems and power systems, resulting in huge economic losses (Cinto et al. 2020;Thaduri et al. 2020).Therefore, the prediction of solar flares is a major topic to the space environment community.
In recent times, machine learning-based methods have gained widespread usage in various fields and have garnered increasing attention in the realm of solar flare prediction.Table 1 highlights several noteworthy works, including both traditional machine-learning algorithms such as support vector machines (SVM; Ahmadzadeh et al. 2019Ahmadzadeh et al. , 2021) ) and gradient boosting (Cinto et al. 2020), as well as deep-learning algorithms such as LSTM (Liu et al. 2019;Wang et al. 2020;Chen et al. 2021) and Deep Flare Net (Nishizuka et al. 2021).
These studies have significantly contributed to enhancing the accuracy of solar flare prediction.
The main challenge in building a prediction model is the extreme class imbalance of the data sample of solar flares (Ahmadzadeh et al. 2019).Solar flares are generally graded as A, B, C, M, or X based on the value of the peak flux of X-rays.The frequency of solar flares in different years exhibits significant variability, as demonstrated by the statistics presented in Figure 1 (Palʼshin et al. 2014).In this study, we focus on the strong flares above M class.The samples are denoted as flaring samples (positive samples) when M-classand-above flares appear in the next 48 hr, while the rest are denoted as flaring-quiet samples (negative samples).Based on the statistics on data set integrated from the Space-weather HMI Active Region Patches (SHARPs) and solar flares information provided by the National Oceanic and Atmospheric Administration (NOAA), it can be noted that the ratio of positive samples to negative samples is up to 1:24.7.
To address this issue, it is necessary to investigate the correlation between the magnetic field in the AR and solar flares.The energy released by solar flares mainly comes from magnetic energy released by the corona (Russell 2000;Liu et al. 2021).It has been found that the solar flares are closely related to the sunspot area, total magnetic flux, the life span of ARs, and other physical quantities reflecting the complexity of ARs and sunspots (Colak & Qahwaji 2009;Bloomfield et al. 2012).Although the triggering mechanism for the solar flares remains unknown, it has been empirically observed that sunspots with complicated magnetic flux structures are often accompanied by high-level solar flares.Furthermore, ARs exhibiting strong activity tend to erupt solar flares more frequently (Zirin & Marquette 1991;Sammis et al. 2000;Li et al. 2008;Barnes et al. 2016).Consequently, it is logical to Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence.Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.establish a prediction model for solar flares based on magnetic information.
Many previous studies focus on determining if an AR is flare-eruptive or flare-quiet (Xiao et al. 2013;Bobra & Couvidat 2015) and the successive flare bursts (Ran et al. 2022).Drawing from the analysis above, we hypothesize that the frequency of occurrence of solar flares in ARs with varying levels of magnetic flux structure differs, and we propose a twostage prediction framework called the hierarchical prediction framework (HPF).We first classify the ARs into two categories according to the frequency of solar flares.The region where at least one solar flare event occurs within the life cycle of the sunspot group is defined as flaring ARs, while the others are defined as nonflaring ARs.Then, a cascade consisting of several machine-learning models is used to predict flare productivity in flaring ARs.The main advantages of HPF are listed as follows: 1. Applicability: the construction and application of HPF are in line with the actual prediction scenarios, making full use of the ARs' information to achieve stable long-term prediction.2. Effectiveness: taking linear regression (LR) as the base classifier as an example, the F1 score and true skill statistic (TSS) are significantly improved to 0.594 and 0.727 respectively after boosting with HPF.
The rest of this paper is organized as follows.In Section 2, the data set used in this study is described.The details of our method HPF are presented in Section 3. Section 4 shows the experimental results to illustrate the effectiveness of the method.Then, we discuss the stability of long-term solar flare prediction in Section 5. Section 6 concludes the paper.

Data
The data set used in this paper is the 720s_SHARPs9 observed by the Helioseismic and Magnetic Imager on board the Solar Dynamics Observatory.Each sample is a time series on a 12 minute cadence that contains various features for the region calculated over the magnetogram.Ten parameters are used as features, including TOTUSJH, TOTPOT, TOTUSJZ, ABSNJZH, SAVNCPP, USFLUX, AREA_ACR, MEANPOT, R_VALUE, and SHRGT45.The definitions and calculation formulas of each magnetic parameter are shown in Table 2.For further information, please refer to Angryk et al. (2020).The data are sampled from 2010 May 1 to 2019 January 31 at an interval of 96 minutes.For each sample, the starting/ending time, ID number of the AR, and flare class are derived from the information provided by NOAA. 10he samples that contain M-class-and-above flare events within the next 48 hr are defined as flaring samples (positive samples), while the others are defined as flaring-quiet samples (negative samples).According to the statistics, a total of 73,810 samples are unevenly distributed in 1277 ARs, of which 2988 are positive samples and 70,822 are negative samples.Table 3 shows the number of positive and negative samples and the ratio between them from 2013 to 2017.It can be seen that a flare event is a rare event that leads to an extreme class imbalance in the data, which poses challenges for the prediction task.

Method
Extreme class imbalance of samples limits the application of machine learning-based methods to solar flares prediction (Japkowicz 2000;Ahmadzadeh et al. 2019).To establish a robust prediction method, it is necessary to find a way to alleviate this situation.We find that the frequency of solar flares in each AR differs due to the level of magnetic flux structure.Many ARs did not exhibit strong flares of M-class and above even.These regions are referred to as nonflaring ARs, while those containing at least one solar flare sample are flaring ARs.Therefore, removing the samples from nonflaring ARs is a natural idea to balance the ratio of positive and negative samples.As shown in Figure A1 in Appendix A, we find that there is a gap between the distributions of the 10 selected features of the flaring-quiet samples in the nonflaring ARs and in the flaring ARs, which enables us to effectively distinguish the source of flare samples.Inspired by this, we first predict the eruption possibility of the activity area from which the sample came.For the sake of simplicity, we denote by Y sp = 1/0 the sample label of positive/negative samples and denote by Y AR = 1/0 the region label of nonflaring/flaring AR, respectively.
The workflow of HPF is as shown in Figure 2. First, Model-1 is used to classify the samples into two subsets based on their  ARs' activity.Here, we denote by Data Set 1 and Data Set 0 the subsets whose samples are from the predicted flaring ARs and nonflaring ARs, respectively.For samples from Data Set 1, Model-2 is applied to predict the time interval in which flares occur, while the samples from Data Set 0 are input into the baseline model.
Model-1 is to judge whether the sample's region is a flaring AR, that is, to predict the region label Y AR for each sample.Generally, the outbreak of flares tends to repeat in the same flaring AR (Barnes et al. 2016).Then we focus mainly on flaring ARs.The numbers of flaring samples and flaring-quiet samples in Data Set 1 are 2988 and 3335, respectively.The ratio of positive samples to negative samples decreases from 1:24.7 in the original data set to 1:1.12 in Data Set 1. Obviously, the problem of extreme class imbalance has been alleviated greatly through AR stratification.We construct the Model-2 to further predict the sample label Y sp for the samples from Data Set 1.In order to increase the recall as much as possible, the Model-2 is designed as a cascade consisting of logistic generalized additive models (LogisticGAM; Hastie & Tibshirani 1986), RUSBoost (Seiffert et al. 2008), and linear discriminant analysis (LDA; Fisher 1936).As long as the prediction result of one step is Y sp = 1, we will consider this sample as a flaring sample.
The samples from Data Set 0 and the samples predicted as negative samples by Model-2 also is input to the Baseline model to obtain the sample label Y sp .By comparing the results of the baseline models before and after boosting with the HPF framework, the performance improvement brought by the proposed framework can be seen.As a control group, the Baseline model was trained with the raw data set.We select 13 common models as the Baseline model, including the decision tree (DT), SVM, multilayer perceptron (MLP), etc.

Results
In this section, we give a brief introduction to the experimental setting and show the results to illustrate the effectiveness of the proposed framework.The code is available at https://github.com/alphabet666/HPF-Two-stage-Hierarchical-Framework-for-Solar-Flare-Prediction.

Experimental Setting
The task of flare prediction is actually a binary classification problem.Let the true positive (TP)/false negative (FN) be the flaring samples correctly/wrongly predicted as positive, while the true negative (TN)/false positive (FP) be the nonflaring samples correctly/wrongly predicted as negative, respectively.We employ the Precision and Recall to evaluate the prediction performance of algorithms, which are defined respectively as Usually, there is a contradiction between Precision and Recall.Generally speaking, the higher the recall, the lower the precision, and vice versa.For comprehensive consideration, we also consider the F0.5, F1 score that reveals the composite performance: The following measures are also used in this study, where the TSS measures the difference between the Recall and false positive rate, which corrects for the dependency on the class ratio while still keeping the advantages of Recall (Allouche et al. 2006).The updated Heidke Skill Score (HSS 2 ; Heidke 1926) quantifies a model by comparing it to random guessing.Unlike TSS, HSS 2 is biased to imbalance ratio.For the case of class imbalance, both TSS and HSS 2 are defined as robust measures.The area under the curve (AUC) is the probability that the model ranks the positive sample score in front of the negative sample score.
In the following, each feature of the raw data is scaled and normalized with an average of 0 and a standard deviation of 1 to avoid differences in orders of magnitude.In addition, since the missing values account for a small part of the data (<0.001%),we simply utilize the mean to fill out those values.
We split the data set through cross-validation on a rolling basis to validate the performance.As shown in Figure 3, we chronologically split the data into eight partitions, where the former four partitions predict the flares in the next year and the latter four partitions predict the flares in the next second year.Unlike the naive K-fold cross-validation, this data partition method ensures that the training set always predates the test set, which is in line with the actual prediction scenario (Nishizuka et al. 2021).

Performance Comparison
To demonstrate the effectiveness of HPF, experiments are conducted from the following aspects: whether there is a statistically significant improvement in the performance of the prediction algorithms boosting with HPF; how HPF compares to other strategies for addressing class imbalance.In the interest of fairness, the parameter settings for all experiments remain consistent and the hyperparameters of all algorithms have not been fine-tuned.
Table 4 shows the results comparison of the baseline models before and after boosting with HPF.It can be noted that F1, Recall, HSS 2 , TSS, and AUC of all baseline models have significantly improved statistically, which shows the  effectiveness of HPF.However, the Precision of LR, LDA, SVM, MLP, GB, AdaBoost, and ExtRa trees decreased due to the reason that some samples are inevitably wrongly classified into nonflaring ARs in the first-stage prediction.As we all know, missed detection is more serious than false alarms in the task of prediction.Therefore, even a small reduction in Precision would make sense to improve the overall performance.Coincidentally, it can be noticed that the Recall of these models has been improved to a greater extent.
In addition, we compare the effectiveness of HPF with several resampling strategies (random undersampling (RUS), random oversampling (ROS), synthetic minority oversampling technique (SMOTE), adaptive synthetic sampling (ADASYN), NCL, and edited nearest neighbor (ENN)).The results of baselines before and after using these resampling strategies are shown in Tables B1-B6 in Appendix B, respectively.In Table 5, we compare the means of F1 of baseline models after boosting with HPF and the resampling strategies, respectively.For all selected baseline models, both HPF and resampling strategies improved the performance, of which HPF has the largest improvement.Table 6 shows the comparison of Precision after boosting with HPF and resampling strategies.While the two-stage framework HPF results in a significant decrease in Precision for several baseline models including LR, LDA, SVM, MLP, GB, AdaBoost, and ExtRa tree, other resampling strategies except NCL and ENN reduce it to a greater extent.
methods boosting with HPF are also higher than that of the second-place method.The Recall of the three methods boosting with HPF is higher than 0.7, which was significantly better.The downside is that our method is only slightly higher than the second-place method in accuracy but lower than the firstplace method.Part of the reason is that the first-place method uses both image data and extracted features, while our method only uses the extracted features.

Discussions
The frequency of solar flares eruption varies with the activity cycle of sunspots, which is approximately 11.4 yr.This frequency ranges from several times a day under the "active" state to less than once a week under the "dormant" state, making the long-term prediction difficult (Gnevyshev 1977).As a result, the inconsistency between the distributions of training and test sets impairs the stability of long-term predictions of solar flares.
Figure 4 shows the F1 score of the baseline methods before and after boosting with HPF.The number is the ratio of the imbalance ratio between the training set and the test set.The closer the value is to 1, the closer the distribution of the two sets is.It can be noted that the F1 scores fluctuated greatly with the change in the training set.With the increase in the sample size of the training set, the F1 scores increase in DT, Extra trees, MLP, RF, and SVM.On the other hand, when the difference in imbalance ratio between the training set and the testing set is larger, the performance of most models tends to be worse, but almost all baseline methods boosting with HPF still have higher F1 scores.Furthermore, we find that using the same training set to predict flares in the next two years is not  necessarily worse than predicting flares in the next year, which also proves that the distributions of flares in consecutive years are not necessarily more similar.Based on the above analysis, it can be seen that HPF guarantees the stability of the model.

Conclusions
To address the challenge of extreme class imbalance in solar flare prediction, we propose a novel hierarchical prediction method called HPF.By focusing on ARs that have experienced flares within the past 48 hr, HPF improves the accuracy of predicting consecutive solar flares.This finding validates that regions of high solar flare activity tend to experience consecutive flares.
Extensive experiments on a large-scale data set of solar flare observations demonstrate that HPF significantly enhances the performance of nearly all classification models.But the current version of HPF does not fully exploit the temporal information available in solar flare data.As a potential future direction, we propose integrating time series methods such as long short-term memory (LSTM) into the framework.LSTM is a powerful model that can capture the temporal dependencies of events and has been widely used in recent research on flare prediction.Also, it is worth noting that the current version of HPF is limited to binary classification, i.e., predicting whether flares occur or not.However, it cannot predict the level of flares.To address this limitation, we plan to investigate multiclass classification problems in our future work.
In summary, our proposed framework HPF has demonstrated superior performance in predicting solar flares with extreme class imbalance.However, there is still room for improvement by incorporating time series methods and extending the framework to multiclass classification problems.

Appendix B
The results of baselines before and after using the six resampling strategies are shown in Tables B1-B6.
The balanced random forest (BRF; Chen & Breiman 2004) and Gaussian naive Bayes (NB; Wheatland 2004) are used to predict the region label Y AR for each sample.If both models predict Y AR = 1 with a probability greater than 55%, let Y AR = 1 and classify the sample into Data Set 1, otherwise, let Y AR = 0 and classify the sample into Data Set 0. In addition, there are 98 flaring ARs and 1179 nonflaring ARs, respectively.The problem of class imbalance inevitably occurs during the training of Model-1.The neighborhood cleaning rule (NCL; Laurikkala 2001) is used to downsample the samples to alleviate the problem.

Figure 3 .
Figure 3. Diagram of data splitting.The blue and red/green parts indicate the year of observations in the training set and the test set respectively.
11 Magnetogram and extracted features are used as input, those from 2010 May 1 to 2015 December 31 are used as training sets, and those from 2016 January 1 to 2019 January 31 are used as test sets.The training set and the testing set contain 1000 and 277 ARs, respectively.The results are shown in Table

Figure 4 .
Figure 4. F1 scores of baseline methods before and after boosting with HPF, where the horizontal axis is the time interval of the training set; the number is the ratio of the imbalance ratio between the training set and the test set.

Table 1
Methods to Solar Flare Prediction

Table 2
Illustration of Selected Magnetic Field Parameters Note.N 1 and N 0 are numbers of flaring samples and flaring-quiet samples, respectively.

Table 4
Results of the Baseline Models before and after Boosting with HPF *−-

Table 5
The Mean and Deviation of F 0.5 and F1 of Models with Resampling Strategies Note.The first and second rows display the F 0.5 and F1, respectively.Bold font indicates the optimal values.

Table 7
Comparison of Results on Solar Flaring Prediction Challenge