Abstract
Space weather phenomena such as solar flares have a massive destructive power when they reach a certain magnitude. Here, we explore the deep-learning approach in order to build a solar flare-forecasting model, while examining its limitations and feature-extraction ability based on the available Geostationary Operational Environmental Satellite (GOES) X-ray time-series data. We present a multilayer 1D convolutional neural network to forecast the solar flare event probability occurrence of M- and X-class flares at 1, 3, 6, 12, 24, 48, 72, and 96 hr time frames. The forecasting models were trained and evaluated in two different scenarios: (1) random selection and (2) chronological selection, which were compared afterward in terms of common score metrics. Additionally, we also compared our results to state-of-the-art flare-forecasting models. The results indicates that (1) when X-ray time-series data are used alone, the suggested model achieves higher score results for X-class flares and similar scores for M-class as in previous studies. (2) The two different scenarios obtain opposite results for the X- and M-class flares. (3) The suggested model combined with solely X-ray time-series fails to distinguish between M- and X-class magnitude solar flare events. Furthermore, based on the suggested method, the achieved scores, obtained solely from X-ray time-series measurements, indicate that substantial information regarding the solar activity and physical processes are encapsulated in the data, and augmenting additional data sets, both spatial and temporal, may lead to better predictions, while gaining a comprehensive physical interpretation regarding solar activity. All source codes are available at https://github.com/vladlanda.
Export citation and abstract BibTeX RIS
Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
1. Introduction
A sudden outburst of electromagnetic radiation originating at the solar surface travels at the speed of light and reaches Earth within 500 s (Liu et al. 2004). These electromagnetic bursts, known as the solar flare phenomenon, emit extreme-ultraviolet (EUV) and X-ray radiation, leading to an ionization effect in the ionospheric D, E, and F2 layers (Sweet 1958; Reuveni & Price 2009; Reuveni et al. 2010). Solar flares have the ability to interfere in radio communication systems, affect global navigation satellite systems, neutralize satellite equipment, cause electric power blackouts on Earth, harm the health of astronauts, and can easily mean a loss exceeding several billion dollars in repairs and months of reconstruction when they reach a very high magnitude (Marusek 2007; Riswadkar & Dobbins 2010). Therefore, scientists are constantly seeking accurate and consistent tools and methods for predicting where and when solar flares and X-ray bursts are likely to occur (Tóth et al. 2005; Clilverd et al. 2009). However, although our knowledge regarding solar activity and physical processes is constantly improving, attaining real-time solar flare forecasts, similar to our daily atmospheric weather forecasts (Leontiev & Reuveni 2017, 2018; Leontiev et al. 2021), remains an uncharted goal so far (Lyutikov et al. 2018), as space technologies remain vulnerable to such threats.
Thus, extracting an accurate and reliable solar flare forecast while considering multiple ranges of time windows is essential for decision makers when protective measures are taken in critical mission situations.
Attempts to construct solar flare forecasts begun in the 1930s when Giovanelli (1939) suggested examining the probability of an eruption taking place based on the sunspot characteristics associated with it. Today, the number of studies that lean toward machine-learning (ML) algorithms and are considered data-driven approaches is increasing drastically. ML algorithms such as the support vector machine (SVM), random rorest (RF), multilayer preceptron (MLP), and artificial neural network (ANN) were applied in the field of solar flare prediction. Li et al. (2011) proposed an unsupervised clustering approach, combined with vector quantity-learning based on characteristics extracted form the Solar and Heliospheric Observatory (SOHO)/Michelson Doppler Imager (MDI) data. Yuan et al. (2010) used photoshperic magnetic measurements in order to construct an automatic solar flare forecast within a 24 hr window, based on logistic regression and the SVM model. Furthermore, Huang et al. (2013) developed a forecasting model by combining DARAL (the distance between active regions and predicted active longitudes) and solar magnetic field parameters based on an instance-based learning model, while Li & Zhu (2013) used a predictive solar flare model based on MPL and learning vector quantization, trained on sequential sunspot data for a 48 hr flare prediction. More recently, Muranushi et al. (2015) introduced a fully automated solar flare prediction framework called universal forecast constructor by optimized regression of inputs (UFCORIN) with two integrated regression models: the SVM, and a handwriten linear regression algorithm, where Nishizuka et al. (2017) examined and compared the performance of the SVM, K-nearest neighbors (k-NN), and an extremely randomized tree performance to predict the maximum class of flares occurring in the next 24 hr while training the models on vector magnetograms, ultraviolet (UV) emission, and soft X-ray emission, available from the Solar Dynamics Observatory (SDO) and Geostationary Operational Environmental Satellite (GOES), while Asaly et al. (2021) used ionospheric total electron content (TEC) data as an SVM training set to build a solar flare X- and M-class predictor. Bobra & Couvidat (2015) attempted to forecast M- and X-class solar flare events using the SVM trained with the SDO/Helioseismic and Magnetic Imager (HMI) data.
In the past decade, the advancement of ML use in the field of deep neural networks, combined with the massive growth of big data and hardware development in graphical processing units, allowed artificial intelligence (AI) algorithms to achieve human-level performance in computer vision, including image classification, object detection, and segmentation (LeCun et al. 2015; Russakovsky et al. 2015). Moreover, AI has promising results in natural language processing, which includes machine translation, image captioning, and text generation (Vinyals et al. 2015; Wu et al. 2016).
Recently, scientists have applied deep neural networks (DNN) for space weather predictions, in particular, the forecast of solar flares. Nagem et al. (2018) used the GOES X-ray flux 1-minute time-series data for solar flare predictions by integrating three neural networks (NN): the first NN maps the GOES time series into Markov transition field (MTF) images. The second NN extracts all relevant features from the MTF. The third network is a convolutional neural network (CNN; LeCun et al. 1998) that generates the prediction. An additional study performed by Chen et al. (2019) proposed to identify solar flare precursors by an automated feature extraction and to classify flare events from the quiet time for active regions, as well as strong (X and M-class) versus weak (A, B, and C class) events at time frames of 1, 3, 6, 12, 24, 48, and 72 hr before the event. Two types of models were examined: CNN, and recurrent neural networks based on a long short-term memory cell (Hochreiter & Schmidhuber 1997), trained on multiple data sources: GOES, SDO, and HMI. Park et al. (2018) presented a forecast application based on the CNN model with a binary outcome: 1 or 0 for a daily flare occurrence of X, M, and C class. They compared their model with two well-known models, AlexNet (Krizhevsky et al. 2017), and GoogLeNet (Szegedy et al. 2015) by training them while using a transfer-learning technique. Finally, Huang et al. (2018) proposed a deep-learning method for learning forecasting patterns from line-of-sight solar active region magnetograms based on the available data from SOHO/MDI and SDO/HMI. Their method forecast solar flare events at 6, 12, 24, and 48 hr window frames while comparing them to other state-of-the-art forecasting models.
Here, we propose to use a 1D CNN model, designed as a time-series classification for solar flare forecast application, using solely GOES soft X-ray time-series data without hand-crafted features or dedicated data preprocessing, compared to previous studies. The suggested model uses as an input the GOES X-ray time-series data and outputs the probability of X-class or M-class flare occurrence. In addition, we also examine the ability of this model design to learn and extract time-series features to distinguish between different solar flare class events.
The outline of the paper is as follows. Data description and preparation are presented in Section 2. The CNN architecture, training, evaluation processes, and overall method are proposed in Section 3. The model performance and comparison are presented in Section 4. The discussion and conclusions follow in Section 5.
2. Data
The solar flux is known to differ over numerous timescales, ranging from minutes to months and decades (Unruh et al. 2008; Reuveni & Price 2009). The fluctuations in the total solar output have been monitored and recorded since the late 1970s (Willson & Hudson 1988), where various measured properties have been presented (Frohlich & Lean 1998; Willson & Mordvinov 2003; Dewitte et al. 2004). As the short-term (minutes to hour) changes are largely dominated by convection currents and solar fluctuations (mainly acoustic and gravity waves), the diurnal to annual changes are due to the occurrence of sunspot regions and variations in the surface magnetic field, conjugated with the solar rotation that migrates solar active regions backward and forward upon the sunlight side of the Earth. Within an 11 yr cycle, sunspots are transported toward the solar equator, while new ones accumulate at high latitudes (Stix 2002).
We use the 1-minute average X-ray (0.1–0.8 nm) time-series data available from the GOES (Schmit et al. 2013) mission. 6 The first GOES (GOES-1) was launched in 1975 by the United State's National Oceanic and Atmospheric Administration (NOAA), and was operated by NOAA's National Weather Satellite, Data, and Information Service division. All GOES mission spacecraft are geosynchronous satellites, located at a height of about 35,800 km, providing a full-disk view of the Earth as well as having an unobstructed view of the Sun. The main GOES mission is collecting infrared radiation and visible solar reflection from the Earth surface and atmosphere using imager equipment as well as collecting atmospheric temperature, moisture profiles, surface and cloud-top temperatures, and the ozone distribution using sounder equipment. Moreover, GOES spacecraft carry on board a space environment monitor instrument consisting of a magnetometer, an X-ray sensor, a high-energy proton and alpha-particle detector, along with an energetic particle sensor. The X-ray sensor found on board is capable of registering two wavelength bands: 0.05–0.4 nm and 0.1–0.8 nm. The X-ray flux class is defined by the long wave band (0.1–0.8 nm) magnitude as it reaches certain thresholds: 10−4, 10−5, and 10−6 W m−2 for X, M, and C classes, respectively. GOES X-ray flux data constitute the main source for confirming a solar flare occurrence, and they are extensively used by previous and current studies, associating flare events with different measured data sources (Chen et al. 2019; Huang et al. 2018). Hence, the GOES X-ray data source can act as a primary base for a forecasting application without introducing additional sources of measured data. In order to form a sequential time-series X-ray data signal ranging from 1998 July to 2019 December, multiple GOES mission sources were used, namely GOES-10, GOES-14, and GOES-15. The GOES-10 data range from 1998 July to 2009 December, GOES-14 from 2010 January to 2010 December, and GOES-15 from 2011 January to 2019 December. All three data sources were merged into one chronological sequence of 1-minute-averaged X-ray signal, covering almost entirely two solar cycles (cycles 23 and 24), from 1998 July to 2009 December and from 2010 January to 2019 December, respectively.
2.1. Data Normalization, Scaling, and Splitting
Using the X-ray signal magnitudes, we sorted all the X and M solar flare events based on the corresponding thresholds associated with them: 1 × 10−4 and 1 × 10−5 W m− 2, respectively. In order to create two separate data sets for X and M solar flare classes with different prediction frames of 1, 3, 6, 12, 24, 48, 72, and 96 hr, while preserving 48 hr of data as an input to the model, we suggest the following scheme: first, we replaced all the missing values, which appears as "−99999" in the time series with GOES-15 nominal minimum 1 × 10−9 W m−2 value. This provides a continuous and smooth sequence, free of unexpected negative spikes, and is considered as part of the "no event" or "quiet time" sequence. Then, for every solar flare event peak (M or X separately) that is found, we confirmed that no additional higher-magnitude events appeared 12 hr after or 97 hr before the peak (1 hr before the peak and 96 hr for the prediction frames). As a next step, a no-event frame was selected by choosing a random time point and confirming that no event higher than the M-class threshold appeared 12 hr after or 97 hr before. Moreover, the total variance of the selected frame does not appear to be below the 1 × 10−20 W m−2 threshold, thus eliminating the use of frames with a major nominal minimum value count. In this way, an event/no-event data frame will have a length of 144 hr: 96 maximum prediction hour frames, and 48 hr as input (to examine the 96 hr prediction). Figure 1 visualizes the data preparation process, and Figure 2 demonstrates 48 hr of data as the model input of M-class versus no-flare class events, test and train data sets for the two different data split approaches. The total number of event frames found for the X- and M-class set counts was 171 and 1522 events, respectively, while the no-event frame set count was 1057 events, see Tables 1 and 2. Each event frame set is split into a training and testing sets by two different approaches: the simple random sampling (SRS) approach, and the chronological approach. The SRS approach splits each set into a training and testing set by selecting samples with a uniform distribution, i.e., each sample in the set has an equal probability to be selected as a training or as a testing set. This approach is most commonly used in similar applications and leads to low biases applied with balanced data (Reitermanov 2010). For the chronological approach, we followed the data-splitting method suggested by Park et al. (2018), who noted that the data-splitting method might influence the forecasting performance of the model, where the random selection approach can increase its performance as the training and testing events might be chosen from adjacent time periods. Thus, a comparison of the model performance trained with the same data but with different splitting techniques is meaningful. We therefore selected our data, which range from 1998 July to 2009 December, as the training set and the data from 2010 January to 2019 December as the testing set. In this case, the training set consisted of events solely from solar cycle 23 and the testing set solely from solar cycle 24, ensuring that the testing events appear after the training event chronologically. In addition, every training and testing data set was structured with an even flare/no-flare number of events, based on the type of event with the smallest number (when compared between X and M classes). Both splitting approaches are scaled by 1 × 109 W m−2 in order to normalize the minimum value to 1.0 W m−2. Afterward, we applied the natural logarithm function with the resulting sequence, such that the maximum and minimum values range was narrowed to W m−2)], as the maximum nominal value of GOES-15 data set is 1 × 10−3 W m−2. Finally, we applied a standard normalization procedure, which shifts the mean value to 0 and scales the variance to 1, with the training and testing sets, based on the normalization parameters obtained from the training sets.
Download figure:
Standard image High-resolution imageTable 1. Number of X, M, and No Events for SRS Split
X versus M versus No | M versus No | Total | ||||||
---|---|---|---|---|---|---|---|---|
Classes | Train 50% | Test 30% | Validation 20% | Train 50% | Test 30% | Validation 20% | Available | Extra |
X class | 84 | 51 | 36 | ... | ... | ... | 171 | 0 |
M class | 84 | 51 | 36 | 443 | 265 | 177 | 1522 | 294 |
No event | 84 | 51 | 36 | 443 | 265 | 177 | 1057 | 0 |
Total | 513 | 1772 |
Note. This table shows the number of available events of X-class, M-class, and No events with the SRS split approach based on first pulling 30% as test set and the rest then divided by 70% for the training and 30% for validation.
Download table as: ASCIITypeset image
Table 2. Number of X, M, and No Events for Chronological Split
X versus M versus No | M versus No | Total | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Cycle 23 | Cycle 24 | Cycle 23 | Cycle 23 | Cycle 24 | Cycle 23 | Cycle 23 | Cycle 24 | |||
Classes | Train 70% | Test | Validation 30% | Train 70% | Test | Validation 30% | Available | Extra | Available | Extra |
X class | 86 | 47 | 37 | ... | ... | ... | 124 | 0 | 47 | 0 |
M class | 86 | 47 | 37 | 196 | 503 | 84 | 972 | 569 | 550 | 0 |
No event | 86 | 47 | 37 | 196 | 503 | 84 | 405 | 0 | 656 | 106 |
Total | 510 | 1566 |
Note. This table shows the number of available events of X-class, M-class, and No events for the chronological split approach based on cycle 24 as test set and cycle 23 divided by 70% for the training and 30% for validation.
Download table as: ASCIITypeset image
3. Method
The CNN models have shown human-level performance in the field of computer vision and image processing, and are currently being deployed in autonomous cars, flying drone systems, autonomous robotics, gaming, and medicine. The core layers of these models are the convolutional layers, consisting of several filters, also referred to as kernels, which are quantity and shape defined a priori to the training process with the hyperparameters. When input data (tensor) are passed into the convolutional layer, every kernel of each layer is convolved with the input tensor, generating a feature map. A general case of a discrete 2D convolution is given by the following equation:
where is the feature map k of a layer l at index i, j. The xl−1 is the output from the previous layer l − 1, which becomes the input to the current layer l, and is the kernel with a size of (2N + 1) × (2M + 1). Two additional hyperparameters are the stride S, which defines the kernel move step along the input tensor, and the padding P, which pads the input boundary.
Because the convolution operation is linear and CNN is a deep stack of linear combination layers, similar to DNNs, it is also designed with activation functions that allows modeling nonlinear mapping from an input domain into an output domain. The CNN model architecture often includes the rectified linear unit (ReLU) (Fukushima & Miyake 1982; Nair & Hinton 2010), which is described as follows:
Furthermore, the CNN feature maps encapsulate the spatial features found in the input tensor, associated with the kernel values that are learned during the training process. In general, the input sample spatial features that describe the sample are not necessary grouped together in one location, but rather might be spread into different locations, therefore capturing those features can lead to better performances. Thus, the CNN models include pooling layers that pool information (based on the pooling layer type) from the feature maps with applied activation function. Only a few pooling layer types are used in CNN models, where one of the most popular ones is the max pooling layer, defined by the following expression:
where is the pooling tensor k of layer l at index i, j, operating on a feature map with a max pooling kernel of size and stride . are defined by the hyperparameters.
A general classification CNN model (Fawaz et al. 2019) consists of stacks of layers one after another, such that the convolutional layer operates on the input tensor of the previous layer. Then, it passed through the activation function, converting it into a feature map, where then the pooling layer pools spatial information about the feature. At the end of the model architecture, there are a few fully connected layers, followed by the softmax activation function, which is defined as
where x is an input vector, constructed from real numbers, and K is the number of categories, mapping the processed input into the output domain of categorical probabilities (Figure 3).
Download figure:
Standard image High-resolution image3.1. Model Architecture
We used a general CNN architecture in order to develop a classification time-series model as solar flare-forecasting tool. In contrast to the general CNN model, which takes as an input a 2D image, the X-ray time-series data are 1D, hence we designed a 1D CNN based on the general CNN case (Wang et al. 2017). Our model, Figure 4, consists of four convolutional layers, each layer followed by a ReLU activation function, four max pooling layers, a fully connected layer, and an output layer with softmax activation function. In addition, every max pooling layer is followed by a dropout layer, with a dropout probability of 10% (0.1 is the hyperparameter value) for regularization and model overfitting avoidance (Srivastava et al. 2014).
- 1.The first convolutional layer (conv1) has 64 feature maps, a kernel size of 30 × 30, and a stride of 1, the total conv1 size is 1 × 2880 × 64. The following max pooling layer has kernels of size 15 × 15 with a stride of 15 and a shape of 1 × 192 × 64.
- 2.The second convolutional layer (conv2) has 256 feature maps, kernels with size 15 × 15, and a stride of 1, the total conv2 size is 1 × 192 × 256, and its max pooling layer has kernels of size 5 × 5 with a stride of 5 and a shape of 1 × 39 × 256.
- 3.The third convolutional layer (conv3) has 512 feature maps, kernels with size 5 × 5, and a stride of 1, the total conv3 size is 1 × 39 × 512. The following max pooling layer has kernels of size 3 × 3, with a stride of 3 and a shape of 1 × 13 × 512.
- 4.The final convolutional layer (conv4) has 512 feature maps, kernels with size 3 × 3, and a stride of 1, the total conv4 size is 1 × 13 × 512, and its max pooling layer has kernels of size 3 × 3, with a stride of 3 and a shape of 1 × 5 × 512.
Download figure:
Standard image High-resolution imageAt the end of the model architecture, a flattening layer flattens the shape of the previous max pooling layer from 1 × 5 × 512 into 2560 × 1, connecting it to the output layer of size 2 × 1, making it a fully connected layer. The output layer passes through the softmax activation function to map the output into categorical probability space.
3.2. Data Preparation and Training
In order to cover a wide range of prediction time frames, different data split methods, and solar flare class types, we trained the individual model for each combination of the following categories:
- 1.Solar flare class type: X-class flare versus no flare or M-class flare versus no flare (two in total).
- 2.Data split type: Chronological or random (two in total).
- 3.Prediction time frame: 1, 3, 6, 12, 24, 48, 72, and 96 hr (eight in total).
We ended up training the architecture with 32 different configurations, each for every combination (2 × 2 × 8 = 32). In addition, we investigated the ability of the proposed CNN architecture to distinguish between M- and X-class solar flares with each corresponding time frame from category 3 and data split type from category 2, leading to an additional 16 trained models (2 × 8 = 16). In order to train our model in various prediction time frames, we pulled 48 hr window of data, shifted by the prediction gap, out of the available 144 hr in the event frame, forming a training and testing set with a range of 2880 minutes (Figure 5). In total, we trained 48 individual models with the following hyperparameters: we used the Adam (Kingma & Ba 2014) optimizer with a learning rate of 3 × 10−5. We also adopted the cross entropy loss function for the training procedure. The cross entropy loss function is given by the following formula:
Download figure:
Standard image High-resolution imagewhere y is the ground truth one-hot encoded vector of size m, is the model output prediction vector, also of size m, encoded with the probability entries and sums up to 1. Further, we used a mini-batch size of 16 and 75 epochs in total. In addition, an early stopping mechanism (Prechelt 1998) was added to the training process for the model weight sharpshooting, once the validation set loss reached a new minimum value; see Figure 6 for the validation set loss graphs.
Download figure:
Standard image High-resolution image4. Results and Comparison
A classifier is evaluated based on the statistical scores it achieves with the test set. The scores are calculated according to the confusion matrix (Fawcett 2006), which measures the performance of a machine-learning algorithm based on four different combinations of the predicted and actual (ground truth) values; see Table 3. Here, we followed seven commonly used score metrics that were adopted by previous studies (Park et al. 2018; Bobra & Couvidat 2015; Huang et al. 2018; Chen et al. 2019).
- 1.Accuracy (ACC) is defined as the ratio of the number of correct predictions, ranging from 0 to 1, while 1 is a perfect accuracy score:
- 2.Precision (positive predicted value, PPV) measures the ability of not labeling a negative event as positive, ranging from 0 to 1, while 1 is a perfect precision score:
- 3.Recall (true-positive rate, TPR) measures the ability of finding all positive events, ranging from 0 to 1, while 1 is perfect recall score:
- 4.F1-score (F1) measures the ability of finding all positive events without misclassifying negative ones. Extracted from the harmonic mean of the precision and recall scores, ranging between 0 and 1, while a value of 1 indicates perfect precision and recall:
- 5.Heidke skill score 1 (HSS1), ranging from − ∞ to 1, while 1 is a perfect HSS1 skill score. HSS1 measures the improvement over a model that always predicts negative events (baseline model, HSS1 = 0):
- 6.Heidke skill score 2 (HSS2), ranging from −1 to 1, while 1 is a perfect HSS2 skill score. HSS2 is a skill score that is compared to a random forecast:
- 7.True skill score (TSS) measures the difference between the true-positive and false-positive rates and ranges from −1 to 1. The TSS is evaluated as the maximum distance of the receiver operating characteristic (ROC) curve from its diagonal line:
Table 3. Confusion Matrix Description Table
Flare Predicted | No-flare Predicted | |
---|---|---|
Flare occurred (P) | True positive (TP) | False negative (FN) |
Flare did not occur (N) | False positive (FP) | True negative (TN) |
Note. Confusion matrix description table. Flare occurred (P): column of all the positive events. Flare did not occur (N): column of all the negative events. True positive (TP): counts the number of positive events predicted by the model as positive. False negative (FN): counts the number of positive events predicted by the model as negative. False positive (FP): counts the number of negative events predicted by the model as positive. True negative (TN): counts the number of negative events predicted by the model as negative.
Download table as: ASCIITypeset image
For our first evaluation, we compare the results for the random and chronological split method types. Figure 7 shows the ROC curve of each split method for all the prediction time frames, separated by solar flare class. In addition, Figure 8 shows the comparison between different metric skill scores for the two split types, separated by solar flare class. Figure 9 presents a statistical analysis for the different metrics by the data split method. Then, in order to create a compatible comparison with previous studies, we followed the data split method suggested by Park et al. (2018), who adopted the chronological split method as more suitable for space weather forecast platforms. Therefore, our current comparison was made solely using the results achieved by training and testing our model with the chronological data set split method. Four recent studies considered as state-of-the-art flare-forecasting models are chosen for comparison: Chen et al. (2019), Park et al. (2018), Huang et al. (2018), and Bobra & Couvidat (2015), the majority of which used DNNs, except for Bobra & Couvidat (2015), who used the SVM technique. All the comparisons were made based on the skill score metrics. Figures 10 and 11 show a visualization metric of an M-class and X-class solar flare classifier compared with the previous four studies, respectively. An overall comparison of all the prediction time frames, metrics, and models is described in Table 4. Moreover, the test evaluation of our model, examining its ability to distinguish whether an M- or X-class solar flare events will occur, using both split method types, is shown in Figure 12 as ROC curve graphs.
Download figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageTable 4. Performance Comparison for X and M Models Trained with Chronological Split
Number of Hours After a Solar Flare Event | |||||||||
---|---|---|---|---|---|---|---|---|---|
Work | Metric | 1 hr | 3 hr | 6 hr | 12 hr | 24 hr | 48 hr | 72 hr | 96 hr |
Accuracy | 0.947 | 0.936 | 0.904 | 0.979 | 0.926 | 0.915 | 0.851 | 0.819 | |
Precision | 0.92 | 0.918 | 0.913 | 0.959 | 0.935 | 0.933 | 0.851 | 0.895 | |
Recall | 0.979 | 0.957 | 0.894 | 1.0 | 0.915 | 0.894 | 0.851 | 0.723 | |
Current work | F1 score | 0.948 | 0.938 | 0.903 | 0.979 | 0.925 | 0.913 | 0.851 | 0.8 |
(X-class prediction) | HSS | 0.894 | 0.872 | 0.809 | 0.957 | 0.851 | 0.83 | 0.702 | 0.638 |
HSS2 | 0.895 | 0.874 | 0.813 | 0.957 | 0.854 | 0.833 | 0.713 | 0.65 | |
TSS | 0.894 | 0.872 | 0.809 | 0.957 | 0.851 | 0.83 | 0.702 | 0.638 | |
Accuracy | 0.877 | 0.881 | 0.875 | 0.865 | 0.847 | 0.81 | 0.789 | 0.795 | |
Precision | 0.926 | 0.914 | 0.931 | 0.922 | 0.907 | 0.842 | 0.846 | 0.813 | |
Recall | 0.819 | 0.841 | 0.809 | 0.797 | 0.773 | 0.763 | 0.708 | 0.767 | |
Current work | F1 score | 0.869 | 0.876 | 0.866 | 0.855 | 0.835 | 0.801 | 0.771 | 0.789 |
(M-class prediction) | HSS | 0.753 | 0.761 | 0.75 | 0.73 | 0.694 | 0.62 | 0.579 | 0.59 |
HSS2 | 0.759 | 0.768 | 0.755 | 0.736 | 0.703 | 0.637 | 0.597 | 0.611 | |
TSS | 0.753 | 0.761 | 0.75 | 0.73 | 0.694 | 0.62 | 0.579 | 0.59 | |
Precision | 0.93 | 0.93 | 0.91 | 0.92 | 0.89 | 0.88 | 0.86 | ... | |
Recall | 0.88 | 0.87 | 0.85 | 0.85 | 0.77 | 0.72 | 0.68 | ... | |
Chen et al. 2019 | F1 score | 0.9 | 0.9 | 0.88 | 0.88 | 0.83 | 0.79 | 0.76 | ... |
HSS | 0.81 | 0.8 | 0.77 | 0.77 | 0.68 | 0.62 | 0.57 | ... | |
HSS2 | 0.81 | 0.79 | 0.77 | 0.77 | 0.68 | 0.62 | 0.56 | ... | |
TSS | 0.81 | 0.8 | 0.77 | 0.77 | 0.68 | 0.62 | 0.56 | ... | |
Accuracy | ... | ... | ... | ... | 0.83 | ... | ... | ... | |
Recall | ... | ... | ... | ... | 0.85 | ... | ... | ... | |
Park et al. 2018 | HSS2 | ... | ... | ... | ... | 0.63 | ... | ... | ... |
TSS | ... | ... | ... | ... | 0.63 | ... | ... | ... | |
Huang et al. 2018 | HSS2 | ... | ... | 0.054 | 0.081 | 0.143 | 0.206 | ... | ... |
TSS | ... | ... | 0.662 | 0.632 | 0.662 | 0.621 | ... | ... | |
Accuracy | ... | ... | ... | ... | 0.962 | 0.973 | ... | ... | |
Precision | ... | ... | ... | ... | 0.69 | 0.797 | ... | ... | |
Recall | ... | ... | ... | ... | 0.627 | 0.714 | ... | ... | |
Bobra and Couvidat 2015 | F1 score | ... | ... | ... | ... | 0.656 | 0.751 | ... | ... |
HSS | ... | ... | ... | ... | 0.342 | 0.528 | ... | ... | |
HSS2 | ... | ... | ... | ... | 0.636 | 0.737 | ... | ... | |
TSS | ... | ... | ... | ... | 0.61 | 0.703 | ... | ... |
Note. Full table of the metric comparison divided by prediction hours and skill scores.
Download table as: ASCIITypeset image
5. Discussion and Conclusions
In this study we designed a 1D CNN for time-series classification as a space weather forecasting tool. The network was trained solely on GOES X-ray time-series data available for solar cycle 23 and 24. We focused on training two models, one for predicting X-class solar flare events, and the second for predicting M-class solar flare events. Both models were trained for different prediction time frames before the event, using two different data set split methods: random and chronological (trained with past events and tested with future events). For both data split methods, we kept the training and testing sets balanced. The evaluation of the models was done according to several skill scores that were commonly used in recent space weather forecasting studies for a binary classification. For both models and split methods, the degradation in the model performance grows as the prediction time frame increases, which is an expected behavior in forecasting platforms, because the farther the forecasting goes, the higher the uncertainty that is introduced (Camporeale 2019). Despite the results previously reported regrading the influence of data set separation on the forecast performance (Nishizuka et al. 2017), in our study the performance difference between the chronological and random data set splits constitutes only 3% for the M-class model and 2% for the X-class model over all the measured skill scores on average. Moreover, for the M-class model case, the 3% difference indicates a degradation in favor of the random split approach, but for the X-class model case, the performance difference degradation is in favor of the chronological split approach. We chose to compare our current work results achieved with models that were trained by the chronological split method, as suggested by Park et al. (2018), where our M-class model results achieve high scores that are comparable with previous works. The work presented by Chen et al. (2019) achieves higher results at several prediction time frames for a few skill scores, but their work was based on the random split approach. The other previous studies do not provide all the available skill scores for the same prediction time frame, as their main focus of interest was different, but if we do consider their available skill scores, our M-class model achieves higher TSS values than those provided by Park et al. (2018) and Huang et al. (2018) at all prediction time frames. Our M-class model also achieves a higher TSS score than Bobra & Couvidat (2015) at the 12 hr time frame, but lower scores at the 24 hr time frame. The same results are observed with the HSS2 scores. In addition, Park et al. (2018) achieved better performance with the recall score at the 24 hr time frame, and Bobra & Couvidat (2015) achieved a higher accuracy score for both the 12 and 24 hr time frames. On the other hand, our X -class model achieves higher skill score performances over all compared studies, except for the accuracy score values gained by Bobra & Couvidat (2015). As a last step, we examined the ability of our suggested model of distinguishing between the likelihood for an X-class or M-class flare to occur at different time frames. The results reveal that the current model is poorly capable of classifying between the two, such that at the 96 hr time frame, it achieves an area-under-the curve (AUC of the ROC curve) value of 0.506 with the random split approach, which is equivalent to a model that flips a random binary coin. The topic of distinguishing between solar flare classes requires a more comprehensive and profound study, which is important for cases when a model is trained to predict M- or X-class flares as one (binary predication: when X or M will occur, the model predicts 1, otherwise it predicts 0), the recall skill score is calculated for both solar flare classes, meaning it does not truly express the false-negative rate for the X-class flare type.