Low-dimensional Convolutional Neural Network for Solar Flares GOES Time-series Classification

Vlad Landa; Yuval Reuveni

doi:10.3847/1538-4365/ac37bc

1. Introduction

A sudden outburst of electromagnetic radiation originating at the solar surface travels at the speed of light and reaches Earth within 500 s (Liu et al. 2004). These electromagnetic bursts, known as the solar flare phenomenon, emit extreme-ultraviolet (EUV) and X-ray radiation, leading to an ionization effect in the ionospheric D, E, and F2 layers (Sweet 1958; Reuveni & Price 2009; Reuveni et al. 2010). Solar flares have the ability to interfere in radio communication systems, affect global navigation satellite systems, neutralize satellite equipment, cause electric power blackouts on Earth, harm the health of astronauts, and can easily mean a loss exceeding several billion dollars in repairs and months of reconstruction when they reach a very high magnitude (Marusek 2007; Riswadkar & Dobbins 2010). Therefore, scientists are constantly seeking accurate and consistent tools and methods for predicting where and when solar flares and X-ray bursts are likely to occur (Tóth et al. 2005; Clilverd et al. 2009). However, although our knowledge regarding solar activity and physical processes is constantly improving, attaining real-time solar flare forecasts, similar to our daily atmospheric weather forecasts (Leontiev & Reuveni 2017, 2018; Leontiev et al. 2021), remains an uncharted goal so far (Lyutikov et al. 2018), as space technologies remain vulnerable to such threats.

Thus, extracting an accurate and reliable solar flare forecast while considering multiple ranges of time windows is essential for decision makers when protective measures are taken in critical mission situations.

Attempts to construct solar flare forecasts begun in the 1930s when Giovanelli (1939) suggested examining the probability of an eruption taking place based on the sunspot characteristics associated with it. Today, the number of studies that lean toward machine-learning (ML) algorithms and are considered data-driven approaches is increasing drastically. ML algorithms such as the support vector machine (SVM), random rorest (RF), multilayer preceptron (MLP), and artificial neural network (ANN) were applied in the field of solar flare prediction. Li et al. (2011) proposed an unsupervised clustering approach, combined with vector quantity-learning based on characteristics extracted form the Solar and Heliospheric Observatory (SOHO)/Michelson Doppler Imager (MDI) data. Yuan et al. (2010) used photoshperic magnetic measurements in order to construct an automatic solar flare forecast within a 24 hr window, based on logistic regression and the SVM model. Furthermore, Huang et al. (2013) developed a forecasting model by combining D_ARAL (the distance between active regions and predicted active longitudes) and solar magnetic field parameters based on an instance-based learning model, while Li & Zhu (2013) used a predictive solar flare model based on MPL and learning vector quantization, trained on sequential sunspot data for a 48 hr flare prediction. More recently, Muranushi et al. (2015) introduced a fully automated solar flare prediction framework called universal forecast constructor by optimized regression of inputs (UFCORIN) with two integrated regression models: the SVM, and a handwriten linear regression algorithm, where Nishizuka et al. (2017) examined and compared the performance of the SVM, K-nearest neighbors (k-NN), and an extremely randomized tree performance to predict the maximum class of flares occurring in the next 24 hr while training the models on vector magnetograms, ultraviolet (UV) emission, and soft X-ray emission, available from the Solar Dynamics Observatory (SDO) and Geostationary Operational Environmental Satellite (GOES), while Asaly et al. (2021) used ionospheric total electron content (TEC) data as an SVM training set to build a solar flare X- and M-class predictor. Bobra & Couvidat (2015) attempted to forecast M- and X-class solar flare events using the SVM trained with the SDO/Helioseismic and Magnetic Imager (HMI) data.

In the past decade, the advancement of ML use in the field of deep neural networks, combined with the massive growth of big data and hardware development in graphical processing units, allowed artificial intelligence (AI) algorithms to achieve human-level performance in computer vision, including image classification, object detection, and segmentation (LeCun et al. 2015; Russakovsky et al. 2015). Moreover, AI has promising results in natural language processing, which includes machine translation, image captioning, and text generation (Vinyals et al. 2015; Wu et al. 2016).

Recently, scientists have applied deep neural networks (DNN) for space weather predictions, in particular, the forecast of solar flares. Nagem et al. (2018) used the GOES X-ray flux 1-minute time-series data for solar flare predictions by integrating three neural networks (NN): the first NN maps the GOES time series into Markov transition field (MTF) images. The second NN extracts all relevant features from the MTF. The third network is a convolutional neural network (CNN; LeCun et al. 1998) that generates the prediction. An additional study performed by Chen et al. (2019) proposed to identify solar flare precursors by an automated feature extraction and to classify flare events from the quiet time for active regions, as well as strong (X and M-class) versus weak (A, B, and C class) events at time frames of 1, 3, 6, 12, 24, 48, and 72 hr before the event. Two types of models were examined: CNN, and recurrent neural networks based on a long short-term memory cell (Hochreiter & Schmidhuber 1997), trained on multiple data sources: GOES, SDO, and HMI. Park et al. (2018) presented a forecast application based on the CNN model with a binary outcome: 1 or 0 for a daily flare occurrence of X, M, and C class. They compared their model with two well-known models, AlexNet (Krizhevsky et al. 2017), and GoogLeNet (Szegedy et al. 2015) by training them while using a transfer-learning technique. Finally, Huang et al. (2018) proposed a deep-learning method for learning forecasting patterns from line-of-sight solar active region magnetograms based on the available data from SOHO/MDI and SDO/HMI. Their method forecast solar flare events at 6, 12, 24, and 48 hr window frames while comparing them to other state-of-the-art forecasting models.

Here, we propose to use a 1D CNN model, designed as a time-series classification for solar flare forecast application, using solely GOES soft X-ray time-series data without hand-crafted features or dedicated data preprocessing, compared to previous studies. The suggested model uses as an input the GOES X-ray time-series data and outputs the probability of X-class or M-class flare occurrence. In addition, we also examine the ability of this model design to learn and extract time-series features to distinguish between different solar flare class events.

The outline of the paper is as follows. Data description and preparation are presented in Section 2. The CNN architecture, training, evaluation processes, and overall method are proposed in Section 3. The model performance and comparison are presented in Section 4. The discussion and conclusions follow in Section 5.

2. Data

The solar flux is known to differ over numerous timescales, ranging from minutes to months and decades (Unruh et al. 2008; Reuveni & Price 2009). The fluctuations in the total solar output have been monitored and recorded since the late 1970s (Willson & Hudson 1988), where various measured properties have been presented (Frohlich & Lean 1998; Willson & Mordvinov 2003; Dewitte et al. 2004). As the short-term (minutes to hour) changes are largely dominated by convection currents and solar fluctuations (mainly acoustic and gravity waves), the diurnal to annual changes are due to the occurrence of sunspot regions and variations in the surface magnetic field, conjugated with the solar rotation that migrates solar active regions backward and forward upon the sunlight side of the Earth. Within an 11 yr cycle, sunspots are transported toward the solar equator, while new ones accumulate at high latitudes (Stix 2002).

We use the 1-minute average X-ray (0.1–0.8 nm) time-series data available from the GOES (Schmit et al. 2013) mission.⁶ The first GOES (GOES-1) was launched in 1975 by the United State's National Oceanic and Atmospheric Administration (NOAA), and was operated by NOAA's National Weather Satellite, Data, and Information Service division. All GOES mission spacecraft are geosynchronous satellites, located at a height of about 35,800 km, providing a full-disk view of the Earth as well as having an unobstructed view of the Sun. The main GOES mission is collecting infrared radiation and visible solar reflection from the Earth surface and atmosphere using imager equipment as well as collecting atmospheric temperature, moisture profiles, surface and cloud-top temperatures, and the ozone distribution using sounder equipment. Moreover, GOES spacecraft carry on board a space environment monitor instrument consisting of a magnetometer, an X-ray sensor, a high-energy proton and alpha-particle detector, along with an energetic particle sensor. The X-ray sensor found on board is capable of registering two wavelength bands: 0.05–0.4 nm and 0.1–0.8 nm. The X-ray flux class is defined by the long wave band (0.1–0.8 nm) magnitude as it reaches certain thresholds: 10⁻⁴, 10⁻⁵, and 10⁻⁶ W m⁻² for X, M, and C classes, respectively. GOES X-ray flux data constitute the main source for confirming a solar flare occurrence, and they are extensively used by previous and current studies, associating flare events with different measured data sources (Chen et al. 2019; Huang et al. 2018). Hence, the GOES X-ray data source can act as a primary base for a forecasting application without introducing additional sources of measured data. In order to form a sequential time-series X-ray data signal ranging from 1998 July to 2019 December, multiple GOES mission sources were used, namely GOES-10, GOES-14, and GOES-15. The GOES-10 data range from 1998 July to 2009 December, GOES-14 from 2010 January to 2010 December, and GOES-15 from 2011 January to 2019 December. All three data sources were merged into one chronological sequence of 1-minute-averaged X-ray signal, covering almost entirely two solar cycles (cycles 23 and 24), from 1998 July to 2009 December and from 2010 January to 2019 December, respectively.

2.1. Data Normalization, Scaling, and Splitting

Using the X-ray signal magnitudes, we sorted all the X and M solar flare events based on the corresponding thresholds associated with them: 1 × 10⁻⁴ and 1 × 10⁻⁵ W m⁻ ², respectively. In order to create two separate data sets for X and M solar flare classes with different prediction frames of 1, 3, 6, 12, 24, 48, 72, and 96 hr, while preserving 48 hr of data as an input to the model, we suggest the following scheme: first, we replaced all the missing values, which appears as "−99999" in the time series with GOES-15 nominal minimum 1 × 10⁻⁹ W m⁻² value. This provides a continuous and smooth sequence, free of unexpected negative spikes, and is considered as part of the "no event" or "quiet time" sequence. Then, for every solar flare event peak (M or X separately) that is found, we confirmed that no additional higher-magnitude events appeared 12 hr after or 97 hr before the peak (1 hr before the peak and 96 hr for the prediction frames). As a next step, a no-event frame was selected by choosing a random time point and confirming that no event higher than the M-class threshold appeared 12 hr after or 97 hr before. Moreover, the total variance of the selected frame does not appear to be below the 1 × 10⁻²⁰ W m⁻² threshold, thus eliminating the use of frames with a major nominal minimum value count. In this way, an event/no-event data frame will have a length of 144 hr: 96 maximum prediction hour frames, and 48 hr as input (to examine the 96 hr prediction). Figure 1 visualizes the data preparation process, and Figure 2 demonstrates 48 hr of data as the model input of M-class versus no-flare class events, test and train data sets for the two different data split approaches. The total number of event frames found for the X- and M-class set counts was 171 and 1522 events, respectively, while the no-event frame set count was 1057 events, see Tables 1 and 2. Each event frame set is split into a training and testing sets by two different approaches: the simple random sampling (SRS) approach, and the chronological approach. The SRS approach splits each set into a training and testing set by selecting samples with a uniform distribution, i.e., each sample in the set has an equal probability to be selected as a training or as a testing set. This approach is most commonly used in similar applications and leads to low biases applied with balanced data (Reitermanov 2010). For the chronological approach, we followed the data-splitting method suggested by Park et al. (2018), who noted that the data-splitting method might influence the forecasting performance of the model, where the random selection approach can increase its performance as the training and testing events might be chosen from adjacent time periods. Thus, a comparison of the model performance trained with the same data but with different splitting techniques is meaningful. We therefore selected our data, which range from 1998 July to 2009 December, as the training set and the data from 2010 January to 2019 December as the testing set. In this case, the training set consisted of events solely from solar cycle 23 and the testing set solely from solar cycle 24, ensuring that the testing events appear after the training event chronologically. In addition, every training and testing data set was structured with an even flare/no-flare number of events, based on the type of event with the smallest number (when compared between X and M classes). Both splitting approaches are scaled by 1 × 10⁹ W m⁻² in order to normalize the minimum value to 1.0 W m⁻². Afterward, we applied the natural logarithm function with the resulting sequence, such that the maximum and minimum values range was narrowed to $[0,{{\rm{l}}{\rm{o}}{\rm{g}}}_{e}(1\times {10}^{-3}\cdot 1\times {10}^{9}$ W m⁻²)], as the maximum nominal value of GOES-15 data set is 1 × 10⁻³ W m⁻². Finally, we applied a standard normalization procedure, which shifts the mean value to 0 and scales the variance to 1, with the training and testing sets, based on the normalization parameters obtained from the training sets.

**Figure 1.** (A) Raw X-ray time-series data (0.1–0.8 nm). Minimum nominal values are scaled to 0. (B) Log scaled X-ray time-series data. (C) 3D visualization for five samples taken from the random split test data set.
Download figure:
Standard image High-resolution image

**Figure 2.** (A) M-class training data set visualization using the chronological split approach, 1 hr after an event. (B) M-class test data set visualization using the chronological split approach, 1 hr after. (C) M-class training data set visualization using the random split approach, 1 hr after. (D) M-class test data set visualization using the random split approach, 1 hr after.
Download figure:
Standard image High-resolution image

Table 1. Number of X, M, and No Events for SRS Split

	X versus M versus No			M versus No			Total
Classes	Train 50%	Test 30%	Validation 20%	Train 50%	Test 30%	Validation 20%	Available	Extra
X class	84	51	36	...	...	...	171	0
M class	84	51	36	443	265	177	1522	294
No event	84	51	36	443	265	177	1057	0

Total		513			1772

Note. This table shows the number of available events of X-class, M-class, and No events with the SRS split approach based on first pulling 30% as test set and the rest then divided by 70% for the training and 30% for validation.

Download table as: ASCII Typeset image

Table 2. Number of X, M, and No Events for Chronological Split

	X versus M versus No			M versus No			Total
	Cycle 23	Cycle 24	Cycle 23	Cycle 23	Cycle 24	Cycle 23	Cycle 23		Cycle 24
Classes	Train 70%	Test	Validation 30%	Train 70%	Test	Validation 30%	Available	Extra	Available	Extra
X class	86	47	37	...	...	...	124	0	47	0
M class	86	47	37	196	503	84	972	569	550	0
No event	86	47	37	196	503	84	405	0	656	106

Total		510			1566

Note. This table shows the number of available events of X-class, M-class, and No events for the chronological split approach based on cycle 24 as test set and cycle 23 divided by 70% for the training and 30% for validation.

Download table as: ASCII Typeset image

3. Method

The CNN models have shown human-level performance in the field of computer vision and image processing, and are currently being deployed in autonomous cars, flying drone systems, autonomous robotics, gaming, and medicine. The core layers of these models are the convolutional layers, consisting of several filters, also referred to as kernels, which are quantity and shape defined a priori to the training process with the hyperparameters. When input data (tensor) are passed into the convolutional layer, every kernel of each layer is convolved with the input tensor, generating a feature map. A general case of a discrete 2D convolution is given by the following equation:

$\begin{eqnarray}\begin{array}{rcl}{{f}_{k}^{l}}_{\mathrm{map}}[i,j] & = & \sum _{n=-N}^{N}\sum _{n=-M}^{M}{x}^{l-1}[i\cdot S+(N-P)\\ & & -\,n,j\cdot S+(M-P)-m]\cdot {w}_{k}^{l}[n,m],\end{array}\end{eqnarray} \tag{ 1 }$

where ${{f}_{k}^{l}}_{\mathrm{map}}[i,j]$ is the feature map k of a layer l at index i, j. The x^l−1 is the output from the previous layer l − 1, which becomes the input to the current layer l, and ${w}_{k}^{l}$ is the kernel with a size of (2N + 1) × (2M + 1). Two additional hyperparameters are the stride S, which defines the kernel move step along the input tensor, and the padding P, which pads the input boundary.

Because the convolution operation is linear and CNN is a deep stack of linear combination layers, similar to DNNs, it is also designed with activation functions that allows modeling nonlinear mapping from an input domain into an output domain. The CNN model architecture often includes the rectified linear unit (ReLU) (Fukushima & Miyake 1982; Nair & Hinton 2010), which is described as follows:

$\begin{eqnarray}&&\mathrm{ReLU}(x)={\rm{\max }}(0,x).\end{eqnarray} \tag{ 2 }$

Furthermore, the CNN feature maps encapsulate the spatial features found in the input tensor, associated with the kernel values that are learned during the training process. In general, the input sample spatial features that describe the sample are not necessary grouped together in one location, but rather might be spread into different locations, therefore capturing those features can lead to better performances. Thus, the CNN models include pooling layers that pool information (based on the pooling layer type) from the feature maps with applied activation function. Only a few pooling layer types are used in CNN models, where one of the most popular ones is the max pooling layer, defined by the following expression:

$\begin{eqnarray}&&\begin{array}{l}{{x}_{k}^{l}}_{\mathrm{pool}}[i,j]=\mathop{\max }\limits_{\mathop{-N^{\prime} \leqslant n\leqslant N^{\prime} }\limits_{-M^{\prime} \leqslant m\leqslant M^{\prime} }}\\ \ \ ({{f}_{k}^{l}}_{\mathrm{map}}[i\cdot S^{\prime} +N^{\prime} +n,j\cdot S^{\prime} +M^{\prime} +m]),\end{array}\end{eqnarray} \tag{ 3 }$

where ${{x}_{k}^{l}}_{\mathrm{pool}}[i,j]$ is the pooling tensor k of layer l at index i, j, operating on a feature map ${{f}_{k}^{l}}_{\mathrm{map}}$ with a max pooling kernel of size $(2N^{\prime} +1)\times (2M^{\prime} +1)$ and stride $S^{\prime}$ . $N^{\prime} ,M^{\prime} \mathrm{and}\,S^{\prime}$ are defined by the hyperparameters.

A general classification CNN model (Fawaz et al. 2019) consists of stacks of layers one after another, such that the convolutional layer operates on the input tensor of the previous layer. Then, it passed through the activation function, converting it into a feature map, where then the pooling layer pools spatial information about the feature. At the end of the model architecture, there are a few fully connected layers, followed by the softmax activation function, which is defined as

$\begin{eqnarray}&&\mathrm{softmax}{(x)}_{i}=\displaystyle \frac{\exp {x}_{i}}{{\sum }_{j=1}^{K}\exp {x}_{j}},\end{eqnarray} \tag{ 4 }$

where x is an input vector, constructed from real numbers, and K is the number of categories, mapping the processed input into the output domain of categorical probabilities (Figure 3).

**Figure 3.** General classification deep CNN stack of layers. Red shows the convolutional kernel. Orange shows feature maps and activation function layer. Blue represents the pooling kernel. Green shows the pooling layers. Gray are fully conncected layers. The output tensor is passed through the softmax activation function and converted into a probability space.
Download figure:
Standard image High-resolution image

3.1. Model Architecture

We used a general CNN architecture in order to develop a classification time-series model as solar flare-forecasting tool. In contrast to the general CNN model, which takes as an input a 2D image, the X-ray time-series data are 1D, hence we designed a 1D CNN based on the general CNN case (Wang et al. 2017). Our model, Figure 4, consists of four convolutional layers, each layer followed by a ReLU activation function, four max pooling layers, a fully connected layer, and an output layer with softmax activation function. In addition, every max pooling layer is followed by a dropout layer, with a dropout probability of 10% (0.1 is the hyperparameter value) for regularization and model overfitting avoidance (Srivastava et al. 2014).

1.
The first convolutional layer (conv1) has 64 feature maps, a kernel size of 30 × 30, and a stride of 1, the total conv1 size is 1 × 2880 × 64. The following max pooling layer has kernels of size 15 × 15 with a stride of 15 and a shape of 1 × 192 × 64.
2.
The second convolutional layer (conv2) has 256 feature maps, kernels with size 15 × 15, and a stride of 1, the total conv2 size is 1 × 192 × 256, and its max pooling layer has kernels of size 5 × 5 with a stride of 5 and a shape of 1 × 39 × 256.
3.
The third convolutional layer (conv3) has 512 feature maps, kernels with size 5 × 5, and a stride of 1, the total conv3 size is 1 × 39 × 512. The following max pooling layer has kernels of size 3 × 3, with a stride of 3 and a shape of 1 × 13 × 512.
4.
The final convolutional layer (conv4) has 512 feature maps, kernels with size 3 × 3, and a stride of 1, the total conv4 size is 1 × 13 × 512, and its max pooling layer has kernels of size 3 × 3, with a stride of 3 and a shape of 1 × 5 × 512.

**Figure 4.** The suggested CNN architecture uses four convolutional layes with ReLU activation function, four max pooling layers, and one fully connected layer with softmax activation function.
Download figure:
Standard image High-resolution image

At the end of the model architecture, a flattening layer flattens the shape of the previous max pooling layer from 1 × 5 × 512 into 2560 × 1, connecting it to the output layer of size 2 × 1, making it a fully connected layer. The output layer passes through the softmax activation function to map the output into categorical probability space.

3.2. Data Preparation and Training

In order to cover a wide range of prediction time frames, different data split methods, and solar flare class types, we trained the individual model for each combination of the following categories:

1.
Solar flare class type: X-class flare versus no flare or M-class flare versus no flare (two in total).
2.
Data split type: Chronological or random (two in total).
3.
Prediction time frame: 1, 3, 6, 12, 24, 48, 72, and 96 hr (eight in total).

We ended up training the architecture with 32 different configurations, each for every combination (2 × 2 × 8 = 32). In addition, we investigated the ability of the proposed CNN architecture to distinguish between M- and X-class solar flares with each corresponding time frame from category 3 and data split type from category 2, leading to an additional 16 trained models (2 × 8 = 16). In order to train our model in various prediction time frames, we pulled 48 hr window of data, shifted by the prediction gap, out of the available 144 hr in the event frame, forming a training and testing set with a range of 2880 minutes (Figure 5). In total, we trained 48 individual models with the following hyperparameters: we used the Adam (Kingma & Ba 2014) optimizer with a learning rate of 3 × 10⁻⁵. We also adopted the cross entropy loss function for the training procedure. The cross entropy loss function is given by the following formula:

$\begin{eqnarray}&&{CE}(y,\hat{y})=-\sum _{i=1}^{m}({y}_{i}\mathrm{log}({\hat{y}}_{i})+(1-{y}_{i})\mathrm{log}(1-{\hat{y}}_{i})),\end{eqnarray} \tag{ 5 }$

**Figure 5.** Random split approach for M-class vs. no-flare class data set. The number of events in the data set counts for 1772 (886 M-class events and 886 No-flare class events). Every row constitutes 144 hr of prprocessed X-ray data prior to an event, e.g., hour 145 (not included in the data set) is an M-class flare event (in case of an M-class label) or no-flare class event (in case of a no-flare label). In order to train the architecture for various time frames, we pulled 48 hr (our model input length) sequence data out of the 144 hr shifted by a prediction window, marked by different colors: blue shows the 1 hr prediction window prior to an event, orange shows the 3 hr prediction window prior to an event, green shows the 6 hr prediction window prior to an event, red shows the 12 hr prediction window prior to an event, purple shows the 24 hr prediction window prior to an event, brown shows the 48 hr prediction window prior to an event, pink shows the 72 hr prediction window prior to an event, and gray shows the 96 hr prediction window prior to an event.
Download figure:
Standard image High-resolution image

where y is the ground truth one-hot encoded vector of size m, $\hat{y}$ is the model output prediction vector, also of size m, encoded with the probability entries and sums up to 1. Further, we used a mini-batch size of 16 and 75 epochs in total. In addition, an early stopping mechanism (Prechelt 1998) was added to the training process for the model weight sharpshooting, once the validation set loss reached a new minimum value; see Figure 6 for the validation set loss graphs.

**Figure 6.** Validation loss graphs for all the 48 trained models. (A) Validation loss for eight different time prediction frames for X-class flare vs. no-flare event with random data split. (B) Validation loss for eight different prediction window frames for M-class flare vs. No-flare event with random data split. (C) Validation loss for eight different prediction window frames for M-class flare vs. no-flare event with chronological data split. (D) Validation loss for eight different prediction window frames for X-class flare vs. no-flare event with chronological data split. (E) Validation loss for eight different prediction window frames for M-class flare vs. X-class flare event with random data split. (F) Validation loss for eight different prediction window frames for M-class flare vs. X-class flare event with chronological data split.
Download figure:
Standard image High-resolution image

4. Results and Comparison

A classifier is evaluated based on the statistical scores it achieves with the test set. The scores are calculated according to the confusion matrix (Fawcett 2006), which measures the performance of a machine-learning algorithm based on four different combinations of the predicted and actual (ground truth) values; see Table 3. Here, we followed seven commonly used score metrics that were adopted by previous studies (Park et al. 2018; Bobra & Couvidat 2015; Huang et al. 2018; Chen et al. 2019).

1.
Accuracy (ACC) is defined as the ratio of the number of correct predictions, ranging from 0 to 1, while 1 is a perfect accuracy score:
$\begin{eqnarray}&&\mathrm{ACC}=\displaystyle \frac{{TP}+{TN}}{P+N}.\end{eqnarray} \tag{ 6 }$
2.
Precision (positive predicted value, PPV) measures the ability of not labeling a negative event as positive, ranging from 0 to 1, while 1 is a perfect precision score:
$\begin{eqnarray}&&\mathrm{PPV}=\displaystyle \frac{{TP}}{{TP}+{FP}}.\end{eqnarray} \tag{ 7 }$
3.
Recall (true-positive rate, TPR) measures the ability of finding all positive events, ranging from 0 to 1, while 1 is perfect recall score:
$\begin{eqnarray}&&\mathrm{TPR}=\displaystyle \frac{{TP}}{{TP}+{FN}}.\end{eqnarray} \tag{ 8 }$
4.
F1-score (F1) measures the ability of finding all positive events without misclassifying negative ones. Extracted from the harmonic mean of the precision and recall scores, ranging between 0 and 1, while a value of 1 indicates perfect precision and recall:
$\begin{eqnarray}&&F1=\displaystyle \frac{2\cdot {PPV}\cdot {TPR}}{{PPV}+{TPR}}.\end{eqnarray} \tag{ 9 }$
5.
Heidke skill score 1 (HSS₁), ranging from − ∞ to 1, while 1 is a perfect HSS₁ skill score. HSS₁ measures the improvement over a model that always predicts negative events (baseline model, HSS₁ = 0):
$\begin{eqnarray}&&{\mathrm{HSS}}_{1}=\displaystyle \frac{{TP}+{TN}-N}{P}=\displaystyle \frac{{TP}-{FP}}{{TP}+{FN}}.\end{eqnarray} \tag{ 10 }$
6.
Heidke skill score 2 (HSS₂), ranging from −1 to 1, while 1 is a perfect HSS₂ skill score. HSS₂ is a skill score that is compared to a random forecast:
$\begin{eqnarray}&&{\mathrm{HSS}}_{2}=\displaystyle \frac{2\cdot {TP}\cdot {TN}-{FN}\cdot {FP}}{P\cdot ({FN}+{TN})+N\cdot ({TP}+{FP})}.\end{eqnarray} \tag{ 11 }$
7.
True skill score (TSS) measures the difference between the true-positive and false-positive rates and ranges from −1 to 1. The TSS is evaluated as the maximum distance of the receiver operating characteristic (ROC) curve from its diagonal line:
$\begin{eqnarray}&&\mathrm{TSS}=\displaystyle \frac{{TP}}{{TP}+{FN}}-\displaystyle \frac{{FP}}{{FP}+{TN}}.\end{eqnarray} \tag{ 12 }$

Table 3. Confusion Matrix Description Table

	Flare Predicted	No-flare Predicted
Flare occurred (P)	True positive (TP)	False negative (FN)
Flare did not occur (N)	False positive (FP)	True negative (TN)

Note. Confusion matrix description table. Flare occurred (P): column of all the positive events. Flare did not occur (N): column of all the negative events. True positive (TP): counts the number of positive events predicted by the model as positive. False negative (FN): counts the number of positive events predicted by the model as negative. False positive (FP): counts the number of negative events predicted by the model as positive. True negative (TN): counts the number of negative events predicted by the model as negative.

Download table as: ASCII Typeset image

For our first evaluation, we compare the results for the random and chronological split method types. Figure 7 shows the ROC curve of each split method for all the prediction time frames, separated by solar flare class. In addition, Figure 8 shows the comparison between different metric skill scores for the two split types, separated by solar flare class. Figure 9 presents a statistical analysis for the different metrics by the data split method. Then, in order to create a compatible comparison with previous studies, we followed the data split method suggested by Park et al. (2018), who adopted the chronological split method as more suitable for space weather forecast platforms. Therefore, our current comparison was made solely using the results achieved by training and testing our model with the chronological data set split method. Four recent studies considered as state-of-the-art flare-forecasting models are chosen for comparison: Chen et al. (2019), Park et al. (2018), Huang et al. (2018), and Bobra & Couvidat (2015), the majority of which used DNNs, except for Bobra & Couvidat (2015), who used the SVM technique. All the comparisons were made based on the skill score metrics. Figures 10 and 11 show a visualization metric of an M-class and X-class solar flare classifier compared with the previous four studies, respectively. An overall comparison of all the prediction time frames, metrics, and models is described in Table 4. Moreover, the test evaluation of our model, examining its ability to distinguish whether an M- or X-class solar flare events will occur, using both split method types, is shown in Figure 12 as ROC curve graphs.

**Figure 8.** Visualization for random and chronological split types metric evaluation. Left: metric evaluation for an X-class solar flare event vs. no-flare event classification model. Right: metric evaluation for an M-class solar flare event vs. no-flare event classification model.
Download figure:
Standard image High-resolution image

**Figure 9.** Statistical visualization for the different metrics by data split method using the box-chart method, where Q1 is the candle upper boundary (equal to −0.6745σ) and Q3 is the candle lower boundary (equal to 0.6745σ). (A) Metric statistics for an X-class solar flare event vs. no-flare event classification. (B) Metric statistics for an M-class solar flare event vs. no-flare event classification.
Download figure:
Standard image High-resolution image

**Figure 10.** M-class solar flare classifier model metric visualization comparison.
Download figure:
Standard image High-resolution image

**Figure 11.** X-class solar flare classifier model metric visualization comparison.
Download figure:
Standard image High-resolution image

**Figure 12.** ROC curve graph of the distinguishing abilities of the models for X-class and M-class solar flares with different prediction time frames. Left: random data split trained and tested model ROC curve. Right: chronological data split trained and tested model ROC curve.
Download figure:
Standard image High-resolution image

Table 4. Performance Comparison for X and M Models Trained with Chronological Split

		Number of Hours After a Solar Flare Event
Work	Metric	1 hr	3 hr	6 hr	12 hr	24 hr	48 hr	72 hr	96 hr
	Accuracy	0.947	0.936	0.904	0.979	0.926	0.915	0.851	0.819
	Precision	0.92	0.918	0.913	0.959	0.935	0.933	0.851	0.895
	Recall	0.979	0.957	0.894	1.0	0.915	0.894	0.851	0.723
Current work	F1 score	0.948	0.938	0.903	0.979	0.925	0.913	0.851	0.8
(X-class prediction)	HSS	0.894	0.872	0.809	0.957	0.851	0.83	0.702	0.638
	HSS2	0.895	0.874	0.813	0.957	0.854	0.833	0.713	0.65
	TSS	0.894	0.872	0.809	0.957	0.851	0.83	0.702	0.638

	Accuracy	0.877	0.881	0.875	0.865	0.847	0.81	0.789	0.795
	Precision	0.926	0.914	0.931	0.922	0.907	0.842	0.846	0.813
	Recall	0.819	0.841	0.809	0.797	0.773	0.763	0.708	0.767
Current work	F1 score	0.869	0.876	0.866	0.855	0.835	0.801	0.771	0.789
(M-class prediction)	HSS	0.753	0.761	0.75	0.73	0.694	0.62	0.579	0.59
	HSS2	0.759	0.768	0.755	0.736	0.703	0.637	0.597	0.611
	TSS	0.753	0.761	0.75	0.73	0.694	0.62	0.579	0.59

	Precision	0.93	0.93	0.91	0.92	0.89	0.88	0.86	...
	Recall	0.88	0.87	0.85	0.85	0.77	0.72	0.68	...
Chen et al. 2019	F1 score	0.9	0.9	0.88	0.88	0.83	0.79	0.76	...
	HSS	0.81	0.8	0.77	0.77	0.68	0.62	0.57	...
	HSS2	0.81	0.79	0.77	0.77	0.68	0.62	0.56	...
	TSS	0.81	0.8	0.77	0.77	0.68	0.62	0.56	...

	Accuracy	...	...	...	...	0.83	...	...	...
	Recall	...	...	...	...	0.85	...	...	...
Park et al. 2018	HSS2	...	...	...	...	0.63	...	...	...
	TSS	...	...	...	...	0.63	...	...	...

Huang et al. 2018	HSS2	...	...	0.054	0.081	0.143	0.206	...	...
	TSS	...	...	0.662	0.632	0.662	0.621	...	...

	Accuracy	...	...	...	...	0.962	0.973	...	...
	Precision	...	...	...	...	0.69	0.797	...	...
	Recall	...	...	...	...	0.627	0.714	...	...
Bobra and Couvidat 2015	F1 score	...	...	...	...	0.656	0.751	...	...
	HSS	...	...	...	...	0.342	0.528	...	...
	HSS2	...	...	...	...	0.636	0.737	...	...
	TSS	...	...	...	...	0.61	0.703	...	...

Note. Full table of the metric comparison divided by prediction hours and skill scores.

Download table as: ASCII Typeset image

5. Discussion and Conclusions

In this study we designed a 1D CNN for time-series classification as a space weather forecasting tool. The network was trained solely on GOES X-ray time-series data available for solar cycle 23 and 24. We focused on training two models, one for predicting X-class solar flare events, and the second for predicting M-class solar flare events. Both models were trained for different prediction time frames before the event, using two different data set split methods: random and chronological (trained with past events and tested with future events). For both data split methods, we kept the training and testing sets balanced. The evaluation of the models was done according to several skill scores that were commonly used in recent space weather forecasting studies for a binary classification. For both models and split methods, the degradation in the model performance grows as the prediction time frame increases, which is an expected behavior in forecasting platforms, because the farther the forecasting goes, the higher the uncertainty that is introduced (Camporeale 2019). Despite the results previously reported regrading the influence of data set separation on the forecast performance (Nishizuka et al. 2017), in our study the performance difference between the chronological and random data set splits constitutes only 3% for the M-class model and 2% for the X-class model over all the measured skill scores on average. Moreover, for the M-class model case, the 3% difference indicates a degradation in favor of the random split approach, but for the X-class model case, the performance difference degradation is in favor of the chronological split approach. We chose to compare our current work results achieved with models that were trained by the chronological split method, as suggested by Park et al. (2018), where our M-class model results achieve high scores that are comparable with previous works. The work presented by Chen et al. (2019) achieves higher results at several prediction time frames for a few skill scores, but their work was based on the random split approach. The other previous studies do not provide all the available skill scores for the same prediction time frame, as their main focus of interest was different, but if we do consider their available skill scores, our M-class model achieves higher TSS values than those provided by Park et al. (2018) and Huang et al. (2018) at all prediction time frames. Our M-class model also achieves a higher TSS score than Bobra & Couvidat (2015) at the 12 hr time frame, but lower scores at the 24 hr time frame. The same results are observed with the HSS2 scores. In addition, Park et al. (2018) achieved better performance with the recall score at the 24 hr time frame, and Bobra & Couvidat (2015) achieved a higher accuracy score for both the 12 and 24 hr time frames. On the other hand, our X -class model achieves higher skill score performances over all compared studies, except for the accuracy score values gained by Bobra & Couvidat (2015). As a last step, we examined the ability of our suggested model of distinguishing between the likelihood for an X-class or M-class flare to occur at different time frames. The results reveal that the current model is poorly capable of classifying between the two, such that at the 96 hr time frame, it achieves an area-under-the curve (AUC of the ROC curve) value of 0.506 with the random split approach, which is equivalent to a model that flips a random binary coin. The topic of distinguishing between solar flare classes requires a more comprehensive and profound study, which is important for cases when a model is trained to predict M- or X-class flares as one (binary predication: when X or M will occur, the model predicts 1, otherwise it predicts 0), the recall skill score is calculated for both solar flare classes, meaning it does not truly express the false-negative rate for the X-class flare type.

Low-dimensional Convolutional Neural Network for Solar Flares GOES Time-series Classification

Article metrics

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Abstract

1. Introduction

2. Data

2.1. Data Normalization, Scaling, and Splitting

3. Method

3.1. Model Architecture

3.2. Data Preparation and Training

4. Results and Comparison

5. Discussion and Conclusions

Footnotes

Low-dimensional Convolutional Neural Network for Solar Flares GOES Time-series Classification

Article metrics

Share this article

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Abstract

1. Introduction

2. Data

2.1. Data Normalization, Scaling, and Splitting

3. Method

3.1. Model Architecture

3.2. Data Preparation and Training

4. Results and Comparison

5. Discussion and Conclusions

Footnotes