Data augmentation for self-paced motor imagery classification with C-LSTM

Daniel Freer; Guang-Zhong Yang

doi:10.1088/1741-2552/ab57c0

1. Introduction

In situations where other control methods are difficult or impossible, a brain–computer interface (BCI) can be used to get a high-level understanding of the user's intent. This control signal could be used with an assistive robot to aid stroke patients in their daily lives, or to help workers in extreme environments such as space. In these cases, much of the robotic control can be done autonomously, but certain commands from a user are still necessary, such as initiating a specific task or giving directional commands. The most commonly utilised BCI is electroencephalograpy (EEG) because of its non-invasiveness and temporal resolution, though the signal is noisy and has low spatial resolution. To apply EEG to robotic control, different mental strategies are employed by the user, one of which is called 'motor imagery' (MI), which is typically imagining the opening and closing of your left or right hand, or similar imagined movements [1, 2].

While many MI decoding methods have been developed that can both achieve practical results [3] and improve our understanding of the brain [4], very few studies have considered whether state-of-the-art algorithms are applicable for self-paced control with window sizes smaller than 2 s. While using large window sizes may not be a problem in many cases, processing in longer windows of time may increase the complexity of the desired response in that particular time window. For example, imagine a situation where a person is trying to move a robot to the right, but they go too far and need to quickly correct to the left. If you use a large window size, the input will mostly include data from the 'right' class, which could contaminate the results. Even if you use a classifier that learns how features are changing over time, in order for it to reliably recognize this more complex trajectory, you may need to specifically provide this in the training data. For this reason, minimising the time window of considered data is a low-cost solution to achieve seamless robotic control with negligible delay. Assistive robotic control in these contexts should feel smooth to the user, with a quick response to changes in control commands.

Another limitation of using MI control for daily assistance is the length and intensity of training protocols. For four class MI, training protocols can require 8–15 experimental sessions [3], while for one study investigating MI control of 3D movement, subjects used up to 50 experimental sessions amounting to more than 20 hours of training per subject in some cases [5]. Such arduous training protocols are not ideal for easy translation to home-based assistive technology, so any methods to minimize this are welcomed. While some groups have investigated adaptive protocols to improve training [6], others have posited that the generation of realistic artificial EEG signals of a subject through data augmentation may be more useful [7, 8].

This paper aims to (i) present a new solution to the problem of MI classification for application to self-paced control; (ii) assess the robustness of state-of-the-art MI classification methods to shrinking window size; and (iii) investigate the effects of data augmentation strategies that can enhance EEG classification. The next section of this paper will highlight work in the field that is related to our experimental implementation. After this, we will describe how we prepared the data from BCI Competition IV Dataset 2a for classification, implemented each classifier, and augmented the data to achieve notable results. After comparing the robustness of these classifiers to smaller and overlapping windows with four-class data, the paper then shifts its focus to self-paced MI tasks by considering all of the resting data from this dataset as an additional class. This presents various problems such as data imbalance. To combat these issues, various data augmentation techniques will be explored, which could additionally decrease the amount of training data needed for an individual user, making training protocols less strenuous.

2. Related work

2.1. Motor imagery classifiers

MI tasks can be distinguished through the synchronisation or desynchronisation of rhythms in the mu (7–13 Hz) and beta (13–30 Hz) bands on either side of the motor cortex [9]. To counteract the large amount of noise in EEG signals, MI tasks have typically been processed in windows of time. Initial efforts to do this used power features of the signal on a small number of electrodes [10] or applied hand-made spatial filters to the electrodes [11] to simplify the classification problem. Similar methods are still used in current real-time application of BCI technology to control robots [3]. However, these studies still typically limit the number of classes to two or three, or divide a complex task into multiple steps, which is not ideal for practical real-time control in many cases.

The next series of classifiers for distinguishing MI tasks utilised the common spatial patterns (CSP) algorithm, which learns the proper spatial filters to use for a particular subject [12]. However, this method still requires explicit selection of temporal filters for each subject. To ameliorate this problem, the winner of the BCI Competition IV [13] in 2008 performed CSP on various band-pass filtered representations of the data, or filter banks, resulting in the name filter bank CSP (FBCSP) [14, 15]. After this, mutual-information-based feature selection was performed before using a standard classifier such as linear discriminant analysis (LDA), though many feature selection algorithms and classifiers have been investigated for these final steps.

Another recent classifier made use of spatial covariance matrices and calculated distances in Riemannian rather than Euclidian space [16, 17]. This effective and straightforward method of MI classification caught the attention of the BCI community by winning the NER 2015 BCI Competition. Because this is compared to our proposed method, it will be discussed in more detail in section 3.2.1. In addition, Deep Learning models have recently been applied to EEG signals, which has both improved accuracy in some cases [18] and provided new ways to visualise the features learned from our brain data [4].

2.2. Deep learning

Deep Learning has only begun to be explored in the BCI community in recent years, but various studies related to error detection, memory, seizure detection, and other applications have been conducted in addition to MI classification [4]. The review by Roy et al [19] estimates that 21 of the 136 papers related to deep learning with BCI have pertained to MI detection. For all of these applications, convolutional neural networks (CNN) have been the most commonly proposed solution [9, 18–22], though several studies have utilised other machine learning concepts such as autoencoders, restricted Boltzmann machines [23], or recurrent layers, including long short-term memory (LSTM) layers [24, 25]. Recent review papers [19, 21, 26] have more fully investigated the use of deep learning concepts for EEG processing, so interested parties can refer to these for a more in-depth understanding.

CNNs are a class of neural networks that make use of relevant spatial or temporal relationships between neighbouring datapoints to infer which features are useful to a given machine learning task. This technique greatly reduces the number of connections and parameters in a deep network while achieving similar performance, allowing for larger, deeper, and more efficient networks to be developed [27]. Deep CNNs have been used heavily in image processing after surpassing the best of traditional techniques in image classification tasks [27, 28]. EEG data, like images, contain a wealth of features that are difficult to define by hand. Also like images, each EEG data point has a relationship to its spatial and temporal neighbours, though this relationship is not as clearly defined or understood as with images. For these reasons, BCI using EEG is an intuitive application for a CNN, in that CNNs have the ability to extract features from brain signals that are both known and yet to be discovered. Initial application of CNNs to EEG signals have attempted to directly use the time-domain signals [4, 29], while others utilised the frequency domain for input [9, 30].

Another common method in deep learning is the use of recurrent layers, which take the previous prediction layer of a given data segment as additional inputs for the next segment. This innovation planted the concept of utilising how features within a neural network change over time. With the addition of forget, memory, and output logic gates, the LSTM module was created, which solved many of the problems with other recurrent neural networks, especially with respect to exploding or vanishing gradients [31]. Recurrent networks have perhaps most commonly been used for language processing [32, 33], including speech recognition. Human speech, like EEG data, is a biological signal in which frequency components are among the most indicative features for classification. For this reason, applying similar methods to EEG signal processing is intuitive.

In addition, LSTM layers have more recently been combined with convolutional layers to complete tasks such as caption generation for images [34]. Similar networks have been used for video classification [35], which is analogous to EEG processing, as EEG data have previously been characterised as a series of images [24]. Very few papers have investigated the combination of convolutional and recurrent layers in EEG signal processing. The method formulated by Zhang et al [25] was an example of this, but the method was not robust to changes in data preprocessing in preliminary experiments conducted by the authors of this manuscript. In particular, their model appears to overfit to data which was overlapping between training and test sets, resulting in astoundingly high accuracies (98%). With clearer separation between the training and testing sets, we found their model to achieve approximately 36% accuracy on the four-class problem.

Several recent studies have also investigated the use of data augmentation strategies with EEG data. Data augmentation and artificial data generation have been shown to greatly improve the performance of deep neural networks, as in the context of image classification [27]. Methods of artificial data generation have ranged from the simple addition of Gaussian noise [7] to the use of deep generative adversarial networks (GAN) [8, 36]. Other methods have divided real samples from the same class and recombined them, both in the time domain [37] and using empirical mode decomposition (EMD) [38, 39], and some have additionally considered the generation of artificial points on a Riemannian manifold [40]. Applying these strategies to EEG data has improved classification accuracy when limited data is available.

3. Methods

3.1. Data and training protocol

The data used for this study was from the BCI Competition IV Dataset 2a [13]. In this dataset, each of nine untrained subjects performed MI of the left hand, right hand, both feet, and of the tongue when prompted by a computer screen. Each datafile contained 72 examples of each of these four classes (which may be referred to as the 'control' classes throughout the paper), with resting or 'ambient' data between.

In the presented work, the datapoints between 0.5 and 4 s after the prompt from the computer were labeled as their respective classes (1–4), while all other datapoints that were not a part of any trial were labeled as the ambient class (0). While this labeling held true for the validation and test sets, the trial data from 0 to 0.5 s and from 2.5 to 4 s after the start of the trial were removed from the training set. Additionally, to account for imbalance in the training labels because the ambient class was represented nearly ten times as much, ambient data was only added to the training data when it was not being overrepresented compared to the other classes. While this created some discontinuity in the training signal, it ensured that any classifier would not fit only to this class, and that we could easily control the balance between the classes. When using data augmentation, this balance was handled differently, but will be discussed in more detail in section 3.4. In contrast, the validation and testing data included all control class and ambient data, and were never changed or augmented except when only considering the four control classes for a baseline comparison.

In preprocessing, the data was first passed through a bandpass filter (7–30 Hz) using the FIRwin filter in the MNE python library [41], then divided into windows of time between 2 and 0.25 s long, which were evaluated in both overlapping and non-overlapping conditions. Many overlapping conditions were considered, but most experiments were carried out with an overlapping condition of 3, meaning that two-thirds of a given time window would be overlapped with the next time window. The data was then Z-normalised by the mean and standard deviation within only that window, and mean scaled to be between 0 and 1. For this normalisation procedure, each signal window (X_i) of shape $[t\times n_{ch}]$ had only one mean and variance, ensuring that there was still variation between channels and over time for a given window. This process can be seen in equations (1) and (2).

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle X_{norm} = \frac{X_i - mean(X_i)}{std(X_i)} \label{eq:norm} \nonumber \end{align} \tag{ 1 }$

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \label{eq:scale} X_{out} = \frac{X_{norm} - min(X_{norm})}{max(X_{norm}) - min(X_{norm})}. \nonumber \end{align} \tag{ 2 }$

After this, any of the data augmentation strategies that will be described in section 3.4 were carried out, and all resulting data was used as input to the network.

The first 60 percent of the data was used for training, the next 20 percent was used as a validation set for model and parameter selection, while the remaining 20 percent was used as the test set to compare the finally selected model and methods. The training of the deep networks were stopped whenever the loss of the classifier on the validation set reached above 1.1 times its minimum value for a given training session. All deep networks were trained with the Adam optimizer with a learning rate of 0.001 and a batch size of 32, using negative log likelihood as the loss function and batch normalisation with $\alpha=0.1$ . All statistical tests were 2-tailed t-tests, using the standard error of the mean of each subject, and all 95% confidence intervals similarly were determined to be two standard errors away from the mean.

Because part of the goal of this paper was to investigate BCI accuracy in situations that are more realistic for self-paced and real-time control, the authors made most decisions with consideration to these factors rather than attempting to maximize accuracy. For this reason, the final label of a window was determined to be the last label in the window, because this would provide the most 'real-time' response to incoming data. We additionally decided that all of the test data should be considered when determining the accuracy of the system and that the data should be divided so that the classifiers do not know when the motor imagery task began. In all experiments other than the 4-class baseline comparison, the validation and test set data were unchanged apart from the preprocessing and labeling as described above. This method of data segmentation and labeling was intended to ensure that the classifier could identify the proper class even when switching from the ambient class to a control class, or vice versa [42]. A classifier trained in this manner should be able to more precisely determine when a control command is or is not being sent. While we did not complete any real-time testing for this manuscript, the authors believe that the methods taken here provide more realistic conditions for real-time and self-paced control.

3.2. Classifier implementation

In this paper, three different classifiers were utilised and compared based on their MI classification performance. The three methods are using a Riemannian minimum distance to the mean (MDM) protocol [16], a CNN conceptually based on FBCSP [4], and an extension of the CNN additionally using an LSTM module, which is first proposed in this manuscript. A comparison of FBCSP, the CNN, and our proposed network are shown in figure 1.

**Figure 1.** A representation of the evolution of EEG MI classifiers from FBCSP (left) to convolutional neural networks (CNN) (right). The right-most branch of the right diagram shows the new model proposed in this paper, which utilises a LSTM module to make use of how features change over time.
Download figure:
Standard image High-resolution image

3.2.1. Riemannian geometry classifier

An EEG signal has several known features of interest, namely the power of the signal, the spatial location, and the frequency distribution. Two of these crucial features, the spatial and power information, can be succinctly expressed by a spatial covariance matrix. With a pre-filtered signal, covariance matrices can represent an entire signal while reducing the dimensionality of a longer time window. In addition, because they are symmetric and positive definite (SPD), covariance matrices are compatible with calculations in Riemannian space [17]. Riemannian geometry, at its essence, is the study of differential representations of a surface in multiple dimensions [43]. Using concepts of Riemannian geometry, various distances between covariance matrix representations of a signal can be calculated. Therefore, if the training data is represented as a set of labeled covariance matrices, these distances can be used to determine the label associated with a previously unseen signal.

Following the example of Barachant et al, spatial covariance matrices were first estimated for each window of time using equation (3), where X_i is the matrix representation of the signal within a given window. X_i in this case is of shape $[t\times n_{ch}]$ , with t being the number of datapoints within a given window, and n_ch being the number of channels. The signal was then considered in tangent space using geodesic filtering with fisher discriminant analysis (FGDA) [16], and classification was carried out by calculating the minimum distance to the Riemannian mean (MDM) of each class in the training data. Equation (4) shows how to calculate the length of the geodesic curve between two points in Riemannian space, while equation (5) shows how to calculate the mean of a set of points in Riemannian space. Note that for our method, the mean for each class c as calculated through equation (5) would be input as P₂ in equation (4), while P₁ would be a single spatial covariance matrix used as input to the classifier. $\lambda_i$ represents the real eigenvalues of $P_1^{-1}P_2$ .

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle P_i = \frac{1}{T_s - 1}X_i X_i^T \label{eq:cov} \nonumber \end{align} \tag{ 3 }$

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \delta_R(P_1, P_2) = ||log(P_1^{-1}P_2)||_F = \left[\sum_{i=1}^n log^2\lambda_i\right]^{1/2} \label{eq:dist} \nonumber \end{align} \tag{ 4 }$

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle P_{mean_c}(P_1,...,P_I) = argmin\sum_{i=1}^I\delta_R^2(P, P_i)\label{eq:Rmean}. \nonumber \end{align} \tag{ 5 }$

Our method utilised the publicly available code provided by the original authors⁴. We also considered the use of other Riemannian classifiers such as the Tangent Space Classifier, but found in preliminary experiments that the described method achieved superior results on our validation set with the conditions we have described.

3.2.2. Shallow FBCSP-based CNN

This paper extends the work of Schirrmeister et al, who first developed a convolutional deep learning framework based on FBCSP [4]. The network is organised as a combination of convolutional layers, each considering different dimensions of the signal, then a final fully-connected classification layer. After initial preprocessing through a band-pass filter (7–30 Hz) and segmentation of the data, a 1D convolutional layer is applied temporally to the signal coming from each electrode. This layer, in essence, does the same for filter banks as CSP did for spatial filtering, in that temporal filters which are useful will be learned as opposed to predefined. The next layer is a 2D convolution across the electrodes (spatially) and the filters selected from the first layer. Pooling is then used for feature selection, which is a necessary step in FBCSP to avoid overfitting, and after a dropout layer (drop_prob = 0.5), a final convolutional layer with a softmax function is used as the final classifier.

In our implementation, we began with the publicly available code from the original authors⁵, but the amount of data considered for classification of tasks was 2 s at a minimum in the original paper. In order to make the model compatible with smaller window sizes, a few changes needed to be made regarding the length and stride of the final pooling layer (which were chosen to be 20 and five, respectively), and the length of the final convolutional kernel (which was chosen as four). These changes were necessary because as large convolutions take place they shrink the size of the data, because no data is available outside the bounds of the original data segment. With much smaller data segments, they were eventually shrinking to 0, which led to errors in the program. A different option could have been to add padding to each layer, but as this technique was not used in the original paper, we decided not to implement this. These parameters were kept for all window sizes for consistency, but with larger window sizes this resulted in a slight mismatch between the size of the filter and the size of the remaining 'temporal' features. This changed the structure of the final feature layer, but because all of the features were fed into a final convolutional decision layer regardless, this was not seen as a major issue.

3.2.3. Proposed network: C-LSTM

While the CNN presented by Schirrmeister et al [4] can adequately classify entire trials of motor imagery, in order to have real-time motor imagery classification with less delay and a high update rate, there would ideally be some time-dependence between neighbouring datapoints. For this reason, in the proposed network an LSTM module was added between the pooling layer (after dropout) and the final selection layer of the modified network described in the previous subsection. The LSTM took as input the filtered data features at each pseudo-time point, and the output of the LSTM was put through the final classification layer in the same way as with our modified CNN. In this way, the network was able to use information about how the features were changing to make a decision about the class of the data.

The size of the LSTM was determined through a validation protocol, as it was discovered that the network was often overfitting to the training data. Because the LSTM size is related to the number of filters in the previous layer, the number of filters for both convolutional layers prior to the LSTM were also changed to the LSTM size during these initial experiments. The different considerations can be seen in figure 2, with the final parameters chosen as a single layer with a size of 30. A comparison of the proposed network architecture with FBCSP and the CNN from Schirrmeister et al [4] is shown in figure 1.

**Figure 2.** Results on validation data to determine the size and number of layers for the LSTM component of our proposed network, with a window size of 0.5 s and different overlapping conditions.
Download figure:
Standard image High-resolution image

3.3. Performance metrics

Various metrics were used to assess the performance of each of the classifiers and data augmentation methods explored in this manuscript. The main metrics considered were overall accuracy (OA), balanced accuracy (BA), ambient F1 score (AmbF1) and control class F1 score (CCF1). These methods were computed using the following equations, where prec and rec are precision and recall, respectively, and TP, FN, and FP are true positives, false negatives, and false positives for a given class (n), respectively:

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle prec_n = \frac{TP_n}{TP_n + FN_n} \nonumber \end{align} \tag{ 6 }$

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle rec_n = \frac{TP_n}{TP_n + FP_n} \nonumber \end{align} \tag{ 7 }$

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle OA = \frac{TP_{0...4}}{All_{0...4}} \nonumber \end{align} \tag{ 8 }$

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle BA = \frac{\sum\nolimits_{n=0}^{4} rec_n}{5} \nonumber \end{align} \tag{ 9 }$

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle AmbF1 = 2*\frac{prec_0 * rec_0}{prec_0 + rec_0} \nonumber \end{align} \tag{ 10 }$

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle CCF1 = \frac{\sum\nolimits_{n=1}^{4} 2*\frac{prec_n * rec_n}{prec_n + rec_n}}{4}. \nonumber \end{align} \tag{ 11 }$

3.4. Data augmentation

3.4.1. Noise, multiplication, flip, and frequency shift

Data augmentation, which has been especially used in the context of deep learning, provides a simple way to generate more labeled data to train a network. In the context of images, data augmentation often means rotation, flipping, color shift, or other similar two-dimensional manipulations that preserve the validity of the image and label [27]. However, because of the nature of EEG data, the methods we decided to employ varied the signal enough to enhance the network, but still keep similar frequency, spatial, and power components. The four methods shown in the equations below were used, where data[i] is data from one of the control classes at a particular time point, rand is a randomly generated number from a uniform distribution between −0.5 and 0.5, data_k was the data appended to the training set, and C_k represents a constant value. F_shift() represents a function that shifts the frequency of the signal via a Hilbert Transform, as expressed in equation (16), where F is the Fourier transform, U is the unit step function, and dt is the reciprocal of the data collection frequency (250 Hz):

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle data_{noise} = data[i] + rand * stddev(data[i])/C_{noise} \nonumber \end{align} \tag{ 12 }$

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle data_{mult1, mult2} = data[i] * (1 \pm C_{mult}) \nonumber \end{align} \tag{ 13 }$

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle data_{flip} = max(data[i]) - data[i] \nonumber \end{align} \tag{ 14 }$

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle data_{freq1, freq2} = F_{shift}(data[i], \pm C_{freq} \hspace{0.5em} hz) \nonumber \end{align} \tag{ 15 }$

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle F_{shift}(d, C) = F^{-1}(F(d)2U) * e^{2j\pi*C*dt} \label{eq:freq}. \nonumber \end{align} \tag{ 16 }$

Because the multiplication and frequency shift methods alter the power and frequency distribution of the signal, respectively, each of these methods was performed twice for each call, in the positive and negative direction. This ensured there was no mean shift in the training data. Adding noise and flipping the data had no such problem, so these were only performed once for each call. It is worth noting here that the noise and flip methods occur on every segment of the control class training data, therefore there is as much augmented data as there is real data in these scenarios for the control classes. For the multiplication and frequency shift methods, there is twice as much augmented data as real data for these classes. In contrast, only real ambient data was used, and was never augmented at any point during this study.

The augmented data was added to the end of the training dataset, which made the end of the training data entirely control classes with no ambient data between. While this may be part of the reason for any decrease in accuracy when using data augmentation, the other apparent options would be to either augment the ambient data as well, or to add true ambient data in chunks between data of each control class. In the author's opinion, neither of these solutions are particularly more realistic than the proposed one, and regardless any good real-time MI classifier should also be able to switch directly between control classes without any ambient data between, though this was not expressed in the test set.

To determine the constant values which should be used for C_mult, C_freq and C_noise, preliminary experiments were carried out which varied these parameters between no modulation and maximum modulation (figure 3). The optimal parameter value was selected based on which achieved the best control class F1 score for the most classifiers without significantly decreasing the performance of the others. These criteria were chosen because we expected that skewing the training data toward the ambient class, as described in section 3.4.2, would have more effect on classification of the ambient class, while the data augmentation methods shown here should improve control class accuracy more. One could also argue that the values should be optimised for each individual classifier, but this would not allow a direct comparison of the classifiers in terms of generalisability, as they would have been trained on data with different characteristics. These preliminary experiments were carried out on the validation data as described above, with a window size of 0.5 s and an overlapping condition of 3.

**Figure 3.** Control Class F1 score on the validation dataset with changing data augmentation constants C_noise, C_mult, and C_freq. These results determined the values used in each of the described data augmentation methods.
Download figure:
Standard image High-resolution image

The Riemannian classifier's ability to distinguish between the control classes generally decreased as each of the values were modified to have a larger effect on the data, meaning a decrease for C_noise and an increase for C_mult and C_freq. However, for the deep networks, there was typically an increase in performance until a certain value, which began to decrease as the augmented data became too dissimilar from real data. For C_noise, the CNN and C-LSTM peaked at a value of 2. While there was a slight dropoff for the Riemannian classifier at this value, it was still chosen for future experiments. Scaled data augmentation (mult) had the least effect on the Riemannian classifier, but had a large effect on the other classifiers. For this constant, none of the classifiers 'agreed' on which value was best, so C_mult was chosen as 0.05, which was the maximum for the CNN and did not cause any significant change to the other classifiers. C_freq was chosen as 0.2, as this gave the maximum Control Class F1 score for both the CNN and C-LSTM.

In this experiment, we also tested how the classifiers would react if multiple data augmentation methods were used at once. This meant that a larger amount of augmented data was created, rather than that a single data window was augmented in multiple ways before being input to the network. Not all different combinations of data augmented strategies were tested, but Flip+Noise, Mult+Noise, and Flip+Freq were chosen as three representative options. Combining more than two or three together was attempted in preliminary experiments, but resulted in not enough unaugmented ambient data to cope with the control class data with higher data skew values.

3.4.2. Data skew

We can assume that in daily life, robotic assistance will not be necessary in most situations, and therefore the ambient class will occur much more commonly than any of the control classes. For this reason, any BCI classification module used for this purpose should be able to accurately predict control sequences even with imbalanced classes in the test set. In other words, a self-paced BCI needs to have no control (NC) support [42]. With the training method used in the previously described experiment, the proposed C-LSTM model greatly underclassifies the ambient class when tested on data that has not been modified. This is both a theoretical problem and a practical one, as you might rather have the classes that move the robot be underclassified because there would likely be no danger associated with the ambient class. One method to combat this underclassification is to deliberately overexpress this class in the training data. Other typical ways of dealing with problems like this involve a separate control sequence that waits for a trigger to activate the classifier, such as saying 'Hey, Siri' to an iPhone. However, such a control sequence may not always be possible with disabled individuals or in extreme environments like space, so consideration of imbalanced classes may be crucial. In addition, even in situations when the classifier should be active, the ambient class may be much more common than any other class, as was seen with this dataset.

To solve this problem, a data skew value was created which would dictate the relationship between the amount of data from the ambient class and the amount of data in each of the control classes. The skew value increased from 1 to 2.5, meaning that there would be 1 to 2.5 times as much ambient data as data from any one of the control classes. When utilising data augmentation, to make use of as much real data as possible from the resting class, collected ambient data was added to the training data until it achieved a number greater than the data skew value, multiplied by the average of the amount of data from the other classes. The modifier was increased by its initial value for each method that was utilised for a particular run, ensuring that the training data would be balanced to the desired class distribution when the data augmentation was carried out. The calculation for how much ambient data to include in the training set is given by equation (17), where DA_skew is the independent data skew variable, DA_mult is a multiplier accounting for any extra control class data generated as a result of data augmentation, and CC_labels is the amount of total control class labels there are for a given subject's data.

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle NAmbData = DA_{skew}*DA_{mult}*\frac{CC_{labels}}{4} \label{eq:skew}. \nonumber \end{align} \tag{ 17 }$

All methods can be explored further in the publicly available code⁶.

4. Experimental results

4.1. Baseline comparison

To provide a baseline comparison, each of the classifiers was initially tested on 4-class data. Previously, the implementation of the state-of-the-art networks were confirmed to achieve approximately the same results on a trial-wise basis as their original papers when using trial-wise preprocessing [44]. This comparison was conducted as a function of robustness to smaller window sizes (from 2 s to 0.25 s) and with changing overlap conditions. For these initial experiments, all data were ensured to have approximately balanced classes and no data augmentation was performed. The classification methods compared were those described in section 3.2.

From this experiment (figure 4), we could see that both the Riemannian and CNN classifiers outperformed the C-LSTM classifier with a larger window size, but with a window size of 0.5 s or smaller, the three networks were comparable. It should also be noted that an overlap condition of 3 resulted in the most intuitive changes to each classifier's performance with respect to window size, which indicated that both the Riemannian classifier and the C-LSTM showed peak performance with a window size of 0.5 s, while the CNN seemed to have optimal performance with a window size of 1 s. With an overlap condition of 3, the performances of the three classifiers were also not significantly different at any stage, though it appeared that the C-LSTM generally had the worst performance of the three under these conditions. We suspect that the decrease in performance with increasing window size and overlapping windows is because this scenario increases the probability that there were multiple classes of data within the same window. However, we did not robustly investigate this theory.

4.2. Data skew

For most machine learning methods, including more of one particular class than the others usually results in a more frequent prediction of this class. Because the unaltered validation and test sets contained such an abundance of ambient data, this scenario could be an intuitive application for intentionally skewing the data in this way. With no additional data augmentation, our results showed that overall accuracy increased for all three classifiers when the data was skewed toward the ambient class, largely due to higher recall of the ambient class. However, the performance of the CNN on control classes decreased in performance as the data was skewed more toward the ambient class, as evidenced by a declining F1 score in figure 5. This did not noticeably occur in the other two classifiers.

**Figure 5.** The changes in motor imagery classification accuracy based on how much extra data is included from the ambient class when compared to the other classes. The shown metrics are overall accuracy (solid line) and control class F1 score (dotted line) for all three classifiers. The presented results are with no data augmentation.
Download figure:
Standard image High-resolution image

The results from these experiments also revealed that data skew seemed to have less overall effect on the Riemannian classifier, which, after enough skewing, had worse overall accuracy than the other two classifiers. The greatest change in overall accuracy with regard to the data skew value was the CNN, which increased in overall accuracy from 26.5% to 45.3%, despite decreasing from 0.224 to 0.197 (p < 0.01) in terms of control class F1 score. It should be noted here that higher overall accuracy does not necessarily mean better performance, as it is most likely due to the imbalance between ambient and control classes. Nevertheless, these results are interesting in revealing that the deep neural networks are more susceptible to skewing the data in this way, with CNNs being the most susceptible.

4.3. Data augmentation

As the next step in our study, we began to investigate the effect of our proposed data augmentation methods on each of the classifiers, also considering an unaltered testing dataset. Confusion matrices for all nine subjects combined are shown in figure 6. These results were collected with no skew toward the ambient class.

**Figure 6.** The effect of different data augmentation methods on each motor imagery class using the Riemannian (top), CNN (middle) and C-LSTM (bottom) classifiers, as seen in confusion matrices. Presented results are with 0.5 s windows and an overlap condition of 3, and with a data skew value of 1.
Download figure:
Standard image High-resolution image

The results show that each classifier was affected by the data augmentation in different ways. Each of the data augmentation methods improved the overall accuracy of the Riemannian classifier, but this was mainly due to higher recall of the ambient class, while recall of the control classes decreased. For the CNN and C-LSTM, the opposite was generally true, with the recall of ambient data generally decreasing when more artificial data was used. Figure 6 shows an increasing amount of augmented data used, from left to right, with Noise data augmentation doubling the amount of overall training data, Mult data augmentation tripling the amount of data, and Flip + Freq quadrupling it.

The C-LSTM showed a stark contrast in ambient classification ability when data augmentation was and was not used. With any amount of data augmentation, this classifier tended to largely ignore the ambient class in favor of the other four. The CNN, in contrast, had a slower transition into favoring the four control classes, depending on how much augmented data was used in training, which seemed to have more impact on the CNN's performance than any particular method of data augmentation. For example, using additional scaled data (Mult) showed similar results to that of modulating the frequency (Freq), because both tripled the amount of training data for each class. The same was true for adding random noise and flipping the data, which both doubled the amount of training data, and for any combination of methods which produced the same amount of training data.

4.4. Combined data augmentation

When using data skew and our data augmentation methods together, a more complete picture of how data manipulations can affect motor imagery classification begins to take shape. Because of the unbalanced nature of the data, there seemed to be an inherent tradeoff between overall accuracy and balanced accuracy when considering different data augmentation strategies, especially for the CNN and C-LSTM. As can be seen in figure 7, most data augmentation methods resulted in higher balanced accuracy and lower overall accuracy for these two networks. For the Riemannian classifier, data augmentation seemed to help in both metrics, with the only exception to this being the addition of noise to each time segment. Utilising this method pushed the Riemannian classifier to predict ambient data at a much higher rate, which generally increased the overall accuracy, but decreased the balanced accuracy.

**Figure 7.** The effect of different data augmentation methods on each motor imagery class using the Riemannian (left), CNN (middle) and C-LSTM (right) classifiers. Presented results are with 0.5 s windows and an overlap condition of 3, averaged over all data skew values. The top graph shows overall accuracy generally increasing for the Riemannian classifier, while the CNN and LSTM decrease. The bottom graph, rather, shows balanced accuracy increasing for all three classifiers. The error bars represent the 95% confidence interval of the condition with no data augmentation.
Download figure:
Standard image High-resolution image

In terms of balanced accuracy, the best data augmentation method for the Riemannian, CNN, and C-LSTM classifiers were Mult, Mult+Noise, and Flip+Noise, respectively. However, the differences between these methods and other data augmentation methods were statistically insignificant according to a t-test using the standard error of the mean. Of course, all of these results could change depending on the numeric parameters selected in section 3.4.

Taking averages in terms of either the data augmentation modifier or the data augmentation types showed that the Riemannian classifier had the best average performance in both overall and balanced accuracy. However, this result is deceiving in that it is mostly due to the fact that both the CNN and C-LSTM had several conditions in which they performed very poorly, while they also had some conditions that resulted in a very strong performance. Additionally, different classifiers showed aptitude for different subjects. While the above figures indicate general trends in the data, there was no single solution that was best for all of the subjects, as can be seen in figure 8.

**Figure 8.** The best model and data augmentation methods for each individual subject and each metric: overall accuracy (OA), class-balanced accuracy (BA), and F1 scores for the ambient (AmbF1) and control classes (CCF1), respectively). R, C, and L indicate the Riemannian, CNN, and C-LSTM classifiers, respectively, while N, Fl, M, and Fr represent Noise, Flip, Multiplication and Frequency Shift data augmentation, respectively. The number to the right of each model type shows the data skew value. Also presented are the averages of the values for individual subjects, and the same values if a single strategy was applied to all subjects.
Download figure:
Standard image High-resolution image

For five of the nine subjects, the CNN best classified ambient data, which additionally resulted in the highest overall accuracy. To achieve the best classifier with respect to these metrics, the best option was to either use no augmented data, or to augment the data with the addition of noise. For the remaining 4 subjects, the Riemannian classifier showed the best performance, always with the strategy of adding random noise to the data and skewing the data as much as possible toward the ambient class, which was also a successful strategy when using the CNN.

In terms of control class data and balanced accuracy, the C-LSTM and Riemannian classifiers showed better performance than the CNN. While no single data augmentation strategy was dominant among all or even most subjects, either using scaled data (mult) or noisy data (noise) gave the best performance in the majority of subjects, though they might have additionally been combined with other data augmentation methods or each other.

As seen from figure 9, the average improvement from skewing the data was 14.0% (p < 0.01) in terms of the overall accuracy, with most of this coming from improved ambient data classification, whose F1 score increased by 0.207. This largely mirrors the improvement of the CNN in overall accuracy when the data was skewed. From the data augmentation methods, the average improvement among subjects was between five and seven percent for all accuracy metrics and F1 scores, which was still significant (p < 0.01).

When monolithically applying a single strategy to all subjects, the skew greatly improved the F1 score of the ambient data by 0.303, and the overall accuracy was also greatly improved due to the data augmentation method, in this case the addition of noise. All improvements that were made related to the control classes were statistically insignificant when attempting to apply the same strategy to all subjects.

5. Discussion and conclusions

The presented work proposes new methods to facilitate the use of deep neural networks in the context of MI classification. Nonetheless, this work also leaves a great amount of space for extension, and many questions about the potential for using deep learning in this field. The work also explores classification methods that could be more easily transferred to self-paced or real-time control, partially through the use of smaller and overlapping windows of time.

The first question to answer with this context is why deep learning, rather than expansion on more traditional methods of MI classification, is the way forward for BCI control. Traditional methods, including FBCSP, have been based upon features that we as humans have observed and found a way to mathematically define. The same was true of image classification, which is a topic now pervaded by research related to deep learning because deep neural networks were able to find features that humans could not easily define. Similarly, the features extracted from EEG signals by deep learning could greatly improve our understanding of the brain and extend non-invasive BCI control to more people and more control classes.

However, much work still must be done before deep learning can be reliably applied to real-time and self-paced EEG MI control. While the proposed C-LSTM showed the best performance in distinguishing the control classes from each other for five of the nine subjects, it did not have much success in determining which data was rest and which was not. Conversely, the CNN was able to adequately distinguish ambient from non-ambient data, but showed the worst performance of the three networks compared in this study in terms of control class F1 score. While these deep networks may be promising for specific applications, these models should be redesigned to achieve acceptable results in all metrics. The Riemannian classifier, in contrast, showed better results in most metrics for most subjects, and as a result is still the most reliable classifier of the three. Further research could additionally explore how deep learning and Riemannian geometry could be combined for EEG classification, and how to ensure stability for the deep learning models.

One point of contention regarding the results presented here may be that the choice of values for data augmentation disadvantaged the Riemannian classifier, as its performance tended to only decrease with added data augmentation (figure 3). This is particularly true with noise-based data augmentation, where the chosen constant value resulted in an F1 score that was more than 0.02 lower than the maximum for the Riemannian classifier, but was higher for both of the other classifiers. The same was true for frequency-based data augmentation, though to a lesser extent. It should also be noted, in this respect, that the C-LSTM may have been similarly disadvantaged when using multiplication-based data augmentation, as the other two classifiers had maximums here, while the C-LSTM's F1 score was about 0.01 lower with this choice than with its maximum. Additional experiments using different constant values for data augmentation could provide a much fuller picture of how these methods affect the classifiers, and could lead to more avenues to compare these networks and methods.

As shown in this paper by the combination of multiple data augmentation strategies, data augmentation may be an important consideration in future BCI processing, especially with deep networks. For one, training protocols for BCI control are often quite intensive for users, and so it is important to ensure that each piece of data is used to the best of its ability so that this procedure can take less time. In addition, if more data can be reliably generated, the networks that can be used for classification could become deeper, which was shown to improve performance on image recognition [27, 45]. The data augmentation techniques presented here could be modified to optimally fit a particular subject or dataset, or could be used in multiple combinations. While these results provide some general guidelines for how a particular classifier may respond to data augmentation, the desired function of the controlled technology must always be considered. For example, the Riemannian or CNN classifier may be a safer choice than the C-LSTM to be used with an assistive robot, because they were more likely to choose the resting class. Even if this choice was incorrect, no harm could be done with the robot in a resting state, so it could be more desirable than receiving an incorrect control command. In other control scenarios, this may not be the case.

In addition, there should be improved protocols for EEG data collection when the intent is to apply it to self-paced control. The current paradigms for data collection have been partially based on the knowledge that MI classification methods are not very robust to quick switching between classes and imbalanced classes. While simple trial-based classification is a logical place to start, in order for the field of BCI control to progress, the data collection paradigms must do the same. This has been done with some simpler classification methods when the final application of the technology was the focus of the study [3, 5], but there are few publicly available datasets that consider self-paced application due to various challenges [42].

In summation, we present several interesting results regarding EEG motor imagery classification protocols. Firstly, while the compared CNN and C-LSTM networks may improve MI classification in some specific applications, the implemented Riemannian classifier is still the most reliable of the three. Secondly, the efficacy of various data augmentation methods were evaluated to combat data imbalance and limits on the size of MI datasets, revealing that while no strategy alone improves the performance of a classifier, the effects of each data augmentation strategy can be combined to augment both precision and recall of the control and ambient classes. With these revelations, the field has taken a step toward real-time and high accuracy self-paced multi-class EEG control.

Acknowledgments

The authors would like to acknowledge the funding source for this research: EPSRC grant EP/R026092/1. The authors would also like to thank Fani Deligianni and Yao Guo for providing advice and support related to this paper. Daniel would also like to thank his amazing wife for being supportive through this process.

Data augmentation for self-paced motor imagery classification with C-LSTM

Article metrics

Submit

Author e-mails

Author affiliations

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction