A Computationally Efficient Single-Channel EEG Sleep Stage Scoring Approach using Simple Structured CNN

Automatic sleep stage classification has been a hot trend since hand-crafted feature engineering is highly inefficient. However, current studies of automatic sleep stage scoring focus more on designing complex neural network structures to improve the model performance while neglecting the model efficiency. This causes both lengthy training time and highly demanding hardware are needed for model training, which is not favorable for future industrial applications. This work proves the concept that the simple model, such as a shallow Convolutional Neural Network (CNN) combining the proper data processing techniques, can achieve a comparable model performance (overall accuracy of 79.0 %) to the complex model (overall accuracy of 74.9-82.0 %). The designed model in this work also significantly improves the model efficiency by reducing the number of learnable parameters in the neural network. This approach provides a new insight into automatic sleep stage scoring study as well as other deep learning studies that the data processing and the model design are equally important.


Introduction
Sleep is one of the most vital circadian rhythms of the human body. Keep tracking the sleep rhythm, and its quality has been an inseparable part of neuroscience research and clinical diagnoses [1]. Accurately identifying the corresponding sleep stage from Polysomnography (PSG) signals, which is also commonly known as sleep stage scoring, plays a significant role in this field.
Traditionally, the scoring is done by well-trained technicians. The process is labour-intensive, timeconsuming, and easily affected by subject judgment [2,3]. Nowadays, with the exploding development of computer science, classic machine learning models such as Support Vector Machine (SVM) [4], random forest [5,6], and Linear Discriminate Analysis (LDA) [7] have been studied. One of the major drawbacks that all these models sharing is that they all require extensive manual feature engineering, which is highly inefficient and does not work well on large datasets [8].
Currently, deep learning has surpassed the heat of machine learning models due to the accelerating development of the theory and supportive hardware. It is known that deep learning can circumvent manual feature engineering and do not restricted by the size of the dataset thusly. There are two cornerstones of the deep learning model: Convolutional Neural Network (CNN) and Recurrent Neural Networks (RNN), both are well-studied [9].
It is reported that many studies have applied CNN to sleep stage scoring [9,10]. CNN can learn the features of 1D array, such as PSG signals in sleep stage scoring by convolving the array with any desired filter and stride sizes. The weights within the convolution layer are shift-invariant and the flexibility and generality of the model have enhanced accordingly. Furthermore, the training difficulty of CNN is much lower than the fully-connected deep network [11]. In the meantime, RNN, such as  [12]. However, due to its lengthy series of neural network structures, LSTM suffers from an exploding gradient, which may sabotage the prediction results and usually took much longer time to train compared to CNN. Therefore, incorporating LSTM to the deep learning model used in sleep stage scoring would elongate the training time significantly and may not necessarily improve the model accuracy [13].
Recent studies tend to develop more complex and deeper CNN and RNN, such as CNN connected to LSTM in series to enhance the model performance [10], but the complex model would escalate the training difficulty of the model, such as elongated training time and more demanding hardware.
Herein, it is hypothesized that the simple neural network model could perform just as well as a complex model. As the response to the above hypothesis, this work aims to prove that a simple CNN with the help of proper data processing methods could outperform the traditional machine learning methods and complex deep learning model investigated previously. It is expected that a satisfactory overall accuracy rate and macro-averaged F1 score with a promising computational efficiency can be achieved with the simple model presented in this work without relying on any hand-crafted features.

Sleep-EDF dataset
In this work, the single-channel EEG signal (Fpz-Cz) was retrieved from the Sleep-EDF Expanded dataset [14]. The dataset consists of two subsets named as Sleep Cassette (SC) and Sleep Telemetry (ST). All subjects were retrieved from SC subset, which consists of healthy Caucasians aged 25-100. Each subject's data contains the PSG recordings with a sampling rate of 100 Hz. Each 30-second epoch of the recording is manually scored by well-trained technicians according to the R&K standards [15]. The scoring matric has 8 categories which are: N1, N2, N3, N4, Wake, REM, Movement and Unknown. In this work, N4 stages were combined into N3 stages and Movement and Unknown stages were removed just to be consist of the previous work [3]. Other than the sleeping period, only 30 minutes both before and after the sleeping period were included to avoid introducing an excessive number of data with Wake labels.

Experimental design
A total of 45 subjects were used as a training set, and other independent 6 subjects were evenly split and used as validation and test sets, respectively. Only Fpz-Cz channel Electroencephalography (EEG) was retrieved from the dataset. For the first 20 subjects in the training set, all data were included. However, the classes were highly imbalanced with the presence of an extensive number of N2 stages. Therefore, more data with the rest of the four stages were extracted selectively from the later 25 subjects and were added to the training set to compensate for the imbalanced data distribution. This work also employed Welch's method to convert the raw EEG signal from the time domain into the frequency domain to save computational resources when deploying the neural network.

Neural network structure
The first neural network structure used in this work is a shallow CNN with only two conv layers followed by a dense layer. For CNN, the first conv layer consists of 100 output channels, 1000 filter size and a stride of 50. Utilizing a large filter size help better grasp the overall trending information from the frequency domain of the EEG signal. The second conv layer consists of 100 output channels, 2 filter sizes and a stride of 1 to have a better organization on the convoluted signal. After flattening the 3D convoluted signal into 2D, the input signal was fed into the dense layer with 5 neurons as the last layer for prediction. ReLu was selected as the activation function for both CNN and dense layer, and cross-entropy loss was used as the loss function in this work. Softmax was used as the activation function for the last layer to estimate the probability of each sleep stage. As a comparison to the above shallow CNN structure, a Bidirectional LSTM (BiLSTM) was connected to the end of the shallow CNN in series as the second neural network separately used in this work. Using this model shows how well a more complex model is performed when feeding identical data. The input size of BiLSTM was set to 100 to be consistent with the output channel of CNN. The hidden size was set to 256 and 2 layers were used in BiLSTM. Then the output tensor from BiLSTM was flattened and fed into the identical dense layer described in the shallow CNN section.

Training parameters
Adam was chosen as the solver for weight optimization. After grid research, both weight decay and learning rate were set to 10-4. The batch size was set to 50 for the model to comprehend the information from the training dataset fully. In this work, dropout layers with the probability of 0.5 were used in the model in both CNN and dense layers. Early stopping was another regularization technique employed in this work to alleviate the overfitting problem. The patience was set to 50, and the overall iteration number for model training was kept at 500 accordingly.

Implementation
The model was established using Pytorch (version = 1.6.0). An GTX 1080 with up to date NVIDIA driver (version = 451.67) was utilized for training the model. The training was repeated 10 times, it took 3.5 minutes to finish training the model and the model needed around 850 milliseconds for prediction on average.

Evaluation metrics
To evaluate the performance of the trained model, a classification report was generated, which summarized precision (PR), recall (RE), F1-score (F1) for individual classes by utilizing one versus rest strategy. The overall accuracy and macro-averaged F1 score were also included in the report. In this work, the model efficiency is specifically determined by two factors: The length of training time and the number of learnable parameters

Dataset
In this work, only a single-channel EEG signal has been used to train the model while excluding other PSG signals such as Electromyography (EMG) and Electrooculography (EOG). Although introducing more signals may improve the performance since both EMG and EOG could capture the respective muscle and eyeball activity of the subject during sleeping, the overall efficiency would rather drop because more time would be needed to train the model. In the Sleep-EDF dataset, two channels of EEG signals were recorded, which are Fpz-Cz and Pz-Oz, both correspond to the front head and back head, respectively. It is reported that due to the position of Fpz-Cz electrode, EEG measured at this channel could capture the delta wave activity and sleep spindles at low-frequency range better, these features are decisive for sleep stage scoring [16]. Consequently, Fpz-Cz channel EEG was selected as the training data for the neural network model.  Total  Before  3711  1475  9061  2600  3837  20684  After  9061  3666  9061  5245  8615  35648  Table 1 summarizes the sleep label distribution in training dataset before and after importing the data of additional 25 subjects. It is evident that the initial training dataset was highly imbalanced, this is a very common issue in sleep stage scoring and no universal solution so far [17]. Directly feeding the highly imbalanced dataset into the neural network would result the model only predicts the major label. Imbalanced data treatment techniques are not good options in sleep stage scoring due to the fact that the raw EEG signal is quite noisy and cannot be separated without proper data treatment, the under-sampling methods would be futile in this case. On the other hand, the blindly generated samples from over-sampling methods such as Synthetic Minority Over-sampling Technique (SMOTE) could introduce more noise to the dataset, this would mislead the neural network to predict the wrong result [18].
In practice, after testing several methods, it is found that the best solution is to selectively feed more data from minor classes, which are Wake, N1, N3 and REM in this case. The more balanced dataset performed much better than the previous case. The advantage of this method is that no synthetic signal would be added, which maintains the objectivity of the training dataset. The disadvantage of this method is that more training data would be required than usual.

Model performance and efficiency in this work
Both validation and test sets kept the same data label distribution without introducing any selective data to maintain the predicted results' reliability. Table 2 has summarized the performance of the trained model on the test set. As a further comparison, table 3 had summarized the performance of a trained model when LSTM was added into the model.
The simpler model took around 3.5 minutes to train. Meanwhile, the model only used one GPU without data parallelization, which is not hardware demanding. The high efficiency comes from the relatively fewer number of learnable parameters on CNN, which is around 184K in this model. Compared to table 3, adding LSTM further upsurged the number of learnable parameters to 2.76M, which was a 15-fold increase. More parameters in the model also elongated the training time to around 7.5 minutes, which is undesirable. In the meantime, the performance of the model before and after adding LSTM made very little difference. This shows that further making the neural network structure more complex is not necessary. This unnecessity may come from that this work utilized EEG signal in the frequency domain, its temporal feature within the series might not be as characteristic as the general pattern of the signal, which could already be learned by CNN effectively.
Furthermore, from table 2, we can see that the model performed relatively well on Wake, N2, N3, and REM label but not N1 stage. For example, all Wake, N2, N3 and REMs' F1 score are greater than 0.70, while the F1 score of N1 stage is less than 0.5. This is because N1 is still a minor class in the training datasets as well as in both validation and test datasets such that the model might not learn the features that distinguish N1 effectively. More importantly, this observation is consistent with the sleep physiology research that the N1 stage is an intrinsically ill-defined stage [3]. Therefore, the model's inferior performance on N1 stage is tolerable.  Table 4 has compared the model performance in this work to the other similar research. Table 4 shows that the model in this work achieved better overall accuracy than most of the other studies while maintaining a simpler neural network structure with fewer learnable parameters. Although Supratak et al. [10] achieved a better overall accuracy, the complex neural network structure, they employed indubitably raise the barrier of training and hardware requirement. Moreover, the model developed in this work also surpassed most of the traditional machine learning models with hand-crafted feature engineering. It is known that hand-crafted feature engineering requires multiple steps of data processing and involves mathematical extraction of characteristic features such as Power Spectral Intensity and Hjorth Parameters from EEG signal [19]. In some cases, applying feature selection methods such as Minimum redundancy maximum relevance (mRMR) and Recursive Feature Elimination (RFE) is even required in hand-crafted feature engineering [6]. Executing these steps sequentially is vastly time-consuming. Under the current research trend of automatic sleep stage scoring, the manner of hand-crafted feature engineering no longer fulfills the future needs for the indusial application. The reason why the designed simple neural network outperforms than many complex models is probably that it utilized Welch's method to convert the EEG signal from the time domain into the frequency domain. When the EEG signal in the time domain was fed into the model, the loss of development set did not decrease at all, which means the model did not learn anything from the training dataset. This is because the raw EEG signal in time domain contains a huge amount of noise, which misleads the model during training [20]. EEG signals in the frequency domain, on the other hand, could circumvent the presence of noises and augment the effect of valuable features such as beta and delta wave activity for sleep stage scoring. In the meantime, Welch's method is computationally efficient and easy to implement [21]. Therefore, both data pre-processing and neural network structure is equally important when designing the model, especially balancing between model accuracy and efficiency.  [5] Ensemble SVM N/A Fpz-Cz+Pz-Oz Hand-crafted 78.9 [24] kNN N/A Fpz-Cz+Pz-Oz Hand-crafted 80.0

Future work
There are limitations in this work that leave several possible future research directions to explore. Firstly, this work only investigated single-channel EEG as the training dataset while excluding other PSG signals. Studies have reported the benefits of combining different signals to have a multimodal input and allow the neural network to learn more extensive features from different signals [3]. Therefore, future work should focus on how to introduce multimodal input while maintaining high efficiency. Secondly, as mentioned previously, the current strategy of alleviating an imbalanced class problem is to import more data selectively with minor labels. This strategy demands more training data. Thus, future work should focus on training the neural network with a limited amount of data while no compromising on overall model performance.

Conclusions
In conclusion, this work proves that using a simple neural network structure, a comparable performance to other complex models for automatic sleep stage scoring could be achieved with a significant improvement in model efficiency by selecting an appropriate data pre-processing method. This approach could be generalized to other deep learning research as well; researchers should not only concentrate on designing complex neural network structure, but also improving the model efficiency by considering proper data pre-processing method without any manual feature extractions. Future work should focus on investigating multimodal input and considering a new strategy alleviating an imbalanced dataset without requiring extensive data feeding.