MSLTE: multiple self-supervised learning tasks for enhancing EEG emotion recognition

Guangqiang Li; Ning Chen; Yixiang Niu; Zhangyong Xu; Yuxuan Dong; Jing Jin; Hongqin Zhu

doi:10.1088/1741-2552/ad3c28

1. Introduction

Emotion plays an important role in our daily lives [1]. It is a complex psychological and physiological state that can be characterized by behavioral or physiological signals [2]. Neuroscience study shows that physiological signals can represent emotional states more truly and precisely because they are difficult to disguise or hide [3]. Consequently, various physiological signals, such as functional magnetic resonance imaging (fMRI), electroencephalography (EEG), and stereoelectro-EEG (SEEG), etc have been employed in the affective brain–computer interfaces (aBCIs) [4] to recognize or even modulate human emotions. Among these signals, EEG is more suitable for aBCI systems due to its non-invasive nature, high temporal resolution, and ease of collection [5–7]. Such aBCI systems can be applied in the treatment of psychiatric disorders or in assessing the emotional state of the subject in daily life [8].

Traditional EEG emotion recognition models have relied on handcrafted features (e.g. power spectral density (PSD) [9], differential entropy (DE) [10]) and conventional classifiers (e.g. support vector machine (SVM) [11]). While these methods have been used effectively, deep learning architectures can extract high-level and non-linear representations presented in EEG signals more effectively [12, 13]. For instance, in [14], a compact convolutional neural network (CNN) combining depthwise and separable convolutions is proposed to extract robust features from EEG signals acquired in different paradigms. In [15], inspired by the neuroscience findings that different human brain regions have distinct responses to emotion [16], a hierarchical feature learning model based on bidirectional long short-term memory (BiLSTM), called R2G-STNN, is designed for extracting spatio-temporal features from regional to global brain regions for emotion recognition. However, the above methods overlook the correlation between EEG channels and the irregular topological information presented in EEG data, both of which are crucial for accurate emotion recognition.

To consider the topological patterns in EEG signals, some researchers have introduced graph convolutional networks (GCNs) into the EEG emotion recognition task. In [17], a dynamic graph convolutional neural network (DGCNN) is constructed to learn the intrinsic relationship between different EEG channels. In ECLGCNN [18], GCN is combined with LSTM to extract spatial-temporal features from EEG for emotion recognition. In [19], the graph fusion and graph enhancement are integrated into GCN to construct a semi-supervised EEG emotion recognition model called EGFG. In [20], a spatial-temporal attention mechanism and a self-adaptive brain network adjacency matrix are designed to capture the significant sequential segments and spatial location information in EEG signals, and aim to represent the diverse activation patterns under different emotion categories.

EEG signals typically display a highly heterogeneous and nonstationary pattern, because emotion production consists of many neural processes [21]. As a result, enormous data distribution shifts lead to the lack of generalization of data from different subjects or new situations of current subjects [22]. Some researchers have adopted domain adaptation (DA) methods, such as domain-adversarial neural networks (DANN) [23] and adversarial discriminative domain adaptation (ADDA) [24], etc to improve the generalization of the models. However, most DA-based methods need to be trained on labeled training data and unlabeled test data, which does not apply in real applications. Considering that the previously mentioned EEG emotion recognition models are based on single-task learning, which may lead to overfitting and limit the generalization of the features learned by the model, Li et al [25] introduce multi-task learning by constructing three pseudo-tasks based on data augmentation, and propose a graph-based multi-task self-supervised learning (GMSS) model. However, firstly, GMSS [25] is based on data augmentation, which has a high time cost. Secondly, the augmentation method used by GMSS [25] is based on the disruption of the EEG channel order, so it is only applicable to specific EEG datasets, which affects its flexibility. Thirdly, independent classifiers are adopted among multiple pseudo-tasks, which increases the model's parameters and affects the training efficiency. Fourth, and most important, such pseudo-tasks cannot eliminate the adverse effects of instability during EEG signal acquisition, such as the information loss in certain channels or frequency bands of the EEG signal. To overcome these limitations, we propose a multiple self-supervised learning tasks-based model for enhancing EEG emotion recognition, called MSLTE, which combines masked graph autoencoder (GAE) [26] and weight sharing (WS) mechanism. The main contributions can be summarized as follows:

(1)
Channel masking (CM) and frequency masking (FM) are introduced to simulate the information loss caused by the instability during EEG signal acquisition, and two self-supervised feature reconstruction tasks based on masked graph autoencoders (GAE) are constructed to enhance the generalization of the proposed model.
(2)
The WS mechanism is introduced to enhance the information interaction between the two self-supervised tasks, leveraging their common and complementary information to improve the model's overall performance, while reducing the model parameters to some extent.
(3)
Extensive experiments on SEED [27], SEED-V [28], and DEAP [2] datasets validate the effectiveness and generalizability of the proposed MSLTE model.
(4)
The proposed MSLTE model achieves higher emotion classification accuracy and training efficiency, and much lower model parameters and computational complexity than the state-of-the-art (SOTA) multi-task-based model, GMSS [25].

2. Method

The architecture of the proposed MSLTE model is shown in figure 1. It is composed of three tasks: one supervised classification task for learning discriminative emotion features, and two self-supervised learning (SSL) tasks based on masked GAE [26] for enhancing the generalization of the feature learned by the supervised classification task while mitigating the overfitting problem.

2.1. Supervised learning task

Assuming that there are N samples in the training set and each sample includes C channels. Considering that multi-band DE features perform well in EEG emotion recognition task [10, 17, 20], we adopt DE features as model inputs. Specifically, each channel of input EEG is decomposed into F frequency bands, δ (1–4 Hz), θ (4–8 Hz), α (8–14 Hz), β (14–31 Hz), and γ (31–50 Hz), and then DE is extracted from each frequency band. Thus, the multi-band DE feature of the ith sample can be denoted as $\boldsymbol{X}_i\in \mathbb{R} ^{C\times F}$ . To represent the topological pattern contained in multi-channel of EEG signals, X _i is mapped to a graph, denoted as $\boldsymbol{\mathcal{G}}_i = (\boldsymbol{\mathcal{V}}, \boldsymbol{A}, \boldsymbol{X}_i)$ , where $\boldsymbol{\mathcal{V}} = \left \{v_1, v_2,\ldots,v_C \right \}$ is the set of nodes (i.e. channels). The adjacent matrix $\boldsymbol{A}\in \mathbb{R} ^{C\times C}$ is obtained based on the location relationship between the channels see figure 2.

**Figure 2.** EEG channels location relationship on (a) SEED and SEED-V, and (b) DEAP.
Download figure:
Standard image High-resolution image

Then, as shown in figure 1, the obtained graph $\boldsymbol{\mathcal{G}}_i$ is passed through a shared encoder $\mathcal{F}(\bullet )$ , which is composed of a linear layer and a K-order Chebyshev GCN layer, to extract its latent spatial-frequency representation H _i. Finally, H _i is fed into the classifier $\mathcal{C} (\bullet)$ to obtain the predicted emotion probability vector, each element of which is the probability value of belonging to the corresponding emotion category. $\mathcal{C} (\bullet)$ comprises 3 fully-connected layers and a softmax layer. The cross-entropy loss function is adopted to train the supervised classification task. Assuming that the one-hot encoding of the true label of ith sample is y _i, the classification loss, denoted as $\mathcal{L}_{\mathrm{cls}}$ , can be obtained by equation (1):

$\begin{equation} \begin{aligned} \mathcal{L}_{\mathrm{cls}} = -\frac{1}{N} \sum_{i = 1}^{N}\boldsymbol{y}_{i} \log\left(\mathcal{C}\left(\mathcal{F}\left(\boldsymbol{\mathcal{G}}_i\right)\right)\right). \end{aligned} \end{equation} \tag{ 1 }$

2.2. Self-supervised learning tasks

To simulate the information loss of different channels and frequency bands due to the instability during EEG signal acquisition, CM and FM are introduced. And, to improve the generalization of the feature extracted by the shared encoder $\mathcal{F} (\bullet )$ and mitigate the overfitting problem to some extent, two SSL tasks, the FM-based GAE (FM-GAE) and the CM-based GAE (CM-GAE), are combined in the proposed model. In addition, to leverage the common and complementary information between the unmasked frequency band and unmasked channel for better reconstruction of the features, which further improves the performance of the proposed model, a WS mechanism is performed between the frequency decoder and the channel decoder.

2.2.1. FM-GAE

An FM-GAE-based SSL task is designed to reconstruct the masked frequency band features by utilizing the inner relationship between the frequency bands to improve the generalization of the proposed model. Assuming that the frequency bands set is $\boldsymbol{\mathcal{U}} = \left \{u_1, u_2,\ldots, u_F \right \}$ , a masked subset of it, denoted as $\tilde{\boldsymbol{\mathcal{U}}} \subset \boldsymbol{\mathcal{U}}$ , is obtained by random sampling, and the size of $\tilde{\boldsymbol{\mathcal{U}}}$ is defined as $F_m = | \tilde{\boldsymbol{\mathcal{U}}} |$ . As shown in figure 3(a), the features of masked frequency bands are set as zeros. The obtained FM feature and corresponding graph of X _i are denoted as $\boldsymbol{X}_{f,i}\in \mathbb{R} ^{C\times F}$ and $\boldsymbol{\mathcal{G}}_{f,i} = (\boldsymbol{\mathcal{V}}, \boldsymbol{A}, \boldsymbol{X}_{f,i})$ , respectively. Formally, the element of the kth frequency band of the jth channel in $\boldsymbol{X}_{f,i}$ , denoted as $x_{f,i}(j,k)$ can be expressed by equation (2):

$\begin{equation} \begin{aligned} x_{f,i}\left(j,k\right) = \begin{cases} 0, & u_k\in \tilde{\boldsymbol{\mathcal{U}}} \\ x\left(j,k\right), & u_k\notin \tilde{\boldsymbol{\mathcal{U}}}. \end{cases} \end{aligned} \end{equation} \tag{ 2 }$

Then, $\boldsymbol{\mathcal{G}}_{f,i}$ is fed into the shared encoder $\mathcal{F} (\bullet )$ to obtain $\boldsymbol{H}_{f,i}$ . To reconstruct the input feature, a frequency decoder $\mathcal{D}_{f} (\bullet)$ , which is mirror-symmetric to the encoder $\mathcal{F}(\bullet )$ , is constructed. After applying $\mathcal{D}_{f} (\bullet)$ on $\boldsymbol{H}_{f,i}$ , the reconstructed feature, denoted as $\boldsymbol{X}^{rec}_{f,i}$ , is obtained. The mean square error (MSE) is calculated between the features of the masked frequency bands of the $\boldsymbol{X}^{rec}_{f,i}$ and those of corresponding original features X _i to obtain the FM reconstructed loss, denoted as $\mathcal{L}_{fm}$ , with equation (3).

$\begin{align} \mathcal{L}_{fm} &= \frac{1}{N} \sum_{i = 1}^{N} \left(\tilde{\boldsymbol{X}}_i-\tilde{\boldsymbol{X}} ^{rec}_{f,i}\right)^2 \nonumber\\ \tilde{\boldsymbol{X}}_{i} &= \boldsymbol{X}_{i}\left[\tilde{\boldsymbol{\mathcal{U}}}\right]\nonumber\\ \tilde{\boldsymbol{X}} ^{rec}_{f,i} &= \boldsymbol{X}^{rec}_{f,i}\left[\tilde{\boldsymbol{\mathcal{U}}}\right]\nonumber\\ \boldsymbol{X}^{rec}_{f,i} &= \mathcal{D}_{f}\left(\mathcal{F}\left(\boldsymbol{\mathcal{G}}_{f,i}\right)\right) \end{align} \tag{ 3 }$

where $[\bullet]$ means sampling the matrix by index.

**Figure 3.** (a) Frequency masking and the (b) channel masking, where gray circles represent the maintained features and white circles represent the masked features, which are set to 0.
Download figure:
Standard image High-resolution image

2.2.2. CM-GAE

Considering the connectivity between brain regions and different brain regions have different effects on emotional expression, a CM-GAE-based SSL task is designed to reconstruct the masked channel features by exploiting the natural connections between EEG channels, which guarantees the generalizability of the proposed model. First, a masked subset of channels set $\boldsymbol{\mathcal{V}}$ , which is denoted as $\hat{\boldsymbol{\mathcal{V}}} \subset \boldsymbol{\mathcal{V}}$ , is obtained by random sampling, and the size of the subset $\hat{\boldsymbol{\mathcal{V}}}$ is defined as $C_m = | \hat{\boldsymbol{\mathcal{V}}} |$ . As shown in figure 3(b), the features of masked channels are set as zeros. The obtained CM feature and corresponding graph of X _i are denoted as $\boldsymbol{X}_{c,i}\in \mathbb{R} ^{C\times F}$ and $\boldsymbol{\mathcal{G}}_{c,i} = (\boldsymbol{\mathcal{V}}, \boldsymbol{A}, \boldsymbol{X}_{c,i})$ , respectively. Formally, the element of the kth frequency band of the jth channel in $\boldsymbol{X}_{c,i}$ , denoted as $x_{c}(j,k)$ , can be expressed by equation (4):

$\begin{equation} \begin{aligned} x_{c}\left(j,k\right) = \begin{cases} 0, & v_j\in \hat{\boldsymbol{\mathcal{V}}} \\ x\left(j,k\right), & v_j\notin \hat{\boldsymbol{\mathcal{V}}}. \end{cases} \end{aligned} \end{equation} \tag{ 4 }$

Then, $\boldsymbol{\mathcal{G}}_{c,i}$ is fed into the shared encoder $\mathcal{F} (\bullet )$ to obtain $\boldsymbol{H}_{c,i}$ . Similarly, to reconstruct the input feature, a channel decoder $\mathcal{D}_{c} (\bullet)$ with the same structure as the frequency decoder $\mathcal{D}_{f} (\bullet)$ is constructed. After applying $\mathcal{D}_{c} (\bullet)$ on $\boldsymbol{H}_{c,i}$ , the reconstructed feature, denoted as $\boldsymbol{X}^{\mathrm{rec}}_{c,i}$ , is obtained. The MSE is only calculated between the features of the masked channels of the $\boldsymbol{X}^{\mathrm{rec}}_{c,i}$ and those of corresponding original features X _i to obtain the CM reconstructed loss, denoted as $\mathcal{L}_{\mathrm{cm}}$ , with equation (5).

$\begin{equation} \begin{aligned} &\mathcal{L}_{\mathrm{cm}} = \frac{1}{N} \sum_{i = 1}^{N} \left(\hat{\boldsymbol{X}}_i-\hat{\boldsymbol{X}} ^{\mathrm{rec}}_{c,i}\right)^2 \\ &\hat{\boldsymbol{X}}_{i} = \boldsymbol{X}_{i}\left[\hat{\boldsymbol{\mathcal{V}}}\right]\\ &\hat{\boldsymbol{X}} ^{\mathrm{rec}}_{c,i} = \boldsymbol{X}^{\mathrm{rec}}_{c,i}\left[\hat{\boldsymbol{\mathcal{V}}}\right]\\ &\boldsymbol{X}^{\mathrm{rec}}_{c,i} = \mathcal{D}_{c}\left(\mathcal{F}\left(\boldsymbol{\mathcal{G}}_{c,i}\right)\right). \end{aligned} \end{equation} \tag{ 5 }$

2.2.3. WS

To leverage the common and complementary information between the unmasked frequency bands and unmasked channels for ensuring efficient feature reconstruction for the FM-GAE and CM-GAE tasks, a WS mechanism is introduced between the two independent decoders, $\mathcal{D}_{c} (\bullet)$ and $\mathcal{D}_{f} (\bullet)$ . As shown in figure 3, the FM operation will result in information loss on certain channels. The introduction of the WS mechanism between $\mathcal{D}_{f} (\bullet)$ and $\mathcal{D}_{c} (\bullet)$ can utilize the information of the unmasked channels to compensate for the information loss resulting from the masked frequency bands to some extent. Similarly, the information of the unmasked frequency bands can be employed to assist $\mathcal{D}_{c} (\bullet)$ in realizing the reconstruction of the information of the masked channels.

2.3. Joint training of multi-task

A joint training strategy is adopted to train the proposed model by combining multiple losses. Considering that selecting appropriate weights for these losses is time-consuming, an adaptive weight multi-task loss (AWML) strategy based on homoscedastic uncertainty [29] is adopted to adjust the loss weights for each task during the training phase. Specifically, the total loss, denoted as $\mathcal{L}$ , can be calculated by equation (6):

$\begin{equation} \begin{aligned} \mathcal{L} = \mathcal{W}_{\mathrm{cls}} \mathcal{L}_{\mathrm{cls}}+\mathcal{W}_{\mathrm{cm}} \mathcal{L}_{\mathrm{cm}}+\mathcal{W}_{\mathrm{fm}} \mathcal{L}_{\mathrm{fm}} \end{aligned} \end{equation} \tag{ 6 }$

where $\mathcal{W}_{\mathrm{cls}} = \left ( \sigma^{2}_{\mathcal{L}_{\mathrm{cls}}}+\epsilon \right ) ^{-1}$ , $\mathcal{W}_{\mathrm{cm}} = \left ( \sigma^{2}_{\mathcal{L}_{\mathrm{cm}}}+\epsilon \right ) ^{-1}$ , and $\mathcal{W}_{\mathrm{fm}} = \left ( \sigma^{2}_{\mathcal{L}_{\mathrm{fm}}}+\epsilon \right ) ^{-1}$ are the weights for the loss $\mathcal{L}_{\mathrm{cls}}$ , $\mathcal{L}_{\mathrm{cm}}$ , and $\mathcal{L}_{\mathrm{fm}}$ , respectively. ε is a very small constant to prevent the denominator from going to zero, and $\sigma_{\mathcal{L}_{\mathrm{cls}}}$ , $\sigma_{\mathcal{L}_{\mathrm{cm}}}$ , and $\sigma_{\mathcal{L}_{\mathrm{fm}}}$ are the observation noise scalars [29] of the corresponding tasks.

3. Materials

3.1. Datasets

Extensive experiments are conducted on three datasets, SEED¹ [27], SEED-V² [28], and DEAP³ [2], to evaluate the performances of the proposed model in comparison with the baselines. The SEED dataset consists of 62-channel EEG data from 15 subjects (7 males, 8 females), each with three sessions. Each session involves the presentation of 15 film clips to induce positive, neutral, and negative emotions, with 5 clips per emotion. That is, each session comprises 15 trials, with approximately 185–265 1 s long samples per trial, resulting in around 3400 samples per session. The SEED-V dataset consists of 62 channels of EEG data from 16 subjects (6 males and 10 females), each containing three sessions. Each session displays 15 film clips (3 clips for each emotion) to induce five emotions: happy, sad, neutral, fear, and disgust. Specifically, each session consisted of 15 trials, a total of 45 trials across the three sessions. Each trial is composed of approximately 13–74 4 s long samples, resulting in approximately 1800 samples per subject. The DEAP dataset comprises 32-channel EEG data from 32 subjects (16 male and 16 female) with only one session for each subject. Each subject watches 40 music videos and provides a rating of valence and arousal from 1 to 9. Each music video lasts 60 s, that is, there are 60 samples per trial, a total of 2400 samples per subject. For both the valence and arousal dimensions, the binary classification experiments are performed in this work. Specifically, a rating greater than 5.0 is labeled as '1', and one equal to or less than 5.0 is labeled as '0'.

3.2. Experimental settings

To evaluate the effectiveness and the generalization of the proposed model, both subject-dependent and subject-independent experiments are conducted on all three datasets. In the subject-dependent experiment, we follow the experimental protocol in [20]. That is, for the SEED dataset, the EEG data from the first 9 trials in each session are adopted as training data, while the remaining 6 trials are adopted as testing data for each subject. The average accuracy of the three sessions is calculated to obtain the accuracy of each subject. For the SEED-V dataset, a 3-fold cross-validation strategy is used, i.e. samples from the first 5 trials of the 3 sessions are adopted as training data for fold 1 (with similar operations for folds 2 and 3). The average accuracy of the 3-folds is obtained as the accuracy for the subject. For the DEAP dataset, a 10-fold cross-validation strategy is adopted to verify the performance of the model. In the subject-independent experiment, we adopt the leave-one-subject-out (LOSO) cross-validation strategy on all three datasets as the previous works [20, 30]. Specifically, the EEG data of one subject is adopted as the testing data, while the EEG data of the remaining subjects are adopted as the training data until all subjects have been tested once. Notably, for the SEED dataset, the average accuracy of all subjects in one session is adopted to evaluate the model's performance.

All experiments are implemented based on the Pytorch deep learning framework and conducted on a machine with an NVIDIA GeForce RTX 3060 GPU and an Intel i5–12 400 F CPU. The set of learning rates adopted in the experiments is $\left \{0.05,0.01,0.001,0.005 \right \}$ , the batch size is 64, 128, or 1024, and the number of epochs is set to 200 for all experiments. The Chebyshev filter order K in GCN and the mask rate are set as 2 and 0.7, respectively. The code of MSLTE can be found at https://github.com/L-guangQ/MSLTE.

4. Results

In this section, firstly, the emotion recognition performances achieved by the MSLTE model on all three datasets under both subject-dependent and subject-independent scenarios are compared with those of the baselines. Secondly, to investigate the effectiveness of each module in the proposed model, ablation experiments are conducted on all three datasets. Thirdly, a parameter analysis is conducted on the SEED dataset to assess the influence of the key hyperparameters, i.e. Chebyshev filter order K and mask rate, on the model's performance. Finally, the computational complexity and training efficiency of the proposed model are compared with those of the SOTA multi-task-based model, GMSS [25], to demonstrate the superiorities of the proposed model in terms of the model parameter and training efficiency.

4.1. Subject-dependent experiment

To verify the superiority of the proposed MSLTE model over the available methods, the machine learning-based model, i.e. SVM [11], the LSTM-based model, i.e. R2G-STNN [15], the DA-based models, i.e. DANN [23] and ADDA [24], the GCN-based models, i.e. ECLGCNN [18], DGCNN [17], EEG-GCN [20], and the multi-task learning-based model, i.e. GMSS [25], are adopted as the baselines. The performances of the proposed model in terms of the mean and standard deviation (std) of classification accuracy under subject-dependent scenario are shown in table 1 in comparison with those of the baselines. It should be noted that in table 1, the performances of the models with $^{\mathrm{*}}$ symbol are obtained by our implementation, while those of the models without $^{\mathrm{*}}$ symbol are quoted directly from the corresponding references. R2G-STNN [15], ECLGCNN [18], and EEG-GCN [20] do not provide the experimental results on the SEED-V dataset in the corresponding references. In GMSS [20], the spatial jigsaw puzzle pseudo-task relies on the location of the EEG channels adopted. Since the EEG channels in DEAP are different from those in the SEED or SEED-V datasets, the performances of GMSS on DEAP cannot be obtained under both subject-dependent (see table 1) and subject-independent (see table 2) scenarios.

Table 1. The mean and standard deviation of the classification accuracy in percent achieved by different models under the subject-dependent scenario.

			DEAP
Models	SEED	SEED-V	Valence	Arousal
SVM [11] ^*	72.88/6.58	57.57/13.31	63.09/6.22	69.65/13.41
R2G-STNN [15]	79.20/8.38	-/-	77.68/3.95	78.31/7.52
DANN [23] ^*	81.72/10.99	62.48/15.62	71.70/4.89	70.09/8.79
ADDA [24] ^*	83.27/9.42	64.21/14.23	81.62/5.63	79.95/6.58
ECLGCNN [18]	84.30/8.13	-/-	80.45/4.04	80.83/8.62
DGCNN [17] ^*	84.88/7.71	66.73/14.24	79.23/3.85	80.57/8.47
EEG-GCN [20]	85.65/7.49	-/-	81.77/5.58	81.95/7.71
GMSS [25] ^*	88.52/10.39	69.07/13.25	-/-	-/-
MSLTE	90.62/8.45	76.64/12.10	85.80/3.62	83.73/4.74

^*For models with $^{\mathrm{*}}$ , their results are achieved by ourselves; for models without $^{\mathrm{*}}$ , their results are quoted directly from the corresponding references. The best result in each column is shown in bold.

Table 2. The mean and standard deviation of the classification accuracy in percent achieved by different models under the subject-independent scenario.

			DEAP
Models	SEED	SEED-V	Valence	Arousal
SVM [11] ^*	54.29/12.16	26.80/8.87	51.20/10.97	49.79/16.27
R2G-STNN [15]	70.24/9.85	-/-	-/-	-/-
DGCNN [17] ^*	73.73/8.83	37.38/8.64	56.68/4.80	57.43/8.05
DANN [23] ^*	74.48/8.48	46.87/10.09	57.28/6.64	60.63/10.24
ADDA [24] ^*	77.88/9.87	47.97/9.43	58.64/6.10	62.36/9.13
EEG-GCN [20]	77.30/8.21	-/-	-/-	-/-
AD-TCN [30]	-/-	-/-	64.33/7.06	63.25/4.62
GMSS [25] ^*	79.17/9.41	46.34/9.65	-/-	-/-
MSLTE	82.16/6.90	50.12/8.59	61.95/5.52	65.39/9.15

^*For models with $^{\mathrm{*}}$ , their results are achieved by ourselves; for models without $^{\mathrm{*}}$ , their results are quoted directly from the corresponding references. The best result in each column is shown in bold.

From table 1 it can be seen that: (i) On all three datasets, the proposed model achieves the highest mean of the classification accuracy. (ii) Compared with single-task-based models [11, 15, 17, 18, 20, 23, 24], the mean classification accuracy of the proposed model is improved by at least 4.94%, 9.91%, 4.03%, and 1.78% on SEED, SEED-V, DEAP-Valence, and DEAP-Arousal, respectively. Compared with the multi-task-based GMSS [20] model, the average classification accuracy of the proposed model is improved by 2.10% and 7.57% on SEED and SEED-V, respectively. (iii) The proposed model achieves the lowest std of classification accuracy on SEED-V, DEAP-Valence, and DEAP-Arousal, which are 12.10%, 3.62%, and 4.74%, respectively. The std achieved on SEED is slightly higher than those of baselines in [11, 15, 17, 18, 20], with a maximum gap of 1.87%. The possible reason is that the proposed model performs very well on some subjects in SEED and relatively average on others, so the std is slightly higher. Compared with the multi-task-based model, GMSS [25], the std of the proposed method on SEED and SEED-V is reduced by 1.94% and 1.15%, respectively.

4.2. Subject-independent experiment

For the subject-independent experiment, the same baselines shown in section 4.1 except for ECLGCNN [18], which does not provide the performance in the subject-independent scenario, are included. In addition, the adversarial discriminative time convolutional network (AD-TCN) [30], which introduces domain adaptation and adopts both labeled training data and unlabeled test data to train the model, is included. The mean and std of classification accuracy achieved by the proposed model in compared with the baselines are shown in table 2.

From table 2 it can be seen that: (i) In all four cases of SEED, SEED-V, DEAP-Valence, and DEAP-Arousal, the proposed model achieved the highest mean classification accuracy, except for the result on DEAP-Valence, where the proposed model performs worse than AD-TCN [30], and the gap is 2.38%. (ii) Compared with single-task-based models [11, 15, 17, 20, 23, 24, 30], the proposed model enhances the mean classification accuracy by at least 4.28%, 2.15%, 3.31%, and 2.14% on SEED, SEED-V, DEAP-Valence, and DEAP-Arousal, respectively, excluding the result of AD-TCN [30] in DEAP-Valence. Compared with the multi-task-based model, GMSS [20], the proposed model enhances the average classification accuracy by 2.99% and 3.78% on SEED and SEED-V, respectively. (iii) The proposed model achieves the lowest std of classification accuracy on SEED and SEED-V, which are 6.90% and 8.59%, respectively. While, on DEAP-Valence, the std of the proposed model is slightly higher than that of DGCNN [17], but it is reduced by 1.54% compared with SOTA's AD-TCN [30]. On DEAP-Arousal, the std of the proposed model is higher than that of AD-TCN [30].

4.3. Ablation study

To study the effectiveness of each key module in the proposed model, the FM-GAE, CM-GAE, WS, and AWML, the ablation experiments are conducted on all three datasets under both subject-dependent and subject-independent scenarios. The experimental results are shown in table 3, where the 6th row represents the proposed model, the 1st row represents the proposed model without two SSL tasks and only the supervised classification task is retained, the 2nd and 3rd rows indicate the proposed model without the CM-GAE and FM-GAE SSL task, respectively, the 4th and 5th rows represent the proposed model without AWML and WS modules, respectively. From table 3 it can be seen that: (i) No matter which dataset or scenario is considered, the proposed model achieves the highest mean classification accuracy. (ii) The results in the 2nd and 3rd rows surpass those in the first row, which indicates that the CM-GAE or FM-GAE-based SSL task contributes to the performance improvement of the supervised task in the proposed model. (iii) The results in the 6th row outperform those in the 4th row, which indicates the effectiveness of the AWML strategy. (iv) The results in the 6th row are better than those in the 5th row, which demonstrates the effectiveness of the WS mechanism. (v) The comparison of the results in the 2nd and 3rd rows shows that the CM-based SSL task is more effective than the FM-based one.

Table 3. Ablation experimental results under subject-dependent (Dependent) and subject-independent (Independent) scenarios in terms of the mean and standard deviation of classification accuracy on SEED, SEED-V, and DEAP datasets.

									DEAP
	Modules				SEED		SEED-V		Dependent		Independent
#	C	F	A	W	Dependent	Independent	Dependent	Independent	Valence	Arousal	Valence	Arousal
1	✘	✘	✘	✘	84.53/10.79	76.70/10.40	72.75/13.51	44.58/8.17	82.61/3.76	80.61/4.75	59.76/5.69	63.84/8.92
2	✘	✓	✓	✘	87.31/8.99	77.63/8.68	74.57/12.54	47.65/7.79	83.07/3.72	81.24/5.07	60.21/5.58	64.22/9.23
3	✓	✘	✓	✘	88.74/8.51	79.48/8.15	75.94/12.37	48.45/6.87	83.70/3.77	81.70/4.86	60.88/5.55	64.99/9.23
4	✓	✓	✘	✓	88.90/9.52	76.99/7.67	74.87/13.15	46.55/7.75	83.16/4.25	82.31/4.73	60.91/5.44	64.33/9.09
5	✓	✓	✓	✘	89.90/9.16	80.40/6.84	75.67/12.10	47.73/6.64	84.39/4.02	83.09/4.82	60.95/5.59	65.07/9.17
6	✓	✓	✓	✓	90.62/8.45	82.16/6.90	76.64/12.10	50.12/8.59	85.80/3.62	83.73/4.74	61.95/5.52	65.39/9.15

C, F, A, and W denote CM-GAE, FM-GAE, AWML, and WS module in the proposed, respectively. The best result in each column is shown in bold.

In addition, the loss-accuracy curve obtained on the 1st subject of the SEED dataset by the single-task model (1st row in table 3) and that by the multi-task model (6th row in table 3) under subject-dependent scenario are compared in figure 4. It can be seen that: (i) For the single-task-based model, although the model converged quickly in the training phase, after 110 epochs in the testing phase, the accuracy and the corresponding loss decreased and increased, respectively, which indicates that severe overfitting occurs. (ii) For the testing stage of the proposed model, the accuracy and loss fluctuate at a relatively higher and lower value, respectively. This suggests that the multiple SSL task-based model can alleviate the overfitting problem to some extent. It is worth mentioning that similar experimental results can be obtained on the other two datasets (SEED-V and DEAP) under both subject-dependent and subject-independent scenarios.

**Figure 4.** Comparison of loss-accuracy curves obtained on the 1st subject of the SEED dataset by (a) the single-task model (1st row in table 3) and (b) the proposed model (6th row in table 3) under the subject-dependent scenario.
Download figure:
Standard image High-resolution image

4.4. Parameter analysis

The influence of the Chebyshev filter order K in GCN and the mask rate adopted in FM and CM on the performance of the proposed model is examined. It is worth noting that in this work, the mask rate adopted for both FM and CM operations is the same. Figure 5 presents the classification accuracies achieved by the proposed model with different combinations of these two hyperparameters. In addition, the average classification accuracy (in percent) obtained by the proposed model when one hyperparameter is fixed and the other hyperparameter is adjusted is represented by the histogram. It can be seen that: (i) Along the vertical axis, the proposed model achieves optimal performance when the Chebyshev filter order K is set as 2. Performance deteriorates when K exceeds 2, which may result from the over-smoothing effects in graph convolution. (ii) As to the mask rate, the highest performance is achieved at the mask rate of 0.7. And the performance begins to decline when the mask rate exceeds 0.7. (iii) The proposed model achieves its best performance when the Chebyshev filter order K and the mask rate are set as 2 and 0.7, respectively. (iv) The difference between the highest value and lowest value is 3.11%, which is not too much.

**Figure 5.** Classification accuracies achieved by the proposed model under different combinations of Chebyshev filter order K and mask rate of FM and CM on the SEED dataset. And when one hyperparameter is fixed and the other is adjusted, the obtained average classification accuracy by the proposed model is shown in a histogram.
Download figure:
Standard image High-resolution image

4.5. Computational complexity and training efficiency

To study the computational complexity, model size, and training efficiency of the proposed model, the multiply-accumulate operations (MACs), the number of model parameters (Param.), the dataloader preload time (DPT), and the train one epoch time (TOET) of the proposed model under both subject-dependent and subject-independent scenarios are compared with those of the SOTA multi-task-based model, GMSS [25] on SEED dataset. To make a fair comparison, the results of the proposed model and those of GMSS are obtained when batchsize is set as 128. The DPT and TOET are the mean and std obtained based on 5 times running to ensure reliability. Experimental results in table 4 demonstrate that: (i) Compared with GMSS [25], the MAC and Param. of the proposed model are reduced by 44.76% and 45.20%, respectively. (ii) Under the subject-dependent scenario, compared with GMSS [25], the DPT and TOET achieved by the proposed model are reduced by 92.38% and 54.85%, respectively. (iii) Under the subject-independent scenario, compared with GMSS [25], the DPT and TOET achieved by the proposed model are reduced by 95.37% and 84.32%, respectively.

Table 4. Computational complexity, model size, and training efficiency achieved by the proposed model in comparison with GMSS model under both subject-dependent (Dependent) and subject-independent (Independent) scenarios on the SEED dataset.

			Dependent		Independent
Models	MACs(M)	Param.(M)	DPT(s)	TOET(s)	DPT(s)	TOET(s)
GMSS [25]	752.58	4.69	9.19 ± 0.02	4.54 ± 0.21	216.85 ± 0.86	69.19 ± 0.28
MSLTE	415.75	2.57	0.70 ± 0.01	2.05 ± 0.12	10.05 ± 0.09	10.85 ± 0.05

M and s represent megabyte and second, respectively. The best result in each column is shown in bold.

5. Discussion

5.1. Performance comparison with the baselines

From the subject-dependent (table 1) and subject-independent (table 2) experimental results, we can see that: (i) Generally, the multi-task based models (the proposed model and GMSS [25]) achieve higher emotion classification accuracies than the single-task based ones [11, 15, 17, 18, 20, 23, 24, 30], which include the models combining domain adaptation strategy [23, 24, 30], on all three datasets. This indicates that introducing multi-task learning helps to enhance the generalization of the emotion recognition models. The only exception happens in the results on DEAP-Valence under the subject-independent scenario, where the proposed model achieves lower classification accuracy than AD-TCN [30] and the gap is 2.38%. The possible reason is that AD-TCN [30] adopts an additional classification calibration mechanism, which determines the classification threshold by performing k-means clustering on the subjective rating data of each subject to reduce the impact of difficult samples near the thresholds. (ii) The proposed model outperforms the SOTA multi-task-based model, GMSS [25], in terms of the mean and std of emotion classification accuracy on SEED and SEED-V datasets under both subject-dependent and subject-independent scenarios. This indicates that compared with the data augmentation-based pseudo-classification task adopted by GMSS [25], the self-reconstruction regression task based on feature masking in the proposed model can extract more generalized features, which improves the performance of emotion recognition. In addition, the spatial jigsaw puzzle task in GMSS is specifically designed for the SEED series datasets, which cannot be applied in other datasets directly. (iii) As shown in table 4, the proposed model achieves lower computational complexity and fewer model parameters than GMSS [25]. In addition, no matter which scenario (subject-dependent or subject-independent) is considered, the training efficiency, which is represented by DPT and TOET, of the proposed model is much higher than that of GMSS [25]. Moreover, compared with GMSS [25], the proposed model improves DPT and TOET much more efficiently in subject-independent scenario than in subject-dependent scenario, which indicates that the superiority of the proposed model in terms of training efficiency is more prominent in larger datasets.

5.2. Ablation examination

From the ablation experimental results (table 3) it can be seen that: (i) The introduction of two SSL tasks, CM-GAE and FM-GAE, effectively improves the emotion classification accuracy of the proposed model. The reason is that both CM and FM can simulate the information loss due to instability during EEG signal acquisition, and the two SSL tasks can reduce the influence caused by this uncertainty to some extent, which helps to improve the generalization of the multi-task model. (ii) The CM-based SSL task outperforms the FM-based SSL task, suggesting that the potential instabilities in the EEG channels are more severe than those in the frequency components of the EEG signal itself. (iii) Since the contributions of each task are not equal, the introduction of AWML strategy helps to enhance the performance of the proposed model further. (iv) The WS mechanism between the channel decoder and frequency decoder helps to improve the performance of the proposed model. The reason is that with the WS mechanism, the masked information in one dimension (channel or frequency) will be better constructed with the help of unmasked information in the other dimension (frequency or channel). Specifically, as shown in figure 3, the information of masked frequency bands in the FM-GAE task can be better reconstructed by utilizing the information of the frequency bands included in all unmasked channels in the CM-GAE task. Similarly, the information of masked channels in the CM-GAE task can be better reconstructed by utilizing the channel information included in all unmasked frequency bands in the FM-GAE task.

As for the influence of the key parameters (Chebyshev filter order and mask rate) on the proposed model, from figure 5 it can be seen that within the range of $[0.4, 0.8]$ for the mask rate and that of $[2, 4]$ for the Chebyshev filter order, the gap between the highest and lowest accuracies in this range is less than 1.58%, which indicates that the performance of the proposed model is less influenced by the key parameters chosen.

5.3. Limitations

Although the proposed model achieves good performance in some aspects, it still has the following limitations: (i) Only the spatial and frequency properties of EEG are considered in constructing SSL tasks, the temporal dynamic pattern contained in EEG, which is also a very important for emotion recognition, is not taken into consideration. (ii) In practice, there are relatively few labeled EEG samples and many unlabeled EEG samples. The performance of the proposed model now relies on labeled EEG samples for training and does not take full advantage of the large number of unlabeled EEG samples. Therefore, the model needs to be extended to unsupervised learning. (iii) The proposed model requires a predefined graph structure based on the EEG electrode positions in each dataset, and thus is not suitable for cross-dataset scenarios. Research on models with adaptive graph structures can be considered in the future.

6. Conclusions

A novel multi-self-supervised learning task-based model for enhancing EEG emotion recognition is proposed. The model combines supervised and self-supervised tasks, utilizing the WS mechanism and a joint training strategy to enhance the model's generalization and its EEG emotion classification accuracy. Experimental results on three datasets demonstrate that the introduced FM-GAE task, CM-GAE task, AWML strategy, and WS mechanism contribute to the performance enhancement of the proposed model. It outperforms the single-task-based models and the SOTA multi-task-based model. Future research will explore other SSL methods (e.g. contrastive learning) and integrate them with multi-task learning to develop self-supervised models that can be applied in cross-dataset scenarios.

Acknowledgment

This work is supported by the National Natural Science Foundation of China [Grant Numbers 61771196, 61872143]. The authors would like to thank the anonymous reviewers and the associate editor for their insightful comments that significantly improved the quality of this paper.

Data availability statement

The data that support the findings of this study are openly available at the following URL/DOI: https://bcmi.sjtu.edu.cn/home/seed/seed.html.

MSLTE: multiple self-supervised learning tasks for enhancing EEG emotion recognition

Article metrics

Submit

Permissions

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction