Electroencephalography recognition based on encephalic region and temporal sequence transformer

Stereoscopic vision is the key to good motor control and accurate cognition, and its formation is closely related to brain control. The early methods of measuring stereoscopic vision rely on the subject’s judgment, which might be influenced by inadvertent misjudgments. To solve this problem, we collected the Electroencephalography (EEG) of subjects watching dynamic random dot stereogram for stereogram recognition. To analyze stereogram EEG signals, this paper proposed a transformer-based encephalic region temporal sequence analysis network. Inspired by the concept of brain regions, this network designs an encephalic region Transformer module to capture global spatial features in each brain region and among the whole brain regions. Based on the spatial features of electrodes in different brain regions, the global spatial dependence of all electrodes can be further obtained. Then, the temporal sequence Transformer module is adopted to learn the global temporal EEG features. Finally, we utilize the spatial-temporal multi-scale convolution module to extract advanced spatial and temporal fusion features for recognition. The simulation results on two public EEG datasets illustrate the excellent classification performance of the proposed model, which is better than 9 existing comparison models in EEG recognition.


Introduction
As an essential physiological indicator of visual function, stereoscopic vision is the key to good motor control and accurate stereoscopic cognition.In medical terms, stereoscopic vision is embodied in stereoscopic acuity, the minimum parallax that triggers stereoscopic perception.[1].The Frisby distance test and the Howard-Dolman test are the early methods to measure stereoscopic acuity.However, there are monocular cues in the above method, which results in low sensitivity and certain test limitations.Random-dot stereograms (RDS) are presented in a way that corresponds to both eyes and can eliminate the influence of monocular cues, so Random-dot stereograms are gradually developed in the field of stereoscopic cognition.The method based on RDS has been applied to the measurement of stereoscopic acuity [2].Recently, it has been found that higher sensitivity can be obtained when using Dynamic random-dot stereograms (DRDS) to study stereoscopic vision.In addition, stereoscopic vision research can employ precise acquisition equipment to obtain the subject's physiological electrical signals, including electrooculoscopy, electromyography, electrocardiography, and electroencephalography (EEG).Among the above signals, EEG is frequently adopted in stereoscopic vision recognition [3] because of its precise and high time resolution.
The traditional machine learning EEG signal recognition algorithms generally include manual feature extraction and classification.Differential entropy (DE), Power Spectral Density (PSD) [4], and Common Spatial Patterns (CSP) [5] are usually captured as EEG features.After that, the extracted features were fed into classifiers for recognition.However, the above methods rely on a few features and might ignore potential information, leading to performance degradation.
Recently, many deep learning algorithms have been applied in EEG signal classification tasks, which can automatically obtain deeper intrinsic feature representations from raw data.Li et al. [6] utilized a convolution neural network (CNN) to decode MI-EEG.Chen et al. [7] used two-dimensional CNN to recognize different emotions of humans.Although CNNS have advantages in local feature extraction, they need help in learning global spatial relationships between EEG electrodes.Neuroimaging studies [8] have found that visual information is responsible for visual pathways across multiple brain regions.Therefore, the complex spatial relationships are significant to the EEG recognition task.
Some scholars have begun to adopt Graph Convolutional Networks (GCN) to learn the spatial connections of EEG electrodes and have achieved good results.Song et al. [9] designed a DGCNN for EEG emotion recognition.Wang et al. [10] proposed an AMCNN-DGCN for detecting driver fatigue.In a previous study, we proposed the MTS-DGCHN [3] for DRDS-EEG recognition.These above methods employ the GCN to capture the spatial connections of EEG.At the same time, GCN uses the adjacency matrix to learn the global connections among all EEG electrodes and cannot extract the spatial information of the local brain region, which may lead to the deterioration of the network performance.
To tackle the aforementioned issues, this paper proposes a Transformer-based Encephalic Region Temporal Sequence Analysis Network (TER-TSAN) for EEG recognition.TER-TSAN is mainly composed of an encephalic region transformer [11] module, a temporal sequence transformer module, and a spatial-temporal multi-scale convolution module.First, the encephalic region transformer module learns the spatial connections of electrodes in different encephalic regions and the whole brain.Then, the temporal sequence transformer module can extract the global temporal features from the EEG sequence.Finally, the spatial-temporal multi-scale convolutional module captures the global and local spatial-temporal features for fusion and EEG recognition.

Method
As shown in Figure 1, TER-TSAN's overall framework is first described.Then, we will introduce how to implement each module in TER-TSAN.

Encephalic region transformer module
The encephalic region transformer module is implemented based on the division of EEG electrodes.In different encephalic regions, the four-headed single-layer Transformer Encoder structure is mainly used to carry out global information learning.Primarily, specific encephalic regions need to be determined.Combined with prior knowledge of brain region distribution and EEG electrode location distribution, the specific distribution of the 30 EEG electrodes can be found in Table 1.O1, OZ, O2 From the perspective of EEG encephalic regions, the features of each encephalic region are divided into 6 different brain region sets, and the features of each encephalic region are sent to the corresponding encephalic region Transformer Encoder module to learn the global information of the brain region.Then, the features from different encephalic regions were spliced together and fed into the encephalic region Transformer Encoder module to learn the global importance of each electrode in the whole brain region.After the above operations, the encephalic region Transformer module extracts the global importance information for each and whole brain region.The main implementation process is as follows.
Assume that our input is   , … ,  ∈  ,  ∈  , in which N is the electrode number and P is the EEG series temporal length.Firstly, a convolution operation over time is implemented to obtain preliminary features and retain signal spatial information.Meanwhile, the reshape operation is utilized to ensure that the dimensions match as follows: Then, the signals from N electrodes are divided into 6 brain regions and sent to the Encoder module of the brain region respectively, as shown in Figure 2.  The Encephalic region Encoder in Figure 2 mainly adopts the Transformer Encoder structure.In this module, the feature learning process of each brain region does not involve dimensional changes.The calculation process of the i-brain region is shown in Formula (2).
where  1,2, … ,6.Then, the features of all brain regions are combined and sent into the encephalic region Encoder module of the whole brain region to get the output EEG feature Z of the encephalic region transformer module in Formula (3): After the encephalic region transformer module, the network learns the spatial information of each brain region and the whole brain region.Then we send the feature maps learned by the module to the temporal sequence Transformer module for further temporal EEG information extraction.

Temporal sequence transformer module
The temporal sequence transformer module is also based on the four-headed single-layer Transformer encoder architecture, which can take advantage of the long-distance dependent features of the Transformer to capture temporal sequence information of DRDS-EEG signals.The implementation process is described in detail below.
We have fed the EEG signals to the above encephalic region transformer module to obtain the output features  ∈  .In this section, we first transpose them and then use the temporal sequence Transformer Encoder structure to learn the global weighted features of P time slice sequences with a length of 1×N.As shown in Figure 3, we give the schematic of the temporal sequence Encoder module.After the temporal sequence Transformer module, to ensure that the features in the network match the subsequent convolutional feature learning, it is necessary to extend the dimension of the feature graph S and reshape it into three-dimensional features  * .The specific implementation is as follows.

Spatial-temporal multi-scale convolution module
This module extracts the features  * ∈  learned from the above encephalic region and temporal sequence Transformer module at a deeper level.In this paper, N 30, P 128.To further learn EEG features, the feature  * is first sent into three spatial multi-scale convolutional layers to learn EEG deep spatial information from local and global perspectives, and then the information of the three scales is spliced to organize the high-level spatial EEG features.The convolution transformation is:

Model classification and loss function
To reduce the number of features and speed up network training, the 1×1 convolution and 4×4 pooling operations were employed to obtain the final fused feature  * .The calculation formula is as follows.
* _   ,  * ∈ Then, the fusion feature  * is fed into 2 linear layers and 1 SoftMax layer to realize three EEG signal classifications.In the TER-TSAN, we utilize the cross-entropy loss L when calculating the inconsistency between the real label  and predicted label y as shown in Formula (15).
In the above formula, M is the sample size, Θ is a set, containing all trainable parameters in the TER-TSAN, ‖. ‖ represents the 1-norm of the vector, and  is the constant argument.In the loss function, ‖Θ‖ is employed to prevent overfitting.

Experimental results
This paper adopts the public two EEG datasets SRDA [12] and SRDB [3] to verify the proposed TER-TSAN.In the data processing stage, the sampling frequency is down-sampled from 1000 Hz to 256 Hz, as in [3].The SGD optimizer was used to optimize the model backpropagation in the training process.After 100 iterations of each training, the optimal experimental results are saved.The other details are the same as [3].
As shown in Table 2, on the SRDA dataset, the TER-TSAN has achieved a classification accuracy of 96.84%, which is superior to all comparison methods, indicating that TER-TSAN can extract more significant features.The main reason is that TER-TSAN can fully connect the brain region with the whole brain region, capture the timing information of EEG signals, and extract advanced spatialtemporal fusion features of multiple sensory fields.In addition, the proposed model also achieved the highest performance of 0.961 and 0.953 on the other two indexes, F1-score and Kappa, which fully demonstrated the excellent classification effect of TER-TSAN.Subsequently, the performance of TER-TSAN is further verified on the SRDB.As given in Table 2, the proposed TER-TSAN algorithm obtains the overall performance of 96.33%, 0.957, and 0.945 respectively in classification accuracy, F1-score, and Kappa value, which is still superior to all comparison algorithms.Compared with the prior algorithm MTS-DGCHN, this model improves the average classification accuracy by 1.14%.Overall, the results prove that the TER-TSAN can fully extract the temporal and spatial features of the brain region and show superior performance in DRDS-EEG signals recognition.

Ablation Study
To further analyze the importance of each sub-module in the TER-TSAN, ablation experiments are performed based on all subjects in the two datasets.TER-TSAN model mainly includes encephalic region transformer block (ERTB), temporal sequence transformer block (TSTB), and spatial-temporal multi-scale convolution block (STMCB).To verify the necessity of different modules, as shown in Table 3, this section removes each submodule from the TER-TSAN model and retains the remaining two submodules for ablation verification.From the average results of each dataset, when ERTB and TSTB modules are removed, the overall performance of the model will be reduced by more than 2%, which indicates that the ERTB and TSTB are the keys to improving the recognition performance of the TER-TSAN.Besides, we can note that when the STMCB in the model is removed, the classification performance will decrease by about 7%, which indicates that the STMCB is responsible for the fusion of advanced spatial-temporal features at a deeper level, which is crucial to ensure the overall performance of the network.Therefore, the design of the TER-TSAN is rational and efficient.

T-SNE Visualization
To verify the quality of features extracted by the TER-TSAN model, we visualize the separability of the extracted features on a two-dimensional plane with t-SNE technology.When a certain type of feature is more aggregated, the more distant they are from different types of features, the stronger the discrimination ability of the network.To illustrate the features learning capability of the network, TER-TSAN was compared with the second-best comparison method MTS-DGCHN, and the visualization results of t-SNE were shown in Figure 4. We can find that when discriminating EEG features with the TER-TSAN, each type of feature is aggregated and distant, which is better than the MTS-DGCHN.The above result proves the advantage of the TER-TSAN.

Figure 3 .
Figure 3. Temporal sequence Encoder moduleIn the temporal sequence Transformer module, the implementation process of feature S learned is shown as Formula (4).   ,  ∈  (4)

Figure 4 .
Figure 4. MTS-DGCHN and TER-TSAN learned t-SNE visualizations of three state-recognition features from subject 1 on two datasets.
This paper proposes a Transformer-based encephalic region temporal sequence analysis network for the DRDS EEG recognition task.The network consists of an encephalic region Transformer module, a temporal sequence Transformer module, and a spatial-temporal multi-scale convolution module.The designed encephalic region Transformer module can capture EEG local and global spatial features from the brain region and whole brain region, and the temporal sequence Transformer module can fully extract global temporal dependencies within different time segments.The spatial-temporal multi-scale convolution module of the network further learns the temporal and spatial features to obtain advanced EEG spatial-temporal features and classify them.The proposed TER-TSAN is extensively evaluated on two EEG datasets, SRDA and SRDB.The experimental results demonstrate that the proposed TER-TSAN has performed the best classification performance, superior to 9 existing EEG classification models.

Table 2 .
The overall performance of compared methods