Multi-modal Physiological Signal Fusion for Emotion Classification: A Multi-Head Attention Approach

In this essay, a model-level fusion technique of multi-modal physiological signals using Multi-Head Attention is studied. A framework that utilizes multi-model physiological signals for the task of emotion classification is proposed. First, the GCRNN model, which combines the Graph Convolutional Network (GCN) and the Long and Short Term Memory (LSTM), captures the unique features of electroencephalogram (EEG) signals. The spatial and temporal information that makes up impulses from the EEG can be captured precisely by such a technique. The CCRNN model, which combines the Convolutional Neural Network (CNN) integrated with the Channel-wise Attention and the LSTM, is used for peripheral physiological signals. The model can extract useful features from peripheral physiological signals and automatically learn to weigh the importance of various channels. Finally, Multi-head Attention is employed to fuse the output of the GCRNN and CCRNN methods. The Multi-head Attention can automatically learn the relevance and importance of different modal signals and weigh them accordingly. Emotion classification is implemented by adding a component of Softmax to map what the model produced to discrete emotion categories. The DEAP dataset was utilized in this study for experimental verification, and the results indicate that the method using multi-modal physiological signal fusion is substantially greater in precision than the technique using simply EEG signals. Additionally, the Multi-head Attention fusion method performs better than previous fusion techniques.


Introduction
In the past few years, a study in psychology, neuroscience, and medicine has turned to the topic of emotion recognition.Emotion recognition is based on a variety of measurement methods, which are mainly divided into two categories [1]: audiovisual technology and physiological technology.Audiovisual technology relies on external states of expression such as facial expressions, language, and gestures, but such methods tend to ignore subtle emotions and are subject to control or suppression.In contrast, physiological measures including EEG and peripheral physiological signals are a variety of physiological signals emanating from the autonomic and central neurological systems.Because the autonomic nervous system is unconsciously activated to govern physiological signals, they are more robust [2], with a high degree of objectivity and accuracy.Life is significantly influenced by emotion identification based on physiological signals.For instance, in the aspect of medical treatment, regular ICAITA-2023 Journal of Physics: Conference Series 2637 (2023) 012047 IOP Publishing doi:10.1088/1742-6596/2637/1/012047 2 detection of physiological signal indicators related to emotion and real-time emotion recognition to obtain patients' emotional states can help to channel patients' negative emotions in the treatment process and promote patients' recovery.This monitoring can also be used to monitor patient behavior to prevent unexpected events from occurring.With the development of wearable sensor technology, more focus will be placed on emotion identification based on physiological signals.
Physiological signals like EEG, ECG, and EMG themselves present the characteristics of nonstationary random signals.Ordinary time-frequency domain analysis can obtain less information, and the recognition results also have the problems of low accuracy and weak generalization ability across individuals.In recent years, many studies have used Deep Learning methodology to learn physiological signal characteristics to improve emotion recognition ability [3][4].Zheng et al. [5] employed Deep Belief Network (DBN) to capture DE features from multi-channel EEG data based on the SEED data set for training and obtained an average classification accuracy of 86.08%.Li et al. [6] extracted PSD features from multi-channel EEG based on DEAP data set and constructed multi-dimensional EEG feature images, using CNN, LSTM, and recurrent neural network.The mixed neural network model CLRNN was constructed for EEG emotion identification, and the average accuracy rate of emotion classification for each subject reached 75.21%.
Recognition of moods using physiological signals is usually affected by noise and individual differences, so the multi-modal fusion method can complement different types of physiological signal information to enhance recognition's reliability and dependability.Kwon et al. [7] fused EEG with Galvanic Skin Response (GSR) and acquired a sentiment recognition rate of 73.4% on the DEAP dataset.Chen et al. [8] utilized Support Vector Machines (SVM) and Long Short Term Memory (LSTM) to categorize the EEG and ECG data collected in the experiment.Finally, at the decision level, an achievement rate of 85.38% is reached.Chen et al. [9] proposed a novel convolutional neural network called SCA-CNN for the task of image captioning, which combines spatial and channel attention within the CNN.It achieves significantly better performance compared to state-of-the-art image captioning methods based on visual attention.
In this essay, an emotion recognition technique based on EEG and peripheral physiological signals is proposed based on the model-level fusion of multi-modal physiological signals.Firstly, the 32channel EEG signals are extracted by using the GCRNN model.The GCN can effectively capture the relationships and dependencies between EEG signals, while the LSTM can model the timing information to extract features with rich information.Secondly, for the 8-channel peripheral physiological signals, we use the CCRNN model for feature extraction.The Channel-wise Attention can automatically learn the importance of each channel, enabling the model to pay more attention to signal channels that are more useful for emotion recognition tasks.CNN model can effectively extract spatial features, while the LSTM model is used to model timing information.Finally, we use Multi-head Attention to integrate the features extracted from the two models.Multi-head Attention can learn the correlation and weight between different features, make comprehensive use of physiological signals of different modes, and enhance the effectiveness and precision of emotion recognition.

GCRNN Model for EEG
GCN is a Deep Learning method based on the graph structure, designed to deal with irregular data representation.In the case of EEG channel data, channels are not distributed in a grid pattern but exhibit irregular patterns of connections.GCN solves this problem by establishing connections between channels and using an adjacency matrix to describe the strength of the connections between them.In this way, GCN performs convolution operations in the graph domain to extract information and features from adjacent channels.
When a network model is constructed using GCN, EEG channels can be viewed as nodes in the graph, and the connections between channels correspond to the edges in the graph.By performing graph convolution operations in GCN, information can be propagated and aggregated between nodes to effectively capture channel relationships and interactions.This method utilizes the overall structural features of EEG signals to enhance the efficiency of occupations requiring emotion recognition.
A graph structure in mathematical terms can be written as the following expression.

 
, , where, V represents the collection of graph nodes, E is the set of edges of the graph, is the adjacency matrix of the graph, which represents the relationship of EEG channels, N denotes the quantity of EEG channels, and the worth of ij W depicts the connection between channel i and channel j .
Functional connectivity, distance-based and neural networks can be employed to determine the value of ij W .In this paper, the Euclidean distance relationship between EEG electrodes is used to construct an adjacency matrix to depict the association between EEG signals' spatial positions.
The equation for constructing the adjacency matrix W is as follows.
, , , 0, , denotes the Euclidean distance between nodes i and j , and  is the threshold of matrix sparsity.
Strictly speaking, the following normalization describes the Laplace matrix of graph G .
where, E is the unit matrix; 1 ( , ... , ) , that is the number of neighbors that each node has; W is the adjacency matrix.
Given a specific spatial signal , the characteristic value is FN , and its graph Fourier transform is: where, x is the frequency domain transformed signal; U is the resulting orthogonal matrix from L 's singular value decomposition.The process is as follows For two signals x and y on graph *G , the convolution operation can be described as )) where,  is the Hadamard product.

 
ꞏ g denotes A system for filtering, and the signal A filtered by   g L can be stated as where, ( ) are the eigenvalues of L .Since the step of doing the eigendecomposition of L is time-consuming, the K-order Chebyshev polynomial is employed instead of the spectral domain convolution kernel, that is the approximation ( ) g  , to reduce the parameter complexity.The derivation equation is as follows where, k  is the Chebyshev polynomial coefficient and k T is the Chebyshev polynomial computation, which is calculated as follows: Combined with Equation ( 7), it can be converted as follows Where, 2 , and E is the unit matrix.
The GCRNN model structure combined with GCN and LSTM is shown in Figure 1.In this model, the map features of the T second EEG data were extracted using GCN, and the LSTM layer was used to memorize the time domain information between two EEG channels in T second.The output of the model is , _ y eeg batch size channels rnn units  .

CCRNN Model for Phys
When exploring characteristic information, attention can alter the relative importance of several channels.It can be integrated into CNN [10].
The CCRNN model structure is shown in Figure 2.The weight for different channels can be altered by Channel-wise Attention in order to examine characteristic data and extrapolate more significant

Multi-Head Attention Fusion Model
The model using Multi-head Attention to integrate EEG and peripheral physiological signal features is shown in Figure 3.In this model, the output results of the GCRNN model and CCRNN model are connected as inputs.The degree of attention to each modal is dynamically adjusted by the Multi-head Attention, and the features are weighted and fused according to the importance and correlation of different models, to obtain a more differentiated and robust feature representation.The variety and complementarity of multi-modal physiological signals can be fully exploited by this strategy.It enhances the correlation and complementarity between the modes, thus improving the performance and accuracy of the model.
where the projections are parameter matrices   The following is the data preprocessing method: First, the EOG artifact is removed and the data is downsampled to 128 Hz.To average the data to the same datum, a band-pass frequency filter with a range of 4.0 to 45.0 Hz is then used.In general, human emotional states last from 1 second to 12 seconds, and 60-second EEG tests were sliced into 12second chunks to increase the amount of training data.Apply the Fast Fourier Transform (FFT) to each t-second window using the"FFT" function in the Scipy python package, and preserve the logarithmic amplitude of the non-negative frequency component.The final preprocessed data format is shown in Table 1.

Evaluation Metrics and Parameter Settings
In this study, the network parameters are updated iteratively using the backpropagation approach until they achieve the ideal value.To this end, we define a loss function based on the Mean Square Error (MSE).The loss function of the new model is defined as follows.The following calculation methods are used in this paper to evaluate the performance of the model using classification accuracy and F1-score.

TP TN Accuracy TP TN FP FN
where TP, FP, TN, and FN stand for true positive, false positive, true negative, and false negative respectively.The Chebyshev polynomial order is set to 2 K  , and the number of graph nodes is the number of EEG 32 channels  , the number of hidden layer units _ 64 rnn units  , the threshold of matrix sparsity 0.35

 
, the maximum number of iteration MAX is 300, the dropout probability is 0, the batch size of the training set is 512, and the batch size of the verification set and test set is 128.The regularization coefficient of the loss function 0.01   , and the optimizer uses the Adam optimizer.The model was trained and tested on RTX 3090, implemented using Python 3.8.10 and Pytorch 1.11.0.The training set, verification set, and test set of the experiment were split up in the proportion of 8:1:1.

Ablation Experiment
In this section, an ablation experiment was performed, comparing four methods.Firstly, the GCRNN model is used for emotion recognition using single-modal EEG signals.Second, the EEG and peripheral physiological signal output values from the model were directly averaged.The third method is to use Dot-product Attention to fuse multi-modal physiological signals.Finally, we performed an experimental evaluation of the DEAP dataset and compared it with other methods.The experimental results show that the multi-modal physiological signal fusion method is significantly superior to the method using only EEG signals.Moreover, the Multi-head Attention fusion method shows better accuracy, which proves its effectiveness and superiority in the task of emotion classification.
In conclusion, the Multi-head Attention fusion method proposed in this paper performs a model-level fusion of multi-modal physiological signals and achieves good performance in emotion classification tasks.This technique not only increases precision of emotion classification but also offers a practical and useful approach to studying and using emotion identification.Future work could further explore and extend this approach, applying multimodal fusion techniques to a wider range of situations and tasks, and enhancing the precision and accessibility of emotion recognition.
channel information.Additionally, LSTM captures temporal dependencies in EEG data to learn meaningful features.The output of the model is size rnn units  .

Figure 3 .
Figure 3.A model-level fusion model based on Multi-head Attention.

3 . Experiments 3 . 1 .
Introduction to Dataset and Preprocessing DEAP database is a multi-modal dataset collected by the Queen Mary University of London et al. through experiments to study human emotional states.32 volunteers watched 40 one-minute music videos while the researchers captured their EEG and peripheral physiological signals.On a scale from 1 to 9, participants gave each film a rating based on its dominance, valence, arousal, and likes/dislikes.We utilized the variables valence, arousal, and dominance to represent emotions, classifying values greater than 5 as high and values lower than 5 as low based on the properties of the dataset and the characteristics of emotional representation in prior studies.40 channels of labeled EEG signals and peripheral physiological signals collected by this data set are used in the experiment of this paper, which is respectively 32 channels of EEG signals and 2 channels .1088/1742-6596/2637/1/0120476 of EOG signals. 2 channels of EMG signals, and 1 of GSR (Galvanic Skin Response), Resp (Respiration belt), Plet (Plethysmography) and Temp (Temperature) signal.
and l represent the predicted value of the model and the actual label value of the training data respectively.W represents all parameters of the model; α represents the regularization coefficient.The MSE function ( , ) mse p l designed to measure model forecasting results and actual values, the difference between regularization item 1 || || W  aimed at preventing models in the parameter fitting phenomenon appeared in the learning process.
Finally, a model-level fusion method ICAITA-2023 Journal of Physics: Conference Series 2637 (2023) 012047 of multi-modal physiological signals using Multi-head Attention is proposed in this essay.To gauge how well these techniques work, the validation sets are used to measure the classification accuracy, including the classification accuracy in valence, arousal, and dominance.The results are shown in Figures 4, 5, and 6.Meanwhile, the accuracy results of the test set are presented in Figure 7.The performance of the aforementioned techniques for emotion recognition can be assessed using the experimental findings.GCRNN model performance is superior to that of single-modal EEG signals, but the performance of the GCRNN model may be further improved by integrating multi-modal signals.The direct averaging of EEG and peripheral physiological signals is a simple and effective fusion method with stable accuracy.When Dot-product Attention for model-level fusion, the accuracy of some affective dimensions may decrease.The approach to Multi-head Attention presented in this research achieves good results in the model-level fusion of multi-modal physiological signals and significantly improves the accuracy of emotion recognition.In conclusion, the Multi-head Attention method proposed in this paper further enhances the correlation and complementarity between different physiological signals, and the Multi-head Attention has advantages in integrating multi-modal physiological signals, which could enhance the precision of emotion recognition.

Figure 4 .
Figure 4. Val set accuracy on the valence dimension.

Figure 5 .
Figure 5. Val set accuracy on the arousal dimension.

Figure 6 .
Figure 6.Val set accuracy on the dominance dimension.

Figure 7 .
Figure 7. Test set accuracy of different methods.4. Conclusions To sum up, we have successfully applied Multi-head Attention to the model-level fusion of multi-modal physiological signals and achieved remarkable results in emotion classification.The following are the main conclusions and contributions of this paper.First, we propose an innovative framework for emotional classification using multi-modal physiological signals.By combining the GCRNN model with the CCRNN model, we can successfully extract pertinent information from peripheral physiological inputs and EEG signals.The GCRNN model utilizes the combination of GCN and LSTM to capture temporal and spatial information of EEG signals.The CCRNN model combines CNN and LSTM and integrates Channel-wise Attention, which can extract useful features from peripheral physiological signals and automatically learn channel importance