CNN combined with attention module for facial expression recognition

In contemporary times, the proliferation of information technology has prompted an increased interest among scholars towards artificial intelligence, particularly its constituent algorithms, such as deep learning, multilayer perceptron and convolutional neural networks. Of specific interest is the analysis of facial expressions, which has emerged as a popular research topic. However, classifying facial expressions presents a significant challenge due to variations in expressions associated with different emotions, as well as similarities between the various emotions. The task of expression classification is further complicated by the abundance of facial features that must be considered. In this paper, the basic Convolutional Neural Network (CNN) model was first trained on the FRE-2013 face expression dataset and optimized by adding an attention module to learn and analyse the key features of faces, thereby improving the classification accuracy of the model. In short, this model achieved an accuracy of 71.65% and an F1-Score of 0.66 while the model without the attention module only achieved a model accuracy of 69.86% and an F1-Score of 0.65, an improvement of about 2%. The analysis demonstrated that by adding the attention mechanism, some important features e.g. the eyes and mouth, are given more weight, thus improving the classification accuracy.


Introduction
Emotions, which consists of a series of neurophysiological changes, are fundamental to human decisionmaking and interaction processes.According to studies, only around 45% of human communication is spoken, with the other 55% mostly expressed through facial expressions of emotion [1].Thus, understanding facial expressions will be a useful tool for non-verbal communication because they reveal people's complex and variable inner thoughts.Emotions naturally stimulate the movement of muscles under the facial skin, and various facial expressions are thus produced.Automatic emotion recognition systems have significantly improved with the advancement of computer science technology and a better understanding of emotional states.[2].However, in practice, when doing the image recognition, the selection problems may be encountered [3].Common image features include color features, point features, line features, edge features or spatial relations characteristics [3,4].When the brain processes the external information (such as visual information, auditory information), normally, it only focuses on some major part of the information or areas of interest to them.Therefore, how to make the recognition model e.g., Convolutional Neural Network (CNN) can focus on the key feature areas so that the accuracy of the model can be improved deserves more attention.
Since the emotion recognition is a challengeable project, it has received lots of attentions from the scientists in many fields.Traditional approaches for classifying facial expressions often rely on handcrafted features such as Local Binary Patterns (LBP), coupled with algorithms such as Support Vector Machine (SVM) [5].While these techniques may perform well on certain datasets, their performance is likely to deteriorate when applied to more complex expression datasets in uncontrolled environments.However, recent advances in deep learning have proven to be a significant breakthrough in the effectiveness of facial expression classification, as these techniques have been successful in addressing image classification problems.
With the advancement of computer vision in recent years, the attention module has found extensive use in a number of industries, including saliency detection [6], crowd counting [7], and facial expression identification [8].Those CNN-based methods usually represent the images by feature maps in deep lays, which contains the global details of the whole images.However, previous researches have shown that the emotions highly related to the local facial regions [9], while the traditional CNN-methods treat all regions equally.This means that the basic CNN-methods cannot fully extract all features when detecting the images of facial expression.Thus, it seems reasonable if the model can give weights to different image patches or regions to indicate their importance for Facial Expression Recognition (FER) tasks.The attention mechanism would boost the weights of important characteristics like the eyes, nose, and mouth to help increase the model's performance, which is considered in this study.

Dataset
The dataset used in this paper is FER-2013 which is publicly available on the Kaggle [10]   In order to facilitate the import and expansion of data, this article uses ImageDataGenerator to scramble the pictures in the dataset and expand the dataset by flipping and moving the images vertically [11].The extended results of the data are shown in Figure 3.In addition, this study also used the grayscale color mode to load the images in the dataset.

CNN
A Convolutional Neural Network (CNN) is a type of multi-layer neural network that contains several 2D planes, each of which comprises multiple individual neurons.The fundamental architecture of the CNN is illustrated in Figure 4.The Convolutional Neural Network (CNN) has gained significant attention in a variety of fields, particularly in speech detection and image recognition.It is adept at identifying intricate details in raw images, eliminating the need for data reconstruction and feature extraction [11].Moreover, CNN is robust against various forms of deformation, such as translation, tilt, zoom, etc., rendering it ideal for image processing tasks involving large amounts of data.Convolutional Layers in CNN are primarily utilized for extracting image features.The initial convolutional layer is only capable of detecting lowlevel features, such as edges, lines, and angles.As the network grows, the convolutional layer can iteratively learn new complex features by building on the low-level features.The resulting matrix produced by convolving a filter across an image and calculating the dot product is referred to as an 'Activation Map' or 'Convolved Feature'.
By means of convolution operation, the CNN model can comprehend the image data and thereby preserve the spatial relationship between pixels.The primary function of the pooling layer is to preserve essential features, suppress information noise, and reduce information redundancy.This layer can effectively decrease the model's computation load and prevent overfitting.The pooling layer is generally divided into two types: maximum pooling and average pooling.The fully connected layer is comprised of numerous neurons arranged in a tiled structure, derived from the pooling layer.Each convolutional layer preceding the fully connected layer is employed for feature extraction, while the fully connected layer serves to amalgamate all extracted features for classification purposes.

Attention Module
The attention module is a crucial component in many studies and is mainly classified into two categories: Soft Attention Module and Hard Attention Module [12,13].The Soft Attention Module involves the calculation of the weighted average of N input information instead of selecting a single item from N when selecting information to input into the neural network for computation.In contrast, the Hard Attention Module refers to the selection of specific information at a given position in the input sequence, such as randomly selecting an item or selecting the item with the highest probability.This paper utilizes the Soft Attention Module for experiments.
Recently, a prominent attention module called the Squeeze-and-Excitation Network (SENet) shown in Figure 5 has been popular and is used in this paper.The objective of SENet is to capture cross-channel relationships in feature maps by acquiring modulation weights that are unique to each channel.

Implementation details
Four two-dimensional convolution layers, three maximum pooling levels, and two fully connected layers were built to comprise the basic CNN module.The output layer uses the SoftMax function as the Loss Function.To minimize the overfitting, the dropout layer with 0.3 have been used.Each layer uses the ReLU function to activate.Other parameters should be provided e.g., learning rate, optimizer, loss function.

Figure 6. The architecture of CNN Module without the Attention Module
In the Attention Module, firstly, the input feature map is pooled globally averaged based on width and height.The result will then be based on the channel for the connection operation.The result is then subjected to a channel-based concatenation operation, activated by the ReLU function to reduce the dimension to a channel.Then using the sigmoid function to generate spatial attention features.Ultimately, the resulting feature is multiplied by the input feature of the module to derive the final output feature.The architectures of CNN module and attention module are shown in Figure 6 and Figure 7, respectively.

Result and discussion
The present study conducts an analysis of attention mechanisms incorporated in two models, and evaluates their differences through metrics such as the saliency map and F1 Score (Table 1 and Table  2).Figure 8 and Figure 9 depict the optimization process and final accuracy of the models, which were trained under identical conditions with the exception of the attention mechanism, for a period of 70 epochs.Tables 1 and 2 present the metrics employed for evaluating the models, including recall and precision.Notably, both models exhibited superior performance in identifying 'happy' and 'surprise' facial expressions.As shown above, the module is relatively accurate to some extent and has a lower accuracy for the disgust and sad expressions, probably because the disgust expressions are more similar to other expressions.Thus, it is more difficult to distinguish for the CNN model.Even such situation exists, the module can still correctly classify over 2, 512 images.Besides, it can also be seen that the accuracy of this model has been improved less between 60-70 epochs and is in a more stable state.

Conclusion
This study presents an innovative application of the attention mechanism to enhance facial recognition by incorporating it into a CNN model.By effectively isolating the crucial facial features, such as the eyes, mouth, and nose, from extraneous image factors during the classification process, the attention mechanism yields superior accuracy.A comprehensive dataset of images is employed for training and validation purposes.The experimental outcomes illustrate that, with the given hyperparameters and model, the training accuracy is consistently elevated by approximately 1%, reaching approximately 71% and 67% for the training and test sets, respectively.Future research plans entail optimizing the model via transfer learning and augmenting the dataset while enhancing training depth to mitigate bias and variance.
. The sample images are displayed in Figure 1.The image size is 48 × 48 pixel based on this dataset.The dataset has been divided into seven categories and the distribution of the amount of data in each category is shown in Figure 2. The training set includes 28,709 examples and the public test set consists of 3,589 examples.

Figure 1 .
Figure 1.The sample images on the collected FER-2013 dataset.

Figure 2 .
Figure 2. The data distribution of the collected FER-2013 dataset.

Figure 3 .
Figure 3. Augmented Images in the dataset.

Figure 4 .
Figure 4.The architecture of the CNN.

Figure 7 .
Figure 7.The architecture of CNN Module with the Attention Module.

Figure 8 .
Figure 8. Accuracy and Loss vs epoch plot with attention module.

Figure 9 .
Figure 9. Accuracy and Loss vs epoch plot without attention module.

Table 1 .
The evaluating metrics of the module with the attention module.