Research on Facial Expression Recognition Based on Improved Deep Residual Network Model

Facial expressions are the main external manifestations of human emotions. Facial expression recognition technology can be used in medical, investigation, education and other application scenarios. Aiming at the disadvantages of slow convergence speed and low recognition accuracy of traditional neural networks, in order to optimize the network more efficiently and improve the recognition rate of facial expressions, a facial expression recognition method based on improved convolutional neural networks is proposed.First, the deep residual network ResNet18 is improved, and then the improved ResNet18 network is used to extract global expression features of face images, and then the self-attention weighting module is introduced to calculate the expression features of each face image and output one The corresponding weights are used to weight the loss function. Finally, the two public expression data sets of CK+ and RAF-DB are used to experimentally verify the method in the article. The accuracy of facial expression recognition reaches 98.89% and respectively. 87.13%, compared with the original deep residual network ResNet18 model and other types of network models, the facial expression recognition rate has been significantly improved, which proves that this method can effectively improve the accuracy of facial expression recognition and has certain application value.


Introduction
Facial expression recognition (FER) is an important means of using image processing and analysis technology to effectively read facial expressions to convey information. At present, facial expression recognition technology has been applied in many fields [1] , involving all aspects of our work and life. For example, by recognizing the facial expressions of a teacher in class, the effect of a teacher's teaching can be reflected from one side [2] .
In recent years, there has been a period of rapid development of scientific research on facial expression recognition. Many scholars have applied deep learning methods to facial expression recognition and achieved good results.In 2018, Kaur et al. [3] used facial expression recognition to predict the concentration of students. In 2019, Xie et al. [4] constructed an adaptive weighted loss function to fuse manual features and deep features to effectively improve facial expression classification.At present, facial expression recognition based on deep learning is still under development.
In order to further improve the accuracy of facial expression recognition and enhance the generalization ability of the network model, this paper has improved on the basis of the traditional ResNet18 network, adding a convolutional layer and a maximum pooling layer, and weighting the loss

Deep Residual Network Resnet18
The basic unit of the deep residual network is the residual structural unit, and its structure is shown in Figure 1.The residual structure unit is realized in the form of cross-layer identity connection. The output of each residual unit is obtained by adding the output and input elements of multiple convolutional layers cascaded, and finally activated by the nonlinear activation function ReLU. The entire depth The residual network is composed of multiple residual structural units stacked to form a deep neural network.Introducing the deep residual network of shortcut connection and identity mapping enables the data stream to flow across layers, so the input feature information can be well characterized by the output of the network.The residual structure unit also well controls the number of network parameters and computational complexity.

Improved deep residual network
This paper is based on the network model of the deep residual network ResNet18, using a 1×1 convolutional layer, passing through the BN layer, then passing through the Relu activation function, and finally using a Maxpool layer to replace the last average of the original network The structure of the Avgpool layer is shown in Figure 3. We named the improved model C-ResNet18.  Figure 3. Improved deep residual network C-ResNet18 1×1 activation and Relu activation can fatally produce roles under the function of no loss [5] , making the network more characteristic of discriminating information.Maximum pooling (Maxpool) can reduce the deviation of the estimated mean value caused by the parameter error of the convolutional layer, retain more texture information, prevent over-fitting, and improve the generalization ability of the model.

Improved self-attention weighted deep residual network model
The improved self-attention weighted deep residual network model is shown in Figure 4. The face image is first input into C-ResNet18 for feature extraction, and the extracted features are put into the fully connected layer and the weighting module. The weighting module is from The sample weights are learned in the facial features, which are used for loss weighting, and finally the facial expressions are classified by softmax.  The self-attention weighting module is used to obtain the contribution of input samples to training, output a weight in the range of 0 to 1, and weight the loss.Use 1 2 [ , , , ] to represent the facial features of N images, take F as input, and output a weight for each feature.Specifically, the weighting module is composed of a linear fully connected (FC) layer and a sigmod activation function. The extracted expression feature Fi passes through the fully connected layer and the Sigmoid activation function, and the weight corresponding to the expression picture can be obtained. The expression is: The loss value is the goal of the overall network optimization, and participates in the network optimization operation, so as to train a neural network model with higher accuracy. In this article, the weighting module is used to weight the cross entropy loss [6] , and its expression is: (2) among them, j W Represents the jth classifier, C is the number of categories，It can be seen from the above formula that wce L and are  positively closed.

Data set and preprocessing
The experimental data sets are JAFFE and RAF-DB, which include seven types of expressions: fear, surprise, happiness, anger, disgust, neutral, and sadness.The RAF-DB facial expression data set has a total of 15,339 expression pictures, of which there are 10226 pictures in the training set and 5113 pictures in the test set, as shown in Figure 5.

Experimental parameter setting
Experimental environment: This experiment is based on the PyTorch framework under the Windows platform to carry out experimental analysis of the proposed algorithm. The calculation and configuration are Intel i5 processor, 8G memory, NVIDIA GeForce GTX 1050 Ti graphics card, 4G video memory, and Python3.7 is installed.
Parameter setting: The debugging of network training parameters is very important to the performance of the network. The learning rate is an important parameter in network training. If the learning rate is too high or too low, it will have a certain impact on the network.In the experiment process, the improved network model uses different learning rate parameter settings. When training on the CK+ data set, the initial learning rate is set to 0.001, and the learning rate is attenuated by 10% during the 15th and 30th iterations of training. 1. When training on the RAF-DB data set, the initial learning rate is set to 0.005, and the learning rate is attenuated by one-tenth each during the 15th and 30th iterations of training.

Experimental results
This paper uses an improved self-attention weighted network model to conduct experiments on the CK+ data set and RAF-DB data sets. The experimental results are shown in Table 1-4.  Table 1 and Table 2 respectively show the facial expression recognition results of the improved network model and the original network model on the CK+ data set and RAF-DB data set. It can be seen from Table 1 and Table 2 that the accuracy of facial expression recognition of the improved network is significantly higher than that of the original network.  Table 3 and Table 4 respectively show the comparison results of the face recognition rate of the improved model in this paper and other models on the CK+ data set and RAF-DB data set.It can be seen from the results in Table 3-4 that the improved model in this paper has a significant improvement in the accuracy of facial expression recognition than other models. Table 3. Comparison of the accuracy of facial expression recognition between the improved model in this article and other models on the CK+ data set Network model Accuracy/% C-LetNet5 [3] 94.37 EM-AlexNet [4] 93.02 WGAN [5] 96.00  Table 4. Comparison of the accuracy of facial expression recognition between the improved model in this article and other models on the RAF-DB data set Network model Accuracy/% VGG16 [5] 80.96 Wang et.Al [6] 82.83 Li et. Al [7] 85.07 Self-attention weighted C-ResNet18 87.13

Concluding remarks
Based on the deep residual network ResNet18, this paper proposes an improved network C-ResNet18. The improvement is to replace the last average pooling layer of the original network with a convolutional layer and a maximum pooling layer, so that the network extracts more discriminative information Features, prevent over-fitting, weight the loss function, and repeatedly train to achieve a more optimized neural network model.Experiments on the CK+ data set and RAF-DB data and the experimental results show that the accuracy of facial expression recognition has reached 98.89% and 87.13%, respectively, which is better than the facial expression recognition of the original ResNet18 network and other types of network models. The rate has been significantly improved, which proves that this method can effectively improve the accuracy of facial expression recognition and has certain application value.