Self-supervised Learning for Expression Recognition on Small-scale Data Set

In this paper, we propose a facial micro-expression recognition method utilizing self-supervised learning and Vision Transformer. We employ a contrast learning approach for self-supervision and extract image features using an attention mechanism. Various data augmentation techniques were utilized, and we specifically designed an enhancement method for facial recognition. By combining the strengths of the Vision Transformer and CNN models for feature extraction, our approach achieves improved recognition accuracy, even with limited labeled data. Experimental evaluation shows that our proposed method has good results in facial micro-expression recognition tasks.

Machine learning methodologies, such as support vector machines (SVM) and neural networks, have been employed since the 1920s to recognize facial expressions.Furthermore, a plethora of datasets, including CK+, JAFFE and FERET, have been assembled by researchers to assess and compare the efficacy of various expression recognition techniques.Through the extraction of location and variation of facial feature points, Pantic and Rothkrantz were able to discern expressions, identify facial expressions based on texture and calculate the optical flow field of facial motion to capture the dynamic aspects of expressions [4].The deployment of machine learning techniques has notably enhanced the performance of expression recognition and the creation of a dedicated dataset has significantly supported this research.
The performance of expression recognition has also started to see enhancement through the incorporation of methodologies such as transfer learning and multimodal learning.Lopes A T proposed a convolutional neural network-based technique for facial expression recognition.To overcome the issue of insufficient training data, he applied a transfer learning technique, which adjusted the network weights, pre-trained on large-scale image datasets like ImageNet, to the task of expression recognition [5].Table 1 shows the common expression dataset, the number of facial expression images it contains and the bestperforming model for this dataset so far.Certain scholars have developed a multi-tasking framework based on deep learning technology, which is structured to simultaneously execute the tasks of facial key point detection and attribute prediction [6].
Despite the need for extensive data annotation, facial recognition, especially micro-expression recognition, heavily relies on it.However, if feature extraction can be achieved via a self-supervised method, it can significantly reduce the volume of model training required and enhance recognition efficiency.Currently, numerous researchers are investigating the potential application of selfsupervision in facial expression recognition.
In response to these challenges, this paper introduces a facial expression recognition method premised on self-supervised learning, CNN and the Vision Transformer.Our approach employs selfsupervised learning techniques to train the model.We utilize the Vision Transformer model, which excels in processing long sequence data, thereby enhancing the recognition of micro-expressions and emotional changes.When compared with traditional methods, our approach significantly improves the recognition accuracy of micro-expressions and exhibits superior generalization capabilities, enabling expression recognition on small-scale datasets.Through experimental evaluation, we demonstrate the commendable performance of our proposed method in the task of facial microexpression recognition.
During data augmentation, we consider exploring an effective strategy for expression recognition.Some emotions cause distortions in the typically symmetrical facial features.We seek to extract these distortions and amplify them using computer vision methods and then compare the facial features after enforcing symmetry.

Self-Supervised Learning in Expression Recognition
Self-Supervised Learning (SSL) is an unsupervised learning method that autonomously generates labels as it learns to represent input data.Tasks such as image reconstruction, image rotation prediction and contrast learning are typical applications of self-supervised learning.Recent contrast learning algorithms include SimCLR [7], MoCo [8] and BYOL [9].Both natural language processing and computer vision have greatly benefited from this strategy, with many machine vision researchers adopting it.
Koepke et al. proposed a self-supervised framework for learning facial attributes from video data, using a model that embeds multiple frames of the same individual into the state space [10].This framework allows the learning of attributes such as pose, key points and expressions for each independent task without supervised data.Jakab et al. proposed a method for training landmark detectors for visual elements, such as the eyes and nose on a face, without any manual guidance [11].Their approach can be applied to a broad range of datasets, including faces, humans, 3D objects and digits, without the need for any modifications.Wang et al. suggested a self-supervised method for building adversarial networks using a CNN structure for a small unlabeled expression dataset [12].After training the GAN using the target dataset, the CNN model was modified by fusing the generated images with the source data to improve recognition accuracy.

Transformer in Expression Recognition
Since the introduction of attention research, numerous transformer models have been developed.The transformer utilizes self-attention to better understand complete information compared to the CNN model, leading to slight improvements in data performance and efficiency.Xue et al. proposed a transformer-based method for facial expressions [13].After extracting feature maps using a backbone CNN, local CNN blocks were created to identify various local patches.A transformer encoder then explored the global relationship between these local patches using a multi-head self-attention dropping module.

The challenge of small-scale datasets
Currently, the dataset available for facial expression recognition is small in comparison to those used for natural language processing (NLP) and image classification tasks, making it impractical to train large models like ChatGPT.Recent studies have indicated that residual blocks with shortcut connections can be optimized more easily and the issue of vanishing gradients is less prominent.When dealing with small expression recognition datasets, robust recognition using residuals can enhance recognition performance [14].

Method
The goal of this study is to train a hybrid model for expression recognition using self-supervised learning, CNN and Transformer, while also accurately recognizing micro-expressions.The conceptual diagram of the model is depicted in Figure 1.
 We employed contrast learning with self-supervised learning.The input facial expression images were processed using an enhancement method specifically designed for faces. For feature extraction, a hybrid model is utilized.This model begins with a CNN, to which an attention mechanism is added after the convolutional layer.The NT-Xent loss function is utilized for training and the processed image features are then computed. The study's main focus lies in designing a training model for facial recognition that delivers exceptional results for microexpression recognition.

Data Augmentation
Data augmentation is a common technique in both supervised and self-supervised learning.The utilization of appropriate data augmentation methods can significantly enhance learning efficiency.In contrast to learning, data augmentation assumes an especially crucial role.Research in data processing suggests that using a single type of data enhancement often yields minimal effectiveness.Rather, combined data augmentation methods -particularly those prioritizing random cropping and random color distortion -demonstrate superior results [7].The sequence of these operations is important.Random cropping is done first, followed by color distortion.Given that our experiments use grayscale images, color distortion isn't an ideal enhancement method.As a result, we propose a symmetry-based enhancement method specifically for facial recognition.This method yields impressive results in facial recognition, especially when used in contrast learning where feature loss from both sets of images is calculated.

1))
If we denote the original image as I and I (i, j) as the pixel value at position (i, j), we can apply an enhancement technique of flipping the grayscale image, specifically tailored for facial recognition.This results in two vertically symmetrical augmented images.Within this process, we also incorporate some random cropping methods.The results show that these complex data augmentation techniques contribute to the effective training of the model for feature extraction.

Feature extraction
For the evaluation of the facial expression recognition dataset, the majority of the utilized methods are based on CNN architectures, as shown in Table 1.Considering that most of these methods are supervised, our attempts at developing an unsupervised model hold considerable potential.
In the processing of facial expression images, we initially performed data augmentation.Subsequently, we employ two distinct methods.The first approach leverages the synergy of Convolutional Neural Networks (CNN) and contrastive learning to extract salient features.In the second approach, we harness a fusion of CNN and Transformer architectures for image processing.Finally, the outputs from both methods are amalgamated to train the ultimate predictive model.
During the feature extraction process of the augmented data, we first employ the CNN method.The network comprises six convolution layers and three Max-Pooling layers, followed by two densely fully connected layers.The number of convolution filters doubles with the addition of each Max-Pooling layer, starting with 64, then 128 and finally 256.The filter window size is 3 3 and Max-Pooling layers with a stride size of 2 2 are placed after every two convolutional layers.1.
The blue segment of Figure 1 denotes the vector generated following contrastive learning, while the green segment represents the vector yielded by the projection head after being processed by the Transformer.These two are jointly utilized to extract features.We derived the final model.

Loss Function and Training
For loss calculation, we use InfoNCE's loss function [8].This function aims to increase the similarity between pairs of enhanced images derived from the same image while minimizing their similarity with other images.We found that this loss function performs better compared to others, such as logistic loss or margin loss.
[Zi, Zi+] denotes the vector of positive examples.In InfoNCE, a higher similarity of positive examples is preferred [8].This loss function has been previously utilized [15].

Data Reduction
In this study, we make several attempts to reduce the amount of data, aiming to find a method capable of extracting features with minimal data.Initially, it was planned to use some small data sets for experiments.And the dropout method is used to reduce the amount of data in the experiments.Since this study is for expression dataset and micro-expression recognition.The dataset used is also a smallscale dataset.Initially, we consider altering the dataset by randomly reducing the data within it.During the training process, the network learns to develop meaningful feature representations on unlabeled datasets.The best features are extracted with as little amount of data as possible.

Dataset FER2013 (Facial Expression Recognition 2013
) is a dataset for static facial expression recognition, which comprises 35,887 grayscale images of human faces.The images size 48 48 pixels, depict seven categories of expressions: anger, disgust, fear, happiness, sadness, surprise and neutrality.The dataset is split into training, public and private test sets.CASME (Chinese Academy of Sciences Micro-Expression) is a micro-expression recognition dataset, which includes 195 videos of facial micro-expressions.Due to the fleeting and subtle nature of microexpressions, the CASME dataset is primarily used for dynamic facial expression detection and microexpression analysis tasks.
Given the source of the CASME dataset -images extracted from videos, it inherently runs a risk of overfitting throughout the training phase.As illustrated in Figure 2, there are significant differences in color resolution between the images of the FER2013 and CASME datasets.Consequently, we standardize all images to grayscale in the experiment.Furthermore, due to the limited number of images encapsulating emotional expressions within the CASME dataset, its application necessitates combined usage with other datasets during both the training and testing stages.To address these issues, we employ a two-step approach: initiating with the training of the model on the FER2013 and CASME dataset, followed by its application on the CASME dataset.Our findings indicate that during the training of the FER2013 dataset, the contrast learning method we used can achieve 70% accuracy.Although this study does not reach SOTA, the hybrid model combining self-supervised learning and attention with CNN can take advantage of self-supervised learning and attention to reduce the learning loss to a significantly low range.Also, the current models that work better than this study are all using some auxiliary labeling and large model pre-training methods.In contrast, the utilization of self-supervision and transformer in this study is also a novel idea.There is still much room for improvement.
The model we trained with the dataset fer2013 was used for validation on the micro-expression dataset CASME.Initially, we wanted to use the CASME dataset alone for the experiments.Since the data in CASME dataset is too similar, the trained model has too large a loss and too low a value.So we used a mixture of fer2013 and some data from the CASME dataset to act as a training set.Then we perform micro-expression recognition on a set of data from CASME.In a set of images, the one or several images in which the expression changes are extracted.Our model performs quite well in this task.With the validation of each group of micro-expression hints given by the authors of the dataset, most of the images of the group of micro-expressions can be verified as shown in Figure 3.
When testing the micro-expression dataset, since the dataset is not partitioned into training, test, and validation sets, we had to do this segregation.After normalizing the image sizes, we loaded the model trained on the FER2013 dataset and the training set of the micro-expression dataset for retraining.The validation process was based on the experimenter's emotional data provided by the authors at that time.Emotional variations were identified in a series of video-cropped images and classification was performed according to the labels of the FER2013 dataset.

Training Strategies
We employed different fine-tuning approaches for different datasets.As the FER2013 images are static and the CASME images are sequential, the CASME dataset exhibits a stronger temporal correlation.This makes it a better fit for contrast learning, which capitalizes on stronger correlations between random images.
During stochastic gradient descent optimization, we experimentally adjusted the momentum and scheduled the learning rate.To prevent model overfitting, we added an L2 regularization term to the loss function and used smaller parameter values in the weight decay model.

Discussion
During the experiments, we found that the model employing the attention mechanism required more epochs compared to the model using only CNN.Although the slower convergence rate consumed more resources, the use of the attention mechanism resulted in better feature extraction.
The inherent relatedness of the micro-expression dataset makes it easier to extract features using contrast learning methods.It demonstrates the potential of self-supervised learning, coupled with attention mechanisms, to successfully navigate and learn from these complex and nuanced facial expressions datasets.

Conclusion
In this study, we proposed a facial recognition model based on self-supervised learning.Leveraging contrast learning and feature extraction using self-attention methods, we were able to develop a model capable of recognizing subtle nuances in facial expressions.Various image augmentation strategies were employed and a specialized augmentation method designed specifically for facial recognition showed promising results.
Our model, integrating both Transformer and Convolutional Neural Network (CNN) architectures for feature extraction, demonstrated the power of these combined methods in capturing complex facial expression informattion.Notably, even with a limited amount of data, our self-supervised approach aided in effective feature extraction, thus improving recognition efficiency.By carefully adjusting the model and parameters, our approach provided a promising avenue for improved facial and microexpression recognition.

Figure 3 .
Figure 3. Pictures of happy micro-expressions occurring

Table 1 .
Formatting Facial Expression Recognition Dataset The input image size The first convolution block output size is 48 48 64, the second convolution block size is 23 23 128 and the third convolution block size is 11 11 256, as shown in Figure