CNN combined with data augmentation for face recognition on small dataset

Faces have universal structures yet contain distinct features among individuals. Recognizing individuals based on their faces has always been a popular topic in pattern recognition, and computer vision and many traditional approaches have yielded satisfying results. In recent years, rapid growth in deep learning has encouraged researchers to use deep learning methods to solve authentication problems. Convolutional neural networks are one of the most popular deep neural networks with multiple layers and the ability to reduce parameters by using kernels to capture features from input. It has outstanding performance in pattern recognition due to its ability to extract features and take images as inputs. In machine learning, data augmentation is a technique to seemingly enlarge a dataset to avoid underfitting or overfitting problems caused by insufficient data. This paper uses convolutional neural networks to solve face recognition problems on a small dataset. It compares performance with traditional face recognition methods such as Principal Component Analysis and examines the impact on performance using data augmentation. Overall, data augmentation boosts the accuracy of the network but also results in an unsteady learning curve. The convolutional neural network performs well on pattern recognition and obtains an accuracy of 94% in an augmented dataset with only two convolutional layers.


Introduction
Face recognition is a subcategory of biometric identification.As a research topic in computer vision and pattern recognition that has been conducted for 50 years, has grow rapidly in the recent years and is considered as one of the most successful systems for its convenience, accuracy and has a wide range of applications [1].In comparison to fingerprint recognition systems which need specific devices to obtain clear fingerprints, or voice verification systems that may suffer from noise and tone changes, face recognition systems are both consistent and easier to obtain inputs from non-ideal environments using inexpensive cameras [2].It has gradually become an inseparable part of the modern identification system and extensively used in areas such as law enforcement, security, and consumer electronics.
Face recognition analyzes human facial structure such as distance between eyes, angle of chin, the heights and width of nose, etc.The process of face recognition translates the above features into numerical codes, the system then analyzes those numerical codes and match it with the collection of stored faces.A face recognition system has three step processes: face detection, feature extraction, face recognition [3]. 1) Face detection: this process detects and locates faces from a given image, it should separate human faces from background.This is the key step of automatic face detection and in most applications, it requires fast processing with real-time capability.2) Feature extraction: this process identifies the geometric structure of human faces.It applies normalization to the detected faces and obtains the shape or features of face components such as nose and eyes.3) Face recognition: this step compares previously extracted features with a collection of stored faces and decides the identity of the input individual.It shares a similar approach as most biometric systems that take an input image and obtain a vector of numerical code that represents the characteristic of individuals.
In 1970, the idea of face recognition first emerged.It was treated as a 2D pattern recognition problem [4].Before the rise of deep learning and artificial intelligence, the most common approaches were the Holistic approach and the feature-based approach.Holistic approach took the entire face as a feature for detection and ignored face component features such as eyes, nose, and mouth etc. (Holistic approaches include Eigen faces, PCA, LDA, and ICA).Feature-based approaches on the contrary use individual features for detection and use heuristic parameters for face detection.
As deep learning and artificial intelligence rapidly develop during recent years, many classic computer vision problems have taken the deep neural network approach.Convolutional Neural Network (CNN) is a popular artificial neural network specifically designed to process image data and is used in image recognition and processing.The built-in layer of convolutional neural network can reduce dimensionality of image without lost information.In the context of face recognition, CNN approach will generate feature maps for each input face, however, it does not focus on specific facial features like feature-based approach.This network is capable of learning features directly from the image and has been used in image recognition such as handwritten recognition and medical image classification etc.
This paper attempts to build and train a convolutional neural network for a face recognition task and explore the performance and accuracy of the system.This system should be able to identify individuals despite the difference in facial expression, accessories, and lighting.To achieve this, a small face dataset called Olivetti dataset is used.In the end, with data augmentation, the Convolutional neural network approach yields an accuracy rate of 94% on the Olivetti dataset.In comparison, the most popular face recognition method Principal component analysis have an accuracy rate of 92% on the same dataset [5].Albeit better performance, Convolutional neural network with data augmentation also has its drawbacks, this paper will further analysis the performance of the system and discuss on its potential improvements.The proposed system consists of 4 steps shown in Figure 1.First, it splits the target dataset into training and testing sets.Next, it utilize the data augmentation technique and applies a series of image preprocessing methods to generate an augmented training dataset.Then the system will train and test on the CNN to obtain the corresponding results.

Olivetti face dataset
There are numerous open-source face databases such as PubFig, colour FERET database, CelebA Dataset, MTFL etc.In this paper, Olivetti faces dataset will be used./Olivetti Faces Dataset contains ten greyscale images of 40 subjects.Each subject is labelled between 0-39, indicating the identity of the picture person.Lightings, facial expressions, and facial details vary.Pictures were taken between April 1992 and April 1994 at AT&T Laboratories Cambridge [6].This paper will use 70% of the Olivetti dataset for training (i.e.280 images) and 30% for testing (i.e. 120 images).
The original dataset has a 92 x 112 that shows the entire head, hair, and neck parts.This paper uses the scikit-learn version of this dataset, which is scaled to 64 x 64 that is centred on face structure.In the scikit-learn version, the chin of some images is chopped off, resulting in an incomplete face shape.This change would not be ideal for Holistic approaches since some consider the entire face as a feature, and face shape is an essential part of it.However, this processed version of the Olivetti Dataset highlights unique features such as nose, eyes, mouth etc.Meanwhile, it normalized the face for training and testing.
Olivetti Dataset has its drawbacks.Since this dataset only contains 400 images and 30% of it is used for testing.The training set may not be representative enough for the system.It is possible the model cannot capture the relationship between input and output variables and is prone to underfitting.To solve this, a technique called data augmentation is used in this study.

Data augmentation
Data augmentation is a technique to artificially increase dataset by generating new data from existing data [7].It slightly modified the existing dataset to make it look like a larger dataset to the model.However, for a normalized face image, it should not be sheared, upside down or zoomed.Rotation, horizontal flip, and a slight shift in both directions would make the dataset more complex while fitting the face recognition methodology.Those approaches will distort the similarity among datasets and eliminate the very purpose of dataset pre-processes.Images in the training set will be rotated between 0 to 15 degrees clockwise and counterclockwise, shift 0 to 6 pixels in both directions and flip horizontally.Note that none of those augmentations is fixated and that each image may have different augmented versions in other epochs.Therefore, making the dataset seemingly larger.Figure 2 shows an example of data augmentation on Olivetti dataset.Data augmentation is applied randomly in this case, the first image on the first row has width & height shifts and rotation, and the last image on the first row is only rotated slightly.

Convolutional neural network
Convolutional Neural Network is one of the most popular networks in the deep learning field, which is widely used in many tasks [8,9].It has the capacity to handle extensive data and, in recent years, has surpassed classic approaches in many areas, especially pattern recognition and computer graphics.CNN is a multi-layer network that consists of a convolutional layer, activation layer, pooling layer, and fullyconnected layer.The convolutional layer produces feature maps of the input, activation layer then adjusts and activates the feature map.Next, the pooling layer is to reduce complexity for further layers.Lastly, the fully-connected layer would produce outputs [10].
In the proposed system, a CNN should take a 64x64 image as input, find its feature map from the convolutional layer, and produce an output between 0 to 39 that represents an individual from those 40 people.Essentially, this system treats face recognition problems as a classification problem: every individual is its class.
The CNN in this system utilize TensorFlow.It comprises two convolution layers (with 2x2 kernel size, no-zero padding, 1x1 strides size) with max-pooling and a fully connected layer with a softmax classifier at the end.This network uses ReLU as the activation function, Adam as the optimizer and cross-entropy loss as the loss function.A dropout rate of 0.25 at the second convolution layer.This network is trained on 280 images that are separated from the Olivetti faces dataset and will be tested on the remaining 120 images.This network should take a 64 x 64 image as input, and the output will be a number between 0 to 39, representing the image's identity.This CNN have batch size of 20 and learning rate of 0.001.This study measures accuracy of testing set after each epoch for evaluating performance.Figure 3 shows the performance of CNN without data augmentation.Since the Olivetti dataset is a small dataset, the results without data augmentation are not ideal.The testing accuracy at 100 epochs is 91.67%.The model remains mostly constant after 20 epochs, there is a significant gap between testing loss and training loss.This is a feature of underfitting, caused by small training set that is not representative enough for the model to capture the relationship between its input and output.When data augmentation is used, the increases on training curve and validation curve become less significant meaning it took longer to converge.This is because the model is working on a seemingly larger dataset that is slightly different in every epoch.It yields an accuracy of 94.17% at the end of 150 epochs.Additionally, be compare performance with Figure 4 and Figure 5, as more data augmentation method being added, the accuracy curve too longer converge.Similarly, the gap that existed between validation loss and training loss become less separated as more augmentation method being added.Clearly, with data augmentation, previously stated underfitting problem have been alleviated.

Discussion
When compare the accuracy curves in Figure 5.The training accuracy tends to fluctuate more as data augmentation is introduced.This tendency is caused by the randomly augmented training set for each epoch, making the training set harder to match.
The testing accuracy is higher than the training accuracy due to a similar reason.Data augmentation is not applied to the testing set.Therefore the model tends to fit better with the testing set than the augmented training set.This is inevitable outcome of data augmentation since the artificial dataset are not true dataset and may contain errors that deviate from ground truth.

Conclusion
In this paper, a deep neural network is built and tested for face recognition.It yields an accuracy rate of 94%, which outperforms the classic machine learning approach on the same dataset.CNN uses small datasets that tend to underperform when it comes to training.Data augmentation can artificially enlarge datasets and boost the accuracy rate overall.This technique improves performance, meanwhile requiring more time for the learning curve to converge as the training set becomes more complex than the original, and the learning curve becomes unsteady.This shows the drawbacks of data augmentation, as artificially created images tend to deviate from testing subjects.This deviation after repetitive training becomes a challenge for the current network to process.Possible solutions include manually enlarging the dataset by adding authentic images of human faces or adding more layers to the CNN to extract more features capable of handling datasets with more variation.

Figure 1 .
Figure 1.Flow chart of face recognition with CNN network and Olivetti faces dataset.

Figure 2 .
Figure 2. Olivetti faces the dataset with data augmentation (left), and Olivetti faces the dataset without data augmentation (right).

Figure 3 .
Figure 3. Performance of model without data augmentation.

Figure 4 .
Figure 4. Performance of model with image rotation and flip.

Figure 5 .
Figure 5. Performance of model with image rotation, flip and shift.