The Efficient Implementation of Face Mask Detection Using MobileNet

Under the condition of the global pandemic, Wearing the mask in the public area is the direct and ordered action to prevent the spreading of the virus. Therefore, developing the machine learning model to detect whether the mask is appropriately wearing is practical and meaningful. This paper unfolded the mask recognition based on the MobileNet model, which belongs to one of the branches of Convolutional Neural Network, CNN for short, in the deep learning field. With the background and tremendous needs of face mask detection, in the beginning, we go through the review of the MobileNet’s history and the practicability of it on demanding functions. After that, we explained in a more detailed way to help gain a better understanding of MobileNet’s structure and how those differences make it outstanding. During the experiment, approximately 9000 images were used as inputs by the model for training and optimizing. The result turns to the less weighted model presenting an acceptable accuracy of 87.96% and 93.5% for testing whether a person wears a mask and whether the mask is covered correctly correspondingly. We also discovered several shortcomings and limitations during the testing process. Those problems and some of the possible further directions of MobileNet will be discussed. As a result, the MobileNet shows the huge probability for further improvement to be equipped into the small-scale but sophisticated utensil.


Introduction
Since the coronavirus outbreak in 2019, the virus has spread rapidly and widely, coming from countless grievous news, sharp unemployment increases, and ruined families. Up to date, the threatening pandemic has already caused more than 4 million deaths worldwide, which is comparable to the total number of war deaths in the past 40 years. It is necessary to wear a face mask that prevents the virus from entering our respiratory system when we breathe. Since the coronavirus is highly communicable and is transferred through droplets and breathing, the mask, which reaches more than 95% of filtering efficiency, becomes our first protective armor to prevent contact with the virus effectively.
Recently, the good news is that more people take it for granted to wear masks when going out. According to the wearing specification, the correct usage of the face mask should cover the user's mouth and nose, two accesses of our respiratory system. However, some people fail to follow the wearing instructions. As a result, some supervisors are always needed. The point at this situation, technologists may encounter several difficulties, especially how to check whether individuals wear the

The Outline
Although there are already many detection algorithms, a faster and more accurate one is still needed for further implementation. In order to control the complexity of the detection algorithm for a build-in chip with a small capacity, MobileNet, a light but still efficient neural network, becomes the first choice. Firstly, we apply the MobileNet on the dataset of about 9,000 face pictures, which distributes one-third for each with and without the mask, the rest for incorrect wearing. Then, we divide the whole project into conduct two tasks. One aims at the accuracy of MobileNet for checking whether the mask is worn when the other focuses on the result of inspecting the mask are correctly worn or not. After that, by observing the features of the neural network and adjusting hyperparameters, the test accuracies of these two tasks could surprisingly reach above 90%, which is acceptable. Under the model training situation, the MobileNet demonstrates a splendid performance, which implies a tremendous potential for its development for practice situations.
The rest of this paper is organized as follows. Section 2 gives an overview of the proposed methods, including MobileNet, a widely used lightweight neural network. Section 3 shows the experimental process and the testing results of formerly mentioned tasks and the stages of optimizing the method by adjusting the parameters and getting the outcomes. The limitations to be improved and prospects are discussed in Section 4.

Methods Used
The testing dataset of this project is download from Kaggle, which is subdivided into 3 folders, withmask, without-mask, and with-mask-incorrectly. Then, we chose the first version of the MobileNet from the series to deploy and train the oriented image set since this generation demonstrates the first valid definition of depthwise separable convolution and the overall net structure, which are dedicated to reducing the amount of calculation and parameters. Once ensuring the algorithm is available, the experiment around the model training begins. As below, we cover an introduction to MobileNet V1.

MobileNet V1
In this part, we first show the model's architecture, including different types, amounts, and combinations of layers used, followed by the innovative resource-saving principles of MobileNet, and finally, an even dwindling Thinner Models with width multiplier and resolution multiplier. Table 1 demonstrates the data structure of the Mobile Net [1]. A batch norm and ReLU nonlinearity follow all the Standard convolutional layer and Depth-wise Separable convolutions used. Finally, the result will be fed into the SoftMax layer for classification. Those pool layers are removed through the whole 28 layers of the network (AVG pool and softmax are excluded here), but stripes are frequently applied for down sampling [10]. As for the training consumption, the model puts almost all intensive operations into one × one convolution. There are 95% of training duration, and three of fourth parameters spend on those operations, and the rest for fully connections' use. Comparing to those networks commonly approaching to deeper and more complicated, the model aims to build a small but efficient architecture, and this intention perfectly satisfied the requirements. It is a challenge to not only maintain the stability but also reduce the complexity of the model. For MobileNet, the standard convolution filters are divided into depth-wise convolution filters and pointwise convolutions to form depth-wise separable convolutions to cope with the issue [6]. For calculating resources saving, the depth-wise separable convolutions can reach of one with standard convolution filters, here and stands for output channels the kernel size respectively. Its goal is to reduce the model size while maintaining the model performance and improve the model speed. For instance, when is relatively large, if a 3x3 convolution kernel is used, the calculation amount of depthwise separable convolution can be reduced by about 9 times compared with standard convolution.  Figure 2 shows the basic structure of one depthwise separable revolution compared to a standard convolutional layer. We can easily find that more ReLU layers are used, which enhances the model's generalization ability. Even though those functions used already significantly reduced the scale of the model, engineers always encounter the demand of an even smaller model. The width multiplier and resolution multiplier may help. The width multiplier mainly reduces the number of channels in proportion. In contrast, the resolution multiplier is mainly used to reduce the size of the feature graph in proportion. Introducing these two parameters will undoubtedly degrade the performance of MobileNet. The decision is to make a compromise between accuracy and computation and between accuracy and model size.

Experiment Results and Analysis
The experiment develops with testing MobileNet on the dataset with 9,000 images into three categories (with-mask, without-mask, and with-mask-incorrectly). Specifically, the dataset has already been cleaned and equally distributed across each class. The experimental dataset in this section is obtained from Kaggle, which Vijay Kumar shares. All the experimental results are obtained by adjusting different hyperparameters. There are two tasks, one aims to check whether the mask is worn, and the other focuses on inspecting the mask is correctly or wrongly worn. It is ideal for training the model with the whole dataset with 6,000 images for each task. Training, validation, and testing directions divide the whole dataset into three parts with the proportion 8:1:1 of all those images with 128*128. The experimental results show that the proposed method performs well in the data set. For task 1, the with-without check, the highest accuracy reaches 87.96% when the learning rate range from 0.01 to 0.001 and epoch is 50; as for task 2, the correct-incorrect check, the trained model shows a 93.5% correctness for best performance when the learning rate range from 0.01 to 0.001 and the epoch runs 30 times.

Task 1: Whether the person wear a mask?
For the first attempt, when the learning rate through 0.001 to 0.0001, and the Epoch gets 25, the result turns out to be 52% of test accuracy, which shows a relatively lower accuracy. Then, we try to set the learning rate range from (0.0001, 0.001) to (0.001, 0.01), and meanwhile, change the epochs as 25, 30, 50 successively to get as higher accuracy as possible. As the table shows below, the first column shows a regular increase as the value of Epoch grows. After we get ten times the initial learning rate, the second column displays that the model is stable and reliable with the given attributes. Not only does the learning rate start from 0.001 to 0.01 for the second line performs, in general, better than the former, when we set the Epoch as 50, but we also obtain the best result from so far of 87.96% of test accuracy.

Task 2: Whether the person wear it correctly or not?
With the experience with task 1, we initially set the learning rate down to 0.0001 to 0.00001, which quickly results in probable overfitting, with an accuracy of 99.0%. Then, we start to adjust the model to fit the dataset even more smoothly. At those continuous experiments, when we use the rate of 0.001 to 0.0001, the results turn out to be 73.46%, 88.34%, and 84.78%, respectively, when the epoch is appointed 25, 30, and 50. Although the outcomes are already meet expectations, even better results appear when the learning rate is from 0.01 to 0.001, as the third line shows below. However, we may assume that when the epoch gets 50, the obtaining result of 99.83%, which exceeds the envisaged result much and get genuinely close to 1, a total non-error model, may be a result of overfitting. Therefore, we exclude the splendid value and choose the accuracy of 93.5% as the highest one. Overall, similar to task 1, the model performs better test accuracy whenever we multiply the altogether learning rate. As well as the epoch, another factor that directly affects the outcome, the optimized model's performance is always one of the accompaniments to its rise.  6. Discussion

Limitations
On the one hand, we find that the application of deep separable convolution and introduced hyperparameters of MobileNet truly downsize the consumption no matter the training time and parameters used. On the other hand, we need to admit that it will also lose accuracy certainly.
Although the research group proved to gradually promote the stability of the algorithm of v1 to v3, it is always a dilemma to be both faster and more accurate. Within this experiment, the decrease of time cost is not that clear, but there are still several hidden worries of the model. Even though the experiment results surprisingly get to above 90%, which shows that the model has excellent performance, the realistic situation could be much more complex on the more extensive scale of the sample size and stricter response requirements. When considering the feasibility of the detection model into production, every factor matters from basically the algorithm itself to the detailed implementation, such as the hardware required, the maintenance needed, and the optimization always expected. All of those elements may compromise the ideal results we get. Moreover, the model needs times of training to get unchanged stability.

Future studies
Researchers make their efforts to figure out a method to improve the performance of the MobileNet even a tiny step while keeping the absolute superiority on the control of computation's size. Additionally, MobileNet will hopefully be modified and equipped with more computer vision applications, especially within compact or portable devices.
Combined with the need for face mask detection, the MobileNet technologically owns the possibility of further implementation on all kinds of mobile or concrete machines. Those models trained by this Convolutional Neural Networks method are always looking to perform correctly and consistently. Even further, MobileNet can be used for more significant tasks, such as recognizing identities even with a face mask, which may help filter those covered suspects from the crowds. There are much more functions to achieve or improve. The general trend is always pursuing higher and higher accuracy.

Conclusion
After reviewing the MobileNet v1 and executing the whole experiments based on it, it is clear and sure to declare that the MobileNet gets its superior forces on the function of Face Mask Detection.
Although there is still a long way for the architecture to get into mass production, its capability is already apparent even at the current preliminary planning stage. The introduction and experiment center on MobileNet show the potential of being deployed on machines, sluice gates, and almost those devices with limited capacity.