An elevator passenger behavior recognition method based on two-stream convolution neural network

In order to reduce elevator accidents caused by passengers’ incorrect use, a method for elevator passenger dangerous behaviour recognition based on two-stream convolutional neural network is proposed after analysing existing elevator passenger behavior recognition methods. A two-stream convolution neural network model framework and its parameters for elevator passenger behavior recognition are also given. Herein, the appearance features of human behavior are extracted from space domain, and the motion features of human behavior are extracted from time domain. Finally, the softmax outputs of the two-stream are merged with the liner weighting to realize human behavior recognition. The model is trained and evaluated on the video dataset of elevator passenger behavior. The results show that this method can be used to identify the unsafe behavior of elevator passengers, and the average recognition rate is 96.87%.


Introduction
Elevators are the facilities most closely related to public daily life. On March 15, 2021, the State Administration of Market Regulation (SAMR) of China announced the national safety situation of special equipment in 2020. Data show that by the end of 2020, the number of elevators in China has reached 7.8655 million [1]. With the increase in the number of elevators, the relationship between elevator safety and public safety has been widely concerned. Especially when an elevator accident occurs in a crowded public place, it is easy to cause many casualties [2]. According to statistics [1], unsafe behavior of passengers is one of the key factors causing elevator accidents. For example, incorrect self-rescue behavior taken by passengers when they are trapped in the elevator [3]. These elevator accidents can be reduced by using video monitoring technology to monitor the passenger behavior in the elevator car. Thus, many cameras are installed in elevator cars. However, this traditional monitoring method cannot automatically identify the abnormal events in the elevator [4]. Therefore, how to automatically identify and alarm the unsafe behavior of passengers by elevator surveillance video has become a concern of researchers.
Several methods of abnormal behavior detection in elevator cars have been proposed in recent years. An abnormal behavior detection model based on corner kinetic energy was proposed in [5], it designed for robbery and violence behavior in the elevator car. According to the number of elevator passengers, Zhu et al. [6][7] propose three models and algorithms to detect the falling or crouching behavior of one person, violent behavior between two-person, and group panic behavior, separately. A two-dimensional pose estimation method based on the Location-Focus Field (PAFs) was proposed in [8], which can detect the behavior of force the door open. However, these methods depend on artificial features, so it is difficult to automatically extract deep information from the image. And an algorithm in these methods can only detect one unsafe behavior.
Deep learning technology shows excellent performance of automatic feature extraction in image classification and detection [9][10]. Behavior recognition in video can be regarded as an image classification problem that changes with time. Therefore, deep learning methods in image recognition have been widely used in behavior recognition in video [11], among which the convolutional neural network (CNN) is the most widely used. Alexnet [12], GoogLeNet [13], VGGnet [14] and other classic CNN architectures have not only made breakthroughs in image processing, but also achieved remarkable results in video processing. CNN is applied to driver behavior recognition in [15], and realized the recognition of 10 types of driving behaviors, with an average recognition rate of 97.13%. However, deep learning technology has not been applied to elevator passenger behavior recognition.
Herein, an elevator passenger behavior recognition model based on two-stream CNN is proposed. And the model is trained and evaluated on the video dataset of elevator passenger behavior. This work is helpful for prevention of elevator accidents caused by passengers' unsafe behavior and intelligent video monitoring of elevators.

Convolutional neural network
Convolutional neural network (CNN) is mainly composed of convolution layers, pooling layers and dense layers. It can imitate the process of image processing and recognition in the visual cortex, so it has a wide range of applications in the field of image recognition. CNN automatically extraction image features through convolution and pooling operations, and combines feature extraction and classification output into a whole, so as to obtain higher recognition efficiency and better performance. Figure 1 shows a simple convolutional neural network. It consists of two convolution layers, two pooling layers and two dense layers. Convolution layer and pooling layer are the key to automatic feature extraction.

Convolution layer.
Convolution layer use several different convolution kernels to convolute the input of this convolution layer (Figure 2), then add bias, and then several feature maps are obtain by activation function. The calculation process of convolution layer can be expressed as follows:

Pooling layer.
Pooling layer, also known as the lower sampling layer, is to sample the feature map of the upper layer to achieve dimensionality reduction, so as to reduce the calculation of model. Pooling operation is a special convolution operation. Common pooling methods are mean-pooling and max-pooling [16], as shown in Figure 3. Mean-pooling keeps the background information as much as possible by averaging the pixel values in the neighborhood. Max-pooling is to take the maximum value of pixels in the neighborhood, which can better retain the texture information of the image. What we need is the information of passenger behavior in elevator, not the background information of the car. Therefore, the pooling layer of the model constructed in this work is max-pooling.

Dense layer.
Dense layer is also known as the fully connected layer. Its function is map the features extracted by the network to the label space, thus the output layer could output the recognition results conveniently. Dense layer transforms several two-dimensional feature maps into a onedimensional vector, which reduces the influence of position of eigenvalue on classification result. To solve the multi-classification problem, dense layer is usually used in conjunction with softmax layer. Because the size of each element in the vector obtained by dense layer is unlimited, and softmax layer can normalize it to get the probability that the sample is identified as each category.

Two-stream convolutional neural network
In order to make better use of the information contained in temporal component of video, Simonyan et al. [16] proposed a two-stream convolutional neural network framework in 2014. Two-stream convolutional neural network consists of spatial stream CNN and temporal stream CNN, as shown in Figure 4. Spatial stream CNN extracts action appearance information in the spatial domain from a single frame of video to identify action. Temporal stream CNN obtains the motion information of human behavior by calculating the optical flow between continuous multi-frames of video, and then realizes behavior recognition. Finally, the predicted scores of each stream are combined by linear weighting.

Elevator passenger behavior recognition model
In order to make full use of the appearance information and motion information in the elevator monitoring video, a two-stream convolutional neural network model for elebator passenger behavior recognition is proposed. The architecture of this model is shown in Figure 5. It corresponds to original two-stream convolutional networks model architecture of [16]. Figure 5. Architecture of elevator passenger behavior recognition model. The difference is that the original two-stream convolutional networks model has the same parameters on the two stream networks, while the parameters of our model is based on VGG-M-2048 model [17] and Flow_Net model [18], as shown in Table 1. In particular, activation function of the convolution layer in this model is sigmoid function, loss function is cross entropy loss function.

Unsafe behavior of elevator passengers
According to our statistics, 88.18% of elevator accidents in China from 2002 to 2019 are related to people's unsafe behavior, as shown in Figure 6. In addition, when the elevator fails, 78.92% of the accidents are related to people's unsafe behavior. Figure 7 shows that 61.64% of these accidents are related to elevator passengers' unsafe behavior. Therefore, we summarize the unsafe behavior of elevator passengers involved in these accidents, as shown in Table 2.    Table 2 can lead to elevator failure or accidents. In order to prevent the occurrence of elevator accidents caused by unsafe behavior of passengers, the model proposed in Section 3.1 is utilized to realize the automatic identification of passengers' unsafe behavior in elevator car.

Experimental dataset
There is no standard elevator passenger behavior video dataset due to elevator monitoring video involving the privacy of passengers. Therefore, we recruited 10 volunteers to imitate various passenger behaviors in elevator, and collected these video for experiment. Figure 8 shows an example image of various behaviors in elevator.

Preprocessing of dataset
Firstly, video is divided into several single frame RGB images by extracting one frame per 10 frames, and 18000 single frame RGB images with a size of 1280 × 720 are obtained. However, an image that is too large is not conducive to training of network. In order to adapt to the input of network, we compress the image to 224 × 224.
The input of temporal stream CNN is optical flow image, so Lucas-Kanada algorithm [19] is used to calculate the optical flow. In this work, Lucas-Kanada algorithm is implemented by OpenCV. At the same time, each optical flow image has accumulated 10 frames of optical flow changes to keep the same amount of data with the spatial stream CNN.

Results and discussion
Elevator passenger behavior recognition model is trained and tested on the dataset of elevator passenger behavior by Python 3.8 in Anaconda 3. In this work, 80% of RGB images and optical flow images are used for model training and validation, and the remaining 20% are used for testing. During the model training, 20% of the data is used as the validation set to estimate whether the model is over fitting. The learning rate is 0.8, batch size is 32, and epoch is 100. The convergence curve of loss function is shown in Figure 9, and the accuracy curve is shown in Figure 10.  Figure 10. Accuracy curve. Figure 9 shows that the convergence curve of the loss function decreases rapidly at the beginning of training, which indicates that the learning rate we set is appropriate, and it is in the process of gradient descent. With the training, the convergence curve of loss function tends to be stable (the loss function gradually approaches 0), which indicates that the model converges stably in the process of training. Figure 10 shows that with the increase of epoch, the accuracy curve of the model on training set and validation set gradually increases and tends to be stable (the accuracy rate gradually approaches 1), indicating that the generalization ability of our model is good.
When the epoch is lower than 40, the value of loss function in training set is greater than that in validation set (Figure 9), and the accuracy of training set is less than that of validation set( Figure 10). It shows that model is under fitting and needs to increase the training epoch. With the increase of epoch, the loss value decreases, and the accuracy increases. It shows the model fits well, and there is no over fitting.
We tested the model on the test set, and confusion matrix of test results is shown in Figure 11. The results show that the elevator passenger behavior recognition model can effectively identify the behavior of elevator passenger, and accuracy is 97.22%. According to the confusion matrix, we can calculate precision, recall, and F1 value of each behavior class, and macro precision, macro recall, and macro F1 value for all samples on test set, as shown in  Figure 11. Elevator passenger behavior recognition results (confusion Matrix).  Table 3 shows that recognition rate of Class-1(one person takes elevator normally) is highest, reaching 98.74%, while Class-9(block the closing of elevator door) has lowest recognition rate with 93.67%. Class-9 is mainly incorrectly identified for Class-1 and Class-8(kick the elevator door), and the error recognition rate is 3.62% and 1.81%, respectively (shown in Figure 11). After checking the video dataset, we believe that model recognition errors have two reasons. On the one hand, in some videos, block the closing of elevator door cannot be captured by camera due to the position of the passengers standing in the car, so the model cannot detect this behavior. On the other hand, some passengers block the closing of elevator door with their feet, which is similar to kicking the elevator door and is misjudged by the model.
For the first situation, there are two solutions. The first method is to avoid the emergence of monitoring blind spots by using two cameras, but this method will increase the cost of monitoring. The second method is to adjust the installation position of the camera to the middle of the top of the car. For the second situation, the frequency of kicking the elevator door is significantly higher than the frequency of blocking the closing of the elevator door with your foot. Therefore, these two types of behavior can be effectively distinguished by adjusting the number of video frames for calculating optical flow.

Conclusions
Elevator accident is closely related to the unsafe behavior of passengers when the elevator fails. In this work, we summarize the unsafe behavior of elevator passengers, collect the elevator passenger behavior video dataset, and propose an elevator passenger behavior recognition model based on twostream convolutional neural networks. The model is trained and evaluated on the collected elevator passenger passenger behavior video dataset. The results show that the model can effectively identify unsafe behaviors of elevator passengers, with average recognition rate of 96.87%. Accuracy of the model has achieved 97.22% on the test set.

Conflicts of interest
There are no conflicts to declare.