Falling Behavior Detection System for Elevator Passengers Based on Deep Learning and Edge Computing

Due to factors such as dense populations and narrow viewing angles, previous deep-learning models for detecting passenger behavior in elevators often lack effectiveness. Traditional cloud-based data transmission methods have issues with high latency, high resource usage, and privacy threats, particularly during periods of high usage. To address these issues, we proposed a falling behavior detection system for elevator passengers based on deep learning and edge computing. A two-stream neural network model improved by 3D ResNet is presented, which utilizes edge computing for elevator passenger fall detection. Our homemade dataset of elevator passenger falling behavior is utilized to train and evaluate the system. The results demonstrate that the system is effective in detecting passengers’ falling behavior in elevators, with an average accuracy of 89.2%. The feasibility of the system in an elevator is also verified, and it has performed well. The application of this system in this field holds significant research value.


Introduction
In recent years, the issue of passengers falling inside elevators has garnered widespread attention due to the potential risks it poses to public safety and well-being.Elevators are an integral part of modern infrastructure and are indispensable in a sense.However, the confined space and dynamic motion environment of elevators make passengers susceptible to falls and related injuries.The occurrence of passenger falls inside elevators can be attributed to various factors, including sudden changes in elevator speed, abrupt stops and starts, mechanical malfunctions, and passenger-related issues.Despite the presence of safety features such as handrails and emergency stop buttons, real-time detection and prevention of falls remain a challenge.This is particularly critical in situations where passengers are alone or unable to seek assistance promptly.
Our research pertains to human action recognition in deep learning.This is a traditional area of computer vision that has seen the development of many advanced technologies.Wang et al. [1] proposed the TSN, a novel framework that combined a local time segmentation strategy with video-level supervision.[2] proposed an LSTM-based action recognition model that achieves state-of-the-art results on datasets such as UCF-101 and Sports-1 M.There are also algorithms such as slow-fast [3] and video transformer [4] that were proposed later, all of which achieved good performance at that time and have been used until now.
Currently, significant research has been conducted worldwide on recognizing passenger behavior inside elevator cabins.Liu et al. [5] introduced a falling detection algorithm based on machine vision and multi-feature fusion.However, the recognition performance of their method needs improvement, particularly in cases of occlusion.Shi et al. [6] used a CNN model based on deep learning to recognize abnormal behavior in elevator cabins.However, the model they employed is relatively simple and does not perform well in complex elevator environments.Lan et al. [7] achieved good results by proposing a framework grounded on a two-stream convolutional neural network for recognizing the dangerous behaviors of elevator passengers.However, these traditional cloud-based methods have limitations, including high network resource consumption, high latency, and privacy concerns.
Edge computing refers to a computing paradigm that performs computation, storage, networking, and application processing closer to the data source at edge nodes.Unlike traditional cloud computing, edge computing focuses on migrating data processing and computation functions to devices that are closer to the data source.As shown in Figure 1, edge computing places computing nodes in a finegrained mesh closer to the end devices, making it a feasible approach to meet the requirements of high computation and low latency for edge devices in deep learning [8].Edge computing offers several advantages, including fast response, high privacy and security, reduced transmission costs, and greater flexibility and scalability.In this paper, a falling detection system for elevator passengers based on deep learning and edge computing is proposed.We make the following contributions:  In terms of the falling detection algorithm, we utilize a two-stream network, which is improved by 3D ResNet, to detect passenger behaviors.It achieves a rating of 89.2% recognition accuracy on our dataset, which is competitive with other classical methods  In terms of edge computing, we innovatively apply it to the recognition of passenger behavior inside elevators.This approach overcomes the problem of high latency in traditional methods, enabling nearly "zero-latency" real-time processing.Furthermore, it effectively protects passengers' privacy.We have deployed this system in an elevator, proving its feasibility and effectiveness. We also captured our video dataset of passenger falling behavior in elevators, laying the groundwork for further research in this direction in the future.

Detection model of falling behavior
The falling detection model for elevator passengers employs a two-stream network model that is enhanced by 3D ResNet.Within this model, the spatial stream utilizes 3D ResNet to identify actions within video frames, while the temporal stream analyzes optical flow images using a traditional convolutional neural network to capture the temporal sequence information between consecutive video frames.Finally, the outcomes from both streams are fused using softmax.

Two-stream network
Simonyan and Zisserman [9] first proposed this model to divide video into space and time.In this way, the spatial stream is represented by a single video frame, which contains information about the objects and scenes in the video, while the motion stream is represented by optical flow images that show the motion between video frames.
In the temporal flow part, the input volume


of the convolutional network is constructed as in Equation ( 1) and Equation (2): ( , ) ) where ( , ) t u v d represents the displacement vector of point ( , ) u v in the coordinate system t and x t d and y t d can be regarded as horizontal and vertical image channels, respectively.

3D ResNet
We used the 3D ResNet-18 architecture [10] in the spatial flow section instead of the convolutional neural network (CNN) described in [9].
For the sake of addressing the issue of vanishing gradients or gradient explosion in deep networks, 3D ResNet introduces the concept of residual connections.In each convolutional layer, the input signal is different from the output signal, and this difference signal is added back to the output signal, preventing the loss of information and making the network easier to train.A 16-frame RGB image, with an input tensor size of 3 16 112 112

  
, is taken as input by the network, whose kernel size is 3 3 3   , and conv1's temporal stride is 1, which is comparable to that of C3D [11].When the number of feature maps increases, a fast recognition method with zero padding can be used to avoid a rise in the number of parameters.

Two-stream fusion
The average of the processed results from the two streams yields the ultimate detection outcome.Figure 2 displays the model's precise design.

System establishment
The construction of the prototype and the system are shown in Figure 3.A simple prototype was made by supporting an edge device capable of neural network calculation and a USB camera on a tripod inside the elevator.The chosen edge device is the Nvidia Jetson Xavier NX computing platform, which has approximately 21 TOPS of computing power.To ensure the efficient operation of the model on edge devices, Int8 quantization was applied to the neural network.This approach was chosen to enable smooth operation on the Jetson platform, albeit with a slight reduction in performance.
After we deployed the system physically inside an elevator, the operation of the system is shown in Figure 4.The camera transfers the captured videos to the edge device, which extracts the video frames and inputs them into our model.Then, the model provides the detection results and transfers them to the cloud server.Therefore, an edge computing system for detecting falling behavior in elevator passengers is implemented.

Dataset
The dataset used for training was collected by filming ourselves using a camera mounted on a tripod inside the elevator.The videos were then categorized into three behaviors: standing, falling, and getting up.All videos were recorded when no other individuals were present inside the elevator.Part of the dataset used for training is presented in Figure 5.The dataset we collected encompasses scenarios with single and multiple individuals using the elevator, which better represents actual elevator environments and enhances the model's generalizability.
When making the dataset from the videos, each video of approximately 2 s was divided into 32 frames and compressed to 224×224 size.A total of 4544 RGB pictures and 9052 optical flow pictures were generated, and subsequently, the dataset was divided into three sets at a ratio of 7:2:1, which consisted of a training set, validation set, and testing set.

Results
The parameters used for training were established as follows: the batch size was set to 12, the learning rate was 4 5 10   , and the total number of epochs was set to 100.The results of the training process can be seen in Figure 6.
During the initial stages of training, the model's loss decreased rapidly.Conversely, as the number of epochs increases, the model's accuracy oscillates, and the final accuracy does not achieve a satisfactory level.This may be due to the utilization of int8 quantization on edge devices, which has led to some performance degradation.This reveals that we can then work on optimizing the model to reduce its arithmetic requirements or try to switch to a higher-performance edge device.Nevertheless, the model's accuracy remains high, and its ability to run smoothly on edge devices renders it potentially valuable.
Additionally, we evaluated the performance of various models on this dataset, and the outcomes are presented in Table 1.The table indicates that despite several classical computer vision neural network models, our two-stream network improved by 3D ResNet still demonstrates better performance.The models under comparison here belong to the classic category.Some newly introduced models that have shown good performance have yet to be tested.This could be a direction for future improvements.

Field validation of the system
Following the completion of the training, we conducted a simple validation experiment using the installed system within the elevator.The result of our validation is well presented in Figure 4.The camera mounted inside the elevator captures real-time data that are then processed by an edge device carrying the falling behavior detection model.The model successfully detects different behaviors and forwards the final results to the cloud server, which has been programmed to send alerts as the appropriate course of action in the event it receives abnormal behavior data that are not categorized as STAND.During our experiments, we observed that the system recognized our actions quickly, regardless of whether one or multiple individuals performed a range of actions.In the validation, the system detected three behaviors with accuracies of 92.5%, 86.7%, and 85.4%.
This validation experiment demonstrates the practicality of our system, confirms our overall design, and offers guidelines for the system's further development.

Conclusions
This paper addresses elevator accidents caused by falling passengers.We combined edge computing and deep learning techniques to detect such behavior.A video dataset was created for elevator passenger falling behavior, and an elevator passenger falling behavior detection system was proposed.The system's effectiveness was demonstrated with an average recognition rate of 89.2% in a simple test.After completing the training, we verified the feasibility of the system in the field inside the elevator, scientifically confirming its use.The experiment proves the accuracy and efficiency of our system in a closed-loop manner, affirming all previous attempts.Prompt identification of falling behavior can prevent potential accidents and ensure the safety and well-being of elevator users.The creation of the video dataset offers an essential resource for further research and analysis in this field.Overall, the suggested detection system for elevator passenger falling behavior presents significant potential for preventing elevator accidents, and it greatly contributes to relevant research.

Figure 1 .
Figure 1.Edge computing allows neural networks to operate at the edge and the end.

Figure 2 .
Figure 2. Model architecture for falling behavior detection.

Figure 3 .
Figure 3.The construction of the prototype.

Figure 4 .
Figure 4.The operational pattern of the falling behavior detection system

Figure 6 .
Figure 6.Results of training