Performance evaluation of deep learning techniques for human activity recognition system

Human Activity Recognition (HAR) is crucial in various applications, such as sports and surveillance. This paper focuses on the performance evaluation of a HAR system using deep learning techniques. Features will be extracted using 3DCNN, and classification will be performed using LSTM. Meanwhile, 3DCNN and RNN are two additional, well-known classification techniques that will be applied in order to compare the effectiveness of the three classifiers. The 3DCNN-LSTM approach contributes the highest overall accuracy of 86.57%, followed by 3DCNN-3DCNN and 3DCNN-RNN with the overall accuracy of 86.07% and 79.60%, respectively. Overall, this paper contributes to the field of HAR and provides valuable insights for the development of activity recognition systems.


Introduction
In this modern era, recognition of human activities has become a popular field for research.Human Activity Recognition (HAR) describes the ability of machines to recognize human activity [1].It involves making predictions by analysing human poses in the scene in which type of human activity can be automatically classified [2].
Activities can be different kinds including running, climbing, playing basketball, playing violin, applauding or waving hand.To recognize activities, a model needs to be trained.To do so, the machine must be provided with some references, such as pictures or videos to allow it to be more efficient when analysing future data [3].By providing a video of any human activity, the action can be detected easily, and comprehensive information could be obtained based on deep learnings [4].

Methodology
The process of human activity recognition is divided into few parts which are database acquisition, preprocessing, feature extraction, and classification.The flow chart of the human activity recognition is as shown in Figure 1.Database Acquisition This paper implements an online dataset, UCF50 Action Recognition Dataset [12] to train and test human activities.UCF50 is made up of 50 action categories, a total of 6676 videos, and at least 100 videos for each action class.It is an online and public dataset made up of reliable YouTube videos, which distinguishes it from most action recognition data sets that are manufactured by actors.

Pre-processing
Pre-processing techniques in deep learning are one of the important roles prior to feature extraction.Preprocessing is necessary to improve the quality of the data as it has a significant impact on the performance of the deep learning algorithm.Low quality of images or videos will affect the results of feature extraction and classification.There are a few steps of pre-processing which will be implemented in this paper include frames re-sizing, grayscale conversion and normalization.Frames re-sizing is a process of re-sizing all the videos' frames into the same size.The dataset will first undergo frames re-sizing to a fixed width and height of [64,64].Besides, all the datasets will then convert into grayscale which only store values in a single array which is black and white to simplify the algorithm and reduce computational requirements.Lastly, normalization is used to transform the features on a similar scale and so improve the model's performance.

Feature Extraction
3D Convolutional Neural Network (3DCNN) is widely used for image or video recognition and classification.They are designed to automatically learn spatial hierarchical representation from input data.Therefore, it has been chosen as feature extraction method in human activity recognition.Max Pooling will be implemented in this paper whereby it reports the largest element from the feature map.Classification Long Short-Term Memory (LSTM) is a Recurrent Neural Network kind of neural network.It is a network that is specifically intended to function with a data sequence since it considers all previous inputs when generating an output.3D Convolutional Neural Network (3DCNN) is specifically developed to operate with image data when comes to analyzing the images and making predictions on them [13].Figure 2 shows the architecture of a 3D Convolutional Neural Network (3DCNN).Furthermore, Recurrent Neural Network (RNN) is a type of Artificial Neural Network that analyzes data sequences.These algorithms process sequences by storing the prior value or state in memory [14].They are mainly used for sequence classification especially in sentiment classification and video classification.Note that the database is split into 75% and 25% for training and testing, respectively.

2.5
Evaluation of Accuracy Evaluation of accuracy is vital as it determines how is the performance of a research work.Confusion matrix is used to evaluate the accuracy and the performance of human activity recognition by tabulating a summary of the number of correct and incorrect predictions made by a classifier.

Pre-processing Results
Based on the video dataset, the duration of each video is between 3 to 8 seconds.Prior to the pre-processing process, each video is converted into 20 frames accordingly.Figure 3 displayed 6 frames out of a total of 20 of an activity.The frames are displayed to observe the entire action sequence from the beginning to the end.After the frames have been converted, the pre-processing step is then performed on the dataset.The frames of the videos are then resized to a fixed width and height of [64,64] to reduce the computations and fasten the training process since small images consist of less pixels.Next, the frames are converted into grayscale to simplify computational complexities.Last but not least, normalization is applied to the frame after the resize and grayscale process.It is a technique to change the values of numeric column in the dataset to a common range, between 0 and l.By implementing normalization, it helps to ensure the input values are in suitable range for neural networks to learn effectively.

Feature Extraction Results
In this paper, the feature extraction approach that implemented in human activity recognition is 3DCNN.The architecture consisted of several convolutional layers and max pooling layers.The first layer has 16 filters with a kernel size of (3, 3, 3) and employs the ReLU activation function.The following convolutional layers are 32, and 64 filters respectively, maintaining the same kernel size and activation function.After that, a 3D max pooling layer with a pooling window size of (2, 2, 2) is added after each convolutional layer to down sample the output.This architecture enabled the extraction of relevant features from the input data, making the data useful for further classification tasks.The feature extraction's results are all in matrices form.

Classification Results
In this section, the classification results are shown using confusion matrix method.The confusion matrix of 3DCNN-LSTM, 3DCNN-3DCNN and 3DCNN-RNN model is illustrated in Figure 4(a), 4(b), and 4(c) below.In this matrix, the rows indicate the actual classes and the columns represent the predicted classes.Each cell in the matrix contains the count of samples that fall into a particular class combination.For instance, the number of samples from true class A (Baseball Pitch) that were accurately classified as class A (Baseball Pitch) is represented in cell at row 1, column l.Comparably, the number of samples in row2, column 3 indicates that class B (Basketball) was misclassified as class C (Horse Riding).Apart from that, the samples that were successfully classified are represented by the diagonal elements of the matrix.In Figure 4(a), there are 36 out of 42 samples were correctly classified as activity Baseball Pitch, 2 samples misclassified as Basketball, 2 sample misclassified as Playing Violin, 2 sample misclassified as Tai Chi while no Baseball Pitch samples misclassified as Horse Riding and Skate Boarding etc.There is a total of 174 samples correctly predicted out of the total of 201 samples.As a result, the overall accuracy of the classification model is known to be 86.57%.Next, the confusion matrix of 3DCNN is described in Figure 4(b) where there is a total of 173 out of 201 samples (86.07%) in the overall accuracy of the model's performance.Furthermore, the confusion matrix depicted in Figure 4(c) gives an overview of how the RNN model performed.There is a total of 160 samples correctly predicted out of the total of 201 samples, which corresponds to 79.60%.Bar graph in Figure 5 illustrates the overall accuracy performances of each class in three models, namely LSTM, 3DCNN and RNN.Playing Violin achieved the highest accuracy among all the activities followed by Tai Chi.It showed that Playing Violin and Tai Chi stand out in terms of recognition performance compared to other activities in the dataset.This is because the static and repetitive motion of these activities makes them easier to recognize and classify compared to activities with more dynamic and varied motions.

Conclusion and Future Work
In conclusion, this study investigated the effectiveness of 3DCNN-LSTM, 3DCNN-3DCNN, and 3DCNN-RNN models for human activity recognition.The goal of this study was to analyze the accuracy performance of a human activity recognition system while investigating the differences in classification performance using LSTM, 3DCNN, and RNN methods for analyzing human activities.It was found after extensive experimentation that the 3DCNN-LSTM model outperformed both the 3DCNN and RNN models in terms of overall performance, achieving the highest accuracy in human activity classification.This is because 3DCNNs are great for image data and LSTM networks are great when working with sequence data.Therefore, by combining both of them, the strength of both models will be obtained which then enables the effective resolution of complex computer vision challenges such as video classification.Additionally, there are several aspects that may influence the performance of the models such as pre-processing methods, model architecture, and others.Therefore, future studies could thus focus on fine-tuning strategies to further enhance the performance of each model.It is suggested to investigate the possibility of hybrid models that integrate the strengths of LSTM, 3DCNN, and RNN models.All in all, this report highlights the 3DCNN-LSTM model's superior performance in human activity recognition beyond 3DCNN and RNN model.These findings will also have significant implications for real-world applications, such as eliminating the need for manpower to manually operate cameras, follow athletes and record the entire running process.Instead, the system can autonomously capture and analyze athletes' activities, providing a cost-effective way to monitor sport events.

Figure 3 .
Figure 3. Frames for Basketball Activity

Figure 5 .
Figure 5. Comparisons of Accuracy Performances

Figure 6 .
Figure 6.Average of True Positive Convolutional Neural Network (3DCNN) is used in feature extraction whereas Long Short-Term Memory (LSTM) is for classification.Apart from that, another two classification methods which are 3D Convolutional Neural Network (3DCNN) and Recurrent Neural Network (RNN) were implemented for comparison in the performance and accuracy with Long Short-Term Memory (LSTM) method.