Lightweight Human Motion Recognition Method with Multiscale Temporal Features

To address the problems of the large size of human motion recognition models based on deep learning and insufficient mining of data features and thus low recognition accuracy, a lightweight human motion recognition algorithm based on multi-scale temporal features is proposed, the algorithm automatically extracted features through a multiscale feature fusion model. After, the integrated features are modeled by an improved time convolution network (TCN). In the TCN network structure, In the TCN network structure, the depthwise separable convolution is used instead of the normal convolutional layer to reduce the computational complexity, and the Leaky ReLU activation function is used instead of the ReLU function to improve the training efficiency. The experiments are based on the WISDM public dataset. finally achieve fast real-time recognition of actions, and structural and parametric optimization is performed through experiments to effectively improve the accuracy of results, and the final accuracy rate reaches 99.06%. In comparison with other methods, this method can reduce the model volume while maintaining a high accuracy rate.


Introduction
With the continuous progress of science and technology, the continuous improvement of people's quality of life, and people's attention to human health monitoring and other areas, human activity recognition (HAR) has become a very important research direction.Modern science and technology hope that machines instead of humans carry out action recognition work.There are two main types of mainstream recognition methods: video-based and sensor-based.In the current research video-based methods are affected by conditions such as lighting and obstacles, and there is a certain violation of personal privacy, The inertial sensor-based approach [1] because the whole acquisition and data transmission process can be automated; again, is a very important research direction.inertial sensors contain smartphones and a series of wearable devices [2].The main principle is to use small sensors such as magnetometers and accelerometers in combination with human wearable devices to obtain information on acceleration and angular velocity of different parts of the human body and then to infer and predict human activities through intelligent algorithms such as deep learning [3] in combination with the characteristics of human motion.
In the current stage, deep learning achieves automatic feature extraction through end-to-end neural networks, reducing time-consuming and laborious manual feature extraction.Deep Learning methods are widely used in HAR for their higher efficiency and higher classification accuracy.For example, Mekruksavanich et al. [4] proposed enhancing the recurrent neural network by combining the attention mechanism and the bidirectional gated recurrent unit, effectively recognizing complex actions, and achieving a 95.83% recognition rate on the PAMAP2 dataset.Zhang et al. [5] introduced a multi-headed convolutional neural network and attention mechanism for feature extraction, and after experiments, The final classification recognition accuracy reached 96.50%.Lu et al. [6] proposed a multichannel fusion model based on the idea of division.The structure uses a multichannel convolutional neural network to fuse features, and feature labeling and enhanced feature representation by gated recursive units (GRU).and using global average pooling (GAP), the model obtained 96.41% accuracy on the dataset.However, the network structure is more complex.
Current deep learning-based recognition methods do not fully explore the feature information at different scales and cannot guarantee that the network model has low complexity, ignoring the increased complexity of the algorithm, but if the algorithm is only designed to reduce the relative complexity of the recognition computational process, it may lead to a loss of performance in the recognition process, resulting in relatively low recognition accuracy.In this paper, we propose a Lightweight human motion recognition method with Multiscale Temporal features (MTL-HAR) by summarizing the ideas in the structure of the Inception [7] network and improving it by combining the character of the TCN [8] network, Constructing a multi-scale feature fusion model.And the computational cost of the network is reduced through the optimization of the structure.The model can effectively improve the recognition accuracy in human motion recognition tasks, and the recognition accuracy reaches 99.06%.

Action Pattern Recognition Method Based on MTL-HAR
The neural network structure of the MTL-HAR algorithm is shown in Figure 1, and the main structure is divided into five parts: data acquisition, data pre-processing, multiscale temporal data extraction part, lightweight TCN module, and output layer.

Data Pre-Processing
In the process of collecting action data using sensors, slight sensor shifts may generate noise, and to address this problem, the algorithm uses data normalization [9] to define the range of variation of the data, A visualization of the human walking motion data after normalization of the triaxial accelerometer is shown in Figure 2. The data is normalized so that the original features are retained while the overall magnitude is reduced, and it is easy to transmit and store, which improves the efficiency of the model, and the data is normalized to be more conducive to the extraction of global features.

Multiscale Temporal Data Extraction part
The structure of the multiscale temporal data extraction part of the MTL-HAR algorithm is shown in Figure 3, which consists of three 1-by-1 reduced dimensional convolutional units, four convolutional units, and a concatenate layer.The overall structure is a combination of straight-through tandem and multi-branch parallel splicing.The idea of multi-scale feature extraction is to sample the temporal features at different granularity, which can be observed at different scales under different perceptual fields, and to fuse the information across channels after dimensionality reduction, so as grasp the features from a global perspective.The different sizes of convolutional kernels in the convolutional unit are used to extract features at different sizes of the data.Smaller convolutional kernels extract more detailed feature information, and larger convolutional kernels are used to obtain feature information at higher and more adequate semantic levels.The original three-axis acceleration data can be regarded as one-dimensional gridded [10] data obtained by random sampling on the time axis, and the temporal data [11] of different scales are obtained after multi-scale fusion.With the gradual deepening of the convolution layer, the feature extraction information will become richer and richer.It enables the model to obtain more comprehensive feature information.The size of the convolution kernels of convolution units 1, 2, 3, and 4 are 1, 3, 5, and 7, and the number of convolution kernels is 16; the size of the reduced-dimensional convolution unit 1, 2, and 3 is 1, and the number of convolution kernels is 2, 2 and 4 respectively; the final output is the temporal data of size (100, 11).
The convolutional unit of the multiscale temporal extraction part has the same structural components as the reduced dimensional convolutional unit, which is composed of a one-dimensional convolutional unit and a Batch Normalization (BN) layer, and a Nonlinear Activation Function (ReLU), Better mining of action information through BN and ReLU.Which, the one-dimensional convolutional unit is calculated as shown in Equation 1, where y and h denote the output sequence, the convolutional kernel sequence, and the input sequence, respectively, u denotes the coordinates of the elements on the output sequence, and N denotes the length of the input sequence.The BN layer is to keep the same distribution of the data after each convolution calculation and puts special constraints on the distribution of the data, which is applied in conjunction with the nonlinear activation function to improve the convergence speed of the network [12] and to achieve the effect of curbing overfitting.
To reduce the dimensionality of the data stitching, the MTL-HAR algorithm adds the 1-by-1 reduced-dimensional convolution unit module after the output of convolution units 2, 3, and 4.After the dimensionality reduction operation, the number of channels of convolution unit 2, convolution unit 3, and convolution unit 4 are reduced to 2, 4, and 4, respectively, and the data volume becomes 1/8 and 1/4 of the original one.

Lightweight TCN Module
TCN, as a new neural network structure, has good performance for tasks such as temporal feature prediction.TCN proposes causal convolution and dilated convolution techniques based on traditional convolutional neural networks and increases the fitting ability of the model through residual structure and Dropout [12] layers.It enables the TCN to complete the recognition of temporal sequence information from part to whole.
The MTL-HAR algorithm combines the temporal feature properties of action information, The temporal features fused by multi-scale features are passed through the TCN layer to learn higher semantic level representation and more abstract features.However, TCNs suffer from problems such as computationally intensive and parameter redundancy in solving temporal feature prediction tasks.This paper proposes an improved lightweight TCN module to solve these problems.The lightweight TCN module uses depthwise separable convolution [13] instead of traditional convolution, constructs a causal dilated depth-separable convolution layer, and uses the Leaky ReLU [14]

Depthwise Separable Convolution
Depthwise separable convolution is an efficient convolution method that can reduce the number of parameters and computational effort during convolution operations and is widely used in image classification tasks.Compared with the MobileNets [15] family of structures with a given number of parameters, the depthwise separable convolution has a better performance.Normal convolution requires both spatial information and channel correlation to be considered when processing data.In contrast, depthwise separable convolution accomplishes the function of normal convolution through a two-step operation of channel convolution and point-by-point convolution.The channel convolution is performed separately for each input channel to reduce the number of convolution kernels, while point-by-point convolution uses 1-by-1 convolution kernels to further reduce the computational effort.Thus, depthwise separable convolution can improve computational efficiency when training deep neural networks.The lightweight TCN module takes advantage of this idea to replace the original normal convolution with depthwise separable convolution, using channel convolution and point-by-point convolution to complete the function of a normal convolution.The three convolution methods are shown in Figure 5.

Input Kernel Kx1
Feature Input Group Depthwise Feature Depthwise Feature 1×1 Kernel Pointwise Feature The normal convolution is calculated as shown in Equation 2, where the convolution kernel isW , the input feature map is y , the number of channels is denoted by m , the resolution of the input feature map is i y , , and the resolution of the output feature map is denoted by o q , .

ICAITA-2023
Journal of Physics: Conference Series 2637 (2023) 012042 IOP Publishing doi:10.1088/1742-6596/2637/1/012042 The channel convolution is calculated as shown in Equation 3, the point-by-point convolution is calculated as shown in Equation 4, and finally, the depth-separable convolution is obtained by combining Equation 3 and Equation 4 as shown in Equation 5.

Output layer
The MTL-HAR algorithm uses the Softmax classifier as the output layer for action recognition.And the classification loss function uses a balanced cross-entropy loss function.The Softmax classifier is calculated as shown in Equations 6 and 7, where T W , x and b in Equations 6 are the weight matrix, input vector, and bias, respectively.The output value y is calculated for each output node and the probability value  y i corresponding action is calculated by Equation 7, where n is the number of hidden units in the fully connected layer, the natural constant is denoted using e ; and y i is the output value of the ith output node.

Setting of network model parameters
In this paper, we first replace the lightweight TCN module in MTL-HAR with two recurrent neural networks, the long short-term memory network (LSTM) and the gated recurrent neural network (GRU), for comparison experiments, constituting a Multiscale CLT network (CNN+LSTM) and a Multiscale CGU network (CNN+GRU), with the same number of convolutional channels as in MTL-HAR for both variants.And MTL-HAR is compared with three cascaded networks without multiscale fusion operation, CNN+LSTM, CNN+GRU, and CNN+TCN.The network parameters of the MTL-HAR algorithm and the comparison algorithm are shown in Table 1

Complexity analysis with different models
Table 2.

Comparison of the complexity of the comparison algorithm and the MTL-HAR model
As seen in Table 2, in the comparison of MTL-HAR with CNN+TCN, CNN+LSTM, and CNN+GRU algorithms, by using a multi-scale feature fusion model instead of cascaded convolutional layers, MTL-HAR fuses feature information at different scales by connecting across layers, completing the fusion of data dimensions with a smaller number of parameters, and by increasing the width of the model and extracting features with convolutional kernels of different scales, the overall computational effort of the model is reduced.
In the comparison of MTL-HAR with Multi-scale CLT-net(CNN+LSTM) and Multi-scale CGU-net(CNN+GRU), replacing the LSTM module and GRU module with the lightweight TCN module proposed in this paper can effectively reduce the overall number of parameters and computation of the network, because TCNs use convolutional neural networks for temporal sequence modeling, which reduces the computational cost by processing multiple time steps simultaneously using the parallel computational capability of convolutional operations.

Dataset
In this paper, the model was experimentally validated using the WISDM dataset, WISDM is a HAR benchmark dataset the Wireless Sensor Data Mining Lab research team provides.With Android smartphones in the foreleg pockets of 36 participants, 36 participants completed the specified movements in a closed environment.A triaxial accelerometer was built into the smartphone, and 1098207 sample points were obtained using triaxial acceleration.The sampling frequency was 20 HZ, and the participants were asked to perform six activities: sitting, standing, walking, walking up and down stairs, and jogging.And the values of the x, y, and z axes in the accelerometer were collected.

Analysis of experimental results
The confusion matrix derived from the MTL-HAR algorithm trained on this dataset is shown in Figure 6.Each column represents the number of real data being predicted as that category, and each row represents the number of data instances in each category.The coordinates 0 to 5 in Figure 6 are the six actions of label jogging, walking, sitting, standing, going upstairs, and going downstairs, respectively.The model was able to achieve an accuracy of about 99% for all six action patterns; among them, 7 items of the walking action (label 2) were confused as jogging, and 11 items were confused as standing, mainly due to the possible influence of the size of the triaxial acceleration and the similarity of the action amplitude; The rest of the movements have a relatively good recognition effect.The evaluation indexes of the experimental results are shown in Table 3

Comparison of recognition accuracy of different models
In this paper, we continue to use [5], [6] and [9] for comparative analysis.These three studies conduct the same dataset experiment as this paper.In [9], a 2-layer LSTM network was used for human action recognition, and the recognition accuracy was lower than that of this paper because only the temporal properties of the features were considered, and the network structure was simpler.In [5], a combination of multi-channel convolutional neural networks and attention was introduced, and the recognition rate was higher than that of ordinary CNN.In [6], a multi-channel CNN-GRU model was proposed, which made the accuracy rate rise further.However, the models in [5] and [6] were made larger and more complex so as to pursue higher recognition accuracy.Figure 7 shows the recognition accuracy curves of each algorithm.MTL-HAR effectively improves the recognition capability of the model by introducing a multi-scale temporal feature extraction module to fuse temporal features at different levels and scales, and the fused features are used to learn higher semantic level representations and more abstract features through a lightweight TCN module.To reduce the complexity of the algorithm, the model size is effectively reduced by the optimization of structure and parameters and the lightweight processing of the TCN module.From Figure 7, in the comparison of MTL-HAR with the LSTM method in [9], Multi-head Convolution Attention in [5], and Multichannel CNN-GRU Model in [6] and several comparative models proposed in this paper, The model has a strong fit and maintains a high accuracy with a small number of parameters.

Conclusion
Human motion recognition plays a very important role in the field of human health monitoring, this paper combines human motion recognition and deep learning, and proposes an MTL-HAR model for classification recognition, introducing a multi-scale temporal feature extraction module and improved lightweight TCN module to address the problems of large model size and low accuracy due to inadequate mining of data features in the current traditional neural network.In comparison with traditional methods, the MTL-HAR model avoids the construction of tedious and complicated feature engineering, and the performance indexes of the MTL-HAR model are ahead of other deep neural networks compared with other deep learning algorithms, and it has a lower number of model parameters and higher equipment utilization efficiency. in future work, we will continue to study the introduction of improved algorithms of deep learning into a series of applications such as human health monitoring.

Figure 1 .
Figure 1.Structure of the MTL-HAR algorithm model.

Figure 2 .
Figure 2. Standardized acceleration data curve of the walking process.

Figure 3 .
Figure 3. Structure of multi-scale temporal data extraction section.

Figure 6 .
Figure 6.Experimental confusion matrix results graph.Table 3. The evaluation index of the proposed model on the dataset (%)

Figure 7 .
Figure 7.Comparison of recognition accuracy of different models.

Table 1 .
. Network parameters of the contrast algorithm and MTL-HAR .

Table 4 .
Comparison before and after TCN improvement