Activity Recognition using 1D convolution from Accelerometers Data

Aiming at the problem of activity a recognition method based on a convolutional neural network was proposed in this papaer, which can effectively classify 6 types of human movements: Downstaris, Jogging, Sitting, Standing, Upstairs and Working. The network consists of an input layer, two convolutional layers, two pooling layers, a fully connected layer, and an output layer. The sliding window is used to transform the sensor data into a three-channel RGB image format, and the features of the three-axis speed sensor data are automatically extracted to classify each action. The model was reproduced using Tensorflow, and the recognition rate of 89.35% was achieved on the open source database WISDM. The experimental results show that the amplifier has a better effect on human movement recognition.


Introduction
With the development of the Internet of Things and the application of various embedded devices, we are able to collect a large amount of sensing data. More and more edge computing devices, such as mobile phones, integrate a variety of powerful sensors, including accelerometers, gyroscopes, light sensors, GPS, etc. These sensors have become data sources for measurement applications. Especially the acceleration sensor in the mobile phone, the real-time acceleration data returned by it along the xaxis, y-axis or z-axis can be used for activity recognition. Compared to video recognition methods, this method is not susceptible to interference from natural environments, such as light and darkness, angles, and obstructions.

Related Work
A large number of successful cases have appeared in this field et [1]. The method was first proposed in the early 2000s et [2] [3]. Numerous research results have proved that the method is feasible, for example: Tapia et al. et [4] It is proposed that daily activities can be identified through sensors. Hollosi et [5] applied sensor data to early childhood education. Although traditional machine learning methods such as support vector machines and decision trees have made some progress et [6], the classification effect of activity is still not ideal. Because neural network technology has better feature extraction capabilities, it is currently widely used in the field of image processing. Inspired by the above studies, this paper applies convolutional neural networks to the sequence data processing of sensors to solve the problem of activity recognition.

Our Proposal
In this paper, we use a convolutional neural network (CNN) as a classifier of activity, and propose a method for activity recognition using acceleration sensor data. The proposed method acquires the acceleration data of its activities by wearing sensors on the wrist and thigh of the user, and then uses 1dimensional convolution to perform feature extraction on the input sequence data, and finally realizes human body recognition.We used the data set provided by Wireless Sensor Data Mining Lab Fordham University [7], This data set was collected in a controlled experimental environment. This dataset contains a total of 1,098,203 records,six activity categories,including Downstaris,Jogging, Sitting, Standing,Upstairs and Working, The distribution of this dataset with activities (labels) is is shown in Fig. 1. In this project, researchers recorded acceleration data for three axes of daily movements. The data characterizes the forward motion of the leg through the z-axis; the y-axis characterizes the user's upward and downward motion; and the x-axis characterizes the user's horizontal leg motion.

Data preprocessing
The main work of this part is data normalization,this can Improve the convergence rate of the model and model accuracy. More important is the need for learning algorithms, the reason, in some models, the optimal solution is not equal to the original one after the inhomogeneous expansion of each dimension. For example, SVM For such a model, unless the distribution range of the data in each dimension is relatively close, it must be standardized to avoid the model parameters being dominated by the data with a larger or smaller range. Some models, such as logistic regression, are equivalent to the original optimal solution after non-uniform scaling in various dimensions.
CNN was first applied to the classification task of 2D images et [8]. The principle of CNN work is to first use the convolution kernel to extract features from the pixels of the original picture, and then abstract the convolution data into smaller dimensions through the pooling layer The data. Determine the spatial correlation between them by subsampling the maximums. Finally, the softmax layer is used for classification. The CNN is essentially an iterative process, for models with scalable invariants, it is best to standardize the data. The case where the maximum and minimum values are unknown, or the case where there is outlier data,So,in this paper, we followed zero-mean normalization, Zero-mean normalization is also called Z-score normalization. The processed data conform to the standard normal distribution, that is, the mean value is 0 and the standard deviation is 1. The values of attribute are normalized using the mean and standard deviation of . a new value is obtained using the equation (1), where ̅ and are the mean and standard deviation of attribute , and using the We divide the sensor data into data segments of the same length and change the dynamic data sequence into static data in order to extract the spatiotemporal characteristics of the data. This method can better process sequence data, and analyze and recognize activity from it. The down staris are taken as an example. Figure 3 shows the processed sequence data in a visual way.

Model Structure
In this section, we discuss our CNN architecture. Fig. 4, shows the structure of the proposed approach. It has three kinds of layers:1)input layper whose values are fixed by the input data; 2) hidden layers whose values are derived from previous layers;3) output layer whose values are derived from the last hidden layer.

Figure 4. Structure of CNN
Consider the actual situation of human movement, we proposa set the window size to 4.5s. The window size is set to 90, The overlap between each time window is half the size of the window, Therefore, the sample size of input layer in CNN is 1*90*3, the input data is accelerated by a preprocessing depth of 3, then the behavior features are extracted by convolution and pooling, after the data is preprocessed, it enters the first volume of the convolutional neural network.The window size of convolution 1 is 1*10, there are 60 convolution windows with different values. The data is processed by the convolution 1 and then enters the pool layer, Sampling the Max-pooling with the maximum value, And then enter the convolution 2 to process. The window size of convolution 2 is 1*6, there are 10 convolution windows with different values in convolution 2.Finally, the recognition of actions is carried out through the identification of full link layer and Softmax level.
Formally, extracting a feature map using a convolution operation is given equation(4) et [9] ∑ * ∑ ∑ Where q denotes the feature map in layer , σ is Rule non-linear function, is the number of feature , is the kernel convolved over feature map in layer to create the feature map in 1 , is the length of kernels in layer and is a bias vector.

Training Process
As you know, when the learning rate is set too low, the model's convergence speed will be very slow, and when the learning rate is set too large, a gradient explosion will occur, making the model unable to converge. Therefore, in this paper we uses Adam as the model (set the initial learning rate to 0.001) To avoid the problem of improper learning rate value.Which flow as equation (5), is the initial learning rate, As the decay rate of each round of learning, is the current learning steps, is the steps per round of learning, it flow as equation (6), is the total number of samples, is the size of each batch. The loss function is defined by Softmax loss as equation (7), because We assume that the samples are subject to Bernoulli distribution. is the loss, is the value of the output vector S from softmax, is the desired output. ∑ 7

Experimental Analysis
All our experiments are conducted in Tensorflow on an ordinary computer with a GTX 1060 GPU and 8GB memory. We take cross validation, using random sampling method, the whole sample was divided into three parts by random sampling method. Two of them were trained and one was tested, Cross validation has lower accuracy than sample test set, but it is closer to classification accuracy of unlabeled samples, to analyze the results in more detail, we show the confusion matrix (Table 1)and Normalized Confusion matrix (Fig.5) for the Actitracker dataset using CNN.

Conclusion
Inspired by convolutional neural networks in machine vision, we designed and implemented a convolutional neural network model for human activity recognition based on smart phone triaxial acceleration sensor information. It has good universality and high recognition accuracy for walking, sitting and standing, the accuracy rate for upstairs and downstairs is not satisfactory although it reached an accuracy rate of 83.94%. It has advantages of higher total recognition rate, better fault tolerability and stability, stronger robustness, and is more convenient and flexible to use.