Research on Behavior Recognition Algorithm Based on Compact Representation of Low-level Feature

Recently, the bag of visual words (BOVW) model based on spatial-temporal interest points (STIPs) is used more and more widely in the field of behavior recognition. however, the model ignores temporal order between frames and intra frame position information of interest points. In the paper, an algorithm is proposed to acquire the geometrical and temporal distribution of STIPs. Firstly, the STIPs mutual information(STIPsMI) algorithm based on Co-occurrence matrix is proposed to describe the spatial-temporal relationship of STIPs between different visual words. Then the descriptor is concatenated with the BOVW histogram as the final descriptor. The two authoritative human action datasets were used to test the algorithm in paper: the KTH and the UCF sports. Experimental results verify the robustness of the algorithm and better than BOVW model and other mainstream methods.


Introduction
In the field of computer vision, behavior recognition and detection are very important modules in the past ten years. With the continuous development and maturity of Internet technology and the continuous expansion of video surveillance applications, more and more applications involve the automatic recognition of video events. Among this, behavior recognition is a hot topic, and widely used in surveillance, virtual reality, content-based video retrieval, human-machine interaction [1].
The main stream methods of behavior recognition are relying on low-level feature extraction, which can describe the human motion from the video effectively; further to learn behavior patterns from the these low-level features and then classify behavior categories [2]. However, human behaviors are complex and diverse, which results in no feature models can be used universally.
Proverbially, STIPs-based BOVW model for behavior recognition is effective and easy realized for neither need of moving object detection and tracking, nor background modeling. It provides an approach to understand and analyze the interesting points from the video sequences directly. So, the BOVW model is widely used in the research of behavior recognition [3][4][5][6][7][8][9][10][11][12]. However, the BOVW model ignores the spatial location information and temporal structure knowledge of STIPs inside a video, which has proved to be essential for boosting behavior recognition discrimination [4].
Due to these shortcomings of the BOVW model, this paper improves the BOVW model and proposes a behavior recognition algorithm based on the STIPs mutual information. Firstly, the Dollar et al. [12] method is used to extract the interesting points and described by HOG3D feature. Afterward, the K-means algorithm is used to generate the bag of visual words by clustering these HOG3D features. Then, the spatio-temporal relationship between pair-wise visual words is represented by STIPs mutual information descriptor based on the co-occurrence matrix. Finally, the two kinds of information: BOVW histogram and STIPs mutual information are fused as the descriptors of a video sequence, SVM classifier is used for recognition and Classification. The overall algorithm flow is shown in Figure 1.
The paper is arranged as follows. Section 2 is related work about the improved BOVW methods. Section 3 introduces the way of extraction and description of spatial-temporal interest points. Section 4 gives the detailed procedures to compute STIPs mutual information. Section 5 shows the results on two challenge datasets. Section 6 analyzes and summaries the paper. In recent year, The algorithms for integrating geometric information into BOVW have been proposed. The most common method is to use spatial-temporal pyramid(STP). In the method, descriptor level statistics are pooled and the video sequence is divided into a set of spatial and temporal cells repeatedly and equably. The approach can capture the spatial layout and temporal order of an roughly action sequence. However, directly using the BOVW histogram to represent video increases the feature scale while greatly increasing the cost of learning and storage [4].
Later, some scholars put forward the modeling methods of spatial-temporal interest point. Wu X et al. [5] propose to use multiple Gaussian Mixture Models (GMMs) at different time and space scales to represent the distribution of local spatial-temporal background between points of interest as a feature representation. Q Hu et al. [6] propose the spatial-temporal context in spatial-temporal domain. The Spatial-temporal context is a representative video collection including video words and context words. Actually, the context can obtain and understand the semantics of video words.Wang et al. [7] propose to use dense trajectories and motion boundary information to describe video. X Yang et al. [4] propose an effective coding scheme to merge spatial-temporal information by aggregating low-level descriptors into the super descriptor vector(SDV). Kovashka et al. [8] propose to form candidate neighborhoods by learning the shapes of the neighborhood of space-time features, and then composing descriptors with words associated with nearby points and their orientation relative to the central point of interest. Wong and Cipolla [9] propose that the non-negative matrix factorization(NNMF) is applied to the entire video sink to achieve the purpose of adding global information to the point of interest detection. Bregonzio et al. [10] extract to model the global space and time distribution by extracting the overall features of the point cloud of interest accumulated on multiple time scales. C Yuan et al. [11] use the 3D discrete Radon transform, also called R transform, which is extended by 2D Radon transform to capture the detailed geometric distribution of points of interest to obtain global characterization. However, these methods are too complex and have not been received extensive attention. Therefore, it is still a challenging work to datamine these STIPs of a video for better behavior recognition.

STIPs extraction and description
In the paper, We use the interest point detector proposed by Dollar et al. [12]which can determine the spatial-temporal salience by calculating its response value at each pixel in the video sequence with two-dimensional Gauss functions and one temporal Gabor wavelet function. The detected STIPs are shown as shown in Fig.2. The response function is defined as equation (1).  (2)(3) respectively.
σ and t are two parameters of the spatial scale and temporal scale, respectively.
It is usually necessary to describe the spatial-temporal information of the cuboid centered at each interest point, i.e. structuring the spatial-temporal feature descriptor. we employ the HOG3D feature to build this descriptor for each interest point.

STIPs Co-occurrence matrix
Inspired by spatial co-occurrence matrix for 2-D interest points in an image [13], this paper extends the concept and applies it to the action recognition task. STIPs co-occurrence matrix (STIPsCM) is to obtain the spatial-temporal correlation information of interest points between different visual words.
The set of spatial-temporal interest points V in a video sequence S is quantified by BOVW model into n classes, expressed as 1 2 , ..., , n w w w here, , 1, 2, For any pairs of STIPs   ( , , ) , ( , , ) , if their spatial-temporal distance is no longer than the threshold r defined as equation (4), their relation is called 3-D co-occurrent. This co-occurrence statistics among any two words q w and p w is defined as equation (5).
Where ∧ represents logic AND. In fact, the item ( , ) q p CM w w represents the total STIPs number of visual word q w co-occured with visual word p w measured by this spatial-temporal radius r . So the obtained STIPs co-occurrence matrix for each pair-wise visual words is an n×n matrix, as shown in Fig.3. Their detailed co-occurrence relation is shown in Fig.4. Finally, in order to get the co-occurrence probability distribution, each row in Fig.3. can be normalized by dividing the sum of its row elements. Fig.4. The structure of STIPs Co-occurrence statistics among any two words

STIPs mutual information
Although the STIPs co-occurrence matrix can represent the spatial-temporal distribution of interest points among different words, the n×n dimension of the co-occurrence matrix is high, which results in costly computation and more storage memory [14,15]. In the paper, we propose the concept of spatial-temporal interest points mutual information based on the co-occurrence matrix, which not only reduces the dimension of feature vector to N, but also conserves the spatial-temporal information.
The concept of mutual information was proposed originally in the theory of information , which means that the amount of information provided by event Y for the another event X , is defined as equation (6).
The STIPs co-occurrence matrix represents the spatial-temporal distribution of all pairs of spatial-temporal interest points. Each element in the th p row in the co-occurrence matrix represents the probability of the visual word p w matching with other visual words. The paper gives the definition of mutual information of visual word p w for q w as equation (7).  Where, ( ) j p w is the frequency of the th j word in BOVW histogram descriptor. By this transformation, the n n × dimensional co-occurrence matrix of spatial-temporal interest points in section 4.1 can be reduced as a n-dimensional mutual information descriptor, which is called STIPs mutual information, shortened as STIPsMI.

Algorithm validation and data analysis
In this section, we will evaluate the robustness of the proposed algorithm by experimental results on authoritative datasets: KTH and UCF sports. The output is identified by a leave-cross-validation(LOOCV) method of linear SVM. The KTH dataset includes 6 kinds of behaviors. Each video data was collected from 25 individuals in 4 different scenarios, as shown in figure 2. The UCF dataset consists of 150 video sequences including of 10 kinds of behavior, as shown in figure 5. The dataset has been used in many applications widely, such as: action recognition, motion positioning and saliency detection.
The main parameters of the algorithm include temporal scalet , spatial scaleσ , the number n of BOVW, the scale r of the location operator. For KTH dataset, according to previous work and a priori knowledge [11], the spatial scale can be set as 1.5, the temporal scale as 1.5, the number of BOVW is 500. Similarly, according to previous work and a priori knowledge [4], initializations for UCF sports database go as follows, the spatial scale is 2, the time scale is 2.5, the number of BOVW is 800. Therefore, in the paper, we mainly test the performance of parameter r and two kinds of low-level descriptors(BOVW, STIPsMI) on the recognition accuracy.

Performance of three low-level descriptors (BOVW、STIPsMI) on recognition accuracy
To verify the feasibility of the algorithm, the two basic descriptors of BOVW and STIPsMI, as well as their different combinations were tested on KTH and UCF sports datasets respectively.    Table 1 and Table 2 show that recognition accuracy based on these three kinds of descriptors(BOVW+BOVW, STIPsMI+STIPsMI) are not improved significantly or even declined. However, recognition accuracy has been improved significantly when different descriptors are cascaded. The performance of descriptor(BOVW+STIPsMI) improve by 4.5%-6.0% compared with that of BOVW descriptor tested on the KTH and UCF datasets. This shows that local features are important in the descriptors of the video. The results also demonstrates that the STIPsMI and BOVW descriptors are complementary. That is, STIPsMI describes the local relationships between different visual words, which make up for BOVW descriptor.
In terms of KTH dataset, recognition accuracy of the three descriptor fusions are up to 95.29%. For jog and run behaviors themselves have large similarities, it is easy to produce some confusion, so the recognition accuracy is relatively low; the other 4 kinds of behavior can be identified accurately. As to UCF database, the highest recognition accuracy is 87.33%. STIPsMI stands for local distribution of STIPs, and STIPsMI performs well whether the video contains multi person behavior or single behavior.

Performance of parameter r on recognition accuracy
Changes of r can exert remarkable impacts on recognition accuracy, as shown in Figure 6, which are experimental results on KTH and UCF sports datasets. The optimal values of parameter r is about [20,25]. Aiming at the limitation of BOVW model in behavior recognition, the paper proposes a behavior recognition algorithm based on STIPs mutual information. The STIPs mutual information is used to represent the spatial-temporal distribution of interest points between different visual words in a small neighborhood. Table 3 presents a comparison of our results with state-of-the-art results, which are based on the BOVW model. The citations of papers are over 350, and two of them are cited more than 1000 times, so they can be compared as a benchmark.   [12] 81.2 -Bregonzio et al. [10] 93.1 -Kovashka et al. [8] 94.5 87.2 Le et al. [3] 93.9 86.5 Wang et al. [7] 92.1 85.6 In addition, STIPsMI based on the co-occurrence matrix can also be considered as a dimension reduction method from 2-D to 1-D, and does not change the sparsity of data. That is to say, the feature is still sparse, if the matrix is sparse. That is one of the reasons why we use STIPsMI instead of 2DPCA or (2D)2PCA. More sophisticated parameter optimizing methods will be explored later, for instance, assign different weights to BOVW and STIPsMI features.