A Novel Channel Attention Mechanism for Human Action Recognition Based on Convolutional Kernel

With the improvements of computer performance, deep learning has gradually expanded from 2D image tasks to 3D video tasks. Human action recognition is a typical 3D video task, which can achieve category classification by capturing human action characteristics. However, most of the videos are processed by encoding and decoding technology at current, thus the motion details are blurry, which makes it difficult for human action recognition. To solve this problem, we utilize the attention mechanism to “ignore” the blurred feature caused by video coding and decoding technology. Therefore, we hope to embed the attention mechanism in 3D spatiotemporal CNN to overcome this problem. Compared with 3D CNN, the effectiveness of our method is verified on UCF101 and HMDB51 dataset. And our method also proves the convolutional kernel implies channel-wise dependence. Although the improvement of the proposed method is limited, we hope the channel attention mechanism can help people to study neural network.


Introduction
Recently, the convolutional neural network (CNN) has been proved to be effective in 3D video tasks. However, is the feature of moving objects really acquired by CNN? When we observe videos, we always focus on the region of interest and the meaningful dynamic content, thus we empirically ignore some irrelevant details. Because of the unique characteristics of human eye observation, some researchers pay attention to how to achieve better observation effect with smaller bandwidth and storage memory, i. e. video coding and decoding technology. For example, H.264 standard, a typical video coding and decoding standard, was developed in response to the growing need for higher compression of moving pictures [1]. As is shown in Figure1, the details of the hand and the arrow are blurry when they move too fast. Therefore, we propose a novel channel attention based on convolutional kernel embedded in 3D CNN for human action recognition to "ignore" the blurry region.

3D spatiotemporal CNN
With the advent of AlexNet [2], some researchers hope to learn from the successful experience of 2D CNN to build 3D CNN. For example, Shuwang Ji [3] et al. firstly proposed 3D CNN for human action recognition. Then Kensho Hara et al. [4] modified 2D ResNet to 3D ResNet for human action recognition in 2017. In the same year, Joao Carreira et al. [5] proposed the I3D method constructed by ImageNet pre-trained model. CNN + LSTM is another better method of human action recognition, which is different from 3D CNN. In recent years, with the increase of dataset, the recognition accuracy of 3D CNN has been greatly improved by pre-training in large-scale dataset.

Attention mechanism
Attention mechanism can selectively emphasize informative features and suppress less useful ones [6]. In recent years, a variety of attention mechanisms have been proposed for computer vision. One is the channel attention mechanism, which allocates the weight of feature maps in channel-wise dimension, such as SEnet [6]. Another method is the spatial attention mechanism, which allocates the weight of all feature maps in the spatial dimension. CBAM [7] and BAM [8], for example, use channel attention and spatial attention at the same time. However, those attention mechanisms obtain dependence by feature maps, while convolutional kernel maybe is also an important characteristic to obtain channel-wise dependence.

Method
Normally, a convolution operation takes XR         C T H W as input feature maps, which can generate feature maps , where k c refers to the parameters of the c-th filter. The outputs can be written as Here  denotes convolution,

12
, , , Our method, named kernel-channel attention, is inspired by SEnet, which acquires channel-wise dependence by feature maps. While we consider that the convolutional kernel implies channel-wise dependence. Thus,we creatively squeeze the convolutional kernel to get the weight of each channel.
The kernel-channel attention consists of squeeze and excitation, as shown in Figure 2.
The structure of kernel-channel attention.

Squeeze
To obtain the channel weight, we first clone the convolutional kernel of the generated feature map to squeeze it. The channel weight can be written as:

Excitation
Then we perform the excitation operation for the information aggregated in the squeeze operation to fully capture channel-wise dependencies. The channel-wise dependencies can be written as: Finally, we redistribute the channel weights to the feature map U and obtain ultimate feature map X :

Experiments
In this section, we show the details and results of the experiment. We choose 3D ResNet34 (R3D-34) network for comparison and train those models from scratch on UCF101 [9] and HMDB51 [10] dataset. We report the accuracy by split-1. Table 1 provides the architectures of R3D-34 and our method. The input is 16 consecutive RGB images (i. e. L=16), and the size of each frame is resized and cropped to 112×112 pixels. On this basis, we build our method: from conv2 to conv5, we add the kernel channel attention module (KC-Attention).

Details
All the experiments were conducted on a server with a NVIDIA RTX 2080Ti GPU and installed with the windows10 system. The specific parameter settings are as follows:  Table 2 compares our method to the original R3D-34. The top1 accuracy of our method is improved by 0.53% and 0.85% on the UCF101 and HMDB51 dataset, respectively. And the top5 accuracy of our method is improved by 2.01% and 2.77% on the UCF101 and HMDB51 dataset, respectively. In Figure 4, it could be seen that the loss of the model with the KC attention mechanism was lower on the UCF101 and HMDB51 dataset.  Compared with the original 3D Resnet34, the accuracy of the 3D CNN with embedding KC-attention is improved and the loss of the 3D CNN with embedding KC-attention is lower. In addition, the effect of this improvement is more obvious on the HMDB51 dataset in the terms of accuracy and loss. Therefore, the performance of our method is more powerful.
Although the kernel-channel attention mechanism we proposed is more effective, it cannot prove that our method overcomes the interference caused by the video coding and decoding technology to blur the video frame.

Conclusion
In this paper, we proposed a kernel-channel attention mechanism based on the 3D ResNet34 to achieve video classification. Compared with the original 3D Resnet34, our method improved the accuracy of the model on UCF101 and the HMDB51 dataset, which proves that convolutional kernel also contains channel-wise dependence. The kernel-channel attention mechanism we proposed is only verified on the 3D ResNet34 structure, and whether it is effective on other structures still requires more experiments to prove. We hope that our proposed kernel-attention mechanism can inspire a new construction of attention mechanism. In the future, we will further explore the impact of video coding and decoding technology on human action recognition and demonstrate whether the attention mechanism can "ignore" the blurry region in the video frame.