Human pose estimation based on Improved High Resolution Network

Human pose estimation has become a hot problem in human-computer interaction and other intelligent application technologies, but the results obtained by previous deep learning networks are not very satisfactory when facing small-scale human instances. Therefore, in order to solve the problem of scale variation in human pose estimation, especially to pinpoint the keypoints of small-scale human instances, an improved High-Resolution Network (Improved HRNet) is proposed in this paper. The main improvement work in this paper is as follows: In this paper, a double attention mechanism is added in the forward transmission of the parallel sub-network, with the aim of assigning weights to the propagated information without changing the number of channels, assigning the information with high weights as useful information and reducing the interference caused by irrelevant information. In this paper, the network structure is validated using COCO dataset, and the average accuracy (Average Precision, AP) of Improved HRNet is 66.4, which is 2.3 higher than the average accuracy of High-Resolution Network (HRNet) with only 0.4% increase in parameters.


Introduction
With the gradual development in the field of computer vision, human pose estimation, as an indispensable component of human-computer interaction, has gradually entered the limelight. Human pose estimation tasks play different roles in different fields and are widely used for human analysis in intelligent surveillance, medical rehabilitation and sports. In human pose estimation, estimation of 2D human pose from a picture or a video is a very important fundamental stage, for example, human tracking, motion perception, 3D pose research or human-computer interaction applications need accurate 2D human pose estimation as a support. Therefore, the study of 2D pose estimation becomes particularly important.
Human pose estimation is performed in two steps, namely human target detection and human key detection. Currently, human pose estimation is mainly performed by deep learning methods, while the network used for human keypoints detection is mainly a deep convolutional neural network, and the methods of deep convolutional neural network can be subdivided into two types, the first one is the estimation of keypoints heatmap method and the second one is the regression of keypoints position method. The keypoints of the human instance can be learned by extracting the corresponding feature information after learning each image and then selecting the bit with the highest heat value as the keypoints.
In 2016 Stacked Hourglass Network (SHN) [1] was proposed, which consists of subnetworks in series according to a resolution from high to low. Thus, the stacked hourglass network is operated by downsampling and upsampling from high to low and then from low to high for different resolutions. And this process cannot fully utilize the spatial feature information effectively, resulting in partial loss 2 of spatial feature information thus making the output high-resolution representations less than perfect. The cascaded pyramid network (CPN) proposed in the literature [2] , compensates this drawback of SHN and is able to fuse the low and high resolution feature map information when using the operation of upsampling.
In 2019 Sun et al. proposed the high-resolution network (HRNet) [3] , HRNet abandons the series connection method used in conventional networks in the past, and it uses parallel connection to connect different resolution subnetworks in parallel from high to low, which can effectively utilize feature information when realizing multi-scale fusion.

HRNet
The HRNet network model is a completely new architecture compared to previous network models. It always maintains a high-resolution feature sub-network throughout the network delivery. In order to realize the transmission of high-resolution features, it divides the network into multiple sub-networks, with the backbone being the high-resolution sub-network, which always maintains the same resolution as the input feature map, and then gradually adds low-resolution sub-networks through down-sampling operations, so that more stage sub-networks are formed to achieve the state of multi-stage sub-networks in parallel. The parallel low-resolution subnetworks continuously exchange information through fusion to achieve the information of the feature map in the high-resolution subnetwork contains deeper subnetwork features, while at the same time the backbone subnetwork is always the high-resolution feature map transmission, so that less information loss, it achieves the effect of better estimation of human keypoints. The HRNet network is shown in Figure 1. where the horizontal and vertical directions correspond to the depth of the network and the scale of the feature map, respectively.

Improved HRNet
Inspired by SENet [4] , this paper adds a channel attention mechanism to the forward passing process of parallel sub-networks. Channel attention is to calculate weights for each feature map so that the network focuses on the features with more useful information, reducing the interference brought by irrelevant information to the network for extracting features as well as exploiting the interdependence between features. The structure of channel attention is shown in Figure 2.   Figure 2, it is assumed that the input characteristic graph is X and the size is After the convolution operation, the size of the feature graph is H W C , where tr F represents the convolution operation. Then, a feature vector of 11C  is obtained by global average pooling. The calculation process is shown in Equation (1) (1) ( ) , c U i j represents the pixel value of ( ) , ij, where the location of the c channel in the feature graph U . Then, the correlation between channels is modeled through two full connection layers to learn the degree of dependence on each channel and adjust the feature graph according to the different degree of dependence. The calculation process is shown in Equation (2).
Where, ex F represents the sigmoid loss function, 1 W and 2 W are the weights of the two convolutional layers respectively, and their sizes are constantly adjusted through training and learning. Finally, feature graphs with different weights on different channels are fused through scale F , and the calculation process is shown in Equation (3).
Adding channel attention makes it possible to exchange information without reducing the dimension of the subnetwork. This method not only ensures the computational performance but also reduces the complexity of the network model. Adding channel attention mechanism can not change the size and number of channels of the feature graph, but multiply all pixels in the same channel by the same value. If a channel is important, then the weight of all elements in it is multiplied by a high coefficient, and vice versa.
In addition to channel attention mechanism, spatial attention mechanism is also added to optimize feature extraction. In this paper, spatial attention feature map is generated by using spatial relations among features. Different from the focus of channel attention mechanism, spatial attention focuses on the information of "the position on the feature map", and uses spatial attention to complement channel attention. To calculate spatial attention, this paper first carries out average pooling and maximum pooling along the channel direction, and fuses them to generate effective feature weights (the structure of spatial attention mechanism is shown in Figure 3. According to [5] ， pooling along the channel direction can highlight useful information in the feature. , where each pixel represents the maximum value of the 11  large and small feature graph of each channel direction. Then, two pooled feature graphs are fused through a layer of conventional convolution operation to obtain the final spatial attention feature map , which is used to process the features processed by channel attention mechanism again to improve the feature performance at the spatial scale. The process of spatial attention mechanism is expressed as follows: ( ) Where,  represents the Sigmoid function, and is 77  denoise convolution operation. The convolution operation of 77  was used to fuse the output feature graphs of the two pooled layers.
In this paper, dual attention mechanism module is added in the forward transmission process of parallel subnetwork, which can make the network learn channel importance and space importance respectively. In the process of feature transmission, the network will focus on the features with more useful information, reduce the interference brought by irrelevant information to the network feature extraction and the interdependence between features, and improve the accuracy of keypoints information prediction in the feature map. The structure of dual attention modules is shown in Figure 4.
Where M  is the output feature of the dual attention mechanism module, M is the input of the dual attention mechanism, scale F is the weight of each channel output by the channel attention module, and sf M is the feature map output by the spatial attention module.

data sets
This article uses the COCO dataset. The COCO dataset contains more than 200,000 images and more than 250,000 human instance images labeled with 17 keypoints. This paper is a model training on COCO2017 training set, which includes more than 57,000 images and 150,000 images of human body instances. In addition, the network structure of this paper is evaluated on COCO2017 validation set and COCO2017 test set, which contains 5,000 images and 20,000 images respectively.

Image preprocessing
As the original images collected in the MSCOCO2017 dataset are of different sizes, image preprocessing and then training are required. Therefore, the pretreatment process is divided into the following two 77 f   (1) The image size is cut from the data set image centering on the main human hip. The size of the image is cut to 256×192 and adjusted to a fixed ratio, and the height: width is 4:3, which is convenient for network training. (2) Data enhancement with random rotation (−45°, 45°) and random scaling scale (0.65, 1.35) was used to process the data. The data set image is clipped with the main human hip as the center. According to the ratio of height to width of 4:3, the size of the image is clipped to 384×288, so as to achieve the input effect required by network training.

Evaluation criteria
In this paper, experiments are carried out on COCO2017 data set, and there are two evaluation methods: (1) The key similarity (OKS) given by MS COCO is used for evaluation. (2) using the percentage of correct key points (PCK) as the experimental evaluation index. Table 1 summarizes the results of COCO2017 test development data set. As can be seen from the table, this paper uses Improved HRNet to surpass the previous methods and become the best bottom-up method (AP70.5). The performance of the improved high-resolution network proposed in this paper is better than that of the high-resolution network, and the parameters are only slightly increased (+0.4%).  Figure 5 shows the experimental comparison results of HRNET and Improved HRNET in singleperson simple posture, two-person simple posture and multi-person complex posture, which more reflects the superiority of the improved network structure.

Conclusions
In this paper, the high resolution network is improved on the basis of high resolution network, and the improved high resolution network structure is introduced in detail. In this paper, in the process of network operation, dual attention mechanism is added to carry out feature extraction without changing the number of channels, focusing on the features with more useful information to reduce the interference brought by irrelevant information. This paper uses Improved HRNet to surpass the previous methods and become the best bottom-up method (AP70.5).  Figure 5 Comparison of experimental results