Optimization of ICANet lightweight human pose estimation based on HRNet

The existing high-resolution human pose estimation models have low applicability in practical applications due to their large parameter quantity and high computational complexity. To address these issues, this paper proposes a human pose estimation network with a lightweight inverted residual coordinate attention network (ICANet) based on the high-resolution network (HRNet). With the introduction of CoordAttention mechanism and the inverted residual module, this paper proposes two lightweight network modules, namely ICAneck and ICAblock, which not only reduce model parameter quantity and computational complexity, but also achieve feature enhancement of long-range dependence and precise position information in spatial directions of the feature map. Experimental results show that compared to HRNet, the ICANet model proposed in this paper reduces its parameter quantity by 53.7% and computational complexity by 32.4% on the COCO validation set, and lowers its parameter amount by 53.7% and computational complexity by 32.6% on the MPII validation set. Practical applications prove that ICANet still achieves high-precision detection of human key points with fewer parameters and lower computational complexity, and has higher applicability and practicality compared with common human pose estimation networks such as the Stacked Hourglass Network (Hourglass), Cascaded Pyramid Network (CPN), and SimpleBaseline, and therefore has better applicability and practicality.


Introduction
As an important research hotspot in computer vision, human pose estimation can predict the accurate spatial position of human body key points from input images [1], and is widely used in person reidentification, pedestrian detection, and human-computer interaction.
In the development process of human posture estimation research, graph structure model algorithm is one mainstream traditional algorithm.But due to human posture diversity, this algorithm has difficulty in extracting feature information, model checking accuracy and efficiency, making it difficult to apply in practical scenarios.
In recent years, with the rise and development of deep convolutional neural networks in computer vision [2], human pose estimation networks have two major categories based on different targets, namely coordinate regression network [3] and thermal map prediction network [4].The coordinate regression network regards human pose estimation as a regression problem for key points of the human body, but the accuracy of key point detection cannot meet expectations, leading to the emergence of a heat map prediction network.This network generates a Gaussian mask at each key

Human pose estimation network
At present, human pose estimation network models use two mainstream frameworks: Top-Down and Bottom-Up.The Top-Down framework initially uses a human body detector [10] to obtain the bounding box from each human body in images, and then makes these bounding boxes be uniformly cropped to a fixed size in the input image and fed into a human pose estimation network for individual human pose estimation, and that being said, the accuracy of key point detection is affected by the human body detector performance.But still the Top-Down framework is currently the one with the highest accuracy.As a lightweight network, the Spatial Shortcut Network (SSN) [11] with the Top-Down framework, has feature mobility modules including a main module and an attention mechanism module.The former is mainly used to obtain feature information from the feature map, while the latter to determine the relevance between the after-displacement feature area and the original one; Then, the final estimation of human pose is performed through feature fusion.Therefore, the SSN lowers both model parameter quantity and computational complexity by the information flow cost reduction.In 2022, Wang proposed an edge real-time multi-person pose estimation network model [12], which indicates that high-resolution network models have high computational costs and are not suitable for practical applications.Therefore, this model deletes its low-resolution branches to improve the efficiency and performance during runtime.
In contrast, the Bottom-Up framework use no human detectors, but instead initially detects all human key points in input images, and then concatenates them based on grouping principles of each network model to obtain human poses.Although this algorithm has high running speed and good realtime performance, grouping errors are prone to occur when key points of the same type are close to each other.Hence, it is crucial to design different grouping information for Bottom-Up algorithm.Using this framework, the Associative Embedding (AE) method [13] adopts Hourglass as the skeleton network to detect and group key points.With labels introduced to determine key point grouping, key points of the same human body are only matched and grouped with the ones of the same label.As a result, the final human pose is generated.In 2021, Geng proposed Disentangled Keypoint Regression (DEKR) [14], which utilizes a multi-branch network model which enables each branch to use adaptive convolutional kernels so as to learn pixel features around different key points, and use the learned features to regress the spatial positions of different key points.

Inverted residual module
MobileNet V2 is mainly designed for lightweight networks with an inverted residual module proposed.EfficientNet [15] finds a balance among network depth, network width, and image resolution while balancing lightweight and performance of the network model based on Inverted Residual Module; MixNet [16] has rethought the impact of convolutional kernel scale on model accuracy on the basis of Inverted Residual Module.

Attention mechanism
Attention mechanism is implemented through the learning of convolutional kernels and global average pooling layers to reassign weights to feature channels and spatial pixel values.And relevant studies show that adding lightweight attention mechanisms can improve the performance of network models with a small increase in parameter count and computational complexity.The two most commonly used attention mechanisms are Channel Attention and Spatial Attention.Channel Attention reproduces the weights of different channels in the input feature map, and distinguishes the importance between channels by assigning values to different channels, while Spatial Attention focuses more on the correlation of spatial information in feature maps and improves the positioning accuracy of interested areas through assigning weights to pixels.

Lightweight network
Currently, the design of lightweight networks improves deep convolutional neural networks mainly through Spatially Separable Convolution and Deeply Separable Convolution, so as to achieve the reduction of parameter quantity and computational complexity, thereby reaching the goal of lightweight networks.
Spatially Separable Convolution splits the convolution kernel in spatial directions, and then performs convolution operations on two smaller convolution kernels in sequence.But not all convolution kernels are splitable.The splitting causes computational complexity increase.Different from Spatially Separable Convolution in the way of splitting, Depthwise Separable Convolution splits convolutional kernels into Depthwise Convolution (Dwise) [17] and Pointwise Convolution (Pwise) [18], thereby significantly lowering model parameter quantity.But this method possibly results in insufficient feature information extraction.

ICANet
This paper makes improvements with HRNet as the basic architecture and proposes ICANet, whose structure is shown in Figure 1.
Figure 1 shows that ICANet has four parallel branches with different resolutions and channel numbers, and is divided into four stages of sub networks, namely Stage1, Stage2, Stage3, and Stage4, based on the differences in resolutions and channel numbers.The specific processing procedure of ICANet is as follows: (1) Use a multi-scale preprocessing module to make the feature map resolution reduced to 1/4 of the original, and the RGB images with 3 channels transformed into 64 channels.
(2) Take the preprocessed feature map as input to Stage1, four ICAneck modules are used to extract features from the map.
(3) The output of Stage1 processed by the information exchange unit serves as the input of Stage2.In the following three stages, ICAblock modules with different resolutions (1/4, 1/8, 1/16, 1/32) and channel numbers (C, 2C, 3C, 4C) are used to extract features from the feature map.
This paper adopts a C=32 network architecture, and adjusts the resolution and channel amount of the feature map between each stage.According to reference [6], a gradual reduction approach of doubling feature map channels for every half reduction in resolution, can compensate for the spatial positioning loss caused by the resolution decrease.
During the significant downsampling operation, there is quick loss of human pose details in the feature map.It is difficult to improve the accuracy of predicting key points even if feature information is learned from blurred images and then fused with the feature information extracted from highresolution feature maps in the upper layer.Involved in the implementation process there are multiscale preprocessing, inverted residual, CoordAttention, ICAneck, and ICAblock modules.

Multi-scale preprocessing module
Due to higher spatial resolution of feature maps and more position information of human key points included in the early stage of the ICANet's preprocessing module, multi-scale preprocessing modules of two 3×3 convolutional kernels and two 5×5 ones are used for feature extraction of input images in this stage.At last, feature fusion is performed on extracted feature values before they are used as inputs to stage 1.The specific structure of the multi-scale preprocessing module is shown in Figure 2.

Inverted residual module
The structure of the Inverted Residual Module is shown in Figure 3.In order to reduce parameter quantity, this module initially uses a low-dimensional feature tensor as an input, and a 1×1 convolution for the input's channel expansion to generate a high-dimensional feature tensor, which is then spatially contextually encoded with deep convolution.And eventually 1×1 convolution is used to map the high-dimensional feature tensor to a low-dimensional one.According to the theory of reference [6], skip connections are omitted when the channel amount is different between input and output.
The calculation formula for the parameter quantity and computational complexity of Inverted Residual Module is as follows: in C and out C in the above equations represent input and output channels of the module.T means the scaling coefficient.H represents the height and W the width of the feature map, respectively.
In order to verify the impact of the scaling coefficient T on the detection of key points in network models, this paper conducts comparative experiments on different scaling coefficients in the MPII dataset.The experimental results are shown in Table 1.It is found that T =1 outperformed T =2 in detecting different key points of the human body, with the total performance increased by 0.8 percentage point.Meanwhile the scaling coefficient T is inversely proportional to the model's parameter quantity and computational complexity.Therefore, the scaling coefficient T of this experiment should be taken as 1.

CoordAttention module
The CoordAttention module not only obtains feature information between channels, but also captures accurate position information of human key points in spatial directions.
This module first extracts the feature information of each feature channel using adaptive average pooling layers in both horizontal and vertical directions for the input feature maps.Next, it concats the two generated feature maps and use pointwise convolution to simultaneously generate feature maps with two spatial directions.Then split the feature map in spatial directions and convert the feature map channels into input feature map channels using pointwise convolution.Lastly, attention weights in the two obtained spatial directions are multiplied by the input feature map to obtain a feature map with attention weights.The specific structure of CoordAttention module is shown in Figure 4.The calculation formula for the parameter quantity and computational complexity of CoordAttention Module is as follows: This module adopts a total of three pointwise convolutions, where in C is the input channel, out C is the output channel, and mid C is the compressed feature channel.H means the height and W the width of the feature map, respectively.

ICAneck module and ICAblock module
This article proposes two basic modules for network models, namely ICAneck and ICAblock modules, whose structures are shown in Figure 5.
After removing the preprocessing module and information exchange unit, HRNet still faces a problem of large parameter quantity and high computational complexity that is mainly caused by Bottleneck and Basicblock modules.Therefore, in this paper HRNet is redesigned with ICAneck and ICAblock modules used.Firstly, the standard 3×3 convolution is replaced with the lightweight deep convolution; Secondly, the scaling coefficient of the reduction and extension layers in Inverted Residual Module is reconsidered to ensure the ability to extract feature from the feature map; Lastly, CoordAttention is added to each module to obtain cross-channel feature information and accurate spatial position information.
Overall, after the replacement of 3×3 convolutions, the entire ICAneck structure becomes similar to Inverted Residual Module; Meanwhile, the ICAblock module adds an inverted residual module between two deep convolutions to ensure the reduction of parameter quantity and computational complexity, and obtains feature information more effectively by compressing and expanding feature channels.
ICAblock and ICAblock both adopt a CoordAttention module for better performance.In order to make the lightweight network model run faster, this paper takes the scaling coefficient T=1 in the inverted residual module.When designing the basic module of ICANet, the original residual architecture foundation of the basic module in HRNet is retained, and the amount of input and output channels are made to be equal in both IRM and CoordAttention inside the basic module.Here are calculation formulas for the parameter quantities of HRNet and ICANet basic modules: block in mid mid out ICAneck IA CA From the formula, it can be concluded that the ratio of parameter reduction during network training is inversely proportional to the scaling coefficient T .Equations ( 9) and ( 10) respectively represent the ratio of the reduced parameter quantity of the first and the three following ICAneck modules in Stage 1.The input channel in C =64 of the first module needs to be converted to the output channel out C =256 by 1×1 convolution, while the other three modules do not require channel number conversion.Therefore, skip connections require feature map addition without channel number conversion, which results in different values of neck1 r and neck2 r .Equation (11) shows the ratio of parameter reduction of ICAblock module to that of Basicblock in Stage2, Stage3, and Stage4.Because of the scaling coefficient T =1 inside the inverted residual module, the amount of input and output channels of both inverted residual module and CoordAttention module are equal.Hence there is no need for conversion of channel amount inside the ICAblock module.Thereby, the ratio of parameter reduction for ICAblock modules with parallel branches at different stages can be expressed using block r .[19].In this experiment, it contains 118287 images, while the validation set contained 5000 images.Among the dataset annotation there are 17

Evaluation criteria
The validation standard used here is Object Keypoint Similarity (OKS), where AP 50 represents the accuracy of detected key points at OKS=0.5, AP 75 the accuracy of detected key points at OKS=0.75, mAP the average accuracy of predicted key points among 10 thresholds at OKS=0.50, 0.55, ‧‧‧, 0.90, 0.95, AP M the accuracy of detected key points of medium-size objects, AP L the accuracy of detected key points of large-size objects, and AR the average of 10 threshold points at OKS=0.50, 0.55, ‧‧‧, 0.90, 0.95.The specific implementation method is shown in Equation ( 12): i d represents the Euclidean distance between detected key points and the annotated key points in the dataset, i v is the flag bit of real key points, s is the target scale, i k is the relevant control attenuation constant of each key point type, and i sk represents the standard deviation of each key point.The similarity of each detected key point is within the range of [0,1], and the larger the OKS, the higher the accuracy of point predicting.

Training implementation details
The experimental environment configuration is as follows: Ubuntu 18.04 LST 64bit system, 1 GeForce RTX 3090 graphics card, and Python 1.8.1 deep learning framework.
During training, images in the COCO dataset are cropped and scaled to a fixed 256×192 size, the minimum batch size of each GPU is 32, and both random image rotation and horizontal flipping are used for data enhancement.With Adam as the optimizer, the network training has a total of 220 rounds, where the learning rate starts at an initial value of 1e-3, decays to 1e-4 in the 170th round, and eventually decays to 1e-5 in the 210th round.

Experimental validation analysis
In order to verify the accuracy of ICANet in detecting human key points, this article conducts training on the COCO training set and directs verification on the COCO validation set.The experimental results shown in Table 2 that the improved network model has fewer parameters and lower computational complexity.Compared to other advanced network models, this one has decent performance in key point detection.
In terms of calculating cost, compared with HRNet, there is a reduction of about 70.8% in the total parameter quantity of ICAneck and ICAblock in the ICANet basic module from Equations ( 9), (10), and (11), as well as a decrease of about 89.5% in computational complexity from Equations ( 2) and ( 4), theoretically.But actually, the parameter quantity and computational complexity loss is 53.7% and 32.4%, respectively, due to the fact that improved multi-scale preprocessing modules and information exchange modules are involved in the model.
In terms of model performance, although compared with HRNet, the ICANet model proposed here causes a loss of 0.2 percentage point in detected key point performance indicator mAP, it increases the performance indicator AP 50 by 3 percentage point and AP M by 0.5 percentage point.The increase of AP 50 indicates that ICANet focuses more on detecting all key points in images with only slight differences in their positions, while the increase of AP M shows the model's advantage of more accurate detection of mesoscale human key points.Besides, other performance indicators of this model are at the same level as those of HRNet.
Compared with the latest lightweight Lite-HRNet-18 and Lite-HRNet-30 models [20], the ICANet proposed here increases mAP by 8.4 percentage point and 6.0 percentage point, though it causes the parameter quantity and computational complexity raise.In addition, it outperforms those two models in all performance indicators for detecting key points.Compared with popular human pose estimation networks such as Hourglass, CPN [21], CPN+OHKM, G-RMI [22] and IPR [23], ICANet has improved the average accuracy of predicting key points by 6.3, 4.6, 3.8, 8.3 and 5.4 percentage points, respectively.Compared with the latest network models such as SimpleBaseline [24] and DARK [25], it not only maintains the advantages of fewer parameters and lower computational complexity under different network frameworks, but also outperforms these two models in major indicators.[26], which consists of 24984 images, including 40000 different human instances.Here there are about 28000 instances used as training samples, and about 11000 used as test samples.The annotation contains 16 key points of the entire body.

Evaluation criteria
The MPII dataset in this paper adopts Head Normalized Probability of Correct Keypoint (PCKh) evaluation metric.The coordinates of both predicted and actual key point coordinates should be less than , where is a threshold and a reference distance.The MPII dataset takes =0.5 (PCKh@0.5),with the reference distance taken as the diagonal length of the head box.

Training implementation details
During the training with MPII dataset, the cropped images are scaled to a fixed size of 256×256.Other training implementation details are the same as the COCO dataset, with the same parameter configuration and experimental environment.

Experimental validation analysis
To verify the detection effect of ICANet in the MPII dataset, the human bounding boxes detected in the dataset are cropped to 256×256 in size, and then perform a human pose estimation network, which is different from the size of the human bounding box in the COCO dataset.From Equations ( 5), ( 6), (7), and ( 8), it can be seen that the size of the human bounding box has no effect on the parameter quantity of the network model, but it can impact on the computational complexity.Therefore, it is calculated that ICANet reduces parameter quantity by 70.8% and computational complexity by 88.6% respectively during training on the MPII dataset.However, the actual model only reduces the quantity and complexity by 53.7% and 32.6%, respectively.The specific performance comparison between ICANet and other human pose estimation networks is shown in Table 3.Since ICANet is not pre-trained in the ImageNet dataset, the models in Table 3 are compared for performance without loading pre-trained models.Table 3 shows that the accuracy of ICANet in detected key points of the head is the same as that of HRNet, while as to shoulders and buttocks points, the accuracy is increased by 0.1 and 0.5 percentage point respectively compared to that of HRNet.
The accuracy of key points in elbows and knees remains the same distribution, although there is difference in the key points which are not easily detected in the human body, such as wrists and ankles.ICANet only has a slight decrease of 0.2 percentage point in the overall performance.
Compared to traditional human pose estimation algorithms such as DARK-50, DARK-101, DARK-152 and SimpleBaseline-50, ICANet has improved overall performance in predicting key points by 1.6, 1.2, 0.6 and 0.6 percentage points respectively at the same resolution and environmental configuration.Meanwhile, it also outperforms these two algorithms in detecting different key points.Compared to the lightweight network models Lite HRNet-18 and Lite HRNet-30, its performance has an increasement of 3.0 and 2.1 percentage points as well.
These experimental results have shown that ICANet has lower parameter quantity and computational complexity compared to traditional network models.With an improved multi-scale preprocessing module used in the early stage, and an inverted residual module and an CoordAttention module to construct ICAneck and ICAblock modules, ICANet model strengthens feature extraction of feature map channels and spatial information, and pays more attention to detecting all key points in the input image, contributing to good performance on detection of key points in the human body.The fact that the performance of MPII dataset tends to be saturated, may be the reason why this performance difference in MPII verification set is smaller than that in COCO validation set.

Ablation experiment
This paper adopts the controlling variable method and constructs different ICANet models to verify the respective impact of the added multi-scale preprocessing module, inverted residual module, and CoordAttention module on ICANet's feature extraction ability and human key point prediction accuracy.Experiments are conducted to train on MPII dataset and validate on the MPII validation set without loading pre-trained models.
As shown in Table 4, the experimental results show that compared to HRNet, the improvement of the basic module by IRM used in HRNet leads to a reduction of 69.1% in parameter quantity and 39.0% in computational complexity.The model with the addition of both multi-scale preprocessing module and inverted residual module increases its parameter quantity by 1.1 percentage points and computational complexity by 8.6 percentage points compared with the one merely adding the inverted residual module, but makes its overall performance improved by 0.8 percentage point.Meanwhile the model with both CoordAttention module and inverted residual module added has an improvement of 0.5 percentage point in its overall performance compared with the one with inverted residual module added alone.
Therefore, the model that adds the inverted residual module alone to improve the basic module in HRNet has a slight decrease of 1.2 percentage points in overall performance, but a significant reduction of parameter quantity and computational complexity while ensuring the performance of the network model, thus achieving a quality lightweight network.On this basis, through adding a multiscale preprocessing module in the early stage and the CoordAttention module in the basic module, there is still no significant influence on parameter quantity and computational complexity of the entire model.
In summary, the respective addition of multi-scale preprocessing module, inverted residual module, and CoordAttention module can all improve the accuracy of ICANet in detecting key points.

Visualization research and analysis
This paper conducts visualization researches on COCO and MPII datasets.
With regard to the COCO dataset, multiple-person-detection images with human body folding and occlusion are randomly selected on this dataset.And as shown in Figure 6, the points represent key points positions in human body, and the lines represent key point relationship modeling.In human pose estimation algorithms, the difficulty of detecting key points varies from different parts of the human body.Hence, there is difference in testing results of different parts.For instance, key points detecting of waist and legs is significantly more difficult than that near the head.From the comparison between Figure 6 (b) and Figure 6 (c), it can be concluded that when population is dense and human key points in small scale, ICANet has higher accuracy in predicting human key points and detects more key points due to enhanced feature extraction ability of the network model in spatial directions, compared to HRNet, Moreover, ICANet focuses more on all human key points detecting in the input image, especially the waist and leg ones, with only slight difference in predicted key point positions.Meanwhile, ICANet provides accurate human body posture modeling and modeling error correcting.
The results show that ICANet reduces its parameter quantity and computational complexity by adding the multi-scale preprocessing module in the early stage and adopting both ICAneck and ICAblock to improve the basic module in HRNet.Although these two modules essentially use deep separable convolution, which can result in fewer map features extracted and insufficient feature generalization ability, the addition of multi-scale preprocessing modules, inverted residual modules, and CoordAttention modules to the model can enhance the feature extraction from feature maps and to some extent compensate for the above deficiencies.Therefore, overall speaking, it can effectively predict small-scale and occluded human key points, with good robustness.As for the MPII dataset, Figure 7 shows visualization results with this dataset.Under the condition of using the same weight file as the COCO dataset, the experiment selects images from different perspectives, including images with obvious key point occlusion, and those blurred ones with low quality.In these complex conditions, the ICANet proposed in this paper still has good predictive performance.

Conclusion
This paper proposes a lightweight human posture estimation network model named ICANet, with an improved multi-scale preprocessing module used in its early stage, and two lightweight modules ICAneck and ICAblock constructed from improving HRNet basic modules by an inverted residual module and a CoordAttention module.Then, comparative experiments are conducted using this model and other existing human pose estimation models in both COCO and MPII datasets.
The results indicate that ICANet not only ensures model accuracy, but also lowers parameter quantity and computational complexity.In addition, it has decent performance in predicting smallscale and occluded human body key points, as well as in correcting modeling errors of human poses.
It will be a main research direction in the future to design a lightweight human pose estimation network that is more suitable for actual scenes.If hardware facilities allow, a larger ImageNet dataset can be used for the detection.Besides, there is still room for model accuracy optimization.

Figure 1 .
Figure 1.The structure of ICANet model.Figure 2. The structure of Multi-scale preprocessing module.

Figure 2 .
Figure 1.The structure of ICANet model.Figure 2. The structure of Multi-scale preprocessing module.

in C and out C
represent the input and output channel amount respectively, and mid C means the number of intermediate feature channels.Equations (5) and (6) represent the formula for parameter quantity of the HRNet basic module.ICAneck and ICAblock are mainly composed of IRM and CoordAttention modules, and the ICAblock module retains the Basicblock module architecture.Moreover, there are also two lightweight deep convolutions.Thereby Equations (7) and (8) represent the calculation formulas for the parameter quantity of the improved ICANet and ICAblock basic modules, respectively.On the basis of above formula, the ratio of parameter reduction between basic modules of ICANet and HRNet during network training is: key points of the entire body, namely: Dot 0 representing the nose, Dot 1 the left eye, Dot 2 the right eye, Dot 3 the left ear, Dot 4 the right ear, Dot 5 the left shoulder, Dot 6 the right shoulder, Dot 7 the left elbow, Dot 8 the right elbow, Dot 9 the left wrist, Dot 10 the right wrist, Dot 11 the left hip, Dot 12 the right hip, Dot 13 the left knee, Dot 14 the right knee, Dot 15 the left ankle, and Dot 16 the right ankle.

Figure 6 .Figure 7 .
Figure 6.Images with a large number of key points overlap.

Table 2 .
Performance comparison of COCO validation set.GFOLPs mAP AP 50 AP 75 AP M AP L AR Dataset description MPII is a dataset used for human pose estimation