Multi-level Salient Feature Mining Network for Person Re-identification

Person re-identification (Re-ID) algorithms can retrieve the same pedestrian’s images from an image gallery captured by multiple cameras when given a pedestrian image. Due to changes in pedestrian postures, illuminations, and perspectives, it remains a significant challenge to improve the accuracy of person re-identification. Although the attention mechanism can alleviate some of these issues, it causes attention-based methods to pay excessive attention to features in the most salient areas of images while ignoring discriminant features outside the most salient areas, resulting in the insufficient discriminability of features extracted by attention-based methods. For this purpose, we propose a Multi-level Salient Feature Mining Network (MSFM-Net). First, by embedding attention modules in ResNet50, the model extract the most salient pedestrian feature maps. Second, the model uses two sub-salient feature mining branches to extract the second-level and third-level salient feature maps (collectively referred to as sub-salient feature maps). Third, the model uses the feature maps fusion module to combine the most salient feature maps with sub-salient feature maps to obtain the fused salient feature maps. Finally, the model pools the fused salient feature maps to produce more discriminant pedestrian representations. The results of two benchmark datasets demonstrate that MSFM-Nets performance reaches the current advanced level.


Introduction
Given a pedestrian image, person re-identification algorithms can retrieve the images of the same pedestrian from an image gallery taken by multiple cameras.Person re-identification has become a popular research topic because it can be combined with face recognition, pedestrian detection, and pedestrian tracking to play an important role in security projects.It is still confronted with challenges, such as changes in pedestrian postures, illuminations, and perspectives.As a result, improving the accuracy of person re-identification remains a significant challenge.
Some scholars [1][2][3][4] [5] successfully applied the attention mechanism to the field of Re-id.Hu et al. [1] proposed an extrusion incentive module that models channel relationships and assigns a weight to each channel based on this relationship information, positioning salient areas and suppressing extraneous information.Woo et al. [2] proposed CBAM that outperforms using only one attention module by incorporating both spatial and channel attention modules.Li et al. [3] proposed MSCAN that foregoes traditional hard partitioning and instead employs the attention mechanism to locate and learn pedestrian parts, thereby resolving the pedestrian dislocation problem and the impact of pedestrian posture changes.Jing et al. [4] proposed a PPA that employs the attention mechanism to improve the accuracy of local person features extracted from the posture estimation model and to fully utilize person posture information in the network.Zhang et al. [5] proposed RGA, which learns valuable information in the global structure by modeling the semantic relationships between each local feature node and locates salient areas based on this information, greatly improving recognition accuracy.
However, most attention mechanism-based methods only pay attention to the most salient areas of pedestrian images, ignoring some discriminant features beyond the most salient areas, resulting in pedestrian features with insufficient discriminability.As a result, we propose a Multi-level Salient Feature Mining Network that not only focuses on the most salient features of pedestrian images by incorporating attention modules into the backbone network, but also extracts discriminant features beyond the most salient areas via the Sub-salient Feature Mining Branches.

Framework of Network
As shown in Figure 1, after the first two convolutional blocks of the backbone, we add an attention module, respectively.The feature map output from the backbone network's third convolutional block is then processed again by the attention module to obtain the most salient feature map 1 . Following subtraction of 1 X from X to obtain the second-level salient feature map 2 , the 2 X is fed into an attention module to obtain the feature map 2a X . 2 X subtracts 2a X to obtain the third-level salient feature map 3 , bringing us one step closer.The above operations can be formulated as: where ( ) att < represents the attention module.
After getting 1 X , 2 X and 3 X , we first process 1 X with a Layer4 and an attention module to get the most salient feature map 1 X , then two Layer4 will be used to process 2 X and 3 X to get the second-level and third-level salient feature map 2 X and 3 X , respectively.Then, we use the FFM to fuse 1 X , 2 X , 3 X to get the final feature map 4 X .
During the training phase, GeM pooling will be used to process 1 X , 2 X , 3 X , and 4 X to obtain the corresponding feature vectors, which can then be used to calculate the cross-entropy loss and triplet loss.We use the vector of 4 X after GeM pooling as the final features of the pedestrian in the testing phase.

Feature Maps Fusion Module
We design a feature maps fusion module (FFM) to reduce the impact of redundant information when fusing multiple pedestrian features.As shown in Figure 2, given three feature maps 1 X and 3 X are first fed into a convolutional and a Batchnorm layer, respectively.The resulting feature maps are spliced along the channel dimension to obtain via the Softmax operation along the channel dimension, W is then sliced along the channel dimension to obtain , and finally, FFM's output feature map 4 The above operations can be formulated as follows: : : (2) Among them, onv_1( ) indicates the 1 u 1 convolutional layer followed by a Batchnorm layer, [; ] indicates splicing operation, ( ) SplitC indicates splitting operation, and ( ) Softmax indicates the Softmax function.

Attention Module
, which is processed in the global branch by a kernel size of 1 u 1 convolutional layer, a Batchnorm layer, and a Relu layer, is produced.Preceding operations can be formulated as follows: where onv_1( ) represents the kernel size of 1 u 1 convolutional layer.
In the local branch, first, f X is evenly divided into four parts along the channel dimension, which are recorded as x in the channel dimension can we produce where onv_1( ) represents the 1 u1 convolutional layer and (.) V represents the sigmoid function.CAM: Channel attention is able to model the relationship between each channel to obtain the importance of each channel.The CAM we designed is shown in Figure 4. v is processed by two FC layers and a sigmoid function to obtain the CAM map c A .The above operations can be formulated as follows:  .Then, the C and H dimensions of the feature map j 1 X are exchanged and processed through a convolutional layer, a Batchnorm layer, and a Relu layer to obtain j . Finally, the final pooling result j obtained by exchanging the C and W dimensions of the feature map j . Attention module: The AM is usually formed by combining CAM and SAM in parallel or series [2].However, simply connecting the CAM and SAM in series or parallel may result in information redundancy.As shown in Figure 1, we use our FFM to fuse the input feature map, SAM's output feature map, and CAM's output feature map to obtain the attention module's output feature map.

Loss Function
We use a loss function to supervise the most salient features, second-level salient features, third-level salient features, and final pedestrian features.A triplet loss and a cross-entropy loss comprise the loss function: where, cls L and tri L represent the cross entropy loss and triplet loss, respectively.Cross Entropy Loss: We employ the label smoothing strategy to prevent the model from overfitting.As a result, we employ the following cross-entropy loss: In the formula, N and M respectively represent the number of images in each batch and the number of categories of images in each batch, Triplet Loss: The traditional triplet loss employs an excessive number of triplets, and the majority of triplets are simple and easy to distinguish, resulting in slow convergence and easy overfitting of the network model.Batch hard triplet loss can improve the network's generalization ability.As a result, we employ the following triplet loss: where P indicates that each min-batch of images contains P pedestrian categories; M means each pedestrian category contains M images; > @ + x represents the maximum value function;

Datasets
The Market-1501 dataset contains 1501 pedestrians with various identities captured by 6 different cameras.This dataset contains 32668 pedestrian images, some of which are generated by the DPM detector.
The DukeMTMC-ReID dataset contains 36411 images of 1812 pedestrians with different identities captured by 8 cameras.A total of 1404 pedestrians with various identities were captured by more than two cameras in this data set.First, we divide the 1404 pedestrian images with different identities into nonoverlapping training and test sets; then, we insert pedestrian images of 408 pedestrians captured by only one camera into the test set's gallery image.

Experimental details
Backbone network: As the backbone, we use a ResNet-50 model with some modifications.We cancel the last down-sampling operation of the last convolutional block to obtain a larger feature map.Furthermore, the convolutional blocks in the Second-level salient feature mining branch and the Third-level salient feature mining branch are initialized with the parameters of the pre-trained ResNet-50 model's fourth convolutional block.
Image processing: We use random erasure, horizontal flipping, and random clipping to process each image, and we adjust the image size to 288 × 144.
Training process: We use the Adam optimizer.The batch size is set to 64, and the number of iterations is set to 75.The Adam optimizer weight attenuation factor is 5e-4, and the momentum is 0.9.we use a warm-up strategy during training.The initial learning rate was 8e-5, increased to 3e-4 after 10 generations, and decreased ten times in the 45th and 70th cycles, respectively.
Loss function: In this paper, the interval coefficient w of the batch-hard triplet loss is 0.3, and the label smoothing coefficient is 0.1.

Comparison with State-of-the-Art Methods
Table 1 Performance (%) on Market-1501 dataset Market-1501 Method R-1 R-5 R-10 mAP RGA [5]  95.8 --88.  [5 86.1 --74.9MSFM-Net 89.7 95.6 96.6 80.1 From Tables 1 and 2, it can be seen that MSFM-Net reaches the optimal values, which fully proves the effectiveness of MSFM-Net.It can be seen from Table 3 that both SAM and CAM designed by us can improve MSFM-Net's performance, and compared with simple series and parallel, FFM designed by us can better improve MSFM-Net's performance.

Conclusion
This paper proposes a multi-level salient feature mining network for person Re-id.Through the extraction and fusion of pedestrian multi-level salient feature maps, MSFM-Net can obtain richer and more discriminative pedestrian representation to overcome the influence of posture change, uneven illumination, and complex background on accuracy to a certain extent.A large number of experimental results show that the accuracy of MSFM-Net has reached the current advanced level.

Figure 3 .
Figure 3. SAM splicing the output feature maps of the global branch and local branch along the channel dimension to obtain 5 H W a X R u u , we can obtain the attention feature map s A of the SAM by the following formula: ( ( ( onv_1( ))))

Figure 4 v and 2 v
Figure 4. CAM

u
where onv_1( ) C represents the kernel size of 1 u1 convolutional layer, a Batchnorm layer, and a Relu layer, [; ] represents the splicing operation, (.) V represents the sigmoid function, CW R represents the exchange of the channel and width of the feature map, represent the parameters of the FC layers.

Figure 5 X
Figure 5. Deep pooling In particular, 1 (.) pool and 2 (.) pool in (6) are two different deep pooling operations.Since the only difference between 1 (.) pool and 2 (.) pool is the order of dimension exchange, we will only introduce 1 (.) pool .As shown in Figure 5, we input C H W f X R u u and then exchange the C and W

q
represents the label smoothing value of the nth image belonging to class m in each batch of images, n m p represents the prediction probability that the nth image in each batch of images belongs to the nth class, n y represents the real label of the nth image in each batch of images, H is the label smoothing factor.
pairs belonging to the ith identity; i a x and i n x represent negative sample pairs belonging to different identities; ( ) D is the Euclidean distance; m is the margin factor.

Table 3 .
Ablation study on the Market-1501 dataset