Multi target tracking algorithm based on anchor free detection

In order to alleviate the problems of frequent identity label switching caused by mutual occlusion between targets and missed detection caused by small target sizes in multi-target tracking, this paper proposes a multi-target tracking algorithm based on anchor-free detection. This algorithm first uses YOLOX for detection and embeds the ReID algorithm to enhance pedestrian recognition. Then, it uses the DeepSORT algorithm for tracking and correcting abnormal trajectory algorithms for occluded targets. Finally, experiments are conducted on both the MOT16 dataset and the self-made dataset. This algorithm has higher tracking accuracy than other algorithms, and its tracking performance for small targets has also been effectively improved.


This article algorithm
In order to improve the tracking speed and accuracy of multiple targets in small size and occlusion situations, this article uses YOLOX as the detector and embeds ReID features to enhance re-recognition.The DeepSORT algorithm is used as the tracker, and the correction of abnormal trajectory algorithm is used for occluded targets.

Detector 1)YOLOX
The YOLOX network is mainly composed of three parts: the backbone feature extraction network, the enhanced feature extraction network, and the YoloHead.Its network structure is shown in Figure 1.The backbone network uses the Focus network structure to expand the input channels by four times, and the feature layer after splicing is changed from the original three channels to 12 channels so as to reduce the number of parameters and the amount of calculation.At the FPN layer, the Panet structure is used to perform upsampling feature fusion and downsampling feature fusion on features.And YoloHead conducts classification and regression separately for final integration prediction.
2) ReID In order to deal with small and occluded targets, ReID technology is embedded in the detection network to enhance pedestrian re-recognition.The embedded ReID algorithm is shown in Figure 2. The input image is preprocessed, flipped, and automatically enhanced to achieve data enhancement.The pre-processing results are transferred to the backbone network to obtain the mapping features of the image.The mapped features are input into the aggregation module for average pooling extraction to global features.Finally, the global features are transmitted to the network head for normalization processing to obtain the final results.The average pooling calculation equation used in the aggregation module is as follows.
The input is (W, H, and C are the width, height, and channel of the feature map, respectively).The output vector is 1 [ , , ,  ]

Tracker 1) DeepSORT
This article uses DeepSORT as the tracker and uses a Kalman filter to predict the next frame.It then uses fusion metrics to calculate the matching degree between the detection results and the tracking results.Finally, the object that has occluded in the motion trajectory is added with the Correction of Abnormal Trajectory algorithm (CATA) [8] to obtain a more accurate tracking trajectory.The DeepSORT algorithm tracking process is shown in Figure 3.The DeepSORT algorithm cascades and matches the predicted results obtained from the Kalman filter with the detection results of the detector.If the match is successful, it is updated in the trajectory pool.If the match fails, it is matched with IoU, resulting in three types of results: unmatched trajectory, unmatched detection, and matched trajectory.For targets with successful trajectory matching, the Kalman filter is updated.If the IoU matching fails, the detection result is added to the trajectory pool as a new trajectory.If the IoU matching fails and the trajectory exceeds the threshold or is uncertain, it will be deleted.The remaining will be placed back in the trajectory pool for a new round of prediction updates.The cascading matching used is to reduce the number of identity tag switches and assign priority to frequently occurring targets.

2)Kalman filtering
Kalman filtering is suitable for linear and Gaussian distribution systems, which mainly consist of two cyclic processes: prediction and update.In the prediction process, the state equation is first used to predict the state variables of the current frame based on the information from the previous frame, calculate the covariance variance of the current frame, and then enter the update process.During the update process, the Kalman filter gain is first calculated, and then the observation variables are updated.Finally, the error covariance is updated.
The CATA algorithm is mainly divided into four processes: annotation, calculation, comparison, and update.The specific implementation is as follows: • Annotation: The targets during the tracking process are annotated, assuming that there is a total of n targets during the tracking process, and number them as   0,1, 2 , m n  ; • Calculation: The average height and width of the m-th target are calculated, as well as the average height and width of the tracking box for that target at a certain time, and a threshold ( ) Equation ( 2) is the update coordinate used when there is an obstruction in the vertical axis direction.Equation ( 3) is the updated coordinate used when there is an obstruction in the horizontal axis direction.h  is the height change value between the current moment and the previous moment; w  is the width change value between the current moment and the previous moment.

Algorithm process Specific operation process:
Step1.YOLOX algorithm is used to detect pedestrians, corresponding detection boxes and confidence are obtained.
Step 2. The current framed Kalman filter is initialized.
Step 3. The target position information of the next frame is predicted using a Kalman filter to obtain a prediction box.
Step 4. The intersection and union ratio for cascade matching is calculated.
Step 5.The Hungarian algorithm performs data association, and if the matching is successful, the tracking box coordinates are output.If the matching fails, a new tracker will be established, saving its coordinates and features.If the tracking is successful for 3 consecutive frames, it will be used as a new target, and a new Kalman filter will be initialized.If it fails, it will be regarded as a lost target and added to the correction of the abnormal trajectory algorithm to search.If it is not found after 30 frames, the tracker will be deleted.
Step 6.The processing of all images is ended, otherwise, return to Step 3. The flowchart is shown in Figure 4.

Experimental environment
This article uses the COCO dataset to train the model of our algorithm and tests our algorithm using the MOT16 dataset and self-made dataset.Experimental environment: Ubuntu 18.04, Intel ® Xeon(R) CPU E5-2660 V2 @ 2.20GHz × 40 and GeForce GTX 2080Ti GPU graphics card, python3.6.12.

Evaluation criterion
The evaluation criteria used in the experiment are tracking accuracy (MOTA), the total number of ID tag switching (ID_Sw), the total number of false positives (FP), the total number of missed detections (FN), and tracking speed (HZ) [9][10] .The specific calculation equation is as follows: where GT is the total number of tracked targets; FN is the number of missed detections; FP is the number of false positives; ID_Sw is the number of times the target identity tag has been switched.

Experimental results and analysis
In order to objectively demonstrate the superiority of the improved algorithm in this paper, two sets of experiments were designed to compare and analyze the performance of the algorithm in this paper.1)Experiment 1 compares our algorithm with other algorithms on the MOT16 dataset to further analyze the performance of our algorithm.The comparison algorithms DeepSORT, SORT, MTDF [11] .
2)Experiment 2 compares the algorithm proposed in this paper with the Deep SORT algorithm on a self-made dataset in order to analyze the performance of the algorithm proposed in this paper.
The comparison of experimental results is shown in Tables I and II.From Table 1, it can be concluded that the highest MOTA of the algorithm in this paper is 75.3%, and ID_SW is the least with 355.The MOTA of this algorithm is 13.9% higher than DeepSORT, 15.5% higher than SORT, and 29.6% higher than MTDF.The ID switching frequency of the tracking algorithm in this article is 426 less than DeepSORT, 1068 less than SORT, and 1632 less than MTDF.
From Table II, it can be concluded that the algorithm in this paper has the highest MOTA in the self-made dataset, which is 14.5% higher than the DeepSORT algorithm, and the number of ID switches is 114 lower than the DeepSORT algorithm.
In order to visually demonstrate the superiority of the algorithm proposed in this article, visual comparison experiments were conducted on the MOT16-01 dataset and the self-training dataset.The comparison results between our algorithm and DeepSORT algorithm under occlusion on the MOT16 dataset are shown in Figure 5.The first to third columns are the 420th, 428th, and 437th frames of the MOT16-01 video sequence, respectively.The comparison results between our algorithm and the DeepSORT algorithm in the case of a large number of small targets on the self-made dataset are shown in Figure 6.Among them, the first to third rows are the 17th, 251st, and 477th frames of the self-made dataset, respectively.
From Figure 5, it can be seen that the Deep SORT algorithm does not have ID switching issues before and after occlusion, but there is a problem of missed detection in frame 402, not only missing pedestrians who are about to interact but also missing pedestrians standing behind.And the algorithm in this article tracked all pedestrian targets.From this, it can be seen that the algorithm in this paper has the best tracking performance on the MOT16 dataset.As shown in Figure 6, it can be seen that there are a total of 9 pedestrian targets in the 17th frame of the image.The Deep SORT algorithm only tracked one relatively large target, while the algorithm in this paper tracked one large target and four small targets.In frame 251 of the image, there are a total of 7 pedestrian targets.The Deep SORT algorithm only tracks 1 pedestrian target, while the algorithm in this paper tracks 6 small targets.In frame 477 of the image, there are also a total of 7 pedestrian

Figure 1 .
Figure 1.Framework of YOLOX The detection network first extracts feature through the backbone network, then passes them into the feature pyramid for feature fusion so as to enhance feature extraction, and finally classifies and regresses them through YoloHead.The backbone network uses the Focus network structure to expand the input channels by four times, and the feature layer after splicing is changed from the original three channels to 12 channels so as to reduce the number of parameters and the amount of calculation.At the FPN layer, the Panet structure is used to perform upsampling feature fusion and downsampling feature fusion on features.And YoloHead conducts classification and regression separately for final integration prediction.2) ReID In order to deal with small and occluded targets, ReID technology is embedded in the detection network to enhance pedestrian re-recognition.The embedded ReID algorithm is shown in Figure2.

Figure 4 .
Figure 4. Algorithm flowchart of this article

Figure 5 .Figure 6 .
Figure 5.Comparison results of the algorithm in this article and DeepSORT on the MOT16 dataset

•
Comparison: The average height and tracking box height, width, and tracking box width are compared at a certain moment, as well as the center point of target m at this moment and the previous moment( , )

TABLE I .
QUANTITATIVE TRACKING RESULTS OF DIFFERENT ALGORITHMS ON THE MOT16