Improved Faster R-CNN Traffic Sign Detection Algorithm Based on Transformer

In view of the fact that traffic sign detection is becoming more and more important in assisted driving, but there are situations where the target is small, occluded and the background is complex, this paper proposes a Transformer-based improved Faster R-CNN algorithm for traffic sign detection. The algorithm in this paper uses the Transformer network based on the shift window as the backbone. The multi-level feature map is fused across layers through the cascade fusion module to obtain the fusion feature map. Moreover, the model’s feature extraction ability is improved. The detection head module eliminates rounding through RoI Align quantization error. Based on experiments in the TT100K dataset, it can be known that the mAP and detection speed of the algorithm in this paper has been improved. The effectiveness of the improved model in this paper is proven and it is applicable to real scenarios.


Introduction
In autonomous driving technology, vehicles obtain current environmental information through the target detection system, and traffic signs can alleviate traffic jams and assist driving [1] , so this field has important development significance.In real-world scenarios, it is susceptible to problems such as light intensity, weather changes, complex backgrounds, and small targets, which can lead to false or missed detection of traffic signs.In this regard, many detection algorithms have been proposed by researchers.There are two detection algorithms for general object detection, one-stage, and two-stage.Because the former runs faster but has lower accuracy, its speed is appropriate but its detection accuracy does not meet the real-time detection requirements.Therefore, this paper chooses the two-stage algorithm Faster R-CNN [2] as the basis to improve the network.In the previous research based on this network, Shao et al. [3] proposed to simplify the Gabor wavelet through the region proposal algorithm to improve the recognition speed of the network. [4]aimed at the problems of single model detection scale and low utilization rate of feature information, and added a reverse feature fusion process on the basis of traditional FPN, but its improvement strategy made the algorithm layers deep and the detection speed low.
This paper uses a Transformer-based improved Faster R-CNN traffic sign detection algorithm, and the Faster R-CNN network is optimized in this paper.First, we replace the original relatively simple backbone network VGG16 [5] with a Transformer network based on shifted windows [6] to enhance the information expression ability of high-dimensional image features in the network; secondly, we design a cascade fusion module to meet the needs of small target detection, and finally, RoI Align [7] is used to replace RoI Pooling in the detection head module to eliminate the double quantization rounding error and further improve the detection accuracy.

This paper improves the algorithm
The algorithm flow of this paper is shown in Figure 1.Among them, step a is the backbone network part, the input is the original picture, extracting image features use a Transformer network based on a shift window, and the output is a multi-level feature map; step b is the cascaded feature fusion module, which fuses feature maps of different levels across layers, and outputs a fusion feature map that fuses shallow features and deep features; step c is the RPN part, which outputs the candidate area of the detected traffic sign; step d is the RoI Align part, and multiple candidate areas of different sizes generated in c are placed in b.The output fusion feature map is mapped to a fixed size; step e is the subsequent detection head module, the input is the detection frame feature map, and the detection result is output.Above all, the Patch Partition module sets off the image into non-overlapping patches.In stage I, the input of the linear embedding module is all the patch data of the image, and the two-dimensional vector information   of the image is output; it is then input to multiple consecutive Transformer modules, and finally, the feature map F1 is output.In the following three stages first pass the Patch fusion module and owned the height and width of the feature map are each 1/2 times the original, and the number of channels is twice the original.Then we repeat the learning process of the Transformer extraction module and finally output the feature maps  2 、 3 and  4 respectively.The backbone of this paper algorithm is shown in Figure 3, for cut down the computational complexity of the attention mechanism and get local information, W-MSA performs multi-head selfattention mechanism calculations in divided windows.In addition, a shift window operation is added to SW-MSA to realize the cross-window connection and obtain global attention information.The formula is expressed as:

Cascade Feature Fusion Module
Inspired by the feature pyramid network structure of FPN [8] , this paper proposes to add a cascaded feature fusion module between the backbone network and the RPN network.The multi-level feature map output by the backbone is fused across layers to get a fusion feature map [9] .Its structural block diagram is shown in Figure 4(b).The same size as  2 , all smoothed by the convolution unit (•), to obtain  4 3 ,  3 1 and  3 1 , and then multiply  4 3 ,  3 1 and  2 , and the output result is the same as  3 1 Perform concat splicing to obtain the feature map  2 ; ④ Upsample  4 ,  3 ,  2 and  2 to the same size as  1 , and smooth them through the convolution unit (•) to obtain  4 4 ,  3 2 ,  2 1 ,  2 1 , and then multiply  4 4 ,  3 2 ,  2 1 and  1 , and concat the output result with  2 1 to obtain the feature map  1 .In the second part,  1 ,  2 ,  3 ,  4 , and  4 are all dimensionally reduced through the convolution unit (•), and the final output fusion feature maps are  1 ,  2 ,  3 ,  4 , and  5 .The specific is shown in Equation ( 5)- (10):

Experimental dataset
The experiment is selected to train on the domestic traffic sign dataset TT100K [10] .The TT100K data set was created by Tsinghua and Tencent using high-definition cameras to create Tencent Street View images.There are 9457 pictures in the processed dataset, according to 7:2:1 division.The training set, verification set and test set have 6598, 1889 and 970 pictures respectively, with a total of 44 categories.
Figure 7 shows the categories and numbers of traffic signs.

Experimental settings and evaluation indicators
The experiment in this paper is carried out under the Ubuntu18.04.The CPU is Intel(R) Xeon(R) Platinum 8358P, and the GPU is NVIDIA RTX 3090Ti.All models are trained under the deep learning framework mm detection [11] based on Pytorch1.13.For evaluate the recognition accuracy and recognition speed of the model, the experiment uses mAP_0.5:0.95 and the number of frames per second (FPS) to measure the evaluation index.The calculation equation is as follows.
= ∫ () (12) where  is the number of traffic signs to be recognized.

Experimental results and analysis
For evaluate the performance effect of the algorithm before and after the improvement, this paper algorithm and the original algorithm are trained and tested under the same conditions.The final result is shown in Table 1 and the confusion matrix of each category is shown in Figure 5.
Table 1  From Table 1, comparing with the original algorithm, the evaluation index obtained by the algorithm of this paper has been improved to a certain extent, among which the increase of mAP is 9.72%, and the increase of detection speed is 31.58%.It can be seen from Figure 5 that the traffic sign detection effect of 44 categories is good.In summary, the improved module achieves a good fusion, which can ensure the accuracy and rapidity of detection simultaneously.
To compare the detection performance before and after the model improvement, by combining four improved modules or strategies, an ablation experiment is set up on the TT100K data set, and the improvement effect of each module is quantitatively evaluated by means of objective qualitative measurement indicators.The results are shown in Table 2.  2, in the improvement strategy of a single module, the improved backbone has the greatest impact on the recognition accuracy, the mAP is increased by 6.13%, the FPS is increased by 4.73f/s, and the FPS is reduced by 2.53 after adding the cascade fusion module, but the detection accuracy also increased by 0.58%.Overall, the improvements proposed in this paper can improve the detection effect.At the same time, ensuring the detection speed, the highest detection accuracy of all improved strategies is 83.87%.
So as to more intuitively reflect the detection performance of the improved algorithm of this paper on traffic signs, a visual detection experiment was carried out on the test set.Figure 6 is a comparison of some detection effects.The first line is the result of the original algorithm, and the second line is the result of the algorithm in this paper.In the first column of images, both of them can accurately detect traffic signs.Due to the influence of environmental factors such as illumination, the detection results of the original algorithm are not as high as the traffic signs pn and p27 detected by the method in this paper.In the second column of images, the method in this paper can also accurately detect the traffic sign w55 with a smaller scale in the distance.In the third column of images, when the complex background and traffic signs are incomplete or occluded, the original algorithm missed i2 and p26, and wrongly predicted i5 as ip and p27 as p26, while the algorithm in this paper accurately detected all traffic signs in the figure.In addition, the confidence of po and i2 was also improved to some extent.The visualization results show that the original algorithm is less robust and is easily affected by complex environment conditions such as complex backgrounds and lighting.This paper method can effectively improve the problems of missed detection and false detection in detection, and better adapt to the scale change of detection targets.And the impact of environmental factors such as light, the overall detection effect is better.

Conclusion
The target detection algorithm based on Transformer and cascade fusion proposed in this paper uses the network based on shift window Transformer as the backbone for feature extraction.It improves the feature extraction ability of the backbone.The strategy of cascading feature fusion is adopted to improve the model's multi-scale fusion ability and improve the feature fusion's efficiency.Finally, RoI Align is used in the detection head module to generate a fixed-size feature map, which eliminates the double quantization rounding error and further improves the positioning accuracy of the detection frame.From the experimental results, this paper's improved algorithm has high robustness and accuracy in the traffic sign detection task.The mAP, and FPS reached 83.87% and 54.29f/s respectively, compared with the original Faster R-CNN algorithm has increased by 13.64%, and 31.58%respectively.In the follow-up research, it is planned to further accelerate the processing speed of the algorithm on the premise of ensuring the accuracy of traffic sign detection, and combine the rain and fog removal algorithm on the original basis to improve detection effect in the rain and fog environment.

Figure 1 .
Figure 1.The algorithm execution flow in this pape 2.1.Backbone Network Based on Shifted Window Transformer A Transformer backbone network based on the shift window proposed in this paper is divided into 4 stages to obtain feature maps of different levels, repeats the learning process based on the Transformer extraction module four times, and finally obtains 4 feature maps{  | = 1,2,3,4}, and its network block diagram is shown in Figure 2.

3 F 4 FFigure 2 .
Figure 2. Transformer backbone network block diagram based on shifted windows

Figure 3 .
Figure 3. Block diagram of Transformer module based on shift window

Figure 4 .
Figure 4. Block diagram of cascade fusion module The specific operation is divided into two cascading parts: in the first part, ① additional operation of downsampling the feature map  4 by 2 times to obtain the feature map  4 ; ② upsampling  4 to the same size as  3 , and passing the result to the convolution Unit (•) to get  4 1 and  4 2 , then multiply  4

Figure 5 .
Figure 5. Confusion matrix diagram of the detection results of various types of traffic signs

Figure 6 .
Figure 6.Comparison of some detection effects . Comparison of model performance before and after improvement.

Table 2 .
Results of ablation experiments.