YOLOv5 UAV Detection Algorithm Based on Attention Mechanism

To address the poor accuracy issue with tiny target recognition by UAVs, this study provides an improved YOLOv5 detection method with an attention mechanism. Firstly, CBAM is integrated into Backbone to suppress irrelevant features and enhance the network’s attention to space and channels. This can help the network learn more discriminative representations of objects in the image. Then, the introduction to Biformer in Neck removes redundant information on the algorithm structure, endows the network with dynamic query-aware sparsity, and enhances its ability to detect small targets. The experimental findings demonstrate that the suggested algorithm model has a detection accuracy of 84.6% on average. in the self-built UAV dataset, and can accurately complete the detection task of small UAV targets.


Introduction
With the rapid popularization of low-altitude small UAVs (unmanned aerial vehicles) in public life due to their small size and high flexibility, UAVs are utilized extensively in many fields, including mapping and automated video surveillance [1] .However, the wide application of UAVs will inevitably cause a certain threat to public privacy.Therefore, effective control of low-altitude small UAVs is imminent, and how to detect UAVs quickly and effectively is the primary task at present.
At present, common UAV detection methods include radar detection, acoustic detection [2] [3] [4] , etc.However, due to the drawbacks of large equipment volume and high equipment costs, unable to be used on a large scale.In the last few years, the field of computational neuroscience has significantly advanced leading to the success of detection algorithms based on neural networks based on convolution in extracting deep features from images.Two main types of object detection algorithms used in computer vision are based on two-stage detection methods such as the Faster R-CNN [5] algorithm and one-stage detection methods like YOLO [6] series algorithm.YOLOv5 has great advantages in the deployment of the model because of its stability and efficiency, taking into account the detection accuracy and speed, so it can adapt to the detection of low-altitude small target UAVs in this paper.
This work proposes an improved YOLOv5 UAV object identification method to increase the detection accuracy of tiny target UAVs.The method makes use of the CBAM [7] , which boosts information transmission within the Backbone network and improves its capacity to extract features.At the same time, to address the challenges of detecting flexible and small targets of UAVs, this paper introduces a BiFormer [8] attention mechanism that is highly effective against small target detection.This further enhances the accuracy of detecting small targets, specifically low-altitude small-targeted UAVs.

YOLOv5s algorithm
The YOLOv5 algorithm includes four versions that vary in terms of network depth and width.This paper uses YOLOv5s as the benchmark model, and Figure 1 (a) illustrates its overall structure.Backbone, Neck, and Output make up the bulk of it.

CBAM
CBAM is a feedforward neural network's attention mechanism module that is both straightforward and effective.Figure 2 illustrates its general structure.By adding CBAM to the Backbone of YOLOv5 [9] , multi-scale feature fusion can be achieved, enhancing feature information training between channels The improved backbone network can enhance the attention to important feature information through CAM and SAM, suppress the attention to secondary information, and more effectively extract UAV feature information.The trained model has better robustness for UAV target detection, Equations ( 1) and ( 2) illustrate the entire CBAM computation method: ( ) ( ) where  is the input feature map,  is the feature information obtained by weighting  is through CAM,  is the feature information obtained by weighting  through SAM,   is the weight information generated by  after CAM,   is the weight information generated by  after SAM, ⊗ represents bitwise multiplication, and after processing, it becomes the size of the original feature map.
CAM is mainly used to receive the input feature map.Its major objective is to compress the spatial dimension and force the module to focus on the important details in the supplied image.SAM is mainly used to receive the feature information processed by cam.Its main purpose is to keep the spatial dimension unchanged while compressing the channel dimension so that the module attends to the location information of the target.

Biformer
In response to the fact that the UAV's flying altitude is too high, resulting in the UAV target being too small, the detection accuracy of the YOLOv5s+CBAM algorithm may not meet the requirements of this article.In response to this, a Transformer with BI Level Routing Attention (Biformer) structure is introduced into Neck.As shown in Figure 3, it is used for detecting small targets, achieving more flexible calculation allocation and feature perception.
Biformer is a novel dynamic sparse attention proposed to the basis of Transformer.Compared to Transformer, Biformer can filter out most non-critical value pairs at the rough region level and retain a small portion of routing regions, thereby removing redundant information on the structure and endowing it with dynamic query-aware sparsity [10] .Based on the features, adaptive attention is paid to a small number of related markers without distracting the attention to other irrelevant markers, achieving more flexible calculation allocation and good small object detection performance.
Figure 5 shows that the structure of the BI-BI Level Routing Attention method involves collecting key valued pairs from the first k relevant windows and avoiding the least important regions by using sparsity procedures to save on parameter and computational costs.The method then applies fine-grained token-to-token attention  ,  within these concentrated areas.
( , ) where  is the gather key and  is the value tensor, We focus on these key pairs to obtain the results: ( , , ) ( )

Experiment and analysis
Comparative tests were carried out using a self-created dataset to assess the updated model's detection ability for UAV targets.The experiments involved comparing the YOLOv5s algorithm, the YOLOv5s+CBAM algorithm and our algorithm.Among them, the self-built dataset consists of 2330 UAV flight images, including some UAV small target images.The dataset is split in a 7:3 ratio into a training set and a testing set, each measuring 608 by 608 pixels.The images are manually labeled using Labelimg, the training round is 500.The experimental configuration is Ununtu18.04,CUDA11.0,cuDNN8.0,Python 1.7.

Comparative Experiment of Target Detection
This research iterates the three methods inside the same Pytorch framework and sets the same training settings to the convergence of the Loss function to validate the algorithm's performance and assure the efficacy of comparison.Table 1 shows the comparison results of three algorithms.This article selects P (Precision), mAP50, and mAP50-95 as evaluation indicators.Among them, mAP50 represents a mean Average Precision of 50% for the IOU threshold, and mAP50-95 represents an increase in the IOU threshold from 50% to 95% in steps of 5%.The findings indicate that the algorithm suggested in this work is somewhat better than the original YOLOv5s.Compared with YOLOv5s, the optimized algorithm has a 3.4% increase in P for UAV targets, a 2.0% increase in mAP50, and a 1.7% increase in mAP50-95.Compared to the simple combination of YOLOv5s and CBAM, there is also a certain improvement, with a 1.0% increase in P, a 0.6% increase in mAP50, and a 0.3% increase in mAP50-95.

Experimental Visualization Analysis
To validate the practical applicability of our algorithm for detecting UAVs, we evaluated its reliability using images not included in the original dataset.
According to an analysis of Figures 5 and 4, adding CBAM to the YOLOv5s algorithm significantly increases the detection accuracy of tiny UAV objects.Additionally, as shown in Figure 6, placing a Biformer in the neck position can enhance the algorithm's ability to recognize tiny UAV targets.The study mentioned above demonstrates the high flexibility and increased accuracy of the improved algorithm in this article's technique for YOLOv5s in identifying minute targets for UAVs.Additionally, to address the challenge of detecting small targets, Biformer is introduced to selectively focus on relevant markers, which improves small targeted detection accuracy.The algorithm's efficacy is evaluated on a dataset constructed explicitly for this research.Experimental results demonstrate that the proposed algorithm accurately identifies small UAVs.

Figure 1 (
b) shows an improved network structure diagram.Backbone is mainly composed of CBS, C3, and SPPF.Mainly responsible for feature extraction of input images, where CBS is composed of Conv+BN+Silu, and C3 is mainly composed of CBS, Bottleneck, and concatenated through Concat.Neck is composed of a structural fusion of FPN+PAN, mainly used to process the feature information extracted by Backbone and output fusion features of three different resolutions for transmission to Output.(a) YOLOv5s Structure (b) ImprovedYOLOv5s Structure Figure 1.Algorithm Network Structure.

Figure 6 .
Figure 6.The algorithm results of this article.4ConclusionTo address the poor accuracy issue with tiny target detection in UAVs, this study provides an enhanced YOLOv5s algorithm with an attention mechanism.The proposed algorithm augments the network's capacity to extract image features and improves algorithmic detection accuracy by incorporating CBAM.

Table 1 .
Comparison of the performance of various detection algorithms