Remote-sensing Small-target Detection Based on Feature-dense Connection

To address the problem that a large number of small targets exist in remote-sensing images but are difficult to detect, in this paper, a DenseYOLOv5 detection model is proposed for practical applications. DenseYOLOv5 is based on YOLOv5s and the small target detection head P2 and its feature fusion part are added to improve the detection performance of small targets. To address the problem of semantic information loss of small targets due to continuous upsampling in YOLOv5s, DenseYOLOv5 reconstructs the feature fusion pyramid (FPN) structure and incorporates dense connections. In addition, DenseYOLOv5 also uses transposed convolution as the upsampling method to further improve the small target detection capability. DenseYOLOv5 can achieve better detection results with less memory and computational overhead and thus has better usability.


Introduction
Remote-sensing image target detection is one of the research hotspots in the field of remote-sensing applications, which has been widely used in urban planning [1], agricultural management [2], intelligent transportation [3], national defense and military [4], environmental monitoring [5], and other fields.At present, for large and medium-sized targets, remote-sensing image-target detection methods based on deep learning can usually achieve good detection results.However, due to the influence of such factors as a small target occupying area on the image, little available feature information, and weak feature expression ability, common remote-sensing image-target detection models are often not suitable for small-target detection, and the detection effect of small and medium targets in remote-sensing image is still poor [6].
Limited by the physical size of targets and spatial resolution of remote-sensing images, small targets have always been a very important class of targets in remote-sensing images, such as vehicles and ships in satellite images and pedestrians in UAV images.In recent years, considering the difficulty of remote-sensing small-target detection, many scholars have put forward improvement work from various aspects.By introducing transposed convolutional module combined with underlying features [7], combining with ECA attention enhancement feature extraction [8] and contextual information inference [9], super-resolution images are generated by the GAN network [10], bidirectional convolutional network [11], perception loss and texture matching loss [12].These efforts have had some influence.From the perspective of practical application, on the premise of ensuring real-time detection of small and medium-sized targets in remote-sensing images, it is of great significance to maximize the detection accuracy of small targets and fully tap the application potential of remote-sensing images.It is expected to play a certain role in promoting the development of related fields.
At present, the YOLOv5 algorithm is mature and widely used in industrial fields with high detection efficiency while taking into account its detection performance.Based on YOLOv5 6.0, an object detection algorithm (YOLOv5 with DFPN) is proposed in this paper for feature fusion through dense connections.DenseYOLOv5 is mainly optimized in three parts: a small target detection head and its feature fusion part, a dense feature fusion pyramid, and upsampling using transposed convolution.It is verified on remote-sensing small-target data set AI-TOD [13].The experiment shows that the proposed algorithm achieves a good detection effect while maintaining real-time detection efficiency.

Re-design of small-target detection head and anchor frame
In the YOLOv5 model, only C5, C4, and C3 feature layers were used, and P5 (large), P4 (middle), and P3 (small) detection heads were set.The last three layers of the backbone network only contain a small amount of location information of small targets and a large amount of location information is lost in the process of convolution downsampling.In order to solve the problem of the loss of small target feature information, the upsampling part of FPN is extended, combined with the C2 layer with more abundant small target location information, and P2 is added for small-target detection to improve the detection effect of small targets.In addition, the prior box in YOLOv5 is generated by the COCO data set and adaptive anchor box clustering algorithm, which is suitable for most target detection data sets, but not for small targets of remote-sensing image species.In this paper, the self-use clustering method is used to cluster out more appropriate prior boxes.

The Dense feature fusion network (DFPN)
In YOLOv5, the PAN structure is used for the feature fusion part, as shown in Figure 2.After adding the P2 detection header and the feature fusion part, continuous up and down sampling will cause a large amount of feature information loss, and it is difficult to transfer high-level semantic information and low-level spatial information completely.Therefore, in this paper, a dense feature connection structure is added in FPN.As shown in Figure 2, when adding the DenseFusion module between layers, the feature information is supplemented to ensure the integrity of the information.Jump connections are added to the bottom-up structure species to transfer features from the backbone network to the bottom-up feature pyramid to compensate for the loss of spatial feature information.The DenseFusion module designed in this paper is shown in Figure 3.The two layers of features before and after sampling are restored to the feature map of the same size by subpixel convolution [14] , and then the concat is fully integrated.Finally, feature extraction is carried out by the C3 structure.

Upsampling using transposed convolution
For YOLOv5, the top-down upsampling in FPN adopts the nearest neighbor interpolation method, which has fewer parameters and is fast, but its effect is not good enough.The loss of feature map information after up-sampling is great, which is not conducive to small-target detection requiring shallow feature information.In order to optimize this problem, we use learnable transposed convolution [15] instead of nearest neighbor interpolation to reduce the information loss of small targets in the upsampled feature graph.

Datasets
In order to demonstrate the correctness and effectiveness of the small-target detection method proposed in this paper, the AI-TOD data set is used for experimental verification.The AI-TOD dataset contains eight types of small targets, with a total of 28,036 images and 70,621 instance targets.The image size is 800×800 pixels.The largest target size is less than 64 pixels, 86% of the target size is less than 16 pixels, and the average target size is 12.8 pixels, which is much smaller than the target size in other remote-sensing image data sets.

Experimental details and experimental evaluation
In this paper, the Ubuntu20.04system was used and the experimental environment was python3.7,pytorch1.8,and cuda11.1.All models adopted dual-card distributed hybrid training on two NVIDA GeForce RTX3060 GPUs.In order to quantitatively analyze the detection effect of the method in this paper, mAP0.5, mAP0.5:0.95,model weight size, and detection speed FPS was adopted as the evaluation indexes of the model.

Ablation experiment
In order to analyze and verify the Formulation method in this model on the detection effect of small targets in remote-sensing images, YOLOv5s are used as the benchmark model, and ablation experiments are conducted by single comparison and increment one by one.The results of the ablation experiment are shown in Table 1.By analyzing the experimental results in Table 1, the following conclusions can be drawn: The weight of the model in this paper is 17.9 MB and the detection speed is 41 FPS.Compared with YOLOv5s, the weight of the model in this paper increased by 3.3MB and the detection speed decreased by 17, but mAP0.5 increased by 9.9% and mAP0.5:0.95increased by 5.3%.In general, the model in this paper can effectively improve the mAP of small-target detection by ensuring the real-time detection of small targets.

Comparative experiment
In order to further illustrate the effectiveness and superiority of the model presented in this paper, a variety of mainstream target detection models are selected to carry out two sets of comparative experiments.In order to ensure the fairness of the experiment, a self-used clustering algorithm is adopted to generate an anchor frame, and aut represents an adaptive anchor frame.The experimental settings are as follows: 1) Experiment 1: Compared with mainstream lightweight models.Experiment 2: Compared with mainstream standard models.As can be seen from Table 2, when the input image size is 800×800 pixels, the weight size of DenseYOLOv5 is 17.9 MB, which can meet the application requirements of real-time detection.Meanwhile, mAP0.5 is improved by 0.9% and 2.1% compared with YOLOv3 and Scale-YOLOv4, respectively, and the detection accuracy of mAP0.5 was similar to that of YOLOv5l.It shows that high detection accuracy can be obtained while maintaining real-time detection, which is of high application value.

Model validity analysis
In order to fully illustrate the actual detection effect of the proposed method in different scenarios, the four most representative and difficult scenarios from the AI-TOD test set are selected in this section.From Figure 4 to Figure 7, it can be seen that DenseYOLOv5 has achieved good detection effect among four types of complex scenes, namely dense scene, thin cloud cover scene, night scene, and ground object occlusion scene, indicating that DenseYOLOv5 has good robustness.

Table 1 .
The ablation experiment on the AI-TOD dataset

Table 2 .
Comparative experimental results of different lightweight models