Small Target Detection Algorithm Based on Wavelet Domain Features and Weighted Sample Allocation Strategy

Small target detection using UAV aerial photography has emerged as a popular research topic. The resolution of small target images is low and the background is complex. In this paper, utilizing a single-stage target detection algorithm ATSS as its foundation, we propose a small target network detection model SDW-Net that combines deformation convolution with wavelet domain feature enhancement by weighting the samples from both classification and regression tasks, and combining the strong local feature representation ability of discrete wavelet transform. Firstly, the original features are obtained by using discrete wavelet transform, and the weighted wavelet features are obtained through the deformable convolution block and the spatial channel attention unit; secondly, the weighted features of the regular domain are obtained through the inverse wavelet trans-form; finally, the original feature and the wavelet features are fused to enrich the details of the small target image and sent to the detection head for output. According to the experimental findings, when compared to the ATSS algorithm, the proposed detection model achieved an average increase in prediction accuracy of 7.2%, and a recall rate increase of 3.7%, and the problem of missing detection and false detection can be alleviated, which can provide a new technical solution for small-sized target detection scenarios.


Introduction
The automatic and accurate detection of small target images taken at high altitudes by computer platforms plays a significant role in unmanned aerial vehicles (UAV) and distant monitoring [1] .However, the background of small target images is complex, the visual features are not obvious, the available information is less, and the detection is difficult, which easily leads to low detection accuracy and seriously missed detection and false detection.Despite the fact that deep learning-based object detection technology has become increasingly mature, its performance of small targets still faces challenges.
In recent years, domestic and abroad scholars have conducted research on deep learning-based small target detection algorithms: Fan proposed a Half Wavelet Attention Block (HWAB) [2] .The detailed information of the small target image is captured, and the complexity is also low.Zhang [3] proposed Adaptive Training Sample Selection (ATSS) on the basis of a single-stage target detector.The statistical characteristics of the target are utilized for the automatic selection of positive and negative samples in object detection, which significantly improves the accuracy of detecting small targets.
Nevertheless, the design of most attention module structures is based on experience, and less consideration is given to integration with traditional algorithms, resulting in poor interpretability of the network, and the effectiveness of attention modules suitable for small-sized target detection is low.
To solve the above problems, based on the ATSS that operates in a single stage and with the help of the learning ability of Swim-Transformer [4] , feature extraction is carried out in this paper to generate a high-resolution feature map and capture global information.At the same time, starting from wavelet domain features, we fused with deformable convolution [5] and proposed SDW-Net (Swin-Transformer based on deformable convolution in Wavelet domain), which focused on the target area of interest and improved the detection accuracy of small-sized target detectors by enriching the semantic information of small target objects.The main contributions can be summarized as:

ATSS algorithm
Target detection algorithms are categorized into anchor box-based and anchor-free methods [6] .The current mainstream of the anchor-free frame algorithm is the central domain method [7] .The scheme of detecting objects by laying anchor points began to emerge, transforming into the classification and regression of anchor points, avoiding the calculation related to anchor frames.In regression, the predicted anchor point is compared to the real box's top, bottom, left, and right boundaries to determine the distance.The algorithm needs to address the issue of differentiating between positive and negative samples.Negative samples are easy to classify and their loss value is small, but the quality is large, as a result, the sum loss overwhelms the positive sample loss.
In view of the unbalanced distribution of samples with positive and negative labels.It is unnecessary to tile multiple anchor points at each position of the image to detect targets.Therefore, ATSS is proposed, that automatically selects samples with positive and negative labels by analyzing the statistical properties of the target, which could be greatly enhanced the effectiveness of the model without additional calculations and parameters.The allocation strategy of ATSS is shown in Figure 1.The average IoU of the target object serves as an indicator of the suitability of the object's predefined anchor box [8] .In order to get better candidate anchor points, ATSS employs the addition of the mean and standard deviation to IoU as the threshold of IoU.As illustrated in the structure diagram, it is evident that placing the anchor point closer to the center point of the target results in a higherquality detection anchor box.Therefore, the best candidate anchor point is the point nearest to the object's center point in terms of distance.

3.1.SDW-Net Algorithm Framework
In view of the shortcomings of the above algorithm for detecting small target images, this paper improves the backbone network and detection head part on the basis of the end-to-end target detection framework ATSS network.The schematic representation of the upgraded algorithm's overall framework is presented in Figure 2. The backbone uses Swin-Transformer to extract the feature, and the acquired features extracted at multiple scales are sent to the feature pyramid for fusion to obtain five-layer features, and the head part is connected to each layer of features.On the basis of the traditional head predictor, a half-wavelet attention block is added to improve small target image details.Simultaneously, the classification and regression branches are trained with weighted positive and negative samples.The center branch on the regression model represents the distance weight to the object center and suppresses low-quality prediction boxes.

Swin-Transformer backbone network
Swin-Transformer proposes a hierarchical transformer and the features are learned by moving, which solves the problem of information interaction between adjacent windows and allows flexible modeling at various scales to achieve global modeling capabilities.The model mainly includes Patch partition, Linear embedding, SW block, Window Multi-heads Self-Attention (W-MSA), and Shift Window Multi-heads Self-Attention (SW-MSA).The framework is shown in Figure 3.The network subsamples the input image and divides it into multiple non-overlapping pixel blocks.The dimensions of the image are reduced by 1/4, and the channels are adjusted to 48.After the linear embedding layer, channels are adjusted to 96.The feature map size of this layer is H/4×W/4×96, which is used as the feature map of the first layer.In the first layer, the dimension is adjusted by the pixel block fusion layer to obtain the hierarchical feature.The SW basic module is composed of pixel fusion and Swin-transformer blocks.After two SW basic modules, the second layer of feature maps is obtained, whose size is H/8×W/4×192.The generation methods of the feature maps of the third layer and the fourth layer are basically similar, using 6 and 2 continuous SW basic modules respectively.
However, for the Visdrone dataset, due to high-altitude shooting, the image resolution is low, and the number of small-sized objects in a picture accounts for a large proportion.In order to improve the resolution of the small target image, this paper samples the lowest layer feature map of the Swin-Transformer.High-resolution feature maps can obtain a smaller anchor box size, and it is easier to fit the position offset between the ground truth and anchor boxes [9] .

Improvement with a half-wavelet attention mechanism
Structure of half-wavelet attention module.Wavelet transforms itself has certain extraction and separation characteristics for image features, which can promote the learning of network models [10] , and is conducive to better capturing the details of small target images.At the same time, after the input image undergoes wavelet transform, the feature map will also be reduced to half, so that the calculation speed of the entire network can be reduced and training process can be sped up.The halfwavelet-based attention module can not only focus on features in the wavelet domain, but also enrich semantic information.This module adds a dual attention unit, which consists of channel and spatial attention to extract features.The features after discrete wavelet transform are integrated into deformation convolution.Figure 4 illustrates the overall structure.First of all, the normal domain feature of the input feature (b, c, h, w) is retained, as shown in Figure 4, divided into two parts according to the channel direction (b, c /2, h, w), and perform discrete wavelet transform on some of the features to obtain the wavelet domain feature f w (b, 2c, h/2, w/2).The f w is sent to the spatial channel attention module after the deformed convolution block to calculate the weighted wavelet feature, and its inverse wavelet transform is converted from the weighted wavelet domain feature to the weighted feature of the normal domain (b, c/2, h, w).Add the wavelettransformed feature and another part of the original feature and pass it to the 3×3 convolutional layer and the ReLU activation layer.The original domain feature is added to it after 1×1 channel transformation, and the wavelet attention feature is obtained.The output features of the information (b, c, h, w).
Deformable convolution Network (DCN).The traditional CNN uses fixed coefficient convolution, which cannot fully conform to the target shape and lacks modeling and expression capabilities for objects with variable shapes.An offset variable (δx, δy) is added to the position of each sampling point in the convolution kernel, allowing random sampling around the current position instead of being limited to regular lattice points Figure 5 shows the implementation of DCN.In this paper, the half-wavelet attention module, with the help of deformable convolution, spatially separates the features in the wavelet domain, which is beneficial for features to be learned more efficiently in the wavelet domain.Replacing some convolution operations with deformable convolutions can somewhat enhance the model's performance, avoiding additional computational costs and optimization difficulties caused by the DCN layer.
Weighted Sample Allocation Strategy.Given the image, I, suppose there are N ground-truth boxes and P predicted boxes based on predefined anchors.For each anchor candidate Pi, the foreground probabilities ˆi p and regressed bounding boxes ˆi b are output relative to each class.Therefore, the cost matrix is formulated as: where C i,π (i) represents π i the matching score of i-th predicted frame, and Ω i represents the potential predicted frames for the i-th ground truth frame.
In allocation, the K prediction values with the highest cost value are selected from each FPN level, and when the matching value exceeds the adaptive threshold calculated with batch statistics, the candidate samples are assigned as foreground samples.The classification branch prioritizes samples with high foreground probability, while the regression branch focuses more on regression quality.To divide more suitable samples for classification and regression tasks and Separate the label assignment for classification and regression, this paper introduces a hyperparameter α∈[0, 1] to equalize the impacts.The weighted values of samples for these tasks in [11] are 0.5 and 0.8, respectively, the training effect obtained is better.Therefore, the classification and regression tasks are successfully separated and trained separately, which can better achieve the detection of small objects.

Dataset Introduction
To confirm the viability and precision of the algorithm, this paper conducts experiments and analyzes the VisDrone2019 [12] dataset, which was collected by the Machine Learning and Data Mining Laboratory of Tianjin University through various drone cameras, of which 6, 471 images are used for training, and the validation and test sets are 548.

Training program
All experiments in this paper are simulated on a GPU server with CUDA10.1 installed in the Linux development environment, Pytorch1.8version, Tesla V100, and 32 GB of video memory.
The training parameters in this paper are as follows: the image size is 1333×800, the training process uses stochastic gradient descent (SGD) to learn and update network parameters, and its momentum and weight decay are 0.9 and 0.0001.This paper uses gradient clipping during training to prevent the phenomenon of gradient explosion during training.The learning process starts with an initial rate of 0.005, which is then decreased during rounds 18 to 22, and the batch size is of training.The backbone network loads the pre-training weights of Swin-Transformer-Tiny.Finally, we use soft-non-maximum suppression soft-NMS (Soft Non-Maximum Suppression) [13] to eliminate overlapping detection boxes, confidence Thresholding is set at 0.6 to generate the final top 100 plausible predictions for each image.

Experimental results
Comparison of detection results of mainstream detection algorithms on the VisDrone 2019 dataset.At present, common small target detection models include VFNet, YOLOv5, YOLOv7, and YOLOX.In general, the more complex the model, the greater the amount of calculation is, and the higher the detection accuracy is.Therefore, this paper selects the above classic deep network models for smallsized target detection experiments, and at the same time cites the detection results of different detection algorithms in the multi-scale segmentation attention UAV aerial image target detection algorithm released by Mao et al. [14] .Table 1  In order to test the algorithm's scientific rigour and effectiveness in this paper, six sets of experiments (all loaded with pre-trained weights) were designed for comparative analysis.As shown in Table 1, on the small target dataset, the ATSS model has a greater advantage in accuracy than the classic YOLOv5-L model, with an increase of 1.0%.The algorithm that is based on the improvement of the ATSS network, the precision increases by 7.2%, and the recall rate increases by 3.7%, which proves that the SDW-Net model has a greater advantage in the task of UAV image target detection.
Ablation experiments of different improvement schemes.To test the efficacy and scientific validity of the backbone feature extraction network, adaptive sample allocation strategy, and half-wavelet attention module based on deformable convolution, this paper uses the ATSS algorithm as the benchmark algorithm and changes one variable for each group of experiments to conduct ablation experiments.The experimental results of the algorithm under evaluation indicators such as mAP50 are shown in Table 2 As shown in Table 2, on the basis of the benchmark model ATSS, the feature extraction ability of the self-attention-based Swin-Transformer network is superior to that of the CNN-based ResNet50 network, an increase of 1.6%.On this basis, for the head predictor to assign weighted samples to classification and regression tasks, this paper conducts experiments on the basis of [11].A task can be assigned to the best sample for prediction.At this time, the accuracy for the VisDrone dataset is the highest, reaching 46.8%.At the same time, the addition of half-wavelet attention is accompanied by an increasing parameters, but the mAP50 is increased by 0.8%, and the recall rate is increased by 0.5%.
Randomly select a picture from the verification set to infer through the algorithm model in this paper.A group of pictures in Figure 6 are the original ATSS network, the ATSS network of the Swin backbone, the Swin-weighting network (using the Swin backbone network and weighting strategy), and the detection situation of the improved SDW-Net network (Swin-Transformer + weighted sample allocation + half-wavelet attention mechanism).In Figure 6, the model trained by the original ATSS network has many missed and false detections.For example, the motorcycle next to the trash can in the (a) picture and the motorcycle hidden in the big tree in the lower right are missed.The dense crowd in the distance on the upper right is also not detected.After passing through the Swin-Transformer backbone network and the adaptive sample allocation strategy, some missed targets can be detected correctly.The improved model's inference effect is superior to the original ATSS network, but the tricycle on the upper right is still not detected.The reasoning of the SDW-Net network on the verification set pictures has been improved, and the above-mentioned missed detection targets are all detected correctly.It can be seen that SDW-Net can effectively alleviate some of the issues of missed detection and false detection.

Conclusion
This paper presents an algorithm designed for detecting small targets, which combines wavelet and weighted sample allocation strategies.It uses Swin-Transformer-Tiny to extract features, weights the number of samples for classification and regression tasks, uses the feature separation and extraction of the discrete wavelet transform, into the deformation convolution to learn the detailed features of small targets, and finally removes the problem of missed detection of dense targets through soft-nms, and combine the data enhancement strategy and related optimization methods to acquire a model that exhibits superior performance.Compared with the original ATSS algorithm, the SDW-Net network improves the accuracy by 7.2%, effectively improving the target detection algorithm's ability to perceive the details of small target objects, and the detection performance is improved.However, this article is only verified on the VisDrone2019 dataset.Follow-up work needs to verify whether other small target datasets can achieve better results and enhance the algorithm's versatility.

Figure 3 .
Figure 3.The overall structure of the Swin-Transformer

Figure 5 .
Figure 5.The implementation process of DCN

Figure 6 .
Figure 6.Partial detection results on the test set

Table 1 .
illustrates the results obtained from the experiments.Traditional small-sized target detection model