End-to-end Remote Sensing Image Aircraft Target Detection and Fine-grained Recognition Framework

In view of the practical application of target detection and recognition tasks for remote sensing, an end-to-end aircraft target detection and fine-grained recognition framework is proposed. It can accurately and quickly implement detection and recognition in an end-to-end way. The main network of the framework adopts the design ideas of target detection and fine-grained recognition methods using candidate region extract and visual attention, making sure the accuracy of detection and recognition. Then, to solve the problem of high false detection rate and missed detection rate of densely arranged targets, we propose the re-detection mechanism. To minimize the large amounts of calculations of deep networks and improve real-time performance, we introduce depthwise separable convolution to optimize networks. Finally, a weight mapping idea based on transfer learning is adopted to solve the problem of data labeling and also helps the detection and fine-grained recognition. The results prove that the proposed framework has good robustness, versatility, and efficiency in aircraft detection and fine-grained recognition tasks.


Introduction
The problem of aircraft target detection and fine-grained recognition in high-resolution remote sensing images is important for military reconnaissance, strategic defense, and strikes, and in the highly informationized modern military environttment, requiring real-time and accurate target detection technology [1][2][3].This paper proposes an end-to-end framework for aircraft detection and fine-grained recognition.This framework can achieve high-precision detection and recognition results, solve the problem of a dense arrangement of aircraft targets in remote sensing images, and has high real-time performance.
The end-to-end training method [4,5] facilitates optimization of the model and has the advantages of simpler training and stronger real-time performance.From the current research progress in various aspects, it can be seen that there is relatively little research and experimentation on fine-grained recognition of aircraft models.How to overcome the technical difficulties in recognition and complete end-to-end high-precision and high-speed recognition tasks is challenging and of great significance.In addition, the dense arrangement of targets in remote sensing images, especially the dense arrangement of many aircraft, can reduce the detection accuracy of aircraft.In the end-to-end algorithm framework, this will further reduce the recognition accuracy.Therefore, this paper proposes a new end-to-end framework for aircraft target detection and fine-grained recognition in remote sensing images.
Firstly, aiming at the problem of multi-scale aircraft target detection in remote sensing background, a feature extraction module using receptive field lifting for feature fusion is proposed.Secondly, the dense arrangement of aircraft targets has to some extent, reduced the detection accuracy of current mainstream detection methods.In this paper, a local re-detection module has been designed to address this issue.Then, by combining channel attention and spatial attention for fine-grained target recognition, the problem of difficulty in locating and extracting discriminative features is solved.Finally, in order to solve the problem of excessive parameter and computational complexity in complex deep networks, this paper introduces deep separation convolution, which greatly reduces the computational complexity of the model and improves real-time performance.
Section 2 first introduces the main framework of the algorithm in this paper and then introduces several key design ideas and structures.Section 3 introduces the experiments and the analysis of the results.Section 4 summarizes the work and significance of this paper.The main process of the algorithm in this chapter is shown in Figure 1.Firstly, a candidate region extraction network [6] is used to generate many target candidate regions for the input image.The candidate regions are filtered and sorted to remove areas with high overlap and simple backgrounds.Secondly, the feature extraction module is used to extract and fuse global and local features.Then, finegrained recognition of candidate regions is performed based on the attention mechanism, and the detection box coordinates are adjusted.Finally, the detection results are post-processed, and the final results are output to achieve the detection and recognition goals.

Local re-detection mechanism for dense targets
Through the analysis of numerous images and experimental results, we found that in some data, the arrangement of aircraft targets is relatively dense; there is even a phenomenon of tight connection or mutual occlusion, which increases the difficulty of target localization and recognition [7,8].To solve the problems of difficulty in distinguishing and prone to false and missed detections caused by densely arranged targets and further improve detection accuracy, this paper designs a local re-detection mechanism.It adds a local re-detection sub-network during the testing process.The structural diagram of this sub-network is in Figure 3.The input is a feature map with detection box information that has been filtered and extracted through candidate regions.The image is enlarged through an upsampling layer to make the size of the target relatively larger and more dispersed.Then, feature extraction is performed on these types of targets to achieve local re-detection.At the end of the testing process, NMS and coordinate regression is still used to correct the detection box and output the final detection and recognition results.

Model lightweight based on deep separation convolution
When layers in a deep learning network deepen, the number of parameters and computational complexity of the network will significantly increase.This will slow down the running speed and affect the real-time performance of detection or recognition results.Therefore, in order to make the proposed detection and recognition framework faster and lighter, we replace ordinary convolution with deep separation convolution [9] in the deep network, dividing the ordinary convolution operation into two parts: longitudinal convolution (Depthwise) and 1 1 convolution (Pointwise).
In this paper, if the input feature map channel of a certain layer is 128, the size is 224 224, the output feature map channel is 64, and the convolution kernel size is 3 3, the ratio of computational complexity between deep separation convolution and ordinary convolution is: =0.13 From this, it can be seen that using deep separation convolution to appropriately replace ordinary convolutions in deep networks can greatly reduce computational complexity, making the model lighter and more real-time.

Training based on Transfer learning
Transfer learning is to overcome isolated learning paradigms and use the knowledge of a task to solve problems similar to or related to it.In the context of deep learning research, most models that solve complex problems require large data samples.Considering the limitations of time and manpower, it is very hard to get a large amount of labeled data for supervised models.Therefore, there has been a lot of research and application of transfer learning.Some studies [10] show that migration by using the pretraining model, weight parameters, and features of some prior tasks can effectively solve the problems such as the lack of data deficiency, long training time, and low accuracy of current related tasks.Therefore, this paper proposes a prediction layer weight transfer method, as shown in Figure 5. Firstly, it is trained on the source domain dataset to obtain the prediction layer weights in the source domain.Then, the mapping transformation to the target domain is learned.Finally, the weight is used for aircraft detection and recognition.

Linear transformation to destination domain weights
Extract weight

Experimental setup
The experiment used the migration method described in Section 2.4, using the DOTA2 dataset as the source domain.Firstly, the model was pre-trained on the DOTA2 dataset, and then the training weights were transferred to the dataset in this article to continue training.During training, the input setting is that the scale of the image is 640.
The paper evaluates the experiment from different perspectives using four evaluation indicators: accuracy, recall, mAP (mean average precision), and single image processing time.

Experimental comparison and analysis
Firstly, we compared the algorithm with the Faster R-CNN model and YOLOv4 model through experiments.The network framework without the addition of dense target local re-detection subnetworks was used as the main algorithm in this paper.The experimental results as in Table 1 show that compared to Faster R-CNN and YOLov4 models, our algorithm has significant improvements in accuracy, recall, and mAP.Replacing the five deepest ordinary convolutional layers in the algorithm of this paper with deep separation convolution can improve the running speed of the model.Although it is slower than the single-stage algorithm YOLOv4, it still has good real-time performance.Adding a local re-detection mechanism during the testing process is effective, showing that the local re-detection subnetwork plays a role in densely arranged targets.Overall, the algorithm in this paper has high detection and recognition accuracy and good real-time performance.Then, in order to compare the algorithm of aircraft targets in different scale ranges, this paper implemented grouping experiments and evaluations based on the size of the target scale.From Table 2, the evaluation indicators of small targets are the lowest, indicating that the detection and recognition small targets is a difficult problem.In the case of large changes in the target scale range, the network design in this paper has played an important role, especially in feature fusion and receptive field enhancement.For medium-sized aircraft targets with a large proportion, the model demonstrated excellent performance, proving that the network framework in this chapter has good robustness and universality.
Table 2 6 to Figure 8, some of the experimental results of the algorithm are visualized.Figure 6 shows that when the illumination intensity or image brightness changes, which may affect the target features, the algorithm can resist interference and has a certain degree of stability.Figure 7 shows the experimental results of samples with high inter-class similarity in fine-grained recognition, which are difficult to distinguish samples.The algorithm can solve the problem of difficult discrimination and perform more accurate detection and recognition of such targets.Figure 8 shows the experimental results for multiple categories, demonstrating the diversity of samples and demonstrating the universality and effectiveness of targets of different types and scales.

Conclusions
The paper proposes a novel framework for aircraft target detection and fine-grained recognition.This network framework can achieve end-to-end optimization and inference and can provide high-precision detection and recognition of different aircraft targets in remote sensing images.Through the results, the problem of dense targets being difficult to detect was discovered.A local re-detection mechanism was proposed to re-detect densely arranged targets during the testing phase.Then, in response to the problem of high computational complexity and low real-time performance in deep networks, a strategy of replacing ordinary convolutions with deep-separated convolutions was adopted.Finally, a training method of transfer learning is used with source domain information.At the end of this paper, a large number of experiments demonstrate the effectiveness of the modules and strategies.The stability and universality of our algorithm were demonstrated through quantitative comparison and visual display.

2
End-to-end aircraft detection and fine-grained recognition algorithm2.1 Algorithm framework design

Figure 1 .
Figure 1.The flow of detection and fine-grained recognition algorithm

Figure 2 .
Figure 2. Algorithm network structure diagram of this chapter The network diagram of the algorithm is in Figure 2, where the RPN sub-network is the same as in Faster R-CNN.The feature extraction and fusion network is used to extract and fuse local and global features.The channel and spatial attention network is used to capture category attention and spatial position attention.A local re-detection network is designed to address the issue of false or missed detections of densely arranged targets.The dashed arrows indicate that during the testing process, local re-detection of such targets can improve detection performance.Finally, after learning the dual tasks of classification and regression, the detection and recognition results are output.

Figure 3 .
Figure 3. Local re-detection sub-networkWe conducted experiments for the effectiveness of this module.For data with densely arranged targets, adding a local re-detection sub-network during the experiment can greatly improve the detection accuracy of the targets.The experimental results in Figure4prove that when densely arranged targets are not re-detected, the detection results are easily affected by the arrangement of the targets, leading to false detections and missed detections.The local re-detection mechanism in this paper can effectively solve this problem and improve detection accuracy to a certain extent.

Figure 5 .
Figure 5. Design of weight transfer prediction 3 Experiment and analysis

Figure 6 .Figure 7 .
Figure 6.Display of experimental results on image brightness changes

Figure 8 .
Figure 8. Display of experimental results on category diversity

Table 1 .
Comparison of experimental results of different models . Comparison of experimental results at different scale ranges