Military Unmanned Equipment Image Target Recognition Method based on Improved Deep Learning

Military unmanned equipment target recognition is currently a research hotspot and trend in the field of military intelligence. The small sample size and complex recognition scenarios of military unmanned equipment image datasets result in low recognition accuracy. It is proposed a military unmanned equipment image target recognition method based on improved deep learning, which set Faster R-CNN as the network framework for target recognition, used the Kmeans++ algorithm to label boxes on customized datasets, and then added OHEM to the framework to improve the network’s recognition accuracy for difficult to recognize samples. The accuracy of the algorithm proposed in this article reaches 93.8%, which is 2.8% higher than the YOLOv5 algorithm, providing an improved deep learning method for military unmanned equipment image target recognition.


Introduction
Environmental perception and target autonomous recognition are one of the most important core technologies of intelligent equipment.Sensors, videos, and other devices obtain a large number of images of various types and resolutions in different time and space, from which military target category and location information can be obtained.There are mainly two types of object detection models based on deep learning.First, Two-stage network is represented by Faster R-CNN, which have high accuracy but slow training and prediction speed [1][2][3][4].Second, One-stage network is represented by YOLO [5][6][7] and SSD [8], which directly regresses the category and localization of the detected target bounding box in the output layer, and has low accuracy in detecting complex and small target objects.The image features of military unmanned equipment are complex and diverse, and the similarity between the target and the background is high, which puts forward high requirements for accuracy.
Based on this, in order to address the problems of limited datasets, low accuracy, and slow speed in image target recognition of military unmanned equipment in complex scenes, first construct a military unmanned equipment target dataset that includes Military Unmanned Aerial Vehicles(Military UAV), Military Unmanned Ground Vehicles(Military UGV), Military Unmanned Surface Vehicles(Military USV), Military Unmanned Underwater Vehicles(Military UUV), etc; Secondly, Faster R-CNN was adopted as the network framework for military unmanned equipment image target recognition.Based on this, the network was improved by using the Kmeans++ algorithm [9] to label boxes on customized datasets, and OHEM(Online Hard Example Mining) [10] was added to the framework to improve the recognition accuracy of the network for difficult to recognize samples.

Faster R-CNN Framework
Faster R-CNN is a development from Fast R-CNN, which adds RPN(region proposal network) to form a complete network structure.Faster R-CNN consists of four parts: feature extraction network, RPN, ROI Pooling module, and classification and regression network.The model structure is shown in the figure.The image is input into the feature extraction backbone network for feature extraction, obtaining a feature map, which is then input into the candidate region generation network to obtain multiple target pre-selection boxes, The role of ROI Pooling is to unify the size of candidate region feature maps, and finally classify the preselected boxes and perform parameter regression on the position and size of the boxes.

Improved Faster R-CNN
Aiming at the drawbacks of Faster R-CNN detection mentioned above, a new method for military unmanned equipment image target recognition based on improved deep learning is proposed to achieve better military unmanned equipment image target recognition performance than the original Faster R-CNN.
The improved Faster R-CNN, as shown in the figure 1, resizes military unmanned equipment images and inputs them into Mobilenetv2.Customized defect anchor boxes are designed using the K-means++clustering algorithm.The output feature map is then processed through a region recommendation network to obtain the initial suggestion box.After ROI calibration pooling, the feature maps are unified into sizes, and finally, the suggestion boxes are classified and regressed, In addition, the online difficult case mining strategy will also be applied to this framework.The following is a specific analysis of the improved Faster R-CNN.

K-means++ Clustering
Anchor boxes are pre set prior boxes in the network before training.The anchor boxes in the original Faster R-CNN were empirically obtained based on annotations from public datasets such as VOC 2007.There are a total of 9 anchor boxes, with scales of 128, 256, and 512, and aspect ratios of 0.5, 1, and 2. The value of anchor boxes directly affects training speed and accuracy.For military unmanned equipment target recognition datasets, the scale size and aspect ratio of annotation boxes differ significantly from general public datasets.Therefore, this article proposes to use the K-means++ algorithm to cluster the annotation box parameters and calculate anchor boxes suitable for military unmanned equipment image target recognition.The K-means++ clustering algorithm is an improvement of the K-means algorithm.K-means randomly generates cluster centers, which has a huge impact on clustering performance.Such a random initialization method can lead to slow convergence or even non convergence, often causing the algorithm to fall into local optima and unable to obtain global optima.In response to the above issues, the K-means++ algorithm has improved the initialization method of cluster centers by adopting the idea of generating cluster centers one by one.The following are the steps for generating cluster centers in the K-means++ algorithm: Design anchor boxes of three different scales based on the size of the target; Design three types of anchor boxes with different aspect ratios based on the different aspect ratios after rotation.So a total of 9 types of anchor boxes were designed to meet the basic size and proportion of annotation boxes in the dataset.A small-sized prior anchor box is designed for detecting small targets; A large-sized prior anchor box is designed for detecting large targets.The specific parameters of the anchor box are shown in the table 1.

OHEM
Due to the small sample size and similarity to the background of some detected targets in military unmanned equipment image target recognition, the sample types can be divided into easy to detect samples and difficult to detect samples.The loss value of difficult to detect samples is relatively large.
If difficult to detect samples can be selected during training and further training can be carried out, it will make the network training more efficient.Therefore, the Online Hard Example Mining (OHEM) strategy is used to improve the Faster R-CNN algorithm.The OHEM strategy can be used to train region based convolutional neural networks, automatically selecting difficult examples instead of adjusting hyper parameters, thereby improving detection efficiency.The steps of the OHEM strategy are: first, select difficult to detect samples based on the size of the loss, and then select the samples that have a significant impact on the final detection accuracy for further gradient descent training.
As shown in the figure, the specific operation is as follows: copy the final classification and regression network into two parts, and connect them to the ROI Pooling layer, called NetA and NetB respectively.NetA is set to only perform forward propagation and no longer perform backpropagation calculations, used to calculate sample loss as a reference.NetB performs normal forward and backward propagation, inputs difficult to detect samples, calculates losses, and updates network parameters.As shown in the dashed box network in the figure, the network structures of NetA and NetB are the same and share weights.After calculating the loss, NetA selects the pre selection box with a larger loss for non maximum suppression (intersection to union ratio of 0.6), passes the pre selection box into NetB, calculates the loss, propagates backward, and updates the network parameters.The FC in the figure is a fully connected layer.
If the detection accuracy is low when the sample size is small, the loss value calculated by NetA is large, and the sampling probability of such images input into NetB increases, resulting in a decrease in the detection miss rate.The application of OHEM strategy to Faster R-CNN effectively solves the problem of low detection rate of individual targets caused by uneven sample size in military unmanned equipment image datasets, thus further improving the accuracy of target detection.

Experiment
This article organizes military unmanned equipment images through the internet, including Military UAV, Military UGV, Military USV, and Military UUV.After screening, four types of military unmanned equipment image datasets are obtained, using the VOC 2012 dataset format, with a total of 500 images.One image contains multiple military unmanned equipment targets of different scales, with a total of 623 annotations; This experiment was completed under the Windows 10 operating system, with a computer CPU of AMD Ryzen 7 5800H, 16GB of memory, NVDIA GeForce RTX 3070 GPU, 8GB of graphics memory, and Python version 3.7.Cuda 11.0 and cudnn 8.1 were used for accelerated computation, and the deep learning framework was Pytorch 1.7.1.The training parameters are as follows.After pre training, this article trains 50 epochs on the military unmanned equipment image dataset.The optimizer uses Adam optimization, with an initial learning rate set to 0.005.As the training epochs increase, the learning rate gradually decreases to prevent overfitting.The sample size for batch processing is limited by GPU memory, with batch sizes set to 4. TensorboardX is used for training visualization, and real-time changes in loss values and accuracy are displayed.
In the experiment, mAP (mean average precision) was used as the detection accuracy indicator, and the larger the value of mAP, the higher the detection accuracy.
  Among them, is the recall rate of target recognition, tp is the number of true target detection that are true, fp is the number of non target detection that are true, and fn is the number of true target detection that are false.
In order to verify the effectiveness of the improved Faster R-CNN algorithm, the recognition results of three networks YOLOv5, Faster R-CNN, and the improved Faster R-CNN were compared.
In the experiment, mAP (mean average precision) was used as the detection accuracy indicator for the surface defect detection performance of insulated bearings.The larger the value of mAP, the higher the detection accuracy.The results of comparing the detection effects of 3 different network detection methods on military unmanned equipment images are shown in the table 2. Compared with the single-stage YOLOv5 detection framework, the detection speed of the method proposed in this paper is slightly reduced, but the average detection accuracy is increased by 2.8%.Obviously, the two-stage improved Faster R-CNN method has higher accuracy than the single-stage network.In the case where the detection speed is not high, it is more appropriate to choose the improved Faster R-CNN method.Compared with Faster R-CNN, the average detection accuracy of our algorithm has improved by 5.4%.This result indicates that combining anchor customization and OHEM strategy in the dataset can improve the accuracy in high similarity images.Comparing the two detection methods, it can be concluded that the improved Faster R-CNN method proposed in this paper has the highest detection accuracy, with mAP of 93.8%. Figure 2 is loss and learning rate of improved Faster R-CNN.
As shown in the figure 3, the recognition results of 3 methods are shown, with each image detecting content including category, specific location, and confidence score.

Conclusion
Based on the actual needs of target recognition in military unmanned equipment images, an improved Faster R-CNN algorithm based on deep learning was studied for classification and localization of military unmanned equipment images.The improved algorithm in this paper was compared with the object detection frameworks of YOLOv5 and the original Faster R-CNN through experiments.The results showed that the detection accuracy of the improved Faster R-CNN method in this paper was improved from 88.4% to 93.8%, and was 2.8% higher than the YOLOv5 algorithm, This provides an effective new algorithm for image target recognition of unmanned military equipment.

Figure 2 .
Figure 2. Loss and learning rate

Figure 3 (
a) is the recognition result of YOLOv5, (b) is the recognition result of Faster R-CNN, and (c) is the recognition result of improved Faster R-CNN.The detected images are a test set that did not participate in training.From the figure, it can be seen that the detection performance of the proposed method in this paper is the best, with targets falling within the detection box and high confidence, proving that the improved Faster R-CNN detection method in this paper has excellent performance in classifying and locating military unmanned equipment targets.

Table 1 .
Anchor box parameters

Table 2 .
Performance comparison of different algorithms