Summary of Target Detection Algorithms

In recent years, the soaring development of CNN has facilitated the maturity of the Computer Vision Algorithm. This paper will briefly introduce some representative Target Detection Algorithm, and systematically analyze the underlying problems, the modified methods, and the prospective direction of the algorithm in accordance with its merits and demerits. It is generally divided into a single-stage detection model and a double-stage detection model in terms of whether candidate areas need to be extracted during target detection for further tasks. Featured with scale in the double-stage detection model, the algorithm is divided into single-scale detection and multi-scale detection based on whether it can appropriately integrate with a network structure, which enhances the accuracy of the network model towards small targets. Meanwhile, it can also be divided into anchor-base and anchor-free in a single-stage detection model on the basis of the anchor bolt. A predictable development of the Target Detection Algorithm will show to us in the future.


Introduction
Computer vision technology has been integrated into all aspects of life in the continuous development of today's society. Target detection is a very basic but very important task in computer vision technology. Target detection is used in social security management, traffic vehicle monitoring, environmental pollution detection, and forest disasters. There are very outstanding application results in the fields of early warning and national defense security. The task of target detection mainly includes the recognition and location of single or multiple targets of interest in digital images. People process the training images containing the target to extract stable and unique features or specific abstract semantic information features, and then match these distinguishable features or use classification algorithms to give confidence in each category Degree to classify.
Target detection algorithms have been studied for many years. In the 1990s, many effective traditional target detection algorithms appeared. They mainly used traditional feature extraction algorithms to extract features and then combined with template matching algorithms or classifiers for target recognition. However, traditional algorithms have encountered a bottleneck in their development due to the lack of strong semantic information and complex calculations. In 2014, Ross Girshick proposed a convolutional network-based target detection model RCNN [1] with high detection accuracy and strong specific robustness and generalization ability, making people pay more

Convolutional Neural Network
In 2014, the proposal of the RCNN convolutional network opened a new stage in the development of target detection. Its accuracy and stability greatly exceed the traditional target detection algorithm, so it is quickly accepted by people. The detection model of the convolutional neural network is mainly divided into single-stage and two-stage detection models. The difference is that the two-stage detection model needs to train a region proposal network (ROI), but it increases the computational complexity and the model is difficult to achieve real-time detection. The single-stage model discards this link and converts the target detection problem into a regression problem. Although the accuracy is sacrificed, the calculation speed of the model is greatly improved, and the model can be detected in real-time.

The process of Convolutional neural network detection
Convolutional neural network target detection also has certain similarities with traditional detection, which can be regarded as feature extraction and use of features to identify targets. Both backbone feature extraction network and "detection head" Convolutional neural networks use convolutional networks to extract high-level semantic features of images, which are generally the backbone of the network, and then process the feature maps, such as connecting a fully connected network and softmax or svm to form a classification head to complete the classification task, or use a small volume The product core is processed into feature dimensions and a position loss function is used to form target positioning.
The network judges the error through the loss function and updates the network weight parameters by the reverse gradient propagation of the network, thereby continuously reducing the value of the loss function to improve the detection accuracy. The detection network calculates multiple times through a large amount of training data and can learn a set of optimal weight values from the set of data to predict the detection target.   (2020) VoVNet-v2 basically multiply and accumulate the image of the image by a variety of convolution kernels according to a certain spatial position to obtain the next-level feature image.

LeNet-5
The earliest classic convolution feature backbone network was LeNet-5 [11] proposed by Lecun et al. in 1995. Although the network model at that time was relatively simple, it already included the most basic convolution pool in convolutional neural networks. The transformation and fully connected layers play a guiding role in the development of convolutional neural networks, which are mainly used for handwriting recognition.

AlexNet
Convolutional neural networks became popular in 2012. Among them, Alex Krizhevsky proposed that the structure of the AlexNet network [12] is similar to LeNet on the whole, which is convolution first and then fully connected. But the network is more complicated, using a five-layer convolution, a three-layer fully connected network, and the final output layer is a softmax of 1000 channels. AlexNet uses two GPUs for calculations, which greatly improves the computing efficiency. In the ILSVRC-2012 competition, it obtained a top-5 error rate of 15.3%. In order to obtain a larger receptive field, the first layer of the network uses an 11*11 convolution kernel and adds LRN local response normalization to each convolution layer to improve accuracy, but in 2015 Very Deep Convolutional Networks for Large-Scale Image Recognition. It is basically useless to mention LRN.

ZFNet
Because people want to understand the working principle of convolutional neural networks, ZFNet [13] was proposed in 2013, which provides a visual network to understand the various layers of convolutional networks. The main improvement to AlexNet is to use a smaller convolution to reduce the time complexity while using deconvnet and visual feature maps to visualize it, and at the same time won the ILSVRC championship. By visualizing the neural network, it can be seen that the low-level network has extracted the edge texture features of the image, and the high-level network has extracted the abstract features of the image. This feature has translation and scale invariance but does not have rotation invariance.

GoogLeNet
In order to further improve the performance of the neural network, the most direct way is to increase the depth and width of the network, but it will cause too many parameters to increase the amount of calculation, and the limited training set will cause problems such as gradient dispersion or overfitting. For example, the 22-layer AlexNet has about 60 million parameters. GoogLeNet [14] proposed in 2014 has only 5 million parameters in the same situation. It mainly uses the convolution solution. GoogLeNe v1 decomposes a 5*5 convolution operation into two 3*3 convolution operations. When they obtain the same receptive field, the parameters are reduced by 2.78 times. GoogLeNe v2 [15] divides a 3 The *3 convolution operation is decomposed into 1*3 and 3*1 convolution operations. GoogLeNe V3 [16] decomposes the 7*7 convolution kernel into 7*1 and 1*7 convolution kernels, which deepens the depth of the network and reduces the parameters. GoogLeNe v4 [17] is based on GoogLeNe v3 The addition of the residual network greatly increases the depth.

VGGNet
The VGGNet [18] proposed by Karen Simonyan et al. in 2014, which is equivalent to the network deepening version of AlexNet, consists of two parts: a convolutional layer and a fully connected layer. All activation layers use relu, and the pooling layer uses maximum pooling. The simple structure and strong feature extraction ability made it second place in the classification project of ILSVRC2014 and first place in the positioning project. VGGNet, used in the test, used a 1*1 convolution kernel to improve the fully connected layer, becoming a fully connected layer with convolution. This overcomes the disadvantage that the traditional fully connected layer requires a fixed input dimension. Therefore, multi-scale training is adopted, and the training image scale is randomly selected from the range of [256, 512] side length. This scale jitter method can enhance the training set.

ResNet series
As the network depth increases, the acquired features become richer but the optimization effect becomes poor. The accuracy of detection decreases (gradient explosion and disappearance). For a shallower network, the input data of each layer can be normalized to make the network converge. But the deep network still has optimization problems. So in 2015, He Kaiming proposed ResNet [19] to break this bottleneck, mainly using the jump connection structure. On the basis of ResNet v2 [20] proposed in 2016, by changing the order of the normalization layer, the pooling layer, and the convolutional layer, a set of jump structures with the best performance ( Figure 22) was found, and it was proposed in 2017 ReNetXt [21] draws on the idea of GoogLeNet to add 1*1 convolution on both sides of the convolution layer, reducing the number of control cores and reducing the parameters by about two-thirds.

DenseNet
The previous convolutional network is either as wide as GoogLeNet or as deep as ResNet. The author of DenseNet [22] published in 2018 found two characteristics of two neural networks through experiments: 1. After removing the middle layer, the next layer is directly connected to the upper layer, that is, the neural network is not a progressive hierarchical structure, and it is not necessary to connect adjacent layers. You can rely on the previous layer for feature learning. 2. Many layers of ResNet are randomly removed during training, and it will not affect the convergence and prediction results of the network. It proves that ResNet has obvious redundancy, and each layer in the network only extracts a few features (the so-called residual). Compared with ResNet, DenseNet has obvious advantages, improved performance, and reduced parameters.

SENet
SENet [23], published in 2019, is aimed at the detection task and proposes a channel weight combined with the idea of the attention to suppressing features that are not useful for the current task. The SE module is mainly used for weight distribution for the convolutional layer. This sub-module form makes it compatible with other networks. The article is mainly used in the ResNet network. The SE module is embedded in ResNeXt, BN-Inception, Inception-ResNet-v2, and has achieved a lot of gains. It can be seen from this that the gain effect of SE is not only limited to some special network structures, it has strong generalization.

EfficientNet
Considering that the previous network mainly improves the accuracy of the network through a single scaling of the width (WideResNet and MobileNets), depth, and resolution of the network model. The EfficientNet network model [24] quantifies the relationship between these three dimensions and uses a constant ratio to simply increase to balance the three dimensions of the network at the same time.

VoVNet series
The VoVNet network [25] proposed in 2019 has completely surpassed ResNet and can be used as a backbone network for real-time target detection. Considering the factors of energy consumption and model inference speed, optimizing memory access costs (the highest efficiency when the number of input and output channels is the same) and GPU computing efficiency (GPU processing large tensors is strong, CPU processing small tensors is strong) is more critical. The most important thing is to put forward the OSA module, which optimizes the problem of densely connected DenseNet modules. At the same time, it is improved in the 2020 CenterMask article, and the residual block and eSE module are added (in the original SE module improvement, one FC is used to replace the original two FC to reduce information loss) to greatly increase its performance and form VoVNet -v2 structure. Compared with ResNet, the VoVNet network has a stronger ability to extract small targets, and speed and accuracy are better.

Network Models of Target Detection
The network model is generally to detect sub-regions in the image. Because the traversal detection based on the sliding frame requires a large amount of calculation, the candidate frame is used to initially locate the area of interest, and then each candidate area is detected to greatly reduce the calculation of the network Complexity, this kind of algorithm that extracts candidate regions and then detects and locates the target is called a two-stage detection algorithm. The accuracy of the two-stage detection algorithm is high, but the amount of calculation is still large, so it is difficult to achieve real-time detection.
In view of the practicability of the two-stage detection, the single-stage detection algorithm does not need to extract candidate regions but performs regression prediction on each feature map, which greatly reduces the time complexity of the network algorithm. The accuracy of the single-stage detection algorithm in recent years has been close to that of the two-stage detection algorithm while maintaining a high detection speed, making its development attract more people's attention.

Network models of Two-stage target detection
The development of the two-stage detection model started from the initial RCNN, and there are many improved models around the RCNN model, such as SPP-net, Fast-rcnn, Faster-rcnn, R-FCN, etc. These models all improve the RCNN network on a single scale feature to greatly improve accuracy and speed.
There are also some improvements to the RCNN network that combine the idea of multi-scale feature fusion. Such as ION, FPN, MASK-RCNN, etc., this multi-scale feature fusion improves the ability of the network model to detect small targets.

Based on the single-scale feature model
The R-CNN [26] process proposed by Ross Girshick et al. in 2014 is relatively simple. First, select more than 2000 candidate frames randomly by using (selective search) on the input image, and then zoom to 227*227, and then use AlexNet CNN Extract the features to obtain a 2000*4096 matrix, and then use the svm algorithm to classify, that is, multiply the feature matrix by the matrix 4096*20 (representing 20 classes). The class with a score greater than a certain threshold is judged as this class. The accuracy of the R-CNN model on VOC 2010 reached 53.7mAP.
The SPP-net [27] proposed by Kaiming He et al. in 2015 solved two pure problems of the RCNN network at that time: 1. Candidate frames are randomly selected from the original image, and each candidate frame will be subjected to a featured network. The extraction of this kind of repeated convolution calculation greatly increases the computational burden. 2. A fixed-size input image is required, so the original image needs to be cropped or scaled. These operations may cause the target information to be lost and affect the accuracy. For the first problem, a shared feature convolution layer is used, and the final convolution layer The selection of candidate regions is performed to reduce the amount of calculation. For the second problem, the root cause is that the fully connected layer needs to input a fixed-dimensional feature vector. In order to solve this problem, SPP-net added a pyramid pooling layer to the last convolutional layer (feature map) of the feature network for ordered output. proposed Faster-rcnn [29] to solve two problems in Fast-rcnn: 1. The suggestion box uses a selective search algorithm, which greatly increases the number of network calculations. 2. The objective loss function of the positioning frame is unstable at the optimal solution point using L1 distance. The Faster-rcnn training network is an end-to-end network, which realizes the sharing of most of the calculations, and has high detection accuracy and anti-interference. Although the real-time performance is not high, its unique regional suggestion network RPN is the goal of the  [32] utilizes the high-level semantic information fusion of the underlying network structure to increase the resolution of the feature map, and predicting on a larger feature map is conducive to obtaining more small targets. The feature information, makes the small target prediction effect significantly improved. The accuracy of the FPN model on coco reached 59.1AP.
The Mask-rcnn [33] proposed by Kaiming He et al. in 2018 is similar in structure to Faster-crnn. It is a flexible multi-task detection framework that can complete: target detection, target instance segmentation, and target keypoint detection. Simply put, a "detection head" (segmentation task layer) is added to the Faster-crnn framework structure. Due to the introduction of the mask layer, the network can handle segmentation tasks and key point tasks. ROIAlign avoids the two quantization of Faster-rcnn and improves detection accuracy.

Network models of Single-stage target detection
The earliest single-stage detection model is YOLO v1. One type of improvement is to use anchor-base on the feature map obtained by the feature extraction network to detect the target point by point according to the preset anchor frame, such as SSD, YOLO V2, RetinaNet, YOLO V3, YOLO V4, EfficientDet, etc.
At the same time, another type of improvement is to use the anchor-free idea to directly point the two corner points and the center point of the target through the network, and use these key points to achieve the return positioning task of the target. Such as CornerNet, CenterNet, CornerNet-Lite, FCOS, CenterMask, etc. The anchor-free model overcomes the following five shortcomings of the anchor-base model: 1. The detection performance is very sensitive to the size, aspect ratio, and the number of the anchor frame, so the hyperparameters related to the anchor frame need to be carefully adjusted. 2. The size and aspect ratio of the anchor frame is fixed. Therefore, it is difficult for the detector to process candidate objects with large deformation, especially for small targets. 3. The pre-defined anchor boxes also limit the generalization ability of the detector, because they need to be designed for different object sizes or aspect ratios. 4. In order to improve the recall rate, dense anchor frames need to be placed on the image. Most of these anchor boxes belong to negative samples, which causes an imbalance between positive and negative samples. 5. A large number of anchor boxes increase the amount of calculation and memory usage when calculating the intersection ratio.

Based on the anchor-base detection model
The SSD [34]  YOLO v4 [38] proposed by Alexey Bochkovskiy and others in 2020 adopts the best optimization strategy in the CNN field in recent years on the framework of the traditional YOLO series, from data processing, backbone network, network training, activation function, loss function, etc. Various aspects have been optimized to varying degrees. It balances the detection speed and accuracy, which is greatly improved compared to YOLO V3. The accuracy of the YOLO v4 (CSPDarknet-53) model on coco reached 43.5AP.
The EfficientDet [39] proposed by Mingxing Tan   The CornerNet-Lite [43] proposed by Hei Law et al. in 2019 optimizes its backbone network on the basis of CornerNet to form CornerNet-Squeeze. The CornerNet-Saccade, which uses the attention mechanism for cropping, removes the redundant image part of the network detection target ( Similar to two-stage detection, first cut out the approximate area of the target for detection). This method has made a good breakthrough in speed and accuracy, reaching the highest accuracy (47.0%) of the The FCOS [44] network model proposed by Zhi Tian et al. in 2019 is roughly composed of the FPN feature pyramid and three branch detection heads. FCOS discards the traditional anchor box and directly performs regression operations on each point on the feature map. And the use of FPN's multi-scale hierarchical detection greatly reduces the fuzzy samples generated in multiple BBs (detection frames) in one location. Center-ness weighting combined with NMS (non-maximum suppression) is a good way to suppress the distance from the target center The low-quality BB. Compared with some of the most mainstream first-order and second-order detectors, FCOS is superior to the classic algorithms of Faster R-CNN, YOLO, and SSD in terms of detection efficiency. FCOS lacks speed in order to improve accuracy, but it is better than RetinaNet in terms of accuracy and speed. The accuracy of the FCOS(ResNeXt-64x4d-101-FPN) model on coco reached 44.7AP.

Based on the anchor-free detection model
The CenterMask [45] proposed by Youngwan Lee et al. in 2020 is based on FCOS and adds the SAG-Mask instance segmentation module integrated into the attention mechanism and replaces its feature extraction backbone network (VoVNet-V2). Using the ResNet101-FPN backbone network can reach 38.3% mask AP surpasses all previous networks, but the speed is only 13.9FPS. The lightweight CenterMask-Lite can reach 33.4% mask AP and 38% box AP. The speed can reach 35FPS, so it can meet real-time requirements. The accuracy of the CenterMask (V-39-FPN) model on coco reached 36.3APmask.

The future development direction and summary of the target detection algorithm
In order to pursue faster and more accurate target detection algorithm models, the algorithm model will incorporate more other advanced model algorithms, and single-stage and two-stage methods will gradually merge. For example, the target position estimation proposed by the single-stage CornerNet-Lite model is pseudo The two-stage model adopts the idea of two-stage target detection.
With the diversification of detection task requirements, the target detection model is no longer a single task model, which adds instance segmentation (similar to multi-target detection, but uses edge contours instead of bounding boxes (target boxes)) and some are also added Panoramic segmentation (it is a combination of semantic segmentation and instance segmentation: semantic segmentation refers to assigning a category to each pixel on the image (can be distinguished by color) but does not distinguish between individuals). After panoramic segmentation, we can know which individual in which category each pixel on the image belongs to, which is a more refined classification task. At the same time, there is also key point detection for detecting the human body posture (that is, the joints of the human body are replaced by points and connected by adjacent line segments, which abstractly represent the human body posture actions).