Comparison and summary of Faster R-CNN, YOLOv3 and YOLOv5 applied in vehicle detection

Nowadays, computer vision and machine learning are widely applied in vehicle detection. The technique of object detection is also of necessity in this realm. However, due to the large variety of models in object detection, it is important to find the most specifically suitable model for detecting vehicles in many situations. In this paper, we focus on comparing and summarizing Faster R-CNN, YOLOv3 and YOLOv5 applied in vehicle detection. We would introduce the models in relative detail and design an experiment to verify the models’ performances. The methodology for the experiment is to train the three models using the same dataset of vehicles, compare the different attributes of these results and find the most suitable application scenario for vehicle detection for each model.


Introduction
Vehicle detection has become increasingly significant in recent years in terms of traffic monitoring and driving record.Nearly 43000 individuals died in motor vehicle traffic crashes in 2021, according to a survey by the American National Highway Traffic Safety Administration (NHTSA) [1], a 10.5% increase from the 38,824 fatalities in 2020.According to the World Health Organization (WHO) [2], 1.35 million people died globally in traffic-related accidents in 2016, and an additional 20 to 50 million individuals suffered non-fatal injuries and/or impairments.Thus, it has become increasingly important in recent years to develop the method of autonomous moving vehicle recognition.This method will be crucial for improving road safety and defending lives and property.
Vehicle detection is also applied in automatic driving.For example, when the car is moving, a variety of sensors will collect signals around the car and transmit them to the on-board intelligent system.The intelligent system comprehensively judges the situation at each moment.The purpose of automatic driving is achieved through the cooperation of these modules.So the visual signals around the car and the processing procedure of these signals are very important.
Vehicle detection is a branch of object detection in machine learning.The reason is that the vehicle is detected as an object.So the object detection method in machine learning can also be used in vehicle detection.
Prior to 2012, standard techniques were used to process object detection algorithms.Many deep learning models appeared after 2012 as processing power and data quality improved.The two-stage target detection model used by RCNN in the original model is primarily adopted.However, following 2016, the model started to concentrate on a one-stage model headed by YOLO due to the increased demands placed on mobile terminals for object detection efficiency [3].
Faster R-CNN, an improved model created using the knowledge gained from creating R-CNN and Fast R-CNN, was released in 2016.A true end-to-end deep learning detection algorithm is faster R-CNN.Its use of the RPN network to generate candidate boxes based on the Anchor mechanism and, ultimately, to combine feature extraction, candidate box selection, border regression, and classification into a single network to more effectively increase detection accuracy and efficiency, represents this model's most significant innovation.
You Only Look Once, known as YOLO, is a deep neural network-based object detection and locating technique [4].Its primary advantage is that it is quick and suitable for real-time systems.Because it is the first single-stage object detector approach to view detection as a regression problem, YOLO represents a significant advancement in the field of object identification.The detection architecture just looks at the image once to identify the object's position and category [5].Due to the wide variety of YOLO series models, this article will specifically examine the YOLOv3 and YOLOv5 variants.
To test and compare the performance of these two categories of models and find their best application scenarios for vehicle detection, we designed an experiment that trained these two models with the same dataset.This dataset contains 1572 pictures, including 3 types of labels, which are people, car and motorbike.We will compare the attributes of the two types of models, such as model loss, recognition speed, recognition accuracy, etc.

Theory of Faster R-CNN
Scaling the input image before putting it into the convolution layer to extract features yields the Faster R-identical CNN's process for generating feature maps.To create a list of potential candidate boxes, submit the feature map to the RPN network next.Next, input the original feature maps and all candidate boxes output by the RPN to the Roi Pooling layer, extract and collect the proposal, and calculate a fixed size of 7 × 7, and send it to the full connection layer for target classification and coordinate regression.The structure of Faster R-CNN can be divided into four parts: Convolutional neural network (CNN), Region Proposal Network (RPN), Roi Pooling, Classification and Regression.

Convolutional neural network (CNN).
The feature extraction structure of Faster R-CNN is the same as that of ordinary CNN networks.You can use VGG, ResNet, Inception and other common structures (only the part before the full connection layer) to extract features from the input images, and finally output feature maps [6].

Region proposal network (RPN).
No matter the SPPnet or Fast R-CNN network structure, the regional candidate link adopts the Selective Search algorithm.The improvement of Faster R-CNN is to unify region candidate extraction, classification and boundary box regression, and realize the end-toend detection algorithm.RPN is specially used to extract candidate boxes in the Fast RCNN structure.Compared with the Selective Search algorithm, RPN takes less time and is easy to integrate into the Fast RCNN as a whole [6].
A fully convolutional network is the RPN network.The "sliding window + anchor mechanism" is its central concept for creating candidate boxes.The specific method is to use the sliding window method on the 40 * 60 feature map obtained by the convolution of the previous convolution layer, that is, 3 × 3's convolution kernel constructs 9 candidate boxes (40 * 60 * 9 ≈ 20000) with different aspect ratios and different scales at the central point of each sliding window, maps them to the rescale image frame with a mapping ratio of 16, discards the preproposals that exceed the boundary, and then sorts them from large to small according to the Softmax score of each region, extracts the first 2000 preproposals, perform non-maximum suppression on the 2000 preproposals, and finally sorts the obtained preproposals again, output 300 proposals to Faster RCNN for prediction.At this time, the prediction category of Faster RCNN does not include background, because RPN output defaults to the foreground.Moreover, in the training process, Faster R-CNN uses the alternate training method to train the RPN with the initialized weights, and then trains the convolution network with the candidate regions extracted from the RPN to update the weights.

Roi Pooling.
Roi Pooling has two specific functions here: a. Generate the proposals area from feature maps.b.Maximize pooling of inconsistent size inputs (here is the convolution layer feature map) and generate small function maps of fixed size [6].
A procedure of switching from variable size to fixed size input is necessary because the size of the proposal derived from the RPN network will fluctuate and the whole connection layer input for classification must have a constant length.In the earlier R-CNN structure, the proposal is scaled or clipped to a fixed size.The side effect of scaling and clipping is that the original input is deformed or the amount of information is lost, resulting in inaccurate classification.ROI Pooling completely avoids this problem.The proposal can be pooled into a fully connected input, without deformation, and the length is fixed.

Theory of YOLOv3
YOLO algorithm is a one-stage target detection algorithm.The biggest difference between one-stage and two-stage target detection algorithm is the operation speed.YOLO series algorithms divide the image into several grids, and then generate an a priori box based on the anchor mechanism.Only one step is needed to generate the detection box.This method greatly improves the prediction speed of the algorithm.
YOLOv3 network structure can be roughly divided into three parts: Backbone, PANet, and YOLO Head.

Backbone: Darknet-53.
The main body of Darknet-53 is similar to resnet in structure, with multiple residual modules stacked and a kernel separated between them_ Size=3×3, street=2 convolution layer, which is mainly used for downsampling [7].53 represents a total of 52 convolution layers and the last connect layer (full connection layer) of the entire backbone, with a total of 53 layers.
The first 3×3 convolution core is mainly used to increase the number of channels, obtain more effective feature maps without changing the image size, and expand the receptive field of the feature map.The second 3×3 convolution core, street=2, is mainly used to downsample, reduce the amount of parameters and calculations in the calculation process, stack multiple residual blocks, and finally pool the obtained feature maps evenly.

PANet.
PANet corresponds to the Neck module, which is an improved version of FPN.The main method of FPN is to conduct downsampling from top to bottom for shallow large-scale feature maps and feature fusion for deep-level feature maps, and output multiple feature maps of different scales for prediction [7].On this basis, PANet adds up sampling operation for deep-level feature maps, and concatenates the deep small-scale feature maps after sampling.

YOLO Head.
YOLO Head is a decoder [7].Its main structure is a conv+bn+act module and a kernel_ size=1×1 convolution integration class layer.It uses 1x1 convolution instead of full connection layer for classification.

Theory of YOLOv5
There are four variants of the target detection network in the YOLOv5 official code: YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x.The YOLOv5s network is the network in the YOLOv5 series with the smallest depth and the smallest width of the feature map.On this basis, all three of the latter networks are strengthening and expanding.The smallest network, fastest speed, and least accurate AP are all characteristics of YOLOv5s.However, if the scenario in which the large target is mostly identified seeks speed, this model is also a viable option.Based on this, the other three networks are strengthening and expanding their networks, while also enhancing AP accuracy and speed consumption.
The network structure of YOLOv5 is mainly divided into the following four parts: Input, Backbone, Neck, and Prediction.

Input.
Mosaic used in YOLOv5 refers to the CutMix data enhancement method proposed at the end of 2019 [8], but CutMix only uses two images for splicing, while Mosaic data enhancement uses four images for splicing, which are randomly scaled, randomly cut, and randomly arranged.The detection effect of small targets is good.
There will be anchor frames in the YOLO method with the initial length and width defined for various data sets.During network training, the network generates the prediction box based on the initial anchor box, compares it to the ground truth to determine the difference between the two, and then updates the network parameters in reverse.As a result, the initial anchor box is crucial.When training distinct data sets in YOLOv3 and YOLOv4, the initial anchor box value is calculated using a different program.But YOLOv5 incorporates this feature into the code.The ideal anchor box value for each training will be determined adaptively across several training sets.You can disable the function that calculates anchor frames automatically if you believe they are inaccurate.
YOLOv5 has an adaptive image zoom module.Numerous images included in the project actually have multiple aspect ratios.As a result, the black edges at both ends are different sizes after scaling and filling.Information redundancy occurs when more information is filled, which slows down reasoning.Therefore, datasets.py'sletterbox function has been changed in YOLOv5's code to adaptively add the original image's least-black edges.The black edges at the opposite ends of the image height become less, and the quantity of reasoning computation will also be decreased, resulting in an increase in target identification speed.

Backbone.
Focus structure and CSP structure are adopted by YOLOv5.Slicing procedures make use of the Focus structure [8].The Focus structure receives the original 608 * 608 * 3 image, using the YOLOv5s structure as an example.It first becomes a feature map of 304 * 304 * 12 by the use of the slicing operation, and then, following a convolution operation utilizing 32 convolution cores, it finally becomes a feature map of 304 * 304 * 32.
CSP stands for Cross Stage Primary Network, often known as CSPNet.In terms of network structure design, it primarily addresses the issue of excessive computation in reasoning.According to CSPNet's creator, repeating gradient information during network optimization is what is to blame for the issue of excessive reasoning calculation.As a result, the basic layer's feature mapping is split into two parts using the CSP module, and the two sections are then combined using a cross-phase hierarchy to maintain accuracy while also reducing the amount of work.

Neck.
The Neck of YOLOv5 adopts the FPN+PAN structure [8].The PANet of CVPR, which was mostly utilized in the field of image segmentation at the time, was a resource that FPN+PAN adopted in 2018, but Alexey used its decomposition to YOLOv4 to further enhance the capability of feature extraction.
The Neck structure of YOLOv4 employs standard convolution techniques.To improve the ability of network feature fusion, the Neck structure of YOLOv5 uses the CSP2 structure created with reference to CSPnet.

Method
In order to compare and summarize Faster R-CNN, YOLOv3 and YOLOv5 applied in Vehicle Detection, we applied a VOC dataset containing 1572 pictures, with 3 types of labels, which are people, car and motorbike.The three models were trained with the same VOC dataset to control variables.The ratio of the training set to the validation set is set as 9:1.

Training process of Faster R-CNN
About the parameter setting of the model before training, we set the backbone feature extraction network as resnet50.Save the model every 5 epochs.The model was trained 100 epochs, including 50 freeze epochs.The loss was declined to around 0.715 and almost stop falling.It is not certain that loss has reached the global minimum.After testing, we found that Faster R-CNN had high requirements for GPU performance, so we transferred the training of this model to Google Collab.

Training process of YOLOv3
For YOLOv3 has only one backbone network, there is no need to set the type of the backbone initially.Save the model every 5 epochs.The model was trained 150 epochs, including 50 freeze epochs.The loss was declined to around 0.041 and almost stop falling.It is not certain that loss has reached the global minimum.The whole training procedure was operated on the laptop that the GPU type was NVIDIA RTX 2070 max-q.

Training process of YOLOv5
About YOLOv5, there are mainly four types of pre-training weights provided (YOLOv5s, YOLOv5m, YOLOv5l and YOLOv5x).Because the majority of pictures in the dataset for training the model contained small objects, we specifically chose the YOLOv5s for it could help the model recognize the small object more precisely.Save the model every 5 epochs.The model was trained 150 epochs including 50 freeze epochs.The loss was declined to around 0.068 and almost stop falling.It is not certain that loss has reached the global minimum.The whole training procedure was operated on the laptop that the GPU type was NVIDIA RTX 2070 max-q.

Evaluation
After training the three types of models, we used the weights that had lowest loss to evaluate the performance of these models.The indicators to be compared mainly include the model's AP value, F1 value, Precision, Recall, fps and mAP value.To control variables, we set the models threshold score to 0.5.

Evaluation of Faster R-CNN
The indicators about Faster R-CNN are as follows: Table 1.The indicators about Faster R-CNN.Table 1 shows the indicators of Faster R-CNN after validation, which includes the AP values, F1 values, Precision and Recall on car, person and motorbike these three classes.To evaluate the indicators, it is said that the AP values about these three classes are all above 73%, which is relatively high.The F1 values are basically lower than 0.85, with an average value around 0.72.The precision values are small while the recall values are high.Because precision and recall are mutually exclusive indicators to a certain extent, it could be concluded that under a certain non-maximum suppression value and threshold score, Faster R-CNN has better performance in confirming the real positive example.

Evaluation of YOLOv3
The indicators about YOLOv3 are as follows: Table 2 shows the indicators of YOLOv3 after validation, which includes the AP values, F1 values, Precision and Recall on car, person and motorbike these three classes.To evaluate the indicators, it is said that the AP values about these three classes are all above 90%, which is relatively high.The F1 values are all higher than 0.81, with an average value around 0.88.The precision values and the recall values are high.Compared with Faster R-CNN, YOLOv3 has better performance.From the perspective of data, yolov3 should be more capable and accurate in identifying correct examples.

Evaluation of YOLOv5
The indicators about YOLOv5 are as follows: Table 3 shows the indicators of YOLOv5 after validation, which includes the AP values, F1 values, Precision and Recall on car, person and motorbike these three classes.To evaluate the indicators, it is said that the AP values about these three classes are at around 90%, which is relatively higher than Faster R-CNN.The F1 values are all higher than 0.79, with an average value around 0.84.The precision values and the recall values are high.Compared with Faster YOLOv3, and from the perspective of data analysis, YOLOv5 is inferior to YOLOv3.After testing, we found that the problem should be caused by the improper selection of training weight.At first, we thought YOLOv5s could help the model better identify small objects, but in fact, our dataset may contain fewer small objects than we expected, so if we choose a pre training weight with greater width and depth, such as YOLOv5m, we may have better training results.At the same time, we also found that compared with YOLOv3, YOLOv5's recognition ability for these three types of objects is relatively smaller according to F1 values.This shows that the model recognition ability is relatively stable.On the whole, YOLOv5 still has strong recognition ability and performance.

Analysis and comparison of other indicators
There are some indicators that have not been analysed: As shown in table 4, it shows the mAP and fps values of Faster R-CNN, YOLOv3 and YOLOv5.It is obvious to see that no matter mAP values nor fps values, the YOLO series models have better performance basically.Because YOLO series models are "Only look once", their recognition rate is significantly higher than that of Faster R-CNN.But because of its fast recognition speed, it must sacrifice some recognition accuracy.For example, when encountering some small objects or involving similar objects, YOLO's inferior discrimination ability appears.It is mainly reflected in the deviation degree of the suggestion box position and the confidence degree of the object.   1 above.This is also a major optimization of YOLOv5 compared with YOLOv3.

Conclusions
As the evaluation pointed out, YOLOv3 and YOLOv5 have advantages in detecting speed.The suitable application scenarios for these two models are those that require rapid recognition and have certain requirements for recognition accuracy.For example, visual recognition system of vehicle is appropriate for them.Because the system is mainly used for recognizing the traffic situation in real-time and identify the sorts of vehicles.YOLOv3 and YOLOv5 have the feature that they support real-time recognition.About Faster R-CNN, although it is short in detecting speed, it has significant advantages in recognition accuracy.It is more suitable for scenarios that require relatively slow recognition.It is used for scenarios that require static or low-speed testing, such as static aircraft inspection on airport surfaces and inspection of mechanical equipment components and appearance integrity.
As for the limitations on the methodology of this experiment, the preparation before training the models is not particularly complete.The models should be trained in a more comprehensive and adequate dataset to better improve the accuracy and universality of the model, as well as reduce the loss.At the same time, in selecting the training weights for YOLOv5, each weight should be trained to obtain the most suitable training results for this training set, so as to make the most comprehensive and rigorous comparison of the performance of the three types of models.
For future work, we will continue to pay attention to the latest models in machine vision, improve loopholes in training dataset selection, and choose more rigorous and complete training methods such as selecting training weights more comprehensively.

Figure 1 .
Figure 1.Detection example about Faster R-CNN, YOLOv3 and YOLOv5.Although YOLOv5 has similar indicators to YOLOv3, YOLOv5 actually performs better in identifying small objects and distinguishing similar objects.As shown in the figure1above.This is also a major optimization of YOLOv5 compared with YOLOv3.

Table 4 .
mAP and fps of Faster R-CNN, YOLOv3 and YOLOv5.