Improved YOLOv7 Intelligent Environment Perception Algorithm Based on Loss Function

To address the difficulty of environment perception in unmanned driving technology, an improved intelligent environment perception model based on YOLOv7 loss function is proposed. Simultaneously, the algorithm of deep sort target tracking is added for the sake of perceiving the surrounding environment through the camera to ensure the safety of the driver. And by modifying the loss term of the aspect ratio to the distinction between the predicted width and the target width, the fitting speed of the model can be effectively improved. Focal Loss is introduced to optimize the problem with the sample imbalance inside the target bounding box regression and reduce the weight proportion of numerous anchor boxes with less overlap between the predicted target bounding box and the true target bounding one to the overall model, such that the model centers more upon overlapping targets. The data applied to this article is BDD100K. The map of this algorithm reaches 60.3 %, which reaches 2.2 percentage points exceeding the baseline. Tests indicate that the raised algorithm attains certain advancements in the intelligent perception model of unmanned driving.


1.Brief introduction
Recently, with the promotion of computer vision technology and the prevalence of artificial intelligence research, unmanned driving technology has been continuously promoted to commercialization [1] .Nowadays, unmanned driving technology has already had the functions of assisted driving and automatic driving.While providing a better driving experience for the driver, it can also escort the driver's safety and effectively alleviate the traffic congestion encountered during driving [2] .The purpose of the driverless visual perception system is to obtain real-time visual perception of targets such as roads, houses and pedestrians to determine the area where vehicles can pass.Therefore, the accuracy of the environmental perception system is of great significance to the safety of unmanned driving [3] .

BBD100K dataset
In recent years, the data sets in the field of autonomous driving have grown rapidly, and their data formats mainly include video, images and some radar data.The dataset applied to this article is just the dataset called BDD100K.It includes a road object bounding box, a drivable area and a lane marking line.The videos in the dataset are collected from all over the United States, covering different times, different weather conditions and driving scenarios.Inside the dataset, the detection of the road target is to label the target bounding box on 100, 000 images such as rider, bike, car, person, truck, bus, traffic sign, traffic light, train and motor.As can be seen from the disparity of the data inside the BDD100 K depicted in Figure 1, the problem of category imbalance is extremely serious in the data set.The distinction between the car with the largest quantity of figures and the train with the least quantity of figures is about a thousand times.This category imbalance has a very serious impact on the training of the model.

Algorithm overview
Target classification and target localization compose the target detection task, in computer vision which has invariably been the most basic problem.At present, the most advanced target detection algorithm relies upon one target bounding box regression module to locate the target.Therefore, the impact of excellent loss function upon the algorithm of the target detection is highly crucial.The Focal _ Eiou loss function into the YOLOv7 [8] model is introduced in this article, which modifies the difference of three pivotal geometric elements of the loss term, including overlapping area, center point and width-height value.At the same time, a focus loss is added to enhance the weight of that target bounding box with a large IOU in the model optimization process.Finally, these two approaches are integrated into an emerging target detection loss function, that is Focal-EIOU, to achieve accurate and effective target detection.

Loss Function of BBR
The target bounding box regression is one key step inside the target detection algorithm.The boundary target frame is used to depict the spatial position of the object.The shape is just rectangular and depends upon x and y coordinates of the upper left corner and the lower right corner, or (x, y) axis coordinates the center of the target boundary frame and the frame width and height.By far, target bounding box regression has been used in most target detection methods, and researchers have spent a lot of energy designing the loss function of the regression of the target bounding box.YOLOv1 [9] uses the square root of the size of the predicted target bounding box to reduce the scale sensitivity between the real bounding box and the predicted bounding one.Apart from Fast R-CNN [5] , Faster R-CNN [6] adopts the function of the SmoothL1 loss, which becomes a more robust loss than RCANN [4] .The dynamic SmoothL1 loss owns the ability to dynamically control the function of the loss and can pay more attention to the target bounding box where the real target box and the prediction box overlap more.Subsequently, the function of the IOU loss is raised; additionally, superior performance is achieved on the benchmark of FDDB.In addition, GIOU [10] loss is put forward to handle the IOU loss weakness, namely, when the two target bounding boxes do not interact, the IOU loss is always zero.Subsequently, DIOU and CIOU [11] were proposed, which have quicker convergence speed and greater performance.Aiming at shortcomings of the existing loss function, an effective EIOU loss function is introduced in this article for target detection.
The shortcomings of existing common loss functions are first analyzed in this part; besides, the EIOU loss function is then raised.
(1) Limitations of IOU loss We measure this IOU loss of similarity between two arbitrary volumes (shapes): The IOU loss function has good scale insensitivity.However, it has two main drawbacks: if there is no intersection between the two boxes, the aforementioned IOU loss will invariably be zero, which could not correctly show the tightness between the two boxes; the convergence speed is tardy.
(2) Limitations of GIOU loss To handle the IOU loss shortcomings, the function of the GIOU loss is proposed, defined as follows: , the area is always decimal or equal to zero (while A includes B, the item is going to be zero, and vice versa).In the circumstances, the GIOU loss is lessened than the IOU loss.Therefore, the convergence speed of GIOU loss is excessively slow as before.
(3) Limitations of CIOU loss CIOU loss takes into account three pivotal geometric elements, i.e., overlap domain, center distance, along with aspect ratio.Stipulate a prediction bound b and a target bound   , the CIOU loss is defined below.

( )
where  and   are center points of  and   , respectively; ρ(•) = ||b − b gt || 2 denotes the Euclidean distance; and c refers to the diagonal length of the smallest closed box covering two boxes.
In Equation ( 3), ν only shows the distinction in aspect ratio, not the true relationship between  and   or ℎ and ℎ  .That is to say, when ν = 0, the loss term will fail.Therefore, the loss of CIOU may optimize the similarity of bounding boxes unreasonably.
(4) Improvement of EIOU To solve the aforementioned issues, the CIOU loss is modified and a more effective IOU loss function is raised, namely EIOU loss, defined below.where ℎ  and ℎ  refer to the height and width of the minimum closed box covering two boxes, respectively.Namely, the loss function is split into three parts: IOU loss   , distance loss   , along with aspect ratio loss   .By this means, the profit features of CIOU losses can be retained.Synchronously, the EIOU loss straightforwardly minimizes the gap in height and width between the target and the anchor boxes, bringing about quicker convergence speed and greater positioning consequences.

Improved Focal Loss
A stage detector has a serious foreground and background imbalance problem.To solve this issue, hard negative mining is used in SSD [7] ; additionally, SSD just retains a marginal part of the background to train.The background and foreground samples were re-weighted in Focal loss, allowing forward to assign a larger weight.On account of the target object sparsity in the image, the quantity of high-overlap target bounding boxes with a small regression error is much less, compared to that of low-overlap target bounding boxes.The study of OHEM shows that [12] , the number of low overlap target bounding boxes will produce excessive gradients, which is detrimental to the training process.Therefore, a highly overlapping target bounding box is crucial to provide more weight for the network training process.As the error in the target bounding box regression changes, we are capable of assuming an anticipated function curve of the gradient amplitude and constructing one function family with parameters to manage the curve shape.As it rises, the outliers` gradient amplitude will be deeply suppressed.Nevertheless, as for the high-quality target bounding box, its gradient magnitude will descend, too.To reduce this gradient decline, another parameter is added, and then standardize different gradient sizes to [0,1].As for the target bounding box, its FocalL1 loss can be obtained: where C is a constant value, and farther ( ) to ensure that Equation ( 9) is continuous at x=1.For the EIOU loss to focus the weight on the highly overlapping target box, the loss associated with the EIOU can be replaced in the equation.Finally, the following Focal-EIOU loss is attained where and  denotes one parameter which controls the outlier suppression.

Experimental environment parameter setting
The software and hardware versions used in this experiment are depicted in Table 1 To analyze the superiority of the improved YOLOv7 intelligent environment perception algorithm based on the loss function, the map (average accuracy) results before and after improvement are shown in Figure 2 and Figure 3.According to Figure 2 and Figure 3, the improved model has a very obvious improvement in some categories, the bus has increased by 0.7 points, the bike has increased by 0.6 points, the traffic light has increased by 2.9 points, and the train has also increased by 2 points.The map of the final model is improved by 2.2 points, which fully illustrates the advantage of the model.

Qualitative analysis
Some images in the validation set are filtered and tested.The consequences are described in Figures 3  and 4:  As shown in Figure 3, in the case of poor light, the target in the Baseline yellow box is a case of recognition error, which is correct compared with the target recognition of the yellow box in Ours.The wrong targets in the figure are traffic signs belonging to the category of small targets.They have fewer features and are more difficult to identify.However, in the case of the same training parameters, the effect of Ours is more prominent than that of Baseline.
Figure 4 shows that while perceiving the target around the vehicle, this paper also makes a short-term Re-id allocation to the target, that is, it tracks the target for a certain time.For example, the number of the bus in Figure 4 is 1, as a result, the quantity of the vehicle in the view will not change during this period.This is to assign an ID to each target and follow it for some time.The design of this module can bring convenience to the subsequent behavior analysis of surrounding targets Profit, such as whether there is a pedestrian or vehicle trajectory conflict with their trajectory, or predict whether there will be a collision event, can effectively prevent the occurrence of accidents.

Conclusion
To improve the control of safety in the process of unmanned driving, and reduce the risk coefficient of unmanned driving.This paper introduces the YOLOv7 target detection and the algorithm of deep sort target tracking and enhances the algorithm of the loss function of the target exploration.As the aspect ratio value, its loss term is changed to the loss term of the aspect ratio value, which efficiently deals with the stagnation issue during the model fitting process.The improved Focal loss is also adopted to effectively handle the unbalance between foreground and background in the dataset.The accuracy of the model under the BDD100 K data set reaches an average accuracy of 60.3 %.It can effectively solve the safety hazards caused by the inaccurate environmental perception of unmanned driving and improve the safety factor of unmanned driving.
where A, B ⊆ S ∈ R n are just two arbitrary bounding boxes.And the minimum target bounding boxes C ⊆ S ∈ R n and loss plans to enlarge the domain of the target bounding box to overlap with the target one, which is the opposite of reducing spatial position difference.

Figure 2 .
Figure 2. The map before improvement The map after improvement.

Table 1 .
. Hardware versionIn the training of the deep learning model, good parameter setting can make the model more effective.After many experiments, Table2is the most appropriate training parameter.