Yolov5 Vehicle Detection Model in Fog Based on Channel Attention Enhancement

Vehicle detection in foggy weather plays an indispensable role in the field of intelligent transportation. This article proposes an improved YOLOv5 vehicle detection model based on the problems of insufficient detection accuracy and high fault tolerance of most algorithms in foggy weather. First, the AOD-Net network is used for defogging preprocessing of the original image. Then, the SE attention mechanism is fused in the C3 module of the Backbone feature extraction backbone network to adaptively allocate weight information, enhance the attention to important features, and reduce the impact of noise and irrelevant information. Finally, BiFPN is used in the Neck feature fusion network to replace the original PANet and enhance the model’s feature fusion ability. Experiments are conducted on the Cityscapes and RTTS datasets, and the results show that the improved YOLOv5 algorithm in this article has significant improvements in precision, recall rate, and average precision mean compared to the original model, with increases of 8.4%, 9.5%, and 9.2%, respectively. It can better adapt to vehicle detection tasks in foggy weather.


Introduction
In recent years, the ownership of automobiles in China has experienced steady growth due to technological advancements and an increase in people's income.Consequently, the development of intelligent transportation systems has witnessed rapid progress.Vehicle detection has gained increasing prominence in the fields of intelligent transportation and traffic safety.Vehicle detection in foggy environments, influenced by complex weather conditions such as haze, has become a significant research direction in computer vision.
Deep learning-based object detection can be classified into one-stage and two-stage methods: onestage methods directly extract target information from images through regression and classification, with examples including YOLO [1] [2] and SSD [3] .Two-stage methods first generate candidate boxes, and then detect them through regression and classification.Some examples of these algorithms include R-CNN [4] , Fast R-CNN [5] , and Faster R-CNN [6] .
Deep learning methods have many advantages in target detection.Deep learning-based methods can detect targets in images faster and more accurately.However, in heavy fog weather conditions, vehicle features may be lost, which increases the difficulty of detection.Li [7] used the multi-scale retinex algorithm to restore the color of haze images and obtained an enhanced datase.Zhai et al. [8] used an improved histogram equalization method to enhance image contrast and used depthseparable convolution as the basic unit, thereby reducing calculation parameters and improving model efficiency.Huang Kaiqi et al. [9] used SSR defogging algorithm for foggy image preprocessing, and then used the confidence score to select the detection box position, which improved vehicle detection efficiency.Wang Yudong et al. [10] added the fog concentration discrimination module in the detection network to improve the network's robustness and adaptability, and introduced the attention mechanism to effectively improve the model's feature extraction ability.Yuan LaoHu et al. [11] improved the data augmentation of the original YOLOv5, randomly cropped and merged 9 images to accelerate model convergence; added CBAM attention mechanism before the prediction end to strengthen image feature extraction ability; and improved the NMS non-maximum suppression, resulting in YOLOv5-CBAM enhancing target detection accuracy and efficiency.Liu et al. [12] used the adversarial network to do fog removal on the image, and replaced the model's CSPDarkNet53 network backbone with ShuffleNet to improve the model's detection speed.Finally, a CA mechanism was added to the feature fusion part to enhance the attention to small target information.Wang Zhongmei et al. [13] used a generative adversarial network for image preprocessing and proposed a multi-scale fusion module to combine shallow and deep features.
Although the above methods have achieved advantages in detection accuracy, there are still problems such as unclear effect of defogging algorithm and insufficient feature information extraction under haze environment factors.Based on this, paper proposes the improved object detection method, which first uses AOD-Net for image defogging processing to improve image clarity, and then inputs the processed image into a YOLOv5 model while improving the model.Firstly, a spatial channel attention is incorporated into the C3 module to improve the representation ability.Secondly, BiFPN is used to replace the PANet part of feature fusion in YOLOv5, to fuse multi-level features and enhance the object detection performance.

YOLOv5 model
YOLOv5 is the object detection algorithm which uses the concept of anchor boxes to detect objects.The YOLOv5 model structure is shown in Figure 1.

Figure 1.YOLOv5 Model Structure
YOLOv5 is based on a lightweight CSP network structure which is composed of a series of convolutional layers and residual blocks, mainly categorized into three components: Backbone, Neck, and Head.The Backbone adopts the CSPDarkNet53 architecture, which involves multiple convolutional and pooling layers to gradually extract high-level feature representations from the input image.The Neck part uses the SPP structure, which is a middle processing network responsible for further processing the feature maps extracted by the Backbone network to obtain richer and multiscale features.The Head contains multiple detection layers, which map the feature maps extracted by the Neck to the prediction space and output detection results.

Image defogging processing
In recent years, due to the influence of environmental pollution, there have been more and more haze weather conditions in cities, which has brought great challenges to image processing and computer vision.Among them, the deep learning-based global haze removal network AOD-Net [14] has achieved good results.The main idea of AOD-Net is to establish an adaptive deep model, use the powerful feature extraction and data fitting ability of deep learning to estimate K values, and map the haze image and the corresponding transmission rate image to the haze-free reference image to automatically perform haze removal.The method primarily comprises two components: a K value estimation module, which utilizes five convolutional layers to estimate K(x), and a clean image generation module.The network contains an encoder and a decoder.The encoder is responsible for feature extraction， the decoder is responsible for generating haze-free images .The network structure model is shown in Figure 2.   Adding attention in YOLOv5 can allow the model to automatically focus on the more important parts of the predicted results during the learning process, improving the model's performance.The SE [15] module, which combines channel and spatial attention mechanisms, not only weights channel features, but also introduces spatial attention, making the model more focused on effective features when processing input data, thus improving the model's accuracy.In this paper, a more efficient SE attention is introduced in feature extraction, and its module structure is shown in Figure 4.
In which is the output feature of the convolutional layer, C is the number of channels of all features; W and H are different feature dimensions.
The excitation operation processes the compressed real numbers through two fully connected layers to enhance the module's non-linear features.Firstly, the data is dimensionally reduced and ReLU activation operation is performed in the first fully connected layer.Then, the data is dimensionally increased and sigmoid activation function is used to output the result in the second fully connected layer.The calculation formula is shown in (2).
Where 1 and 2 are the parameters of the two fully connected layers, is the sigmoid activation function, represents the ReLU activation function.Finally, by reweighing the weights and multiplying the original features channel by channel with the channel coefficient obtained from the excitation operation as the input of the pooling layer, the calculation formula is shown in (3).

Optimization of C3 module
In YOLOv5, the C3 module is a very important component, used to extract image features, further enhancing the detection network's expression ability.The C3 module consists of three parts: dimension reduction, residual block, and dimension increase, the entire design of the C3 module can effectively improve the network's expression ability and detection accuracy.However, in some images with complex backgrounds or dense targets, the target characteristics may be obscured or submerged.
Adding an attention mechanism in the C3 module can further improve the model's expressive power and performance, while reducing misjudgments caused by background interference.
Incorporating SE attention mechanism into the C3 module of YOLOv5 can further improve the model's detection accuracy and generalization ability.This makes the model focus more on the important features of objects and improve the accuracy and robustness of object detection.The C3 module with attention mechanism added is shown in Figure 5.

Feature pyramid optimization
YOLOv5 uses a combination of FPN and PAN to realize the feature pyramid.However, they adopt a unidirectional path to transmit information, and each level of feature maps can only obtain information from the feature maps of the previous level, which may cause information loss and insufficient information transmission.To better fuse high-level and low-level features, reduce computation costs, and improve model robustness, this paper adopts the weighted bidirectional feature pyramid (BiFPN) structure [16] .It simplifies the PAN structure, and fuses multiscale feature information while establishing bidirectional connections between feature maps of the same scale to address the problem of feature information loss to some extent [17] .The feature pyramid structure is shown in Figure 6.
Where represents the output feature map, ,1 represents the upsampled feature map, ,2 represents the downsampled feature map, and ,1 、 ,2 、 ,3 are weights learned through training.
The feature maps after up-sampling and down-sampling can be calculated using formulas ( 5) and (6).

dataset
The dataset used in this experiment consists of two parts.One part is the RTTS dataset [18] from RESIDE, a joint release by institutions including Tsinghua University and Beijing Jiaotong University.It contains road traffic images in different regions and conditions under real scenes.It has strong representativeness and application value.The other part uses the Foggy Cityscapes dataset [19][20] from Cityscapes, a joint release by the Computer Vision Center at the University of Stuttgart and the Intelligent Systems Institute.It contains over 5000 high-resolution images from more than 50 cities in Germany and Switzerland and is commonly used for semantic scene segmentation and object detection tasks.

Experimental Environment
This experiment was conducted under the deep learning framework PyTorch, using Windows 11 operating system and NVIDIA GeForce GTX 1650 graphics card.The programming language version is Python 3.9.The model uses the SGD optimizer, with an initial learning rate (lr) set to 0.01, a decay parameter of 0.0005, batch size of 4, and iteration set to 200.

Evaluation Indicators and Effect Analysis
We use accuracy, recall, mean average precision, and mean average precision at IoU threshold to evaluate the model.The confusion matrix can divide the test results into true/false positive and true/false negative.Recall is calculated as shown in formula (10), precision is calculated as shown in formula (11), and the calculation formulas for mean average precision and mean average precision at IoU threshold are shown in formulas ( 12) and ( 13), respectively.
The P-R curve represents the precision on the vertical axis and the recall rate on the horizontal axis.The commonly set threshold for mAP is 0.5.The P-R curve during training is shown in Figure 7, where the precision of vehicle detection is the highest and the detection effect is the best.

Ablation experiment
To verify the impact of the model proposed in this study on the detection performance of the YOLOv5 model, four experiments were designed on the RTTS dataset.Without changing the training parameters, the YOLOv5s model was used as the baseline for the first experiment, and the improved parts were added to the YOLOv5 model one by one using the method of controlling variables in the subsequent experiments.The impact of different improvement methods on the model's detection performance is shown in Table 1 In the table, the symbol "√" indicates the corresponding improvement method is used in the network mode.As can be seen from the table, the precision of the original YOLOv5 is 84.33%, the recall rate is 75.327%, and the mean average precision is 83.11%.Compared with the original YOLOv5 evaluation indicators, after each improvement, all indicators have been improved to some extent, indicating that the improved parts have improved the detection accuracy of the YOLOv5 model.

Comparative experiment
To further verify the effectiveness of the algorithm proposed in this paper, the improved YOLOv5 model was compared with the Faster RCNN, YOLOv3-multiscale, and YOLOv5-CBAM models with better detection performance in comparative experiments.The experimental comparison results of the four object detection algorithms are shown in Table 2.As indicated in Table 2, the mean average precision of the algorithm propose is 6% higher than that of the Faster RCNN model, 11% higher than that of the YOLOv3 model, and 2% higher than that of the YOLOv5-CBAM model.This indicates that the improved algorithm has better detection performance than other object detection models, further demonstrating the feasibility of the proposed method for vehicle detection tasks in actual foggy weather conditions.

Conclusion
This paper proposes an improved YOLOv5 algorithm for vehicle detection tasks in foggy weather, aiming to solve the problem of low accuracy and missed detection of existing models in this scene.First, the AOD-Net network is used to dehaze the image to enhance clarity; then the SE attention mechanism is added to the C3 module.After incorporating the SE attention, the model can pay more attention to important features, enhance feature expression and classification accuracy.Finally, in the feature fusion stage, the weighted bidirectional feature pyramid network BiFPN is used instead of the original FPN+PAN structure to improve the ability of multiscale feature fusion.Experimental results show that the proposed improved algorithm can effectively solve the vehicle detection problem in foggy weather, attaining a mean average precision of 92.4%, which is 9.3% higher than the original mAP value and higher than existing models with good detection performance.This demonstrates the effectiveness and practicality of the proposed algorithm for vehicle target detection tasks in foggy weather environments.

Figure 2 .
Figure 2. AOD-Net network diagram K-estimation is the feature extraction module in AOD-Net, which extracts useful features from the input hazy image and estimates the depth and relative haze level.The module consists of two parts: a global branch and a local branch.In the AOD-Net network, the global branch extracts global feature information, while the local branch focuses on capturing local texture information.while the local branch is responsible for extracting local texture information.The outputs of the two branches are fused by the fusion layer to obtain the final results, which are the weighted sum of the global haze density estimation value and the local haze density estimation matrix.The structure of the Kestimation module is shown in Figure 3.

Figure 3 .
Figure 3. Structure of the K-estimation module The image generation module can generate a clear image based on the estimated haze density k value and the input hazy image.The module consists of two parts: a convolutional neural network and a deconvolutional neural network.The convolutional neural network is used to learn useful features from the input hazy image.The deconvolutional neural network contains 4 deconvolutional layers, each using a 3x3 convolutional kernel and a ReLU activation function.

4 .
Improvement of YOLOv5 model4.1.SE attention mechanismAttention mechanism is a commonly used technique in deep learning, which is used to increase the importance weight of certain inputs in the model, thereby improving the performance and effectiveness of the model.By incorporating both the global and local branches, the model gains a deeper understanding of the image's local and global information.This allows the model to assign different weights to various parts of the features based on the image content.

Figure 4 .
Figure 4. SE attention mechanism The SE module mainly consists of two components: the squeeze operation and the excitation operation.The idea of squeeze operation is as follows: when performing Global Average Pooling on an H*W*C feature map, a 1 * 1 * C feature map will be obtained.By compressing the features of each channel to obtain C real numbers, the receptive field is extended to the global range, which can better capture global information.The squeeze operation is used to enhance the model's perception ability.The calculation for obtaining the real numbers is shown in formula (1).

Figure 6 .
Figure 6.BiFPN network diagram Assuming the input feature maps are 1 , 2 , ⋯, , where represents the number of input feature maps, the calculation formula of the BiFPN module is shown in equation (4).

Figure 7 .
Figure 7. P-R diagram of model training

Figure 8 .
Figure 8. Change curve of loss function In order to intuitively demonstrate the superiority and detection effect of the improved algorithm, we used the original YOLOv5 model and the improved model to detect and compare two groups of images.The results are shown in Figure 9.

Figure 9 .
Figure 9.Comparison between the original YOLOv5 model and the improved model It can be seen from the figure that both models can recognize vehicle targets, but the confidence of the original YOLOv5 model's target detection is generally lower than that of the model proposed in this paper, and there is a phenomenon of missed detection.This indicates that the algorithm proposed in this paper has better detection effect and better performance on the foggy weather dataset.

Table 1 .
. Results of ablation experiment

Table 2 .
Results of comparative experiment