Real-time Lightweight Target Detection Network under Autonomous Driving

Target detection in traffic scenes has been a focal point concerning computer vision, especially in the context of autonomous driving, including the recognition of vehicles, pedestrians, and traffic signs. However, the computational power of devices operating in traffic scenarios is limited, placing stringent demands on computational resources and latency. To address these challenges, this study proposed a lightweight detection algorithm, GV2-YOLO, which prioritizes both speed and accuracy. First, we apply the GhostNetv2 architecture and incorporate the Ghost module to decrease the parameters of the backbone feature extraction network. Second, the algorithm also integrates SPPF and Slim-neck by GSConv techniques, which effectively reduces the computational complexity while ensuring the resulting significant reduction in detection accuracy. Our proposed algorithm achieves a mAP of 84.7% on the CCTSDB dataset. The algorithm has a total parameter count of 9.1 M, making it ideal for deployment in embedded autonomous driving platforms.


Introduction
Target detection is one of the key technologies for autonomous and intelligent assisted driving.It can help drivers to improve driving safety.Traffic sign inspection also belongs to one of its inspection branches.Therefore, the detection of traffic signs is a hot topic at present.Mainstream target detection algorithms are mainly classified into single-stage target detection [1] and two-stage object detection [2].Although classification-based two-stage algorithms have greatly improved detection performance, the current object detection algorithms have limitations in meeting the real-time detection requirements.The accuracy of two-stage target algorithms is relatively high, but they require numerous parameters.
Under current hardware conditions, it is not feasible to handle large-scale convolutional operations in autonomous driving scenarios.Therefore, it has become imperative to develop compact neural network models that meet the requirements of autonomous driving.With this in mind, we aim to achieve the following research objectives: (1) The enhancement of feature extraction capability will be achieved by employing the lightweight backbone feature extraction network, GhostNetV2, supplemented with DFC attention.This attention mechanism is hardware-friendly.(2) To facilitate the fusion of local and global features and expedite inference speed, we introduce the SPPF module.The model neck structure is further illuminated by using the improved Slim-neck by GSConv.(3) The effectiveness of the proposed model regarding detection efficiency and accuracy has been validated through experimental results using the CCTSDB traffic sign dataset.Consequently, the model can be seamlessly deployed on embedded platforms and applied in the domain of autonomous driving in the future.

Related Work
One-stage object detection algorithms have greatly improved the efficiency of object detection, making real-time object detection a reality, and are suitable for autonomous driving scenarios.Redmon and colleagues introduced the YOLO algorithm, which predicts boundaries and directly determines object classes by using convolutional neural networks.This approach achieved true realtime object detection.In recent years, researchers have made advancements to the YOLO algorithm, resulting in the development of more sophisticated versions such as v5, v6 [3], and v7.
The neural networks required for autonomous driving need to be lightweight and efficient [4].Common practices include designing a new lightweight network architecture.Howard et al. [5] proposed the MobileNet series by using depthwise separable convolutions and other operations to extend the efficiency of resource-efficient blocks.Li et al. [6] proposed a lightweight convolution method called GSConv, which achieves comparable performance to ordinary convolution with higher efficiency in depthwise separable convolution.Han et al. [7] proposed a cost-effective operation to reduce feature redundancy within channels.Tang et al. [8] added an attention mechanism on top of these methods to enhance feature extraction.Based on the research analysis above, an improved lightweight and high-accuracy design for an object detection network is proposed.

Network Architecture
In this study, the architecture is depicted in Figure 1.First, the backbone network used by YOLOv4 for raw feature extraction is CSPDarknet53, which is based on CSPNet (Cross-Stage Partial Network) proposed in the literature.To enrich the feature extraction, this network was improved to GhostNetV2, which is a lightweight version with hardwarefriendly DFC attention.In addition, SPP (spatial pyramid pooling) was improved to SPPF, which is more computationally efficient and has a larger receptive field.This method outputs a fixed-size image data matrix, avoids data loss due to cropping, and solves the multiscale problem to some extent.The path aggregation network (PAN) is a network architecture that combines a pyramid and inverse pyramid in parallel, which employs a lightweight GSConv improvement.This improvement aims to reduce parameters while enhancing feature extraction and multiscale fusion.Finally, YoLoHead is utilized to predetermine the detection results based on features that have undergone multiple convolutional processing.This network combines the advantages of a lightweight convolutional neural network, fading mechanism, SPPF module, and PANet-Slim-Neck.

Backbone Feature Extraction
Convolutional neural networks composed of multiple convolutional layers require a significant amount of computational resources.As a result, a series of neural networks have emerged to address this issue, which introduced depthwise separable convolutions or shuffling operations to construct efficient CNNs using smaller convolutional kernels (floating-point operations).The remaining 1x1 convolutional layers still consume a significant amount of memory and FLOPs.We introduced the Ghost module to obtain more features from low-cost operations.By utilizing the Ghost module, the Ghost Bottleneck of the lightweight CNN was designed.The Ghost bottleneck uses the same residual structure as in ResNet.The Ghost bottleneck consists of the Ghost module.The GhostNet implemented with DFC attention is called GhostNetV2.The DFC attention branch operates concurrently with the initial Ghost module to enrich the expanded features.The architecture is illustrated in Figure 2. The feature maps in CNNs are usually low-rank, thereby obviating the need for dense input-output connections across diverse spatial locations.Leveraging the 2D configuration of the CNN features presents an opportunity to decrease the computational load of FC layers through the decomposition of the equation.Specifically, a more easily understood equation in both directions is presented in detail: , 1 32 This article assesses the efficacy of each model through the utilization of precision (P), recall (R), F1 score, and mean average precision (mAP).N is the target number.The mAP is employed as the primary metric for evaluating the comprehensive performance of the models:

Experimental Results
To demonstrate the advantages of the GV2-YOLO algorithm introduced in this research, we performed experiments using the CCTSDB2021 dataset and performed a series of comparisons between GV2-YOLO and classical detection algorithms at various stages, displaying the experimental outcomes in Table 2.
Table 2. Metric evaluation using different detection algorithms on the CCTSDB2021 dataset.In summary, the precision, recall, and mAP of the classical SSD were only 86.5%, 27.4%, and 49.2%, respectively.The R-CNN algorithm family has undergone a series of enhancements.One of the better improvements is the Libra R-CNN, whose precision, recall, and mAP were 83.7%, 60.0%, and 61.4%, respectively.In YOLO's series of algorithms, YOLOv4 has added many training techniques that improve F1 scores by 5.4% and mAP by 1.2% compared to YOLOv3.In recent years, newly improved versions of the YOLO algorithm have demonstrated outstanding performance.YOLOv5's F1 score and mAP increased by 24.4% and 30.4%, respectively, compared to YOLOv3.YOLOv7-tiny worked well, but GV2-YOLO outperformed it by 1.0% and 4.8% in F1 score and mAP, respectively.The model predictions yielded a mAP of 84.7% for GV2-YOLO, basically achieving the expected effect.The model prediction effects are depicted in Figure 4.

Conclusions
We propose a lightweight and efficient network model that can be deployed in autonomous driving scenarios.First, the model introduces the GhostNetV2 lightweight network and improved SPPF module for model lightweight and inference acceleration.Second, we transform the neck layer into a slim neck.Finally, our experimental comparisons validate and demonstrate the proposed model's efficacy and superiority.The comparative experiment on the CCTSDB dataset concludes that GV2-YOLO achieves the best results in detection accuracy, R, and F1.Our neural network achieves an FPS (frames per second) of 41.22, meeting the real-time requirements.This indicates that our model can run in real time on embedded devices for autonomous driving and can be applied in autonomous driving scenarios.In future endeavors, we will further explore the optimization based on this algorithm to achieve better performance.
Table 1 lists some of the parameters used for this experimental training.