Research and Implementation of 3D Object Detection Based on Autonomous Driving Scenarios

Perception function, as an important part of autonomous driving, ensures the safety and intelligence of driving. We collect the surrounding environment of driving through hardware sensors, providing a basis for subsequent decision-making of autonomous driving. In recent years, deep learning has made breakthrough progress in object detection. Based on this, this paper uses lidar point cloud data, combined with deep learning theory and method, to carry out three-dimensional target detection tasks around lidar point cloud data, and carries out theoretical analysis, method verification, and result analysis. A 3D object detection method based on LiDAR point cloud data is proposed. The backbone network of the network model corresponding to the method is VoxelNet’s backbone network. After outputting the feature matrix, sparse point clouds are supplemented with point clouds, and then the feature matrix is decoded to generate candidate boxes. The model used in this paper on the KITTI data set can effectively solve the problem of 3D target detection in the process of autonomous driving and has good performance.


Introduction
Intelligent networked vehicles have been defined as the new strategic direction of global automobile industry development."<Made in China 2025> Technology Roadmap for Key Areas" proposes that by 2025, as for the equipment rates of driving assistance level, partially or highly automatic driving level vehicles will reach 40% and 50% respectively, aiming to rely on physical systems and information communication Technology enabling vehicles to be automated and intelligent.Scholars continuously track the monitored target, analyze its movement trend, and provide a basis for vehicle obstacle avoidance planning and behavior decision-making; software and algorithm include data fusion, path planning, control algorithm, target detection, tracking algorithm, etc.These algorithms are the core of automatic driving technology; the power system includes a transmitter, transmission system, etc., to provide power for self-driving vehicles; The human-computer interaction part is mainly the interaction scene with the driver in the automatic driving scene.
The 3D object detection in the autonomous driving environment mainly includes the following research significance: (1) We provide environmental awareness for autonomous driving.The automatic driving system needs to perceive the three-dimensional information around the vehicle accurately in real time and detect various target objects on the road, such as other vehicles, pedestrians, traffic facilities, etc.In this process, 3D object detection provides important information for environment perception.
(2) The detected target provides the judgment basis for obstacle avoidance and path planning for automatic driving.According to the detected three-dimensional target, the automatic driving system can judge the movement trend and trajectory of the obstacle, and realize safe obstacle avoidance and more comfortable driving by planning the lane in advance and adjusting the speed.
(3) We provide accurate positioning information.In the automatic driving scene, the vehicle module detects the objects around the vehicle during driving, and senses the three-dimensional coordinates, sizes, and specific directions of the surrounding objects, so that the self-driving vehicle can accurately know the position of the vehicle even without GPS, and after being fused with GPS data, it can obtain more accurate positioning.
To sum up, in the automatic driving system, 3D object detection directly determines the system's environmental perception ability, positioning accuracy, obstacle avoidance planning, and the realization of advanced automatic driving functions, and its significance is self-evident.With the continuous development of autonomous driving technology, 3D object detection becomes more and more important.At the same time, in the process of 3D target detection development, high precision, and low latency have also become important indicators.

Related Work
The main challenge of 3D object detection lies in the difficulty of collecting and processing 3D data, and how to use 3D and 2D data to design efficient deep learning algorithms.However, with the advancement of 3D technology and computing power, 3D object detection [1][2][3] is also making new progress.At present, the existing 3D object detection methods in the autonomous driving scene mainly include three methods, namely, the method based on the lidar point cloud, the method based on image, and the method combining the two.

Detection method based on Lidar point cloud
The input data of the 3D object detection method based on the lidar point cloud are the lidar point cloud, and the output is the bounding box of the target detection object.The main research results based on this method include VoxelNet [4], PointPillars [5], etc. VoxelNet is an end-to-end 3D object detection method proposed by Alibaba.VoxelNet uses end-to-end detection, combined with neural network methods, to directly output learning features and detection boxes by inputting LiDAR point cloud data.VoxelNet operates on voxels in 3D space and integrates accelerated convolution methods to better process point cloud data, while effectively solving the problem of high demand for computing resources.

Image-based detection methods
The core of the image-based 3D object detection method is to first project 3D point cloud data into a 2D image plane and output a 2D view of the point cloud.Based on the 2D view [6][7][8][9], we use existing 2D object detection methods to recognize it.Finally, the two-dimensional detection results are projected back to the three-dimensional space to obtain the three-dimensional object detection frame.

General introduction to the network
The laser radar point cloud data are used as the original data set, and the input data is directly detected with an end-to-end scheme.The model consists of four parts, as shown in Figure 1, namely data input, feature extraction, feature self-attention, candidate box generation, and model evaluation.The specific work of each part is described as follows: Data entry: The laser radar point cloud data is used as the original data set.Feature extraction: We use the original point cloud data of the KITTI dataset as data input; This module mainly extracts features from raw LiDAR point cloud data, uses the backbone network of VoxelNet to output the feature matrix, and completes the point cloud for the sparse point cloud area to increase the density of features; Feature self-attention: This module mainly decodes the LIDAR point cloud feature data output in the previous step, and further processes the model by combining the self-attention and cross attention mechanisms.Self-Attention is mainly responsible for decoding; Cross-Attention is mainly responsible for the generation of candidate boxes.Generation of candidate boxes and model evaluation: This module is the last step.Finally, the data with candidate boxes are output.The detection effect of the model can be directly seen through the output data.We integrate accuracy and other indicators to comprehensively evaluate the model.Data output: We output the 3D target box and related information.

VoxelNet model optimization
The main implementation details of VoxelNet are as follows: z We use Voxel to discretize point cloud data and divide the three-dimensional space into uniform small voxel units, and each voxel contains all point cloud points within its spatial range.z The application of a 3D convolutional neural network for feature extraction on voxel grids is as follows.We use 3D CNN to gradually reduce the voxel scale to obtain semantic features of point clouds at different spatial resolutions.z The region proposal network (RPN) is been used to generate target box candidates in the feature map, and the 3D RPN can directly generate the target box in the point cloud feature map.z The Rol pooling layer is used to extract voxel features of the target frame, thereby completing subsequent recognition and model refinement.RoI pooling can extract the features in the corresponding target frame from the feature map to identify the target type.z We apply 3D NMS to filter the frame candidates to get the final detection result and use strategies such as IoU to filter overlapping frames in three-dimensional space to obtain target detection results.z The feature encoding refers to the backbone part of VoxelNet and optimizes the rest of it.
The use of a transformer to decode the model ultimately improves the accuracy and detection performance of the model.

Point cloud completion of sparse point cloud
We use VRCNet to complete data completion operations on sparse point clouds.VRCNet includes two subnetworks, namely, the Statistical model network and the relationship enhancement network.First, PMNet generates a rough point cloud shape frame based on the defect cloud, and then RENet combines the rough frame generated in the previous stage and the defect cloud observation to infer a relevant structure to achieve the enhancement of the detailed features of the final generated point cloud.
VRCNet can better generate high-quality complete point clouds by learning and predicting multiple symmetries.

Use Transformer as a model decoder
Firstly, a detection converter is added to decode the feature matrix of the point cloud data in the previous step, and then self-attention and cross-attention mechanisms are added to process the decoded feature matrix.Self-Attention is mainly responsible for decoding; Cross-Attention is mainly responsible for the generation of candidate frames.The first step of the DETR is to predict all objects through a one-time operation, complete end-toend network training through the Loss function, and use the Loss function to complete the matching between the predicted value and the real value.
DETR mainly includes three stages of training: first, image features are extracted using a convolutional neural network (CNN).Second, the image features are input into Transformer's encoder-decoder architecture.The encoder mainly learns the global information of the image, and the decoder generates prediction boxes by performing self-attention operations on image features and Object Queries.Third, we match the predicted box with the real box through the bipartite graph matching loss and then use the feed-forward network (FFN) to calculate the classification loss and bounding box loss.
The reasoning stage of DETR is also divided into three steps: first, we use CNN to extract image features.Second, image features are fed into Transformer's encoder-decoder architecture to generate a series of predicted boxes.Third, we keep the predicted boxes with confidence higher than the threshold as the final output.

Dataset Introduction
The KITTI is a mainstream dataset in the field of automatic driving, mainly used in the field of computer vision.It was jointly completed and created by Karlsruhe Institute of Technology (KIT) and Toyota University of Technology (TTIC) in Chicago.The KITTI is a public dataset formed by processing data collected by vehicles on the road.The collection vehicles have rich and diverse sensor data, including binocular cameras, lidar, GPS/IMU integrated navigation, and positioning systems.At the same time, it has a large number of calibration truth values.In terms of data labels, the dataset has multi-category labels, including Car, Van, Truck, Pedestrian, Person (sitting), Cyclist, Tram, and Misc (e.g., Trailers, Segways).

Implementation Details
In the process of completing all experiments, we use end-to-end methods to train the network from the original data.The training and validation sets contain 7481 images, while the testing set contains 7518 images.The model is trained using the SGD optimizer.Using the gradient warm-up strategy on 16 GTX 1080Ti GPUs, the learning rate is 0.002, the number of warm-up iterations is 500, the warm-up ratio is 0.33, and the batch size is 32.We apply a weight of 0.2 for deep regression to train our baseline model to make the training more stable.

Evaluation indicators 4.3.1. IoU (Intersection over Union)
IoU represents the intersection and union ratio, that is, the ratio of the intersection and union of the predicted frame and the true value frame.IoU is used to determine the degree of overlap between two boxes, and the higher the value is, the higher the similarity is.

Precision
The accuracy index indicates that the correct percentage of all positive samples is given by the detector, which is used to evaluate the correct rate of the detector based on the detection success.

Recall
The recall rate (recall rate) indicates that the percentage detected by the detector among all positive samples is given by the true value and is used to evaluate the detection coverage of the detector for all targets to be detected.

mAP (Mean Average Precision)
The mean of the average precision is used to measure the overall precision performance of the algorithm on all categories.The mAP value is one of the most important evaluation indicators for target detection algorithms.

Experimental results and analysis
The experimental results, as shown in Table , 1 are compared with the benchmark network PointPillars, VoxelNet, SECOND, and other classic networks.The above network performance is directly extracted from the article.The final experimental results are shown in Table 1.Table 1 shows the detection performance of different 3D target detection networks based on different sensors on the KITTI test set for cars, pedestrians, and bicycles.The method refers to the usage of the 3D target detection network, and the Modality column represents the type of sensor used.Lidar refers to Img refers to the lidar, and Img refers to the camera.It can be seen from the table that the 3D target detection network based on lidar has the best detection performance because it has the best perception of depth information and distance information.The network in this paper has a better effect on 3D target detection and achieves a better balance between detection accuracy and speed.Compared with PointPillars detection performance, 3D object detection has achieved significant improvement in all three categories.The table shows that in 3D target detection, the network in this paper has a good improvement in various tasks compared with other networks, and the real-time performance and performance of the algorithm can be balanced.However, PointPillars has a significant improvement in detection speed, because PointPillars directly divides the detection space into column-shaped voxels in the voxel division stage, without using 3D convolution and voxel RoI modules.PointPillars is accelerated by the NVIDIA TensorRT module and the algorithm speed advantages are obvious.

The ablation experiment
The ablation experiment aims to evaluate the performance and contribution of each part of our method on the KITTI dataset according to the metrics, as shown in Table 2.This section will introduce two ablation experiments: the ablation experiment for point cloud completion and the ablation experiment for the Self-Attention module.The purpose of the ablation experiment for point cloud completion is to verify the role of point cloud completion in this network and the improvement of features.The ablation experiment for the Self-Attention module mainly verifies the contribution of feature selfattention in the network and performance improvement.First, verifying the validity of the different components of the network as well as the detailed contributions, two variants of the network model are performed: a. Variant 1: We remove the point cloud completion, directly decode the feature matrix, do not perform the point cloud completion of the sparse matrix, and observe the result.
b. Variant 2: We remove feature self-attention, do not decode features, directly generate candidate boxes, and watch the results.
Based on the experimental results, the analysis is as follows: If the point cloud completion operation is not performed after the feature extraction step is completed, but the decoding is performed directly, it will have a certain impact on the overall experimental results, and there will be a downward trend in accuracy.If the feature self-attention process is removed, the detection accuracy also decreases.Therefore, each component of the network model in this paper is necessary for the KITTI dataset, and the network model in this paper has advantages in terms of accuracy and precision.

Conclusion
This article efficiently solves the problem of 3D object detection in autonomous driving scenarios, improves on previous methods, and meets the real-time and accurate requirements of autonomous driving scenarios.In future research, the actual implementation of the algorithm and the algorithm deployed in the vehicle will also be considered.As the cost of the technology falls, self-driving cars will become more affordable and gradually reach the mass market.At the same time, carrying out large-scale testing and verification to ensure the safety and reliability of the system is also a necessary step to achieve popularization.

Table 1 .
Results of experiment

Table 2 .
Results of the ablation experiment