Hardware-Accelerated YOLOv5 Based on MPSoC

This paper details the development of a hardware acceleration system for YOLOv5, focusing on flame detection as its primary application. The implementation leverages the APU and DPU functionalities integrated into the Zynq UltraScale+ MPSoC XCZU7EV core. The proposed solution addresses the challenge of achieving real-time target detection on mobile terminals, ensuring both real-time operation and ultralow power consumption of YOLOv5. Notably, our design approach facilitates the deployment of all target detection algorithms under TensorFlow for mobile devices. To optimize model efficiency, we employ saturated linear mapping quantization with calibration. This technique maps model weights, double bases, and activations from 32-bit to 8-bit, incurring only a 1.64% accuracy loss. The data flow design is realized through efficient data exchange between DDR, APU, and DPU, utilizing the AXI4 bus architecture. Image pre-processing and post-processing tasks are executed on the APU, while neural network inference occurs on the DPU. Our accelerated system demonstrates compelling experimental results: maintaining a detection speed of 56FPS, achieving an accuracy of 36.56% on the COCO2014 dataset, and exhibiting a total system power consumption of only 4.147W. Furthermore, the energy consumption ratio is measured at 15.41GOPS/W, surpassing the RTX A6000 graphics card by a factor of 55.


Introduction
Machine vision algorithms presently heavily rely on large servers for processing.The exponential growth of cloud data has imposed significant processing burdens on servers.Moreover, transmitting video data to the cloud through mobile networks poses challenges in ensuring real-time data processing performance.With the emergence of edge computing [1], there is a growing demand for localizing target detection algorithms on mobile devices.Hardware acceleration technology [2] aims to leverage programmable circuits for specialized, real-time, and low-power computing of neural networks [3].
YOLOv5, a lightweight iteration of the fourth-generation YOLO [4][5][6][7], boasts rapid detection speed and the capability for swift development to meet specific detection requirements.It is particularly well-suited for deployment on embedded platforms.However, the YOLOv5 network structure, despite its efficiency, is intricate and expansive.Constructing a network structure from the underlying gates using FPGA [8] can be time-consuming and labor-intensive.Xilinx has introduced the DPU (Deep-Learning Processing Unit), designed specifically for convolutional neural networks on MPSoC [9].Functioning as a semi-custom FPGA, it alleviates the need to consider the underlying hardware structure during development [10].By utilizing the Vitis-AI development tool library, developers can efficiently customize the processing unit in the DPU to optimize and execute the convolutional neural network.The development process of this system encompasses various aspects: training the YOLOv5 model under TensorFlow, quantifying network weights and activation values using the Vitis-AI tool library, constructing and generating the DPU hardware system, and developing the YOLOv5 inference program on the board.The holistic development process is illustrated in figure 1.

Training
The training of YOLOv5 demands substantial memory and computing power, making it challenging for implementation on hardware terminals.To address this, we opt for GPU-based training on the PC, with only the trained network model being transmitted to the hardware terminal.
For the backbone network structure, we employ Focus+CSPNet [11], while the Neck component utilizes FPN+PAN [7].To enhance compatibility with the DPU and optimize detection efficiency, we introduce modifications to the original YOLOv5 network: all activation functions now utilize Sigmoid [12], and the pooling kernel of the SPP [13] structure is set to 3×3, 5×5, 7×7.Specific training settings include a single target category (Fire), 9 anchors [14] (10,13,16,30,33,23,30,61,62,45,59,119,116,90,156,198,373,326), and an input image resolution of 416×416.We employ mosaic data augmentation [15] during training, with 24 frozen training iterations and a total of 48 iterations.The frozen training single-pass sample size is set to 16, and the non-frozen training single-pass sample size is set to 8. The training utilizes the Adam optimizer [16] with a maximum learning rate of e-3, and the learning rate descends following a 'cos' mode.
A total of 1000 images were annotated, with 90% allocated for training and 10% for testing, employing the AP50 test method.Positive samples are determined when the intersection ratio between Ground Truth and Bounding Box exceeds 50%.YOLOv5 was tested for detection accuracy and speed on the PC side, utilizing the AP50 evaluation metric.The results indicate a recognition accuracy of approximately 67.86%, with a detection speed of about 13.89ms/sheet, equivalent to 72FPS on the RTX A6000 Graphics card.
The loss function employed encompasses three components: prediction box loss, confidence loss, and classification loss [7] .The loss value is computed by calculating and accumulating each grid in the dimensions of 13×13, 26×26, and 52×52, respectively.To maintain a balanced loss across the three components, we introduce three weights: λobj = 0.05, λbox = 1.0, and λcls = 0.5.The loss is defined as illustrated in formula (1).
For prediction box loss, we implement the CIoU calculation model [17], which comprehensively addresses three crucial aspects in regression positioning: intersection ratio, size ratio, and the distance between center points.The loss definition for CIoU is presented in formula (2).

Lightweight
YOLOv5 demands significant memory resources, but hardware terminals often operate within limited memory constraints.To ensure real-time performance and operational reliability, we employ quantization techniques on the model [18].This involves converting floating-point operations within the model to fixed-point or integer operations [19].This approach can substantially reduce the model's memory footprint, incurring only a marginal loss of accuracy.
Our quantization method of choice is saturated linear map quantization with calibration.This technique determines an optimal quantization threshold for each layer's weights and activations, thereby minimizing accuracy loss resulting from model quantization.In contrast to non-saturated quantization methods, which may lead to a concentration of quantized values within a small range due to a few activation values being excessively large or small, saturated linear map quantization ensures a more distributed representation.This approach avoids wasting a significant portion of interval information in the range [-128, 127], ultimately preserving precision, as illustrated in figure 4. (3) Threshold Selection for Quantization: The median value of each group is selected as the threshold T for quantization.The quantization results are used as network models to perform inference computations on the calibration dataset.The activation values of each layer are collected to generate a histogram until all thresholds are traversed.
(4) KL Divergence for Threshold Optimization: Use the KL divergence [20] to compare the threshold that minimizes the loss of information before and after quantization, and record the threshold T. KL divergence, or relative entropy, is derived from information entropy H(p) and cross entropy H(p, qj).We take the activation value histogram information obtained by inference from floating-point weights as the true distribution p.The prediction distribution qj is obtained by inferring the fixed-point weights.Then the formula for calculating the minimum KL divergence is as shown in formula (3).
(||  )  = ((,   ) −() KL divergence quantifies the dissimilarity between two distributions.The objective is to identify the quantization threshold (T) that minimizes the KL divergence between the true and predicted distribution.This forms the crux of the saturating linear map quantization scheme, aiming for the least loss of precision.
(5) Saturating Linear Mapping and Quantization: Implement subsequent saturating linear mapping and quantization on the activation values of this layer based on the identified threshold (T).
To facilitate a more insightful observation of the curve's impact and enhance quantitative assessments during operation, an absolute value operation is conducted on the activation values beforehand.Figure 5(a) illustrates the probability distribution of activation values before and after quantization when employing the optimal threshold (T) for a specific layer.Notably, the grouping of activation values tends to concentrate in the top 50 out of 256 groups.For visual clarity, only the first 50 groups are presented in the figure.

Hardware Design
In practical operation, the requirement extends beyond a lightweight model structure to encompass a set of data streaming strategies.Together, these components constitute the mobile target detection system.
The system is meticulously crafted and designed based on MPSoC as shown in figure 6 and 7, with a primary focus on optimizing power consumption to the greatest extent possible.The core of the system utilizes the Xilinx 7U7EV chip, employing a cooperative utilization of the Processing System (PS) and Programmable Logic (PL).Communication between PS and PL is facilitated through the AXI4 bus [21].Data exchange is accomplished by the DMA control bus, and the handshake module manages the state interaction tasks between the data path on the PS end and the PL end.For efficient runtime operations, DDR4 high-speed SDRAM chips are employed to store image and model information.This comprehensive approach ensures a well-integrated mobile target detection system with optimized power consumption and streamlined communication between different components.(2) Data Transmission within the Processing System (PS): The pre-processed image is transmitted from the PS terminal to the input data buffer of the PL terminal via the AXI bus.
(3) DPU Inference: The DPU performs inferences on the images based on the network model.The results are stored in the PL's output data buffer.
(4) Result Transmission within the Processing System (PS): The inferred result is transmitted from the PL end to the DDR of the PS end via the AXI bus.
(5) Result Display: The APU reads the processing result from the DDR and displays it.Use Vivado to customize the DPU IP core.Select on-board resources for hardware construction.The specific procedures are as follows: (1) IP Module Selection in Vivado: Utilize Vivado to choose the DPU IP core for constructing the hardware design diagram.(2) DPU Configuration in Vivado: Configure the DPU IP core by specifying parameters such as the number of cores, BRAM utilization, channel enhancement, frequency, and bus settings.Refer to table 1 for the system configuration.
(3) Generation of xsa Files: Generate Xilinx System Archive (xsa) files containing the configured DPU IP core.
(4) Platform Project Compilation with Petalinux: Compile the generated xsa files into a Platform project using the Petalinux tool.
(5) Image File Building with Vitis: Use Vitis to build an image file based on the compiled Platform project.
(6) SD Card Image Burning with imageUSB: Employ the imageUSB tool to burn the generated image file onto the SD card for subsequent deployment.

Compilation and Inference
The inference of DPU relies on three parts as shown in figure 8.The dynamic link library contains model information, and the DPU builds the internal PE unit according to it to form a computing model.The mirroring system generated by Vivado includes the configuration information of the DPU, and the working frequency of the DPU is formulated.The related driver libraries and tool libraries enable users to use the python language to call the DPU to complete the creation and shutdown of work tasks.
The overall process of the YOLOv5 inference program: obtaining category and anchor, calling DPU, creating tasks for DPU, image pre-processing, loading images into DPU memory, obtaining network input Tensor, running network model on DPU, image postprocessing, and displaying results .The system operation flow chart is shown in figure 9.

Experiment
The results show that the preprocessing (APU) time of a single image is about 215ms, and the postprocessing (APU) time is about 210ms.The convolutional neural network (DPU) computation time is about 17.5ms.Therefore, the theoretical detection speed reaches 56FPS.The calculation speed is about 63.89GOPS.The figure 10 illustrates a test picture.
The evaluation standard of model accuracy in the field of target detection is often expressed by AP.TP: positive samples predicted to be positive; TN: negative samples predicted to be negative; FP: negative samples predicted to be positive; FN: positive samples predicted to be negative. Accuracy: Precision: Recall: We use AP50 for testing.There are 226 images in the test set, of which 297 are positive samples and 13 are negative samples.According to formula (4) ( 5) and ( 6), the precision-recall curve can be obtained as shown in figure 11.
The area enclosed by its curve is the accuracy AP, which can be calculated by the formula (7). =  =0 ×  =0 + ∑[(  −  −1 ) ×   ] =1 (7) On MPSoC, the detection accuracy of the model is about 65.19%.Compared to PC, the accuracy decreases by about 1.64%.In addition, we tested on the public dataset COCO2014 with an accuracy of 36.56%.The theoretical energy consumption analysis can be given after the Vivado model is run, as shown in figure 12.The operation speed of the RTX A6000 is 72FPS.The neural network operation speed of MPSoc is 56FPS.They can all meet the real-time requirements.However, the MPSoc simulation results show that the total power is 4.147W, of which the static power is 0.702W and the dynamic power is 3.446W.The power consumption of MPSoC is only 1.4% of RTX A6000 (395W).

𝜂 = 𝐺𝑂𝑃 𝐽 (8)
Note, that the detection speed of the MPSoC is not as fast as the RTX A6000.However, the resource allocation of the two is not in the same order of magnitude, so the total energy consumption cannot be compared directly.Therefore, this paper proposes a way of comparing energy consumption ratios.Equation 8gives the calculation formula of energy consumption ratio in target detection, GOP is the amount of calculation, and J is power consumption.It is calculated that the target detection energy consumption ratio of the RTX A6000 is about 0.28 GOPS/W.The target detection energy ratio of MPSoC is 15.41GOPS/W.The energy consumption ratio of MPSoC is about 55 times that of the RTX A6000, with extremely low power consumption.

Conclusion
In this paper, we presents a novel hardware acceleration system for YOLOv5, with a primary focus on flame detection.Compare some hardware accelerated systems of YOLO algorithms from 2019 to 2023.Our system shows good performance.We believe that this stems from effective quantization methods, reasonable data flow strategies, and high performance of DPU. Figure 13 shows part of the test images.

Figure 2 .
Figure 2. Schematic diagram of ground truth and bounding box variables

Figure 3 .
Figure 3. Part of the training set images

Figure 4 .
Figure 4. Comparison of non-saturating/saturating linear mapping quantization methods (a): nonsaturating linear mapping (b): saturated linear mapping For each layer, the quantization process involves the following specific steps: (1) Calibration Set Selection: Choose a subset of the validation set as the calibration set, consisting of a total of 50 images.(2) Floating-Point Network Inference on Calibration Set: Run the inference of the floating-point network on the calibration dataset.Collect activation values for each layer and count their histograms.Divide histograms into 256 groups.(3)Threshold Selection for Quantization: The median value of each group is selected as the threshold T for quantization.The quantization results are used as network models to perform inference computations on the calibration dataset.The activation values of each layer are collected to generate a histogram until all thresholds are traversed.(4)KL Divergence for Threshold Optimization: Use the KL divergence[20] to compare the threshold that minimizes the loss of information before and after quantization, and record the threshold T. KL divergence, or relative entropy, is derived from information entropy H(p) and cross entropy H(p, qj).We take the activation value histogram information obtained by inference from floating-point weights as the true distribution p.The prediction distribution qj is obtained by inferring the fixed-point weights.Then the formula for calculating the minimum KL divergence is as shown in formula(3).(|| )  = ((,   ) −())

Figure 5 .
Figure 5.Comparison of activation values before and after quantization

Figure 6 .
Figure 6.Schematic diagram of the overall architecture of the system hardware and the data stream transmission strategy (1) Image Acquisition and Preprocessing: APU processor reads the image from DDR. Image preprocessing is handled by quad-core ARM Cortex-A53 processor in APU.(2)Data Transmission within the Processing System (PS): The pre-processed image is transmitted from the PS terminal to the input data buffer of the PL terminal via the AXI bus.

Figure 8 .Figure 9 .
Figure 8. Schematic diagram of the working principle of DPU

Table 1 .
The DPU configuration of this system

Table 2 .
Hardware accelerated system performance of some YOLO algorithms from 2019 to 2023