Hardware Acceleration Schemes for Convolutional Neural Networks

This paper presents a hardware acceleration design for convolutional neural networks. Floating-point fixed-point operations, pipeline interlayer parallel acceleration, and design space exploration are the three key areas of optimization, and optimized modules can be used to build various networks with convolutions according to specifications for the application scenario, thus achieving a universal design. The experimental results show that the optimization of hardware resources improves the speed and performance of the algorithm, and can withstand larger data volumes and higher real-time requirements. The system achieves an accuracy of 95.09% and an inference speed of 0.237 ms per image, with a high processing speed. As a result, convolutional neural networks may now be used in a wider variety of application scenarios and manage larger datasets and higher real-time demands thanks to the design solutions presented in this research.


Introduction
With Moore's Law gradually failing, software acceleration solutions have hit a bottleneck in terms of performance improvement [1] .Especially for emerging applications that are compute-intensive and dataintensive, software solutions implemented with central processors are no longer able to meet the needs of emerging applications [2] .Hardware acceleration techniques can address the needs of emerging applications as they provide sufficient computational resources and less support for control flow [3][4] .
In the 1990s, Lee et al. introduced an ANNA chip that implemented 64 convolutional operations on the scale of 8×8 and was the first time a convolutional task was done in hardware [5] .
At the beginning of the 21st century, the rapid increase in the capacity and design size of FPGAs, the integration of digital signal processing modules, and the rich internal multiplication and accumulation units capable of performing a large number of multiplication and addition operations in convolution [6] .In 2019, Leon et al. [7] and Huang et al. [8] reduced the parameters of the network model by about six times by using an FPGA platform for YOLO networks.In recent years, better performance has been pursued to accommodate deep learning algorithms.
This study focuses on convolutional neural networks driven by ZYNQ that include hardware acceleration.There are three key sections: floating-point fixed-pointing, pipelined inter-layer parallel acceleration, and design space exploration.Lastly, tests are created to confirm the convolutional neural network optimization.

Relevant theoretical foundations
Unlike traditional neural networks, an extensive number of convolutional layers, pooling layers, and fully linked layers make up a convolutional neural network, with the convolutional layer serving as the network's primary building block in Figure 1.

Convolutional layers
This layer's convolution performs a convolutional operation on the input image through a series of filters, resulting in a series of convolutional feature maps.These convolutional feature maps can be thought of as abstractions of different levels of the input image, where low-level features typically include information such as edges, lines, and corner points, whereas more detailed elements include details about the size, color, and texture of items in the image.The convolution layer's functionality can be stated as follows: 1 1 , , , , , , , 0 0 where , , n x y O is the output neuron located at position (x, y) on the nth output feature map, K stands for kernel size, and S for kernel step size, , , , m n i j W is the nth output feature map, the bias of the nth output feature map, the mth input feature map, and the nonlinear activation function, represented by f.
ReLU has evolved into a popular activation function due to its computational simplicity and the solution to the disappearance problems in Figure 2.

Pooling layer
A pooling layer is typically placed after the convolution layer, and its primary purposes are to decrease the dimensionality of the feature map, decrease the number of parameters, and enhance computing efficiency.Maximum and average pooling are two common pooling techniques.In the paper, the maximum pooling with a step size of 2 is utilized and is represented as follows: where in ( , ) i j p refers to the information in the input feature map's row I and column j and out p indicates the pooling result.

Fully connected layer
Finally, it frequently serves as the output layer of a convolutional neural network, whose main role is to take the output of the previous convolutional and pooling layers and perform operations such as classification or regression.The fully connected layer usually consists of several neurons, each corresponding to a category or regression output.

Floating-point fixed-point
Typically, the training of convolutional neural networks is often implemented in CPUs and GPUs with floating-point operations, which have the advantages of high accuracy and a large dynamic range of representation.Using 32-bit floating-point for convolutional neural network training can ensure the accuracy and smooth convergence of the network.However, floating-point operations on FPGAs consume a lot of DSP resources.The 32-bit floating-point multiplication requires 5 DSP resources, while 32-bit fixed-point operations require only 2 DSP resources and 16-bit fixed-point operations require only 1 DSP resource.The complexity of the operation is extremely high and the accuracy is low.In other words, to implement efficient convolutional neural networks on FPGAs, fixed-point operations and optimization algorithms need to be used to reduce resource consumption and power consumption, while maintaining accuracy as much as possible.
In this paper, we use 16-bit fixed-point quantization data, which is less accurate, can store fewer data, reduce bandwidth occupation, and achieve the purpose of reducing power consumption.The specific fixed-point conversion equation is as follows: where this is the bit width of b w , p is the ordinal code for fixed points, and i B ∈{0,1}.Fixed-point numbers are represented by a complementary code.

Inter-nuclear parallelism
Throughout the process of computing a convolutional neural network, the waiting time elapsed during a computation including all subsequent CNN layers is often difficult to reduce because the number of computational layers required is predetermined.Fortunately, throughput can be enhanced by implementing some flowline mechanisms.Therefore, we propose a balancing strategy where one time an even number of layers are computed simultaneously and another time an odd number of layers are computed simultaneously, as shown in Figure 3.The FIFO implements the synchronisation circuit into practice for the data from the input and output.The feature map being input is saved in BRAM once external input data has been written to the FIFO.

Exploring Design Space
Arithmetic power and bandwidth are two important constraints in the hardware deployment of convolutional neural networks.A system is usually under a bandwidth constraint or an arithmetic power constraint, and researchers have found that the performance can vary by up to 90% for different solutions, i.e., chunking parameters, in the same accelerator that allows chunking parameter configurations, using the same FPGA logic resources.
To better analyze the network for better performance, some researchers have started to use the roofline model to design optimization schemes for accelerators.
For hardware platforms, the platform arithmetic π refers to the upper limit of the number of floatingpoint operations that the platform can perform per unit of theoretical time, in FLOP/s, and the bandwidth β refers to the maximum permitted amount of memory switching that the platform can perform per unit of time, in Bytes/s.The greatest number of floating-point operations represents the upper bound of computing intensity that can be performed per unit of memory swapping, as shown in Equation ( 4): As shown in Figure 4, as the computational intensity increases, so does the theoretical computational performance of the model on the platform.At this point, it is constrained by bandwidth, and the platform's capacity for computation is not fully utilized.When the theoretical computational performance of the model on the platform keeps essentially constant, it means that the model has reached or is close to the computational constraint π of the platform and the hardware resources have been fully utilized.Thus, the goal of green area-based design optimization.

Results and analysis from experiments
The resources of the FPGA accelerator consist of a look-up table, flip-flops, storage, and DSP.The BRAM is mainly used for intermediate data caching, the DSP is mainly used for several calculations involving addition and multiplication in convolution, and a large number of flip-flops are mainly used to cache the weight data from the DDR and the data read from the BRAM.Their resource utilization is shown in Table1 [9][10] .

Conclusion
Through rational use and optimization of hardware resources, the design solution in this paper achieves a very significant performance improvement on the LeNet-5 network.On the MNIST test set, this paper's design reduces the inference speed from a few milliseconds to 0.237 milliseconds per image, with high processing speed and real-time performance, while maintaining 95%s accuracy.This means that we can use the design solution in this paper to handle larger datasets and higher real-time requirements, allowing neural networks with convolutions to perform a crucial part in the broader application circumstances.Future research can further optimize the design scheme in this paper to better meet the requirements of different application scenarios.

Figure 2 .
Figure 2. Three common activation functions

Figure 3 .
Figure 3.The timing diagram of the flowline structure An overall pipelined structure is used in the design.Handshake signals are used to implement the synchronization circuit for the multiple layers of computing.There are start and completion signals for each specific computation layer.When the completion signal is pulled up after the start signal has been activated, the layer has completed its computation.The FIFO implements the synchronisation circuit into practice for the data from the input and output.The feature map being input is saved in BRAM once external input data has been written to the FIFO.

Table 1 .
Resource utilization for the method described in this research are displayed in Table2, with a chip power consumption of 1.578 W and a processing speed of 0.237 ms per image, and are analyzed in comparison with the CPU platform scheme and other design solutions in the literature.The simulation on Vivado 2019.1 is shown in Figure5.

Table 2 .
Performance parameters of the scheme FPGA on Xilinx ZYNQ 7020 CLG400-2