Compressing deep neural networks on FPGAs to binary and ternary precision with HLS4ML

We present the implementation of binary and ternary neural networks in the hls4ml library, designed to automatically convert deep neural network models to digital circuits with FPGA firmware. Starting from benchmark models trained with floating point precision, we investigate different strategies to reduce the network's resource consumption by reducing the numerical precision of the network parameters to binary or ternary. We discuss the trade-off between model accuracy and resource consumption. In addition, we show how to balance between latency and accuracy by retaining full precision on a selected subset of network components. As an example, we consider two multiclass classification tasks: handwritten digit recognition with the MNIST data set and jet identification with simulated proton-proton collisions at the CERN Large Hadron Collider. The binary and ternary implementation has similar performance to the higher precision implementation while using drastically fewer FPGA resources.


Introduction
Field-programmable gate arrays (FPGAs) are an efficient and flexible processing solution to perform low latency and high bandwidth inference of deep neural networks (DNNs).Their design is extremely functional to parallelize the mathematical operations typical of DNN inference tasks, namely matrix multiplication and activation function application.FPGAs can be reprogrammed, which offers unquestionable advantages in terms of flexibility with respect to application-specific integrated circuits (ASICs).At the same time, they share some of the advantages offered by ASICs, such as low power consumption and speed.
Typically, FPGAs are used to emulate generic digitial circuits as a preliminary step toward the design of custom ASICs or as an alternative to them.For instance, hundreds of FPGAs are used as custom electronic logic to process in real time the proton-proton collisions at the CERN Large Hadron Collider (LHC).With beams colliding every 25 ns and thanks to a built-in buffering system, a typical LHC experiment has O(1) µs to decide whether to keep or discard a given event.This real-time decision-taking system, referred to as the "L1 trigger", consists of a set of digital circuits implementing physics-motivated rule-based selection algorithms.Currently, these algorithms are emulated on FPGAs, mounted on custom boards.
The severe L1 latency constraint prevents the LHC experimental collaborations from deploying complex rule-based algorithms on the L1 FPGA boards.Machine Learning (ML) solutions, and in particular DNNs, are currently being investigated as fast-to-execute and parallelisable approximations of rule-based algorithms.For instance, the CMS collaboration has deployed a Boosted Decision Trees in the L1 trigger electronic logic [1].Following this approach, one could train a DNN to process a given input (e.g., energy deposits in a calorimeter) and return the output of an event reconstruction algorithm (e.g., to regress the energy of the incoming particle that caused these energy deposits or to identify its nature).Since the complexity of LHC collision events is going to increase, we expect this approach to become popular in the near future.
In order to facilitate the deployment of DNNs in the L1 triggers of high energy physics (HEP) experiments, we developed a software library, hls4ml, to convert a DNN model into FPGA firmware through an automatic workflow [2].In HEP, the deployment of deep learning (DL) models on FPGAs has been discussed in the context of the online data-selection system of the LHC experiments.Alternative solutions based on VHDL [3] have been explored.Similar studies and comparable results have been shown in Ref. [4].
The hls4ml design is characterized by two aspects: (i) it relies on high-level synthesis (HLS) back-ends, in order to allow a fully automatized workflow from a trained model to an FPGA firmware; (ii) it is designed so that the final outcome is a fully-on-chip logic, which enables the latency to be within typical values of O(1) µs.Our ultimate goal is to support the most popular DNN model ingredients (layers, activation functions, etc.) and an interface to the most popular DL training libraries, directly (e.g., for TensorFlow [5], Keras [6], and PyTorch [7]) or through the ONNX [8] interface.The library is under development and many of these ingredients are already supported.While hls4ml was initially conceived for LHC applications, its potential use cases go well beyond HEP.In general, hls4ml provides a user-friendly interface to deploy custom DNN models on FPGAs, used as co-processing accelerators or as digital circuits in resource-constrained, low-latency computing environments.
The main challenge in deploying a DNN model on an FPGA is the limited computational resources.Typically, one would reuse resources for the inference operations across multiple clock cycles, at the price of a larger latency.The reuse factor quantifies how many times a resource is reused and is equal to the initiation interval (II) for that operation.A complementary approach consists of compressing the model, e.g., by reducing the number of operations needed in the inference step (pruning) or their cost (e.g., quantizing the network to a fixed point numerical representation).In a previous publication [2], we showed that pruning and quantization allow one to execute simple fully-connected DNN models with state-of-the-art performance on a specific LHC problem within a latency of O(100) ns, while using only a fraction of the FPGA resources.In this paper, we investigate how a similar result can be obtained with binary and ternary networks [9,10,11], following closely the studies presented in Refs.[9,12,13].This paper is structured as follows: Section 2 introduces the benchmark problems and data sets.The implementation of binary and ternary networks in hls4ml is described in Section 3. Section 4 describes the different model architectures considered in this study, while their application to the two benchmark classification problems is discussed in Section 5. A summary and outlook is given in Section 6.

Benchmark models and data sets
We consider two benchmark classification tasks: a digit recognition task with the MNIST data set [14] and the LHC jet tagging task discussed in Ref. [2].False positive rate 0 (0.9997)   .Each image is represented as a 28 × 28 pixel array, storing the gray-scale content of each pixel in the original image.For our purpose, we flatten the 2D array to a 1D array, concatenating each row of the image to the right to the previous one.The derived 1D array is passed as input to a multilayer perceptron (MLP) [15] with an input (output) layer of 784 (10) nodes and three hidden layers with 128 nodes each.Rectified linear unit (ReLU) activation functions [16] are used for the hidden layer nodes, while a softmax activation function is used for the output layer.
The other benchmark task consists of classifying jets from a set of 16 physics-motivated high-level features, as described in Ref. [2].The network receives as input a vector of 16 quantities and processes them through a MLP with three hidden layers of 64, 32, and 32 nodes with ReLU activation functions.The output layer consists of five nodes with softmax activation.The five output values correspond to the probability of a given jet to belong to one of the jet classes (light quark q, gluon g, W boson, Z boson, or top quark t).The utilized data set is available on the Zenodo repository [17].
The architectures of the baseline MNIST and LHC jet classifiers are illustrated in Fig. 1.Both are implemented and trained with Keras in floating point precision (FPP).Their performance is shown in Fig. 2 in terms of receiver operating characteristic (ROC) curves and normalized confusion matrices.The area under the curve (AUC) of each ROC curve is quoted in the figure, as well as in Table 1, where the corresponding accuracy values are also given.Following convention, we define the accuracy as the fraction of correctly labeled examples.In practice, this is done applying an Arg Max function to the array of scores returned by the network and comparing it to the corresponding target array.The total accuracy of the MNIST and LHC jet classifiers, computed across all categories, are found to be 98% and 75%, respectively.3 Implementing binary and ternary networks in hls4ml Binary and ternary networks are extreme examples of quantized neural networks.A network is quantized when the numerical representation of its parameters is a fixed precision.This precision could be constant across the full network or specific for each components (e.g., for different layers).Quantization allows one to reduce the computing resources of a given model for inference and it can be tuned so that it comes with little or no loss in terms of performance.In the case of binary (ternary) networks, each weight assumes a value of +1 or −1 (+1, 0, or −1).Two-and three-valued activation functions are used after each layer, acting as discrete versions of the tanh function.As alternatives, we also investigate a standard ReLU function as well as its clipped version [18], defined as min(ReLU(x), y max ), with y max being a positive hyperparameter.In our study, we fix y max = 1.The four functions are represented in Fig. 3.
In order to convert the models described in Sections 2, we rely on the MLP-related functionalities offered by the hls4ml library, discussed at length in Ref. [2].In addition to that, we exploit a set of custom implementations [12], specific to binary and ternary networks, that allow one to speed up the execution of the building-block architecture shown in Fig. 4. The implementation of these solutions is integrated in recent versions of the hls4ml library, starting with the v0.1.6tag of the GitHub repository [19].With respect to the work presented in Ref. [2], this version provides a special support for large dense layers, which allows one to deal with large number of nodes as in the models we consider in this study.This functionality will be described in more detail in a future publication.
Binary networks use 1-bit representations for both weights and activations.In this case, the product between two quantities can be optimised to an extremely lightweight operation.By encoding an arithmetical value of '−1' as '0', the product can be expressed as an XNOR operation, as shown in Table 2.For models using ternary weights or greater than 1-bit for activations, FPGA logic is always used rather than DSPs.
The binary and ternary tanh activation functions are implemented by testing the sign (in the case of binary tanh) or sign and magnitude (for ternary tanh) of the input and yielding the corresponding value ±1 or 0 as seen in Fig. 3.A binary or ternary tanh activation layer preceded by a BN layer can be further optimized.The usual BN transformation is: given the mean µ, variance σ 2 , scale γ, and shift β computed during the network training.For a BN followed by a binary tanh activation, the sign of y is enough to determine a node output value.To avoid calculating the scaling of x using FPGA logic, the four BN parameters are used to compute the value of x at which y flips sign.This calculation is performed at compile-time, when the model is converted to HLS firmware using hls4ml.Similarly, the two values of x around which the output of the ternary tanh activation changes are also calculated at compile-time.In the FPGA, each node output is then simply compared against these pre-computed thresholds, outputting the corresponding ±1, or 0. An additional optimization step sets the type of x in the HLS implementation to integer with a bitwidth corresponding to the largest integer expected for each binary/ternary layer which is computed at compile-time.This procedure allows one to further save FPGA resources.
The binary and ternary layers considered for this work are fully integrated and compatible with the hls4ml package.
While not explored here, the package also supports models mixing binary/ternary layers with higher precision layers for fully customised networks.

Binarization and ternarization strategies
Given a full-precision model, one could follow different strategies to turn it into a binary or ternary model.One could just replace each full-precision component by the corresponding binary/ternary element, in order to minimize resource utilization.This might result in a loss of accuracy.As an alternative, one could train a binary/ternary model with arbitrarily large architecture, in order to match the accuracy obtained at full precision, at a cost of a larger latency and resource consumption.The ultimate strategy to follow depends on the use case.In this work, we present a few options, covering these two extremes and intermediate solutions.
In this work, we focus on binary/ternary MLPs.The basic structure for the adopted architectures is shown in Fig. 4.Each model consists of a sequence of blocks, each composed of a dense, batch normalization (BN) [20], and activation layer.The BN layer shifts the output of the dense layers to the range of values in which the activation function is nonlinear, enhancing the network's capability of modeling nonlinear responses.For binary and ternary tanh, a BN+activation layer sequence can be implemented at small resource cost (see Section 3), which makes this choice particularly convenient for fast inference on edge devices.
The binarization/ternarization of a given model can be done in different ways, e.g., preserving the model architectures or its performance.As a consequence, for each benchmark problem we consider seven models: • Baseline: the three-layer MLP described in Section 2.
• Binarized (BNN): a binary version of the baseline model, built preserving the model architecture (number of layers and nodes) while applying the following changes: use a binary representation (±1) for the weights; replace the inner-layer ReLU activation functions with a binary tanh (see Fig. 3); introduce BN layers in between the binary dense layers and the activation functions; remove the softmax activation function in the output layer.
• Ternarized (TNN): a ternary version of baseline model, built preserving the model architecture (number of layers and nodes) while applying the following changes: use a ternary representation (−1, 0, +1) for the weights; replace the inner-layer ReLU activation functions with a ternary tanh (see Fig. 3); introduce BN layers in between the ternary dense layers and the activation functions; remove the softmax activation function in the output layer.
• Best BNN: same structure as the BNN model, but with more nodes in each layer to improve performance.We obtain this model with a Bayesian optimization, finalized to minimize the validation loss in the training process.
• Best TNN: same structure as the TNN model, but with the number of nodes per layer chosen through a Bayesian optimization of the architecture, as for the best BNN model.
• Hybrid BNN: same as the BNN model, but with ReLU or clipped ReLU activation functions rather than the binary tanh of Fig. 3.
• Hybrid TNN: same as the TNN model, but with ReLU or clipped ReLU activation functions rather than the ternary tanh of Fig. 3.
The baseline model is taken as a benchmark of ideal performance and the other models represent different strategies toward a more resource-friendly representation.The BNN and TNN models are simple translations of the baseline model.They are designed to reduce the resource consumption, at the potential cost of a performance drop.The best models are designed to match (as close as possible) the performance of the baseline model, which might result in a larger resource consumption with respect to what the BNN and TNN models achieve.The hybrid models are a compromise between the two approaches.The fixed-precision conversion is applied only to the weights and biases of the nodes in the dense layers, while ReLU or clipped ReLU activation functions are used.Given the relatively small resources used by the ReLU/clipped ReLU activations, the hybrid models allow one to reach performance closer to the baseline model without inflating the number of nodes and, consequently, numerical operations.The best BNN and TNN models are only presented for the LHC jet problem, since in that case the simple binarization or ternarization of the baseline model result in a substantial performance loss.The effect is much milder for the MNIST classification problem, so that the binary and ternary architectures are not re-optimized for in that case.
One should notice that not all the components of a binary (ternary) model come in binary (ternary) precision, e.g., the output of a ReLU activation function in a hybrid model.For this reason, in the following we discuss bit precision and network quantization even in the context of binary and ternary models.
All models are implemented in Keras [6], with TensorFlow [5] as a back-end using the implementation in [13] for binary and ternary layers, which we also cross-checked with QKeras [21] with similar results.The network training was performed on a NVIDIA V100 GPU.During training, binary/ternary precision is employed during forward propagation, while full precision is used during backward propagation.The baseline models of Section 2 are trained minimizing a categorical cross entropy.The binary and ternary models are trained minimizing a hinge loss function [22].

Experiments
The results presented below are synthesized with the Vivado HLS version 2018.2 for a Xilinx Virtex Ultrascale 9+ FPGA with part number xcvu9p-flga2104-2L-e.The clock frequency is fixed at 200 MHz, which is typical for the LHC L1 triggers.Unless otherwise specified, the quoted results are derived after the HLS compilation step.The network implementation is further refined by the logic synthesis.We verified that this final step does not affect the accuracy while it reduces the resource consumption.

Handwritten digits classification
We first evaluate the performance of the HLS neural network implementation for the models described in Section 4 with different fixed-point precisions by scanning the number of both integer (I) and fractional (F) bits.In the following, a given choice of fixed-point precision is specified as T, I , where T = I + F is the total number of allocated bits.For each case, the minimum number of bits yielding an accuracy above 90% after quantization is considered.We then study the latency and resource utilization in these configurations.Table 3 shows a comparison of the performance obtained for the baseline, binary, and ternary models, in terms of accuracy and AUCs, before and after quantization.
Table 3: Accuracy and AUCs of the different MNIST-classification models described in Section 4 before and after quantization, for the fixed point precision settings chosen for this study.Both the numbers of integer (I) and fractional (F) bits are specified, using the notation I + F, I .For each case, we report the range spanned by the 10 AUC values and the model accuracy evaluated on the testing sample.For binary and ternary models, the hls4ml library applies a further level of per-layer customization of the fixed-point representation, to match the numerical precision of each layer separately, as discussed in Section 3. The outcome of this optimization is shown in the right plot of Fig. 5 for the BNN model, where the gray areas cover different numerical ranges for different layers, despite the common precision specified at compilation ( 16, 8 in this case).During the optimization, the inputs and the outputs are still represented by the fixed-point precision specified by the user, while the precision of the other network components is optimized.

Model
When quantizing a model, one should allocate I and F bits so that the range of values one can cover overlaps with the range of values returned by the network layers, in order to reduce the impact on accuracy.This is shown in the left plot of Fig. 5, where the profile of output values returned by each layer of the baseline model is compared to the range covered by the allocated fixed-point precision.For each layer, we consider the distribution of the output values obtained running the network on a test sample.In the figure, the box represents the quartiles of the distribution, while the line inside the box shows the median.The lines extending beyond the box show the minimum and maximum values.The gray area represents the numerical range covered by the allocated precision.Overall, the optimized precision matched the bulk of the output values at each layer.The only exception is observed for the output layer.In this case, the allocated precision (gray area in the last row of the left plot in Fig. 5) does not cover the bulk of values returned by the layer (red box in the figure).This happens whenever a given example is associated to a specific class with a score close to 1, so that the other values are pushed close to 0 and out of the supported range.In practice, this fact would not alter the classification outcome in inference2 .For the baseline model, the quantization from floating-point precision to 18, 8 results in an accuracy drop from 98% to 95%.This is almost entirely induced by the softmax activation function applied to the last layer and it results from the limited precision of the LUT implementing the exp functions in the softmax.This parameter is hard-coded in the version of hls4ml used for this study.One could avoid this accuracy loss removing the softmax function at the end, as long as there is interest only on which class has the biggest score and not on the individual scores.An alternative option is to further optimize the precision of the LUT implementing the softmax activation function.In this case, we verified that a 18, 8 quantization baseline with 22, 10 precision for the softmax LUT recovers an accuracy of 97% without affecting the resources.The ability to externally configure the precision of the softmax LUT will be implemented in future versions of hls4ml.
For the hybrid BNN/TNN models, the same number of bits used for the BNN/TNN cases allows one to achieve the FPP accuracy, at the condition of allocating more integer (10 instead of 6) and less fractional (6 instead of 10) bits.This behaviour can be understood from Figure 6, which shows the range of outputs returned by each hybrid BNN layer.While for I=10 the allocated precision spans the full range of outputs returned by each layer, frequent overflows are observed for the Dense 1, Dense 3 and Dense 4 layers when we set I=6.   a BN layers before the ReLU/clipped ReLU activation functions rather than the bit-shift implementation described in Section 3. A further optimization of the BN operations for hybrid models could in principle push the DSPs utilization closer to zero.
The LUTs usage is largely overestimated by the HLS compiler for all binary and ternary NN models, while it is found to be well within the available resources after the logic synthesis.Hybrid models require more LUTs with respect to the standard BNN/TNN, because of the wider data bitwidth at the input of each binary or ternary layer.
Figure 7 shows the dependence of the resource utilization on the maximum latency achieved by the design (controlled by the II) for the baseline and BNN models.Results for the TNN model are very close to the BNN ones.For all latency values, the resources used by the BNN/TNN models are typically reduced with respect to the baseline model.In particular, there is a large gain in using binary/ternary networks to obtain a large reduction in the number of DSPs up to a few-µs latency.For higher latency values, the II is large enough to allow a small usage of DSPs even for the baseline model.In that case, the advantage of using a binary or ternary quantization would be minor.
As a final test, we train a larger BNN model consisting of three dense layers with 256 nodes each, as in the study of Ref. [12], allowing for a direct comparison of our implementation of a binary architecture with what presented

Figure 1 :
Figure 1: Network architecture for the baseline MNIST (top) and LHC jet (bottom) classifiers used as benchmark models in this study.

Figure 2 :
Figure 2: Classification performance evaluated on the testing sample of the baseline MNIST (top) and LHC jet (bottom) classifiers used as benchmark models in this study: ROC curves (left) and normalized confusion matrices (right).On the left, numbers in parentheses correspond to the AUC of each class.On the right, the text is omitted for bins corresponding to a false positive rate below 1%.

Figure 4 :
Figure 4: The MLP architecture used in this study, consisting of a sequence of repeating blocks.Each block, fully connected to the previous and following one, consists of a dense layer, a batch normalization layer, and an activation layer.The last block does not have an activation layer.

Figure 5 :
Figure 5: Profile of the range of output values of each layer, sampled during inference on the test dataset, for the baseline (left) and BNN (right) MNIST models.For each layer, the box represents the quartiles of the distribution, while the line the median.The lines extending beyond the box show the minimum and maximum values.The gray shaded areas represent the range covered by the allocated fixed point precision for each layer.In the left plot, these ranges correspond to the precision specified at compilation( 18,8 ).On the right plot, an optimization procedure implemented in hls4ml for binary and ternary networks automatically adapts the precision of each layer to match the range covered by the output distribution; as the BN layer is merged with the binary tanh in the HLS implementation, its output precision is 1 bit.Dense, batch normalization (BN), and activation layers are presented in order from the input (top) to the output (bottom).

Figure 6 :
Figure 6: Profile of the range of output values of each layer, sampled during inference on the test dataset, for the hybrid BNN+ReLU model quantized to 16-bit precision, when 10 (left) or 6 (right) bits are used for the integer part.For each layer, the box represents the quartiles of the distribution, while the line shows the median.The lines extending beyond the box show the minimum and maximum values.The gray shaded areas represent the range covered by the allocated fixed-point precision for each layer.Dense, batch normalization (BN), and activation layers are presented in order from the input (top) to the output (bottom).

Table 1 :
Classification performance evaluated on the testing sample of the baseline MNIST and LHC jet classifiers used as benchmark models in this study: AUC and per-class accuracy.

Table 2 :
Left: All possible products between A and B with values constrained to ±1.Right: The corresponding truth-table when the quantities A and B are each encoded with 1-bit, and the XNOR operation is used for the product.

Table 4
provides a comparison of the resource utilization and latency for the configurations presented in Tab. 3.For each configuration, we quote both the resource utilization estimated by the HLS compiler and those obtained by the logic synthesis.The logic synthesis transforms the Register Transfer Level (RTL) design created by the HLS compiler into a gate-level implementation, applying additional optimizations that result in a more accurate assessment of the resource utilization.In the table, the II represents the number of clock cycles needed before the algorithm may accept a new set of inputs.In our study, the II value is fixed by requiring that the resulting resource utilization is below the maximum allowed on the target FPGA.Lower II values would result in a network design that would not fit the device.Larger II values would result in higher latency.
At the very low latency values (O(100) ns) that we are targeting, BNN/TNN models allow one to reach competitive performance while saving most of the FPGA resources.About half of the observed accuracy loss can be recovered using hybrid BNN/TNN models, paying a small price in terms of DSPs utilization, induced by an explicit allocation of

Table 4 :
Comparison of the resource utilization for the MNIST-classification models described in Section 4, together with timing information.Resources estimated by the HLS compiler (C) and obtained by the logic synthesis (S) are quoted for a chosen initiation interval (II).