Exploring machine learning to hardware implementations for large data rate x-ray instrumentation

Over the past decade, innovations in radiation and photonic detectors considerably improved their resolution, pixel density, sensitivity, and sampling rate, which all contribute to increased data generation rates. This huge data increases the amount of storage required, as well as the cabling between the source and the storage units. To overcome this problem, edge machine learning (EdgeML) proposes to move computation units near the detectors, utilizing machine learning (ML) models to emulate non-linear mathematical relationships between detector’s output data. ML algorithms can be implemented in digital circuits, such as application-specific integrated circuits and field-programmable gate arrays, which support both parallelization and pipelining. EdgeML has both the benefits of edge computing and ML models to compress data near the detectors. This paper explores the currently available tool-flows designed to translate software ML algorithms to digital circuits near the edge. The main focus is on tool-flows that provide a diverse range of supported models, optimization techniques, and compression methods. We compare their accessibility, performance, and ease of use, and compare them for two high data-rate instrumentation applications: (1) CookieBox, and (2) billion-pixel camera.


Introduction
New instrumentation detectors have better sensitivity, sampling rate, and pixel density.These improvements significantly increase the total data velocity, exceeding terabytes per second (TB s −1 ) in particle physics and medical imaging experiments and surpassing the capacity of current acquisition systems [1,2].For example, the data generation of Large Hadron Collider (LHC) experiments at CERN reach 1200 GB s −1 [3].The detectors at the LHC use multi-level trigger systems and still need massive data centers to analyze, compress and save the final data.The current methods save only a small fraction of the total generated data, recording only 1 in 10 to 1 in 100 bunch crossings happening at 40 MHz [4,5].Another example is the LINAC Coherent Light Source (LCLS) at the Stanford Linear Accelerator Center (SLAC) National Accelerator Laboratory, which has a repetition rate of 1 MHz leading to colossal data velocity, exceeding TB s −1 [6].
The current paradigm of collecting all raw data in a centralized node requires expensive hardware, significant power, and has a large environmental impact.A potential solution is to move the computational units closer to the edge of the system, either in adjacent hardware or directly embedded within the detector control and acquisition circuits, a method called edge computing [7].By placing computing resources and data storage at the system's edge, edge computing reduces system latency while enabling real-time analytic and reducing operational costs [8,9].
Edge computing allows only limited computing resources due to power and physical constraints, which may not be enough when it comes to analyses requiring complex classical algorithms.Training machine learning (ML) algorithms to model these complex algorithms can achieve the same behavior with less computing complexity.Combining edge computing and ML is called edge ML (EdgeML).EdgeML offers reduced latency, and bandwidth requirements.Figure 1 presents where the EdgeML fits in an edge computing paradigm and how it differs from a standard data acquisition (DAQ) [10].
As can be seen, ML on the edge can provide intelligent, low-latency feedback to the detector and the radiation source, enabling parameter adjustments.The amount of stored data in the EdgeML DAQ paradigm is also significantly lower compared to standard DAQ.ML algorithms can be implemented in hardware to increase pipelining and parallelization.While several processing units and digital circuits are available for general tasks and ML purposes, only a few are suitable for use near the edge.
Choosing the correct processing units for EdgeML applications is a crucial subject due to the necessity of very low latency for real-time DAQ.Moreover, low power consumption, low inference latency, low unit cost, and high integration level are the next crucial factors for a good EdgeML system.Algorithms can be implemented in hardware to increase pipelining and parallelization [11].Application-specific integrated circuits (ASICs) and field-programmable gate arrays (FPGAs) are two integrated circuits that can run with a lower latency compared to other micro-controllers, and common processors such as central processing unit, and graphics processing unit [12].ASICs and FPGAs also achieve inherent parallelism through their optimized architecture, allowing for the execution of multiple tasks or operations simultaneously.Moreover, low power usage, and large number of input/output (I/O) ports for high-throughput communication also make ASICs, and FPGAs two excellent choice for EdgeML applications [13].The architecture of ML algorithms and these digital circuits are a great match since ML algorithms generally use arithmetic that is simple for digital circuits to execute, such as additions and multiplications [14].However, ML algorithms are progressing rapidly, and ASIC require long development cycles, making FPGA a better choice for prototyping and low cost EdgeML applications.
Generally, there is ongoing research to integrate FPGA-based EdgeML models in high data-rate instrumentation, particularly for online event selection, and DAQ paradigms at the edge of the system [15].In [16], an FPGA-based ML event classification for custom electronics-based trigger systems in high energy physics is introduced, where the lowest latency for real-time event classification is required.Additionally, the authors in [17] have presented an FPGA-embedded system for ML-based tracking and triggering in the electron-ion collider experiment.Moreover, a fast muon tracking with ML implemented in FPGA for first-level trigger at LHC experiment is presented in [18].
Although FPGAs are great choice for high data-rate instrumentation, implementing ML models on FPGAs requires a high level of expertise and knowledge in hardware design.In this paper, we first explore the available tool-flows for mapping ML algorithms onto FPGAs, near the edge of the sensors.The main focus is on tool-flows that provide a wide range of supported models, optimization techniques, and lower latency to find the most suitable for instrumentation.After finding the most suitable one, we use it to implement different ML models on FPGAs focused on two high data-rate instrumentation application: (1) CookieBox, and (2) billion-pixel camera.We design our ML models, translate them to hardware code, and implement the models for both application on the Zynq UltraScale+ MPSoC ZCU104 evaluation kit as the target board.
The rest of the paper is organized as follows: section 2 compares the current methods for implementing ML on FPGA and explains the available ML to FPGA tool-flows.Section 3 describes the methodology, and the experiment setup that we use to compare the tool-flow performances on FPGA.The simulation and hardware implementation results are presented in section 4. Finally, we discuss the performance of available tool-flows for instrumentation and conclude the paper in sections 5 and 6, respectively.

Background
The usual FPGA programming languages are hardware description language (HDL) such as Verilog and, very high-speed integrated circuit hardware description language (VHDL).ML model hardware implementations in real-time systems are complicated, multi-step endeavors.Translating a complex ML algorithm to HDL requires sufficient hardware knowledge and is time-consuming.
High-level synthesis (HLS) is a relatively new alternative for developing FPGA applications.HLS allows software engineers to design applications for FPGA and ASIC platforms using more common programming languages, namely C and C++ [19].HLS tools can be used to implement ML algorithms with much less development time and effort, but they may bring some compromise on performance compared to direct HDL implementation [20,21].Several groups want to take automation a step further and create a tool-flow that does the entire implementation chain from inference model to hardware implementation.These tool-flows use an ML model from common libraries such as TensorFlow, Keras, and PyTorch, enabling FPGA implementation with no hardware knowledge.Table 1 provides a thorough compilation of the presently accessible tool-flows for ML to FPGA applications.It includes information about the supported models, ML environment, release date, and the supported ML model compression methods.The following paragraphs highlight a few of these tool-flows.
HLS4ML [22,46], is an open-source Python package for ML inference in FPGAs.It first converts a Keras or PyTorch model to an HLS model and maps it to the corresponding HDL code.HLS4ML was initially designed for microsecond latency applications like the CERN LHC [47].The HLS code generated by hls4ml can also be used for ASIC design [48,49].HLS4ML offers many configuration settings such as I/O type, reuse factor, precision, and different implementation strategies.The reuse factor is a parameter of HLS4ML that determines how many times each FPGA multiplier will be used, directly impacting the model latency.
FINN [23] is a framework for building fast and flexible FPGA accelerators using a heterogeneous streaming architecture.The FINN framework targets binarized neural networks (BNNs) and highly quantized neural networks (NNs) for small boards.FINN converts each layer to an HLS design, and subsequently stitches these sub-components together to make the whole network.Custom models can also be imported from an Open Neural Network Exchange (ONNX) model by calling FINN from a Python script.Compared to HLS4ML, FINN offers less customization and we can only change the target clock, target throughput, and quantization.
It is worth mentioning that the HLS4ML and FINN teams are working together on a more unified test flow so that both Keras and PyTorch models can be translated into quantized ONNX, as shown in figure 2 [50].This will make the interaction of HLS4ML, and FINN much easier in the future.
The Vitis AI [24] platform is a comprehensive artificial intelligence (AI) inference development solution for Xilinx devices and Alveo Data Center acceleration cards.Vitis AI is a proprietary configurable intellectual property (IP) core with internal parallelism [48].Some commonly used models supported by Vitis AI are provided in the Xilinx Model Zoo [51] such as ImageNet networks and some object detection networks.Vitis AI also supports custom models, and users can give it their customized NN model.
Versatile tensor accelerator (VTA) [25] is an open, generic, and customizable deep learning accelerator with a complete Apache tensor virtual machine-based compiler stack.Generally, A VTA instance consists of a vector-matrix and an arithmetic logic unit core, supporting operations on matrix operands.VTA targets architectures similar to ResNet and MobileNet-based NN architectures.
MATLAB deep learning processor (DLP) [26] is a subset of the commercial MATLAB suite and a tool-flow supporting a full ML model compilation, including quantization.It can target any platform compatible with the Matlab HDL Coder [52], such as Xilinx's Zynq and Zynq UltraScale+ platforms.MATLAB DLP has its own front end, but it also can import NN models from currently available libraries such as PyTorch.It also supports ONNX, making it inter-operable with other NN libraries and other tool-flows.
OpenVINO provides a set of tools and libraries for optimizing and deploying deep learning models on various Intel hardware platforms, including FPGAs [27].Although OpenVINO is not designed for FPGAs, it has the same functionality as MATLAB DLP and Vitis AI for FPGA accelerators.It provides boosted deep learning performance for vision, audio, and more models from popular frameworks like TensorFlow and  PyTorch.It also supports different quantization and optimization techniques but only supports a limited number of Intel FPGA boards.
OpenHLS is a lightweight, compiler framework that uses a combination of compiler and HLS techniques to compile the entire deep NN into fully scheduled register-transfer level design [28].Its architecture is similar to HLS4ML and FINN, but focused on Convolutional NNs (CNNs) in particular, while using low level virtual machine as its core compiler [53].
The tool flows summarized above are the most actively developed ones, based on the level of activity on their GitHub repositories, and offer better configuration and optimization support than other low activity tools.Other available tool-flows have relatively less community support, with fewer features compared to the first seven tool-flows of table 1.Among the tool-flows explained, only HLS4ML and FINN fully support both fully connected NN (FCNN) and CNN layers with no board support limitation.Therefore, we see potential in these tool-flows to implement fully customized models on FPGAs.Additionally, researchers can migrate from FPGA to ASIC for fixed applications if the tool-flow provides HLS/HDL codes, which HLS4ML and FINN do.Therefore, HLS4ML and FINN are chosen as potential candidates for EdgeML high data-rate instrumentation.
In the rest of this paper we investigate HLS4ML, and FINN, compare them, and explore their different configuration settings to find the optimal configuration.Our objective is to find the optimal tool-flow considering the latency and use the best one for two high data-rate instrumentation applications: (1) CookieBox and (2) billion-pixel camera.We design and translate ML models to hardware code, and implement the models for both application on the Zynq UltraScale+ MPSoC ZCU104 evaluation kit as the target board.

Tool-flows comparison
We first selected two different training data sets, and designed NN models for these applications to ensure the fit on our board.For each application we designed an FCNN model and an CNN model.
The first model (FCNN) has three hidden layers and a total size of 3171 parameters.The training data set for the FCNN model is the UNSW-NB15, a big data set created to provide a comprehensive network-based data set that can reflect modern network traffic scenarios [54].
The second model (CNN) has three convolutional layers, two dense layers, and a total size of 4460 parameters.The training data set for the CNN model is the Street View House Number (SVHN) data set, which can be seen as similar flavor to MNIST with over 600 000 labeled data [55].
Table 2 provides an overview of the key characteristics of both NN models for testing tool-flows.Both model's architecture are presented in figure 3. Since FINN is only focused on highly quantized models, we also used the Qkeras in the HLS4ML front-end to quantize the model for a better side-by-side comparison with FINN.Unfortunately, due to a software limitation in Vivado, we were unable to compare the FCNN and CNN models using the same datasets in the first experiment.Vivado restricts the number of parameters per layer to 4096, which would be exceeded if we were to implement an FCNN model for the SVHN dataset, due to the large input shape.
Once we have selected the NN models and use HLS4ML and FINN to generate their corresponding HLS code, we extract the corresponding IP block of the model, bring it into the Vivado design suite, and finalize the final block design to test and implement on the board.The overall design flow is presented in figure 4. In the final block design, it is crucial to appropriately connect the NN block to various components.These components include a memory controller, a processor, and a counter.The counter counts the number of clock cycles that it takes to complete the inference of the ML block, which indicates the latency of the ML block.The latency has no variability and depends only on architecture of the trained model.We use the advanced extensible interface (AXI) developed by ARM for communication bus protocol in the block design.We chose the Zynq UltraScale+ MPSoC ZCU104 evaluation kit as target boards, since it has an ARM processor, and sufficient resources for our ML models.The processor is not necessary but facilitates the testing process.We created our own ZCU104 block design for HLS4ML, since is not fully supported by   The NN models implemented by HLS4ML were trained on a PC using Keras library, since Keras is fully supported as HLS4ML front-end.The same NN models are implemented by FINN using PyTorch since FINN does not support Keras at this time.

Instrumentation applications
After comparing HLS4ML and FINN using the mentioned models, and selecting the best one with the lowest latency, we now apply a similar process for two real high data-rate instrumentation applications: the CookieBox and the billion-pixel camera [6,57].The objective of this experiment is to determine whether the performance of tool-flows is sufficient for these applications.We again compare both FCNN and CNN models to find the optimal ML configuration.The first application is the CookieBox, which is an angular streaking detector for online x-ray beam diagnostic tool in the LCLS-II project by SLAC [6,58].LCLS-II operates at a repetition rate of 1 MHz, resulting in a massive amount of data exceeding terabytes per second.To handle this data overload, a strategy is employed to veto some certain pulses.The CookieBox detector is actually used to make these veto decisions.The designed ML model's purpose for CookieBox is to classify different x-ray beam shots, and veto the unnecessary ones [59].
The second application is the billion-pixel camera, an x-ray camera for synchrotron and x-ray free-electron laser experiments.The Billion-pixel camera will generate 1000 to 10 000 images in one second, and each image is around 1-2 GB in size.Accordingly, the billion-pixel camera will generate between 1 TB s −1 and 10+ TB s −1 of data [60].The designed ML model's purpose for the billion-pixel camera is to compress input images by reducing sparse representations of the camera's images, followed by quantization and entropy coding for data compression [61].It is worth mentioning that we needed to add a custom layer to the HLS4ML back-end for the billion-pixel camera experiment.We had to use a parametric soft shrink activation for the billion-pixel camera ML model, which is not supported by HLS4ML default models.The softshrink activation is defined as: The softshrink activation helps increase the code sparsity, which we measure as the number of zero-value elements in the encoding divided by the total element count.Since the function is not a supported HLS4ML layer, we modified the back-end accordingly in order to convert the model.The NN model training setup for both applications is presented in table 3. The architecture of both models are also presented in figures 6 and 7, respectively.

Tool-flow comparison
In the first experiment, our objective was to determine the tool-flow with the best latency for high data-rate instrumentation.We first use the high throughput configuration of the FINN, which focuses on the lowest latency (highest throughput).The configuration of HLS4ML for the first experiment is also set to a reuse factor of 1, a latency strategy, and a parallel data structure.
For the second experiment, we use the base configuration of the FINN focused on minimizing resource usage.The configuration of HLS4ML for the second experiment is set to a reuse factor of 64, a resource strategy, and a stream data structure.Finally, the target clock frequency for all experiments is set to 100 MHz.The results for both experiments are shown in tables 4 and 5, respectively.We also compare the implementations of the mentioned models using HLS4ML and FINN with the available related works in both tables.We have chosen related works that concentrate on both latency and resource implementation to draw meaningful comparisons with our implementation.In table 4, we select models from other related works with the lowest latency for comparison.For a fair comparison in table 5, we opt for more compressed models from other related works, as we are emphasizing lower resource utilization in the second experiment.
As can be seen, the latency of HLS4ML for both experiments is lower compared to FINN, since the HLS4ML utilizes several HLS pragmas, such as loop unrolling in the final HLS code, resulting in low latency.By switching the implementation strategy from latency optimization to resource minimization in the HLS4ML configurations, resource usage decreases at the cost of increased latency.FINN performs better when it comes to resource utilization, especially digital signal processor (DSP) usage.FINN shows efficient power usage, as expected, due to its primary design targets, which are smaller boards and models.It is important to note that the power reported in the tables those reported by Xilinx Vivado post layout implementation.Both tool-flows demonstrate a good performance compared to the related works.Although [22] demonstrates a better latency, it uses higher clock frequency and a bigger board compared to the others, and also uses much more resources.The resource utilization of HLS4ML and FINN is also relatively better, considering our target is a smaller board (except than [48]).Additionally, it is worth noting that power usage data for the other works is not available for a direct comparison, and obtaining power values for the related works was not feasible.In summary, HLS4ML outperforms FINN in terms of latency, which is the most crucial factor for high data-rate instrumentation applications, as previously mentioned.Consequently, we move forward with HLS4ML and test real instrumentation models with various configurations.

Instrumentation applications
First, we implemented an FCNN model for both the CookieBox and the billion-pixel camera, focusing on both latency and resource utilization in separate tests.The results are shown in table 6.In this table, the resource utilization percentage demonstrates the usage of each FPGA resource.
As mentioned earlier, the NN model used for the billion-pixel camera is larger than the one used for the CookieBox.This results in higher resource utilization and power usage.Although the FCNN model for the billion-pixel camera is larger, it has a smaller input shape.This leads to lower latency compared to the model designed for the CookieBox.By scrutinizing the waveform analysis in Vivado simulations, we noticed that most of the inference time is spent on the HLS4ML blocks trying to fetch the input data.Consequently, a smaller input data shape results in less latency.Moreover, the latency results with a resource implementation strategy are slightly higher, but we can achieve lower resource utilization, which is ideal for smaller boards.We repeat both tests with a CNN model as presented in table 7. Like the FCNN results, the latency of the billion-pixel camera is lower compared to the CookieBox for the resource optimal case.However, the billion-pixel camera latency is higher for the best latency case.The main reason behind this is that due to the usage of a custom layer for the billion-pixel camera application, the latency strategy did not give usable results.Instead, we used the resource strategy with reuse factor 1. The bigger model of the billion-pixel camera causes a higher resource utilization.CNN models are usually big and difficult to fit on smaller boards.However, with the HLS4ML resource strategy, it is doable to fit CNN models on a board with much lower resource utilization.
Although the latency results in both tables 6 and 7 are the lowest that achieved with HLS4ML, these results are with uncompressed ML models.To compress the model size, we used different quantization settings, with the most optimum HLS4ML configuration focused on the latency, such as reuse factor 1, and latency strategy, to find the best configuration for the CookieBox and billion-pixel camera.We additionally decrease the bit depth of various models, which corresponds to the input image size and the complexity of the model input.The implementation results for different quantization bit depths for the CookieBox application are presented in table 8.We were able to achieve 1.9 µs with similar accuracy.Moreover, reduced bit depth also deflates the model sizes and lowers resource utilization.Thus, by using a lower bit depth in HLS4ML, the final model can be implemented on a small board with excellent latency.According to table 8, choosing higher bit depth and CNN models causes higher latency and resource utilization.
The results for the FCNN with 7 bit depth are not available because of a software limitation in Vivado, which limits the number of parameters per layer to 4096.The 7 bit FCNN exceeds that limit when using a 16 × 128 flattened image input size.
The same experience is repeated for the billion-pixel camera application in table 9. Here, the NN aims to find sparse representations of large gray-scale images.Because this is an image encoding and decoding process, there is no accuracy metric to rely on.Instead, we use sparsity rates and peak signal to noise ratio (PSNR) metrics to judge the quality of the encoding and decoding, respectively, where higher values indicate better NN performance.Similar to the CookieBox, we test multiple quantization bit depths.The 8 bit depth is a reference value due to its good latency, sparsity, PSNR, and low resource utilization.We test both a lower bit depth for lower latency and a higher bit depth to confirm the effect of quantization on resource utilization.The FPGA implementation resource utilization and quantization relation for both CookieBox, and billion-pixel camera applications are also illustrated in figures 8 and 9, respectively.The provided information includes latency calculations in microseconds, along with the usage ratio for DSP, look-up table,

Discussion
It is evident that selecting a higher bit depth leads to improved accuracy in the final FPGA implementation.However, increasing the quantization bits, especially in CNNs, substantially increases resource utilization and latency.However, the relationship between model size and resource utilization is not exactly linear.Choosing a high-precision model for FCNN with our current strategy is limited by the number of parameters per layer of 4096.To address this issue, one approach is to use a smaller model.Additionally, Vivado's forthcoming new features and updates for larger boards can provide a solution to this challenge in the future.Most existing ML to FPGA tool-flows are designed for CNNs rather than other neural network architectures.CNNs are widely used in several image applications but are harder to fit on FPGAs due to the large number of operations for the convolutional layers.However, for high data-rate instrumentation applications, FCNN models are often sufficient.These models are fully supported by two available tool-flows only: HLS4ML and FINN.Moreover, after testing an EdgeML application with an FPGA, researchers may migrate to ASICs for the final fixed application.This is not possible with the tool-flows that do not provide the HLS/HDL codes.However, moving to ASIC using HLS4ML and FINN is doable.All in all, we chose HLS4ML and FINN as potential candidates in EdgeML instrumentation applications.
We first showed that FINN is generally a better choice for smaller boards and that HLS4ML performs better considering the latency, which makes it a great candidate for high data-rate instrumentation applications.As shown in the instrumentation applications' results, we were able to achieve excellent results with different quantized models using HLS4ML for the CookieBox, and the billion-pixel Camera.As mentioned earlier, although EdgeML models have been utilized for instrumentation, there has been limited prior research focusing on the CookieBox, and the billion-pixel camera.An HDL-based ML model for CookieBox has been utilized in [58], demonstrating a 20 µs latency, which is higher than this paper's implementation using HLS4ML.In addition, [61] demonstrates a remarkable 100:1 high compression ratio, and a 99% code sparsity for the billion-pixel camera with a minimal latency of 0.89 µs latency on FPGA, by using the HLS4ML tool-flow.
Furthermore, we examined the latency of different models and observed that a significant portion of the latency arises from fetching the input data, rather than processing it.This explains why the billion-pixel camera model runs with lower latency, as it has a smaller input data shape.However, this situation could be improved by increasing the input bus limit, resulting in a significant decrease in latency.The reason HLS4ML and FINN might not allow this could be their reliance on main target boards with processors, which imposes limits on increasing the input bus.Additionally, we also noticed that the ML blocks by HLS4ML are not fully pipelined.This is not a problem for the block latency but limits the throughput.The pipelining technique can enhance the throughput of all models in the ML block.
It is worth mentioning that although HLS4ML supports a wide selection of layers, it is not straightforward to change its back-end codes and add custom layers.We had to add a custom layer for the billion-pixel camera application but the new custom layer behaved strangely in some cases, specifically with the CNN model where the latency strategy did not produce meaningful results.The use of the resource strategy explains the high latency for the billion-pixel camera's best latency strategy in table 7, as well as the low resource usage for the CNN in table 9, especially considering DSPs.The incompatibility between the latency strategy and the custom layer may be fixed in future updates.

Conclusion
In this paper, we explained the new developments in high data-rate instrumentation and why they essentially need low-latency solutions such as EdgeML.We presented an exhaustive exploration of currently available tool-flows for EdgeML on FPGA with a focus on their usability for scientific high data-rate instrumentation applications.Our selection has been on those tool-flows that offer an attractive variety of supported networks, optimization, compression, platforms, and accessibility.After comparing, and testing ML to FPGA tool-flows, we noticed that the best choice for a practical instrumentation application with lower latency, especially the high data-rate instrumentation applications, is HLS4ML due to its numerous optimizations, configuration options, and the possibility of being used for ASICs.For lower-resource platforms and smaller FPGAs, FINN is a more suitable tool-flow since it is mainly focused on small and highly quantized NN models.HLS4ML, and FINN differ in their implementation strategies: HLS4ML demonstrates the potential for low-latency ML applications, and FINN minimizes resource and power usage.Here, we see HLS4ML as an excellent candidate for further research in instrumentation as we tested it for two different high data-rate instrumentation applications: (1) CookieBox and (2) billion-pixel camera.In the short term, we plan to implement EdgeML models using these tool-flows near a detector emulator, such as an arbitrary waveform generator.Subsequently, this work will guide future FPGA implementations as a part of an EdgeML-based real-time analysis of high-velocity data in large experiments.

Figure 1 .
Figure 1.Comparison of classic data acquisition system and EdgeML data acquisition system.

Figure 3 .
Figure 3. FCNN and CNN models architecture for testing tool-flows.

Figure 4 .
Figure 4.The test flow for evaluating the Tool-Flows.

Figure 8 .
Figure 8. CookieBox model different bit widths relation with implementation results.

Figure 9 .
Figure 9. Billion-pixel camera model different bit widths relation with implementation results.

Table 1 .
An overview of the available ML to FPGA tool-flows.
a Fully connected neural network b Convolutional neural network c Recurrent neural network d OpenVino is the only tool-flow that targets Intel FPGAs.Other tool-flows target AMD Xilinx FPGAs.

Table 2 .
Neural network characteristics for testing tool-flows.

Table 3 .
Neural network characteristic for testing instrumentation applications.
a Sparse categorical cross-entropy (SCC) b Mean square error + regularization term

Table 4 .
Implementation results for both fully connected (FCNN) and convolutional (CNN) models targeting a lower latency.

Table 5 .
Implementation results for both fully connected (FCNN) and convolutional (CNN) models targeting a lower resource utilization.

Table 6 .
Instrumentation experiment (CookieBox and billion-pixel camera) results with an FCNN model.

Table 7 .
Instrumentation experiment (CookieBox and billion-pixel camera) results with a CNN model.

Table 9 .
Billion-pixel camera model quantization results.