Fast Neural Network Inference on FPGAs for Triggering on Long-Lived Particles at Colliders

Experimental particle physics demands a sophisticated trigger and acquisition system capable to efficiently retain the collisions of interest for further investigation. Heterogeneous computing with the employment of FPGA cards may emerge as a trending technology for the triggering strategy of the upcoming high-luminosity program of the Large Hadron Collider at CERN. In this context, we present two machine-learning algorithms for selecting events where neutral long-lived particles decay within the detector volume studying their accuracy and inference time when accelerated on commercially available Xilinx FPGA accelerator cards. The inference time is also confronted with a CPU- and GPU-based hardware setup. The proposed new algorithms are proven efficient for the considered benchmark physics scenario and their accuracy is found to not degrade when accelerated on the FPGA cards. The results indicate that all tested architectures fit within the latency requirements of a second-level trigger farm and that exploiting accelerator technologies for real-time processing of particle-physics collisions is a promising research field that deserves additional investigations, in particular with machine-learning models with a large number of trainable parameters.


Introduction
A crucial aspect of particle physics experiments at colliders is the trigger and data acquisition system.In fact, efficiently collecting the products of the collisions resulting in interesting physics processes is a challenging task, for both the complexity and sparsity of the detector data to be analysed and the stringent latency requirements imposed by the high frequency of the occurring collisions.
Both the ATLAS and CMS experiments [1,2], being the two multi-purpose particlephysics detectors with cylindrical geometry currently running at the Large Hadron Collider (LHC) at CERN [3], employ a two-tier trigger system for selecting the products of the proton-proton collisions, so-called events, for storage and analyses [4,5].
The initial 40 MHz rate of proton-proton collisions produced by the LHC is first reduced to O(100 kHz) by a hardware-based Level-1 (L1) trigger system, and then further reduced down to O(1 kHz) by a software High Level Trigger (HLT).Triggering events is therefore an optimisation problem: how to maximise the variety and richness of the physics program with the limitations in terms of latency, throughput, data transfer, and storage capabilities.The selection at L1 must occur with a latency of O(10 −1 ÷ 10 0 µs) and is obtained by using low-resolution detector information.The selection at the HLT, instead, is based on software running on a commercial CPU-based farm, and, with access to more granular detector information, needs to occur with typical latency times between O(10 −1 ÷ 10 0 s).
With the upcoming high-luminosity phase of the LHC (HL-LHC) [6], the design of the trigger and data acquisition system needs to cope with the higher occupancy and the higher number of readout channels of the upgraded detectors.The advancement in single-processor computing performance is not adequate, and more modern solutions of heterogeneous computing may offer an interesting avenue of exploration [7,8].In particular the works presented in [9,10] suggest that FPGA-accelerated inference of machine-learning algorithms is a promising option for particle physice experiments, requiring minimal modifications to the current computing models.
In this context, we study the possibility to implement algorithms based on deep neural networks for the event selection at the HLT, and to use commercial accelerator boards based on FPGA processors to improve the performance in terms of processing time and throughput.FPGAs are reconfigurable hardware architectures which can be adapted for specific tasks and are traditionally programmed using hardware description languages like VHDL or Verilog.In recent years several tools and libraries were developed to facilitate the implementation and deployment of both traditional and machine learning algorithms on FPGAs.The Xilinx [11] company for example has released Vitis-AI x [12], being an AI-inference development platform for AMD devices, boards, and Alveo data center acceleration cards.Similarly, Intel has developed the FPGA AI Suite based on OpenVINO [13].
In this work we construct and characterize deep neural networks targeting the selection of events where neutral long-lived particles decay within the detector volume.We present the design and the results of the implementation in a working engineering pipeline that starts from the pre-processing of the input data, to the training of the deep neural network-based model, to the optimization and deployment on two Xilinx FPGA accelerators, the Alveo U50 and the Alveo U250, all based on the use of publicly available libraries.Two approaches based on a deep convolutional neural network and on an autoencoder are developed and presented.A comparison of the performances of the deployed algorithms in CPU, GPU and FPGA accelerators is also shown.We stress the complementarity of this approach, also in terms of development and maintenance of the needed libraries, with respect to the ongoing work of deploying neural networks on FPGA boards with a latency compatible with the selection occurring at L1, where a dedicated software library, hls4ml, is being developed [14,15], and dedicated implementations have been recently proposed [16].The paper is organized as follows.In Section 2 we describe the physics benchmark and the dataset.In Section 3 we introduce the trigger strategies we have tested, and the associated algorithms: a convolutional neural network (CNN) and an autoencoder (AE) architecture.In Section 4 we present and discuss the results.Finally, we provide our concluding remarks in Section 5.
The dataset used for the presented results is made available in Zenodo at the link in Ref. [17].The codes for constructing the algorithms, converting the models and evaluating their performances are available on request by contacting the authors.

Physics benchmark and datasets
The Standard Model (SM) of particle physics provides an excellent description of all observed phenomena up to the energies presently explored.A variety of beyond the SM scenarios has been proposed in literature to address open questions such as naturalness, baryogenesis, dark matter and the origin of neutrino masses, and new long-lived particles (LLPs) are often present in these scenarios [18].
Among the models with neutral LLPs, the ones in Ref. [19,20,21,22] are of particular interest because of the predicted unconventional phenomenology in the collisions at the LHC.The peculiarity of the reconstructed final states, containing collimated signatures of pairs of leptons and/or light hadrons, with no detector activity connecting such signatures with the interaction point of the colliding protons, resulted in the development of dedicated signature-driven triggers to enable the searching of these models at the LHC and HL-LHC [23,24,25,26].
This paper focuses on the identification of a neutral LLP decay with the data collected by the muon spectrometer (MS) of a typical experiment at the LHC.In this way, the the search sensitivity, both in terms of geometrical acceptance and of reduced backgrounds, is typically maximised.A toy simulation of the monitored drift tube (MDT) detector together with the superconducting toroidal magnetic field of the ATLAS experiment is developed, together with the physics benchmark of a neutral LLP decaying to charged particles.In particular the generation, simulation and reconstruction chain can be summarised as follows: (i) generation of a neutral LLP and of its charged decay products in the MDT detector volume and within the magnetic field; (ii) simulation of the detection of the decay products through the formation of hits in the MDT chambers; (iii) estimation of the experimental effects on the hit positions using the resolutions of the ATLAS MDT detector as in [27];  (iv) addition of detector noise and background accounting for the measured average rate during the data-taking of the LHC [28].
The simulated experimental conditions are considered to be of enough detail for the scope of this article, which is to demonstrate, as a proof of principle, the benefits of inference acceleration in the context of triggering applications for particle-physics experiments.
Physics processes are simulated with a number of charged particles as decay products from two to ten, representative of the cases of two-body and multi-body decays of a X particle with a uniformly distributed decay length L r in the range [0, 5] m.An example of these simulated processes is depicted in Figure 1, where the case of two and ten tracks are reported.Each bin of the vertical axis corresponds to one of the 20 layers of the MDT chambers.The horizontal axis linearly maps the longitudinal coordinate of the MDT chambers.The number of bins on the horizontal axis is set to 333, a realistic average number of MDT tubes in the ATLAS detector.The images as in Figure 1 constitute a convenient representation of the simulated and reconstructed physics process for training neural-network based algorithms.For each choice of charged particle multiplicity, 5k images are generated separately, with a total of 45k available events.The sample is randomly split in two parts so that 80% of the images are employed for the trainings and the remaining 20% for the evaluations.

Neural network models
Algorithms for the trigger selection of the experiments at the LHC generically fall into two categories.The first category is based on the ability to identify unique characteristics of the specific signature of interest, and defines a selection capable of preserving such a signature with high purity.This approach is effective for selected processes, but it lacks, by design, generalizability to other, possibly unknown, physics phenomena.The second category builds on the concept of anomaly detection [29] to overcome the limitation of the first.In this scenario, the trigger selection is based on the likelihood that the event is not generated by known physics processes.This approach is particularly appropriate for searching for model-independent new physics signatures.
In this article two algorithms, representative of the two triggering philosophies just described, are developed and characterised.A deep convolutional neural network (CNN) is trained for regressing the L r parameter of the neutral LLP while an autoencoder (AE) is trained exclusively on events where the decays of the LLP occurred near the interaction point for detecting anomalies.These two algorithms clearly follow the two distinct triggering criteria because the AE, contrarily to the CNN, is only exposed to events with short lifetimes in the training, hence remains agnostic on how a neutral LLP decay would look like in the detector.Once trained and deployed in the trigger and acquisition system, the CNN and the AE can be employed to define a selection criteria based, in the first case, on the inferred L r parameter and, in the second case, on the likelihood of the event to not only contain prompt decays.
The CNN model [30,31] is presented in Figure 2. It comprises convolutional layers, ReLU activation functions, MaxPooling operators, and a final multi-layer perceptron with a single output node to regress the L r of the LLP.The implemented loss function is the mean squared error between the true and the predicted values, respectively L r and Lr .
In a similar fashion, the AE is also based on a CNN and is also presented in Figure 2. The encoder part is composed of convolutional layers, ReLU activations, and MaxPooling operators.The loss function of the AE is the binary cross-entropy and as such is responsible for the pixel-to-pixel comparison between the input and the reconstructed images.A second term to the loss was investigated but found to not provide substantial improvement in the discrimination performance.Such second term compared the high-level features in the latent space of another AE, constructed with the same architecture and trained on a different dataset with the same statistics, to compute a perceptual loss, inspired by the work in Ref. [32].Once the AE is trained, only its encoder part is employed for constructing the discriminant for the trigger selection, as explained in Section 4, and consequently only the encoder is used when studying the performance and the inference time.For simplicity of convention, the encoder part of the AE is referred to the AE model throughout the text.The CNN model comes with ∼2.8M trainable parameters, while the AE model with ∼398k, and ∼162k for the encoder part.The different number of parameters of the two models influences the studies on the inference time and throughput, as it will be shown in Section 4. We highlight how the chosen architectures are not ideal for the typical sparsity and cardinality of the data emerging from particle-physics collisions; they were chosen, instead, because they are fully supported by the adopted publicly available libraries.Support for other architectures, such as recurrent or graph neural networks, would definitely broaden the potential interest for the physics applications of fast inference on commercially available  Data is labelled as background if the LLP decay is within 0 m < L r < 1 m and is labelled as signal if 3 m < L r < 5 m, and these definitions are consistently provided to both the CNN and AE models for the evaluation of the performances.Only background data is provided to the AE training, without any explicit label, while all the dataset, regardless of the truth L r value, is provided when training the CNN model.The CNN model includes data with decays to charged particles with multiplicity between two and ten, for both background and signal.Contrarily only decays with multiplicity between two and four for the background are considered when training the AE model.The reason is that the CNN model was found to provide excellent discrimination performance regardless of the composition of the background in terms of track multiplicity and opening angles of the decay products.Contrarily, the AE model was found to be more sensitive to the composition of the background and for this reason only the background with a low number of tracks was kept in the training.Further optimisation of the AE architecture and training procedure, and also proper simulations of background events originating from SM particles decaying promptly in the detector, are possible and are left for future studies.
For studying the inference time two different FPGA boards are considered, the Xilinx Alveo U50 and U250.The AE model was not run on the U250 board due to missing support of the Xilinx Vitis-AI tool to the Reshape layer, which is included in the architecture of the AE.It is also important to note that the servers hosting the two accelerator cards are equipped differently, hence a direct comparison between the performance of the U50 and U250 cards for the CNN model is not straightforward given the different CPU load on each.
The actual acceleration in the FPGA cards requires several preliminary operations and the Xilinx distributed Vitis-AI tool provide a complete workflow for this purpose.The Post Training Quantization of Vitis-AI is chosen for the model quantization, converting the original trained parameters to 8-bit integer precision.With the quantized model the Vitis-AI compiler allows to have the so-called Xilinx Intermediate Representation (XIR) of the model, a representation of the model in a workable format for the specific chosen FPGA board.Vitis-AI provides a Python-based script which manages the model inference using XIR and Vitis AI Run Time libraries operations.The first part of the script consists in the compiled model deserialization, where, starting from the model XIR representation, a xmodel file is returned.Then a final utility checks the correctness of the XIR model and performs the actual deployment to the accelerator card.
The two Alveo cards come with different architecture configurations and also contain different Deep-Learning Processing Units (DPUs).In addition, slightly different quantization bit-width methods are employed, together with two different versions of the Vitis-AI software, v1.4.1 and v2.5, respectively for the U50 and U250 boards.

Results
In this section, we present the performance evaluation of both the CNN and AE models in terms of performance, inference time and throughput.
Figure 3 presents the distribution of the residuals between the predicted and the true decay length of the neutral LLP and the trigger efficiency of the CNN model, considering the float, the quantized, as well as the models actually deployed on the U50 and U250 accelerator cards.The quantization of the models resulted in a small degradation in accuracy but did not significantly impact the efficiency curve, indicating an acceptable performance degradation.Similarly, a slight degradation was observed between the quantized model and the one deployed on the FPGA.The trigger efficiency is defined as the fraction of neutral LLP decays with the predicted decay length Lr > 3 m as a function of the true decay length L r .The steep turn-on curve of the efficiency confirms the residual distributions being under control and indicates the feasibility of The evaluation results of the AE are shown in Figure 4 with a discriminant defined as the sum of the hidden features of the AE latent space.Other variations of the discriminant were investigated but due to the sparsity and the relative contribution of noise in the original and reconstructed images, a discriminant based on the latent space features was found to be more adequate for the purpose of anomaly detection.Among   the possible choices for combining the latent space features the sum was considered for its simplicity and compatibility with the Vitis-AI software.As with the CNN model, a small degradation is observed when quantizing the model while the quantized model and the one deployed on the FPGA were found in substantial agreement.The performances of the two models are also studied with the ROC curves presented in Figure 5.These curves are computed by labelling consistently as signal the LLP decays with 3 m < L r < 5 m, and as background those decays with 0 m < L r < 1 m.The background with the different charged particle multiplicities is summed up, and ROC curves are constructed with respect to a signal with a particular number of charged particle multiplicity.To be consistent with the training procedure outlined in Section 3, track multiplicity of the neutral LLP decay between two and ten, and between two and four, is used for creating the background sample, respectively for the CNN and the AE models.The ROC curves demonstrate the capability of the CNN model to effectively learn the decay position independently of the multiplicity of the charged decay products, while a dependence on the multiplicity is clearly evident for the AE model.The performance of the AE model in case of two-track signals is not displayed because no discrimination is achieved in this case.The performance of the AE model was also found to be dependent on other aspects of the generation, for example on the opening angle of the decay productions of the neutral LLP.Additional studies in this direction are considered out of the scope of this article, as these results discussed so far already demonstrate the capability of the chosen network architectures to effectively select the neutral LLP decays of interest.
The inference time and the throughput of the CNN and AE models on different architectures are also studied and results are presented in Tables 1 and 2. The inference on the FPGA accelerator cards require to batch images with a fixed size, which depends on the DPU of the accelerator card and is declared by the manufacturer.The U50 and U250 cards require a batch size of three and four, respectively.For consistency in the reported results, a batch size of four is also implemented for studying the inference time and the throughput on CPU and GPU architectures.The measurements on the CPU and GPU architectures have been performed by converting the model into the Open Neural Network Exchange (ONNX) format with the runtime engines corresponding to these two architectures [33].The ONNX format was found to substantially improve the results, even more than one order of magnitude on both architectures.The measurements on the CPU were performed using all the cores and on a machine equipped with AMD EPYC 7302 16-Core processors.The measurements on the GPU were performed on a GPU NVIDIA Tesla V100, and using the float models before quantization.The inference time results are obtained by averaging on few tens of measurements.The throughput is estimated by inferring the models with 10k images.The first measurements of both inference time and throughput on accelerators are discarded since they were observed to be systematically higher.
Overall the study indicates that all architecture technologies offer inference time and throughput adequate for the typical latency requirements of a high-level trigger selection in a general-purpose experiment at LHC or HL-LHC.The inference time for the CNN model suggests that the acceleration on FPGA gives an advantage compared to the CPU-based approach.A similar advantage is not evident for the AE model.This can be attributed to the lightness of the model in terms of number of parameters, which results in the actual inference time being negligible compared to the time needed for loading the data onto the FPGA itself.A proper study of the time needed for performing the various sub-tasks for enabling the inference on the FPGA is considered of great interest, but was not possible given the provided tools at hand.The throughput measurements also indicate the superiority of the FPGA-acceleration approach compared to the CPUbased one for the CNN model, and not for the AE model for the same considerations just expressed.In addition the throughput on the GPU architecture seems to suggest the superiority of this approach but this is achieved, as the corresponding measurements on the inference time confirm, only thanks to the capability of GPUs to process inference concurrently, and such high degree of concurrent computing can't be directly injected within a multi-node high-level trigger farm at colliders.In summary, considering the necessity of deploying larger models when dealing with real experiments, the results presented in this study suggest that a heterogeneous computing model with FPGA-based acceleration has the potential to improve the realtime processing and the responsiveness of a trigger system.It is also worth noting that the inference time and throughput for the Xilinix Alveo U50 and U250 FPGAs, as presented in Tables 1 and 2, were obtained without utilizing the device multi-threading capability.We tested the performance in terms of throughput with varying numbers of parallel threads, and observed an almost linear improvement in the throughput performance versus the number of concurrent threads, as show in Figure 6.Hence, by leveraging the multi-threading capabilities, it becomes possible to achieve superior performances compared to what shown in Tables 1 and 2. One final consideration is the comparison of the power dissipation declared by the manufacturers for the considered architectures.The NVIDIA Tesla V100 GPU has a power consumption of 300 W, to be compared with the Xilinx Alveo U50 of 75 W.The actual power dissipation will obviously depend on the amount of resources being used in reality when performing the inference.This has not been studied because it will depend on the multi-node architecture of the farm equipped or not equipped with accelerators, hence what is declared by the manufacturers is considered of enough interest for corroborating these remarks.More studies in this direction will be necessary and are considered an important step for the definition of the trigger and data acquisition system of the future experiments operating at the HL-LHC.

Conclusions
This article discusses the performance evaluation of machine learning algorithms on commercially available Xilinx FPGA accelerator cards.The necessary steps including model training, quantization, compilation, and actual deployment on the FPGA board were all performed.The post-training quantization technique provided by Vitis-AI was used for model quantization, which resulted in an acceptable level of model accuracy degradation and reduced model size.Two neural-network models based on different architectures were trained and characterised for selecting events with neutral long-lived particle decays within the geometrical acceptance of a muon spectrometer of a generalpurpose experiment at the LHC.A model based on convolutional neural networks and trained to regress the decay length of the neutral long-lived particle and a model based on an auto-encoder architecture and trained to detect as anomalies such decays are presented.The first model was deployed on Xilinx Alveo U50 and U250 accelerator cards using the Vitis-AI compiler, while the second model was only deployed on the U50 card.The models were demonstrated to efficiently retain events of physics interest while rejecting background collisions.The inference time and throughput of the models were also confronted on a CPU and on different architectures with GPU-based or FPGAbased acceleration.The results indicate that the measured inference times on all tested architectures fit within the typical latency requirements of a high-level trigger selection in a general-purpose experiment at LHC or HL-LHC.

Acknowledgments
We thank the INFN IT teams in Genoa and Rome, and in particular Mirko Corosu and Luca Rei, for useful support in instrumenting the local computing resources to accommodate the FPGA accelerator cards.This work is partially supported by ICSC -Centro Nazionale di Ricerca in High Performance Computing, Big Data and Quantum Computing, funded by European Union -NextGenerationEU.
(a) Two-track LLP decay (b) Ten-track LLP decay

Figure 1 :
Figure 1: Image representation of a neutral LLP decaying to two (a) and ten (b) charged particles with the signal pattern and detector noise released into the MDT chambers.

Figure 2 :
Figure 2: Diagrams with layer-by-layer details of the two architectures being considered in this work.The CNN model for regressing the L r parameter of the neutral LLP is in (a) while the AE model for detecting anomalies defined as decays not occurring near the interaction point is in (b).For both diagrams further details on the individual blocks of the diagram are given in the text.

Figure 3 :Figure 4 :
Figure 3: (a) Residual plot between the true decay length of the neutral LLP and the one predicted by the CNN model.(b) Efficiency plot of the CNN model, where the efficiency is defined as the fraction of neutral LLP decays with the predicted decay length Lr > 3 m as a function of the true decay length L r .
(a) CNN model (b) AE model

Figure 5 :
Figure 5: ROC curves for the (a) CNN and the (b) AE models.In both cases, LLP decays are labelled as signal if 3 m < L r < 5 m and as background if 0 m < L r < 1 m.Track multiplicity between two and ten, and between two and four, is used for creating the background dataset, respectively for the CNN and the AE models.In contrast, the track multiplicity is considered separately for signal, as a way to estimate the discrimination performance for different hypotheses of new physics signatures.The CNN model is found to provide better discrimination performance than the AE model, and to not be dependent on the track multiplicity of the neutral LLP decay.

Figure 6 :
Figure 6: Throughput in frames per second as a function of the number of concurrent threads as measured with the CNN model deployed on the FPGA U50 and U250 accelerator cards.

Table 1 :
Inference time in ms and throughput in frames per second for the CNN model on different target architectures.The results include the actual deployment of the model on the FPGA U50 and U250 accelerator cards.

Table 2 :
Inference time in ms and throughput in frames per second for the AE model on the different target architectures.The results include the actual deployment of the model on the FPGA U50 accelerator card.