Design and implementation of neural network based conditions for the CMS Level-1 Global Trigger upgrade for the HL-LHC

The CMS detector will be upgraded to maintain, or even improve, the physics acceptance under the harsh data taking conditions foreseen during the High-Luminosity LHC operations. In particular, the trigger system (Level-1 and High Level Triggers) will be completely redesigned to utilize detailed information from sub-detectors at the bunch crossing rate: the upgraded Global Trigger will use high-precision trigger objects to provide the Level-1 decision. Besides cut-based algorithms, novel machine-learning-based algorithms will also be included in the Global Trigger to achieve a higher selection efficiency and detect unexpected signals. Implementation of these novel algorithms is presented, focusing on how the neural network models can be optimized to ensure a feasible hardware implementation. The performance and resource usage of the optimized neural network models are discussed in detail.


Introduction
The new CMS trigger system for the High-Luminosity LHC upgrade [1] will exploit detailed information from the calorimeter, muon and tracker subsystems at the bunch crossing rate.The final stage of the Level-1 Trigger apparatus, the Global Trigger (GT), will receive high-precision trigger objects from the upstream systems.Implemented in modern Field Programmable Gate Arrays (FPGA), it will determine the Level-1 decision based on a trigger menu consisting of more than 1000 trigger algorithms.The current system [2] relies on cut-based algorithms that act on specific combinations of reconstructed particle properties.To reach higher selection efficiency and selection of unexpected signals, the upgraded GT will include also neural-network-based conditions.Implementing these neural-networkbased conditions in the GT algorithm chain requires meeting stringent requirements in terms of latency and resources.The upgrade targets a total latency of 1 μs (40 Bunch Crossings, BX) for the entire GT.Three quarters of it is used by high speed serial links, demultiplexers, distribution and the Final-OR stage.Given neural networks (NN) are typically resource intensive, extensive optimization is required during and after training to ensure they can be integrated alongside the cut-based algorithms while meeting the target latency of ∼10 BXs.Two different flavours of NNs are considered: deep binary classifiers and deep auto-encoders.To reduce the models' resource usage and latency, multiple optimizations have been applied.Some of these optimizations, such as synapse pruning, hyper-parameter quantization and precision tuning, can be performed without completely redesigning the model.However, others require a new model to be designed and trained from scratch.In this work a technique known as knowledge distillation was used to further reduce the resource usage of the final NN model.

Neural network model development
Deep binary classifiers and deep auto-encoder are studied.The primary purpose of the deep binary classifiers is to discern specific signal signatures, while the deep auto-encoders are designed to learn the unlabeled data and flag anything that deviates from it as anomalous.The latter rely on an unsupervised learning technique, and, in this particular case, aim to understand efficient encoding of the feature of the well-understood physics scenarios ("background").They aim to encode the input data into a lower-dimensional representation (latent space) and then decode it back to its original form, attempting -1 -to minimize the reconstruction error.As a result, any signature which differs substantially from the background will be reconstructed poorly.The distance between the input and the reconstructed event is then used as anomaly score.To compare the performances of deep binary classifiers and auto-encoders, we consider three different signal signatures denoted as A, B and C. Each unique signal signature will be associated with its own trained binary classifier, whereas a single auto-encoder will be trained using background events.Binary classifiers are trained with a mixture of signal and background events and supervised learning is used.
Hardware used for real-time inference in the Level-1 Trigger has limited computational capacity due to size and latency constraints.Incorporating resource-intensive models without a loss in performance poses a great challenge.Significant model compression is then necessary.Hyper-parameter and input/output precision quantization using qkeras [3] is employed to reduce the multiplication's complexity.Additionally, synapse pruning is implemented through Tensorflow model optimization [4].These optimization processes occur during training and result in a reduction in model size by more than threefold compared to the uncompressed model implementation [5].Despite the aforementioned compression methods, the auto-encoders typically remain too large to be implemented in FPGAs.To tackle this challenge, one more compression technique is harnessed: a basic implementation of knowledge distillation [6].First, a bigger auto-encoder (referred to as the "teacher") is trained with only background events.A secondary, more compact model (referred to as the "student") is then trained to reproduce the teacher's anomaly score using the background events and random samples.1The anomaly score is computed as Mean Squared Error (MSE).Figure 1  Data pre-processing is performed with a normalization layer: the training dataset's variables are re-scaled to have a mean equal to zero and a standard deviation equal to one.Such re-scaling parameters are applied during training and in the hardware inference.In table 1, input variables are listed for the two model topologies.  , ,  are the representation of the candidate particle's momentum [7].Since the usage of the  variables does not result in notable increase in the signal efficiency in the case of binary classifiers, they are ignored during training.Binary classifiers feature a single hidden layer with 64 nodes with ReLU activations, the output is a single node with a sigmoid activation function.The auto-encoder (teacher) features multiple hidden layers for the encoder part, a latent space with 7 nodes and a decoder with the inverted encoder's architecture.The student is a deep neural network with one output (the anomaly score).ReLU activations are used in the hidden layers and linear activation at the output.To translate NN models into firmware, hls4ml [5], developed by the CMS community, is used.It translates high level description models (Keras/Qkeras) into synthesizable C code, which is then translated by the AMD VITIS High Level Synthesis compiler [8] into a VHDL module.This results in a block for integration into an FPGA design, and eventually, firmware can be built using AMD VIVADO [9].In table 2, signal selection efficiency at a given fixed rate for the three reference samples is compared between the binary classifiers and the auto-encoder.Keras models are trained with single-precision floating-point (FP32) without synapse pruning.In Qkeras and hls4ml models, different quantizations are applied for hidden and output layers, favoring higher bit precision in the output layer for enhanced performance.Hardware deployable models usually loose performance when quantization (fixed-point with 8/6 bits) and pruning (50%) are applied.In this work, Keras to hls4ml porting for binary classifiers incurs in less than 6% signal efficiency loss, which is small considering the reduction in the model size (see section 4).The results presented above demonstrate that the auto-encoder is sensible to the signal samples, albeit with reduced efficiency compared to the binary classifiers, even though it was not trained with signal events.

Interface between the Global Trigger and neural networks
As was mentioned in section 2, a pre-processing step is applied at the inputs of the NN models, where the re-scaling parameters have to be passed to the hardware.The mathematical operation is described in eq.(3.1),where the two passed parameters are the mean () and the standard deviation

Implementation of neural networks in the Global Trigger hardware
Each development step, if not managed properly, could lead to timing violations in the final hardware implementation.The hls4ml step inherits all the optimizations described in section 2. During VITIS HLS compilation, the target clock frequency for the auto-encoder (student) was increased to 300 MHz to avoid possible timing violations, while 240 MHz was kept for binary classifiers.The clock uncertainty was increased to 33% for both.To relax any possible routing congestion within the NN block, the input vector is registered twice to allow the place and route process to focus on the NN itself rather than its external connections [12].Crossing from 240 MHz back to 480 MHz at the output stage requires multi-cycle path constraints.Finally, timing constraints were met enabling most of the aggressive implementation strategies in VIVADO [9].A breakdown of the resource usage and latency is given in table 3. Post Training Quantization (PTQ) was used for the uncompressed models, while Quantization Aware Training (QAT) was used for the compressed ones.The latency shown for the re-scaler modules comprises the normalization and the clock domain crossing (CDC) logic.The prototype firmware is implemented in a Serenity ATCA board [13] equipped with a Xilinx VU9P FPGA part with 3 Super Logic Regions (SLR).The prototype design features one auto-encoder (student) and the three binary classifiers, each replicated in all SLRs for a total of 3 auto-encoders (student) and 9 binary classifiers (figure 4).First, data are read from the data link buffers, then they are demultiplexed and distributed in the whole chip and finally are injected in the NN interfaces.Algorithm bits are written to the output channels and sent to the Final-OR board where monitoring, pre-scaling and masking takes place [10].

Summary
The CMS Global Level-1 Trigger for Phase-2 features novel algorithms based on machine learning.In this study, we utilized quantization-aware training, pruning and knowledge distillation to compress the neural network models for the implementation in the FPGA fabric.Binary classifiers offer better performance on discerning known signal signatures with low latency and low resource usage with respect to auto-encoders, but a distinct model is needed for each signal type, requiring prior knowledge to generate the requisite dataset.Conversely, a single trained can be employed to detect known and unknown signatures, utilizing solely background events for its training.Furthermore, there is a notable difference in the final model sizes between the two approaches.Despite the inclusion of knowledge distillation, the auto-encoder (student) is approximately ten times larger than the binary classifier.Meeting timing constraints in this intricate architecture necessitated the employment of particular coding techniques [12] to guide the VIVADO implementation algorithm.Deep neural network models have been developed, evaluated and successfully tested in a Serenity prototype board.

Figure 1 .
Figure 1.Left: in the supervised learning a model is trained knowing the output labels.Right: auto-encoders rely on unsupervised learning where the model is trained with only background events.With the knowledge distillation technique the student model is trained to behave like the teacher model.

Table 1 .
Input variables of the two model topologies.

Table 2 .
Relative performance of auto-encoder with respect to binary classifiers.

Table 3 .
Resource usage breakdown of the relevant modules.vs. compressed synthesizable models' size comparison.On the bottom, the resource usage of the re-scaler modules is shown.