NaNet: a flexible and configurable low-latency NIC for real-time trigger systems based on GPUs

NaNet is an FPGA-based PCIe X8 Gen2 NIC supporting 1/10 GbE links and the custom 34 Gbps APElink channel. The design has GPUDirect RDMA capabilities and features a network stack protocol offloading module, making it suitable for building low-latency, real-time GPU-based computing systems. We provide a detailed description of the NaNet hardware modular architecture. Benchmarks for latency and bandwidth for GbE and APElink channels are presented, followed by a performance analysis on the case study of the GPU-based low level trigger for the RICH detector in the NA62 CERN experiment, using either the NaNet GbE and APElink channels. Finally, we give an outline of project future activities.


Introduction
Thanks to their relevant computing power and favorable ratios in price/performance and power consumption/performance, GPUs architectures such as NVIDIA Fermi and Kepler are gaining popularity in the HEP experiments community.Their usage in high level trigger systems, leveraging on their computing power to reduce the numerosity of computing farm nodes, is currently under study with encouraging results [1,2,3].For the same reasons, low level triggers could also benefit from GPUs adoption; the main issue to be taken into account in this context is the strict real-time requisite typical of such systems.
Low level triggers are designed to perform very rough selection based on a sub-set of the available information, in a pipelined structure housed in custom electronics, in order to bring to a manageable level the high data rate that would otherwise reach the software stages behind them.Due to small buffers size in the read-out electronics, such systems typically require very low latency; however, thanks to fast and cheap DDR memories available nowadays, this requirement will be abandoned in the near future.On the other hand, GPUs provide so great a computing power that taking complex decisions with speeds matching significant data rates is feasible; this would mean more accurate selection and more stringent trigger conditions, providing purity and efficiency such as those from commodity PCs without forfeiting the real-time constraint.At the same time GPUs would represent a great step forward in terms of reprogrammability when compared to custom electronics.GPUs real-time performances need careful assessment to match the requirements of the lowest trigger levels, the main issue being the network transfer from the custom readout (RO) electronics to the server hosting the GPU on the PCIe bus.Another caveat of GPU architectures is the need for saturation of computing cores, which requires a significant number of events and a buffering stage; both factors weigh on trigger answer latency.Latency stability is another feature that must be carefully considered for real-time applications since computing on GPUs is mostly deterministic as soon as data has landed onto the internal memories but, wholly considering the low level trigger, latency fluctuations stem from transit from the RO system through network interface card (NIC) and PCIe bus.
Our approach to this problem is twofold: first, we designed a NIC able to inject RO data directly from the links into NVIDIA Fermi-and Kepler-class GPUs memories without any intermediate buffering or CPU operation -GPUDirect RDMA is the commercial name of the feature; second, we implemented a dedicated engine in the NIC to offload the CPU from network stack protocol management duties.In this way, transfer latency and its fluctuations are reduced and possible OS jitter effects avoided.These two features stand in the NaNet FPGA-based NIC: the first was inherited from development of our HPC-dedicated 3D NIC, APEnet+ [4]; the second comes from adapting and integrating an open core by the FPGA vendor 1 .
NaNet is flexible, supporting 4 different link technologies, namely a custom 1 Gbps optical serial link, GbE (1000BASE-T/1000BASE-X), 10-GbE (IEEE 802.3aq) and the APElink channel -4 bonded PCML lanes over QSFP+ cables capable of 34 Gbps raw data bandwidth [5]; NaNet logic can be effectively tailored to different usage scenarios as any FPGA-based design by adding dedicated custom logic blocks, e.g. to compress or reshuffle the data stream.
NaNet is currently being used in a pilot project within the CERN NA62 experiment aiming at investigating GPUs usage in the central Level 0 trigger processor (L0TP) [6].
In the following we provide a detailed description of the NaNet hardware modular architecture and a performance analysis for a case study on the GPU-based Level 0 trigger of the NA62 RICH detector using either the NaNet GbE and APElink channels.
Results of this study motivated current development of NaNet design aimed at including 10 GbE link support; preliminary results and additional FPGA resources requirements are shown.
Finally, we report an outline of future project developments.

NaNet
NaNet is a modular design of a low-latency NIC dedicated to real-time GPU-based systems and supporting a number of different physical links; its design baseline comes from the APEnet+ PCIe Gen 2 x8 3D NIC.The Distributed Network Processor (DNP) is the APEnet+ core logic, acting as an off-loading engine for the computing node in performing inter-node communications [7].The DNP provides hardware support for the Remote Direct Memory Access (RDMA) protocol guaranteeing low-latency data transfers.Moreover, APEnet+ is also able to directly access the Fermi-and Kepler-class NVIDIA GPUs memory (provided that both devices share the same upstream PCIe root complex) leveraging upon their peer-to-peer capabilites.This is a first-of-its-kind feature for a non-NVIDIA device (GPUDirect RDMA being its commercial name), allowing unstaged off-board GPU-to-GPU transfers with unprecedented low latency [8].An overview of the typical APEnet+ data flow is in figure 1: inward and outward traffic over the 34 Gbps APElink channel is directly routed to and from GPU internal memory.NaNet design inherits GPUDirect RDMA capabilities from APEnet+, extends it with support for standard network links -namely GbE and 10 GbE-and adds to the logic a network stack protocol management offloading engine, to avoid possible OS jitter effects and reduce latency even more.NaNet design supports a configurable number and kind of I/O channels; incoming data streams are processed by a Physical Link Coding block feeding the Data Protocol Manager that in turn extracts the payload data.These payload data are encapsulated by the NaNet Controller in the APEnet+ data packet protocol and sent to the APEnet+ Network Interface, taking care of their delivery to the destination memory.A Custom Logic block joins in by performing any data manipulation needed by the specific application context (see figure 2).In the following, we focus on the characterization of the NaNet-1 design configuration, then we describe current developments for one supporting 10 GbE interface, NaNet-10; finally, we present a sketch of the NaNet 3 design, its main feature being its deterministic latency links.

NaNet-1 architectural overview
NaNet-1 is a PCIe Gen 2 x8 NIC featuring GPUDirect RDMA over 1 GbE and optionally 3 APElink channels.The NaNet-1 board employs the Altera Stratix IV EP4SGX230KF40C2 FPGA (see figure 3); a custom mezzanine was designed to be optionally mounted on top of the Altera board.The mezzanine mounts 3 QSFP+ connectors, thus making NaNet able to manage 3 bi-directional APElink channels with switching capabilities up to 34 Gbps.APElink adopts a proprietary data transmission word stuffing protocol; this is pulled for free into NaNet-1.
For what concerns the implementation of the GbE transmission system we follow the general I/O interface architecture description of figure 2.
We exploit the Altera Triple Speed Ethernet Megacore (TSE MAC) as Physical Link Coding, providing complete 10/100/1000 Mbps Ethernet IP modules.The design employs SGMII standard interface to connect the MAC to the PHY including Management Data I/O (MDIO); the MAC is a single module in FIFO mode for both the receive and the transmit sides (2048x32 bits).
The data protocol manager tasks are carried out by the UDP Offloader dealing with UDP packets payload extraction and providing a 32-bit wide channel achieving 6.4 Gbps (6 times greater than the standard GbE requirements).The UDP Offloader component collects data coming from the Avalon Streaming Interface of the Altera Triple Speed Ethernet Megacore and redirects UDP packets into a hardware processing data path.In this way, the FPGA on-board µcontroller (Nios II) is totally discharged from UDP packet traffic management.
The I/O interface data flow control logic is managed by the NaNet Controller, a hardware component able to encapsulate data packets in the APEnet+ protocol formed by a header, a footer  (128-bit word) and a payload of maximum size equal to 4096 bytes.NaNet Controller implements an Avalon-ST Sink Interface collecting the GbE data flow from the UDP offloader, parallelizing incoming 32-bit data words into 128-bit APEnet+ data ones.
Data coming from the I/O interface are managed by the Router component; it supports a configurable number of channels, acting as a multiplexer for a customizable number of ports.
Finally, the Network Interface comprises the PCIe X8 Gen2 link to the host system for a maximum data rate of 4+4 GB/s, the packet injection processing logic, the RX block and GPU I/O accelerator providing hardware support for the RDMA protocol for CPU and GPU, managed by the Nios II µcontroller operating at 200 MHz.On table 1 we show a recap of the used FPGA logic resources as measured by the synthesis software.

Software Stack
The NaNet-1 software stack runs partly on the x86 host and partly on the Nios II FPGA-embedded µcontroller.On the host side a GNU/Linux kernel driver controls the device and an application level library provides an API to: open/close the NaNet-1 device; inject commands to register and de-register circular lists of persistent receiving buffers (CLOPs) in GPU and/or host memory, necessary to allocate, pin and return the virtual address of these buffers to the application; manage events generated by the device when receiving packets on the registered buffers in order to promptly invoke the GPU kernel that processes the data just received.On the µcontroller, a single process C program configures the device, computes the destination virtual address inside the CLOP for incoming packets payload and performs the virtual to physical memory address translation necessary to initiate the PCIe DMA transaction towards the destination buffer.

NaNet-1 enhancements and roadmap to NaNet-10
As described in section 2.2, the NaNet GPU memory addressing is managed by the Nios II firmware.Implementing new features with a µcontroller is a fast and efficient strategy during debugging phase but the Nios II introduces a considerable latency in performing the basic RDMA tasks: buffer search and translation of virtual addresses to physical ones.Moreover, it is responsible of jitter effects on the hardware latency path [9].Thus, two major improvements are currently under development for NaNet-1: a Translation Lookaside Buffer (TLB), an associative cache where a The expected request of increased data rates and considerations of future-proofing for the NaNet IP pushed the design of a board supporting the more advanced 10-GbE industrial standard: NaNet-10.Since Altera Stratix IV development board is not natively equipped with a 10-GbE interface, an additional board from Terasic (Dual XAUI To SFP+ HSMC) is employed; it mounts a Broadcomm BCM8727 dual-channel 10-GbE SFI-to-XAUI transceiver and provides 2 full duplex 10-GbE channels with a XAUI backend interface.This mezzanine card is plugged into the HSMC connector of the Altera board.At the moment, this makes the 10-GbE mutually exclusive with the custom mezzanine providing the APElink channels.
The final configuration foresees design migration towards Stratix V FPGA, to exploit enhanced Altera transceivers with switching capabilities up to 12.5 Gbps and a Gen3-compliant PCIe bus able to sustain 8 + 8 GB/s.

NaNet 3 four way deterministic latency 1 Gbps optical link NIC
To be complete, an overview of the NaNet board family must mention the undergoing development of the NaNet 3 board for the KM3 HEP experiment [10].In KM3 the board is tasked with delivering global clock and synchronization signals to the underwater electronic system and receiving photomultipliers data via optical cables.The design employs Altera Deterministic Latency Transceivers with an 8B10B encoding scheme as Physical Link Coding and Time Division Mul-tiPlexing (TDMP) data transmission protocol.Current implementation is being developed on the Altera Stratix V development board with a Terasic SFP-HSMC daughtercard plugged on top and sporting 4 transceiver-based SFP ports (see figure 4).
Measurements were conducted using one of the host GbE ports to send UDP packets according to the NA62 RICH RO data protocol to the NaNet-1 GbE interface: using the x86 Time Stamp Counter (TSC) register as a common time reference, it was possible in a single process test application to measure latency as time difference between when a received buffer is signalled to the application and the moment before the first UDP packet of a bunch (needed to fill the receive buffer) is sent through the host GbE port.Similarly, we closed in a loopback configuration 2 of the 3 available APElink ports and performed the same measurement.Note that in the aforedescribed measurement setup ("system loopback"), the latency of the send process is also taken into account.
Benchmark results for GbE link bandwidth, varying the size of GPU memory receiving buffers, is shown in figure 5; it remains practically constant in the region of interest for the reference application and at maximum value for the link.In figure 6 latencies for varying size buffer transfers in GPU memory using the GbE link are represented.Besides the smooth behaviour increasing receive buffer sizes, fluctuations are minimal, matching both constraints for real-time and, compatibly with link bandwidth, low-latency on data transfers; for a more detailed performance analysis, see [9].

The NA62 RICH Detector GPU-Based low level Trigger Case Study
The NA62 experiment at CERN [11] aims at measuring the Branching Ratio (BR) of the ultra-rare decay of the charged Kaon into a pion and a νν pair.Due to the very high precision of theoretical prediction on this BR, a precise measurement at the level of 100 events would be a stringent test of the Standard Model, also being this BR highly sensitive to any new physics particle.
The ∼ 10 MHz rate of particles reaching the detectors must be reduced by a set of trigger levels down to a ∼ kHz rate, manageable for data recording.The first level (L0) is implemented in hardware (FPGAs) on the RO boards and performs rough cuts on their output reducing ∼ 10 times the data stream rate to cope with the ≤ 1 MHz event readout rate for the design.Events out from L0 are transferred for further reconstruction and event building to upper level triggers (L1 and L2), implemented in software on a farm of commodity PCs.In the standard implementation, FPGAs on the L0 trigger RO boards compute simple trigger primitives on-the-fly which are time-stamped and sent to a central processor for matching and trigger decision.Thus, the maximum latency allowed for the synchronous L0 trigger is related to the maximum data storage time available on the data acquisition boards, up to 1 ms for NA62.The Ring Imaging Čerenkov detector (RICH) identifies pions and muons in the momentum range 15 GeV /c to 35 GeV /c, giving a µ suppression factor better than 10 −2 with a good time resolution.
As a first example of GPU application in the NA62 trigger we studied ring reconstruction in the RICH.The RICH L0 trigger processor is a low-latency synchronous level and the possibility to use the GPU must be verified.In order to test feasibility and performances, as a starting point we have implemented 5 algorithms for single ring finding in a sparse matrix of 1000 points (centered on the PMs in the RICH spot) with 20 firing PMs ("hits") on average.Results of this study are available in [12] and show that GPU processing latency is stable and reproducible once data are available in the device internal memory.
In order to fully characterize latency and throughput of the GPU-based RICH L0 trigger processor (GRL0TP), we took into account, besides GPU-assisted ring reconstruction, data transfer needed to move primitives data from RO boards to GPU internal memory through multiple (4÷6) GbE links and the host PCIe bus.The NaNet-1 NIC was integrated in the GRL0TP prototype, using the "system loopback" setup described in section 3. The host simulates the RO board by sending UDP packets containing primitives data from the GbE port of the hosting system to the GbE port the hosted NaNet-1, which in turn streams data directly towards a circular list of receive buffers in GPU memory that are sequentially consumed by the CUDA kernel implementing the ring reconstruction algorithm.Communication and kernel processing tasks were serialized in order to perform the measure; results are shown in Fig. 6.This represents a worst-case situation: during normal operation given NaNet-1 RDMA capabilities, this serialization does not happen, and kernel processing seamlessly overlaps with data transfer.This is confirmed by throughput measurements in figure 5. Combining the results, it is clear that the system remains within the 1 ms time budget with GPU receive buffer sizes in the 128 ÷ 1024 events range while keeping a ∼ 1.7 MEvents/s throughput.Although real system physical link and data protocol were used to show the real-time behaviour on NaNet-1, we measured on a reduced bandwidth single GbE port system that could not match the 10 MEvents/s experiment requirement for the GRL0TP.
To demonstrate the suitability of NaNet-1 design for the full-fledged RICH L0TP, we decided to perform equivalent benchmarks using one of its APElink ports instead of the GbE one.Results for throughput and latency of the APElink-fed RICH L0TP are shown in figure 7 and 8: a single NaNet-1 APElink data channel between RICH RO and GRL0TP systems roughly matches trigger throughput and latency requirements for receiving buffer size in the 4÷5 Kevents range.

Conclusions and Future Work
In this paper we presented the NaNet board family, a modular design of a low-latency NIC dedicated to real-time GPU-based systems and supporting a number of different physical links.
A performance analysis of the NaNet-1 board has been provided, showing the real-time features of its GbE channel.
We demonstrated that using a single NaNet-1 APElink channel to feed the RICH L0 GPU-based trigger processor roughly fulfil latency and throughput requirements of the system.While adding a APElink channel to the RO board is likely infeasible, needing a major redesign, it encouragingly hints to the suitability of the NaNet-10 as RICH L0 GPU-based trigger processor NIC.

Figure 4 .
Figure 4. NaNet 3 testbed: board is connected to offshore RO system via optical cable.

Figure 8 .
Figure 8. Latency of NaNet-1 APElink data transfer and of ring reconstruction CUDA kernel processing.Bandwidth and latency performances for NaNet-1 APElink channel are in figure7and figure8.Current implementation of APElink is able to sustain a data flow up to ∼ 20 Gbps.The APElink bandwidth plateau in figure7is due to the RX path implementation of NaNet-1.RDMA-related tasks weigh on the Nios II; for a ∼ 200 MHz clock, this means ∼ 1.6 us more latency to each packet.

Table 1 .
An overview of NaNet resource consumption.entries can be stored in order to perform memory management tasks, taking only ∼ 200 ns and a hardware module for virtual address generation for GPU memory management.