Front-end RDMA over Converged Ethernet, real-time firmware simulation

Several physics experiments are moving towards new acquisition models. In this work some ideas to implement Remote Direct Memory Access (RDMA) directly on the front-end electronics have been explored, part of the computing farm's CPU resources could be freed. New simulation techniques are introduced to understand RDMA over Converged Ethernet (RoCE) firmware block developed at ETH Zürich, including real-time firmware simulation leveraging SystemVerilog's useful features. The ability to explore a wider and dynamic inputs increases the likelihood of uncovering potential issues, identifying edge cases, and validating the system's performance across a broader range of scenarios.


Introduction
In a DAQ system a large fraction of CPU resources is engaged in networking rather than in data processing.The common network stacks that take care of network traffic usually manipulate data through several copies performing expensive operations.Thus, when the CPU is asked to handle networking, the main drawbacks are throughput reduction and latency increase due to the overhead added to the data transmission process.Networking with zero-copy can be achieved by adding a RDMA layer to the network stack and making dedicated hardware take care of the burden of the stack handling.The main goal of the RDMA implementation is to move up the adoption of clever networking protocols to the data producer.
In the context of a physics experiment this means to implement the network protocol in the detector front-end electronics.Therefore, the front-end electronics can initiate the RDMA transfer to the computing farm, eliminating the point-to-point connection between the front-end and back-end allowing the freedom of dynamically switching the routing to the computing nodes according to their processing availability.By appropriately choosing the network protocol for RDMA it is also possible to obtain a two-fold benefit.The possibility of adopting commodity hardware makes the data acquisition (DAQ) system reduce reliance on custom hardware and it exploits all the advantages of a mature technology.In this way, the DAQ system gains in scalability and easiness of maintainability.RDMA Over Converged Ethernet (RoCE) is the de facto industry-standard Ethernet-based RDMA solution with a multi-vendor ecosystem, making it the natural choice.
In this work the implementation and simulation of the main firmware blocks for the realisation of the RoCE endpoint have been explored.A real-time firmware simulation of the RoCE network stack has been developed where real network packets are exchanged between free-running Systemverilog code and the host machine via a TUN/TAP device which emulates a connection with a physical device (FPGA).The second part will be devoted to the verification process of the modified RoCE stack using the tools developed so far such as the novel simulation framework.As a remark no performance study will be carried out here, such numbers will reflect only on the of the simulator and machine's characteristics not on the network implementation itself.

FEROCE idea
With the ever-growing demand of larger bandwidth for big data systems, many works point in the direction of implementing network stacks on custom hardware.In particular, FPGAs are the natural target for reducing time to market and keeping a low entry-barrier.All this common effort has been done keeping the paradigm of a datacenter where a computing farm of servers exchange data between nodes as efficiently as possible.The main goal of FEROCE is to move up the adoption of clever networking protocols to the data producer (figure 1(b)).Therefore, it is the front-end electronics that could take care of initiating the RDMA transfer towards the computing farm.In such a way we eliminate the point-to-point connection between the front-end and the back-end (figure 1) leaving the freedom of switching dynamically the routing to the computing nodes according to their processing availability.The RDMA protocol choice is crucial to enable the use of commodity hardware, leading to the use of mature and proven technologies while reducing the use of custom hardware.For instance, the capability of bandwidth aggregation given by Ethernet network protocols or traffic prioritisation like Quality of Service.In this way, the DAQ system gains in scalability and easiness of maintainability.RDMA over Converged Ethernet (RoCE) is the most commonly used RDMA technology for Ethernet networks and is deployed at scale in some of the largest datacenters in the world.RoCE is the only industry-standard Ethernet-based RDMA solution with a multi-vendor ecosystem delivering network adapters and, in its second version (ROCEv2), operating over standard layer 2 and layer 3 Ethernet switches since it includes IP and UDP headers in the packet encapsulation.The target RoCE implementation needs to be scalable, in that sense it has to be configurable to tackle different scenarios: front-end with lower or higher data rate will requires 1 Gb/s or 10 Gb/s respectively, while a data aggregator board will require 100 Gb/s instead.The use of FPGA at the front-end level has gained attention in the last decade for many physics experiments.The well-known advantages of the FPGAs over ASICs like re-configurability, low entry barrier and quick and easy design flow, made the small-size FPGA appealing for the front-end electronics in the outer detector regions where the radiation levels are acceptable for such devices.After the digitalization of the detector data, front-end designers choose FPGAs for handling the data movement towards the back-end electronics.The market availability of small FPGA provided with high-speed transceivers able to sustain an aggregated bandwidth in the order of 100 Gb/s at affordable prices, makes this device the perfect candidate for driving high speed optical links.On the other hand, FPGAs are also largely employed in big data processing.They are very suitable for networking -2 -purposes because of their natural strength in data streaming [1][2][3][4].FEROCE aims to leverage such networking capability to boost the data movement from the front-end.

Dynamical simulation
Setting up a suitable global-oriented simulation environment is crucial for the successful development of a network stack.Indeed, the debugging after deployment in a network is almost impossible due to the stratification of the protocol.The difficulties in firmware simulation partially arise by the language heterogeneity that builds the stack, but mainly by the need of creating a dynamic simulation.Typically, firmware simulations are pure static and testbench-driven.Instead in a dynamic schema the simulation model interacts with its environment by exchanging network packets with the host operating system running the simulator [5].In figure 2 the two approaches are shown.A static simulation injects a stimulus to the Device Under Test (DUT), in this particular case the network stack, and a checker compares the output to the reference response.On the other hand a dynamic simulation sends real data packets to the DUT through software to the simulation interface, the outputs are checked directly at the software level.In this work, a dynamic simulation is considered.The network stack under test must be tested with real Ethernet frames.To do so a TAP device is created to exchange data between the RTL simulation and the Linux network stack.To the TAP device and to the RTL simulation a MAC and an IP addresses are associated.Two ring buffers for the two data directions are implemented to alleviate eventual packet drops.Testbench top file is written in SystemVerilog where Direct Programming Interface (DPI-C) is set up to communicate with the TAP device.

Understanding the ETH network stack
System @ ETH Zürich stack [6] was selected for the first prototype.It features AXI-Stream connections between the network layers.It supports UDP/IP, TCP/IP and a RoCEv2 protocols.The stack can achieve 10 or 100 Gb/s speeds.Communication with the external network is made with the Xilinx 10-Gigabit Ethernet MAC (10 Gb/s) [7] or the Integrated 100G Ethernet Subsystem (100 Gb/s) [8] IPs.Although 100 Gb/s would not be implementable in small-size rad-hard FPGA, it will be a nice addition leaving open the possibility to implement the stack in other environment rather than locking it to 1 or 10 Gb/s.Such blocks can target directly the Gigabit transceivers or they can output an XGMII stream.The latter is used for communicating with the dynamical simulation (figure 3).For understanding the network stack we first used the UDP protocol with an hard-coded firmware loop-back.Then, a message was sent from the Linux network stack, to the TAP device, to the RTL simulation and back trough the same path.Once the message arrived the waveform of the -3 - process was inspected.Similar procedure was set up also for the TCP protocol with the only difference that a connection must be opened at the beginning of the message exchange.
ETH RDMA network stack features an AXI-stream port used to send data without going through the DMA engine.First the RDMA parameter has to be set via the two connection context ports.1Such ports carry the Queue Pair (QP) connection information: QP numbers, Packet Sequence Number, Remote keys, Virtual address and IP address.Finally data can be sent to the receiver giving the target IP address, the message length in bytes and the operation type (write or read).First, the two end-point must be synchronized.Then, the transmitter sends the data alongside the address, the remote key and the QP number.The receiver sends an acknowledge is sent after each packet if it's received correctly.
The Infiniband architecture RDMA WRITE operation is made up of one or multiple packets depending on the size of the message to be sent [9].If the message is smaller than the Maximum Transmission Unit (MTU), the packet will be sent with the RDMA WRITE ONLY op-code.On the contrary if it's bigger than the MTU the packet is split in multiple frames.The relative op-codes will be RDMA WRITE FIRST-MIDDLE-LAST.Given the fact that the main goal of our project is to push data from the frontend to the receiver, only RDMA WRITE is considered in the following studies.In figure 4 the RDMA WRITE operation is described.First, an out-of-band handshaking must be sent to synchronizing the two end-points, here the QP numbers, the PSN, the remote key and the address are sent to the FPGA firmware ready to be loaded using the connection info ports.Once the connection is established, the data transmission can be initiated defining the receiver IP address and the length of the DMA transfer.At the end of each package the receiver sends an acknowledge (ACK) if the PSN received match the expected one, otherwise an notacknowledge (NACK) will be sent and that packet should be re-transmitted.Once the message is sent an other out-of-band signal is sent to notify the receiver that the data stored in its memory is ready to be read.
1QP contex info and QP connection info.

JINST 19 C03038 5 Results
As it was mentioned in section 4, to validate the novel simulation an hard coded loop-back is established in the UDP layer.The messages sent and received must be the same.Secondly, to test the RDMA stack, a simple increasing counter is placed in the firmware and the number is packaged by the network stack in one or multiple RDMA frames, then such frames are sent to the TAP device and finally the message is read via Soft-RoCE [10] and analyzed with Wireshark.First, the two end points must be synchronized.To do so, the parameters are generated by the receiver's Soft-RoCE application and then they are passed to the RTL simulation.As soon as the handshake is completed the message is sent.The Wireshark capture of such exchange is shown in figure 5.In the picture all the relevant parameters can be found: the QP number (0x11), the PSN (10337885), the remote key (0x22d), the IP addresses (Tx:22.1.212.209,Rx:22.1.212.210).Such test is executed in the same machine allowing the communication between the RDMA simulation and the Soft-RoCE application.This test allows to verify the RoCE protocol integrity for RDMA WRITE operations trough Wireshark, while the data stored on the buffer checked against the expected one.2

Summary
In the framework of the FEROCE project, we proposed the implementation of a dynamic simulation of a multi-protocol network stack that includes RDMA capabilities.This is a first step in view of the development of a lightweight RoCE firmware block allowing its deployment on FPGAs with a low resource pool.Possible target devices could be, for instance, rad-hard FPGAs used in front-end detector boards.The next step would be to implement the very same firmware in a Xilinx evaluation board and send the data to a Connect-X NIC, measuring the effective throughput (100 or 10 Gb/s) and the latency.The dynamic simulation allows a safer pruning of the unneeded functionalities while remaining compliant to the network protocol.For example, we can envisage a front-end endpoint that implements only a one-side communication primitive (RDMA WRITE) given that it acts only as source of data and initiator of the transfer.
2The data is just an increasing counter.

Figure 1 .
Figure 1.Point-to-point connection between front-end and back-end (left).Data are read and packaged into network frame by computing farms.Network protocol is implemented directly in the front-end (right) allowing the use of commercial switches right after the front-end electronics.

Figure 2 .
Figure 2. Static (left) vs. dynamic (right) simulation.Static stimulus can cover a limited test-case space, while dynamic inputs can reach a wider space.

Figure 3 .
Figure 3. Diagram of the communication between the RTL simulation with Synopsys VCS and Soft-RoCE.A TAP device is used to access the RTL simulation and the Linux network stack.

Figure 4 .
Figure 4. RC RDMA WRITE operation.First, the two end-point must be synchronized.Then, the transmitter sends the data alongside the address, the remote key and the QP number.The receiver sends an acknowledge is sent after each packet if it's received correctly.