100 Gbps PCI-Express readout for the LHCb upgrade

We present a new data acquisition system under development for the next upgrade of the LHCb experiment at CERN. We focus in particular on the design of a new generation of readout boards, the PCIe40, and on the viability of PCI-Express as an interconnect technology for high speed readout. We show throughput measurements across the PCI-Express bus, on Altera Stratix 5 devices, using a DMA mechanism and different synchronization schemes between the FPGA and the readout unit. Finally we discuss hardware and software design considerations necessary to achieve a data throughput of 100 Gbps in the final readout board.


Introduction
The LHCb experiment will be upgraded during the Long Shutdown 2 (2018-2019) of the Large Hadron Collider in order to reach unprecedented precision in the band c-quark flavour sectors [1].
One of the main objectives of the LHCb upgrade is to create a trigger-less readout system, working at the full LHC event rate of 40 MHz, backed by a purely software trigger.In the current experiment, large readout inefficiencies result from the 1.1 MHz trigger rate imposed by the Level-0 hardware trigger.These limitations will become even more relevant at the increased instantaneous luminosity which will be achieved after LS2.
Removing this bottleneck requires the implementation of a readout system able cope with a sustained bandwidth of around 4 TBytes/s and therefore the redesign of all aspects of the present readout architecture.This requirement translates, in particular, into a careful choice of the fastest and most cost-effective technologies at all levels of the readout chain.
PCI-Express Gen3 technology was selected as the principal communication protocol between the readout boards (which receive physics data directly from all subdetector frontends) and the event builder units (which assemble event fragments from different parts of the experiment into complete events).

Readout architecture
Figure 1 shows a simplified schema of the upgraded readout system architecture.
All detectors perform zero-suppression on their respective frontends.Radiation hardened, simplex optical links connect all frontend channels to a layer of FPGA-based readout boards.For -1 - reasons of power and space constraints in the underground cavern, the most cost effective configuration requires all DAQ hardware to be concentrated on the surface.Optical links have to be designed so as to be reliably operated over these relatively long distances (350 m).

JINST 10 C04018
The readout boards for all subdetectors share the same hardware design, henceforth called the PCIe40.Subdetector-specific behavior is implemented by reconfiguring the on-board FPGA with dedicated firmware.
Timing and Fast Control (TFC) commands are propagated to the entire system by a second hierarchy of PCIe40 boards.This second flavour of the PCIe40 is characterized by the use of duplex optical links to configure and monitor the state of all frontends.All configuration and monitoring aspects fall under the responsability of the LHCb Experiment Control System (ECS).
Flow control is implemented through a throttle signal.In the absence of a hardware trigger, throttling can be used to relieve data backpressure if necessary.This signal is generated in the DAQ layer and propagated by the TFC.
The number of PCIe40 boards necessary to read out the entire LHCb experiment has been estimated at around 500 units in the DAQ layer and less than a hundred in the TFC layer.
To every PCIe40 board in the DAQ layer is associated a dedicated readout unit.The connection between the two consists of a point-to-point PCI-Express Gen3 x16 link.A readout unit consists of a powerful server running a distributed event reconstruction algorithm.In order to exchange partial event fragments, all readout units are interconnected by the so-called event builder network.
Lastly, the event builder network is also connected to the filter farm, where the full-software HLT (High Level Trigger) selects candidate physics events for long-term storage.

PCIe40 architecture
Figure 2 shows a high level diagram of the future board.The design can present up to 48 bidirectional optical links communicating with the frontend electronics and one communicating with the TFC system.For the boards used in the DAQ layer, 24 of these links will be populated with optical receivers.Low-occupancy subdetectors can elect to install more receivers, compatibly with the readout unit bandwidth constraints.
A high-density Arria 10 FPGA provides this board with powerful reconfigurable logic capabilities.The FPGA transceivers are connected to the frontend optical links on one side and to a PCIe switch on the other.The PCIe switch, by PLX Technologies, allows two 8-lane PCIe Gen3 interfaces from the FPGA to appear to the event builder as a single 16-lane PCIe Gen3 link.
In order to read out 24 frontend optical links, running at up to 4.5 Gbit/s, the downstream PCI-Express bandwidth has to be as close as possible to a nominal figure of 110 Gbit/s.

PCIe40 firmware architecture
Since all subdetectors will rely on a common readout board, a generic firmware architecture is currently under development to provide a common readout platform for all subdetectors.
This generic framework implements all of the low-level management functions required by the PCIe40: on the frontend side this includes decoding, synchronization and aggregation of all the optical links; on the backend side this requires providing a fast communication component to transmit event fragments across the PCI-Express bus from the PCIe40 into the main memory of the readout unit.
Efficient data transmission across the PCI-Express link at the rates required by the LHCb upgrade can be achieved with the implementation of a protocol based on DMA (Direct Memory Access).With DMA, data is copied from FPGA buffers into host memory without preempting the event builder CPU.This communication mechanism is inherently asynchronous and careful synchronization facilities between the event builder CPU and the FPGA are required to guarantee data integrity.In order to satisfy all these requirements a fast and efficient DMA controller was -3 - implemented.This DMA controller exposes to the firmware a simplified, streaming interface and abstracts away all of the cross-link communication and synchronization complexity.

PCIe DMA controller architecture
Each PCI-Express interface present on the PCIe40 board will be bound to a dedicated instance of the DMA controller.A simplified diagram of the DMA controller architecture and the associated low-level software interfaces is in figure 3.
The DMA controller receives event fragments as a continuous data stream.In the final PCIe40 firmware, event fragments are prepared by aggregating the input of all input optical channels corresponding to the same LHC bunch crossing (identified in the TFC system by a numerical event identifier).Before streaming, subdetector-specific logic can align and reorder the data if necessary.
The DMA controller temporarily stores this data in intermediate internal buffers on the FPGA.Buffering is required in order to optimally exploit the PCI-Express transport.Since internal buffering consumes on-chip memory resources that would otherwise be available for event fragment reordering and preliminary online data processing, the size of the buffers has to be minimized.
Several buffering configurations were studied in order to achieve the best possible performance with the smallest amount of on-chip storage.Ultimately, increasing the number of buffers and the individual buffer size does not significantly increase the final performance, except in the case of very small buffers (up to 2 KiB).
The current implementation uses 16 buffers of 4 KiB, for a total of 64 KiB of memory.Such modest storage requirements are possible thanks to an extremely efficient buffer management policy implemented by the DMA controller.As soon as the completion of a DMA transaction is confirmed by the PCIe root complex of the event builder, the DMA controller can immediately recycle the corresponding buffers and accept new data.
This implementation has the cost-saving benefit of being able to efficiently buffer and transmit a data stream running at over 55 Gbit/s without having to add external memory to the readout -4 -board.Buffering is completely offloaded onto the readout server where large memory capacities are available at a fraction of the integration cost of the same memory on a custom readout board.
On the FPGA, communication between the DMA controller and the internal DMA engine happens through a descriptor mechanism.The DMA controller calculates the size and the source and destination addresses for each DMA operation (the source being a pointer in the internal memory mapped fabric of the FPGA, the destination being an address in the PCIe bus space).Once sufficient data has been buffered, the controller generates write descriptors for the engine.In parallel, the DMA engine notifies the controller of the execution status of each DMA operation.The controller uses this information to update its internal representation of buffer space availability, both on the board and on the PC.
Careful synchronization mechanisms within the DMA controller ensure that data can be simultaneously written from the event stream into the internal buffers and from the internal buffers onto the PCI-Express bus at the highest possible rate, reaching close to 100% buffer utilization.

PCIe driver architecture
Another synchronization mechanism is required between the event builder application and the DMA controller.This serves two purposes: • notify the application of the presence of new data • point the DMA controller to free PC memory to use for new events.
Like on the FPGA, the memory storing event fragments on the event builder is managed as a circular buffer, the two implementations are however radically different.In the first case cost and efficiency reasons demand the implementation of the smallest possible buffer, such constraints do not hold however on the event builder server.Cheap and plentiful memory can be exploited on the readout unit to implement very large buffers.
In the event builder, it is desirable to store many event fragments at once.Due to the distributed nature of the event building algorithm, events fragments have to be transmitted in parallel from each event builder to each of the other 500 event builder units in parallel.Maximizing the number of events transmitted in each network operation exploits the event builder network bandwidth to the fullest extent.
From the point of view of the event builder application, this circular memory buffer is managed through a traditional read/write pointer pair.The write pointer is managed and updated by the FPGA after each DMA write and the read pointer is owned and updated by the event builder after each network transmission.The PCIe40 driver provides a high-level programming interface to manage these synchronization resources.
In addition to pointer synchronization, the most important duty of the PCIe40 driver is to manage the physical memory used to create the main circular buffer.Due to the nature of the Linux physical memory allocation algorithm, physically contiguous memory (suitable for DMA) can only be allocated with a granularity which is much smaller (4 MiB at the time of this writing) than the desired size of the entire circular buffer (several GiB).
In order to overcome this limitation, the DMA controller on the FPGA implements a simple memory management unit.This mechanism allows the circular buffer to be described to the PCIe40 as a list of scattered memory blocks.The linear association between positions in the circular buffer and physical addresses in this discontinuous physical address space is maintained inside a memory map stored in the FPGA.Every entry in this table holds the size and physical address on the PCIe bus of a memory block allocated by the PCIe40 driver upon initialization.
In the current implementation, this allows each DMA controller to access up to 4 GiB of memory.On a PCIe40 with two instances of this component, this results in up to 8 GiB of buffer capacity for physics events.

Performance measurement
Figure 4 shows DMA throughput measurements obtained using our DMA controller implementation on a commercial Altera Stratix 5 board connected over a single PCIe Gen3 x8 interface to an Intel Sandy Bridge CPU.For a long-running data acquisition platform it is important to guarantee that memory throughput from the DMA controller be consistently above the specified requirements over sustained periods of uninterrupted data taking.The histogram sums up the downstream write performance observed from the DMA engine over a 12-hour period of steady operation.
Performance for a single instance of the DMA controller bound to a PCIe x8 Gen3 interface is shown to be consistently above 54 Gbit/s (55.49Gbit/s on average) and therefore compatible with the LHCb upgrade requirements.
Other separate studies have also observed that the insertion of the PLX PCIe bridge between the two PCIe interfaces to be used on the PCIe40 board does not negatively affect the performance of either interface.
Lastly, one of the two PCIe links on the board will also be used to configure and monitor the PCIe40 (through dedicated ECS software running on the same event builder server).For this reason it was necessary to evaluate the possible impact, if any, of this additional source of bidirectional PCIe traffic on the DMA performance.A control system emulator was implemented, running on -6 - the host CPU and generating both read and write operations to internal FPGA registers.Figure 5 shows different throughput measures obtained by varying the ratio between configuration (write) and monitoring (read) operations in the synthetic benchmark.Even under these circumstances, DMA bandwidth can be consistently measured in excess of 54 Gbit/s.

Conclusions
To conclude, PCI-Express Gen3 technology has been proven to be compatible with the very high performance targets demanded by the future upgrade of the LHCb experiment.This design allows the implementation of a data acquisition system that is scalable, cost-effective and able to pervasively leverage commercial off-the shelves solutions.This strategy allows the development of custom electronics to be kept to a minimum.The PCIe40 will be a common platform for all data acquisition and experiment control tasks within LHCb.Development activities continue on both the software and firmware in order to facilitate integration of the PCI-Express transport with the future event building software, also under active development.

Figure 1 .
Figure 1.Architecture of the upgraded LHCb readout system.

Figure 3 .
Figure 3. DMA firmware and software architecture.

Figure 5 .
Figure 5. DMA throughput with concurrent monitoring and configuration traffic.