Common read-out receiver card for ALICE Run2

A BSTRACT : ALICE at CERN LHC uses custom FPGA-based computer plug-in cards as interface between the optical detector read-out link and the PC clusters of Data Acquisition (DAQ) and High-Level Trigger (HLT). The cards used at DAQ and HLT during Run1 have been developed as independent projects and are now facing similar problems with obsolete major interfaces and limited link speeds and processing capabilities. A new common card has been developed to enable the upgrade of the read-out chain towards higher link rates while providing backward compatibility with the current architecture. First prototypes could be tested successfully and raised interest from other collaborations.


ALICE online architecture during Run1
ALICE is the heavy-ion experiment at CERN LHC dedicated to the study of the physics of strongly interacting matter.It has been designed to cope with particle densities in central Pb-Pb collisions.The data captured from all 18 subdetectors is read out from the Data Acquisition (DAQ) system via Front-End Read-Out electronics and around 500 serial optical links called Detector Data Link (DDL) [1].
The data sent via DDL from the cavern to the counting rooms is received on custom FPGA based DAQ Read-Out Receiver Cards (D-RORC).These boards are installed in server PCs acting as Local Data Concentrators (LDC).An exact copy of the incoming data is forwarded within the D-RORC FPGA to another DDL towards the High-Level Trigger (HLT).A simplified overview of the read-out architecture is shown in figure 1.
The HLT is the first system in ALICE where data from all detectors is combined and online reconstructed.This system has continuously evolved during Run1 and finally consisted of 224 nodes, divided into Front End Processing (FEP) nodes , Compute nodes (CN), output nodes and several infrastructure machines.Each FEP node is equipped with two custom FPGA based HLT Read-Out Receiver Cards (H-RORC) to receive the detector data via DDL and perform first reconstruction steps.In addition to software based data processing on the FEP nodes, the computing power of the HLT could significantly be enhanced by implementing the cluster finding on Time Projection Chamber (TPC) data in the H-RORC firmware [2].Most of the Compute Nodes are equipped with Graphics Processing Units (GPUs) to perform online track reconstruction of the TPC data [3].Output nodes provide the processed data back to Data Acquisition via H-RORCs and DDL.All HLT nodes are interconnected with an InfiniBand network.
From DAQ point of view, the HLT appears like another detectors.Event fragments are assembled into sub-events in the LDC nodes and sent over the Event Building Network for further processing and finally long term storage.
-1 - The Read-Out Receiver Cards on DAQ and HLT side have similar requirements, however they have been developed as independent projects.The H-RORC contains a Xilinx Virtex-4 FPGA and connects to DDL via pluggable addon boards hosting the optical links.The interface to the host machine is implemented with PCI-X at 133 MHz.The D-RORCs have been used in two different revisions: one with PCI-X and one with PCIe interface to the host machine.These boards use Altera APEX or Stratix II FPGAs and have two optical interfaces on the same board.

Requirements for Run2
Already during the heavy-ion runs of Run1 the LHC exceeded its design luminosity by a factor of two.After long shutdown 1 luminosities are expected to be in the range of 1 . . . 4 • 10 27 cm −2 s −1 with a center-of-mass energy of 5.1 TeV for Pb-Pb collisions.The read-out system used during Run1 will not be able to handle these data rates.One limiting factor in detector read-out is the bandwidth of the read-out controllers of Time Projection Chamber (TPC) and Transition Radiation Detector (TRD).The TPC for example will upgrade its Readout Control Unit (RCU) to double the number of branches per RCU and go for higher link rates by implementing the second generation of the Detector Data Link (DDL2) [4].
The increasing data rates and read-out changes also affect the systems of DAQ and HLT and in particular the Read-Out Receiver Cards.Both types of RORCs used during Run1 are limited in their optical read-out capabilities to around 2 to 3 Gbps as required for DDL1.Apart from that, the electrical interface to the host machines became obsolete for all PCI-X based boards and is -2 -not available anymore in recent server PCs.These facts require a replacement of the Read-Out Receiver Cards.
The state-of-the-art bus interface to the host machine is PCI Express and the new card should comply with the according form factor to allow installation in commercial-off-the-shelf server PCs.The number of optical input links is increased to 12 because six links cover a complete TPC segment and DAQ has to send a copy of all data to the HLT.This reduces the number of required RORCs and significantly improves data locality for event assembly and reconstruction in DAQ and HLT input nodes.
The optical link rate capabilities of the new RORC should allow the implementation of the second generation of DDL while providing backward compatibility with DDL1 for all detectors that will not upgrade their read-out electronics for Run2.The FPGA has to be large enough to allow the previously used online hardware preprocessing to be reused or even extended for the increased number of input links.
A custom interface that was used in the D-RORC and is still required for the new board is a LVDS connection to a device called BusyBox that monitors and verifies the transfer of event data from the Front End Electronics (FEE) to the DAQ system.
In order to manage and maintain a cluster with a large number of boards the configuration process of each FPGA has to provide an interface for monitoring and control.The board should be able to hold at least two configuration images to provide a backup image if anything goes wrong.

Commercially available platforms
There are several FPGA boards available that come with a PCI Express interface and serial optical links.The first board that was actually available at the time the collection of requirements was started was the Xilinx ML605 Virtex-6 evaluation board.This device can do PCIe with up to four lanes Gen2 or eight lanes Gen1 and has a DDR3 SO-DIMM socket for up to 800 Mbps operation.Unfortunately this device has only one optical serial link via SFP which cannot be used when the board is installed into a server PC.
A board more closely to the ALICE requirements is a Virtex-6 platform from HitechGlobal.This board features a PCIe interface with eight lanes Gen2 and can do DDR3 with up to 1066 Mbps.As for the Xilinx board this device also comes with only little optical connectivity installed on the board.A big advantage of this device is that the optical connectivity can be extended to eight links with two QSFP modules using a FMC addon board.This setup has been used as evaluation platform but could also not fully be integrated into a server PC due to its size.
Boards with Virtex-6 FPGAs that can do 10 Gbps cannot be used because their transceivers cannot be operated at link rates in the range of two to five gigabit per second.During the development of the custom board a device with Altera FPGA was introduced that comes with PCIe Gen3 and at least eight onboard serial optical links using two QSFP modules.Boards with Series-7 FP-GAs also became available during the development of the custom board.However none of them came with the required optical connectivity or supported link rates.In addition to their low optical connectivity none of the mentioned boards provide sufficient FPGA configuration monitoring or management options nor is able to run LVDS via RJ45.

Design of the common read-out receiver card
As described above the lack of suitable commercial platforms and the requirements on the new Read-Out Receiver Card led to custom board development.A photo of the final board is shown in figure 2. The decision on the FPGA was made in favor of Xilinx Virtex-6 because these FPGAs were the most recent devices that were actually available at the time the development was started.The LX-Series of Virtex-6 comes with transceivers that can be used for a broad spectrum of link rates covering the required 2.125 Gbps for DDL1 and the anticipated 4.25 or 5.135 Gbps for DDL2.The maximum PCI Express bandwidth achievable with this FPGA is eight lanes Gen2 (40 Gbps raw).The board is designed in a way that Virtex-6 FPGAs of different logic sizes can be used as long as they share the same package size.
The increased link density cannot be realized with SFP modules like on the previous RORCs because the IO bracket of PCI devices simple does not provide enough space to have more than four SFPs per PCI slot.However there are several parallel optical transceiver technologies available that provide an increased link density on a much smaller footprint.The decision has been made for QSFP modules, because they provide four bidirectional links per module and are available from several manufacturers.The required twelve optical links have been realized using three QSFP modules.This setup leaves enough space on the IO bracket for a RJ45 socket to provide the required LVDS connectivity.Using twelve optical input links with DDL1 results in an overall input bandwidth of 25.5 Gbps which can easily be handled with the PCIe interface.Due to the usage of QSFP modules, the number of input links could also be reduced to eight links with only two QSFPs when using higher link speeds without online preprocessing or compression.
The interface to the existing fiber installation will be made with break-out fibers allowing to connect the parallel optical links with the existing patch panels via LC or E2000.The matching of the optical output levels can be made with compatible QSFP modules or optical attenuation components.
The support for a broad range of link rates could be realized by using a configurable reference clock for the FPGA transceivers.The board comes up with a default frequency that can be used for 2.125 and 4.25 Gbps, but can be reconfigured to any other frequency during runtime using -4 -I2C.This allows the board to be used for link rates covering the whole range of what the FPGA transceivers support.
In order to get an FPGA based PCIe device detected from the host PC it has to be up and running relatively early during the boot phase of the PC.The PCIe specification allows a maximum of 100 ms after de-assertion of the PCIe reset until a device has to be able to respond [5].The time between applying power to the system and the de-assertion of the reset depends on the system but can easily be as low as another 100 ms [6].Given the fact that all onboard power supply converters, clocks and the FPGA itself also need several ms after applying power, there is not much time left to get the FPGA configured and able to respond to PCIe requests.A synchronous flash memory is used on the board to configure the FPGA in time.Unfortunately these synchronous memories were only available with 16MB capacity and a full configuration for a Virtex-6 LX240 requires roughly 9MB so that only a single configuration image can be stored in one flash chip.For that reason a second synchronous flash chip has been installed in parallel to provide the ability to hold at least two full FPGA configurations on the board.The user can select which flash chip should be used for configuration.The contents of the flash memories can be read and written via PCIe or using a Xilinx programming cable.
The monitoring of the FPGA configuration process is done with an onboard microcontroller.This device can detect if the FPGA configuration from flash succeeded or can trigger a reconfiguration from the other flash chip if required.A big advantage of this approach is that the microcontroller is connected to the host systems SMBus interface via PCIe sideband signals.This means that the host machine can access the microcontroller with standard Linux tools like i2cset/i2cget even if the PCIe link is down.This allows to trigger FPGA reconfiguration from a specific flash memory using host software or read-out a board health status with several board supply voltage measurements.
For onboard storage two DDR3 SO-DIMM module sockets have been chosen.This allows to install onboard memory only where needed and gives a lot of flexibility on memory capacity.The FPGA is able to operate both DDR3 interfaces independently with up to 1066 Mbps (PC3-8500) for single ranked modules and 606 Mbps for dual ranked modules.
The FPGA itself provides temperature sensors and supply voltage measurements that can be read out via PCIe.The QSFP modules also provide temperature readings that can be accessed via their slow electrical interface.The required LVDS interface is realized with a LVDS buffer and a RJ45 socket in the same way it was done on the D-RORC boards.

Hardware test results
The first series of C-RORCs produced have been used to extensively test all interfaces of the board.The implementation of the PCIe interface is realized using the onboard PCIe Hard Block in the FPGA and the Xilinx IP core for a transaction interface.The Direct Memory Access (DMA) functionality is a custom development because no commercially available DMA engine could provide 12 independent channels.This implementation supports scatter-gather lists so it can handle memory allocated from the regular Linux memory subsystem without requiring kernel patches or memory separated on boot.The host system part consists of a microdriver in kernelspace and the main part of the device driver in userspace [7].DMA throughput tests have been performed fol- The configuration flashes can be read and written via PCIe and allow quick updates of the onboard firmware.Also the configuration monitoring and the reconfiguration of the FPGA by accessing the microcontroller via SMBus has proven to be easily possible.Time measurements on power-on to FPGA configuration-done confirm that the PCIe requirements are met.The board is detected reliably in all tested machines.
A firmware able to test the correct operation of all major interfaces at once is currently being prepared to simplify hardware tests for the production of larger numbers of boards.

Conclusion and outlook
The increased needs on link rate and processing capabilities for Run2 require a new Read-Out Receiver Card for ALICE Data Acquisition and High-Level Trigger.This has been realized as a common project of both systems.First boards were produced and thoroughly tested.All hardware test were successful.The board is compatible with the read-out architecture, link rates and protocols as used during Run1 and provides room for upgrades as proposed for Run2.The development of the board also raised interest from the ATLAS collaboration for the upgrade of their read-out system.A common purchase of the boards to be used for Run2 is ongoing.

Figure 2 .
Figure 2. Photo of the Common Read-Out Receiver Card equipped with three QSFP modules with parallel fibers attached and two DDR3 memory modules.