The ALICE Data Quality Monitoring: qualitative and quantitative review of three years of operations

ALICE (A Large Ion Collider Experiment) is a detector designed to study the physics of strongly interacting matter produced in heavy-ion collisions at the CERN Large Hadron Collider (LHC). Due to the complexity of ALICE in terms of number of detectors and performance requirements, Data Quality Monitoring (DQM) plays an essential role in providing online feedback on the data being recorded. It intends to provide shifters with precise and complete information to quickly identify problems, and as a consequence to ensure acquisition of high quality data. This paper presents a review of the ALICE DQM system during the first three years of LHC operations from a quantitative and qualitative point of view. We start by presenting the DQM software and tools before moving on to the various analyses carried out. An overview of the produced monitoring quantities is given, presenting the diversity of usage and flexibility of the DQM. Well-prepared shifters and experts, in addition to a precise organisation, were required to ensure smooth and successful operations. The description of the measures taken to ensure both aspects and an account of the DQM shifters' job are followed by a summary of the evolution of the system. We then give a quantitative review of the final setup of the system used during the whole year 2012. We conclude the paper with use cases where the DQM proved to be very valuable, scalable and efficient and with the plans for the coming years.


ALICE
ALICE [1] is a general-purpose detector dedicated to the study of heavy-ion collisions at the CERN LHC. It is optimized to study the properties of the deconfined state of quarks and gluons produced in such collisions known as quark-gluon plasma. It is also well-suited to study elementary collisions such as proton-proton and proton-nucleus interactions. Robust tracking and particle identification in a wide pt range are ensured by different types of detectors designed to cope with high particle multiplicities. The commissioning of the experiment was carried out during 2008 and 2009 in the underground experimental pit. Since the startup of the LHC in November 2009 and till a long shutdown in February 2013, the experiment has been successfully taking data, almost continuously.

DQM
Data Quality Monitoring (DQM) is an important component of every high-energy physics experiment, especially in the LHC era where the detectors are extremely sophisticated devices. To ensure recording of high-quality data, one needs an online feedback on the quality of data actually being recorded for offline analysis. The DQM system provides this feedback and helps shifters and experts to quickly identify potential issues. It involves the online collection of data, their analysis by userdefined algorithms and the storage and visualization of the produced monitoring information.

AMORE design and architecture
A DQM software system, called AMORE (Automatic MOnitoRing Environment) [2] [3], was developed for the ALICE experiment. It is based on the data analysis framework ROOT [4] and uses the DATE monitoring library [5]. AMORE is based on the publisher-subscriber paradigm (see Figure 1) where a large number of processes, called agents, execute detector-specific decoding and analysis on raw data samples and publish their results in a pool. Clients can then connect to the pool and visualize the monitoring results through dedicated user interfaces (see section 4.2. for details). The data samples feeding the agents might come from the DAQ nodes, from other agents or from files. They also come from other systems such as the High-Level Trigger or the Detector Control System and can also feed directly the data pool without going through an agent. The resulting quantities, generally histograms, although there is no restriction on their type, are published encapsulated in MonitorObjects. A MonitorObject also contains metadata, for example the quality and the target audience of the object. Finally, the only direct communication between publishers and clients consists of notifications by means of the DIM (Distribution Information Management) system. AMORE uses a plug-in architecture to avoid any framework dependency on users' code. Users, usually detector teams as well as the framework developers themselves, implement specific code that is built into dynamic libraries called modules. They are loaded at runtime by the framework if, and when, it is needed.

DQM shifter
Since the start-up of ALICE operations and during the whole period of data taking, a dedicated DQM shifter monitored the online data-taking by means of the DQM system. Shifters were asked to inspect the histograms and quality objects made available through the AMORE Graphical User Interface (GUI). The later allows browsing and visualization of the published objects, which are organized in a tree and can be filtered according to specific parameters (see figure 2). Each monitored quantity is automatically compared with reference values (set by the experts) or the expected behaviour, in such a way that properly trained shifters could verify the data quality without being experts. The shifters had the responsibility of alerting the shift crew and the on-call experts in case of anomalous behaviour of the monitored quantities or the framework itself. The shifters' reaction had to be as prompt as possible in order to take immediate action and, ultimately, to ensure the highest data taking efficiency.

Shifters' tools
The operation of the DQM system is carried out by using a number of tools, based on the Tcl script language and the Tk GUI toolkit [6], that have been designed to ease the AMORE configuration. The most important shifters' tool is the amoreAgentsManager. Its GUI allows the user to check the status of all available AMORE agents and start/restart/stop their executions. The amoreEditDb and the amoreConfigFileBrowser are dedicated to a more expert usage and operate mainly on the AMORE database for configuration purposes. The main operations performed by using these two tools concern configuration files edition and modification of execution parameters of the agents.
The ALICE electronic logbook [7] is another continuously consulted tool, which stores run details as well as human entries. It is used not only by the DQM shifters but also by the experts, who can access its web interface remotely. It provides information on the agents running during each data acquisition run, including configuration parameters and details on archived objects. It also gives the possibility to download the AMORE objects for further manipulation.
Furthermore, the Event Display complements the data monitoring and is used to display an event in 3D, which must have been partially or fully reconstructed beforehand. It also includes the possibility to apply online selection criteria. Finally, the user can customize the view by colouring or hiding tracks and retrieve details about them.

Organisation
The DQM shifts organisation followed the ALICE schema, according to which three 8-hour shifts per day were alternating during normal operations. In order to assist the shifters, 24-hours on-call shifts were organised, to allow the expert on duty to help the shifter with technical and physics-related matters. Given the variety of operators' background and experience, all candidate shifters attended a training to operate the DQM framework. These lectures were followed by two 8-hour training shifts, during which the newcomer was asked to shadow the DQM shifter during online operations. The training process evolved into a more demanding procedure in the last year of operations, when a webbased test was also used to verify the shifters' preparation. The latter proved its effectiveness by leading to an overall improvement of the DQM shifters' efficiency and awareness.
Monthly meetings gathered detector experts and software developers, in order to discuss framework developments, requests and the status of the whole DQM system.

Monitored quantities
Data taking with such a complex detector as ALICE comes with the need not only of monitoring the behaviour of each sub-detector during data taking, but also that of inspecting online physics properties of the events being recorded. AMORE provides the framework structure to access the result of the online analysis of the raw data and additional information coming from different sub-systems. As an example, it includes the output of the reconstruction of the primary interaction vertex (see figure 2) provided by the Detector Algorithms (DA) of the Silicon Pixel Detector (SPD) and the online track reconstruction in the High-Level Trigger (HLT) farm [1]. As a compromise between the statistical relevance and the readiness of the results, the monitoring objects were updated every minute, thus showing time-integrated quantities or time-wise trending behaviour.
During its operations, ALICE exploited several trigger configurations [8], varying from minimum bias to "rare" Level-2 (L2) triggers (classes), issued by the Central Trigger Processor (CTP) unit as the result of the combination of several trigger inputs. The trigger rates were monitored at each level, from the single sub-detector inputs to the final L2 trigger decision. Furthermore, the busy times of each detector were monitored in order to prevent potential issues in the data-taking flow. For similar reasons, the average data size for each sub-detector's DAQ equipment was monitored time-wise. All the sub-detectors provided different objects, most commonly monitoring the detector readout configuration, shape of the signal, detector occupancy, noise in the readout electronics and the readout efficiency. The system featured also the possibility of discriminating between different L2-trigger classes, allowing the monitoring of the signals originating from the trigger input and the physics

Quantitative review
During the data-taking period with LHC beams, the DQM was intensively used in a demanding and always evolving environment. More than 40 AMORE agents ran 24/7, publishing an average of more than 10000 objects updated every minute making a total of 10 MB/s of data stored during a typical run. Over these 10000 objects, about 95% were marked for experts while 120 were actually monitored continuously by the DQM shifters as figure 3 shows. Since 2010, more than 150 trained shifters per year spent a total of more than 23500 hours, arranged in shift blocks of 8 hours/day for 6 consecutive days/week. About 120 new shifters were trained in the year 2012 alone, successfully passing the web-based test.

Best stories
The DQM system played a fundamental role since the LHC start-up when ALICE provided the event display of the first pp collisions and a fast feedback on the data quality, which was instrumental for publishing the first physics LHC paper within a week from the first collision. As in that occasion, during the first Pb-Pb, p-Pb and Pb-p collisions the monitoring of the interaction vertex position and the background levels provided substantial information that allowed the LHC experts to better tune the beams parameters for collisions. In an occasion, a suspicious behaviour of the monitoring objects for the Time-Projection Chamber revealed a problematic issue in the High-Level Trigger chain. The notification of the issue by the DQM, when no other systems raised alarms, avoided recording data in the wrong format. Finally, at the beginning of 2013, the output of the online raw data analysis was used for a study on the ageing of the V0 detector's scintillators.

Future
The long shutdown in 2013-2014 gives a unique opportunity to review the experience of the many users and react accordingly by adjusting the tools, the physics analysis and the monitored metrics. Tools, including the generic GUI, are in the process of being re-implemented in web applications. This change gives the possibility for people outside the control room to use them, to incorporate them in the CERN authentication system and to have them coded in modern languages. The shifters' training course and the final test will be improved, aiming to increase the preparation of the shifters and to reinforce the cooperation with the other members of the shift crew.
These various changes will be available for the commissioning in summer 2014.

Conclusion
The Data Quality Monitoring system proved its usefulness and its reliability throughout the first three years of LHC operations. It provided online feedback on the data being recorded and allowed the shift crew to immediately realize and promptly react when certain problems arose. It provided the experts with a helpful insight on the conditions and causes of these problems, factually improving the reaction and the correction time.
The positive experience we gained, as well as the various changes we plan, make us confident that the DQM will be fully available for the LHC restart and will play again its crucial role during the next years of data taking.