New operator assistance features in the CMS Run Control System

During Run-1 of the LHC, many operational procedures have been automated in the run control system of the Compact Muon Solenoid (CMS) experiment. When detector high voltages are ramped up or down or upon certain beam mode changes of the LHC, the DAQ system is automatically partially reconfigured with new parameters. Certain types of errors such as errors caused by single-event upsets may trigger an automatic recovery procedure. Furthermore, the top-level control node continuously performs cross-checks to detect sub-system actions becoming necessary because of changes in configuration keys, changes in the set of included front-end drivers or because of potential clock instabilities. The operator is guided to perform the necessary actions through graphical indicators displayed next to the relevant command buttons in the user interface. Through these indicators, consistent configuration of CMS is ensured. However, manually following the indicators can still be inefficient at times. A new assistant to the operator has therefore been developed that can automatically perform all the necessary actions in a streamlined order. If additional problems arise, the new assistant tries to automatically recover from these. With the new assistant, a run can be started from any state of the sub-systems with a single click. An ongoing run may be recovered with a single click, once the appropriate recovery action has been selected. We review the automation features of CMS Run Control and discuss the new assistant in detail including first operational experience.


Introduction
The Run Control System [1,2,3,4] of the Compact Muon Solenoid (CMS) [5,6] experiment at CERN's Large Hadron Collider (LHC) is a distributed web application running on a set of Apache Tomcat servers. It allows users to define a hierarchical structure of control nodes, called Function Managers (FMs). Function managers are developed in Java and are dynamically loaded into the web application upon the start of a configuration [7]. Function Managers are controlled through a web-browser. During global data-taking, all operations are controlled through the top-level control node, also called the Level-0 FM, as illustrated in figure 1. Hardware access and transport of data are handled by the C++ based XDAQ online software [8]. Monitoring services collect monitor data from the XDAQ and run control layers, and make them available to monitor clients, error & alarm panels and an expert system. A separate control system, the Detector Control System (DCS) [9] controls detector voltages, gas flows etc. DCS is in communication with the LHC control system.

Review of Operator Assistance Features
At the start of data-taking with CMS in 2008 the top-level control node provided full control of the global state machine, control of individual sub-systems and control of configuration keys applicable to various sub-systems. All required operations were available to the operator, but the operator had to know how to choose compatible configuration keys, when to perform actions on individual subsystems and how these actions depended on each other. Operation of the experiment required a lot of detailed knowledge of the system and was rather error prone. During Run-1 of the LHC (2009-2013), many operator assistance features have been added to the top-level control node (Level-0 Function Manager) of the run-control system in order to ease the job of the operator and avoid any possibility of error. We briefly review these features, below. For a more in-depth description we refer the reader to [10].

Improved Configuration Handling
Previously, operators had to manually choose compatible configuration keys for the first-level trigger, high-level trigger and sub-detectors, that were appropriate for the data-taking condition at hand. Inconsistent configuration of the trigger system or of sub-detectors frequently led to down times or to data-quality problems. To remedy the situation, configuration keys were combined by introducing first a combined trigger key for the first-level and high-level trigger and later introducing a combined runmode that ties together compatible sets of all configuration keys for all sub-systems needed for certain modes of data-taking (for example: collisions, cosmics, circulating beam). These combined keys and run-modes could thus be prepared and validated by experts ahead of time while operators only needed to select the appropriate run-mode for a given data-taking condition. This solution, especially when combined with the guidance system explained in the following subsection, greatly reduced problems due to inconsistent configuration.

Guidance System
A guidance system was introduced in order to ensure consistent configuration of the experiment. It continuously checks whether all components are configured with the selected configuration, which is typically implied by the overall run-mode described above. The guidance system resolves configuration keys in a number of databases to detect if underlying definitions of keys have been updated. The resolved keys are then compared to the actual configurations of the components. The system also ensures that configuration changes are applied to sub-systems in the correct order, which is particularly important if the clock source of the experiment is changed or the experiment needs to recover from LHC clock instabilities. The guidance system indicates necessary actions to the operator by flashing indicator lights next to the corresponding sub-system action buttons in the GUI. For example, if the clock of the experiment is changed, the guidance system reminds the operator to first reconfigure the Trigger Control and Distribution System and then to reconfigure or to reinitialize certain sub-systems. The guidance system together with the improved configuration handling described above practically eliminated problems due to inconsistent configuration.

Automatic actions in response to detector and LHC state changes
When detector high voltages are ramped up during preparation for data-taking (some of them just after stable beams have been declared), certain actions have to be performed through run-control. In the silicon pixel sub-system, gains need to be adjusted while in the silicon strips sub-system payload suppression needs to be turned off (payload-suppression is needed to avoid large fragment sizes due to noise when high voltages are off). At the beginning of data-taking with CMS, these actions had to be performed manually by the subsystem expert, which meant that physics runs could only be started once the LHC had declared stable beams and the detector high voltages had been ramped up. These manual actions necessary at the start of a fill led to considerable down time. In order to streamline operations, the aforementioned actions have been automated. The top-level control node now receives information about the LHC state and about detector high voltages from the CMS detector control system. The top-level control node passes the high-voltage status information to the sub-systems upon run starts so that sub-systems can take care of adjusting gains and payload suppression. If a run is on-going while the LHC or high voltage change state, an automatic action is triggered that pauses the run, propagates updated settings to subsystems and resumes the run. This way, a physics run can be started long before stable beams are declared. Once beams are stable and high voltages are up, settings are automatically adjusted as needed.
The information about the LHC state is also used to decide when a clock recovery needs to be performed (The LHC clock is guaranteed to be stable only in certain beam modes.) and to automatically select the appropriate run-mode (collisions, cosmics, circulating beam, etc.) for a certain data-taking condition. The operator thus no longer needs to select the run-mode when data-taking conditions change, but simply needs to follow prompts by the guidance system to configure the experiment consistent with the automatically selected new run-mode.

Automatic Error Recovery for 'Soft' Errors
As LHC luminosity increased towards the end on Run-1 (2011 onwards), frequent 'soft' errors occurred in various sub-systems, which in some cases could be traced to single-event upsets in the readout electronics caused by increased irradiation. Recovery initially required stopping the run, reconfiguring the affected sub-system and starting a new run, a very time consuming operation. A new automatic recovery mechanism was therefore designed in which sub-systems locally detect a 'soft' error condition and signal it to the to the top-level control node by changing their state. The top-level control node then executes a recovery sequence that consists of: pausing the run, issuing a newly defined recovery transition to the requesting sub-system, resynchronizing the system and resuming. Additional sub-systems may register to receive the recovery transition in order to perform preventive actions in the shadow of a recovery. In 2012 alone, at least 46 hours of down time (about 50% of the remaining total down time) were avoided by this automatic error recovery mechanism.

The new Level-0 Automator
While the Soft Error Recovery covers frequent and well-understood errors, there remains a category of errors that are not or cannot be recovered by Soft Error Recovery because they are rare, not well understood or they cannot be diagnosed locally. Such errors currently have to be dealt with by the shift crew according to instructions that describe how to diagnose the error and that prescribe a recovery procedure. Typically this procedure consists of 1) stopping the run, 2) re-configuring or re-initializing a sub-system, 3) starting a new run. A long-term goal is to also automate recovery from this second category of errors. An automated recovery can be factored into two largely independent steps: diagnosing the error situation and executing the recovery procedure. The diagnosis can be performed by an expert system based on the observables of the on-line data-flow monitoring. During Run-1, we successfully used the Perl-based DAQ Doctor [10] that gave advice to the shift crew. For Run-2 (2015-2018) we have been developing a new tool written in Java, called DAQ Expert. At the time of writing the new tool successfully recognizes most error situations and prescribes the appropriate recovery action. The present paper we focus on the second step of the recovery -the execution of the recovery procedure. This second step consists of executing a sequence of commands to the global state machine or to individual sub-systems. In addition to simply following the prescribed steps, the recovery should also re-configure / re-initialize sub-systems if indicated by the guidance system in order to ensure consistent configuration. Moreover, if sub-systems fail during any step of the recovery action, a number of attempts should be made to recover them automatically -if possible in the shadow of other necessary steps. Just like a very experienced operator, the software orchestrating the recovery should have knowledge of the interdependency of transition actions of all sub-systems in order to be able to execute actions in the optimal order and to parallelize actions wherever possible. Since the execution of the recovery procedure requires fast feedback on the state of sub-systems, we decided to implement it within the run control system rather than in the external expert tool that receives monitoring data with a certain latency. As the recovery procedure acts on the system in the way an experienced operator would, we chose to implement it in a new Function Manager called the Level-0 Automator FM that is located on top of the hierarchy -between the operator and the Level-0 FM -as shown in figure 2. This also has the advantage to keep the complexity in a separate entity, which is optional and can be deployed with minimum risk to the production system. The new function manager fulfils all the requirements discussed above. Its user-interface -shown in figure 3 -allows the definition of a recovery procedure by selecting which sub-systems need re-configuration or re-initialization. Clicking the 'Recover' button then triggers the recovery procedure. Alternatively, the recovery can also be triggered by sending the recovery instructions to a SOAP interface. This interface will in future be used by the DAQ Expert tool. Since the Level-0 Automator knows how to get the system to running state from any state of the sub-systems it can also be used by the operators to start a run in the most streamlined manner. To this end a 'Start' button is available that will not stop any on-going run, but otherwise triggers the same actions as the 'Recover' button. Since the Level-0 Automator performs actions that so far have been done by the operator, it is important to record the history of these actions and make it accessible to the operator. We therefore developed a timeline viewer, which shows the evolution over time of the global and all sub-system states and also shows all actions whether triggered by the Level-0 Automator or triggered manually by the operator. The timeline of a recovery is shown in figure 4 while the timeline of the start of a run is shown in figure 5. Since the operator needs to be informed about sub-system states and configuration and also in some situations still needs to perform manual actions on individual sub-systems, the user interfaces of the Level-0 Automator and the Level-0 FM have been integrated. The user interface of the Level-0 Automator is shown as a bar on top of the web page while the Level-0 control page is available below. Whenever the Level-0 Automator executes a recovery or start procedure, the user interface of the Level-0 FM is locked so that the operator cannot interfere by performing manual operations. During other periods, the operator may use the controls of the Level-0 FM, directly.

Configuration of the Level-0 Automator FM
The Level-0 Automator FM's configuration is stored in the resource service database [7] as for other function managers. Configuration options include the number of attempts to recover a sub-system and various timeouts. Since the reconfiguration of the TCDS system in this particular case implies a change of clock, several subsystems require re-initialization in addition to re-configuration. The light bulb symbols indicate the state of indicators from the guidance system (red=indicator appearing, grey=indicator disappearing).

Experience
The new Level-0 Automator FM was deployed in early 2016. Most DAQ operators immediately embraced the new tool and are now routinely using it to start runs and perform recoveries. Besides knowing the optimal sequence of actions, the Automator also frees the operators from watching the screen to detect completion of steps in a sequence only to trigger the next step. Operators are thus free to focus on other tasks such as communication with the shift crew or documenting their actions. The Level-0 Automator FM has helped to considerably reduce the down times caused by error recovery.

Conclusion and Outlook
We have reviewed the operator assistance features that have been added to the CMS run-control system since the start of operation of the experiment. In particular we have for the first time presented a recent development, the Level-0 Automator, which can execute complex recovery actions in an automated fashion in the most streamlined way. By following the guidance system it ensures consistent configuration of the experiment. It is also able to recover additional problems occurring during the recovery of an initial problem. The tool has sped up recovery from typical error situations and has considerably simplified the operators' task. In the near future we are planning to integrate the Level-0 Automator with the new DAQ Expert tool in order to automatically trigger recovery when certain error situations are detected.