Phronesis, a diagnosis and recovery tool for system administrators

The LHCb experiment relies on the Online system, which includes a very large and heterogeneous computing cluster. Ensuring the proper behavior of the different tasks running on the more than 2000 servers represents a huge workload for the small operator team and is a 24/7 task. At CHEP 2012, we presented a prototype of a framework that we designed in order to support the experts. The main objective is to provide them with steadily improving diagnosis and recovery solutions in case of misbehavior of a service, without having to modify the original applications. Our framework is based on adapted principles of the Autonomic Computing model, on Reinforcement Learning algorithms, as well as innovative concepts such as Shared Experience. While the submission at CHEP 2012 showed the validity of our prototype on simulations, we here present an implementation with improved algorithms and manipulation tools, and report on the experience gained with running it in the LHCb Online system.


Introduction
LHCb [1] is one of the four large experiments at the Large Hadron Collider at CERN.This experiment relies on a large computing infrastructure [2] to (i) control the data acquisition system and the detector, and (ii) manage the data it produces.The team in charge of the installation and the administration of this system comprises less than 10 people, with three full time workers.To help the system administrators to reach their goal of high availability, we have attempted to provide them with a software which would propose a diagnosis and recovery solution in case of problems, improve with experience and act as a knowledge and problem history database.
The paper we published at CHEP 2012 [3] introduced the concepts we used in our software.The validity of these concepts was proven on several simulations.Since then, the algorithms were improved, the software code consolidated and manipulation tools were developed.Further simulations were run to test deeper the ability of the software, and it has now been deployed on a much larger scale in the LHCb Online environment.

LISA: LearnIng approach for System Administration
In [3], we presented methods that address problems similar to ours.These methods were expert systems [4] and autonomic computing principles like MAPE-K loop [5].Based on these historical approaches and adding innovative concepts such as the Shared Experience principle, we now define the methodology of our framework as follows: • Linux systems represent the greatest share of the Online environment.We thus decided to focus only on them.Network or Windows-based machine diagnoses are not addressed.• Because of the great variety of software running on the LHCb Online HLT farm, our solution needs to be as generic as possible.As files and processes are the components of any application, we decided to use them as basic blocks for our diagnoses.To each type of problem that can be encountered with such entities -like wrong file permission, wrong process user, etc -is associated a default recovery solution.Note that this method is eqaully valid on Windows servers as it is generic enough.• Perform no monitoring, but rather wait to be informed of problems by external sources • Existing implementations associate one MAPE-K loop instance to one system and rely on multi-agent theory for synchronization and cooperation.Our approach is to have a single loop for all the systems.This allows the software to spot the dependencies between the various systems.• By using Reinforcement Learning algorithms, we improve the diagnostic speed and scalability by reducing the amount of components that are checked before finding the faulty one.• The Shared Experience principle consists of sharing the experience between similar systems (like two websites).It reduces both the learning phase of the learning algorithms and the description workload of the users.• Using Convention over Configuration [6] contributes in reducing the configuration work of the software.• Our software offers a default recovery solution with the full procedure for the fix to be taken into account, as well as information regarding previously encountered situations on the same problematic entity.However, the user has to perform the correction himself.

Phronesis
Our implementation of the above methodology is called Phronesis.It is divided in several modules described in this section.

Compiler
We defined a new configuration grammar that allows us to describe services as a composition of files, processes and other services.This grammar is actually inspired by the object model, where objects would be mainly files, processes or services and the inheritance concept is used to describe the Shared Experience principle.The user can also define two types of rules: • Dependency rule: this rule states that one service needs another one to be fully functional.
• Recovery rule or Trigger: this rule lists what a given recovery action involves.For example, if the recovery action consists of changing the content of a file, a recovery rule could state that it is required to stop a process before changing the file and another one to start it after the modification.
The compiler was developed in Python using the pyparsing library [7].The choice of Python was made because of the dynamic characteristics of Python, such as the introspection mechanism and weak typing.The compiler reads the configuration files and produces an SQL script output.One critical aspect of the compilation is to not lose the experience that was previously gained by the reinforcement algorithm.This is achieved using custom graph-matching algorithms between the configuration files and the current content of the database.

Remote Agent
The remote agent is a software program that runs on all the machines the user wants to supervise.Its only purpose is to answer queries from the Core (see 3.3).The complexities of it are at the technical level, and are just implementation details.The query concerns all the attributes of files, processes or the general environment.The agent is developed in C++, using several Boost libraries [8].

Core
The "Core" module of the software is the central part which contains all the algorithms used to actually diagnose problems and offer recovery solutions.The main algorithms are listed here: • Sorting algorithm: when several problems are reported at the same time, this algorithm has to decide in which order they are analyzed.The order is very important for performance reasons, but also because there might be situations in which one problem cannot be solved before the others are.This algorithm uses Dependency rules to establish the order.• Recovery algorithm: once the root cause of a problem is found, it can usually be fixed quite easily (e.g.fix a corrupted file, restart a process).For the changes to be taken into account, extra actions might be required.These actions are defined by Recovery rules.The complication comes from the fact that actions can be required before or after the fix is applied.Computing the full chain of events is a non-trivial task.• Reinforcement Learning algorithm: the reinforcement learning algorithm is used to optimize the exploration path from a reported problem to its faulty component.The chosen method consists of keeping track of the paths that were successful in previous cases.Each path has an associated counter which is incremented when the path is faulty.When a new problem is reported, one can rely on these counters to choose the more appropriate path.
There are two strategies: either sorting the counters in decreasing orders, either making a weighted random choice.Simulations (see 4.1) show that in average, both strategies are equivalent.Although simple, this method based on counters has great advantages.If a path is reinforced whereas it should not, the user can very easily correct it.The user can also give a priori knowledge.Finally, from a technical point of view, the application of the Shared Experience principle to this method is straightforward.• Dependency algorithm: one of the most interesting features of our software is its ability to find dependencies between services based on previous experience.This capacity allows our software to infer new Dependency rules, and thus provides better diagnoses.
The implementation is done in C++ and uses Boost libraries.It can be run as a daemon, as an interactive program, or to make a full check of all the services known to it.

Tools
There are two kinds of interactions between the software and the user.Output communication so that the user knows what the software is doing.Input communication for the user to report problems or give feedback.This bidirectional communication is made possible using an Application Programming Interface (API).The output communication is based on an Observer pattern [9], while the input messages are similar to Remote Procedure Calls.Based on the API, several ready-to-use user interfaces were developed: • phrUtils: a command line tool • phrGUI: currently being prototyped.A GUI based on the Qt framework [10].
• phrXml: only for output communication.This stores all the output into an XML file based ring buffer.• phrSimu: an interface used by our simulation software to test the algorithms.
• phrIcinga: an interface that gathers data from Icinga [11], the monitoring software used at LHCb. • phrWeb: a web interface based on phrXml and the Django framework [12].

Simulations
It was important in order to test our algorithms to be able to simulate realistic situations.To achieve this, we developed a complete set of tools to produce Monte-Carlo simulations.Phronesis needs to be compiled in a particular way.The reason is that the simulation tool tests the algorithms of the Core module, and not the code quality of the Agents: when under normal usage, remote servers are queried to get information before processing it; in simulation mode, the query is intercepted and a local Agent is instructed what to return.This allows us to test Phronesis on a single local machine.Another software program is used to randomly generate problems based on user input, inject signals to the Core to mock the agents' analysis, interact with it to confirm or deny its diagnoses, and produce statistics about the behavior of Phronesis.This tool reproduces almost any kind of environment.Various situations were simulated, which validated the importance of Dependency rules as well as the Shared Experience principle.It also showed that the two exploring strategies of a faulty service mentioned earlier are equivalent in average.

Real case application
Phronesis is now being deployed on the entire LHCb Online cluster.It is to be noted that it is not a replacement to any solution already in place, but is expected to be in addition to it.At the time of writing, a fair fraction of the LHCb Online system is already covered and the diagnoses we had the opportunity to trigger showed useful.Systems under Phronesis' supervision include the log aggregation cluster, the event filter software, the web services and the monitoring infrastructure.Despite the fact that there only a small number of unexpected and unprovoked situations, Phronesis could make several correct diagnoses, and offered appropriate solutions.Among these, several diagnoses were a direct consequence of the Convention Over Configuration approach, because the root cause was pointing at elements which the user did not define manually.Examples of diagnoses are: • Full inodes for log servers: the log servers store a large number of tiny files (around 50 000 files with a median size of 100 Kb) on a clustered file system.As a consequence, the pool of inodes was exhausted well before the actual storage space.The solution, correctly suggested by Phronesis, was to remove files.In fact, this problem was spotted before it actually happened because of the default threshold set to 99% of used inodes: it is a great chance, because otherwise all the new logs that would have required a new file would have been silently lost.• Incorrect mount options on a web service: one of the web services required a particular folder to be mounted with the write option, which was not the case.Phronesis suggested to remount it with the appropriate option.Although correct, this would not have worked immediately, because an NFS server on which Phronesis had no control was not configured to accept it.• Incorrect DIM [13] name server address: the file containing the information was corrupted • Various problems on the monitoring infrastructure: the mail alerts not being sent tracked down to a process not running, the out-of-date results tracked down to a full disk space and checks not executed because of some servers not running are a few issues that Phronesis correctly diagnosed.
In some cases, Phronesis completely missed the root cause of the problems.We have observed two types of failures: • Errors due to a situation not foreseen in the design.Examples are disk errors or cluster setups.When it did not imply heavy modifications, the code was improved.Other cases were left for future developments.• Errors due to incomplete configuration, like missing information or unsupervised service.
The configuration was always updated to cover future occurrences of similar cases.

Outlook
There is still large room for improvement, both in terms of the technical implementation and of functionality.This includes (i) an extension of the configuration grammar, which is unfortunately more verbose than what we hoped at the beginning, (ii) better native support for cluster systems, and (iii) dynamic constraints on the properties of files and processes.The plan is to add more systems under the supervision of Phronesis and add coverage for the corner cases.We hope to be able to release it as an open source solution that the community would pick up, and further develop.

•
Various problems on MySQL servers: running out of disk space and errord in the configuration files were among the problems diagnosed by Phronesis on the MySQL database 20th International Conference on Computing in High Energy and Nuclear Physics (CHEP2013) IOP Publishing Journal of Physics: Conference Series 513 (2014) 062021 doi:10.1088/1742-6596/513/6/062021