Using container orchestration to improve service management at the RAL Tier-1

In recent years container orchestration has been emerging as a means of gaining many potential benefits compared to a traditional static infrastructure, such as increased utilisation through multi-tenancy, improved availability due to self-healing, and the ability to handle changing loads due to elasticity and auto-scaling. To this end we have been investigating migrating services at the RAL Tier-1 to an Apache Mesos cluster. In this model the concept of individual machines is abstracted away and services are run in containers on a cluster of machines, managed by schedulers, enabling a high degree of automation. Here we describe Mesos, the infrastructure deployed at RAL, and describe in detail the explicit example of running a batch farm on Mesos.


Introduction
At Rutherford Appleton Laboratory (RAL) we run a Tier-1 computing facility used by all four Large Hadron Collider (LHC) experiments in addition to an increasing number of other experiments. We have been deploying production services on both bare metal and a variety of virtualisation platforms for many years. Despite the significant simplification of configuration and deployment of services due to the use of virtual machines (VMs) and a configuration management system, maintaining services still requires a lot of effort. Deployment, routine maintenance and updates all require manual work. In addition, the current approach of running services on static machines results in a lack of fault tolerance. Services are assigned to fixed numbers of VMs with fixed hostnames and therefore depend on the availability of both the VMs and hypervisors. While we have recently started migrating to a cluster of hypervisors with shared storage, allowing live migration of VMs to take place if necessary, this introduces a shared storage system as a single point of failure.
In the current climate more and more non-LHC communities are becoming important, with the potential for the need to run additional instances of existing services as well as new services, but at the same time comes the likelihood that staff effort is more likely to decrease rather than increase. It is therefore important that we are able to reduce the amount of effort required to maintain services whilst ideally improving availability, in addition to being able to maximise the utilisation of resources and become more adaptive to changing conditions and requirements.
These problems are not unique to RAL, and from looking at what is happening in the wider world it is clear that container orchestration has the possibility to provide a solution to many of these issues. Therefore we began investigating the migration of applications to an Apache Mesos [1]  any host or application failures, as well as procedures such as rolling restarts or upgrades, can be handled automatically and no longer require any human intervention. Similarly, the number of instances of applications can be scaled automatically in response to changes in load. On top of this it also gives us the important benefit of being able to run a wide range of services on a single set of resources without involving virtualisation. The benefits of using Docker containers [2] rather than VMs has been investigated previously (e.g. [3], [4]) however the primary focus has generally been on performance. It's when containers are managed by schedulers where even more benefits become apparent. This is not yet a technology that is used widely within the high energy physics community, however it seems to be slowly gaining momentum in recent years [5][6][7].
In this paper we will begin with an introduction to Mesos and describe the Mesos infrastructure that has been deployed at RAL. We then discuss how we deal with the challenges of service discovery, monitoring and logging in a dynamic environment. As an example use case we will demonstrate how we have run production jobs from the LHC experiments on Mesos.

Mesos
Apache Mesos is a cluster manager that originated at the University of California, Berkeley [8], and later became a top-level project of the Apache Software Foundation. At the time many "Big Data" applications were emerging, such as Hadoop [9], but the only way for an organisation to run more than one such application was to partition resources and have one application per cluster. This approach is clearly a very inefficient use of resources and cannot achieve a high level of utilisation or flexibility. Mesos was developed as a means for avoiding this problem by providing a way for multiple clustered applications to share a single set of resources.

Mesos architecture
In principle, the idea of allowing multiple distributed applications, which were all developed independently and have their own scheduling policies and requirements, to share resources is very complex. In particular, having a single monolithic scheduler that has to encompass the scheduling decisions from many applications would be particularly complex and not scalable. Mesos simplifies the problem by using an abstraction to separate the allocation of resources and the scheduling of tasks.
An overview of the Mesos architecture is shown in Figure 1. The Mesos master manages the cluster and Mesos agents provide resources. For fault tolerance it is standard practice to run multiple Mesos masters, using ZooKeeper [10] for leader election and for sharing state. A framework is a distributed application and consists of two parts: a scheduler and an executor. The master offers resources to frameworks based on configured policies. The scheduler from each framework decides which resources to accept and how to make use of them. It provides to Mesos descriptions of all the tasks to be run which are then executed on the appropriate agent nodes. The executor, part of each framework, is used to actually run the tasks. Since each framework is entirely responsible for making use of the resource offered, it's possible to have multiple and potentially fundamentally different schedulers sharing the same resources. For example, one framework could be designed for long-running applications and another for short-lived batch jobs.

Marathon
Marathon [11] is a Mesos framework designed for managing long-running applications, including groups of applications with dependencies. Marathon ensures that the specified numbers of instances of each application are running. If a Mesos agent were to fail, Marathon would start new instances of any applications previously running on that node elsewhere. Health checks can be defined for each application, hence giving Marathon the possibility to kill unhealthy instances of applications so that they can be replaced.
Marathon itself can easily be configured and deployed to be highly available. Multiple instances of Marathon can be run (typically three or five), again making use of ZooKeeper for leader election.

Mesos at RAL
There are two clear use cases for Mesos in a facility like the RAL Tier-1. One is for managing longrunning applications such as grid middleware. The second use case is providing a generic platform that can be used for running multiple computing activities serving multiple communities.

Deployment
We are using five Mesos masters deployed on VMs hosted on our HyperV virtualisation infrastructure. Each of these nodes is also running ZooKeeper, Marathon and Consul server (see below). The choice of five nodes means we can lose up to two without any effect, enabling a high level of fault tolerance. For example, one node could be in intervention and another node could fail without causing any problems. A minimum of three functioning nodes is required for quorum to be maintained for fault-tolerant distributed systems consisting of five nodes. By design ZooKeeper, the Mesos masters and Consul will each fail when quorum is lost.
We are using entirely bare metal for the Mesos agents, using hardware usually used as standard grid worker nodes. We've tested up to almost 300 nodes in the Mesos cluster, corresponding to over 7000 cores, and not experienced any scaling issues. This is not unexpected as Mesos is known to scale to tens of thousands of nodes, and indeed is used in production at this scale at companies such as Twitter.
The Mesos masters and agents are fully configured using Quattor [12], our configuration management syste. In principle our templates could be shared with other sites using Quattor. The configuration required is relatively straightforward and in fact simpler than the grid middleware typically run at sites providing resources to the LHC experiments.

Image creation and storage
A Docker registry is required in order to store container images. In order to avoid dependencies on external services we use a private Docker registry using Ceph [13] as the storage backend. A central Docker registry instance with authentication provided via a httpd proxy is used for write access. In addition to this we run a read-only registry on every Mesos agent. This avoids having a single point of failure and potential bottleneck. If necessary images can be pulled from the registry to many Mesos agents simultaneously without encountering any performance issues, as the image data is pulled from the Ceph cluster directly to each Mesos agent. A single registry instance could quite easily be overloaded. Since the Docker registry is very lightweight, e.g. memory usage is typically well under 100MB and the CPU usage is minimal, running an instance on every agent does not have significant impact on the total resources available in the cluster. So far we have been creating Docker images manually and pushing them to the private Docker registry, but we intend to develop a solution where images can be created automatically from our configuration management system and pushed to the repository.

Metrics
We use Telegraf [14] to collect host metrics as well as application specific metrics from the infrastructure, making use of existing plugins for applications such as Mesos, ZooKeeper and Consul. cAdvisor [15] is used to collect resource usage and application metrics from containers. Metrics are stored in InfluxDB [16] and Grafana [17] is used for visualisation. Rather than having cAdvisor send metrics directly to InfluxDB, a custom script is used to tag metrics collected by cAdvisor with information from Mesos, such as task ID, before sending them to InfluxDB. Having this information as meta-data greatly simplifies the aggregation of metrics from different instances of particular applications.

Logging
While it is possible to login to Mesos agents to find log files, this is not ideal since applications are not tied to specific nodes and can move around or the number of instances can change. We use Filebeat [18] to send the standard output and error from every container to Logstash [19]. Logstash is configured to tag each log with Mesos task ID and application name, and sends the data to Elasticsearch [20]. Kibana [21] can then be used for viewing logs.
Logs from the Mesos masters and agents are also sent to Logstash for parsing and filtering before being sent to Elasticsearch. This allows us to have a record of all events relating to tasks running on the Mesos cluster, for example containers being created, killed and any failures.

Service discovery
In our traditional infrastructure it is standard practice to hardwire static hostnames or static DNS aliases into configuration files. This only makes sense in a static environment where hostnames rarely change. In a dynamic environment an alternative method of accessing services must be used.
We make use of Consul [22], a distributed system for service discovery. A daemon running on each Mesos agent, called Registrator [23], monitors the local Docker engine and watches for containers being created and destroyed, updating a catalog within Consul. We are using five Consul servers running on each Mesos master node, and a Consul agent running on every Mesos agent.
Within the Mesos cluster, services can be discovered and accessed via a dynamic DNS server provided by Consul. External access to services in the Mesos cluster is provided by a pair of highly available load balancers based on HAProxy [24] and Keepalived [25]. HAProxy performs load balancing, with configuration updated dynamically by Consul, and Keepalived provides highly available floating IP addresses.

Running LHC jobs on Mesos
Since we are interested in using Mesos for both long-running applications as well as a flexible platform for compute resources, running jobs from LHC experiments is a relevant and interesting example use case that involves both aspects.
As a starting point we dynamically create containerised HTCondor [26] worker nodes which join our existing production HTCondor pool and run jobs from the LHC experiments that were submitted to our standard grid Computing Elements (CEs). An alternative method to using complete HTCondor worker nodes in containers would be to run individual jobs as Mesos  ideal application to be managed by Marathon. The squid containers we developed expose a number of application-specific metrics, such as request rate and hit rates, which are then collected automatically by cAdvisor. This data can then be used for auto-scaling the number of squid instances. We have been using purely the squid request rate for scaling and this seems to work quite well. Of course, since squids are caches that need to be warmed up to give best performance, in practice it's best to always use an adequate number of instances run rather than creating and destroying instances too frequently, and let auto-scaling handle more extreme situations. In Figure 2 we show an example where an increase in the request rate results in additional squid tasks being created. While it is also possible to manage containerised HTCondor worker nodes using Marathon, scaling down or performing upgrades becomes problematic. In these cases Marathon may kill worker nodes that could be running jobs, which is something we want to avoid. As an alternative to Marathon we have developed a custom Mesos framework for the sole purpose of running a pool of worker nodes. Scaling down is achieved by draining existing worker nodes, which are configured to exit after being idle for a specified period. Similarly, rolling upgrades can be achieved by draining existing worker nodes and creating new instances using the updated image.
We use worker node containers that both setup the required CVMFS repositories and run a HTCondor startd. This means that the Mesos agents don't need any configuration or software related to worker nodes installed and allows us to run as many worker nodes as necessary without having to dedicate a set of resources configured as worker nodes for the LHC experiments. The downside currently, however, is that privileged containers must be used in order for CVMFS to work, as it uses the FUSE kernel module. In future we intend to improve this by, for example, running worker nodes in unprivileged containers which access CVMFS provided by dedicated privileged containers.
Since CVMFS is in the same container as the jobs are run it is more efficient to run multiple jobs in a single container, as a new cache needs to be populated every time a new instance starts. This in turn means it's important to provide isolation between the different jobs. We make use of HTCondor's existing ability to run jobs in their own cgroups, in order to constrain the CPU and memory usage if necessary. These cgroups are children of the Docker container in which the worker node is running, as shown in Figure 3. This means that the resource usage of all jobs on a worker node is limited by the resources allocated the worker node container. We also make use of HTCondor's ability to run jobs within PID namespaces so that jobs can't see any running processes from other jobs in the same container.
For traceability reasons HTCondor in each worker node container is configured so that the image and Mesos task ID are included as startd attributes. In turn the schedds on the CEs are configured to include these attributes in the ClassAds for running jobs. This means we have a record of the image and container used for each HTCondor job, which is quite important. For example, if we're informed about a security incident involving a particular job we can easily determine what container was used to run that job and on what Mesos agent the job ran.
During testing we used up to over 5000 cores simultaneously successfully running production jobs from all four LHC experiments. We found Mesos to be a very reliable platform. As expected, the worker node containers were not affected by the status of the scheduler in the custom framework. The scheduler could be killed or restarted without any effect on the running worker nodes. Since we used a failover timeout of one week, the default value used by other frameworks such as Marathon, the scheduler could be down for up to a week before Mesos would kill all running tasks. Also, the basic "self-healing" ability provided by Marathon means that the squids managed by Marathon have a significant advantage compared to our traditional squids running on static VMs. A squid process could be killed, or a Mesos agent killed, and the squid instance would be replaced automatically. And as previously mentioned, a sudden increase in request rate would trigger the creation of additional squid instances and therefore have a much better change of being able to handle the increased load. With our production squids these types of problems all require manual intervention.

Summary and outlook
We have investigated migration of services to containers running on a Mesos cluster and it is clear that there are many benefits compared to a traditional static infrastructure. For long-running applications, container orchestration provides a way of enabling higher availability and a higher degree of automation, thus reducing the manual effort required to manage services. Furthermore, it becomes possible to have a generic and flexible set of resources that can be used for a variety of computational activities. We have successfully tested running jobs from the LHC experiments on Mesos.
Mesos is not yet a production service in the RAL Tier-1. Running services in this way is a significant change in philosophy compared to what's been done over the past decade or more. Since we have to meet high service level agreements for WLCG, moving away from a well-established infrastructure takes time.
Future work will include testing the use of Ceph for providing storage volumes to containers, which will be useful for situations where applications require persistent storage. We will also investigate running OpenStack hypervisors in containers, which could enable us to run both virtual machines and computationally-intensive work in containers on a common platform. Integration with our existing configuration management system is also an important point, in particular the ability to automatically create container images, upload them to our private Docker registry and carry out other actions, such as vulnerability analysis. Finally, it is worth noting that technologies initially looked at as part of the work on Mesos are now used more widely, and in production, within the RAL Tier-1. A load balancer based on HAProxy and Keepalived has been used in front of the file transfer service for over six months, and recently other grid services have been migrated. This enables us to gain higher availability and prevent interventions and exceptions from being visible to users. Also, Telegraf, InfluxDB and Grafana are being used as a replacement for Ganglia for monitoring over 900 hosts and a wide variety of services.