ATLAS Cloud R&D

The computing model of the ATLAS experiment was designed around the concept of grid computing and, since the start of data taking, this model has proven very successful. However, new cloud computing technologies bring attractive features to improve the operations and elasticity of scientific distributed computing. ATLAS sees grid and cloud computing as complementary technologies that will coexist at different levels of resource abstraction, and two years ago created an R&D working group to investigate the different integration scenarios. The ATLAS Cloud Computing R&D has been able to demonstrate the feasibility of offloading work from grid to cloud sites and, as of today, is able to integrate transparently various cloud resources into the PanDA workload management system. The ATLAS Cloud Computing R&D is operating various PanDA queues on private and public resources and has provided several hundred thousand CPU days to the experiment. As a result, the ATLAS Cloud Computing R&D group has gained a significant insight into the cloud computing landscape and has identified points that still need to be addressed in order to fully utilize this technology. This contribution will explain the cloud integration models that are being evaluated and will discuss ATLAS' learning during the collaboration with leading commercial and academic cloud providers.


Introduction
The ATLAS experiment [1] at the Large Hadron Collider (LHC) is designed to explore the fundamental properties of matter for the next decade. Since LHC start-up in 2009, the experiment has produced and distributed hundreds of petabytes of data worldwide among more than 100 computer centers. Thousands of physicists are engaged in analyzing these billions of events. The ATLAS Computing model [2] is based on a grid paradigm [3], with multilevel, hierarchically distributed computing and storage resources. However, new cloud computing technologies bring attractive features to improve the operations and elasticity of scientific distributed computing. ATLAS sees grid and cloud computing as complementary technologies that will coexist at different levels of resource abstraction, and two years ago created an R&D working group to investigate the different integration scenarios. Earlier ATLAS studies in virtualization and cloud computing were reported elsewhere [4]. In this paper we will describe some of the cloud utilization models that are being evaluated as well as ATLAS' experience in collaboration with leading commercial and academic cloud providers. ATLAS utilizes the PanDA workload management system [5] (WMS) for distributed data processing and analysis.
2. Distributed cloud computing -grid of clouds ATLAS has been evaluating a distributed cloud computing system for running Monte Carlo simulation jobs. A distributed cloud computing system provides a number of attractive features. Virtualization shields the application suite from changing technologies and reduces the need for systems personnel to be knowledgeable about the user application, allowing resource centers that do not have expertise in grid computing or ATLAS software to nevertheless contribute to ATLAS distributed computing. Also, Infrastructure as a Service (IaaS) clouds provide a simple way to dynamically manage the load between multiple projects within a single center, and a distributed cloud aggregates heterogeneous clouds into a unified resource with a single entry point for users (namely, a PanDA queue). The idea of a distributed cloud was initially suggested by Keahey et al [6], who called it "sky computing"; it could also be termed a "grid of clouds".
The scheduling of jobs is performed by HTCondor [7] and the management of virtual machines (VMs) is done by Cloud Scheduler [8]. HTCondor is a well-known job scheduler that is designed as a cycle scavenger, making it an ideal job scheduler for a dynamic environment where VM resources appear and disappear on demand. The management of the VMs in the distributed cloud computing system is done by a general-purpose component called Cloud Scheduler. Cloud Scheduler periodically reviews the requirements of the jobs in the HTCondor batch queue and makes requests to boot user-specified VM images on any of the IaaS clouds which meet those requirements. The use of Cloud Scheduler minimizes the overhead of booting and shutting down VMs, since VMs are reused to continue to run jobs until there are no matching jobs remaining. CernVM [9] images are used (both Xen and KVM flavors), and Puppet [10] is used to manage the system configuration of the VMs.
The distributed cloud computing system currently uses a total of approximately 15 academic IaaS clouds, located in Canada (Victoria, Edmonton, Quebec and Ottawa), CERN, the United States (FutureGrid [11] clouds in San Diego and Chicago), Australia (Melbourne and Queensland), and the United Kingdom. The Victoria and FutureGrid clouds run Nimbus [12] while the other clouds use OpenStack [13]. Several different PanDA queues are used, each corresponding to a geographic region. Since April 2012, the system has completed over 1,000,000 ATLAS jobs (running up to 1,000 at a time) on these clouds, located in three different continents. A full description of the system can be found in [14].

Cloud platform on the ATLAS HLT farm
With the LHC collider at CERN currently going through the period of Long Shutdown 1 (LS1) there is a valuable opportunity to use the computing resources of the large trigger farms of the experiments for other data processing activities. In the case of the ATLAS experiment the TDAQ High Level Trigger (HLT) farm [15,16,17], consisting of more than 1500 compute nodes deployed in the SDX1 area or LHC Point 1, is particularly suitable for running Monte Carlo production jobs. In 2013 ATLAS initiated the design and deployment of a virtualized platform running on the ATLAS TDAQ computing resources and using it to run large groups of CernVM based virtual machines operating as a single CERN-P1 WLCG site. This platform has been designed to avoid any interference with TDAQ usage of the farm and to guarantee the security have worked on both the utilization of commercial cloud services, and on advancing the usage of dedicated local resources via the IaaS approach. Our system consisted of several components which worked together to create an integrated system. These components included Boxgrinder [18] (to automate and generalize VM image creation and customization), HTCondor (both to submit VMs to cloud platforms, and to handle batch job scheduling on the virtual cluster), the Open Science Grid [19] worker node middleware (for authorization and file transfer), the network file system CVMFS [20] (to provide globally-visible application software), and our custom pilot submission utility, AutoPyFactory (APF), which handled the coordination of both local batch scheduling and VM invocation. We utilized Amazon EC2 [21], using the spot pricing mechanism and deployed OpenStack to provision local, dedicated resources so that they could be used transparently alongside EC2 resources. Much IaaS usage has focussed on replicating traditional grid/Virtual Organization (VO) entities on VMs running on cloud platforms. Rather than deal with the complexity and heterogeneity of these systems, we adopted a hybrid approach, with facility-based central services combined with cloud-based cluster execute hosts, all relatively homogeneous. Since the components of the system were used in other ATLAS cloud projects will describe them in some detail below. The HTCondor batch system has been used for years in ATLAS for local batch cluster management and scheduling and for traditional grid submission. In the last two years, HTCondor added a full HTCondor-G interface to any resource provided via the Amazon EC2 API, allowing the invocation of EC2 VMs which can then be queried and terminated like any job. We also used HTCondor as the execute host batch system, with VM execute hosts connecting back to a central manager at BNL to retrieve pilot jobs, which in turn retrieve work from the ATLAS central system at CERN. In 2013, OSG moved to an RPM-based distribution of its middleware. This greatly streamlined its usage on our worker node (WN) images generated by BoxGrinder (since BoxGrinder installs software via YUM utility). The advent of CVMFS (an HTTP-based, secure, distributed, readonly filesystem) was the development which enabled our hybrid approach to IaaS usage, since no shared filesystem needs to be maintained in the cloud. The dynamic management capability of AutoPyFactory (APF) version 2 was a new addition in late 2013. We had already been using APF for pilot submission to US and European grid sites via HTCondor-G (and in several cases via local cluster submission). It was logical to extend this usage to EC2 VM submission. Further combining this with local batch submission of pilots, from the same submit host, is what enabled well-integrated management of VM-based batch clusters. By combining information about execute hosts state (idle, busy, retiring) with the state of the VM job hosting the execute node, we can dynamically scale the size of a cloud-based cluster up and down depending on whatever parameters we choose (the amount of waiting work to be done, spot price, etc.).

Amazon EC2 Spot Market and OpenStack
Aside from special, unusual cases, standard commercial clouds initially were too expensive for most scientific computing. The advent of EC2 spot pricing, which trades a willingness to tolerate VM terminations for significantly reduced hourly prices, made IaaS an economically viable competitor to site-based dedicated resources. At the same time, the maturation of software platforms for provisioning local, physical resources via cloud interface (e.g. OpenStack) offered the possibility of combining owned, local resources with remote, commercial resources in a flexible manner. Our system fully realizes this possibility. BNL created a 200 VM-scale local OpenStack (v4) cloud which was used alongside EC2 spot-priced resources during a scaling test performed in 2013. The 5000 VMs were invoked and managed using APF, with all the execute hosts joining a single virtual cluster that spanned BNL, and Amazon us-east, us-west1, and us-west2 zones. This cluster performed standard ATLAS production simulation jobs (high CPU, low I/O) which staged in from, and out to, the standard grid storage interfaces at BNL.

Project with Google Compute Engine
ATLAS was invited to participate in the Google Compute Engine (GCE) [22] trial period in August 2012. At that time Google was rolling out a new IaaS cloud platform with modern hardware and a brand new API. After several months of preparation and testing, Google agreed to allocate additional resources for ATLAS, about 5 million core hours, on 4000 cores. The idea of this project was to run a production site on GCE, on a scale similar to a typical ATLAS Tier-2 site, that is continuously operational for several weeks. Resources were organized as HTCondorbased PanDA queue, similar to the setup employed for the EC2 run, described earlier in this paper. That allowed for the utilization of common components and transparent inclusion of cloud resources into the ATLAS computational grid. Since GCE at that time did not allow to run user provided VM images, we built a custom image based on the Google provided CentOS 6 [23] image. Initially a set of custom scripts was used for image building, later on a set of Puppet recipes was developed to standardize the procedure. In total we ran on GCE for about 8 weeks in March−April of 2013. Two weeks were planned for start up, debugging and scale up to full size. We ran computationally intensive workloadsphysics event generators, fast detector simulations and full detector simulations. Produced data were automatically transferred to ATLAS grid storage for subsequent processing and analysis. We completed 458,000 jobs, generated and processed about 214 M events. During the whole running period, GCE was very stable in all aspects of operations, quite a remarkable feat for a brand new cloud platform. The overall job error rate on GCE was about 6%, over the whole period of running, with most of the errors occurring during the start up and the system debugging period.

Summary
In this paper, an overview of some activities of the ATLAS virtualization and cloud Computing R&D project over last 1.5 years was given. ATLAS continues to actively explore cloud technology using private and public cloud platforms. Many of the projects are characterized by a shift from pure R&D to production level projects, where cloud resources delivered to ATLAS are comparable to Tier 2 or even Tier 1 centers on the grid. Also active engagement with large commercial cloud providers like Amazon and Google was a hallmark of the period since CHEP 2012.