Evolution of user analysis on the grid in ATLAS

More than one thousand physicists analyse data collected by the ATLAS experiment at the Large Hadron Collider (LHC) at CERN through 150 computing facilities around the world. Efficient distributed analysis requires optimal resource usage and the interplay of several factors: robust grid and software infrastructures, and system capability to adapt to different workloads. The continuous automatic validation of grid sites and the user support provided by a dedicated team of expert shifters have been proven to provide a solid distributed analysis system for ATLAS users. Typical user workflows on the grid, and their associated metrics, are discussed. Measurements of user job performance and typical requirements are also shown.


The ATLAS distributed computing system
The ATLAS [1] computing model is based on distributed resources [2]. Collider and simulated Monte Carlo data samples are centrally produced, and distributed over more than 150 computing centers spread around the world. The first LHC data reconstruction is done at the Tier-0, the computing facilities at CERN and at the Wigner Research Centre for Physics, and at the Tier-1s, primary computing facilities worldwide. Monte Carlo production is also done at Tier-1s, and at secondary facilities around the world, the Tier-2s.
The output format of the ATLAS reconstruction algorithms is the so-called xAOD (Analysis Object Data). The total size of the xAOD samples needed for a typical analysis is of the order of tens of PBs. To further reduce the total size of the samples to be analysed to a few TBs per analysis or group of similar analyses, a common reduction framework has been developed [3]. The output format of the reduction framework is the DxAOD, or derived xAOD. Currently there are about 100 different DxAOD formats in ATLAS, tailored to the needs of specific groups. Most DxAOD formats have a size less than 1% of the original xAOD for data, while typical reduction factors on MC samples are of a few percent. The data reduction is achieved by slimming, skimming and thinning procedures. Skimming is defined as the reduction of events, whereas slimming and thinning involve the reduction of objects. The DxAOD format (and also the original xAOD) is readable by both ROOT and Athena, the ATLAS reconstruction and analysis framework [4].
The reduction framework is run at Tier-1s and Tier-2s. The final DxAOD datasets are distributed to the Tier-1s, the Tier-2s, and additional analysis facilities, the Tier-3s. All these activities are referred to as central production, since it is centrally managed and organized. ATLAS users can use the same distributed infrastructure for data analysis. The access to grid The PanDA workload management system [5] is used to submit tasks to ATLAS grid resources. A task is a collection of jobs executing the same payload on a subset of the task input dataset. Users and production managers submit tasks, which are then split into individual jobs by PanDA, to optimize the usage of distributed resources. The system is based on pilot jobs sent to the sites, whereas tasks are submitted to a central queue operating on the following grid infrastructures: WLCG [6], Open Science Grid (OSG) [7], European Grid Initiative (EGI) [8], and NorduGrid [9]. The ATLAS Distributed Data Management system is based on Rucio [10]. ATLAS users may submit tasks using either PanDA or Ganga [11] client tools.

Distributed analysis for LHC Run 2
User tasks are executed on the same distributed resources as centrally managed production tasks. The processing share based on the number of concurrently running jobs for the various ATLAS distributed computing activities is shown in figure 2. The analysis share ranges from 10% to 20%, and it is mostly driven by the interplay of user requests and production activities. Apart from periods when new re-processing or production campaigns take place, the overall share distribution is constant. The number of ATLAS grid users ranges from 400 to 500 per week, except for the holiday periods (see figure 3). Over one year, more than a thousand unique users submit jobs to ATLAS distributed resources.
Users compete among themselves and with centrally managed production activities for access to ATLAS grid resources. Analysis and production jobs are assigned by PanDA to separate queues, with different shares assigned at each site. At Tier-1s the minimum analysis share is 5% but can be increased on a site voluntary basis. Tier-2s usually have a larger share dedicated to analysis, with typical values ranging between 20 and 50%. The analysis share at Tier-3s is 100%. User jobs are executed on dedicated analysis queues according to a priority system. Each user job has an assigned priority, which depends on the number of jobs from the same user already in the system, and on the user CPU usage in the last 24 hours. The job priority starts at 1000 and decreases fairly rapidly. Some users or groups may be granted priority bonuses,  or no priority penalty, for a limited amount of time if their analysis is recognised as critical by ATLAS. Moreover users working for specific groups or activities, such as sub-detector studies, may submit jobs using a special role ("working group"), so that their personal priority is not decreased. The majority of user jobs have priority greater than -2000, as shown in figure 4, and they wait in the system on average 1.4 hours before being executed. The waiting time for jobs with low priority (less than -2000) is 2.3 hours (see figure 5). User workflows on the grid are heterogeneous. Most users process DxAOD samples to produce final results for their analysis. These jobs are typically short (less than one hour), process a large number of input files, and typically produce several small output files. MC event generation and simulation, and in general reconstruction activities, are also typical user workflows, as can be seen in figure 6. Such workflows process far fewer events than typical DxAOD-or xAOD-based analyses, but they are much more expensive in terms of resource consumption (both CPU and

Distributed analysis performance
ATLAS grid users typically submit about ten thousand jobs per month, processing more than 150 million of events on average. User jobs typically run for about one hour, and on average they are executed around two hours after being submitted. The average memory consumption is 600 MB per job. A data reduction of a factor of 150 (measured as the ratio of the input and output dataset sizes) is normally achieved by user jobs. The average output size is around 100 MB per job. The user job efficiency is defined as the number of successful jobs divided by the total of all user jobs. As shown in figure 7, the user job efficiency is flat over time, and it is typically above 80%. User job failures are typically due to either infrastructure problems (mainly storage failures) or errors in the user code. About 50% of the total walltime consumed by failed user jobs is due to failures in the user code. To help users to submit jobs to ATLAS distributed resources and to retrieve their outputs, ATLAS established a dedicated mailing list, the Distributed Analysis Support Team (DAST). DAST is a team of expert shifters (organized in two daily 8hour shifts to cover both European and American time zones), and provides support on PanDA, Ganga and Rucio clients, site services and issues, physics analysis tools and monitoring systems related to grid activities. Since its creation in 2008, more than a thousand physicists have used the DAST service, with an average of 10000 emails exchanged every year on the mailing list.
To mitigate the infrastructure-related failures, ATLAS grid sites are continuously validated by the HammerCloud [14,15,16] service. HammerCloud submits a constant flow of test jobs to ATLAS distributed resources that are used either for user analysis or for central production activities. Sites failing the HammerCloud tests are automatically excluded from job brokerage until they can successfully execute the test jobs. The auto-exclusion mechanism is in place for analysis queues since 2010, and for production queues since 2012.

Future developments
High job efficiency and short waiting times are key ingredients of a successful distributed analysis system. To improve the balance of production and analysis activities, ATLAS is working on a global share system for all activities. The individual share for each activity can be dynamically changed, to ease central operations and give more flexibility to the system. This will allow for a better resource allocation when one (or more) particular activity becomes critical for ATLAS. The distributed analysis workflows are highly heterogeneous. They could be categorized in further sub-activities, and each share could be allocated dynamically. Moreover, ATLAS is planning to increase the number of analysis queues dedicated to special needs, such as jobs with high memory requirements, long jobs, and multi-core workflows. Standard analysis queues will have uniform limits on job length, memory requirements, etc. across the majority of sites. Currently 95% of the user tasks complete within three days of their submission. A challenge for the future is to tackle the 5% remaining tasks, and to further reduce the time needed for a task to complete. The biggest limitation nowadays to fast job execution is the availability of input datasets. Analysis jobs do not trigger data movement across different sites, therefore they can only be executed at the sites where the input datasets are stored. ATLAS has automatized the creation of additional replicas for popular datasets, and is continuously working on improving techniques to predict the usage of certain datasets for analysis. Another possible improvement is to enable remote file access for datasets on nearby sites based on good network connectivity [17].

Conclusions
The ATLAS distributed analysis system exhibits stable performance. The user job efficiency is typically larger than 80%, and job failures are due in about half of the cases to problems in the user code, and in the other half to infrastructure related issues, mainly storage failures. User jobs are executed on ATLAS grid sites according to a priority system which guarantees fair usage of resources. The average job waiting time is around two hours. Future and current efforts are made in the direction of reducing the number of corner cases where jobs wait in the queues much longer than the average. This can be achieved by a better repartition of resources among various activities, and by improving the availability of the input datasets.