Journal of Physics: Conference Series, Volume 331, 2011

072001

The following article is Open access

Experiment Dashboard for Monitoring of the LHC Distributed Computing Systems

J Andreeva, M Devesas Campos, J Tarragon Cros, B Gaidioz, E Karavakis, L Kokoszkiewicz, E Lanciotti, G Maier, W Ollivier, M Nowotka et al

View article, Experiment Dashboard for Monitoring of the LHC Distributed Computing Systems PDF, Experiment Dashboard for Monitoring of the LHC Distributed Computing Systems

LHC experiments are currently taking collisions data. A distributed computing model chosen by the four main LHC experiments allows physicists to benefit from resources spread all over the world. The distributed model and the scale of LHC computing activities increase the level of complexity of middleware, and also the chances of possible failures or inefficiencies in involved components. In order to ensure the required performance and functionality of the LHC computing system, monitoring the status of the distributed sites and services as well as monitoring LHC computing activities are among the key factors. Over the last years, the Experiment Dashboard team has been working on a number of applications that facilitate the monitoring of different activities: including following up jobs, transfers, and also site and service availabilities. This presentation describes Experiment Dashboard applications used by the LHC experiments and experience gained during the first months of data taking.

https://doi.org/10.1088/1742-6596/331/7/072001

072002

The following article is Open access

ATLAS Grid Information System

Alexey Anisenkov, Alexei Klimentov, Roman Kuskov and Torre Wenaus

View article, ATLAS Grid Information System PDF, ATLAS Grid Information System

ATLAS is a particle physics experiment at the Large Hadron Collider at CERN. The detector produces huge volumes of data, of the order of petabytes per year. To satisfy ATLAS requirements for petabyte scale production and distributed analysis processing, a grid technology is used to distribute, store and analyse these immense amounts of data. In this paper we present ATLAS Grid Information System (AGIS) designed to integrate configuration and status information about resources, services and topology of whole ATLAS Grid needed by ATLAS Distributed Computing applications.

https://doi.org/10.1088/1742-6596/331/7/072002

072003

The following article is Open access

Experiences with http/WebDAV protocols for data access in high throughput computing

Gerard Bernabeu, Francisco Martinez, Esther Acción, Arnau Bria, Marc Caubet, Manuel Delfino and Xavier Espinal

View article, Experiences with http/WebDAV protocols for data access in high throughput computing PDF, Experiences with http/WebDAV protocols for data access in high throughput computing

In the past, access to remote storage was considered to be at least one order of magnitude slower than local disk access. Improvement on network technologies provide the alternative of using remote disk. For those accesses one can today reach levels of throughput similar or exceeding those of local disks. Common choices as access protocols in the WLCG collaboration are RFIO, [GSI]DCAP, GRIDFTP, XROOTD and NFS. HTTP protocol shows a promising alternative as it is a simple, lightweight protocol. It also enables the use of standard technologies such as http caching or load balancing which can be used to improve service resilience and scalability or to boost performance for some use cases seen in HEP such as the "hot files". WebDAV extensions allow writing data, giving it enough functionality to work as a remote access protocol. This paper will show our experiences with the WebDAV door for dCache, in terms of functionality and performance, applied to some of the HEP work flows in the LHC Tier1 at PIC.

https://doi.org/10.1088/1742-6596/331/7/072003

072004

The following article is Open access

Virtualization for the LHCb Online system

Enrico Bonaccorsi, Loic Brarda, Gary Moine and Niko Neufeld

View article, Virtualization for the LHCb Online system PDF, Virtualization for the LHCb Online system

Virtualization has long been advertised by the IT-industry as a way to cut down cost, optimise resource usage and manage the complexity in large data-centers. The great number and the huge heterogeneity of hardware, both industrial and custom-made, has up to now led to reluctance in the adoption of virtualization in the IT infrastructure of large experiment installations.

Our experience in the LHCb experiment has shown that virtualization improves the availability and the manageability of the whole system.

We have done an evaluation of available hypervisors / virtualization solutions and find that the Microsoft HV technology provides a high level of maturity and flexibility for our purpose. We present the results of these comparison tests, describing in detail, the architecture of our virtualization infrastructure with a special emphasis on the security for services visible to the outside world. Security is achieved by a sophisticated combination of VLANs, firewalls and virtual routing - the cost and benefits of this solution are analysed.

We have adapted our cluster management tools, notably Quattor, for the needs of virtual machines and this allows us to migrate smoothly services on physical machines to the virtualized infrastructure. The procedures for migration will also be described.

In the final part of the document we describe our recent R&D activities aiming to replacing the SAN-backend for the virtualization by a cheaper iSCSI solution - this will allow to move all servers and related services to the virtualized infrastructure, excepting the ones doing hardware control via non-commodity PCI plugin cards.

https://doi.org/10.1088/1742-6596/331/7/072004

072005

The following article is Open access

Experience with the CMS Computing Model from commissioning to collisions

Daniele Bonacorsi and the CMS Computing project

View article, Experience with the CMS Computing Model from commissioning to collisions PDF, Experience with the CMS Computing Model from commissioning to collisions

In this presentation we will discuss the early experience with the CMS computing model from the last large scale challenge activities through the first six months of data taking. Between the initial definition of the CMS Computing Model in 2004 and the start of high energy collisions in 2010, CMS exercised the infrastructure with numerous scaling tests and service challenges. We will discuss how those tests have helped prepare the experiment for operations and how representative the challenges were to the early experience with data taking. We will outline how the experiment operations has evolved during the first few months of operations. The current state of the Computing system will be presented and we will describe the initial experience with active users and real data. We will address the issues that worked well in addition to identifying areas where future development and refinement is needed.

https://doi.org/10.1088/1742-6596/331/7/072005

072006

The following article is Open access

Commissioning of a CERN Production and Analysis Facility Based on xrootd

Simone Campana, Daniel C van der Ster, Alessandro Di Girolamo, Andreas J Peters, Dirk Duellmann, Miguel Coelho Dos Santos, Jan Iven and Tim Bell

View article, Commissioning of a CERN Production and Analysis Facility Based on xrootd PDF, Commissioning of a CERN Production and Analysis Facility Based on xrootd

The CERN facility hosts the Tier-0 of the four LHC experiments, but as part of WLCG it also offers a platform for production activities and user analysis. The CERN CASTOR storage technology has been extensively tested and utilized for LHC data recording and exporting to external sites according to experiments computing model. On the other hand, to accommodate Grid data processing activities and, more importantly, chaotic user analysis, it was realized that additional functionality was needed including a different throttling mechanism for file access. This paper will describe the xroot-based CERN production and analysis facility for the ATLAS experiment and in particular the experiment use case and data access scenario, the xrootd redirector setup on top of the CASTOR storage system, the commissioning of the system and real life experience for data processing and data analysis.

https://doi.org/10.1088/1742-6596/331/7/072006

072007

The following article is Open access

ATLAS Muon Calibration Framework

Gianpaolo Carlino, Alessandro De Salvo, Andrea Di Simone, Alessandra Doria, Manoj Kumar Jha, Luca Mazzaferro, Rodney Walker and (forthe Atlas collaboration)

View article, ATLAS Muon Calibration Framework PDF, ATLAS Muon Calibration Framework

Automated calibration of the ATLAS detector subsystems (like MDT and RPC chambers) are being performed at remote sites, called Remote Calibration Centers. The calibration data for the assigned part of the detector are being processed at these centers and send the result back to CERN for general use in reconstruction and analysis. In this work, we present the recent developments in data discovery mechanism and integration of Ganga as a backend which allows for the specification, submission, bookkeeping and post processing of calibration tasks on a wide set of available heterogeneous resources at remote centers.

https://doi.org/10.1088/1742-6596/331/7/072007

072008

The following article is Open access

The LHCb Computing Model and Real Data

Philippe Charpentier and (on behalf ofthe LHCb Collaboration)

View article, The LHCb Computing Model and Real Data PDF, The LHCb Computing Model and Real Data

After several years of experience with Grid production and Analysis dealing with simulated data, the first LHC collision data (as of March 2010) have confronted the LHCb Computing Model with real data. The LHCb Computing Model is somewhat different from the traditional MONARC hierarchical model used by the other LHC experiments: first pass reconstruction, as well as further reprocessing passes, are performed at a set of 7 Tier-1 sites (including CERN), while Tier2 sites are used mainly for simulation productions. User analysis is performed at LHCb Analysis Centers for which the baseline is the 7 Tier1s. Analysis relies on the concept of reduced datasets (so-call stripped datasets) that are centrally produced at the 7 Tier-1's and then distributed to all the analysis centers. We shall review the performance of this model with the 2010 real data, and give an outlook for possible modifications to be put in place for the 2011 run.

https://doi.org/10.1088/1742-6596/331/7/072008

072009

The following article is Open access

LcgCAF: CDF access method to LCG resources

Gabriele Compostella, Matteo Bauce, Simone Pagan Griso, Donatella Lucchesi, Massimo Sgaravatto and Marco Cecchi

View article, LcgCAF: CDF access method to LCG resources PDF, LcgCAF: CDF access method to LCG resources

Up to the early 2011, the CDF collaboration has collected more than 8 fb⁻¹ of data from p collisions at a center of mass energy TeV delivered by the Tevatron collider at Fermilab. Second generation physics measurements, like precision determinations of top properties or searches for the Standard Model higgs, require increasing computing power for data analysis and events simulation.

Instead of expanding its set of dedicated Condor based analysis farms, CDF moved to Grid resources. While in the context of OSG this transition was performed using Condor glideins and keeping CDF custom middleware software almost intact, in LCG a complete rewrite of the experiment's submission and monitoring tools was realized, taking full advantage of the features offered by the gLite Workload Management System (WMS). This led to the development of a new computing facility called LcgCAF that CDF collaborators are using to exploit Grid resources in Europe in a transparent way.

Given the opportunistic usage of the available resources, it is of crucial importance for CDF to maximize jobs efficiency from submission to output retrieval. This work describes how an experimental resubmisson feature implemented in the WMS was tested in LcgCAF with the aim of lowering the overall execution time of a typical CDF job.

https://doi.org/10.1088/1742-6596/331/7/072009

072010

The following article is Open access

The Use of Proxy Caches for File Access in a Multi-Tier Grid Environment

R Brun, D Duellmann, G Ganis, A Hanushevsky, L Janyst, A J Peters, F Rademakers and E Sindrilaru

View article, The Use of Proxy Caches for File Access in a Multi-Tier Grid Environment PDF, The Use of Proxy Caches for File Access in a Multi-Tier Grid Environment

The use of proxy caches has been extensively studied in the HEP environment for efficient access of database data and showed significant performance with only very moderate operational effort at higher grid tiers (T2, T3). In this contribution we propose to apply the same concept to the area of file access and analyse the possible performance gains, operational impact on site services and applicability to different HEP use cases. Base on a proof-of-concept studies with a modified XROOT proxy server we review the cache efficiency and overheads for access patterns of typical ROOT based analysis programs. We conclude with a discussion of the potential role of this new component at the different tiers of a distributed computing grid.

https://doi.org/10.1088/1742-6596/331/7/072010

072011

The following article is Open access

Reinforcing user data analysis with Ganga in the LHC era: scalability, monitoring and user-support

Johannes Elmsheuser, Frederic Brochu, Ivan Dzhunov, Johannes Ebke, Ulrik Egede, Manoj Kumar Jha, Lukasz Kokoszkiewicz, Hurng-Chun Lee, Andrew Maier, Jakub Mościcki et al

View article, Reinforcing user data analysis with Ganga in the LHC era: scalability, monitoring and user-support PDF, Reinforcing user data analysis with Ganga in the LHC era: scalability, monitoring and user-support

Ganga is a grid job submission and management system widely used in the ATLAS and LHCb experiments and several other communities in the context of the EGEE project. The particle physics communities have entered the LHC operation era which brings new challenges for user data analysis: a strong growth in the number of users and jobs is already noticeable. Current work in the Ganga project is focusing on dealing with these challenges. In recent Ganga releases the support for the pilot job based grid systems Panda and Dirac of the ATLAS and LHCb experiment respectively have been strengthened. A more scalable job repository architecture, which allows efficient storage of many thousands of jobs in XML or several database formats, was recently introduced. A better integration with monitoring systems, including the Dashboard and job execution monitor systems is underway. These will provide comprehensive and easy job monitoring. A simple to use error reporting tool integrated at the Ganga command-line will help to improve user support and debugging user problems. Ganga is a mature, stable and widely-used tool with long-term support from the HEP community. We report on how it is being constantly improved following the user needs for faster and easier distributed data analysis on the grid.

https://doi.org/10.1088/1742-6596/331/7/072011

072012

The following article is Open access

Computing for the next generation flavour factories

F Bianchi, D Brown, M Corvo, A Di Simone, A Fella, A Gianoli, E Luppi, M Morandin, E Paoloni, M Rama et al

View article, Computing for the next generation flavour factories PDF, Computing for the next generation flavour factories

The next generation of Super Flavor Factories, like SuperB and SuperKEKB, present significant computing challenges. Extrapolating the BaBar and Belle experience to the SuperB nominal luminosity of 10³⁶cm⁻²s⁻¹, we estimate that the data size collected after a few years of operation is 200 PB and the amount of CPU required to process them of the order of 2000 KHep-Spec06. Already in the current phase of detector design, the amount of simulated events needed for estimating the impact on very rare benchmark channels is huge and has required the development of new simulation tools and the deployment of a worldwide production distributed system.

With the collider is in operation, very large data set have to be managed and new technologies with potential large impact on the computational models, like the many core CPUs, need to be effectively exploited. In addition SuperB, like the LHC experiments, will have to make use of distributed computing resources accessible via the Grid infrastructures while providing an efficient and reliable data access model to its final users. To explore the key issues, a dedicated R&D program has been launched and is now in progress. A description of the R&D goals and the status of ongoing activities is presented.

https://doi.org/10.1088/1742-6596/331/7/072012

072013

The following article is Open access

arcControlTower: the System for Atlas Production and Analysis on ARC

Andrej Filipčič and (forthe Atlas collaboration)

View article, arcControlTower: the System for Atlas Production and Analysis on ARC PDF, arcControlTower: the System for Atlas Production and Analysis on ARC

PanDA, the Atlas management and distribution system for production and analysis jobs on EGEE and OSG clusters, is based on pilot jobs to increase the throughput and stability of the job execution on grid. The ARC middleware uses a specific approach which tightly connects the job requirements with cluster capabilities like resource usage, software availability and caching of input files. The pilot concept renders the ARC features useless. The arcControlTower is the job submission system which merges the pilot benefits and ARC advantages. It takes the pilot payload from the panda server and submits the jobs to the Nordugrid ARC clusters as regular jobs, with all the job resources known in advance. All the pilot communication with the PanDA server is done by the arcControlTower, so it plays the role of a pilot factory and the pilot itself. There are several advantages to this approach: no grid middleware is needed on the worker nodes, the fair-share between the production and user jobs is tuned with the arcControlTower load parameters, the jobs can be controlled by ARC client tools. The system could be extended to other submission systems using central distribution.

https://doi.org/10.1088/1742-6596/331/7/072013

072014

The following article is Open access

WLCG Operations and the First Prolonged LHC Run

M Girone and J Shiers

View article, WLCG Operations and the First Prolonged LHC Run PDF, WLCG Operations and the First Prolonged LHC Run

By the time of CHEP 2010 we had accumulated just over 6 months' experience with proton-proton data taking, production and analysis at the LHC. This paper addresses the issues seen from the point of view of the WLCG Service. In particular, it answers the following questions: Did the WLCG service delivered quantitatively and qualitatively? Were the "key performance indicators" a reliable and accurate measure of the service quality? Were the inevitable service issues been resolved in a sufficiently rapid fashion? What are the key areas of improvement required not only for long-term sustainable operations, but also to embrace new technologies. It concludes with a summary of our readiness for data taking in the light of real experience.

https://doi.org/10.1088/1742-6596/331/7/072014

072015

The following article is Open access

Using widgets to monitor the LHC experiments

I González Caballero and S Sarkar

View article, Using widgets to monitor the LHC experiments PDF, Using widgets to monitor the LHC experiments

The complexity of the LHC experiments requires monitoring systems to verify the correct functioning of different sub-systems and to allow operators to quickly spot problems and issues that may cause loss of information and data. Due to the distributed nature of the collaborations and the different technologies involved, the information data that need to be correlated is usually spread over several databases, web pages and monitoring systems. On the other hand, although the complete set of monitorable aspects is known and fixed, the subset that each person needs to monitor is often different for each individual. Therefore, building a unique monitoring tool that suits every single collaborator becomes close to impossible. A modular approach with a set of customizable widgets, small autonomous portions of HTML and JavaScript, that can be aggregated to form private or public monitoring web pages can be a scalable and robust solution, where the information can be provided by a simple and thin set of web services. Among the different widget development toolkits available today, we have chosen the open project UWA (Unified Widget API) because of its portability to the most popular widget platforms (including iGoogle, Netvibes and Apple Dashboard). As an example, we show how this technology is currently being used to monitor parts of the CMS Computing project.

https://doi.org/10.1088/1742-6596/331/7/072015

072016

The following article is Open access

A prototype for JDEM science data processing

Erik E Gottschalk

View article, A prototype for JDEM science data processing PDF, A prototype for JDEM science data processing

Fermilab is developing a prototype science data processing and data quality monitoring system for dark energy science. The purpose of the prototype is to demonstrate distributed data processing capabilities for astrophysics applications, and to evaluate candidate technologies for trade-off studies. We present the architecture and technical aspects of the prototype, including an open source scientific execution and application development framework, distributed data processing, and publish/subscribe message passing for quality control.

https://doi.org/10.1088/1742-6596/331/7/072016

072017

The following article is Open access

Virtual pools for interactive analysis and software development through an integrated Cloud environment

C Grandi, A Italiano, D Salomoni and A K Calabrese Melcarne

View article, Virtual pools for interactive analysis and software development through an integrated Cloud environment PDF, Virtual pools for interactive analysis and software development through an integrated Cloud environment

WNoDeS, an acronym for Worker Nodes on Demand Service, is software developed at CNAF-Tier1, the National Computing Centre of the Italian Institute for Nuclear Physics (INFN) located in Bologna. WNoDeS provides on demand, integrated access to both Grid and Cloud resources through virtualization technologies. Besides the traditional use of computing resources in batch mode, users need to have interactive and local access to a number of systems. WNoDeS can dynamically select these computers instantiating Virtual Machines, according to the requirements (computing, storage and network resources) of users through either the Open Cloud Computing Interface API, or through a web console. An interactive use is usually limited to activities in user space, i.e. where the machine configuration is not modified. In some other instances the activity concerns development and testing of services and thus implies the modification of the system configuration (and, therefore, root-access to the resource). The former use case is a simple extension of the WNoDeS approach, where the resource is provided in interactive mode. The latter implies saving the virtual image at the end of each user session so that it can be presented to the user at subsequent requests. This work describes how the LHC experiments at INFN-Bologna are testing and making use of these dynamically created ad-hoc machines via WNoDeS to support flexible, interactive analysis and software development at the INFN Tier-1 Computing Centre.

https://doi.org/10.1088/1742-6596/331/7/072017

072018

The following article is Open access

A tool for optimization of the production and user analysis on the Grid, C. Grigoras for the ALICE Collaboration

Costin Grigoras, Federico Carminati, Olga Vladimirovna Datskova, Steffen Schreiner, Sehoon Lee, Jianlin Zhu, Mihaela Gheata, Andrei Gheata, Pablo Saiz, Latchezar Betev et al

View article, A tool for optimization of the production and user analysis on the Grid, C. Grigoras for the ALICE Collaboration PDF, A tool for optimization of the production and user analysis on the Grid, C. Grigoras for the ALICE Collaboration

With the LHC and ALICE entering a full operation and production modes, the amount of Simulation and RAW data processing and end user analysis computational tasks are increasing. The efficient management of all these tasks, all of which have large differences in lifecycle, amounts of processed data and methods to analyze the end result, required the development and deployment of new tools in addition to the already existing Grid infrastructure. To facilitate the management of the large scale simulation and raw data reconstruction tasks, ALICE has developed a production framework called a Lightweight Production Manager (LPM). The LPM is automatically submitting jobs to the Grid based on triggers and conditions, for example after a physics run completion. It follows the evolution of the job and publishes the results on the web for worldwide access by the ALICE physicists. This framework is tightly integrated with the ALICE Grid framework AliEn. In addition to the publication of the job status, LPM is also allowing a fully authenticated interface to the AliEn Grid catalogue, to browse and download files, and in the near future will provide simple types of data analysis through ROOT plugins. The framework is also being extended to allow management of end user jobs.

https://doi.org/10.1088/1742-6596/331/7/072018

072019

The following article is Open access

CMS distributed computing workflow experience

Jennifer Adelman-McCarthy, Oliver Gutsche, Jeffrey D Haas, Harrison B Prosper, Valentina Dutta, Guillelmo Gomez-Ceballos, Kristian Hahn, Markus Klute, Ajit Mohapatra, Vincenzo Spinoso et al

View article, CMS distributed computing workflow experience PDF, CMS distributed computing workflow experience

The vast majority of the CMS Computing capacity, which is organized in a tiered hierarchy, is located away from CERN. The 7 Tier-1 sites archive the LHC proton-proton collision data that is initially processed at CERN. These sites provide access to all recorded and simulated data for the Tier-2 sites, via wide-area network (WAN) transfers. All central data processing workflows are executed at the Tier-1 level, which contain re-reconstruction and skimming workflows of collision data as well as reprocessing of simulated data to adapt to changing detector conditions. This paper describes the operation of the CMS processing infrastructure at the Tier-1 level. The Tier-1 workflows are described in detail. The operational optimization of resource usage is described. In particular, the variation of different workflows during the data taking period of 2010, their efficiencies and latencies as well as their impact on the delivery of physics results is discussed and lessons are drawn from this experience. The simulation of proton-proton collisions for the CMS experiment is primarily carried out at the second tier of the CMS computing infrastructure. Half of the Tier-2 sites of CMS are reserved for central Monte Carlo (MC) production while the other half is available for user analysis. This paper summarizes the large throughput of the MC production operation during the data taking period of 2010 and discusses the latencies and efficiencies of the various types of MC production workflows. We present the operational procedures to optimize the usage of available resources and we the operational model of CMS for including opportunistic resources, such as the larger Tier-3 sites, into the central production operation.

https://doi.org/10.1088/1742-6596/331/7/072019

072020

The following article is Open access

Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities

J Flix, J M Hernández and A Sciabà

View article, Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities PDF, Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities

The CMS experiment has adopted a computing system where resources are distributed worldwide in more than 100 sites. The operation of the system requires a stable and reliable behavior of the underlying infrastructure. CMS has established procedures to extensively test all relevant aspects of a site and their capability to sustain the various CMS computing workflows at the required scale. The Site Readiness monitoring infrastructure has been instrumental in understanding how the system as a whole was improving towards LHC operations, measuring the reliability of sites when running CMS activities, and providing sites with the information they need to solve eventual problems. This paper reviews the complete automation of the Site Readiness program, with the description of monitoring tools, the impact in improving the overall reliability of the Grid from the point of view of the CMS computing system, as well as the resource utilization and performance seen at the sites during the first year of LHC running.

https://doi.org/10.1088/1742-6596/331/7/072020

072021

The following article is Open access

Computing at Belle II

Thomas Kuhr

View article, Computing at Belle II PDF, Computing at Belle II

The next generation B-factory experiment Belle II will collect a huge data sample which is a challenge for the computing system. In this article, the computing model of the Belle II experiment is presented and the core components of the computing system are introduced.

https://doi.org/10.1088/1742-6596/331/7/072021

072022

The following article is Open access

Workflow management in large distributed systems

I Legrand, H Newman, R Voicu, C Dobre and C Grigoras

View article, Workflow management in large distributed systems PDF, Workflow management in large distributed systems

The MonALISA (Monitoring Agents using a Large Integrated Services Architecture) framework provides a distributed service system capable of controlling and optimizing large-scale, data-intensive applications. An essential part of managing large-scale, distributed data-processing facilities is a monitoring system for computing facilities, storage, networks, and the very large number of applications running on these systems in near realtime. All this monitoring information gathered for all the subsystems is essential for developing the required higher-level services—the components that provide decision support and some degree of automated decisions—and for maintaining and optimizing workflow in large-scale distributed systems. These management and global optimization functions are performed by higher-level agent-based services. We present several applications of MonALISA's higher-level services including optimized dynamic routing, control, data-transfer scheduling, distributed job scheduling, dynamic allocation of storage resource to running jobs and automated management of remote services among a large set of grid facilities.

https://doi.org/10.1088/1742-6596/331/7/072022

072023

The following article is Open access

Large scale commissioning and operational experience with tier-2 to tier-2 data transfer links in CMS

J Letts and N Magini

View article, Large scale commissioning and operational experience with tier-2 to tier-2 data transfer links in CMS PDF, Large scale commissioning and operational experience with tier-2 to tier-2 data transfer links in CMS

Tier-2 to Tier-2 data transfers have been identified as a necessary extension of the CMS computing model. The Debugging Data Transfers (DDT) Task Force in CMS was charged with commissioning Tier-2 to Tier-2 PhEDEx transfer links beginning in late 2009, originally to serve the needs of physics analysis groups for the transfer of their results between the storage elements of the Tier-2 sites associated with the groups. PhEDEx is the data transfer middleware of the CMS experiment. For analysis jobs using CRAB, the CMS Remote Analysis Builder, the challenges of remote stage out of job output at the end of the analysis jobs led to the introduction of a local fallback stage out, and will eventually require the asynchronous transfer of user data over essentially all of the Tier-2 to Tier-2 network using the same PhEDEx infrastructure. In addition, direct file sharing of physics and Monte Carlo simulated data between Tier-2 sites can relieve the operational load of the Tier-1 sites in the original CMS Computing Model, and already represents an important component of CMS PhEDEx data transfer volume. The experience, challenges and methods used to debug and commission the thousands of data transfers links between CMS Tier-2 sites world-wide are explained and summarized. The resulting operational experience with Tier-2 to Tier-2 transfers is also presented.

https://doi.org/10.1088/1742-6596/331/7/072023

072024

The following article is Open access

Overview of ATLAS PanDA Workload Management

T Maeno, K De, T Wenaus, P Nilsson, G A Stewart, R Walker, A Stradling, J Caballero, M Potekhin, D Smith et al

View article, Overview of ATLAS PanDA Workload Management PDF, Overview of ATLAS PanDA Workload Management

The Production and Distributed Analysis System (PanDA) plays a key role in the ATLAS distributed computing infrastructure. All ATLAS Monte-Carlo simulation and data reprocessing jobs pass through the PanDA system. We will describe how PanDA manages job execution on the grid using dynamic resource estimation and data replication together with intelligent brokerage in order to meet the scaling and automation requirements of ATLAS distributed computing. PanDA is also the primary ATLAS system for processing user and group analysis jobs, bringing further requirements for quick, flexible adaptation to the rapidly evolving analysis use cases of the early datataking phase, in addition to the high reliability, robustness and usability needed to provide efficient and transparent utilization of the grid for analysis users. We will describe how PanDA meets ATLAS requirements, the evolution of the system in light of operational experience, how the system has performed during the first LHC data-taking phase and plans for the future.

https://doi.org/10.1088/1742-6596/331/7/072024

072025

The following article is Open access

Data oriented job submission scheme for the PHENIX user analysis in CCJ

T Nakamura, H En'yo, T Ichihara, Y Watanabe and S Yokkaichi

View article, Data oriented job submission scheme for the PHENIX user analysis in CCJ PDF, Data oriented job submission scheme for the PHENIX user analysis in CCJ

The RIKEN Computing Center in Japan (CCJ) has been developed to make it possible analyzing huge amount of data corrected by the PHENIX experiment at RHIC. The corrected raw data or reconstructed data are transferred via SINET3 with 10 Gbps bandwidth from Brookheaven National Laboratory (BNL) by using GridFTP. The transferred data are once stored in the hierarchical storage management system (HPSS) prior to the user analysis. Since the size of data grows steadily year by year, concentrations of the access request to data servers become one of the serious bottlenecks. To eliminate this I/O bound problem, 18 calculating nodes with total 180 TB local disks were introduced to store the data a priori. We added some setup in a batch job scheduler (LSF) so that user can specify the requiring data already distributed to the local disks. The locations of data are automatically obtained from a database, and jobs are dispatched to the appropriate node which has the required data. To avoid the multiple access to a local disk from several jobs in a node, techniques of lock file and access control list are employed. As a result, each job can handle a local disk exclusively. Indeed, the total throughput was improved drastically as compared to the preexisting nodes in CCJ, and users can analyze about 150 TB data within 9 hours. We report this successful job submission scheme and the feature of the PC cluster.

https://doi.org/10.1088/1742-6596/331/7/072025

072026

The following article is Open access

Parallel Monte Carlo simulations on an ARC–enabled computing grid

Jon K Nilsen and Bjørn H Samset

View article, Parallel Monte Carlo simulations on an ARC–enabled computing grid PDF, Parallel Monte Carlo simulations on an ARC–enabled computing grid

Grid computing opens new possibilities for running heavy Monte Carlo simulations of physical systems in parallel. The presentation gives an overview of GaMPI, a system for running an MPI-based random walker simulation on grid resources. Integrating the ARC middleware and the new storage system Chelonia with the Ganga grid job submission and control system, we show that MPI jobs can be run on a world-wide computing grid with good performance and promising scaling properties. Results for relatively communication-heavy Monte Carlo simulations run on multiple heterogeneous, ARC-enabled computing clusters in several countries are presented.

https://doi.org/10.1088/1742-6596/331/7/072026

072027

The following article is Open access

Analyzing Ever Growing Datasets in PHENIX

Christopher Pinkenburg and (forthe PHENIX Collaboration)

View article, Analyzing Ever Growing Datasets in PHENIX PDF, Analyzing Ever Growing Datasets in PHENIX

After 10 years of running, the PHENIX experiment has by now accumulated more than 700 TB of reconstructed data which are directly used for analysis. Analyzing these amounts of data efficiently requires a coordinated approach. Beginning in 2005 we started to develop a system for the RHIC Atlas Computing Facility (RACF) which allows the efficient analysis of these large data sets. The Analysis Taxi is now the tool which allows any collaborator to process any data set taken since 2003 in weekly passes with turnaround times of typically three to four days.

https://doi.org/10.1088/1742-6596/331/7/072027

072028

The following article is Open access

PANDA Grid – a Tool for Physics

D Protopopescu and K Schwarz

View article, PANDA Grid – a Tool for Physics PDF, PANDA Grid – a Tool for Physics

PANDA Grid is the computing tool of the bar P ANDA experiment at FAIR with concerted efforts dedicated to evolving it beyond passive computing infrastructure, into a complete and transparent solution for physics simulation, reconstruction and analysis, a tool right at the fingertips of the physicist. bar P ANDA's position within the larger FAIR community, synergies with other FAIR experiments and with ALICE@LHC, together with recent progress are reported.

https://doi.org/10.1088/1742-6596/331/7/072028

072029

The following article is Open access

Hiding the Complexity: Building a Distributed ATLAS Tier-2 with a Single Resource Interface using ARC Middleware

S Purdie, G Stewart, M Kenyon, S Skipsey, A Washbrook, W Bhimji and A Filipčič

View article, Hiding the Complexity: Building a Distributed ATLAS Tier-2 with a Single Resource Interface using ARC Middleware PDF, Hiding the Complexity: Building a Distributed ATLAS Tier-2 with a Single Resource Interface using ARC Middleware

Since their inception, Grids for high energy physics have found management of data to be the most challenging aspect of operations. This problem has generally been tackled by the experiment's data management framework controlling in fine detail the distribution of data around the grid and the careful brokering of jobs to sites with co-located data. This approach, however, presents experiments with a difficult and complex system to manage as well as introducing a rigidity into the framework which is very far from the original conception of the grid.

In this paper we describe how the ScotGrid distributed Tier-2, which has sites in Glasgow, Edinburgh and Durham, was presented to ATLAS as a single, unified resource using the ARC middleware stack. In this model the ScotGrid 'data store' is hosted at Glasgow and presented as a single ATLAS storage resource. As jobs are taken from the ATLAS PanDA framework, they are dispatched to the computing cluster with the fastest response time. An ARC compute element at each site then asynchronously stages the data from the data store into a local cache hosted at each site. The job is then launched in the batch system and accesses data locally.

We discuss the merits of this system compared to other operational models and consider, from the point of view of the resource providers (sites), and from the resource consumers (experiments); and consider issues involved in transitions to this model.

https://doi.org/10.1088/1742-6596/331/7/072029

072030

The following article is Open access

Large scale and low latency analysis facilities for the CMS experiment: development and operational aspects

H Riahi, S Gowdy, P Kreuzer, J Bakken, M Cinquilli, D Evans, S Foulkes, R Kaselis, S Metson, D Spiga et al

View article, Large scale and low latency analysis facilities for the CMS experiment: development and operational aspects PDF, Large scale and low latency analysis facilities for the CMS experiment: development and operational aspects

While a majority of CMS data analysis activities rely on the distributed computing infrastructure on the WLCG Grid, dedicated local computing facilities have been deployed to address particular requirements in terms of latency and scale.

The CMS CERN Analysis Facility (CAF) was primarily designed to host a large variety of latency-critical workflows. These break down into alignment and calibration, detector commissioning and diagnosis, and high-interest physics analysis requiring fast turnaround. In order to reach the goal for fast turnaround tasks, the Workload Management group has designed a CRABServer based system to fit with two main needs: to provide a simple, familiar interface to the user (as used in the CRAB Analysis Tool[7]) and to allow an easy transition to the Tier-0 system. While the CRABServer component had been initially designed for Grid analysis by CMS end-users, with a few modifications it turned out to be also a very powerful service to manage and monitor local submissions on the CAF. Transition to Tier-0 has been guaranteed by the usage of the WMCore, a library developed by CMS to be the common core of workload management tools, for handing data driven workflow dependencies. This system is now being used with the first use cases, and important experience is being acquired. In addition to the CERN CAF facility, FNAL has CMS dedicated analysis resources at the FNAL LHC Physics Center (LPC). In the first few years of data collection FNAL has been able to accept a large fraction of CMS data. The remote centre is not well suited for the extremely low latency work expected of the CAF, but the presence of substantial analysis resources, a large resident community, and a large fraction of the data make the LPC a strong facility for resource intensive analysis.

We present the building, commissioning and operation of these dedicated analysis facilities in the first year of LHC collisions; we also present the specific development to our software needed to allow for the use of these computing facilities in the special use cases of fast turnaround analyses.

https://doi.org/10.1088/1742-6596/331/7/072030

072031

The following article is Open access

Operating a production pilot factory serving several scientific domains

I Sfiligoi, F Würthwein, W Andrews, J M Dost, I MacNeill, A McCrea, E Sheripon and C W Murphy

View article, Operating a production pilot factory serving several scientific domains PDF, Operating a production pilot factory serving several scientific domains

Pilot infrastructures are becoming prominent players in the Grid environment. One of the major advantages is represented by the reduced effort required by the user communities (also known as Virtual Organizations or VOs) due to the outsourcing of the Grid interfacing services, i.e. the pilot factory, to Grid experts. One such pilot factory, based on the glideinWMS pilot infrastructure, is being operated by the Open Science Grid at University of California San Diego (UCSD). This pilot factory is serving multiple VOs from several scientific domains. Currently the three major clients are the analysis operations of the HEP experiment CMS, the community VO HCC, which serves mostly math, biology and computer science users, and the structural biology VO NEBioGrid. The UCSD glidein factory allows the served VOs to use Grid resources distributed over 150 sites in North and South America, in Europe, and in Asia. This paper presents the steps taken to create a production quality pilot factory, together with the challenges encountered along the road.

https://doi.org/10.1088/1742-6596/331/7/072031

072032

The following article is Open access

HERA Data Preservation plans and activities

J Szuba and (on behalf ofthe DESY Data Preservation Group)

View article, HERA Data Preservation plans and activities PDF, HERA Data Preservation plans and activities

An international inter-experimental study group on data preservation and long-term analysis in HEP (DPHEP) was convened at the end of 2008 and held a series of workshops during 2009. The HERA experiments H1, ZEUS, HERMES as well as the IT division and the Library are well represented in DPHEP and efforts are now being made to form a coherent approach at DESY. Various options for preservation are explored, from permanent evolution (H1) to the use of virtualisation techniques (ZEUS). Both experiments have planned the computing and the associated resources until 2013 and now explore possibilities to ensure the maintenance of the data analysis capabilities beyond 2013. A common effort and additional resources may lead to longer viability of data analysis. Technical solutions have been investigated by DESY-IT and involve virtualisation systems tailored for long term software preservation as well as systems for self consistent data archiving and migration. The communication between experiments, DESY-IT and the Library have put forward the possibility for further common developments related to documentation scanning and storage as well as pilot projects within the HEP documentation system INSPIRE. The evaluation of such projects is ongoing and concrete proposals to ensure HERA data analysis after 2013 are expected in 2010.

https://doi.org/10.1088/1742-6596/331/7/072032

072033

The following article is Open access

Xrootd Storage Element Deployment at GridKa WLCG Tier-1 Site

Artem Trunov

View article, Xrootd Storage Element Deployment at GridKa WLCG Tier-1 Site PDF, Xrootd Storage Element Deployment at GridKa WLCG Tier-1 Site

In the work we present our effort in designing and building of Petabyte-level storage element based on Xrootd with storage backend and SRM integration. This Storage Element is implemented at the German LCG Tier-1 Center "GridKa" for the ALICE experiment. Motivation, use cases and details of implementations are presented. It's shown that Xrootd SE at GridKa has no single point of failure and satisfies the initial requirements.

https://doi.org/10.1088/1742-6596/331/7/072033

072034

The following article is Open access

ATLAS Operations: Experience and Evolution in the Data Taking Era

I Ueda and (forthe ATLAS collaboration)

View article, ATLAS Operations: Experience and Evolution in the Data Taking Era PDF, ATLAS Operations: Experience and Evolution in the Data Taking Era

This paper summarises the operational experience and improvements of the ATLAS hierarchical multi-tier computing infrastructure in the past year leading to taking and processing of the first collisions in 2009 and 2010. Special focus will be given to the Tier-0 which is responsible, among other things, for a prompt processing of the raw data coming from the online DAQ system and is thus a critical part of the chain. We will give an overview of the Tier-0 architecture, and improvements based on the operational experience. Emphasis will be put on the new developments, namely the Task Management System opening Tier-0 to expert users and Web 2.0 monitoring and management suite. We then overview the achieved performances with the distributed computing system, discuss observed data access patterns over the grid and describe how we used this information to improve analysis rates.

https://doi.org/10.1088/1742-6596/331/7/072034

072035

The following article is Open access

CMS distributed analysis infrastructure and operations: experience with the first LHC data

E W Vaandering

View article, CMS distributed analysis infrastructure and operations: experience with the first LHC data PDF, CMS distributed analysis infrastructure and operations: experience with the first LHC data

The CMS distributed analysis infrastructure represents a heterogeneous pool of resources distributed across several continents. The resources are harnessed using glite and glidein-based work load management systems (WMS). We provide the operational experience of the analysis workflows using CRAB-based servers interfaced with the underlying WMS. The automatized interaction of the server with the WMS provides a successful analysis workflow. We present the operational experience as well as methods used in CMS to analyze the LHC data. The interaction with CMS Run-registry for Run and luminosity block selections via CRAB is discussed. The variations of different workflows during the LHC data-taking period and the lessons drawn from this experience are also outlined.

https://doi.org/10.1088/1742-6596/331/7/072035

072036

The following article is Open access

HammerCloud: A Stress Testing System for Distributed Analysis

Daniel C van der Ster, Johannes Elmsheuser, Mario Úbeda García and Massimo Paladin

View article, HammerCloud: A Stress Testing System for Distributed Analysis PDF, HammerCloud: A Stress Testing System for Distributed Analysis

Distributed analysis of LHC data is an I/O-intensive activity which places large demands on the internal network, storage, and local disks at remote computing facilities. Commissioning and maintaining a site to provide an efficient distributed analysis service is therefore a challenge which can be aided by tools to help evaluate a variety of infrastructure designs and configurations. HammerCloud is one such tool; it is a stress testing service which is used by central operations teams, regional coordinators, and local site admins to (a) submit arbitrary number of analysis jobs to a number of sites, (b) maintain at a steady-state a predefined number of jobs running at the sites under test, (c) produce web-based reports summarizing the efficiency and performance of the sites under test, and (d) present a web-interface for historical test results to both evaluate progress and compare sites. HammerCloud was built around the distributed analysis framework Ganga, exploiting its API for grid job management. HammerCloud has been employed by the ATLAS experiment for continuous testing of many sites worldwide, and also during large scale computing challenges such as STEP'09 and UAT'09, where the scale of the tests exceeded 10,000 concurrently running and 1,000,000 total jobs over multi-day periods. In addition, HammerCloud is being adopted by the CMS experiment; the plugin structure of HammerCloud allows the execution of CMS jobs using their official tool (CRAB).

https://doi.org/10.1088/1742-6596/331/7/072036

072037

The following article is Open access

Distributed analysis at LHCb

Mike Williams, Ulrik Egede, Stuart Paterson and (on behalf of the LHCb Collaboration)

View article, Distributed analysis at LHCb PDF, Distributed analysis at LHCb

The distributed analysis experience to date at LHCb has been positive: job success rates are high and wait times for high-priority jobs are low. LHCb users access the grid using the GANGA job-management package, while the LHCb virtual organization manages its resources using the DIRAC package. This clear division of labor has benefitted LHCb and its users greatly; it is a major reason why distributed analysis at LHCb has been so successful. The newly formed LHCb distributed analysis support team has also proved to be a success.

https://doi.org/10.1088/1742-6596/331/7/072037

072038

The following article is Open access

Simulation of the job processing performance at an ALICE Tier-2 site with MONARC

Č Zach, L Betev, D Adamová and (forthe ALICE Collaboration)

View article, Simulation of the job processing performance at an ALICE Tier-2 site with MONARC PDF, Simulation of the job processing performance at an ALICE Tier-2 site with MONARC

The MONARC (MOdels of Networked Analysis at Regional Centers) framework has been developed and designed with the aim to provide a tool for realistic simulations of large scale distributed computing systems, with a special focus on the Grid systems of the experiments at the CERN LHC. In this paper, we describe a usage of the MONARC framework and tools for a simulation of the job processing performance at an ALICE Tier-2 site.

https://doi.org/10.1088/1742-6596/331/7/072038

072039

The following article is Open access

Event Display for the Visualization of CMS Events

L A T Bauerdick, G Eulisse, C D Jones, D Kovalskyi, T McCauley, A Mrak Tadel, J Muelmenstaedt, I Osborne, M Tadel, Y Tu et al

View article, Event Display for the Visualization of CMS Events PDF, Event Display for the Visualization of CMS Events

During the last year the CMS experiment engaged in consolidation of its existing event display programs. The core of the new system is based on the Fireworks event display program which was by-design directly integrated with the CMS Event Data Model (EDM) and the light version of the software framework (FWLite). The Event Visualization Environment (EVE) of the ROOT framework is used to manage a consistent set of 3D and 2D views, selection, user-feedback and user-interaction with the graphics windows; several EVE components were developed by CMS in collaboration with the ROOT project. In event display operation simple plugins are registered into the system to perform conversion from EDM collections into their visual representations which are then managed by the application. Full event navigation and filtering as well as collection-level filtering is supported. The same data-extraction principle can also be applied when Fireworks will eventually operate as a service within the full software framework.

https://doi.org/10.1088/1742-6596/331/7/072039

072040

The following article is Open access

BAT – The Bayesian Analysis Toolkit

Frederik Beaujean, Allen Caldwell, Daniel Kollár and Kevin Kröninger

View article, BAT – The Bayesian Analysis Toolkit PDF, BAT – The Bayesian Analysis Toolkit

The main goals of data analysis are to infer the free parameters of models from data, to draw conclusions on the models' validity, and to compare their predictions allowing to select the most appropriate model.

The Bayesian Analysis Toolkit, BAT, is a tool developed to evaluate the posterior probability distribution for models and their parameters. It is centered around Bayes' Theorem and is realized with the use of Markov Chain Monte Carlo giving access to the full posterior probability distribution. This enables straightforward parameter estimation, limit setting and uncertainty propagation. Additional algorithms, such as Simulated Annealing, allow to evaluate the global mode of the posterior.

BAT is implemented in C++ and allows for a flexible definition of models. It is interfaced to software packages commonly used in high-energy physics: ROOT, Minuit, RooStats and CUBA. A set of predefined models exists to cover standard statistical problems.

https://doi.org/10.1088/1742-6596/331/7/072040

072041

The following article is Open access

Deployment of the CMS software on the WLCG Grid

W Behrenhoff, C Wissing, B Kim, S Blyweert, J D'Hondt, J Maes, M Maes, P Van Mulders, I Villella and L Vanelderen

View article, Deployment of the CMS software on the WLCG Grid PDF, Deployment of the CMS software on the WLCG Grid

The CMS Experiment is taking high energy collision data at CERN. The computing infrastructure used to analyse the data is distributed round the world in a tiered structure. In order to use the 7 Tier-1 sites, the 50 Tier-2 sites and a still growing number of about 30 Tier-3 sites, the CMS software has to be available at those sites. Except for a very few sites the deployment and the removal of CMS software is managed centrally. Since the deployment team has no local accounts at the remote sites all installation jobs have to be sent via Grid jobs. Via a VOMS role the job has a high priority in the batch system and gains write privileges to the software area. Due to the lack of interactive access the installation jobs must be very robust against possible failures, in order not to leave a broken software installation. The CMS software is packaged in RPMs that are installed in the software area independent of the host OS. The apt-get tool is used to resolve package dependencies. This paper reports about the recent deployment experiences and the achieved performance.

https://doi.org/10.1088/1742-6596/331/7/072041

072042

The following article is Open access

Usage of Message Queueing Technologies in the ATLAS Distributed Data Management System

Philippe Calfayan, Zang Dongsong and Vincent Garonne

View article, Usage of Message Queueing Technologies in the ATLAS Distributed Data Management System PDF, Usage of Message Queueing Technologies in the ATLAS Distributed Data Management System

The ATLAS Distributed Data Management system is composed of semi-autonomous, heterogeneous, and independently designed subsystems. To achieve successful operation of such a system, the activities of the agents controlling the subsystems have to be coordinated. In addition, external applications can require to synchronize on events relative to data availability. A common way to proceed is to implement polling strategies within the distributed components, which leads to an increase of the load in the overall system. We describe an alternative based on notifications using standard message queuing. The application of this technology in the distributed system has been exercised.

https://doi.org/10.1088/1742-6596/331/7/072042

072043

The following article is Open access

The Anatomy of a Grid portal

Daniele Licari and Federico Calzolari

View article, The Anatomy of a Grid portal PDF, The Anatomy of a Grid portal

In this paper we introduce a new way to deal with Grid portals referring to our implementation. L-GRID is a light portal to access the EGEE/EGI Grid infrastructure via Web, allowing users to submit their jobs from a common Web browser in a few minutes, without any knowledge about the Grid infrastructure. It provides the control over the complete lifecycle of a Grid Job, from its submission and status monitoring, to the output retrieval. The system, implemented as client-server architecture, is based on the Globus Grid middleware. The client side application is based on a java applet; the server relies on a Globus User Interface. There is no need of user registration on the server side, and the user needs only his own X.509 personal certificate. The system is user-friendly, secure (it uses SSL protocol, mechanism for dynamic delegation and identity creation in public key infrastructures), highly customizable, open source, and easy to install. The X.509 personal certificate does not get out from the local machine. It allows to reduce the time spent for the job submission, granting at the same time a higher efficiency and a better security level in proxy delegation and management.

https://doi.org/10.1088/1742-6596/331/7/072043

072044

The following article is Open access

The CMS Electromagnetic Calorimeter Detector Control System

D Di Calafiori, P Adzic, G Dissertori, O Holme, D Jovanovic, S -W Li, W Lustermann, P Milenovic, J Puzovic and S Zelepoukine

View article, The CMS Electromagnetic Calorimeter Detector Control System PDF, The CMS Electromagnetic Calorimeter Detector Control System

This paper presents the Detector Control System (DCS) designed and implemented for the Electromagnetic Calorimeter (ECAL) of the Compact Muon Solenoid (CMS) experiment at CERN. The focus is on its distributed controls software architecture, the deployment of the application into production and its integration into the overall CMS DCS. The knowledge acquired from operational issues during the detector commissioning and the first phase of the Large Hadron Collider (LHC) physics runs is discussed and future improvements are presented.

https://doi.org/10.1088/1742-6596/331/7/072044

072045

The following article is Open access

ATLAS Distributed Computing Operations Shift Team Experience

Kaushik De, Xavier Espinal, Alessandra Forti, Elena Korolkova, Kai Leffhalm, Peter Love, Jaroslava Schovancová, Yuri Smirnov and (forthe ATLAS Collaboration)

View article, ATLAS Distributed Computing Operations Shift Team Experience PDF, ATLAS Distributed Computing Operations Shift Team Experience

Description of ATLAS Distributed Computing Operations Shift (ADCoS) duties and available monitoring tools along with the operational experience is presented in this paper.

https://doi.org/10.1088/1742-6596/331/7/072045

072046

The following article is Open access

CMS Configuration Editor: GUI based application for user analysis job

A de Cosa

View article, CMS Configuration Editor: GUI based application for user analysis job PDF, CMS Configuration Editor: GUI based application for user analysis job

We present the user interface and the software architecture of the Configuration Editor for the CMS experiment. The analysis workflow is organized in a modular way integrated within the CMS framework that organizes in a flexible way user analysis code. The Python scripting language is adopted to define the job configuration that drives the analysis workflow. It could be a challenging task for users, especially for newcomers, to develop analysis jobs managing the configuration of many required modules. For this reason a graphical tool has been conceived in order to edit and inspect configuration files. A set of common analysis tools defined in the CMS Physics Analysis Toolkit (PAT) can be steered and configured using the Config Editor. A user-defined analysis workflow can be produced starting from a standard configuration file, applying and configuring PAT tools according to the specific user requirements. CMS users can adopt this tool, the Config Editor, to create their analysis visualizing in real time which are the effects of their actions. They can visualize the structure of their configuration, look at the modules included in the workflow, inspect the dependences existing among the modules and check the data flow. They can visualize at which values parameters are set and change them according to what is required by their analysis task. The integration of common tools in the GUI needed to adopt an object-oriented structure in the Python definition of the PAT tools and the definition of a layer of abstraction from which all PAT tools inherit.

https://doi.org/10.1088/1742-6596/331/7/072046

072047

The following article is Open access

ATLAS operations in the GridKa T1/T2 Cloud

G Duckeck, T Harenberg, S Kalinin, G Kawamura, K Leffhalm, J Meyer, S Nderitu, A Olszewski, A Petzold, J Schultes et al

View article, ATLAS operations in the GridKa T1/T2 Cloud PDF, ATLAS operations in the GridKa T1/T2 Cloud

The ATLAS GridKa cloud consists of the GridKa Tier1 centre and 12 Tier2 sites from five countries associated to it. Over the last years a well defined and tested operation model evolved. Several core cloud services need to be operated and closely monitored: distributed data management, involving data replication, deletion and consistency checks; support for ATLAS production activities, which includes Monte Carlo simulation, reprocessing and pilot factory operation; continuous checks of data availability and performance for user analysis; software installation and database setup. Of crucial importance is good communication between sites, operations team and ATLAS as well as efficient cloud level monitoring tools. The paper gives an overview of the operations model and ATLAS services within the cloud.

https://doi.org/10.1088/1742-6596/331/7/072047

072048

The following article is Open access

The distributed production system of the SuperB project: description and results

D Brown, M Corvo, A Di Simone, A Fella, E Luppi, E Paoloni, R Stroili and L Tomassetti

View article, The distributed production system of the SuperB project: description and results PDF, The distributed production system of the SuperB project: description and results

The SuperB experiment needs large samples of MonteCarlo simulated events in order to finalize the detector design and to estimate the data analysis performances. The requirements are beyond the capabilities of a single computing farm, so a distributed production model capable of exploiting the existing HEP worldwide distributed computing infrastructure is needed. In this paper we describe the set of tools that have been developed to manage the production of the required simulated events. The production of events follows three main phases: distribution of input data files to the remote site Storage Elements (SE); job submission, via SuperB GANGA interface, to all available remote sites; output files transfer to CNAF repository. The job workflow includes procedures for consistency checking, monitoring, data handling and bookkeeping. A replication mechanism allows storing the job output on the local site SE. Results from 2010 official productions are reported.

https://doi.org/10.1088/1742-6596/331/7/072048

072049

The following article is Open access

Design and early experience with promoting user-created data in CMS

M Giffels and E W Vaandering

View article, Design and early experience with promoting user-created data in CMS PDF, Design and early experience with promoting user-created data in CMS

The Computing Model of the CMS experiment [1] does not address transfering user-created data between different Grid sites. Due to the limited resources of a single site, distribution of individual user-created datasets between sites is crucial to ensure accessibility. In contrast to official datasets, there are no special requirements for user datasets (e.g. concerning data quality). The StoreResults service provides a mechanism to elevate user-created datasets to central bookkeeping ensuring the data quality is the same as an official dataset. This is a prerequisite for further distribution within the CMS dataset infrastructure.

https://doi.org/10.1088/1742-6596/331/7/072049

072050

The following article is Open access

Distributed analysis functional testing using GangaRobot in the ATLAS experiment

Federica Legger and (forthe ATLAS collaboration)

View article, Distributed analysis functional testing using GangaRobot in the ATLAS experiment PDF, Distributed analysis functional testing using GangaRobot in the ATLAS experiment

Automated distributed analysis tests are necessary to ensure smooth operations of the ATLAS grid resources. The HammerCloud framework allows for easy definition, submission and monitoring of grid test applications. Both functional and stress test applications can be defined in HammerCloud. Stress tests are large-scale tests meant to verify the behaviour of sites under heavy load. Functional tests are light user applications running at each site with high frequency, to ensure that the site functionalities are available at all times. Success or failure rates of these tests jobs are individually monitored. Test definitions and results are stored in a database and made available to users and site administrators through a web interface. In this work we present the recent developments of the GangaRobot framework. GangaRobot monitors the outcome of functional tests, creates a blacklist of sites failing the tests, and exports the results to the ATLAS Site Status Board (SSB) and to the Service Availability Monitor (SAM), providing on the one hand a fast way to identify systematic or temporary site failures, and on the other hand allowing for an effective distribution of the work load on the available resources.

https://doi.org/10.1088/1742-6596/331/7/072050

072051

The following article is Open access

Measuring and understanding computer resource utilization in CMS

J Andreeva, S Belforte, S Blyweert, K Bloom, D Evans, T Kress, J Letts, M Maes, S Padhi, S Sarkar et al

View article, Measuring and understanding computer resource utilization in CMS PDF, Measuring and understanding computer resource utilization in CMS

Significant funds are expended in order to make CMS data analysis possible across Tier-2 and Tier-3 resources worldwide. Here we review how CMS monitors operational success in using those resources, identifies and understands problems, monitors trends, provides feedback to site operators and software developers, and generally accumulates quantitative data on the operational aspects of CMS data analysis. This includes data transfers, data distribution, use of data and software releases for analysis, failure analysis and more.

https://doi.org/10.1088/1742-6596/331/7/072051

072052

The following article is Open access

Dynamic PROOF clusters with PoD: architecture and user experience

Anar Manafov

View article, Dynamic PROOF clusters with PoD: architecture and user experience PDF, Dynamic PROOF clusters with PoD: architecture and user experience

PROOF on Demand (PoD) is a tool-set, which sets up a PROOF cluster on any resource management system. PoD is a user oriented product with an easy to use GUI and a command-line interface. It is fully automated. No administrative privileges or special knowledge is required to use it. PoD utilizes a plug-in system, to use different job submission front-ends. The current PoD distribution is shipped with LSF, Torque (PBS), Grid Engine, Condor, gLite, and SSH plug-ins. The product is to be extended. We therefore plan to implement a plug-in for AliEn Grid as well. Recently developed algorithms made it possible to efficiently maintain two types of connections: packet-forwarding and native PROOF connections. This helps to properly handle most kinds of workers, with and without firewalls. PoD maintains the PROOF environment automatically and, for example, prevents resource misusage in case when workers idle for too long. As PoD matures as a product and provides more plug-ins, it's used as a standard for setting up dynamic PROOF clusters in many different institutions. The GSI Analysis Facility (GSIAF) is in production since 2007. The static PROOF cluster has been phased out end of 2009. GSIAF is now completely based on PoD. Users create private dynamic PROOF clusters on the general purpose batch farm. This provides an easier resource sharing between interactive local batch and Grid usage. The main user communities are FAIR and ALICE.

https://doi.org/10.1088/1742-6596/331/7/072052

072053

The following article is Open access

The German National Analysis Facility as a tool for ATLAS analyses

W Ehrenfeld, K Leffhalm and S Mehlhase

View article, The German National Analysis Facility as a tool for ATLAS analyses PDF, The German National Analysis Facility as a tool for ATLAS analyses

In 2008 the German National Analysis Facility (NAF) at DESY was established. It is attached to and builds on top of DESY Grid infrastructure. The facility is designed to provide the best possible analysis infrastructure for high energy particle physics of the ATLAS, CMS, LHCb and ILC experiments.

The Grid and local infrastructure of the NAF is reviewed with a focus on the ATLAS part. Both parts include large scale storage and a batch system. Emphasis is put on ATLAS specific customisation and utilisation of the NAF. This refers not only to the NAF components but also to the different components of the ATLAS analysis framework.

Experience from operating and supporting ATLAS users on the NAF is presented in this paper. The ATLAS usage of the different components are shown including some typical use cases of user analysis. Finally, the question is addressed, if the design of the NAF meets the ATLAS expectations for efficient data analysis in the era of LHC data taking.

https://doi.org/10.1088/1742-6596/331/7/072053

072054

The following article is Open access

Exploiting the ALICE HLT for PROOF by scheduling of Virtual Machines

Marco Meoni, Stefan Boettger, Pierre Zelnicek, Volker Lindenstruth and Udo Kebschull

View article, Exploiting the ALICE HLT for PROOF by scheduling of Virtual Machines PDF, Exploiting the ALICE HLT for PROOF by scheduling of Virtual Machines

The HLT (High-Level Trigger) group of the ALICE experiment at the LHC has prepared a virtual Parallel ROOT Facility (PROOF) enabled cluster (HAF - HLT Analysis Facility) for fast physics analysis, detector calibration and reconstruction of data samples. The HLT-Cluster currently consists of 2860 CPU cores and 175TB of storage. Its purpose is the online filtering of the relevant part of data produced by the particle detector. However, data taking is not running continuously and exploiting unused cluster resources for other applications is highly desirable and improves the usage-cost ratio of the HLT cluster. As such, unused computing resources are dedicated to a PROOF-enabled virtual cluster available to the entire collaboration. This setup is especially aimed at the prototyping phase of analyses that need a high number of development iterations and a short response time, e.g. tuning of analysis cuts, calibration and alignment. HAF machines are enabled and disabled upon user request to start or complete analysis tasks. This is achieved by a virtual machine scheduling framework which dynamically assigns and migrates virtual machines running PROOF workers to unused physical resources. Using this approach we extend the HLT usage scheme to running both online and offline computing, thereby optimizing the resource usage.

https://doi.org/10.1088/1742-6596/331/7/072054

072055

The following article is Open access

ATLAS Tier-2 at the Compute Resource Center GoeGrid in Göttingen

Jörg Meyer, Arnulf Quadt, Pavel Weber and (forthe ATLAS collaboration)

View article, ATLAS Tier-2 at the Compute Resource Center GoeGrid in Göttingen PDF, ATLAS Tier-2 at the Compute Resource Center GoeGrid in Göttingen

GoeGrid is a grid resource center located in Göttingen, Germany. The resources are commonly used, funded, and maintained by communities doing research in the fields of grid development, computer science, biomedicine, high energy physics, theoretical physics, astrophysics, and the humanities. For the high energy physics community, GoeGrid serves as a Tier-2 center for the ATLAS experiment as part of the world-wide LHC computing grid (WLCG). The status and performance of the Tier-2 center is presented with a focus on the interdisciplinary setup and administration of the cluster. Given the various requirements of the different communities on the hardware and software setup the challenge of the common operation of the cluster is detailed. The benefits are an efficient use of computer and personpower resources.

https://doi.org/10.1088/1742-6596/331/7/072055

072056

The following article is Open access

Visual Physics Data Analysis in the Web Browser

M Brodski, M Erdmann, R Fischer, A Hinzmann, T Klimkovich, D Klingebiel, M Komm, G Müller, J Steggemann and T Winchen

View article, Visual Physics Data Analysis in the Web Browser PDF, Visual Physics Data Analysis in the Web Browser

The project VISPA@WEB provides a novel graphical development environment for physics analyses which only requires a standard web browser on the client machine. It resembles the existing analysis environment available from the project Visual Physics Analysis VISPA, including the connection and configuration of modules for different tasks. High level logic can be programmed using the Python language, while performance-critical tasks can be implemented in C++ modules. The use cases range from simple teaching examples to highly complex scientific analyses.

https://doi.org/10.1088/1742-6596/331/7/072056

072057

The following article is Open access

Experience with PROOF-Lite in ATLAS data analysis

S Y Panitkin, C Hollowell, H Ma, S Ye and (forthe ATLAS Collaboration)

View article, Experience with PROOF-Lite in ATLAS data analysis PDF, Experience with PROOF-Lite in ATLAS data analysis

We discuss our experience with PROOF-Lite in a context of ATLAS Collaboration physics analysis of data obtained during the LHC physics run of 2009-2010. In particular we discuss PROOF-Lite performance in virtual and physical machines, its scalability on different types of multi-core processors and effects of multithreading. We will also describe PROOF-Lite performance with Solid State Drives (SSDs).

https://doi.org/10.1088/1742-6596/331/7/072057

072058

The following article is Open access

The ATLAS PanDA Monitoring System and its Evolution

A Klimentov, P Nevski, M Potekhin and T Wenaus

View article, The ATLAS PanDA Monitoring System and its Evolution PDF, The ATLAS PanDA Monitoring System and its Evolution

The PanDA (Production and Distributed Analysis) Workload Management System is used for ATLAS distributed production and analysis worldwide. The needs of ATLAS global computing imposed challenging requirements on the design of PanDA in areas such as scalability, robustness, automation, diagnostics, and usability for both production shifters and analysis users. Through a system-wide job database, the PanDA monitor provides a comprehensive and coherent view of the system and job execution, from high level summaries to detailed drill-down job diagnostics. It is (like the rest of PanDA) an Apache-based Python application backed by Oracle. The presentation layer is HTML code generated on the fly in the Python application which is also responsible for managing database queries. However, this approach is lacking in user interface flexibility, simplicity of communication with external systems, and ease of maintenance. A decision was therefore made to migrate the PanDA monitor server to Django Web Application Framework and apply JSON/AJAX technology in the browser front end. This allows us to greatly reduce the amount of application code, separate data preparation from presentation, leverage open source for tools such as authentication and authorization mechanisms, and provide a richer and more dynamic user experience. We describe our approach, design and initial experience with the migration process.

https://doi.org/10.1088/1742-6596/331/7/072058

072059

The following article is Open access

The LHCb Experience on the Grid from the DIRAC Accounting Data

Adrian Casajús, Ricardo Graciani, Albert Puig, Ricardo Vázquez and (on behalf ofthe LHCb Collaboration)

View article, The LHCb Experience on the Grid from the DIRAC Accounting Data PDF, The LHCb Experience on the Grid from the DIRAC Accounting Data

DIRAC is the software framework developed by LHCb to manage all its computing operations on the Grid. Since 2003 it has been used for large scale Monte Carlo simulation productions and for user analysis of these data. Since the end of 2009, with the start-up of LHC, DIRAC also takes care of the distribution, reconstruction, selection and analysis of the physics data taken by the detector apparatus. During 2009, DIRAC executed almost 5 million jobs for LHCb. In order to execute this workload slightly over 6 million of pilot jobs were submitted, out of which approximately one third were aborted by the Grid infrastructure. In 2010, thanks to their improved efficiency, DIRAC pilots are able, on average, to match and execute between 2 and 3 LHCb jobs during their lifetime, largely reducing the load on the Grid infrastructure.

Given the large amount of submitted jobs and used resources, it becomes essential to store detailed information about their execution to track the behaviour of the system. The DIRAC Accounting system takes care, among other things, to collect and store data concerning the execution of jobs and pilots, making it available to everyone via the public interface of the LHCb DIRAC web portal in the form of time-binned accumulated distributions. The analysis of the raw accounting data stored allow us to improve and debug the system performance, as well as, to give a detailed picture on how LHCb uses its Grid resources. A new tool has been developed to extract the raw records from the DIRAC Accounting database and to transform them into ROOT files for subsequent study. This contribution presents an analysis of such data both for LHCb jobs and the corresponding pilots, including resource usage, number of pilots per job, job efficiency and other relevant variables that will help to further improving the LHCb Grid experience.

https://doi.org/10.1088/1742-6596/331/7/072059

072060

The following article is Open access

Optimization of Large Scale HEP Data Analysis in LHCb

Daniela Remenska, Roel Aaij, Gerhard Raven, Marcel Merk, Jeff Templon, Reinder J Bril and (on behalf ofthe LHCb Collaboration)

View article, Optimization of Large Scale HEP Data Analysis in LHCb PDF, Optimization of Large Scale HEP Data Analysis in LHCb

Observation has lead to a conclusion that the physics analysis jobs run by LHCb physicists on a local computing farm (i.e. non-grid) require more efficient access to the data which resides on the Grid. Our experiments have shown that the I/O bound nature of the analysis jobs in combination with the latency due to the remote access protocols (e.g. rfio, dcap) cause a low CPU efficiency of these jobs. In addition to causing a low CPU efficiency, the remote access protocols give rise to high overhead (in terms of amount of data transferred). This paper gives an overview of the concept of pre-fetching and caching of input files in the proximity of the processing resources, which is exploited to cope with the I/O bound analysis jobs. The files are copied from Grid storage elements (using GridFTP), while concurrently performing computations, inspired from a similar idea used in the ATLAS experiment. The results illustrate that this file staging approach is relatively insensitive to the original location of the data, and a significant improvement can be achieved in terms of the CPU efficiency of an analysis job. Dealing with scalability of such a solution on the Grid environment is discussed briefly.

https://doi.org/10.1088/1742-6596/331/7/072060

072061

The following article is Open access

Interactive Analysis using PROOF in a GRID Infrastructure

Ana Yaiza Rodríguez Marrero, Isidro González Caballero, Alberto Cuesta Noriega and Francisco Matorras Weinig

View article, Interactive Analysis using PROOF in a GRID Infrastructure PDF, Interactive Analysis using PROOF in a GRID Infrastructure

Current high energy physics experiments aim to explore new territories where new physics is expected. In order to achieve that, a huge amount of data has to be collected and analyzed. The accomplishment of these scientific projects require computing resources beyond the capabilities of a single user or group, thus the data is treated under the grid infrastructure. Despite the reduction applied to the data, the sample used in the last step of the analysis is still large. At this phase, interactivity contributes to a faster optimization of the final cuts in order to improve the results. The Parallel ROOT Facility (PROOF) is intended to speed up even further this procedure providing the user analysis results within a shorter time by simultaneously using more cores. Taking profit of the computing resources and facilities available at Instituto de Física de Cantabria (IFCA), shared between two major projects LHC-CMS Tier-2 and GRID-CSIC, we have developed a setup that integrates PROOF with SGE as local resource management system and GPFS as file system, both common to the grid infrastructure. The setup was also integrated in a similar infrastructure for the LHC-CMS Tier-3 at Universidad de Oviedo that uses Torque (PBS) as local job manager and Hadoop as file system. In addition, to ease the transition from a sequential analysis code to PROOF, an analysis framework based on the TSelector class is provided. Integrating PROOF in a cluster provides users the potential usage of thousands of cores (1,680 in the IFCA case). Performance measurements have been done showing a speed improvement closely correlated with the number of cores used.

https://doi.org/10.1088/1742-6596/331/7/072061

072062

The following article is Open access

Understanding I/O patterns and performance of CMS Data Analysis across T2s worldwide

Leonardo Sala and (on behalf ofthe CMS Collaboration)

View article, Understanding I/O patterns and performance of CMS Data Analysis across T2s worldwide PDF, Understanding I/O patterns and performance of CMS Data Analysis across T2s worldwide

We discuss the I/O patterns and performance of CMS Data Analysis based on measurements on "the bench" as well as operational observations during the initial LHC data taking period. It was observed the default I/O settings for CMS data analysis resulted in poor CPU efficiency and high I/O load at many T2 sites. This effort developed tools to measure CMSSW I/O performance for grid-based analysis. Our work enabled CMSSW developers to understand how new I/O strategies work across the full range of deployed T2 storage solutions. We discuss our results, based upon aggregate statistics collected worldwide.

https://doi.org/10.1088/1742-6596/331/7/072062

072063

The following article is Open access

PROOF on a Batch System

W Behrenhoff, W Ehrenfeld, J Samson and H Stadie

View article, PROOF on a Batch System PDF, PROOF on a Batch System

The "parallel ROOT facility" (PROOF) from the ROOT framework provides a mechanism to distribute the load of interactive and non-interactive ROOT sessions on a set of worker nodes optimising the overall execution time. While PROOF is designed to work on a dedicated PROOF cluster, the benefits of PROOF can also be used on top of another batch scheduling system with the help of temporary per user PROOF clusters. We will present a lightweight tool which starts a temporary PROOF cluster on a SGE based batch cluster or, via a plugin mechanism, e.g. on a set of bare desktops via ssh. Further, we will present the result of benchmarks which compare the data throughput for different data storage back ends available at the German National Analysis Facility (NAF) at DESY.

https://doi.org/10.1088/1742-6596/331/7/072063

072064

The following article is Open access

A Job Monitoring and Accounting Tool for the LSF Batch System

Subir Sarkar and Sonia Taneja

View article, A Job Monitoring and Accounting Tool for the LSF Batch System PDF, A Job Monitoring and Accounting Tool for the LSF Batch System

This paper presents a web based job monitoring and group-and-user accounting tool for the LSF Batch System. The user oriented job monitoring displays a simple and compact quasi real-time overview of the batch farm for both local and Grid jobs. For Grid jobs the Distinguished Name (DN) of the Grid users is shown. The overview monitor provides the most up-to-date status of a batch farm at any time. The accounting tool works with the LSF accounting log files. The accounting information is shown for a few pre-defined time periods by default. However, one can also compute the same information for any arbitrary time window. The tool already proved to be an extremely useful means to validate more extensive accounting tools available in the Grid world. Several sites have already been using the present tool and more sites running the LSF batch system have shown interest. We shall discuss the various aspects that make the tool essential for site administrators and end-users alike and outline the current status of development as well as future plans.

https://doi.org/10.1088/1742-6596/331/7/072064

072065

The following article is Open access

The consistency service of the ATLAS Distributed Data Management system

Cédric Serfon, Vincent Garonne and (forthe ATLAS Collaboration)

View article, The consistency service of the ATLAS Distributed Data Management system PDF, The consistency service of the ATLAS Distributed Data Management system

With the continuously increasing volume of data produced by ATLAS and stored on the WLCG sites, the probability of data corruption or data losses, due to software and hardware failures is increasing. In order to ensure the consistency of all data produced by ATLAS a Consistency Service has been developed as part of the DQ2 Distributed Data Management system. This service is fed by the different ATLAS tools, i.e. the analysis tools, production tools, DQ2 site services or by site administrators that report corrupted or lost files. It automatically corrects the errors reported and informs the users in case of irrecoverable file loss.

https://doi.org/10.1088/1742-6596/331/7/072065

072066

The following article is Open access

Distributed Russian Tier-2 – RDIG in Simulation and Analysis of Alice Data From LHC

A Bogdanov, L Jancurova, A Kiryanov, V Kotlyar, V Mitsyn, Y Lyublev, E Ryabinkin, G Shabratova, S Smirnov, L Stepanova et al

View article, Distributed Russian Tier-2 – RDIG in Simulation and Analysis of Alice Data From LHC PDF, Distributed Russian Tier-2 – RDIG in Simulation and Analysis of Alice Data From LHC

On the threshold of LHC data there were intensive test and upgrade of GRID application software for all LHC experiments at the top of the modern LCG middleware (gLite). The update of such software for ALICE experiment at LHC, AliEn[1] had provided stable and secure operation of sites developing LHC data. The activity of Russian RDIG (Russian Data Intensive GRID) computer federation which is the distributed Tier-2 centre are devoted to simulation and analysis of LHC data in accordance with the ALICE computing model [2]. Eight sites of this federation interesting in ALICE activity upgrade their middle ware in accordance with requirements of ALICE computing what ensured success of MC production and end-user analysis activity at all eight sites. The result of occupancy and efficiency of each site in the time of LHC operation will be presented in the report. The outline the results of CPU and disk space usage at RDIG sites for the data simulation and analysis of first LHC data from the exposition of ALICE detector [3] will be presented as well. There will be presented also the information about usage of parallel analysis facility based on PROOF [4].

https://doi.org/10.1088/1742-6596/331/7/072066

072067

The following article is Open access

A policy system for Grid Management and Monitoring

Federico Stagni, Roberto Santinelli and (on behalf ofLHCb Collaboration)

View article, A policy system for Grid Management and Monitoring PDF, A policy system for Grid Management and Monitoring

Organizations using a Grid computing model are faced with non-traditional administrative challenges: the heterogeneous nature of the underlying resources requires professionals acting as Grid Administrators. Members of a Virtual Organization (VO) can use a subset of available resources and services in the grid infrastructure and in an ideal world, the more resoures are exploited the better. In the real world, the less faulty services, the better: experienced Grid administrators apply procedures for adding and removing services, based on their status, as it is reported by an ever-growing set of monitoring tools. When a procedure is agreed and well-exercised, a formal policy could be derived. For this reason, using the DIRAC framework in the LHCb collaboration, we developed a policy system that can enforce management and operational policies, in a VO-specific fashion. A single policy makes an assessment on the status of a subject, relative to one or more monitoring information. Subjects of the policies are monitored entities of an established Grid ontology. The status of a same entity is evaluated against a number of policies, whose results are then combined by a Policy Decision Point. Such results are enforced in a Policy Enforcing Point, which provides plug-ins for actions, like raising alarms, sending notifications, automatic addition and removal of services and resources from the Grid mask. Policy results are shown in the web portal, and site-specific views are provided also. This innovative system provides advantages in terms of procedures automation, information aggregation and problem solving.

https://doi.org/10.1088/1742-6596/331/7/072067

072068

The following article is Open access

Iberian ATLAS Cloud response during the first LHC collisions

M Villaplana Perez, G Amorós, G Borges, C Borrego, J Carvalho, M David, X Espinal, A Fernández, J Gomes, S González de la Hoz et al

View article, Iberian ATLAS Cloud response during the first LHC collisions PDF, Iberian ATLAS Cloud response during the first LHC collisions

The computing model of the ATLAS experiment at the LHC (Large Hadron Collider) is based on a tiered hierarchy that ranges from Tier0 (CERN) down to end-user's own resources (Tier3). According to the same computing model, the role of the Tier2s is to provide computing resources for event simulation processing and distributed data analysis. Tier3 centers, on the other hand, are the responsibility of individual institutions to define, fund, deploy and support. In this contribution we report on the operations of the ATLAS Iberian Cloud centers facing data taking and we describe some of the Tier3 facilities currently deployed at the Cloud.

https://doi.org/10.1088/1742-6596/331/7/072068

072069

The following article is Open access

PanDA Pilot Submission using Condor-G: Experience and Improvements

Xin Zhao, John Hover, Tomasz Wlodek, Torre Wenaus, Jaime Frey, Todd Tannenbaum, Miron Livny and ATLAS Collaboration

View article, PanDA Pilot Submission using Condor-G: Experience and Improvements PDF, PanDA Pilot Submission using Condor-G: Experience and Improvements

PanDA (Production and Distributed Analysis) is the workload management system of the ATLAS experiment, used to run managed production and user analysis jobs on the grid. As a late-binding, pilot-based system, the maintenance of a smooth and steady stream of pilot jobs to all grid sites is critical for PanDA operation. The ATLAS Computing Facility (ACF) at BNL, as the ATLAS Tier1 center in the US, operates the pilot submission systems for the US. This is done using the PanDA "AutoPilot" scheduler component which submits pilot jobs via Condor-G, a grid job scheduling system developed at the University of Wisconsin-Madison. In this paper, we discuss the operation and performance of the Condor-G pilot submission at BNL, with emphasis on the challenges and issues encountered in the real grid production environment. With the close collaboration of Condor and PanDA teams, the scalability and stability of the overall system has been greatly improved over the last year. We review improvements made to Condor-G resulting from this collaboration, including isolation of site-based issues by running a separate Gridmanager for each remote site, introduction of the 'Nonessential' job attribute to allow Condor to optimize its behavior for the specific character of pilot jobs, better understanding and handling of the Gridmonitor process, as well as better scheduling in the PanDA pilot scheduler component. We will also cover the monitoring of the health of the system.

https://doi.org/10.1088/1742-6596/331/7/072069

Table of contents

Volume 331

Contributed

Poster

Journal links