This site uses cookies. By continuing to use this site you agree to our use of cookies. To find out more, see our Privacy and Cookies policy.

Table of contents

Volume 664

2015

Previous issue Next issue

Data store and access

Accepted papers received: 13 November 2015
Published online: 23 December 2015

042001
The following article is Open access

, , , , , , , and

During operations, NOvA produces between 5,000 and 7,000 raw files per day with peaks in excess of 12,000. These files must be processed in several stages to produce fully calibrated and reconstructed analysis files. In addition, many simulated neutrino interactions must be produced and processed through the same stages as data. To accommodate the large volume of data and Monte Carlo, production must be possible both on the Fermilab grid and on off-site farms, such as the ones accessible through the Open Science Grid. To handle the challenge of cataloging these files and to facilitate their off-line processing, we have adopted the SAM system developed at Fermilab. SAM indexes files according to metadata, keeps track of each file's physical locations, provides dataset management facilities, and facilitates data transfer to off-site grids. To integrate SAM with Fermilab's art software framework and the NOvA production workflow, we have developed methods to embed metadata into our configuration files, art files, and standalone ROOT files. A module in the art framework propagates the embedded information from configuration files into art files, and from input art files to output art files, allowing us to maintain a complete processing history within our files. Embedding metadata in configuration files also allows configuration files indexed in SAM to be used as inputs to Monte Carlo production jobs. Further, SAM keeps track of the input files used to create each output file. Parentage information enables the construction of self-draining datasets which have become the primary production paradigm used at NOvA. In this paper we will present an overview of SAM at NOvA and how it has transformed the file production framework used by the experiment.

042002
The following article is Open access

, , , and

Data generation rates are expected to grow very fast for some database workloads going into LHC run 2 and beyond. In particular this is expected for data coming from controls, logging and monitoring systems. Storing, administering and accessing big data sets in a relational database system can quickly become a very hard technical challenge, as the size of the active data set and the number of concurrent users increase. Scale-out database technologies are a rapidly developing set of solutions for deploying and managing very large data warehouses on commodity hardware and with open source software. In this paper we will describe the architecture and tests on database systems based on Hadoop and the Cloudera Impala engine. We will discuss the results of our tests, including tests of data loading and integration with existing data sources and in particular with relational databases. We will report on query performance tests done with various data sets of interest at CERN, notably data from the accelerator log database.

042003
The following article is Open access

, , , , , , , , , et al

The EventIndex is the complete catalogue of all ATLAS events, keeping the references to all files that contain a given event in any processing stage. It replaces the TAG database, which had been in use during LHC Run 1. For each event it contains its identifiers, the trigger pattern and the GUIDs of the files containing it. Major use cases are event picking, feeding the Event Service used on some production sites, and technical checks of the completion and consistency of processing campaigns. The system design is highly modular so that its components (data collection system, storage system based on Hadoop, query web service and interfaces to other ATLAS systems) could be developed separately and in parallel during LSI. The EventIndex is in operation for the start of LHC Run 2. This paper describes the high-level system architecture, the technical design choices and the deployment process and issues. The performance of the data collection and storage systems, as well as the query services, are also reported.

042004
The following article is Open access

The distributed file system landscape is scattered. Besides a plethora of research file systems, there is also a large number of production grade file systems with various strengths and weaknesses. The file system, as an abstraction of permanent storage, is appealing because it provides application portability and integration with legacy and third-party applications, including UNIX utilities. On the other hand, the general and simple file system interface makes it notoriously difficult for a distributed file system to perform well under a variety of different workloads. This contribution provides a taxonomy of commonly used distributed file systems and points out areas of research and development that are particularly important for high-energy physics.

042005
The following article is Open access

, , , , and

The ATLAS detector at the LHC consists of several sub-detector systems. Both data taking and Monte Carlo (MC) simulation rely on an accurate description of the detector conditions from every subsystem, such as calibration constants, different scenarios of pile-up and noise conditions, size and position of the beam spot, etc. In order to guarantee database availability for critical online applications during data-taking, two database systems, one for online access and another one for all other database access, have been implemented.

The long shutdown period has provided the opportunity to review and improve the Run-1 system: revise workflows, include new and innovative monitoring and maintenance tools and implement a new database instance for Run-2 conditions data. The detector conditions are organized by tag identification strings and managed independently by the different sub-detector experts. The individual tags are then collected and associated into a global conditions tag, assuring synchronization of various sub-detector improvements. Furthermore, a new concept was introduced to maintain conditions over all different data run periods into a single tag, by using Interval of Validity (IOV) dependent detector conditions for the MC database as well. This allows on the fly preservation of past conditions for data and MC and assures their sustainability with software evolution.

This paper presents an overview of the commissioning of the new database instance, improved tools and workflows, and summarizes the actions taken during the Run-2 commissioning phase in the beginning of 2015.

042006
The following article is Open access

, , , , and

CERN's tape-based archive system has collected over 70 Petabytes of data during the first run of the LHC. The Long Shutdown is being used for migrating the complete 100 Petabytes data archive to higher-density tape media. During LHC Run 2, the archive will have to cope with yearly growth rates of up to 40-50 Petabytes. In this contribution, we describe the scalable setup for coping with the storage and long-term archival of such massive data amounts. We also review the challenges resulting and mechanisms devised for measuring and enhancing availability and reliability, as well as ensuring the long-term integrity and bit-level preservation of the complete data repository. The procedures and tools for the proactive and efficient operation of the tape infrastructure are described, including the features developed for automated problem detection, identification and notification. Finally, we present an outlook in terms of future capacity requirements growth and how it matches the expected tape technology evolution.

042007
The following article is Open access

, , , and

CASTOR (the CERN Advanced STORage system) is used to store the custodial copy of all of the physics data collected from the CERN experiments, both past and present. CASTOR is a hierarchical storage management system that has a disk-based front-end and a tape-based back-end. The software responsible for controlling the tape back-end has been redesigned and redeveloped over the last year and was put in production at the beginning of 2015. This paper summarises the motives behind the redesign, describes in detail the redevelopment work and concludes with the short and long-term benefits.

042008
The following article is Open access

and

Within WLCG much has been discussed concerning the possible demise of the Storage Resource Manager (SRM) and replacing it with different technologies such as XrootD and WebDAV. Each of these storage interfaces presents different functionalities and experiments currently make use of all of these at different sites. At the RAL Tier-1 we have been monitoring the usage of both SRM and XrootD by all the major experiments to assess when and if some of the nodes hosting the SRM should be redeployed to support XRootD. Initial results were previously presented which showed the SRM still handles the majority of requests issued by most experiments but with an increasing usage of XrootD, particularly by ATLAS. This updates these results based on several months of additional data and show that the SRM is still widely used by all the major WLCG VOs. We break down this usage according by read/write and by geographic source to show how a large Tier 1 is used globally. We also analyse usage by 'use case' (archival storage, persistent storage and scratch storage) and show how different experiments make use of the SRM.

042009
The following article is Open access

, , , , , , , , , et al

Several scientific fields, including Astrophysics, Astroparticle Physics, Cosmology, Nuclear and Particle Physics, and Research with Photons, are estimating that by the 2020 decade they will require data handling systems with data volumes approaching the Zettabyte distributed amongst as many as 1018 individually addressable data objects (Zettabyte-Exascale systems). It may be convenient or necessary to deploy such systems using multiple physical sites. This paper describes the findings of a working group composed of experts from several

042010
The following article is Open access

and

In this paper we describe specific technical solutions put in place in various database applications of the ATLAS experiment at LHC where we make use of several partitioning techniques available in Oracle 11g. With the broadly used range partitioning and its option of automatic interval partitioning we add our own logic in PLSQL procedures and scheduler jobs to sustain data sliding windows in order to enforce various data retention policies. We also make use of the new Oracle 11g reference partitioning in the Nightly Build System to achieve uniform data segmentation. However the most challenging issue was to segment the data of the new ATLAS Distributed Data Management system (Rucio), which resulted in tens of thousands list type partitions and sub-partitions. Partition and sub-partition management, index strategy, statistics gathering and queries execution plan stability are important factors when choosing an appropriate physical model for the application data management. The so-far accumulated knowledge and analysis on the new Oracle 12c version features that could be beneficial will be shared with the audience.

042011
The following article is Open access

, , and

In April of 2014, the UCSD T2 Center deployed hdfs-xrootd-fallback, a UCSD- developed software system that interfaces Hadoop with XRootD to increase reliability of the Hadoop file system. The hdfs-xrootd-fallback system allows a site to depend less on local file replication and more on global replication provided by the XRootD federation to ensure data redundancy. Deploying the software has allowed us to reduce Hadoop replication on a significant subset of files in our cluster, freeing hundreds of terabytes in our local storage, and to recover HDFS blocks lost due to storage degradation. An overview of the architecture of the hdfs-xrootd-fallback system will be presented, as well as details of our experience operating the service over the past year.

042012
The following article is Open access

, , , , and

A common use pattern in the computing models of particle physics experiments is running many distributed applications that read from a shared set of data files. We refer to this data is auxiliary data, to distinguish it from (a) event data from the detector (which tends to be different for every job), and (b) conditions data about the detector (which tends to be the same for each job in a batch of jobs). Relatively speaking, conditions data also tends to be relatively small per job where both event data and auxiliary data are larger per job. Unlike event data, auxiliary data comes from a limited working set of shared files. Since there is spatial locality of the auxiliary data access, the use case appears to be identical to that of the CernVM- Filesystem (CVMFS). However, we show that distributing auxiliary data through CVMFS causes the existing CVMFS infrastructure to perform poorly. We utilize a CVMFS client feature called "alien cache" to cache data on existing local high-bandwidth data servers that were engineered for storing event data. This cache is shared between the worker nodes at a site and replaces caching CVMFS files on both the worker node local disks and on the site's local squids. We have tested this alien cache with the dCache NFSv4.1 interface, Lustre, and the Hadoop Distributed File System (HDFS) FUSE interface, and measured performance. In addition, we use high-bandwidth data servers at central sites to perform the CVMFS Stratum 1 function instead of the low-bandwidth web servers deployed for the CVMFS software distribution function. We have tested this using the dCache HTTP interface. As a result, we have a design for an end-to-end high-bandwidth distributed caching read-only filesystem, using existing client software already widely deployed to grid worker nodes and existing file servers already widely installed at grid sites. Files are published in a central place and are soon available on demand throughout the grid and cached locally on the site with a convenient POSIX interface. This paper discusses the details of the architecture and reports performance measurements.

042013
The following article is Open access

, , , , , , , , and

A cloud-based Virtual Analysis Facility (VAF) for the ALICE experiment at the LHC has been deployed in Bari. Similar facilities are currently running in other Italian sites with the aim to create a federation of interoperating farms able to provide their computing resources for interactive distributed analysis. The use of cloud technology, along with elastic provisioning of computing resources as an alternative to the grid for running data intensive analyses, is the main challenge of these facilities. One of the crucial aspects of the user-driven analysis execution is the data access. A local storage facility has the disadvantage that the stored data can be accessed only locally, i.e. from within the single VAF. To overcome such a limitation a federated infrastructure, which provides full access to all the data belonging to the federation independently from the site where they are stored, has been set up. The federation architecture exploits both cloud computing and XRootD technologies, in order to provide a dynamic, easy-to-use and well performing solution for data handling. It should allow the users to store the files and efficiently retrieve the data, since it implements a dynamic distributed cache among many datacenters in Italy connected to one another through the high-bandwidth national network. Details on the preliminary architecture implementation and performance studies are discussed.

042014
The following article is Open access

, , , , , , and

With the exponential growth of LHC (Large Hadron Collider) data in the years 2010-2012, distributed computing has become the established way to analyse collider data. The ATLAS experiment Grid infrastructure includes more than 130 sites worldwide, ranging from large national computing centres to smaller university clusters. So far the storage technologies and access protocols to the clusters that host this tremendous amount of data vary from site to site. HTTP/WebDAV offers the possibility to use a unified industry standard to access the storage. We present the deployment and testing of HTTP/WebDAV for local and remote data access in the ATLAS experiment for the new data management system Rucio and the PanDA workload management system. Deployment and large scale tests have been performed using the Grid testing system HammerCloud and the ROOT HTTP plugin Davix.

042015
The following article is Open access

, , , , , and

Starting from the experience collected by the ATLAS and CMS experiments in handling condition data during the first LHC run, we present a proposal for a new generation of condition databases, which could be implemented by 2020. We will present the identified relevant data flows for condition data and underline the common use cases that lead to a joint effort for the development of a new system.

Condition data is needed in any scientific experiment. It includes any ancillary data associated with primary data taking such as detector configuration, state or calibration or the environment in which the detector is operating. Condition data typically reside outside the primary data store for various reasons (size, complexity or availability) and are best accessed at the point of processing or analysis (including for Monte Carlo simulations). The ability of any experiment to produce correct and timely results depends on the complete and efficient availability of needed conditions for each stage of data handling. Therefore, any experiment needs a condition data architecture which can not only store conditions, but deliver the data efficiently, on demand, to potentially diverse and geographically distributed set of clients. The architecture design should consider facilities to ease conditions management and the monitoring of its conditions entry, access and usage.

042016
The following article is Open access

, and

Usage of condition data in ATLAS is extensive for offline reconstruction and analysis (e.g. alignment, calibration, data quality). The system is based on the LCG Conditions Database infrastructure, with read and write access via an ad hoc C++ API (COOL), a system which was developed before Run 1 data taking began. The infrastructure dictates that the data is organized into separate schemas (assigned to subsystems/groups storing distinct and independent sets of conditions), making it difficult to access information from several schemas at the same time. We have thus created PL/SQL functions containing queries to provide content extraction at multi-schema level. The PL/SQL API has been exposed to external clients by means of a Java application providing DB access via REST services, deployed inside an application server (JBoss WildFly). The services allow navigation over multiple schemas via simple URLs. The data can be retrieved either in XML or JSON formats, via simple clients (like curl or Web browsers).

042017
The following article is Open access

, , and

A wide range of detector commissioning, calibration and data analysis tasks is carried out by CMS using dedicated storage resources available at the CMS CERN Tier-2 centre. Relying on the functionalities of the EOS disk-only storage technology, the optimal exploitation of the CMS user/group resources has required the introduction of policies for data access management, data protection, cleanup campaigns based on access pattern, and long term tape archival. The resource management has been organised around the definition of working groups and the delegation to an identified responsible of each group composition. In this paper we illustrate the user/group storage management, and the development and operational experience at the CMS CERN Tier-2 centre in the 2012-2015 period.

042018
The following article is Open access

, , and

The Disk Pool Manager is an example of a multi-protocol, multi-VO system for data access on the Grid that went though a considerable technical evolution in the last years. Among other features, its architecture offers the opportunity of testing its different data access frontends under exactly the same conditions, including hardware and backend software. This characteristic inspired the idea of collecting monitoring information from various testbeds in order to benchmark the behaviour of the HTTP and Xrootd protocols for the use case of data analysis, batch or interactive. A source of information is the set of continuous tests that are run towards the worldwide endpoints belonging to the DPM Collaboration, which accumulated relevant statistics in its first year of activity. On top of that, the DPM releases are based on multiple levels of automated testing that include performance benchmarks of various kinds, executed regularly every day. At the same time, the recent releases of DPM can report monitoring information about any data access protocol to the same monitoring infrastructure that is used to monitor the Xrootd deployments. Our goal is to evaluate under which circumstances the HTTP-based protocols can be good enough for batch or interactive data access. In this contribution we show and discuss the results that our test systems have collected under the circumstances that include ROOT analyses using TTreeCache and stress tests on the metadata performance.

042019
The following article is Open access

, , , , and

During the LHC Run-1, Grid resources in ATLAS have been managed by the PanDA and DQ2 systems. In order to meet the needs for the LHC Run-2, Prodsys2 and Rucio are used as the new ATLAS Workload and Data Management systems.

The data are stored under various formats in ROOT files and end-user physicists have the choice to use either the ATHENA framework or directly ROOT. Within the ROOT data analysis framework it is possible to perform analysis of huge sets of ROOT files in parallel with PROOF on clusters of computers (usually organised in analysis facilities) or multi-core machines. In addition, PROOF-on-Demand (PoD) can be used to enable PROOF on top of an existing resource management system.

In this work, we present the first performance obtained enabling PROOF-based analysis at CERN and in some of the Italian ATLAS Tier-2 sites within the new ATLAS workload system. Benchmark tests of data access with the httpd protocol, using also the httpd redirector, will be shown. We also present results on the startup latency tests using the new PROOF functionality of dynamic workers addition, which improves the performance of PoD using Grid resources. These new results will be compared with the expected improvements discussed in a previous work.

042020
The following article is Open access

and

For past eight years, CERN IT Database group has based its backend storage on NAS (Network-Attached Storage) architecture, providing database access via NFS (Network File System) protocol. In last two and half years, our storage has evolved from a scale-up architecture to a scale-out one. This paper describes our setup and a set of functionalities providing key features to other services like Database on Demand [1] or CERN Oracle backup and recovery service. It also outlines possible trend of evolution that, storage for databases could follow.

042021
The following article is Open access

and

At CERN, a number of key database applications are running on user-managed MySQL, PostgreSQL and Oracle database services. The Database on Demand (DBoD) project was born out of an idea to provide CERN user community with an environment to develop and run database services as a complement to the central Oracle based database service. The Database on Demand empowers the user to perform certain actions that had been traditionally done by database administrators, providing an enterprise platform for database applications. It also allows the CERN user community to run different database engines, e.g. presently three major RDBMS (relational database management system) vendors are offered. In this article we show the actual status of the service after almost three years of operations, some insight of our new redesign software engineering and near future evolution.

042022
The following article is Open access

The LHC program has been successful in part due to the globally distributed computing resources used for collecting, serving, processing, and analyzing the large LHC datasets. The introduction of distributed computing early in the LHC program spawned the development of new technologies and techniques to synchronize information and data between physically separated computing centers. Two of the most challenges services are the distributed file systems and the distributed data management systems. In this paper I will discuss how we have evolved from local site services to more globally independent services in the areas of distributed file systems and data management and how these capabilities may continue to evolve into the future. I will address the design choices, the motivations, and the future evolution of the computing systems used for High Energy Physics.

042023
The following article is Open access

, , , , and

In recent years the concepts of Big Data became well established in IT. Systems managing large data volumes produce metadata that describe data and workflows. These metadata are used to obtain information about current system state and for statistical and trend analysis of the processes these systems drive. Over the time the amount of the stored metadata can grow dramatically. In this article we present our studies to demonstrate how metadata storage scalability and performance can be improved by using hybrid RDBMS/NoSQL architecture.

042024
The following article is Open access

, , , , and

The Condition Database plays a key role in the CMS computing infrastructure. The complexity of the detector and the variety of the sub-systems involved are setting tight requirements for handling the Conditions. In the last two years the collaboration has put a substantial effort in the re-design of the Condition Database system, with the aim at improving the scalability and the operability for the data taking starting in 2015. The re-design has focused on simplifying the architecture, using the lessons learned during the operation of the Run I data-taking period (20092013). In the new system the relational features of the database schema are mainly exploited to handle the metadata (Tag and Interval of Validity), allowing for a limited and controlled set of queries. The bulk condition data (Payloads) are stored as unstructured binary data, allowing the storage in a single table with a common layout for all of the condition data types. In this paper, we describe the full architecture of the system, including the services implemented for uploading payloads and the tools for browsing the database. Furthermore, the implementation choices for the core software will be discussed.

042025
The following article is Open access

, , and

The DIRAC Interware provides a development framework and a complete set of components for building distributed computing systems. The DIRAC Data Management System (DMS) offers all the necessary tools to ensure data handling operations for small and large user communities. It supports transparent access to storage resources based on multiple technologies, and is easily expandable. The information on data files and replicas is kept in a File Catalog of which DIRAC offers a powerful and versatile implementation (DFC). Data movement can be performed using third party services including FTS3. Bulk data operations are resilient with respect to failures due to the use of the Request Management System (RMS) that keeps track of ongoing tasks.

In this contribution we will present an overview of the DIRAC DMS capabilities and its connection with other DIRAC subsystems such as the Transformation System. This paper also focuses on the DIRAC File Catalog, for which a lot of new developments have been carried out, so that LHCb could migrate its replica catalog from the LCG File Catalog to the DFC. Finally, we will present how LHCb achieves a dataset federation without the need of an extra infrastructure.

042026
The following article is Open access

, and

This paper presents an algorithm providing recommendations for optimizing the LHCb data storage. The LHCb data storage system is a hybrid system. All datasets are kept as archives on magnetic tapes. The most popular datasets are kept on disks. The algorithm takes the dataset usage history and metadata (size, type, configuration etc.) to generate a recommendation report. This article presents how we use machine learning algorithms to predict future data popularity. Using these predictions it is possible to estimate which datasets should be removed from disk. We use regression algorithms and time series analysis to find the optimal number of replicas for datasets that are kept on disk. Based on the data popularity and the number of replicas optimization, the algorithm minimizes a loss function to find the optimal data distribution. The loss function represents all requirements for data distribution in the data storage system. We demonstrate how our algorithm helps to save disk space and to reduce waiting times for jobs using this data.

042027
The following article is Open access

, , , and

Ceph based storage solutions are becoming increasingly popular within the HEP/NP community over the last few years. With the current status of Ceph project, both object storage and block storage (RBD) layers are production ready on a large scale, and the Ceph file system storage layer (CephFS) is rapidly getting to that state as well.

This contribution contains a thorough review of various functionality, performance and stability tests performed with all three (object storage, block storage and file system) levels of Ceph by using the RACF computing resources in 2012-2014 on various hardware platforms and with different networking solutions (10/40 GbE and IPoIB/4X FDR Infiniband based). We also report the status of commissioning a large scale (1 PB of usable capacity, 4k HDDs behind the RAID arrays by design) Ceph based object storage system provided with Amazon S3 complaint RadosGW interfaces deployed in RACF, as well as performance results obtained while testing the RBD and CephFS storage layers of our Ceph clusters.

042028
The following article is Open access

, , and

The Florida CMS Tier2 center, one of the CMS Tier2 centers, has been using the Lustre filesystem for its data storage backend system since 2004. Recently, the data access pattern at our site has changed greatly due to various new access methods that include file transfers through the GridFTP servers, read access from the worker nodes, and the remote read access through the xrootd servers. In order to optimize the file access performance, we have to consider all the possible access patterns and each pattern needs to be studied separately. In this presentation, we report on our work to optimize file access with the Lustre filesystem at the Florida CMS T2 using an approach based on analyzing these access patterns.

042029
The following article is Open access

, , , and

We report on the status of the data preservation project at DESY for the HERA experiments and present the latest design of the storage which is a central element for bit- preservation. The HEP experiments based at the HERA accelerator at DESY collected large and unique datasets during the period from 1992 to 2007. As part of the ongoing DPHEP data preservation efforts at DESY, these datasets must be transferred into storage systems that keep the data available for ongoing studies and guarantee safe long term access. To achieve a high level of reliability, we use the dCache distributed storage solution and make use of its replication capabilities and tape interfaces. We also investigate a recently introduced Small File Service that allows for the fully automatic creation of tape friendly container files.

042030
The following article is Open access

, , , and

AMGA (ARDA Metadata Grid Application) is a grid metadata catalogue system that has been developed as a component of the EU FP7 EMI consortium based on the requirements of the HEP (High-Energy Physics) and the biomedical user communities. Currently, AMGA is exploited to manage the metadata in the gBasf2 framework at the Belle II experiment, one of the largest particle physics experiments in the world. In this paper, we present our efforts to optimize the metadata query performance of AMGA to better support the massive MC Campaign of the Belle II experiment. Although AMGA exhibits very outstanding performance for a relatively small amount of data, as the number of directories and the metadata size increase (e.g. hundreds of thousands of directories) during the MC Campaign, AMGA suffers from severe query processing performance degradation. To address this problem, we modified the query search mechanism and the database scheme of AMGA to provide dramatic improvements of metadata search performance and query response time. Throughout our comparative performance analysis of metadata search operations, we show that AMGA can be an optimal solution for a metadata catalogue in a large-scale scientific experimental framework

042031
The following article is Open access

, and

The STAR online computing infrastructure has become an intensive dynamic system used for first-hand data collection and analysis resulting in a dense collection of data output. As we have transitioned to our current state, inefficient, limited storage systems have become an impediment to fast feedback to online shift crews. Motivation for a centrally accessible, scalable and redundant distributed storage system had become a necessity in this environment. OpenStack Swift Object Storage and Ceph Object Storage are two eye-opening technologies as community use and development have led to success elsewhere. In this contribution, OpenStack Swift and Ceph have been put to the test with single and parallel I/O tests, emulating real world scenarios for data processing and workflows. The Ceph file system storage, offering a POSIX compliant file system mounted similarly to an NFS share was of particular interest as it aligned with our requirements and was retained as our solution. I/O performance tests were run against the Ceph POSIX file system and have presented surprising results indicating true potential for fast I/O and reliability. STAR'S online compute farm historical use has been for job submission and first hand data analysis. The goal of reusing the online compute farm to maintain a storage cluster and job submission will be an efficient use of the current infrastructure.

042032
The following article is Open access

, , , and

In this article we summarize several years of experience on database replication technologies used at WLCG and we provide a short review of the available Oracle technologies and their key characteristics. One of the notable changes and improvement in this area in recent past has been the introduction of Oracle GoldenGate as a replacement of Oracle Streams. We report in this article on the preparation and later upgrades for remote replication done in collaboration with ATLAS and Tier 1 database administrators, including the experience from running Oracle GoldenGate in production. Moreover, we report on another key technology in this area: Oracle Active Data Guard which has been adopted in several of the mission critical use cases for database replication between online and offline databases for the LHC experiments.

042033
The following article is Open access

, , , , , , and

This paper provides an overview of an integrated program of work underway within the ATLAS experiment to optimise I/O performance for large-scale physics data analysis in a range of deployment environments. It proceeds to examine in greater detail one component of that work, the tuning of job-level I/O parameters in response to changes to the ATLAS event data model, and considers the implications of such tuning for a number of measures of I/O performance.

042034
The following article is Open access

CERN's accelerator complex generates a very large amount of data. A large volumen of heterogeneous data is constantly generated from control equipment and monitoring agents. These data must be stored and analysed. Over the decades, CERN's researching and engineering teams have applied different approaches, techniques and technologies for this purpose. This situation has minimised the necessary collaboration and, more relevantly, the cross data analytics over different domains. These two factors are essential to unlock hidden insights and correlations between the underlying processes, which enable better and more efficient daily-based accelerator operations and more informed decisions. The proposed Big Data Analytics as a Service Infrastructure aims to: (1) integrate the existing developments; (2) centralise and standardise the complex data analytics needs for CERN's research and engineering community; (3) deliver real-time, batch data analytics and information discovery capabilities; and (4) provide transparent access and Extract, Transform and Load (ETL), mechanisms to the various and mission-critical existing data repositories. This paper presents the desired objectives and properties resulting from the analysis of CERN's data analytics requirements; the main challenges: technological, collaborative and educational and; potential solutions.

042035
The following article is Open access

, , , , , , , , , et al

CERN IT DSS operates the main storage resources for data taking and physics analysis mainly via three system: AFS, CASTOR and EOS. The total usable space available on disk for users is about 100 PB (with relative ratios 1:20:120). EOS actively uses the two CERN Tier0 centres (Meyrin and Wigner) with 50:50 ratio. IT DSS also provide sizeable on-demand resources for IT services most notably OpenStack and NFS-based clients: this is provided by a Ceph infrastructure (3 PB) and few proprietary servers (NetApp). We will describe our operational experience and recent changes to these systems with special emphasis to the present usages for LHC data taking, the convergence to commodity hardware (nodes with 200-TB each with optional SSD) shared across all services. We also describe our experience in coupling commodity and home-grown solution (e.g. CERNBox integration in EOS, Ceph disk pools for AFS, CASTOR and NFS) and finally the future evolution of these systems for WLCG and beyond.

042036
The following article is Open access

, , and

Nowadays, many database systems are available but they may not be optimized for storing time series data. Monitoring DIRAC jobs would be better done using a database optimised for storing time series data. So far it was done using a MySQL database, which is not well suited for such an application. Therefore alternatives have been investigated. Choosing an appropriate database for storing huge amounts of time series data is not trivial as one must take into account different aspects such as manageability, scalability and extensibility. We compared the performance of Elasticsearch, OpenTSDB (based on HBase) and InfluxDB NoSQL databases, using the same set of machines and the same data. We also evaluated the effort required for maintaining them. Using the LHCb Workload Management System (WMS), based on DIRAC as a use case we set up a new monitoring system, in parallel with the current MySQL system, and we stored the same data into the databases under test. We evaluated Grafana (for OpenTSDB) and Kibana (for ElasticSearch) metrics and graph editors for creating dashboards, in order to have a clear picture on the usability of each candidate. In this paper we present the results of this study and the performance of the selected technology. We also give an outlook of other potential applications of NoSQL databases within the DIRAC project.

042037
The following article is Open access

, , , , , , , , and

X.509, the dominant identity system from grid computing, has proved unpopular for many user communities. More popular alternatives generally assume the user is interacting via their web-browser. Such alternatives allow a user to authenticate with many services with the same credentials (user-name and password). They also allow users from different organisations form collaborations quickly and simply.

Scientists generally require that their custom analysis software has direct access to the data. Such direct access is not currently supported by alternatives to X.509, as they require the use of a web-browser.

Various approaches to solve this issue are being investigated as part of the Large Scale Data Management and Analysis (LSDMA) project, a German funded national R&D project. These involve dynamic credential translation (creating an X.509 credential) to allow backwards compatibility in addition to direct SAML- and OpenID Connect-based authentication.

We present a summary of the current state of art and the current status of the federated identity work funded by the LSDMA project along with the future road map.

042038
The following article is Open access

, , , , , , , , , et al

The availability of cheap, easy-to-use sync-and-share cloud services has split the scientific storage world into the traditional big data management systems and the very attractive sync-and-share services. With the former, the location of data is well understood while the latter is mostly operated in the Cloud, resulting in a rather complex legal situation.

Beside legal issues, those two worlds have little overlap in user authentication and access protocols. While traditional storage technologies, popular in HEP, are based on X.509, cloud services and sync-and-share software technologies are generally based on username/password authentication or mechanisms like SAML or Open ID Connect. Similarly, data access models offered by both are somewhat different, with sync-and-share services often using proprietary protocols.

As both approaches are very attractive, dCache.org developed a hybrid system, providing the best of both worlds. To avoid reinventing the wheel, dCache.org decided to embed another Open Source project: OwnCloud. This offers the required modern access capabilities but does not support the managed data functionality needed for large capacity data storage.

With this hybrid system, scientists can share files and synchronize their data with laptops or mobile devices as easy as with any other cloud storage service. On top of this, the same data can be accessed via established mechanisms, like GridFTP to serve the Globus Transfer Service or the WLCG FTS3 tool, or the data can be made available to worker nodes or HPC applications via a mounted filesystem. As dCache provides a flexible authentication module, the same user can access its storage via different authentication mechanisms; e.g., X.509 and SAML. Additionally, users can specify the desired quality of service or trigger media transitions as necessary, thus tuning data access latency to the planned access profile. Such features are a natural consequence of using dCache.

We will describe the design of the hybrid dCache/OwnCloud system, report on several months of operations experience running it at DESY, and elucidate the future road-map.

042039
The following article is Open access

, , , and

Many experiments in the HEP and Astrophysics communities generate large extremely valuable datasets, which need to be efficiently cataloged and recorded to archival storage. These datasets, both new and legacy, are often structured in a manner that is not conducive to storage and cataloging with modern data handling systems and large file archive facilities. In this paper we discuss in detail how we have created a robust toolset and simple portal into the Fermilab archive facilities, which allows for scientific data to be quickly imported, organized and retrieved from the multi-petabyte facility.

In particular we discuss how the data from the Sudbury Neutrino Observatory (SNO) for the COUPP dark matter detector was aggregated, cataloged, archived and re-organized to permit it to be retrieved and analyzed using modern distributed computing resources both at Fermilab and on the Open Science Grid. We pay particular attention to the methods that were employed to uniquify the namespaces for the data, derive metadata for the over 460,000 image series taken by the COUP experiment and what was required to map that information into coherent datasets that could be stored and retrieved using the large scale archives systems.

We describe the data transfer and cataloging engines that are used for data importation and how these engines have been setup to import data from the data acquisition systems of ongoing experiments at non-Fermilab remote sites including the Laboratori Nazionali del Gran Sasso and the Ash River Laboratory in Orr, Minnesota. We also describe how large University computing sites around the world are using the system to store and retrieve large volumes of simulation and experiment data for physics analysis.

042040
The following article is Open access

, , , , and

The ATLAS Metadata Interface (AMI) is now a mature application. Over the years, the number of users and the number of provided functions has dramatically increased. It is necessary to adapt the hardware infrastructure in a seamless way so that the quality of service re - mains high. We describe the AMI evolution since its beginning being served by a single MySQL backend database server to the current state having a cluster of virtual machines at French Tier1, an Oracle database at Lyon with complementary replication to the Oracle DB at CERN and AMI back-up server.

042041
The following article is Open access

, , , and

This paper describes the recent improvement of the AMGA (ARDA Metadata Grid Application) python client library for the Belle II Experiment. We were drawn to the action items related to library improvement after in-depth discussions with the developer of the Belle II distributed computing system. The improvement includes client-side metadata federation support in python, DIRAC SSL library support as well as API refinement for synchronous operation. Some of the improvements have already been applied to the AMGA python client library as bundled with the Belle II distributed computing software. The recent mass Monte- Carlo (MC) production campaign shows that the AMGA python client library is reliably stable.

042042
The following article is Open access

, and

EOS is an open source distributed disk storage system in production since 2011 at CERN. Development focus has been on low-latency analysis use cases for LHC1 and non- LHC experiments and life-cycle management using JBOD2 hardware for multi PB storage installations. The EOS design implies a split of hot and cold storage and introduced a change of the traditional HSM3 functionality based workflows at CERN.

The 2015 deployment brings storage at CERN to a new scale and foresees to breach 100 PB of disk storage in a distributed environment using tens of thousands of (heterogeneous) hard drives. EOS has brought to CERN major improvements compared to past storage solutions by allowing quick changes in the quality of service of the storage pools. This allows the data centre to quickly meet the changing performance and reliability requirements of the LHC experiments with minimal data movements and dynamic reconfiguration. For example, the software stack has met the specific needs of the dual computing centre set-up required by CERN and allowed the fast design of new workflows accommodating the separation of long-term tape archive and disk storage required for the LHC Run II.

This paper will give a high-level state of the art overview of EOS with respect to Run II, introduce new tools and use cases and set the roadmap for the next storage solutions to come.

042043
The following article is Open access

, , and

The EOS[1] storage software was designed to cover CERN disk-only storage use cases in the medium-term trading scalability against latency. To cover and prepare for long-term requirements the CERN IT data and storage services group (DSS) is actively conducting R&D and open source contributions to experiment with a next generation storage software based on CEPH[3] and ethernet enabled disk drives. CEPH provides a scale-out object storage system RADOS and additionally various optional high-level services like S3 gateway, RADOS block devices and a POSIX compliant file system CephFS. The acquisition of CEPH by Redhat underlines the promising role of CEPH as the open source storage platform of the future. CERN IT is running a CEPH service in the context of OpenStack on a moderate scale of 1 PB replicated storage. Building a 100+PB storage system based on CEPH will require software and hardware tuning. It is of capital importance to demonstrate the feasibility and possibly iron out bottlenecks and blocking issues beforehand. The main idea behind this R&D is to leverage and contribute to existing building blocks in the CEPH storage stack and implement a few CERN specific requirements in a thin, customisable storage layer. A second research topic is the integration of ethernet enabled disks. This paper introduces various ongoing open source developments, their status and applicability.

042044
The following article is Open access

, , , , and

The CMS experiment at the Large Hadron Collider (LHC) at CERN, Geneva, Switzerland, is made of many detectors which in total sum up to more than 75 million channels. The detector monitoring information of all channels (temperatures, voltages, etc.), detector quality, beam conditions, and other data crucial for the reconstruction and analysis of the experiment's recorded collision events is stored in an online database. A subset of that information, the "conditions data", is copied out to another database from where it is used in the offline reconstruction and analysis processing, together with alignment data for the various detectors. Conditions data sets are accessed by a tag and an interval of validity through the offline reconstruction program CMSSW, written in C++. About 400 different types of calibration and alignment exist for the various CMS sub-detectors.

With the CMS software framework moving to a multi-threaded execution model, and profiting from the experience gained during the data taking in Run-1, a major re-design of the CMS conditions software was done. During this work, a study was done to look into possible gains by using multi-threaded handling of the conditions. In this paper, we present the results of that study.

042045
The following article is Open access

, , , and

The ATLAS EventIndex System, developed for use in LHC Run 2, is designed to index every processed event in ATLAS, replacing the TAG System used in Run 1. Its storage infrastructure, based on Hadoop open-source software framework, necessitates revamping how information in this system relates to other ATLAS systems. It will store more indexes since the fundamental mechanisms for retrieving these indexes will be better integrated into all stages of data processing, allowing more events from later stages of processing to be indexed than was possible with the previous system. Connections with other systems (conditions database, monitoring) are fundamentally critical to assess dataset completeness, identify data duplication, and check data integrity, and also enhance access to information in EventIndex by user and system interfaces. This paper gives an overview of the ATLAS systems involved, the relevant metadata, and describe the technologies we are deploying to complete these connections.

042046
The following article is Open access

, , and

The ATLAS EventIndex contains records of all events processed by ATLAS, in all processing stages. These records include the references to the files containing each event (the GUID of the file) and the internal pointer to each event in the file. This information is collected by all jobs that run at Tier-0 or on the Grid and process ATLAS events. Each job produces a snippet of information for each permanent output file. This information is packed and transferred to a central broker at CERN using an ActiveMQ messaging system, and then is unpacked, sorted and reformatted in order to be stored and catalogued into a central Hadoop server. This contribution describes in detail the Producer/Consumer architecture to convey this information from the running jobs through the messaging system to the Hadoop server.

042047
The following article is Open access

, , , and

Data management constitutes one of the major challenges that a geographically- distributed e-Infrastructure has to face, especially when remote data access is involved. We discuss an integrated solution which enables transparent and efficient access to on-line and near-line data through high latency networks. The solution is based on the joint use of the General Parallel File System (GPFS) and of the Tivoli Storage Manager (TSM). Both products, developed by IBM, are well known and extensively used in the HEP computing community. Owing to a new feature introduced in GPFS 3.5, so-called Active File Management (AFM), the definition of a single, geographically-distributed namespace, characterised by automated data flow management between different locations, becomes possible. As a practical example, we present the implementation of AFM-based remote data access between two data centres located in Bologna and Rome, demonstrating the validity of the solution for the use case of the AMS experiment, an astro-particle experiment supported by the INFN CNAF data centre with the large disk space requirements (more than 1.5 PB).

042048
The following article is Open access

, , , , , , , and

We introduce a service for dCache that allows automatic and transparent packing and extracting of file sets into large container files. This is motivated by the adoption of dCache by new communities that often work with rather small data files which causes large time overheads on reads and writes by the underlying tape systems. Our service can be attached to existing dCache instances and we were show that it is able to optimize average access times to tape access.

042049
The following article is Open access

, and

Archiving data to tape is a critical operation for any storage system, especially for the EOS system at CERN which holds production data for all major LHC experiments. Each collaboration has an allocated quota it can use at any given time therefore, a mechanism for archiving "stale" data is needed so that storage space is reclaimed for online analysis operations. The archiving tool that we propose for EOS aims to provide a robust client interface for moving data between EOS and CASTOR (tape backed storage system) while enforcing best practices when it comes to data integrity and verification.

All data transfers are done using a third-party copy mechanism which ensures point-to- point communication between the source and destination, thus providing maximum aggregate throughput. Using ZMQ message-passing paradigm and a process-based approach enabled us to achieve optimal utilisation of the resources and a stateless architecture which can easily be tuned during operation. The modular design and the implementation done in a high-level language like Python, has enabled us to easily extended the code base to address new demands like offering full and incremental backup capabilities.

042050
The following article is Open access

and

With the restart of the LHC in 2015, the growth of the CMS Conditions dataset will continue, therefore the need of consistent and highly available access to the Conditions makes a great cause to revisit different aspects of the current data storage solutions.

We present a study of alternative data storage backends for the Conditions Databases, by evaluating some of the most popular NoSQL databases to support a key-value representation of the CMS Conditions. The definition of the database infrastructure is based on the need of storing the conditions as BLOBs. Because of this, each condition can reach the size that may require special treatment (splitting) in these NoSQL databases. As big binary objects may be problematic in several database systems, and also to give an accurate baseline, a testing framework extension was implemented to measure the characteristics of the handling of arbitrary binary data in these databases. Based on the evaluation, prototypes of a document store, using a column-oriented and plain key-value store, are deployed. An adaption layer to access the backends in the CMS Offline software was developed to provide transparent support for these NoSQL databases in the CMS context. Additional data modelling approaches and considerations in the software layer, deployment and automatization of the databases are also covered in the research. In this paper we present the results of the evaluation as well as a performance comparison of the prototypes studied.

042051
The following article is Open access

, , , and

The state of the art in Grid style data management is to achieve increased resilience of data via multiple complete replicas of data files across multiple storage endpoints. While this is effective, it is not the most space-efficient approach to resilience, especially when the reliability of individual storage endpoints is sufficiently high that only a few will be inactive at any point in time. We report on work performed as part of GridPP[1], extending the Dirac File Catalogue and file management interface to allow the placement of erasure-coded files: each file distributed as N identically-sized chunks of data striped across a vector of storage endpoints, encoded such that any M chunks can be lost and the original file can be reconstructed. The tools developed are transparent to the user, and, as well as allowing up and downloading of data to Grid storage, also provide the possibility of parallelising access across all of the distributed chunks at once, improving data transfer and IO performance. We expect this approach to be of most interest to smaller VOs, who have tighter bounds on the storage available to them, but larger (WLCG) VOs may be interested as their total data increases during Run 2. We provide an analysis of the costs and benefits of the approach, along with future development and implementation plans in this area. In general, overheads for multiple file transfers provide the largest issue for competitiveness of this approach at present.

042052
The following article is Open access

, , , , and

The Object Store model has quickly become the basis of most commercially successful mass storage infrastructure, backing so-called "Cloud" storage such as Amazon S3, but also underlying the implementation of most parallel distributed storage systems. Many of the assumptions in Object Store design are similar, but not identical, to concepts in the design of Grid Storage Elements, although the requirement for "POSIX-like" filesystem structures on top of SEs makes the disjunction seem larger. As modern Object Stores provide many features that most Grid SEs do not (block level striping, parallel access, automatic file repair, etc.), it is of interest to see how easily we can provide interfaces to typical Object Stores via plugins and shims for Grid tools, and how well experiments can adapt their data models to them. We present evaluation of, and first-deployment experiences with, (for example) Xrootd-Ceph interfaces for direct object-store access, as part of an initiative within GridPP[1] hosted at RAL. Additionally, we discuss the tradeoffs and experience of developing plugins for the currently-popular Ceph parallel distributed filesystem for the GFAL2 access layer, at Glasgow.

042053
The following article is Open access

, , , , , , , and

Data taking and analysis infrastructures in HEP (High Energy Physics) have evolved during many years to a well known problem domain. In contrast to HEP, third generation synchrotron light sources, existing and upcoming free electron lasers are confronted with an explosion in data rates driven primarily by recent developments in 2D pixel array detectors. The next generation of detectors will produce data in the region upwards of 50 Gbytes per second. At synchrotrons, data was traditionally taken away by users following data taking using portable media. This will clearly not scale at all.

We present first experiences of our new architecture and underlying services based on results taken from the resumption of data taking in April 2015. Technology choices were undertaking over a period of twelve month. The work involved a close collaboration between central IT, beamline controls, and beamline support staff. In addition a cooperation was established between DESY IT and IBM to include industrial research and development experience and skills.

Our approach integrates HPC technologies for storage systems and protocols. In particular, our solution uses a single file-system instance with a multiple protocol access, while operating within a single namespace.

042054
The following article is Open access

, , , and

In 2013, CERN IT evaluated then deployed a petabyte-scale Ceph cluster to support OpenStack use-cases in production. With now more than a year of smooth operations, we will present our experience and tuning best-practices. Beyond the cloud storage use-cases, we have been exploring Ceph-based services to satisfy the growing storage requirements during and after Run2. First, we have developed a Ceph back-end for CASTOR, allowing this service to deploy thin disk server nodes which act as gateways to Ceph; this feature marries the strong data archival and cataloging features of CASTOR with the resilient and high performance Ceph subsystem for disk. Second, we have developed RADOSFS, a lightweight storage API which builds a POSIX-like filesystem on top of the Ceph object layer. When combined with Xrootd, RADOSFS can offer a scalable object interface compatible with our HEP data processing applications. Lastly the same object layer is being used to build a scalable and inexpensive NFS service for several user communities.

042055
The following article is Open access

, , and

Storage capacity at CMS Tier-1 and Tier-2 sites reached over 100 Petabytes in 2014, and will be substantially increased during Run 2 data taking. The allocation of storage for the individual users analysis data, which is not accounted as a centrally managed storage space, will be increased to up to 40%. For comprehensive tracking and monitoring of the storage utilization across all participating sites, CMS developed a space monitoring system, which provides a central view of the geographically dispersed heterogeneous storage systems. The first prototype was deployed at pilot sites in summer 2014, and has been substantially reworked since then. In this paper we discuss the functionality and our experience of system deployment and operation on the full CMS scale.

042056
The following article is Open access

, , , , , , , , , et al

The CMS experiment at the LHC relies on 7 Tier-1 centres of the WLCG to perform the majority of its bulk processing activity, and to archive its data. During the first run of the LHC, these two functions were tightly coupled as each Tier-1 was constrained to process only the data archived on its hierarchical storage. This lack of flexibility in the assignment of processing workflows occasionally resulted in uneven resource utilisation and in an increased latency in the delivery of the results to the physics community.

The long shutdown of the LHC in 2013-2014 was an opportunity to revisit this mode of operations, disentangling the processing and archive functionalities of the Tier-1 centres. The storage services at the Tier-1s were redeployed breaking the traditional hierarchical model: each site now provides a large disk storage to host input and output data for processing, and an independent tape storage used exclusively for archiving. Movement of data between the tape and disk endpoints is not automated, but triggered externally through the WLCG transfer management systems.

With this new setup, CMS operations actively controls at any time which data is available on disk for processing and which data should be sent to archive. Thanks to the high-bandwidth connectivity guaranteed by the LHCOPN, input data can be freely transferred between disk endpoints as needed to take advantage of free CPU, turning the Tier-1s into a large pool of shared resources. The output data can be validated before archiving them permanently, and temporary data formats can be produced without wasting valuable tape resources. Finally, the data hosted on disk at Tier-1s can now be made available also for user analysis since there is no risk any longer of triggering chaotic staging from tape.

In this contribution, we describe the technical solutions adopted for the new disk and tape endpoints at the sites, and we report on the commissioning and scale testing of the service. We detail the procedures implemented by CMS computing operations to actively manage data on disk at Tier-1 sites, and we give examples of the benefits brought to CMS workflows by the additional flexibility of the new system.

042057
The following article is Open access

, , , , and

The RACF (RHIC-ATLAS Computing Facility) has operated a large, multi-purpose dedicated computing facility since the mid-1990's, serving a worldwide, geographically diverse scientific community that is a major contributor to various HEPN projects. A central component of the RACF is the Linux-based worker node cluster that is used for both computing and data storage purposes. It currently has nearly 50,000 computing cores and over 23 PB of storage capacity distributed over 12,000+ (non-SSD) disk drives. The majority of the 12,000+ disk drives provide a cost-effective solution for dCache/XRootD-managed storage, and a key concern is the reliability of this solution over the lifetime of the hardware, particularly as the number of disk drives and the storage capacity of individual drives grow. We report initial results of a long-term study to measure lifetime PB read/written to disk drives in the worker node cluster. We discuss the historical disk drive mortality rate, disk drive manufacturers' published MPTF (Mean PB to Failure) data and how they are correlated to our results. The results help the RACF understand the productivity and reliability of its storage solutions and have implications for other highly-available storage systems (NFS, GPFS, CVMFS, etc) with large I/O requirements.

042058
The following article is Open access

, , , , , , , , , et al

JLDG is a data-grid for the lattice QCD (LQCD) community in Japan. Several large research groups in Japan have been working on lattice QCD simulations using supercomputers distributed over distant sites. The JLDG provides such collaborations with an efficient method of data management and sharing.

File servers installed on 9 sites are connected to the NII SINET VPN and are bound into a single file system with the GFarm. The file system looks the same from any sites, so that users can do analyses on a supercomputer on a site, using data generated and stored in the JLDG at a different site.

We present a brief description of hardware and software of the JLDG, including a recently developed subsystem for cooperating with the HPCI shared storage, and report performance and statistics of the JLDG. As of April 2015, 15 research groups (61 users) store their daily research data of 4.7PB including replica and 68 million files in total. Number of publications for works which used the JLDG is 98. The large number of publications and recent rapid increase of disk usage convince us that the JLDG has grown up into a useful infrastructure for LQCD community in Japan.