WLCG Transfers Dashboard: a Unified Monitoring Tool for Heterogeneous Data Transfers

The Worldwide LHC Computing Grid provides resources for the four main virtual organizations. Along with data processing, data distribution is the key computing activity on the WLCG infrastructure. The scale of this activity is very large, the ATLAS virtual organization (VO) alone generates and distributes more than 40 PB of data in 100 million files per year. Another challenge is the heterogeneity of data transfer technologies. Currently there are two main alternatives for data transfers on the WLCG: File Transfer Service and XRootD protocol. Each LHC VO has its own monitoring system which is limited to the scope of that particular VO. There is a need for a global system which would provide a complete cross-VO and cross-technology picture of all WLCG data transfers. We present a unified monitoring tool – WLCG Transfers Dashboard – where all the VOs and technologies coexist and are monitored together. The scale of the activity and the heterogeneity of the system raise a number of technical challenges. Each technology comes with its own monitoring specificities and some of the VOs use several of these technologies. This paper describes the implementation of the system with particular focus on the design principles applied to ensure the necessary scalability and performance, and to easily integrate any new technology providing additional functionality which might be specific to that technology.


Introduction
The Worldwide LHC Computing Grid [1] (WLCG) is a global collaboration between 150 computer centres geographically distributed over the world. The LHC experiments generate 25PB of new data every year and more than 200PB of data are already stored. In the meantime, the computing models of the experiments is evolving from hierarchical data models with centrally managed data pre-placement towards federated storage. Today, the WLCG provides data transfer services for the four main virtual organizations (VOs in the following) using two key technologies: File Transfer Service (FTS) [2] and XRootD [3]. Monitoring them is primary to get a complete picture of the data movements and understand the on-going traffic. In this paper, the WLCG Transfers Dashboard is presented. It is a global system which provides a complete cross-VO and cross-technology picture of data transfer activity on the Grid. First, the initial architecture will be presented. Then, the evolution of the requirements will be described.
To meet these requirements, the new architecture of the WLCG Transfers Dashboard will be discussed.

Initial needs and architecture
The WLCG is a complex environment shared by the 4 LHC VOs. In the early stage of the WLCG Transfers Dashboard, the focus was on FTS only, a technology shared by ATLAS, CMS and LHCb. While ALICE was missing, the picture given by monitoring FTS was a good approximation of the global traffic for the other VOs. This section will focus on the initial architecture of the WLCG Transfers Dashboard as described by Figure 1. Every single FTS server has been instrumented to send a monitoring message each time a transfer is starting or finishing (successfully or not). Initially, there were 13 FTS servers sending monitoring data and this number is growing today along with the deployment of the new FTS3 service.
To handle this monitoring dataflow, a reliable, yet modular and scalable, transport layer is required. The message broker ActiveMQ [4] has been chosen for its advanced features as well as its wide range of supported programming languages. While allowing close to real-time consuming of the stored messages, ActiveMQ can also act as a buffer in front of the database and the messages can be consumed asynchronously dramatically improving the reliability of the monitoring chain.
The raw monitoring data are then consumed by a collector which persistently stores them in a relational database. To improve the performance of the web application, the raw data are aggregated into statistics with different time period granularities (10 minutes for queries requesting less than 2 days of data, 24 hours for the others).
All database access goes through a well-defined API providing data in different humanreadable formats: JSON and XML. The web API consists of a Python application deployed on top of an Apache web server and provides a comprehensive list of possible actions (backed by queries to the database) ensuring re-usability and consistency of the data between different applications.
Finally, end-users access the application through a web interface from which they can parameterize their queries and get the results in different representations such as bar-chart, matrix or on a world-map. This architecture has been partially inherited from the ATLAS DDM Dashboard [5] and is generic enough to be shared by other monitoring tools (today the same code is shared between 5 applications). However, the computing model of the experiments is evolving and new technologies are joining the landscape of data transfer services bringing new requirements.  The increase in XRootD traffic has generated two new main requirements from the data traffic monitoring perspective. On the first hand, the administrators of an XRootD federation need a specific monitoring tool [8] with technology-specific features and the finest information granularity possible which includes local and remote data access. On the other hand, all the WLCG traffic should be correlated in a single cross-VO and cross-technology user interface. To meet these requirements, the monitoring traffic of XRootD is multi-casted downstream of the message broker as shown in the Figure 3. At that point, data transfers monitoring consists of three distinct instances of monitoring services: the AAA and FAX Dashboards (respectively for XRootD ATLAS and CMS federations) and the WLCG Transfers Dashboard for global traffic. Consequently, the XRootD traffic monitoring data are duplicated at the database level causing potential inconsistencies at the user interface level. The following section will describe a novel architecture which consists of having a top level user interface which retrieves and aggregates data from underlying monitoring tools.

Towards a federated dashboard
The previously described architecture is a good fit for technology-specific monitoring dashboards. However, limitations are quickly reached in a heterogeneous multi-technology environment. This is why WLCG Transfers monitoring has evolved from a single system with a central DB to well-defined technology-specific applications aggregated together in a top level monitoring view. Figure 4 shows the new database-less architecture adopted for this top level view, it consists of an application which makes HTTP calls to others dashboards to retrieve the data and aggregate them together. The results are then displayed in the web user interface based on Dashboard in-house MVC µ-framework xbrowse [10] shared to a large extent with other data management monitoring applications. To improve performance, a caching layer based on Memcached [9] has been added to the web server in front of the database. There are numerous advantages to this new design. Firstly, all technologies and VOs can now be correlated as shown in Figure 5 but also EOS traffic for ATLAS and CMS is now available in the global picture (it was not included in the previous WLCG Transfers Dashboard for scalability reasons). Secondly, the consistency is guaranteed by design as the raw data are not duplicated anymore. Finally, integration of others technologies which will be used in the future such as HTTP/WebDAV Dynamic Federation [11] will be greatly simplified.

Conclusion
The WLCG Transfers Dashboard is a cross-VO and cross-technology data transfers monitoring system used daily by more than 80 distinct users. Its architecture has evolved, moving from a single system with a central DB to a federated model combining technology-specific instances of monitoring services. The new architecture provides better scalability and performance. For example, it allows the integration of EOS data (which are generated at a peak of 800 Hz) within the global picture. It also provides a clean isolation of the workflows and facilitates the integration of future data transfer technologies.