Table of contents

Track 2: Data Analysis - Algorithms and Tools

042001
The following article is Open access

and

Electron and photon triggers covering transverse energies from 5 GeV to several TeV are essential for signal selection in a wide variety of ATLAS physics analyses to study Standard Model processes and to search for new phenomena. The ATLAS trigger system is divided in a hardware-based Level-1 trigger and a software-based high-level trigger, both of which were upgraded during the LHC shutdown in preparation for Run 2 operation. To cope with the increasing luminosity and more challenging pile-up conditions at a centre-of-mass energy of 13 TeV, the trigger selections at each level are optimised to control the rates and keep efficiencies high. To achieve this goal multivariate analysis techniques are used. The ATLAS electron and photon triggers and their performance with Run 2 data are presented.

042002
The following article is Open access

and

Modified statistical homogeneity tests on weighted data samples are commonly used in high energy physics applications. We do typically apply the tests in order to test homogeneity of weighted and unweighted samples, e.g. Monte Carlo simulations compared to the real data measurements. The asymptotic approximation of p-value of our weighted variants of homogeneity tests are investigated by means of simulation experiments. The simulation is performed for various probability sample distributions. We show that the asymptotic characteristics of the weighted homogeneity tests are valid for the specific distribution of weights.

042003
The following article is Open access

, , and

The LHCb detector is a single-arm forward spectrometer designed for the efficient reconstruction decays of c- and b-hadrons. LHCb has introduced a novel real-time detector alignment and calibration strategy for LHC Run II. Data collected at the start of the fill are processed in a few minutes and used to update the alignment, while the calibration constants are evaluated for each run. This is one of the key elements which allow the reconstruction quality of the software trigger in Run-II, which fully includes the particle identification selection criteria, to be as good as the offline quality of Run-I. This approach greatly increases the efficiency, in particular for the selection of charm and strange hadron decays. We discuss strategy and performance of this novel approach, followed by a presentation of the recent developments implemented for the 2017 run of data taking, and with the performance and reconstruction quality achieved by the LHCb experiment in LHC Run-II.

042004
The following article is Open access

and

Finding and fitting circles from a set of points is a frequent problem in the data analysis of high-energy physics experiments. In a tracker immersed in a homogeneous magnetic field, tracks are close to perfect circles if projected to the bending plane. In a ring-imaging Cherenkov (RICH) detector, circles of photons around the crossing point of charged particles have to be found and their radii estimated. In both cases, non-negligible background may be present that tends to complicate the pattern recognition and to bias the circle fit. In this contribution we present a robust circle fit based on a modified Riemann fit that removes or significantly reduces the effect of background points. As in the standard Riemann fit, the measured points are projected to the Riemann sphere or paraboloid, and a plane is fitted to the projected points. The fit is made robust by replacing the usual least-squares regression by a least median of squares (LMS) regression. Because of the high breakdown point of the LMS estimator, the fit is insensitive to background points. The LMS plane is used to initialize the weights of an M-estimator that refits the plane in order to suppress eventual remaining outliers and to obtain the final circle parameters. The method is demonstrated on three sets of artificial data: points on a circle plus a comparable number of background points; points on two overlapping circles with additional background; and points obtained by the simulation of tracks in a drift chamber with mirror points and additional background. The results show high circle finding efficiency and small contamination of the final fitted circles.

042005
The following article is Open access

, and

Graphical Processing Units (GPUs) represent one of the most sophisticated and versatile parallel computing architectures that are currently entering the High Energy Physics field. GooFit is an open source tool interfacing ROOT/RooFit to the CUDA platform on nVidia GPUs that acts as an interface between the MINUIT minimization algorithm and a parallel processor which allows a Probability Density Function to be evaluated in parallel.

In order to test the computing capabilities of GPUs with respect to traditional CPU cores, a high-statistics pseudo-experiment method has been implemented both in ROOT/RooFit and GooFit frameworks with the purpose of estimating the local statistical significance of an already known signal. The optimized GooFit application running on GPUs provides striking speed-up performances with respect to the RooFit application parallelized on multiple CPU workers by means of the PROOF-Lite tool.

This method is extended to situations when, dealing with an unexpected signal, a global significance must be estimated. The Look-Elsewhere-Effect is taken into account by means of a scanning technique in order to consider - within the same background-only fluctuation and everywhere in the relevant mass spectrum - any fluctuating peaking behavior with respect to the background model. The execution time of the fitting procedure for each MC toy can considerably increase, thus the RooFit-based approach gets so time-expensive that may become unreliable while GooFit is an excellent tool to carry reliably out this p-value estimation method.

042006
The following article is Open access

, , and

We present a classification algorithm that applies the machine learning paradigm of Learning from Label Proportions (LLP) [1] to enable learning on unlabelled data. Our algorithm, Weakly Supervised Classification, receives as its only input the class proportions of batches of data but makes per-instance classification decisions matching the performance of fully supervised approaches. We apply our model to the problem of Quark-Gluon tagging and show that it is robust to underlying mismodelling of the simulated data unlike fully supervised learning.

042007
The following article is Open access

and

The ALPHA experiment at CERN is designed to produce and trap antihydrogen to the purpose of making a precise comparison with hydrogen. The basic technique consists of driving an antihydrogen resonance which will cause the antiatom to leave the trap and annihilate. The main background to antihydrogen detection is due to cosmic rays. When an experimental cycle extends for several minutes, while the number of trapped antihydrogen remains fixed, background rejection can become challenging. Machine learning methods have been employed in ALPHA for several years, leading to a dramatic reduction of the background contamination. This allowed ALPHA to perform the first laser spectroscopy experiment on antihydrogen.

042008
The following article is Open access

, , , , , , , , and

The bright future of particle physics at the Energy and Intensity frontiers poses exciting challenges to the scientific software community. The traditional strategies for processing and analysing data are evolving in order to (i) offer higher-level programming model approaches and (ii) exploit parallelism to cope with the ever increasing complexity and size of the datasets. This contribution describes how the ROOT framework, a cornerstone of software stacks dedicated to particle physics, is preparing to provide adequate solutions for the analysis of large amount of scientific data on parallel architectures.

The functional approach to parallel data analysis provided with the ROOT TDataFrame interface is then characterised. The design choices behind this new interface are described also comparing with other widely adopted tools such as Pandas and Apache Spark. The programming model is illustrated highlighting the reduction of boilerplate code, composability of the actions and data transformations as well as the capabilities of dealing with different data sources such as ROOT, JSON, CSV or databases. Details are given about how the functional approach allows transparent implicit parallelisation of the chain of operations specified by the user.

The progress done in the field of distributed analysis is examined. In particular, the power of the integration of ROOT with Apache Spark via the PyROOT interface is shown.

In addition, the building blocks for the expression of parallelism in ROOT are briefly characterised together with the structural changes applied in the building and testing infrastructure which were necessary to put them in production.

042009
The following article is Open access

and

The outcome of a machine learning algorithm is a prediction model. Typically, these models are computationally expensive, where improving of the quality the prediction leads to a decrease in the inference speed. However it is not always tradeoff between quality and speed. In this paper we show it is possible to speed up the model by using additional memory without losing significat prediction quality for a novel boosted trees algorithm called CatBoost. The idea is to combine two approaches: training fewer trees and merging trees into a kind of hashmaps called DecisionTensors. The proposed method allows for pareto-optimal reduction of the computational complexity of the decision tree model with regard to the quality of the model. In the considered example the number of lookups was decreased from 5000 to only 6 (speedup factor of 1000) while AUC score of the model was reduced by less than 10−3.

042010
The following article is Open access

, , and

We present the Pileup Mitgation with Machine Learning (PUMML) algorithm for pileup removal at the Large Hadron Collider (LHC) based on the jet images framework using state-of-the-art machine learning techniques. We demonstrate that our algorithm outperforms existing methods on a wide range of jet observables up to pileup levels of 140 collisions per bunch crossing. We also investigate what aspects of the event our algorithms are utilizing by understanding the learned parameters of a simplified version of the model.

042011
The following article is Open access

, and

The LHC data analysis software used in order to derive and publish experimental results is an important asset that is necessary to preserve in order to fully exploit the scientific potential of a given measurement. An important use-case is the re-usability of the analysis procedure in the context of new scientific studies such as the reinterpretation of searches for new physics in terms of signal models that not studied in the original publication (RECAST). We present the usage of the graph-based workflow description language yadage to drive the reinterpretation of preserved HEP analyses. The analysis software is preserved using Docker containers, while the workflow structure is preserved using plain JSON documents. This allows the re-execution of complex analysis workflows on modern distributed container orchestration systems and enables a systematic reinterpretation service based on such preserved analysis.

042012
The following article is Open access

and

By colliding protons and examining particles emitted from the collisions, the Large Hadron Collider aims to study the interactions of quarks and gluons at the highest energy accessible in a controlled experimental way. In such collisions, W bosons or top quarks which have TeV scale momentum can be accessible. Reconstructing such boosted jets are becoming important. In particular, the ability to identify original particle that decays to quarks against normal QCD jets plays a central role in various searches at high energy scale. This is typically done by the use of a single physically motivated observable constructed from the constituents of the jet. In this contribution, multiple complementary observables are combined using boosted decision trees and neural networks to increase the ability to distinguish W bosons and top quarks from light quark jets in the ATLAS experiment.

042013
The following article is Open access

and

Hydra is a header-only, templated and C++11-compliant framework designed to perform the typical bottleneck calculations found in common HEP data analyses on massively parallel platforms. The framework is implemented on top of the C++11 Standard Library and a variadic version of the Thrust library and is designed to run on Linux systems, using OpenMP, CUDA and TBB enabled devices. This contribution summarizes the main features of Hydra. A basic description of the overall design, functionality and user interface is provided, along with some code examples and measurements of performance.

042014
The following article is Open access

, , , , and

The GooFit package provides physicists a simple, familiar syntax for manipulating probability density functions and performing fits, and is highly optimized for data analysis on NVIDIA GPUs and multithreaded CPU backends. GooFit was updated to version 2.0, bringing a host of new features. A completely revamped and redesigned build system makes GooFit easier to install, develop with, and run on virtually any system. Unit testing, continuous integration, and advanced logging options are improving the stability and reliability of the system. Developing new PDFs now uses standard CUDA terminology and provides a lower barrier for new users. The system now has built-in support for multiple graphics cards or nodes using MPI, and is being tested on a wide range of different systems. GooFit also has significant improvements in performance on some GPU architectures due to optimized memory access. Support for time-dependent four-body amplitude analyses has also been added.

042015
The following article is Open access

, , , , , , , , , et al

Daily operation of a large-scale experiment is a resource consuming task, particularly from perspectives of routine data quality monitoring. Typically, data comes from different sub-detectors and the global quality of data depends on the combinatorial performance of each of them. In this paper, the problem of identifying channels in which anomalies occurred is considered. We introduce a generic deep learning model and prove that, under reasonable assumptions, the model learns to identify 'channels' which are affected by an anomaly. Such model could be used for data quality manager cross-check and assistance and identifying good channels in anomalous data samples. The main novelty of the method is that the model does not require ground truth labels for each channel, only global flag is used. This effectively distinguishes the model from classical classification methods. Being applied to CMS data collected in the year 2010, this approach proves its ability to decompose anomaly by separate channels.

042016
The following article is Open access

, , , , , , , , , et al

Faced with physical and energy density limitations on clock speed, contemporary microprocessor designers have increasingly turned to on-chip parallelism for performance gains. Algorithms should accordingly be designed with ample amounts of fine-grained parallelism if they are to realize the full performance of the hardware. This requirement can be challenging for algorithms that are naturally expressed as a sequence of small-matrix operations, such as the Kalman filter methods widely in use in high-energy physics experiments. In the High-Luminosity Large Hadron Collider (HL-LHC), for example, one of the dominant computational problems is expected to be finding and fitting charged-particle tracks during event reconstruction; today, the most common track-finding methods are those based on the Kalman filter. Experience at the LHC, both in the trigger and offline, has shown that these methods are robust and provide high physics performance. Previously we reported the significant parallel speedups that resulted from our efforts to adapt Kalman-filter-based tracking to many-core architectures such as Intel Xeon Phi. Here we report on how effectively those techniques can be applied to more realistic detector configurations and event complexity.

042017
The following article is Open access

, and

High-precision modeling of subatomic particle interactions is critical for many fields within the physical sciences, such as nuclear physics and high energy particle physics. Most simulation pipelines in the sciences are computationally intensive – in a variety of scientific fields, Generative Adversarial Networks have been suggested as a solution to speed up the forward component of simulation, with promising results. An important component of any simulation system for the sciences is the ability to condition on any number of physically meaningful latent characteristics that can effect the forward generation procedure. We introduce an auxiliary task to the training of a Generative Adversarial Network on particle showers in a multi-layer electromagnetic calorimeter, which allows our model to learn an attribute-aware conditioning mechanism.

042018
The following article is Open access

and

It is quite common part of the data analysis in High Energy Physics to train a classifier for signal and background separation. In case the signal under investigation is a rare process, the signal sample is simulated and background sample is taken from the real data. Such setting create an unnecessary bias: the classifier might learn not the characteristic of the signal but the characteristic of the imperfect simulation. So the challenge is to train the classifier in such way that it picks up signal/background difference and doesnt overfit to the simulation-specific features. The suggested approach is based on cross-domain adaptation technique using neural networks with gradient reversal. The network architecture is a dense multi-branch structure. One branch is responsible for the signal/background discrimination, the second branch helps to avoid the overfitting on the Monte-Carlo training dataset. The tests showed that this architecture is a robust mechanism for choosing trade-offs between discrimination power and overfitting. So the resulting networks successfully distinguishes the signal from the background, but does not distinguish simulated events from the real ones. Moreover, such architecture could to be easily extended with more branches, and each one could be responsible for specific discrete and continuous domains. For example, the additional third network's branch could help to reduce the correlation between the classifier predictions and reconstructed mass of the decay, thereby making such approach highly viable for wide variety of physics searches. But such network's extensions weren't investigated during this work.

042019
The following article is Open access

and

MicroBooNE is a liquid argon time projection chamber (LArTPC) neutrino experiment that is currently running in the Booster Neutrino Beam at Fermilab. LArTPC technology allows for high-resolution, three-dimensional representations of neutrino interactions. A wide variety of software tools for automated reconstruction and selection of particle tracks in LArTPCs are actively being developed. Short, isolated proton tracks, the signal for low-momentum-transfer neutral current (NC) elastic events, are easily hidden in a large cosmic background. Detecting these low-energy tracks will allow us to probe interesting regions of the proton's spin structure. An effective method for selecting NC elastic events is to combine a highly efficient track reconstruction algorithm to find all candidate tracks with highly accurate particle identification using a machine learning algorithm. We present our work on particle track classification using gradient tree boosting software (XGBoost) and the performance on simulated neutrino data.

042020
The following article is Open access

and

Relativistic invariants are key variables in high energy physics and are believed to be learned implicitly by deep learning approaches. We investigate the minimum network complexity needed to accurately extract such invariants. Doing so will help us understand how complex a neural network needs to be to obtain certain functions. We find that neural networks do well with predicting transverse momentum of a collision which illustrates the fact that nonlinear functions can be learned. On the other hand, invariant mass was much more difficult to predict. Further work will be done to learn the reason why. However the non-linearity of the function can be ruled out as the sole reason.

042021
The following article is Open access

, , , , , , and

Geant4 is the leading detector simulation toolkit used in high energy physics to design detectors and to optimize calibration and reconstruction software. It employs a set of carefully validated physics models to simulate interactions of particles with matter across a wide range of interaction energies. These models, especially the hadronic ones, rely largely on directly measured cross-sections and phenomenological predictions with physically motivated parameters estimated by theoretical calculation or measurement. Because these models are tuned to cover a very wide range of possible simulation tasks, they may not always be optimized for a given process or a given material. This raises several critical questions, e.g. how sensitive Geant4 predictions are to the variations of the model parameters, or what uncertainties are associated with a particular tune of a Geant4 physics model, or a group of models, or how to consistently derive guidance for Geant4 model development and improvement from a wide range of available experimental data. We have designed and implemented a comprehensive, modular, user-friendly software toolkit to study and address such questions. It allows one to easily modify parameters of one or several Geant4 physics models involved in the simulation, and to perform collective analysis of multiple variants of the resulting physics observables of interest and comparison against a variety of corresponding experimental data. Based on modern event-processing infrastructure software, the toolkit offers a variety of attractive features, e.g. flexible run-time con gurable work ow, comprehensive bookkeeping, easy to expand collection of analytical components. Design, implementation technology, and key functionalities of the toolkit are presented and illustrated with results obtained with Geant4 key hadronic models.

042022
The following article is Open access

, , and

Feature extraction algorithms, such as convolutional neural networks, have introduced the possibility of using deep learning to train directly on raw data without the need for rule-based feature engineering. In the context of particle physics, such end-to-end approaches can be used for event classification to learn directly from detector-level data in a way that is completely independent of the high-level physics reconstruction. We demonstrate a technique for building such end-to-end event classifiers to distinguish simulated electromagnetic decays in a high-fidelity model of the CMS Electromagnetic Calorimeter.

042023
The following article is Open access

, , , , , , , , , et al

Charged particle reconstruction in dense environments, such as the detectors of the High Luminosity Large Hadron Collider (HL-LHC) is a challenging pattern recognition problem. Traditional tracking algorithms, such as the combinatorial Kalman Filter, have been used with great success in HEP experiments for years. However, these state-of-the-art techniques are inherently sequential and scale quadratically or worse with increased detector occupancy. The HEP.TrkX project is a pilot project with the aim to identify and develop cross-experiment solutions based on machine learning algorithms for track reconstruction. Machine learning algorithms bring a lot of potential to this problem thanks to their capability to model complex non-linear data dependencies, to learn effective representations of high-dimensional data through training, and to parallelize easily on high-throughput architectures such as FPGAs or GPUs. In this paper we present the evolution and performance of our recurrent (LSTM) and convolutional neural networks moving from basic 2D models to more complex models and the challenges of scaling up to realistic dimensionality/sparsity.

042024
The following article is Open access

, , , , , , , and

The global view of the ATLAS Event Index system has been presented in the 17th ACAT Conference. This article concentrates on the architecture of the system core component. This component handles the final stage of the event metadata import. It organizes its storage and provides a fast and feature-rich access to all information. A user is able to interrogate metadata in various ways, including by executing user-provided code on the data to make selections and to interpret the results. A wide spectrum of clients is available, from a set of Linux-like commands to an interactive graphical Web Service. The stored event metadata contain the basic description of the related events, the references to the experiment event storage and the full trigger record and can be extended with other event characteristics. Derived collections of events can be created. Such collections can be annotated and tagged with further information.

042025
The following article is Open access

, , and

Traces of electromagnetic showers in the neutrino experiments may be considered as signals of dark matter particles. For example, SHiP experiment is going to use emulsion film detectors similar to the ones designed for OPERA experiment from dark matter search. The goal of this research is to develop an algorithm that can identify traces of electromagnetic showers in particle detectors, so it would be possible to analyse and compare various dark matter hypothesis. Both real data and signal simulation samples for this research come from OPERA experiment. Also we have used exploited algorithm for electromagnetic showers identification as a baseline. Although in this research we have used no hints about shower origin.

042026
The following article is Open access

, and

Neural networks are going to be used in the pipelined first level trigger of the upgraded flavor physics experiment Belle II at the high luminosity B factory SuperKEKB in Tsukuba, Japan. An instantaneous luminosity of L = 8 × 1035cm−2s−1 is anticipated, 40 times larger than the world record reached with the predecessor KEKB. Background tracks, with vertices displaced along the beamline (z-axis), are expected to be severely increased due to the high luminosity. Using input from the central drift chamber, the main tracking device of Belle II, the online neural network trigger provides 3D track reconstruction within the fixed latency of the first level trigger. In particular, the robust estimation of the z-vertices allows a significantly improved suppression of the machine background. Based on a Monte Carlo background simulation, the high event rate faced by the first level trigger is analyzed and the benefits of the neural network trigger are evaluated.

042027
The following article is Open access

and

The BESIII spectrometer is located at the Beijing Electron-Positron Collider (BEPCII). Recently, the endcap parts of the Time-Of-Flight system (TOF) have been upgraded and consequently, an upgrade of the BESIII visualization software (BesVis) is necessary. The Event Display visualizes particle interactions in the detector and plays an important role for the data acquisition system (DAQ), reconstruction algorithms tuning and physics analyses. The graphical interface of Event Display is based on ROOT GUI. The detector description is stored in GDML les and is converted into the ROOT geometry system.

042028
The following article is Open access

, , , , and

Scientific computing has advanced in the ways it deals with massive amounts of data, since the production capacities have increased significantly for the last decades. Most large science experiments require vast computing and data storage resources in order to provide results or predictions based on the data obtained. For scientific distributed computing systems with hundreds of petabytes of data and thousands of users it is important to keep track not just of how data is distributed in the system, but also of individual users' interests in the distributed data (reveal implicit interconnection between user and data objects). This however requires the collection and use of specific statistics such as correlations between data distribution, the mechanics of data distribution, and mainly user preferences. This work focuses on user activities (specifically, data usages) and interests in such a distributed computing system, namely PanDA (Production ANd Distributed Analysis system). PanDA is a high-performance workload management system originally designed to meet production and analysis requirements for a data-driven workload at the Large Hadron Collider Computing Grid for the ATLAS Experiment hosted at CERN (the European Organization for Nuclear Research). In this work we are going to investigate whether data collection that was gathered in the past in PanDA shows any trends indicating that users could have mutual interests that would be kept for the next data usages (i.e., data usage patterns), using data mining techniques such as association analysis, sequential pattern mining, and basics of the recommender system approach. We will show that such common interests between users indeed exist and thus could be used to provide recommendations (in terms of the collaborative filtering) to help users with their data selection process.

042029
The following article is Open access

and

Deep learning has led to several breakthroughs outside the field of high energy physics, yet in jet reconstruction for the CMS experiment at the CERN LHC it has not been used so far. This report shows results of applying deep learning strategies to jet reconstruction at the stage of identifying the original parton association of the jet (jet tagging), which is crucial for physics analyses at the LHC experiments. We introduce a custom deep neural network architecture for jet tagging. We compare the performance of this novel method with the other established approaches at CMS and show that the proposed strategy provides a significant improvement. The strategy provides the first multi-class classifier, instead of the few binary classifiers that previously were used, and thus yields more information and in a more convenient way. The performance results obtained with simulation imply a significant improvement for a large number of important physics analysis at the CMS experiment.

042030
The following article is Open access

, , , , , , , , , et al

Experimental Particle Physics has been at the forefront of analyzing the world's largest datasets for decades. The HEP community was among the first to develop suitable software and computing tools for this task. In recent times, new toolkits and systems for distributed data processing, collectively called "Big Data" technologies have emerged from industry and open source projects to support the analysis of Petabyte and Exabyte datasets in industry. While the principles of data analysis in HEP have not changed (filtering and transforming experiment-specific data formats), these new technologies use different approaches and tools, promising a fresh look at analysis of very large datasets that could potentially reduce the time-to-physics with increased interactivity. Moreover these new tools are typically actively developed by large communities, often profiting of industry resources, and under open source licensing. These factors result in a boost for adoption and maturity of the tools and for the communities supporting them, at the same time helping in reducing the cost of ownership for the end users. In this talk, we are presenting studies of using Apache Spark for end user data analysis. We are studying the HEP analysis workflow separated into two thrusts: the reduction of centrally produced experiment datasets and the end analysis up to the publication plot. Studying the first thrust, CMS is working together with CERN openlab and Intel on the CMS Big Data Reduction Facility. The goal is to reduce 1 PB of official CMS data to 1 TB of ntuple output for analysis. We are presenting the progress of this 2-year project with first results of scaling up Spark-based HEP analysis. Studying the second thrust, we are presenting studies on using Apache Spark for a CMS Dark Matter physics search, investigating Spark's feasibility, usability and performance compared to the traditional ROOT-based analysis.

042031
The following article is Open access

and

The separation of b-quark initiated jets from those coming from lighter quark flavors (b-tagging) is a fundamental tool for the ATLAS physics program at the CERN Large Hadron Collider. The most powerful b-tagging algorithms combine information from low-level taggers, exploiting reconstructed track and vertex information, into machine learning classifiers. The potential of modern deep learning techniques is explored using simulated events, and compared to that achievable from more traditional classifiers such as boosted decision trees.

042032
The following article is Open access

, , and

In 2017, the LHC delivered an instantaneous luminosity of roughly 2.0 × 1034cm−2s−1 to the Compact Muon Solenoid (CMS) experiment, with about 60 simultaneous proton-proton collisions (<µ>) per event. In these challenging conditions, it is important to be able to intelligently monitor the rate at which data are being collected (the trigger rate). It is not enough to simply look at the trigger rate; it is equally important to compare the trigger rate with expectations. We present a set of software tools that have been developed to accomplish this. The tools include a real-time component - a script that monitors the rates of individual triggers during data-taking, and activates an alarm if rates deviate significantly from expectation. Fits are made to previously collected data and extrapolated to higher <µ>. The behavior of triggers as a function of <µ> is then monitored as data are collected - plots are automatically produced on an hourly basis and uploaded to a web area for inspection. This same set of tools can also be used offline in data certification, as well as in more complex offline analysis of trigger behavior.

042033
The following article is Open access

and

We present methods to perform high statistics data analyses to investigate fundamental neutrino properties in large volume neutrino detectors, fast and with modest computational resources. The introduced measures are threefold: speeding up computations using graphics processors, evaluating the underlying physics processes on a grid instead of treating every event individually and lastly applying smoothing methods to quantities obtained from Monte Carlo simulations. We show that with our method we can get reliable analysis results using significantly less simulation than what is usually needed, and that the timing to run an analysis with our method is independent of sample size.

042034
The following article is Open access

, , , , and

There has been considerable recent activity applying deep convolutional neural nets (CNNs) to data from particle physics experiments. Current approaches on ATLAS/CMS have largely focussed on a subset of the calorimeter, and for identifying objects or particular particle types. We explore approaches that use the entire calorimeter, combined with track information, for directly conducting physics analyses: i.e. classifying events as known-physics background or new-physics signals.

We use an existing RPV-Supersymmetry analysis as a case study and explore CNNs on multi-channel, high-resolution sparse images: applied on GPU and multi-node CPU architectures (including Knights Landing (KNL) Xeon Phi nodes) on the Cori supercomputer at NERSC.

We compare statistical performance of our approaches with selections on high-level physics variables from the current physics analyses, and shallow classifiers trained on those variables. We also compare time-to-solution performance of CPU (scaling to multiple KNL nodes) and GPU implementations.

042035
The following article is Open access

, , and

A columnar data representation is known to be an efficient way for data storage, specifically in cases when the analysis is often done based only on a small fragment of the available data structures. A data representation like Apache Parquet is a step forward from a columnar representation, which splits data horizontally to allow for easy parallelization of data analysis. Based on the general idea of columnar data storage, working on the FNAL LDRD Project FNAL-LDRD-2016-032, we have developed a striped data representation, which, we believe, is better suited to the needs of High Energy Physics data analysis. A traditional columnar approach allows for efficient data analysis of complex structures. While keeping all the benefits of columnar data representations, the striped mechanism goes further by enabling easy parallelization of computations without requiring special hardware. We will present an implementation and some performance characteristics of such a data representation mechanism using a distributed no-SQL database or a local file system, unified under the same API and data representation model. The representation is efficient and at the same time simple so that it allows for a common data model and APIs for wide range of underlying storage mechanisms such as distributed no-SQL databases and local file systems. Striped storage adopts Numpy arrays as its basic data representation format, which makes it easy and efficient to use in Python applications. The Striped Data Server is a web service, which allows to hide the server implementation details from the end user, easily exposes data to WAN users, and allows to utilize well known and developed data caching solutions to further increase data access efficiency. We are considering the Striped Data Server as the core of an enterprise scale data analysis platform for High Energy Physics and similar areas of data processing. We have been testing this architecture with a 2TB dataset from a CMS dark matter search and plan to expand it to multiple 100 TB or even PB scale. We will present the striped format, Striped Data Server architecture and performance test results.

042036
The following article is Open access

and

Reconstruction and identification of particles in calorimeters of modern High Energy Physics experiments is a complicated task. Solutions are usually driven by a priori knowledge about expected properties of reconstructed objects. Such an approach is also used to distinguish single photons in the electromagnetic calorimeter of the LHCb detector at the LHC from overlapping photons produced from decays of high momentum π0. We studied an alternative solution based on first principles. This approach applies neural networks and classifier based on gradient boosting method to primary calorimeter information, that is energies collected in individual cells of the energy cluster. Mutial application of this methods allows to improve separation performance based on Monte Carlo data analysis. Receiver operating characteristic score of classifier increases from 0.81 to 0.95, that means reducing primary photons fake rate by factor of two or more.

042037
The following article is Open access

, and

In high energy physics experiments, efficient data analysis tools are required to extract interesting information from the massive data. For large-scale liquid-based neutrino experiments, neutrino signals are usually overwhelmed in huge backgrounds. By constructing a liquid neutrino detector toy model, we generate simulation data in Geant4 [1] and run reconstruction for signal background discrimination. The low-level Photo Multipliers (PMT) hits are also projected to a 2D plane to create visualization outputs for classification. With the 2D images as input, we use the Convolutional Neural Network (CNN) as a specific application, which has shown remarkable performance in signal and background discrimination and outperforms those with high-level reconstruction outputs. The method is expected to be used in the neutrino experiments such as JUNO with further study.

042038
The following article is Open access

, , , , , , , , and

One of the most important aspects of data analysis at the LHC experiments is the particle identification (PID). In LHCb, several different sub-detectors provide PID information: two Ring Imaging Cherenkov (RICH) detectors, the hadronic and electromagnetic calorimeters, and the muon chambers. To improve charged particle identification, we have developed models based on deep learning and gradient boosting. The new approaches, tested on simulated samples, provide higher identification performances than the current solution for all charged particle types. It is also desirable to achieve a flat dependency of efficiencies from spectator variables such as particle momentum, in order to reduce systematic uncertainties in the physics results. For this purpose, models that improve the flatness property for efficiencies have also been developed. This paper presents this new approach and its performance.

042039
The following article is Open access

, , , , , and

The data management infrastructure operated at CNAF, the central computing and storage facility of INFN (Italian Institute for Nuclear Physics), is based on both disk and tape storage resources. About 40 Petabytes of scientific data produced by LHC (Large Hadron Collider) at CERN and other experiments in which INFN is involved are stored on tape. This is the highest latency storage tier within HSM (Hierarchical Storage Management) environment. Writing and reading requests on tape media are satisfied through a set of Oracle-StorageTek T10000D tape drives, shared among different scientific communities. In the next years, the usage of tape drives will become more intense not only due to the growing amount of scientific data to manage but also due to general trend to use tapes as "slow disk", announced by the main user communities. In order to reduce hardware purchases, a key point is to minimize the inactivity periods of tape drives. In this paper we present a software solution designed to optimize the efficiency of the shared usage of tape drives in our environment.

042040
The following article is Open access

, , and

Starting with Run II, future development projects for the Large Hadron Collider will constantly bring nominal luminosity increase, with the ultimate goal of reaching a peak luminosity of 5·1034cm2s−1 for ATLAS and CMS experiments planned for the High Luminosity LHC (HL-LHC) upgrade. This rise in luminosity will directly result in an increased number of simultaneous proton collisions (pileup), up to 200, that will pose new challenges for the CMS detector and, specifically, for track reconstruction in the Silicon Pixel Tracker. One of the first steps of the track finding work-flow is the creation of track seeds, i.e. compatible pairs of hits from different detector layers, that are subsequently fed to higher level pattern recognition steps. However, the set of compatible hit pairs is highly affected by combinatorial background resulting in the next steps of the tracking algorithm to process a significant fraction of fake doublets. A possible way of reducing this effect is taking into account the shape of the hit pixel cluster to check the compatibility between two hits. To each doublet is attached a collection of two images built with the ADC levels of the pixels forming the hit cluster. Thus the task of fake rejection can be seen as an image classification problem for which Convolutional Neural Networks (CNNs) have been widely proven to provide reliable results. In this work we present our studies on CNNs applications to the filtering of track pixel seeds. We will show the results obtained for simulated event reconstructed in CMS detector, focusing on the estimation of efficiency and fake rejection performances of our CNN classifier.

042041
The following article is Open access

There are numerous approaches to building analysis applications across the high-energy physics community. Among them are Python-based, or at least Python-driven, analysis workflows. We aim to ease the adoption of a Python-based analysis toolkit by making it easier for non-expert users to gain access to Python tools for scientific analysis. Experimental software distributions and individual user analysis have quite different requirements. Distributions tend to worry most about stability, usability and reproducibility, while the users usually strive to be fast and nimble. We discuss how we built and now maintain a python distribution for analysis while satisfying requirements both a large software distribution (in our case, that of CMSSW) and user, or laptop, level analysis. We pursued the integration of tools used by the broader data science community as well as HEP developed (e.g., histogrammar, root_numpy) Python packages. We discuss concepts we investigated for package integration and testing, as well as issues we encountered through this process. Distribution and platform support are important topics. We discuss our approach and progress towards a sustainable infrastructure for supporting this Python stack for the CMS user community and for the broader HEP user community.

042042
The following article is Open access

, , , , , , , , , et al

The first implementation of a Machine Learning Algorithm inside a Level-1 trigger system at the LHC is presented. The Endcap Muon Track Finder (EMTF) at CMS uses Boosted Decision Trees (BDTs) to infer the momentum of muons in the forward region of the detector, based on 25 different variables. Combinations of these variables representing 230 distinct patterns are evaluated offline using regression BDTs. The predictions for the 230 input variable combinations are stored in a 1.2 GB look-up table in the EMTF hardware. The BDTs take advantage of complex correlations between variables, the inhomogeneous magnetic field, and non-linear effects – like inelastic scattering – to distinguish high momentum signal muons from the overwhelming low-momentum background. The new momentum algorithm reduced the background rate by a factor of three with respect to the previous analytic algorithm, with further improvements foreseen in the coming year.

042043
The following article is Open access

, , , and

Physics analyses at the LHC require accurate simulations of the detector response and the event selection processes, generally done with the most recent software releases. The trigger response simulation is crucial for determination of overall selection efficiencies and signal sensitivities and should be done with the same software release with which data were recorded. This requires potentially running with software dating many years back, the so-called legacy software, in which algorithms and configuration may differ from their current implementation. Therefore having a strategy for running legacy software in a modern environment becomes essential when data simulated for past years start to present a sizeable fraction of the total. The requirements and possibilities for such a simulation scheme within the ATLAS software framework were examined and a proof-of-concept simulation chain has been successfully implemented. One of the greatest challenges was the choice of a data format which promises long term compatibility with old and new software releases. Over the time periods envisaged, data format incompatibilities are also likely to emerge in databases and other external support services. Software availability may become an issue, when e.g. the support for the underlying operating system might stop. The encountered problems and developed solutions will be presented, and proposals for future development will be discussed. Some ideas reach beyond the retrospective trigger simulation scheme in ATLAS as they also touch more generally aspects of data preservation.

042044
The following article is Open access

, , , , , , , , , et al

Latest developments in many research fields indicate that deep learning methods have the potential to significantly improve physics analyses. They not only enhance the performance of existing algorithms but also pave the way for new measurement techniques that are not possible with conventional methods. As the computation is highly resource-intensive both dedicated hardware and software are required to obtain results in a reasonable time which poses a substantial entry barrier. We provide direct access to this technology after a revision of the internet platform VISPA to serve the needs of researches as well as students. VISPA equips its users with working conditions on remote computing resources comparable to a local computer through a standard web browser. For providing the required hardware resources for deep learning applications we extend the CPU infrastructure with a GPU cluster consisting of 10 nodes with each 2 GeForce GTX 1080 cards. Direct access through VISPA, preinstalled analysis software and a workload management system allowed us on one hand to support more than 100 participants in a workshop on deep learning and in corresponding university classes, and on the other hand to achieve significant progress in particle and astroparticle research. We present the setup of the system and report on the performance and achievements in the above mentioned usecases.

042045
The following article is Open access

and

The bar PANDA experiment, which is currently under construction at the Facility for Antiproton and Ion Research (FAIR) in Darmstadt, Germany, will address fundamental questions in hadron and nuclear physics via interactions of antiprotons with nuclei. It will be installed at the High Energy Storage Ring (HESR), which will provide an antiproton beam with a momentum range of 1.5 - 15 GeV/c and a high average interaction rate on the fixed target of 2 × 107 events/s. The bar PANDA experiment will adopt a continuous data acquisition and the expected data rate transmitted to a high-bandwidth computing network will be in the order of 200 GB/s. However, in order to be able to select many different and rare physics processes, an indiscriminate hardware trigger would not suffice. Instead, an online software-based data selection system will be used to select only relevant data and thereby reduce the data rate by a factor of up to 1000. Due to the high interaction rate a highly advanced online analysis must be developed to deal also with overlapping events. Scalability and parallelization of the reconstruction algorithms are therefore a particular focus in the development process. A simulation framework called PandaRoot is used to further optimize the detector performance and develop and evaluate different reconstruction algorithms for event building, tracking and particle identification. An overview about PandaRoot and the requirements on the event reconstruction algorithms is presented and algorithms for the event time reconstruction currently under development are discussed.