Machine Learning Developments in ROOT

ROOT is a software framework for large-scale data analysis that provides basic and advanced statistical methods used by high-energy physics experiments. It includes machine learning tools from the ROOT-integrated Toolkit for Multivariate Analysis (TMVA). We present several recent developments in TMVA, including a new modular design, new algorithms for pre-processing, cross-validation, hyperparameter-tuning, deep-learning and interfaces to other machine-learning software packages. TMVA is additionally integrated with Jupyter, making it accessible with a browser.


Introduction
ROOT is an object-oriented framework that provides statistical methods, visualization and storage libraries for data analysis of high-energy physics (HEP) experiments, such as the Large Hadron Collider (LHC) in Geneva, Switzerland [1]. Although originally designed for HEP applications, ROOT is also widely used in other scientific fields outside of particle physics.
Today, machine learning is at the core of many particle physics analyses, such as searches for new physics and precision Standard Model studies, including identifying rare decays of the newly discovered Higgs boson. The ROOT framework provides machine learning tools with the Toolkit for Multivariate Analysis (TMVA) [2],that contains many machine learning algorithms.
Here are some of the popular ones: TMVA provides a way to compare the performance of these algorithms on the same dataset, useful in choosing the optimal algorithm for a particular analysis task. TMVA provides implementations of these methods for both classification and regression. Recently, TMVA has undergone a significant upgrade targeting greater flexibility, modular design, new features and interfaces.

New Algorithms and Features
Several high-level algorithms and libraries have been added: deep neural networks, k-fold crossvalidation and hyper-parameter tuning. Additionally, TMVA is integrated with Jupyter [3], allowing for interactive execution in a browser. In what follows, the new functionality and features of TMVA are described.

Pre-processing
Pre-processing algorithms in TMVA transform variable input to a representation that can be effectively exploited by machine learning tasks ( Figure 1). Existing pre-processing classes have been extended with VarTransformHandler class that implements algorithms such as Hessian Local Linear Embedding (HLLE) [4].

Deep Learning
A deep neural network is an artificial neural network with several hidden layers and a large number of neurons in each layer. Recent developments in machine learning have shown that these networks are capable of learning complex, non-linear relationships when trained on a sufficiently large dataset.
A new deep neural network (DNN) was implemented in TMVA, extending the existing artificial neural network Multi-Layer Perceptron (MLP) library. The new DNN implementation provides a deep learning library for efficient training on modern multi-core and GPU architectures.
As common for deep neural networks, DNN implementation uses the stochastic batch gradient descent method to train the network. In each training step the weights W k i,j and bias terms θ k i of a given layer k are updated using is the value of the loss function corresponding to the randomly chosen input batch x b and expected output y b . If regularization is applied, the loss may also contain contributions from the weight terms W k i,j of each layer. This implementation also supports training with momentum. In this case training updates take the form: For p = 0 the standard stochastic batch gradient descent method is obtained. The training batch (x b , y b ) is chosen randomly and without replacement from the set of training samples.
The new implementation provides several backends. The standard backend is based on multicore CPU architecture and will work on any platform where ROOT is installed. It uses multithreading to perform the training in parallel and requires a multi-threaded BLAS implementation and the Intel TBB library [5]. The new GPU backends can be used to train on CUDA and OpenCL-capable GPU architectures.
Numerical throughput and classification performance of the new deep learning library was evaluated on the Higgs dataset [6]. This dataset contains both low-level (kinematical) and high-level (domain knowledge inspired) features. As figure 2 shows, the new library shows higher classification performance compared to boosted decision trees, trained on the same dataset. Additionally, the new library is able to extract useful features directly from low-level features, as illustrated by Figure 2. Furthermore, the TMVA-Cuda backend exhibits superior numerical throughput performance compared to TMVA-OpenCL, TMVA-CPU and Theano [7] deep learning implementations ( Figure 3).

Regression
Previously, a single hard-coded loss function was used for boosted decision trees. A new loss function class, with a number of new loss function options, was added. These options include least-squares and absolute deviation, in addition to the default Huber loss function [8]. Figure  4 left shows an example of using different loss functions on the default input dataset in TMVA. Figure 4 right shows an example application of the deep learning (DNN) library on the same dataset. One can observe the benefit of adding depth to the neural network for this regression task.

Cross Validation
K-fold cross-validation is a machine-learning model evaluation technique, relevant to model generalization to unseen data. During k-fold cross-validation, the dataset is partitioned into k folds  or partitions. During one cross-validation round, k-1 folds are used for model training, and the remaining one for model testing. Several rounds of cross-validation are performed with different partitions and model performance results are averaged. One advantage of cross-validation over a simple split into training and testing set, is the use of the full dataset to validate the model. Performing cross-validation is known to reduce over-fitting of the data, leading to a more accurate estimate of the performance of the machine-learning model on unseen data [9].
Cross-validation in TMVA is done with a standalone CrossValidation class. Figure 5 illustrates five receiver-operating characteristic (ROC) curves for each cross-validation fold for a basic TMVA example. This example has four random variables with gaussian distributions plus several derived variables after applying basic mathematical operations to these variables.

Hyperparameter Tuning
Support for hyperparameter tuning has been added in TMVA for the following algorithms: boosted decision trees and support vector machines. It allows an automatic search through hyperparameter space to find the optimal classifier, and relies on k-fold cross-validation to accurately evaluate classifier performance.

TMVA and Jupyter Notebooks
Another new feature in TMVA is the integration of TMVA and Jupyter notebooks. Jupyter is a web application that combines live code, rich text, links and formula in a user-friendly format. All the previous functionality of the TMVA

TMVA interfaces to external machine learning tools
Other useful new functionality added to TMVA are the interfaces to external machine-learning tools in R and Python languages.

RMVA Interface
RMVA is a set of TMVA plugins that allows the use of machine-learning methods available in R directly from TMVA. The goal of RMVA is to allow direct comparison of R-based algorithms with existing tools in TMVA for a given problem. Currently, the following machine-learning packages in R are supported: • Decision trees and rule-based models (C50) [10].

Python with TMVA (PyMVA)
PyMVA is a set of TMVA plugins based on Python API that allows direct use of machinelearning methods written in Python from within TMVA. The following Python based methods from the Scikit-learn software package [14] are currently available in TMVA: • Random Forest (PyRandomForest) • Gradient Boosted Regression Trees (PyGTB) • Adaptive Boosting (PyAdaBoost)

PyKeras
An interface to Keras [15] has been added to TMVA. Keras is a high-level deep learning library, written in python, that works with Theano [7] and Tensorflow [16] deep learning frameworks.
The new interface permits running Keras from within TMVA.

Conclusions
Machine learning tools in ROOT have undergone a significant upgrade. In particular, TMVA has a new design targeting greater flexibility and modularity, new features such as deep learning neural networks, cross-validation, hyper parameter tuning and interfaces to R and pythonbased machine-learning tools. In addition, TMVA is available in Jupyter notebooks, making it accessible in a web browser.