Development of Machine Learning Tools in ROOT

ROOT is a framework for large-scale data analysis that provides basic and advanced statistical methods used by the LHC experiments. These include machine learning algorithms from the ROOT-integrated Toolkit for Multivariate Analysis (TMVA). We present several recent developments in TMVA, including a new modular design, new algorithms for variable importance and cross-validation, interfaces to other machine-learning software packages and integration of TMVA with Jupyter, making it accessible with a browser.


Introduction
ROOT is an object-oriented data analysis framework that provides statistical methods, visualization and storage libraries for data analysis of high-energy physics (HEP) experiments such as the Large Hadron Collider (LHC) in Geneva, Switzerland [1]. Although originally designed for HEP applications, ROOT is also widely used in other scientific fields outside of particle physics.
Machine learning algorithms have become integral to modern HEP analyses. The ROOT framework provides TMVA, the Toolkit for Multivariate Analysis [2],that contains widely-used machine learning algorithms in HEP such as: • Support Vector Machines (SVM) Of these, most popular methods in HEP are boosted decision trees, neural networks and support vector machines. TMVA provides implementations of these popular methods applicable for machine-learning classification and regression. TMVA provides a way to compare the performance of these algorithms on the same dataset, allowing for an apples-to-apples comparison useful in choosing the optimal algorithm for a particular analysis task.

New Algorithms and Features
Algorithms with useful information for the user have been added, such as cross-validation and variable importance. Also TMVA was integrated with Jupyter, allowing execution in a browser. In the next sections we describe some of the new functionality and features of TMVA.

Data Loader
One new design development in TMVA is the DataLoader class. This class creates greater flexibility and modularity in training di↵erent combinations of classifiers and variables. Previously, the choice of variables was defined once and could not be changed later. The new DataLoader class allows the user to combine di↵erent choices of variables and methods with the data. This design flexibility allows other useful algorithms, such as cross-validation and variable importance, described in the following sections, to be implemented.

Cross Validation
Cross-validation is a machine-learning model evaluation technique, important for model generalization to unseen data. During k-fold cross-validation, the dataset is partitioned into k folds or partitions. During one cross-validation round, k-1 folds are used for model training, and the remaining one for model testing. Several rounds of cross-validation are performed with di↵erent partitions and model performance results are averaged. One advantage of cross-validation over a simple split into training and testing set, is the use of the full dataset to validate the model. Performing cross-validation is known to reduce over-fitting of the data, leading to a more accurate estimate of the performance of the machine-learning model on unseen data [3].
Cross-validation in TMVA is done with a standalone CrossValidation class. Figure 5 illustrates five receiver-operating characteristic (ROC) curves for each cross-validation fold for a basic TMVA example. This example has four random variables with gaussian distributions plus several derived variables after applying basic mathematical operations to these variables.

Variable Importance
Currently, TMVA provides a number of method-specific variable importance algorithms. Each one is relevant only for the method chosen and is computed during construction. For example, for decision trees variable importance is derived by counting the number of splits for each variable weighted by the square of the information gained from the split, or for neural networks, as the sum of weights between the inputs and the hidden layer [2].
In addition to these, a new method-independent variable importance algorithm was added. This algorithm, described in [4], computes variable importance in the context of classifier performance. A number of seeds are randomly generated, each corresponding to a variable subspace. For each seed, individual variable contributions to classifier performance are measured as loss of classifier performance due to the removal of a variable. Figure 4 shows a sample variable importance plot for a basic example in TMVA.

TMVA and Jupyter Notebooks
Another new feature in TMVA is the integration of TMVA and Jupyter Notebooks [5]. Jupyter is a web application that combines live code, rich text, links and formula in a user-friendly format.
With the creation of a ROOT Jupyter kernel, additional integration of machine-learning tools in ROOT with Jupyter notebooks became possible. Currently, all the functionality of TMVA is available in Jupyter notebooks, requiring only access to a browser to run and execute TMVA.

Interfaces to external machine learning tools
Another useful functionality added to TMVA is the interface to external machine-learning tools written in R and Python languages. Figure 6 shows the relationship between ROOT, TMVA and other statistical packages.  Figure 6. Interplay of Machine Learning Tools in ROOT.

ROOT-R Interface
R is a free software framework for statistical computing [6]. ROOT-R interface was developed to use R functions directly in ROOT. It opens a large set of statistical tools available in R, including machine-learning packages, for use within ROOT. The ROOT-R interface design is shown in Figure 7.

RMVA Interface
RMVA is a set of TMVA plugins based on the ROOT-R interface. It allows the use of machinelearning methods available in R directly from TMVA. The goal behind RMVA is not to replace the R package itself but to allow its direct comparison with existing tools in TMVA for a given problem, while adding more methods to TMVA users toolbox.
The RMethodBase class in TMVA starts the R environment using ROOT-R, imports the required modules and maps the DataLoader events in R data frame objects using the helper class ROOT::R::TRDataFrame. Each of the methods inherits from the base class RMethodBase as shown in Figure 8. Currently, the following machine-learning packages in R are supported: • Decision trees and rule-based models (C50) [7].
As shown in Figure 10 the above machine-learning methods in R can be executed within the TMVA framework for the basic example.

Python with TMVA (PyMVA)
Similarly to RMVA, PyMVA is a set of TMVA plugins based on Python API that allows direct use of machine-learning methods written in Python from within TMVA. The goal, similarly to RMVA, is not to replace the original method, but to allow the comparison of the method with the other methods in TMVA using the same dataset, selections and performance metrics. The PyMethodBase class in PyMVA initializes the Python environment, imports the required modules and maps the DataLoader events in numpy arrays using C-API. Each PyMVA method inherits from the base class PyMethodBase, as illustrated in Figure 11. Figure 12 shows how the dataset is mapped from ROOT trees to numpy arrays. The following Python based methods from the Scikit-learn software package [11] are currently available in TMVA: • Random Forest (PyRandomForest) • Gradient Boosted Regression Trees (PyGTB) • Adaptive Boosting (PyAdaBoost) Figure 13 shows the ROC curves of various PyMVA methods for a basic example.

Conclusions
Machine learning tools in ROOT have undergone a significant makeover and upgrade. In particular, TMVA has a new design targeting greater flexibility and modularity, new features such as cross-validation, variable importance and interfaces to R and python-based machinelearning tools. In addition, TMVA is available in Jupyter notebooks, making it accessible in a web browser.