Density Estimation Trees as fast non-parametric modelling tools

Density Estimation Trees (DETs) are decision trees trained on a multivariate dataset to estimate its probability density function. While not competitive with kernel techniques in terms of accuracy, they are incredibly fast, embarrassingly parallel and relatively small when stored to disk. These properties make DETs appealing in the resource-expensive horizon of the LHC data analysis. Possible applications may include selection optimization, fast simulation and fast detector calibration. In this contribution I describe the algorithm, made available to the HEP community in a RooFit implementation. A set of applications under discussion within the LHCb Collaboration are also briefly illustrated.


Introduction
The fast increase of computing resources needed to analyse the data collected in modern hadroncollider experiments, and the lower cost of processing units with respect to storage, pushes High-Energy Physics (HEP) experiments to explore new techniques and technologies to move as much as possible of the data analysis at the time of the data acquisition (online) in order to select candidates to be stored on disk, with maximal, reasonably achievable, background rejection. Besides, research on multivariate algorithms, active both within and outside of the HEP community, is approaching the challenge of operating in distributed computing environments, which represents a further motivation for studies of new classes of algorithms.
Statistical inference of probability density functions underlying experimental datasets is common in High Energy Physics. Fitting is an example of parametric density estimation. When possible, defining a parametric form of the underlying distribution and choose the values for the parameters maximizing the likelihood is usually the best approach. In multivariate problems with a large number of variables and important correlation, however, fitting may become unpractical, and non-parametric density estimation becomes a valid, largely employed, solution.
In HEP, the most common non-parametric density estimation, beyond the histogram, is probably kernel density estimation [1], based on the sum of normalized kernel functions centered on each data-entry.
In this write-up, I discuss Density Estimation Trees, algorithms based on a multivariate, binary tree structure, oriented to non-parametric density estimation. Density Estimation Trees are less accurate than kernel density estimation, but much faster. Integrating Density Estimation Trees is also trivial and fast, making iterative search algorithms convenient. Finally, storing a Density Estimation Trees and propagate it through the computing nodes of a distributed system is relatively cheap, offering a reasonable solutions for compressing the statistical information of large datasets.
An implementation of the algorithm in ROOT/RooFit is available through CERN GitLab 1 .

The algorithm
The idea of iteratively splitting a data sample, making the density estimation to coincide with the average density in each portion of the data space is not new. However, the technique had little room for applications in analyses of datasets up to a few thousands of entries described by small sets of correlated variables.
Recently, kd-trees [2] have been used to split large samples into sub-sets consisting of equal fractions of the data entries. The idea underlying kd-trees is the iterative splitting of the datasample using as threshold the median of a given projection. While powerful to solve a vast range of problems, including notably nearest-neighbour searches, the lack of appropriateness of kd-trees to estimate probability densities is evident considering samples including multiple data-entries.
Ideally, the density estimationf (x) approximating the underlying density function f (x) should minimize the quantity R = (f (x) − f (x)) 2 dx. It can be shown [3] that, exploiting the Monte Carlo approximation the minimization of R is equivalent to growing a Density Estimation Tree iteratively splitting the node , into the two sub nodes R and L minimizing the Gini index, Here, R( ) represents the replacement error, defined by where V is the hyper-volume of the portion of the data-space associated to the node , and N the number of data entries it includes; N tot is the number of data entries in the whole dataset. Figure 1 shows an example of how the training is performed.

Overtraining
As in the case of Classification algorithms, overtraining is the misinterpretation of statistical fluctuations of the dataset as relevant features to be reproduced by the model. An example of overtraining of Density Estimation Trees is presented in Figure 2. In the presented dataset, the alignment of data-entries in one of the input variables is interpreted as narrow spikes. To compensate spikes, in terms of absolute normalization, the density is underestimated in all of the other points of the parameter space.
Overtraining in decision trees is controlled through an iterative approach consisting in pruning and cross-validation: finding and removing the branches increasing the complexity of the tree without enhancing the statistical agreement with a set of test samples. Cross-validation is very expensive in terms of computing power and often fails to identify problems of over-training arising close to the root of the decision tree.
Overtraining in density estimation is fought, instead, by defining a priori an expected resolution width, neglecting fluctuations under that resolution while building the statistical Data sample First step

Second step
Last step and evaluation x y Schematic representation of the training and the evaluation procedures of a Density Estimation Tree. Top, an example of an overtrained Density Estimation Tree. The random alignment of data-entries with respect to one of the variables describing the problem is misinterpreted as a spike in the density estimation, as evident in the projection onto the vertical axis shown on the bottom. model. For example, kernel density estimation algorithms require a parameter, named bandwidth as an input. The bandwidth is related to the width of the kernel function. Abundant literature exists on techniques to optimise the bandwidth for a certain dataset, most of them represent a preliminary step of the density estimation algorithm. Growing a Density Estimation Tree with a minimal leaf width is fast, doesn't require post-processing and it is found to result in betterquality estimations with respect to cross-validation procedures. The same techniques used to compute the optimal bandwidth parameter for kernel density estimation algorithms can be used to define the optimal minimal leaf width of the Density Estimation Tree.

Integration
As mentioned in the introduction, fast integration of the statistical model built using Density Estimation Trees is one of the strengths of the algorithm. Integration usually responds to two different needs: normalization and slicing (or projecting, or marginalizing).
Integrals to compute the overall normalization of the density estimation, or the contribution in a large fraction of the data-space, gain little from exploiting the tree structure of the density estimator. A sum over the contributions of each leaf represents the best strategy.
Instead, integrals over a narrow subset of the data-space should profit of the tree structure of the density estimation to exclude from the integration domain as many leaves as possible, as early as possible. Exploiting the tree structure when performing integrals of slices can drastically reduce the computing time in large density estimation trees.

Operations with Density Estimation Trees
Combining weighted Density Estimation Trees can be useful to model data samples composed of two or more components. Combination is achieved implementing both scalar and binary operations. There is not much to discuss about scalar operations, where the scalar operation is applied to each leaf independently. Instead, binary operations require the combination of two different Density Estimation Trees, which is not trivial because the boundaries are a priori different. The algorithm to combine two Density Estimation Trees consists of the iterative splitting of the terminal nodes of the first tree, following the boundaries of the terminal nodes of the second one. Once the combination is done, the first tree is compatible with the second one and the binary operation can be performed node per node. The resulting tree may have several additional layers with respect to the originating trees, therefore a final step removing division between negligibly different nodes is advisable. PID calibration samples are sets of decay candidates reconstructed and selected relying on kinematic variables only, to distinguish between different types of long-lived particles: electrons, muons, pions, kaons, and protons.
The PID strategy of the LHCb detector relies on the combined response of several detectors: two ring Cherenkov detectors, an electromagnetic calorimeter, a hadronic calorimeter and a muon system [4]. The response of the single detectors are combined into likelihoods used at analysis level to define the tightness of the PID requirements.
Calibration samples count millions of background-subtracted candidates, each candidate is defined by a set of kinematic variables, for example momentum and pseudorapidity, and a set of PID likelihoods, one per particle type. The correlation between all variables is important and not always linear.

Efficiency tables
The first application considered is the construction of tables defining the probability that the PID likelihood of a candidate, defined by a set of kinematic variables, satisfies a particular requirement.
Building two Density Estimation Trees with the kinematic variables defining the data-space, one with the full data sample (tree t all ), and one with the portion of data sample passing the PID criteria (tree t pass ), allows to compute the efficiency for each combination of the kinematic variables by evaluating the Density Estimation Tree obtained taking the ratio t pass /t all .
For frequently-changing criteria a dynamic determination of the efficiency can be envisaged. For simplicity, consider the generic univariate PID criterion y > 0. In this case a single Density Estimation Tree d(x 1 , x 2 , y) defined by the kinematic variables (x 1 , x 2 ), and one PID variable y, has to be trained on the calibration sample. The dynamic representation of the efficiency for a candidate having kinematic variables (x 1 ,x 2 ) is the ratio Thanks to the fast slice-integration algorithm, the computation of this ratio can be included in an iterative optimization procedure aiming at an optimization of the threshold on y.

Sampling as fast simulation technique
Another important application is related to fast simulation of HEP events. Full simulation, including interaction of the particles with matter, is becoming so expensive to be expected exceeding the experiments' budgets in the next few years. Parametric simulation is seen as a viable solution, as proved by the great interest raised by the DELPHES project [5]. However, parametrizing a simulation presents the same pitfalls as parametrizing a density estimation: when correlation among different variables becomes relevant, the mathematical form of the parametrization increases in complexity up to the point it becomes unmanageable.
Density Estimation Trees are an interesting candidate for non-parametric fast simulation. Let d(x 1 , ..., x N , y 1 , ..., y n ) be a Density Estimation Tree trained on a set of candidates defined by generator variables x ≡ (x 1 , ..., x N ), and by variables y ≡ (y 1 , ..., y n ) obtained through full simulation. For example, x could represent the kinematic variables of a track and y the PID likelihoods.
The aim of fast non-parametric simulation is, given a new set of values (x 1 , ...,x N ) for x, to compute a set of values for y distributed according to the conditional probability density function d(y 1 , ..., y n |x 1 , ...,x N ).
Once the DET is trained, the tree structure of the density estimator is used to compute for each leaf the hyper-volume V ∩x of the intersection between and hyper-plane defined by x = (x 1 , ...,x N ).
A random leaf L ∈ { } is then chosen with probability proportional to V L∩x , and variables y are generated following a flat distribution bounded within L.
A set of generator variables x can then be completed by the corresponding y variables without full simulation, but relying on the joint multivariate distribution learnt by the Density Estimation Tree.

Summary and outlook
I discussed Density Estimation Tree algorithms as fast modelling tools for high statistics problems characterized by a large number of correlated variables and for which an approximated model is acceptable. The fast training and integration capabilities make these algorithms of interest for the high-demanding future of the High-Energy Physics experiments. The examples discussed, which benefited from an active discussion within the Particle Identification Group of the LHCb collaboration, explore cases where the statistical features of huge samples have to be assessed in a time shorter than what standard estimators would require. In future, Density Estimation Trees could be sampled to train Regression Multivariate Algorithms, such as Neural Networks, in order to smooth the response and further speed up the query time, at the cost of loosing its fast-integration properties.