New approaches for boosting to uniformity

The use of multivariate classifiers has become commonplace in particle physics. To enhance the performance, a series of classifiers is typically trained; this is a technique known as boosting. This paper explores several novel boosting methods that have been designed to produce a uniform selection efficiency in a chosen multivariate space. Such algorithms have a wide range of applications in particle physics, from producing uniform signal selection efficiency across a Dalitz-plot to avoiding the creation of false signal peaks in an invariant mass distribution when searching for new particles.


Introduction
Methods of machine learning play an important role in modern particles physics.Multivariate classifiers, e.g., boosted decision trees (BDTs) and artificial neural networks (ANNs), are now commonly used in analysis selection criteria.BDTs are now even used in software triggers [1,2].To enhance the performance, a series of classifiers is typically trained; this is a technique known as boosting.Boosting involves training many simple classifiers and then building a single composite classifier from their responses.The classifiers are trained in series with the inputs of each member being augmented based on the performance of its predecessors.This augmentation is designed such that each new classifier targets those events which were poorly classified by previous members of the series.The classifier obtained by combining all members of the series is typically much more powerful than any of the individual members.
In particle physics, the most common usage of BDTs is in classifying candidates as signal or background.The BDT is determined by optimizing some figure of merit (FOM), e.g., the signal purity or approximate signal significance.This approach is optimal for a counting experiment; however, in many cases the BDT-based selection obtained in this way is not optimal.For example, in a Dalitz-plot (or any angular or amplitude analysis) analysis, obtaining a selection efficiency for signal candidates that is uniform across the Dalitz-plot is more important than any integrated FOM.Similarly, when measuring a mean particle lifetime, obtaining an efficiency that is uniform in lifetime is what is desired.In both cases, obtaining a uniform selection efficiency greatly reduces the systematic uncertainties involved in the measurement.When searching for a new particle, an analyst may want a uniform efficiency in mass for selecting background candidates so that the BDTbased selection does not generate a fake signal peak.Furthermore, the analyst may also desire a uniform selection efficiency of signal candidates in mass (or other variates) since the mass of the new particle is not known.In such cases, the BDT is often trained on simulated data generated with several values of mass (lifetime, etc.).A uniform selection efficiency in mass ensures that the BDT is sensitive to the full range of masses involved in the search.

Uniformity Boosting Methods
The variates used in the BDT are denoted by x, while the variates in which uniformity is desired are denoted by y.Some (perhaps all) of the x variates will be biasing in y, i.e. they provide discriminating power between signal and background that varies in y.A uniform BDT selection efficiency can be obtained by removing all such variates; however, this will also reduce the power of the BDT.The goal of boosting algorithms presented in this paper is to balance the biases to produce the optimal uniform selection.
One category of boosting works by assigning training events more weight based on classification errors made by previous members of the series.For example, the AdaBoost [3] algorithm updates the weight of event i, w i , according to where γ = +1(−1) for signal(background) events and p is the prediction for each event produced by last classifier in the series.The uBoost technique, described in detail in Ref. [4], alters the event-weight updating procedure to achieve uniformity in the signal-selection efficiency.
Another approach to obtain uniformity, introduced in this paper, involves defining a more general expression of the AdaBoost criteria: where a i j are the elements of some square matrix A. For the case where A is the identity matrix, the AdaBoost weighting procedure is recovered.Other choices of A will induce non-local effects, e.g., consider the sparse matrix A knn given by a knn i j = 1 k , j ∈ knn(i), events i and j belong to the same class 0, otherwise, ( where knn(i) denotes the set of k-nearest-neighbor events to event i.This procedure for updating the event weights, which we refer to as kNNAdaBoost, accounts for the score of each event's k nearest neighbors and not just each event individually.The gradient boosting [5] (GB) algorithm category requires the analyst to choose a differentiable loss function with the goal of building a classifier that minimizes the loss.A popular choice of loss function is the so-called AdaLoss function (2.4) The scores s are obtained for each event as the sum of predictions of all elements in the series.At each stage in the gradient boosting process, a regressor (a decision tree in our case) is trained whose purpose is to decrease the loss.This is accomplished using the gradient decent method and the pseudo-residuals which are positive(negative) for signal(background) events and have larger moduli for poorly classified events.The gradient-boosting algorithm is general in that it only requires the analyst specify a loss function and its gradient.The AdaLoss function considers each event individually, but can easily be modified to take into account non-local properties of the classifier as follows: (2.6) For example, the loss function obtained from Eq. 2.6 using 1 A knn , which we refer to as kN-NAdaLoss and denote L knn , accounts for the score of each event's k nearest neighbors and not just each event individually.The pseudo-residuals of L knn are One can see that the direction of the gradient will be influenced the most by events whose k-nearestneighbor events are classified poorly.We generically refer to GB methods designed to achieve uniform selection efficiency as uniform GB (uGB).The specific algorithm that uses kNNAdaLoss will be called uGBkNN.
Another approach is to include some uniformity metric in the definition of the loss function.Consider first the case where the data have been binned in y.If the distribution of classifier responses in each bin, f b (s), is the same as the global response distribution, f (s), then any cut made on the response will produce a uniform selection efficiency in y.Therefore, performing a onedimensional goodness-of-fit test of the hypothesis f b ≡ f in each bin provides an assessment of the selection uniformity.For example, one could perform the Kolmogorov-Smirnov test in each bin and define a loss function as follows: where is the fraction of signal events in the bin 2 .
The gradient of the Kolmogorov-Smirnov loss function is zero for events with responses greater than the value of s at which max|F b (s) − F(s)| occurs.Thus, it is not suitable for gradient boosting due to its instability.Instead, we use the following flatness loss function: AdaBoost modification using matrix A knn uGBkNN gradient boost using kNNAdaLoss loss function uGBFL(bin) gradient boost using flatness loss +α AdaLoss as in Eq. 2.11 (data binned for FL) uGBFL(kNN) same as uGBFL(bin) except kNN events are used rather than bins whose pseudo-residuals are (b is the bin containing the kth event) This so-called flatness loss penalizes non-uniformity but does not consider the quality of the classification.Therefore, the full loss function used is where α is a real-valued parameter that is typically chosen to be small.The first term in Eq. 2.11 penalizes non-uniformity, while the second term penalizes poor classification.We refer to this algorithm as uGB with flatness loss (uGBFL).In principle, many different flatness loss functions can be defined and could be substituted for our choice here.See Appendix A for a detailed discussion on this topic.The loss function given in Eq. 2.11 can also be constructed without binning the data using knearest-neighbor events.The cumulative distribution F knn (s) is easily obtained and the bin weight, w b , is replaced by a k-nearest-neighbor weight, w knn .First, each event is weighted by the inverse of the number of times it is included in the k-nearest-neighbor sample of another event.Then, w knn is the sum of such weights in a k-nearest-neighbor sample divided by the total sum of such weights in the full sample.This procedure is followed to offset the fact that some events are found in more k-nearest-neighbor samples than other events.We study two versions of uGBFL below: uGBFL using bins denoted by uGBFL(bin) and uGBFL using kNN collections denoted by uGBFL(kNN).The algorithms are summarized in Table 1.

Example Analysis
The example analysis studied here involves a so-called Daltiz-plot analysis.In such analyses, the distribution of events in a 2-D space is typically fit to extract some information of physical interest.The regions of the Daltiz-plot that tend to have the highest sensitivity to the desired information are the edges.Unfortunately, the edge regions also typically have the most background contamination and the least discrimination against background.Therefore, traditional classifier-based selections tend to produce selections for Dalitz-plot analyses with lower efficiency near the edges.
This study uses simulated event samples produced using the official LHCb simulation framework.The software used for the generation of the events is described in LHCb publications as follows : In the simulation, pp collisions are generated using PYTHIA [6] with a specific LHCb configuration [7].Decays of hadronic particles are described by EvtGen [8], in which final state radiation is generated using PHOTOS [9].The interaction of the generated particles with the detector and its response are implemented using the GEANT toolkit [10,11] as described in Ref. [12].
All simulated event samples are generated inside the LHCb detector acceptance.The signal used in this analysis consists of Figure 1 shows the Dalitz-plot distributions for signal and background events.These samples are split into training and testing samples and then various BDTs are trained.For the BDTs designed to produce uniform selections, the y variates are the Dalitz masses with the choice of uniform selection efficiency on signal candidates in the Dalitz-plot.Figure 2 shows the ROC curves obtained for the various classifiers studied in this paper.For the uGBFL algorithms, there is a choice to be made for the value α which defines the relative weight of the flatness loss vs AdaLoss.As expected, increasing α, which increases the weight of AdaLoss, drives the ROC curve to be similar to AdaBoost.Analysts will need to choose how much ROC performance to sacrifice to gain uniformity in the selection efficiency.In general, the ROC curves for the uniform-driven BDTs are not too different from AdaBoost.Figure 3 shows how the uniformity of the selection efficiency depends on α.As expected, as α is decreased the selection becomes more uniform.
Figure 4 shows the efficiency obtained for each classifier vs distance from the a corner of the Dalitz-plot 3   esting corner regions.The kNNAdaBoost algorithm does not improve upon the AdaBoost result much.This is likely due to the fact that while kNNAdaBoost uses non-local kNN information, it does not utilize global information.The uGBkNN algorithm overcompensates and drives the efficiency higher at the corners.This suggests that if this algorithm is to be used some stopping criteria or throttle of the event-weighting updating should be implemented.The uGBFL (binned and unbinned kNN) and uBoost algorithms each produce an efficiency which is statistically consistent with uniform across the Dalitz plot.As stated above, the analyst is free to optimize the choice of α for uGBFL by defining a metric that involves signal efficiency, background rejection and uniformity, e.g., using uniformity metrics discussed in detail in Appendix A.
As a separate study using the same data samples, consider the case where one has simulated signal events and uses data from a nearby region, a so-called sideband, for background.This is a common situation in particle-physics analyses.Figure 5 shows the training samples used.A major problem can arise in these situations as typically input variates to the BDT are correlated with the parent particle mass.Therefore, the BDT may learn to reject the background in the training using the fact that the mass of the background and signal candidates is different.This is just an artifact of how the background sample is obtained and will not be true for background candidates under the signal peak.Figure 5 shows the background mis-identification rate vs D candidate mass.AdaBoost has clearly learned to use this mis-match in signal and background candidate masses in the training.The background in the region of the signal is about three times higher than one would expect from looking only at the sideband data.Figure 5 also shows the background mis-identification rate vs D candidate mass for the various uniform classifiers where y = m(D) and the choice is for uniformity in the background efficiency 4 .The uBoost algorithm does better than AdaBoost here but is still not optimal.The way that uBoost achieves uniformity is not such that it can be trusted to work outside the region of training.The algorithms presented in this paper each does well in achieving similar performance in the training and signal regions.Consider, e.g., the uGBFL approach to achieving uniform selection efficiency.In this case the training drives the BDT response itself to have the same PDF everywhere in the region 1.75 < m(D) < 1.85 GeV (the training region).This does not guarantee that the BDT re-sponse is truly independent of m(D) outside the training region, but does strongly suppress learning to use m(D) and in this example results in the desired behavior.Finally, if both high and low m(D) sidebands had been used, it is possible for a BDT to create a fake peak near the signal peak location.The use of uGBFL greatly reduces the chances and possible size of such an effect.

CPU Resources
One drawback of the uBoost technique is that it has a high degree of computational complexity: while AdaBoost trains M trees (a user-defined number), uBoost builds 100 × M trees.The algorithms presented in this paper only build M trees; however, the boosting involves some more complicated algorithms.Training each of the M trees scales as follows for N training events: • uGBkNNknn: O(k × N) for A knn , and O(#nonzero elements in the matrix) for arbitrary matrix A; • uGBFL(bin): O(N ln N); • uGBFL(kNN): O(N ln N + Nk ln k).
In the example analysis studied in this paper, we find that the training time for these new algorithms is within a factor of two the same as AdaBoost.The CPU-resource usage of these new algorithms is not prohibitive.

Summary
A number of novel boosting algorithms have been presented that consider uniformity of selection efficiency in a multivariate space in addition to mis-classifcation errors.Of these, the uGBFL algorithm has the best performance on the example analyses studied in this paper.This algorithm is expected to be useful in a wide-variety of analyses performed in particle physics.

Source code
The code for classifiers proposed in this article as well as for metrics of uniformity is publicly available at repository https://github.com/anaderi/lhcb_trigger_ml.

A. Measures of uniformity
In this section we discuss different methods for measuring the uniformity of prediction.One typical way of 'checking' uniformity of prediction used by physicists is fitting the distribution of the events that were classified as signal (or background) over the feature for which you wish to check uniformity.This approach requires assumptions about the shape of the distribution, which makes quantitative comparisons of different classifiers difficult.Our aim here is to explore uniformity figures of merit which make comparing classifiers easier, analogously to how the area under the ROC curve can be used to compare absolute classifier performance.The output of event classification is the probability of each event being signal or background, and it is only after we apply a cut on this probability that events are classified.An ideal uniformity of signal prediction can then be defined for a given "uniform feature" of interest.It means that whichever cut we select, the efficiency for a signal event to pass the cut doesn't depend on the uniform feature.Uniformity for background can be defined in the same manner, but for simplicity, in what follows we will only discuss the uniformity of efficiency for signal events.
A trivial example of a classifier that has ideal uniformity is a classifier which returns a random classification probability, but such a classifier is of course not very useful.One can try to design a uniform classifier with respect to a given feature by not using this feature, or any correlated features, in the classification; in practice, however, this approach also tends to lead to poorly performing classifiers.The approach which we take in this paper is to explicitly let the classifier learn how to balance non-uniformities coming from different features in such a way as to generate a classification which is uniform on average.It is then important to be able to accurately measure the uniformity of classification.
Before proceeding, it is useful to define some desirable properties of uniformity metrics 1.The metric shouldn't depend strongly on the number of events used to test uniformity; 2. The metric shouldn't depend on the normalization of the event weights: if we multiply all the weights by some arbitrary number, it shouldn't change at all; 3. The metric should depend only on the order of predictions, not the exact values of probabilities.This is because we care about which events pass the cut and which don't, not about the exact values of predictions.For example: correlation of prediction and mass doesn't satisfy this restriction.
4. The metric should be stable against any of its own free parameters: if it uses bins, changing the number of bins shouldn't affect the result, if it uses k-nearest neighbors, it should be stable against different values of k.
In what follows we will consider different metrics which satisfy these criteria, and then compare their performance in some test cases.

Standard Deviation of Efficiency on Bins (SDE)
If the space of uniform features is split into bins, it is possible to define the global efficiency eff = total weight of signal events that passed the cut total weight of signal events , as well as the efficiency in every bin, eff bin = weight of signal events in bin that passed the cut weight of signal events in this bin .
One measure of non-uniformity is the standard deviation of bin efficiencies from the global efficiency: To make the metric more stable against fluctuations in bins which contain very few events, we add weights to the bins (note that ∑ bin weight bin = 1): weight bin = total weight of signal events in bin total weight of signal events , giving the weighted standard deviation (SDE) formula This formula is valid for any given cut value.To measure the overall non-flatness of the selection, we take several global efficiencies and use Another power p = 2 can be used as well, but p = 2 is considered as the default value.

Theil Index of Efficiency
The Theil Index is frequently used to measure economic inequality: In our case we have to alter formula a bit to take into account that different bins have different impact, thus the formula turns into Theil(eff) = ∑ bin weight bin eff bin eff ln eff bin eff .
To measure the overall non-flatness, we average values for several global efficiencies: Theil(eff) in particular because Kolmogorov-Smirnov measures are too sensitive to local non-uniformities.The advantage of this method is that we don't need to select some global efficiencies like in the previous metrics.

Knn-based modifications
Though operating with bins is usually both simple and very efficient, in many cases it is hard to find the optimal size of bins in the space of uniform features (specifically in the case of more than two dimensions).As mentioned earlier, problems can also arise due to bins with very low populations.
In these cases we can switch to k-nearest neighbors: for each signal event we find k nearest signal events (including the event itself) in the space of uniform features.Now we can compute the efficiency eff knn(i) , from the empirical distribution F knn(i) of nearest neighbors.The weights for knn(i) are proportional to the total weight of events in knn(i): The knn approach suffers from a drawback: the impact of different events has very little connection with the weights, because some events are selected as nearest neighbours much more frequently than others.This effect can be suppressed by dividing the initial weight of the event by the number of times it is selected as a nearest neighbour.

Figure 1 :
Figure 1: Dalitz-plot distributions for (left) signal and (right) background for the D ±s → π + π − π ± .The three pions are labeled here as 1, 2 and 3 and ordered according to increases momentum.
simulated using the D_DALITZ model of EvtGen to simulate the intermediate resonances which contribute to the three pion final state.The background candidates are three pion combinations reconstructed in simulated samples of c c and b b events, where the charm and bottom quark decays are inclusively modelled by EvtGen.The simulated events contain "truth" information which identifies them as signal or background, and which identifies the physical origin of the three pion combinations reconstructed in the c c and b b simulated samples.

Figure 2 :
Figure 2: (left) ROC curves for classifier algorithms studied in this paper.For the uGBFL algorithms α = 0.02 is shown.(right) ROC curves for uGBFL(bin) for differnet values of α.

Figure 3 :
Figure 3: Uniformity of the selection efficiency across the Dalitz plot, as measured using the so-called SDE metric described in detail in the appendix, vs α for uGBFL(bin).The dashed line indicates the SDE value for AdaBoost.Lower values of α produce more uniform selection efficiencies.

Figure 4 :Figure 5 :
Figure 4: Efficiency vs distance to a corner of the Dalitz-plot.An arbitrary working point of 50% integrated efficiency is displayed.For the uGBFL algorithms α = 0.02 is shown.

Figure 6 :
Figure 6: Demonstration of the distribution similarity approach.(left) Predictions are uniform in mass, the distribution of predictions in the bin (yellow) is close to the global (blue).(right) Distribution with peak in the middle, the distribution in the bin is quite different from the global distribution.In both cases the yellow rectangle shows the events in the bin over mass.

Table 1 :
Description of uniform boosting algorithms.