Speeding up complex multivariate data analysis in Borexino with parallel computing based on Graphics Processing Unit

A spectral fitter based on the graphics processor unit (GPU) has been developed for Borexino solar neutrino analysis. It is able to shorten the fitting time to a superior level compared to the CPU fitting procedure. In Borexino solar neutrino spectral analysis, fitting usually requires around one hour to converge since it includes time-consuming convolutions in order to account for the detector response and pile-up effects. Moreover, the convergence time increases to more than two days when including extra computations for the discrimination of 11C and external γs. In sharp contrast, with the GPU-based fitter it takes less than 10 seconds and less than four minutes, respectively. This fitter is developed utilizing the GooFit project with customized likelihoods, pdfs and infrastructures supporting certain analysis methods. In this proceeding the design of the package, developed features and the comparison with the original CPU fitter are presented.


Introduction
The complexity of the analysis techniques increased rapidly in the past 50 years thanks to the Moore's law on the computation power.Besides the improvement of the single chip computation power, recently the development of the numerous-cores chip and parallel programming toolkit can speed up the analysis even more and the parallel computing becomes more and more popular.
Borexino is a liquid scintillator based neutrino experiment located in Laboratori Nazionali del Gran Sasso in Italy.In Borexino analysis, one of the methods uses the analytical function to model the detector response.Many analysis technologies, such as the multivariate analysis, are used to suppress backgrounds.As a result, the computation becomes too heavy and specific efforts are needed to reduce the fitting time, so we have developed a package to speed up the analysis utilizing the parallel computing techniques working on the Graphic Processing Unit (GPU).The expected number of events in each bin are computed simultaneously by thousands of core in the GPU.It is found to be able to speed up the fitting by more than 2000 times compared with the original code, and thus becomes widely used in analyses [1,2].In this proceeding we briefly introduced the package[3] and presented its performance.

The Borexino experiment
The Borexino detector is a liquid scintillator based calorimeter.Its structure is shown schematically in Figure 1 [4].In the most inner part there is the target, an organic solution.Its solvent is PC (pseudocumene, 1,2,4-trimethylbenzene C 6 H 3 (CH 3 ) 3 ), and the solute is PPO (2,5-diphenyloxazole, C 15 H 11 NO) at a concentration of 1.5 g/L (0.17% by weight).Its mass is around 278 tons.The neutrinos can be detected via the elastic scattering channel on electrons, and the anti-neutrinos can be detected via the inverse beta decay on the hydrogen atoms.Separated by the inner vessel (IV), a 125 µm specially treated ultra-low-radioactivity nylon vessel, it was surrounded by quenched liquid scintillator which shields it from the gammas rays from the natural radioactivity in the PMTs and stainless steel sphere (SSS).Another nylon vessel, the outer vessel, together with the inner vessel works both as the barriers against radon atoms diffusing inward from outer part of the detector.The radius of the inner vessel, outer vessel and the stainless steel sphere are 4.25 m, 5.5 m and 6 m, respectively.Outside the stainless steel sphere there is the water Cherenkov detector as the active muon veto.3800 meters water equivalent rock provides passive shield to further suppress muon and cosmogenic backgrounds.for a 1 MeV energy deposition event in the detector center.For each hit, two signals, one for the energy and one for the time measurements, are produced by the front-end circuit [5].The energy signal is produced by a special analogues gateless charge integrator.The integrator automatically resets itself and has no dead time.By sampling the integrator output twice separated by 80 ns the charge due to the p.e. arriving during this period is obtained.The timing signal is produced by amplifying the PMT signal with two low noise amplifiers.
Since the start of data taking on May 2007, Borexino has given the current best measurement of 7 Be solar neutrinos [6], the first evidence of pep solar neutrinos and the best upper limit on solar CNO neutrino flux [7], the only real time detection of pp solar neutrinos[?], as well as the 8 B solar neutrinos with the lowest threshold as low as 3 MeV [8,9].Borexino also gave the detection of geo-neutrinos [10] and provided various rare processes studies.Additionaly, it is planned to search for sterile neutrino using an artificial 144 Ce-144 Pr source via short baseline neutrino oscillation within the SOX project [10].

Complexity in the Borexino analysis
In Borexino analysis, there are three energy estimators.For each estimator, both two methods, the analytical method and the Monte Carlo method, are studied extensively to describe the actual status of the detector honestly.
In the analytical method, the deposited energy spectrum is calculated priorly, then the visible energy spectrum is predicted by convolving the deposited energy spectrum with the response function where N x is the variable used in the analysis, such as the charge, number of detected photons, number of fired channels etc., f vis (N x ) is the visible energy spectrum, i.e. the spectrum of variable N x , f deposit (E) is the spectrum of the deposited energy, R x is the corresponding response function, p 0 , p 1 , ... are its parameters [11].
To have unbiased results with analytical response function, several techniques are developed.The pile-up events are the critical background for ν(pp) signal.To get precise ν(pp) rate, we modelled the pile-up effect by convolving the visible energy spectrum with the spectrum acquired with random trigger.
where g random trigger is the spectrum of N x variable from random trigger.
To improve the sensitivity on ν(pep) the three-fold-coincidence techniques is used to suppress the cosmogenic 11 C.The techniques tags events associated with a muon event and a subsequent neutron event.Since ν( 7 Be) and ν(pp) neutrino are not affected by the 11 C, instead of applying cuts we fit both spectra with a joint likelihood to increase the exposure for them To further suppress the impact of the backgrounds for ν(pep), multivariate analysis is developed.We include the radius in the likelihood to suppress the γs from the natural radioactivity from PMTs etc. and include the pulse shape parameter to suppress the residual cosmogenic 11 C after TFC-veto.Instead of a real high-dimensional fit, the radial and pulse-shape information are included by adding additional terms of likelihood [12].
The analytical response function is validated against Monte Carlo p.d.f.Toy Monte Carlo spectra are prepared and are fitted against the analytical response function.The fit results are not biased, so the analytical response function is able to describe the detector response, at least as good as the Monte Carlo p.d.f.

GPU based parallel computing
To speed up the computation, a new fitting tool based on GPU is developed.Because the computations of the expected number of events in each bin are independent, they can be performed simultaneously by thousands of cores inside the GPU and thus the fitting time is reduced significantly.An existing project GooFit is used as the interface between the minimization package and the GPU.TMinuit from ROOT is used as the minimization engine.
The workflow of package, controlled by the AnalysisManager object, is the following: (i) Load the fit configuration, the energy histogram and miscellaneous information such as the exposure etc. into the InputDataManager; (ii) Construct the energy spectrum, the response function object, the dataset and all likelihood terms; (iii) Construct the TMinuit object; (iv) The GooFit associate the constructed likelihood object to the the TMinuit object; (v) Call the TMinuit::minimize method.The TMinuit will vary the parameters and evaluate the likelihood and find the maximum of the likelihood; (vi) When TMinuit requests the evaluation of the likelihood, the control is passed to GooFit, GooFit will launch the parallel computation and evaluate the expected number of events in each bin simultaneously.The evaluation is done by calling the function pointer saved in each likelihood object, and these functions are written in cuda, a C-like advanced programming language for instructing the GPU card; (vii) When the TMinuit finds the minimum, the control is given back to package.The fit result is saved and the plots are created.

Comparison and validation against the original CPU code
An example output of the package is shown in the left panel of Figure 2. The package is validated against the original CPU version of the fitter.Fits are performed with two fitters under same configuration on the same dataset.The comparison is shown in Figure 2. The converging time tested is listed in Table 1.As one can see the speed up is at a factor of more than 2000 when switching on the multivariate likelihood.

Outlook
All the analysis techniques used in the solar spectral analysis have been implemented in this package.The package is also validated and has shown an excellent performance.In the near future we plan to implement the multi-dimensional fitting and more sophisticated and accurate response function proposed recently.With a deeper understanding of the principle of GPU and the low level manipulation of the cuda it is foreseen that the performance of the package can be further improved, and an upgrade of the package is planned in the near future.

Figure 2 .
Figure 2. Left panel: An example output with Monte Carlo dataset.Right panel: Comparison between the package and the original fitter.On the x-axis it is log 10 |∆| where ∆ is the difference.

Table 1 .
Comparison between CPU fitter and package.