Model-independent partial wave analysis using a massively-parallel fitting framework

The functionality of GooFit, a GPU-friendly framework for doing maximum-likelihood fits, has been extended to extract model-independent S-wave amplitudes in three-body decays such as $D^+ \to h^+h^+h^-$. A full amplitude analysis is done where the magnitudes and phases of the S-wave amplitudes are anchored at a finite number of $m^2(h^+h^-)$ control points, and a cubic spline is used to interpolate between these points. The amplitudes for P-wave and D-wave intermediate states are modeled as spin-dependent Breit-Wigner resonances. GooFit uses the Thrust library, with a CUDA backend for NVIDIA GPUs and an OpenMP backend for threads with conventional CPUs. Performance on a variety of platforms is compared. Executing on systems with GPUs is typically a few hundred times faster than executing the same algorithm on a single CPU.


Introduction
The physics of scalar mesons below 2 GeV is a long-standing puzzle in light meson spectroscopy [1], this is partly due to the fact that these mesons often have large widths and overlap with each other, which make them hard to model. Contributions from Light scalars can be extracted by using Dalitz plot analyses [2] in three-body hadronic decays of charm mesons. First developed by the E791 Collaboration [3], the Model Independent Partial Wave Analysis (MIPWA) technique extracts S-wave amplitudes in the Dalitz plot analysis with no assumption about the nature of the S-wave. To pin down its large numbers of free parameters, the MIPWA technique requires the large samples of three-body decay events that have become increasingly available at the B-factories and the LHC. Notably, the LHCb experiment has recorded charm decays with sample sizes exceeding any previous experiment by more than an order of magnitude, offering a unique opportunity to study S-wave structures with unprecedented levels of precision. Analysing very large statistics samples requires disproportionately more computing power. Running all the calculations in a single Central Processing unit (CPU) thread would take a prohibitively long time.
Originally, a Graphics Processing Unit (GPU) was a specialised electronic circuit designed to rapidly manipulate and alter memory to accelerate the creation of images for output to a video display. The highly parallel structure makes GPUs more effective than CPUs for algorithms where large blocks of data are processed in parallel. Today, the functionality of GPUs has been extended to enable highly parallel computing for scientific and other more general applications. An open-source framework called GooFit [4] has been developed to exploit the processing power of these GPUs for parallel function evaluation, particularly in the context of maximum likelihood fits. This paper reports an extension of the GooFit framework to support MIPWA of a threebody decay with vastly reduced processing time.

MIPWA method
Essentially all studies of three-body hadronic D (s) decays employ the same technique: the unbinned maximum likelihood fit of the Dalitz plot, in which the quantum mechanical matrix element governing the decay is represented by a coherent sum of amplitudes [1]. These amplitudes correspond to the possible intermediate states in the decay chain D (s) → Rh 3 , R → h 1 h 2 (h = K, π). The amplitudes are grouped according to the orbital angular momentum L of R and h 3 in the rest frame of the D (s) , and the total amplitude is where s 12(13) ≡ m 2 (h 1 h 2(3) ). The amplitudes A L k are weighted by constant complex coefficients c L k , and the series is truncated at L = 2. For a resonance with non-zero spin, the amplitude A k is most often described using a relativistic Breit-Wigner function multiplied by a real spindependent angular factor [5]. For the S-wave, two qualitatively different approaches exist. In the Isobar model, the S-wave is treated as a sum of a constant non-resonant term and Breit-Wigner functions for the scalar resonances. While the Isobar model provides reasonably good descriptions for narrow resonances, it fails to describe the overlap of broad resonances. Additionally, the physical interpretation of the constant non-resonant term is problematic.
To address these issues, the MIPWA describes the ensemble of scalar components using a purely phenomenological set of parameters derived from the data. The s 12(13) mass spectrum is divided into N − 1 slices with N boundary points separating the slices and at the two ends of the spectrum. The S-wave is represented by a generic complex function A 0 (s) = a 0 (s)e iφ 0 (s) . At each of the k boundary points, A 0 (s = s k ) = a k e iφ k where a k and φ k are real parameters. Between the N boundary points, the S-wave is parametrised by a cubic spline [6] in the complex plane. The set of {a k , φ k } are free parameters, along with the coefficients c L l of the higher spin terms, are determined in the MIPWA fit. The large data sets studied, along with the large number of parameters (2N ) required to describe the S-wave, make maximising the likelihood a computationally intensive problem. Interpolating between the boundary points leads to correlations between the parameters, and hence to non-linear behavior.

MIPWA method with GooFit
GooFit provides an interface to allow probability density functions (PDFs) to be evaluated in parallel, using either GPUs or multicore CPUs as back-ends. While the original intention of GooFit was to utilise the massive computational power of NVIDIA GPUs based on the proprietary Compute Unified Device Architecture language (CUDA) [7], the Thrust parallel algorithms library [8] also supports an OpenMP backend for conventional CPUs.
GooFit has been used to perform a number of amplitude analyses, for example that of Ref. [9]. The ResonancePdf class has been extended to add support for the MIPWA method. The entire S-wave is treated as a single resonance (ResonancePdf object), with a set of free parameters {a k 0 , φ k 0 } to be determined in the fit. The projections of m 2 (K + K − ) (left) and m 2 (K + K + ) (right) from D + → K + K + K − . In each plot, the toy MC signal events (points with error bars ) are shown together with the total fit (blue line), φ resonance (red line), and S-wave (magenta line) determined from the MIPWA.

MIPWA in
The D + → K + K + K − decay offers a good opportunity to study the K + K − S-wave amplitude directly. LHCb collected a large D + → K + K + K − data sample from 2.1 fb −1 of √ s = 7 TeV pp collisions recorded by the experiment during 2012. About 100K candidates were selected with a signal purity of 90%. The resonant structure of the K + K − S-wave amplitude was studied for the first time using an Isobar model, and the results were first presented during the CHARM 2016 workshop [10]. This analysis indicates that the S-wave component accounts for about 90% of the decay rate, while the P-wave (φ(1020) resonance) makes up the rest.
A specific goal of the work presented here is to provide a tool to extract the K + K − S-wave by performing an MIPWA on the same LHCb data sample. In this case, the Dalitz plot is symmetrised along the two axes of m 2 (K + K − ). The K + K − mass squared range is divided into 39 slices using 40 boundary points. Because the final state contains two indistinguishable K + mesons, the K + K − amplitudes interfere with themselves, and this sort of interference allows the S-wave phase to be determined. To test the GooFit extension to support MIPWA fits, samples of 100K signal events are generated from a Dalitz plot PDF ("Toy MC"). The mass projections from the fit to a test sample can be seen in Fig. 1. First, the fit quality of the MIPWA method is tested by comparing the fitted {a k , φ k } values with the inputs. In each fit iteration, the Dalitz plot normalisation method ("DalitzPlotPdf::normalise()") is called so that the integral of the total PDF is equal to one. The normalisation integral is calculated numerically based on evenly distributed grid points in the Dalitz plot plane. Figure 2 shows the improvement in the fit quality as the normalisation grid spacing is reduced. Although the finer granularity increases numbers of calculations, the high speed of the GPUs makes this problem tractable.  Differences between the fitted values and input ones for each boundary point. Shown are set of points with different grid spacings for the normalisation: 0.01 GeV 2 (purple), 0.004 GeV 2 (red), and 0.001 GeV 2 (blue).
As shown in Table 1, the GooFit performance on three different GPU platforms has been measured by running the MIPWA fit over the same toy MC sample of 100 K events (see Fig. 1 for the fit projections). For comparison, the MIPWA fit using an older code with the same functionality that runs on one CPU core takes about eight hours to complete. Perhaps surprisingly, an older generation mobile GPU (a GeForce GT 650M with 384 cores) provides excellent performance; a newer HPC GPU board (a Tesla K40c with 2880 cores) provides better performance, but not in proportion to the number of cores; a high-end gamer board (a GeForce GTX 980) provides the best performance, albeit by a small margin. Based on other studies with GooFit, significantly better performance on the new P100 boards is anticipated which utilise NVIDIA's new Pascal GPU architecture. In addition to timing GooFit's new MIPWA performance on GPUs using Thrust's CUDA backend, its performance on two different CPU platforms is timed using Thrust's OpenMP backend, as shown in Fig. 3. With two Intel Xeon E5-2680 v3 CPUs, the fit uses 791 seconds for one OpenMP thread, 50 seconds for 24 threads. The speedup is almost linear with the increase of the number of OpenMP threads, up until the number of threads equals the numbers of physical cores.

Summary
This paper describes an extension of GooFit to support MIPWA fits for three-body decays, and have achieved speedups of a few hundred by using GPUs. The main branch of GooFit's source code is in a GooFit repository at https://github.com/GooFit/GooFit, while the updated code  Table 2: Goofy (Intel Xeon CPU E5-2680 v3 x 2) and Cerberus (Intel Xeon CPU E5520 x 2).