Scalable Software for Multivariate Integration on Hybrid Platforms

. The paper describes the software infrastructure of the P AR I NT package for multivariate numerical integration, layered over a hybrid parallel environment with distributed memory computations (on MPI). The parallel problem distribution is typically performed on the region level in the adaptive partitioning procedure. Our objective has been to provide the end-user with state of the art problem solving power packaged as portable software. We will give test results of the multivariate P AR I NT engine, with signiﬁcant speedups for a set of 3-loop Feynman integrals. An extrapolation with respect to the dimensional regularization parameter ( ε ) is applied to sequences of multivariate P AR I NT results Q ( ε ) to obtain the leading asymptotic expansion coefﬁcients as ε → 0 . This paper further introduces a novel method for a parallel computation of the Q ( ε ) sequence as the components of the integral of a vector function.


Introduction
The PARINT parallel integration package supports the computation of multivariate integrals over hyperrectangular and simplex regions. The algorithms and tools in the parallel integration engine include: (i) an adaptive region subdivision algorithm, equipped with a load balancing strategy to handle localized integrand difficulties such as peaks and singularities; (ii) a non-adaptive Quasi-Monte Carlo (QMC) method based on Korobov lattice rules; (iii) Monte Carlo (MC) methods for high dimensions and/or erratic integrands; (iv) a user interface based on the PARINT plugin compiler which pre-processes the user problem specification and integrand function to be linked with the ParInt executable; (v) 1D rules corresponding to those in QUADPACK [1], which can be used for repeated integration in successive coordinate directions. Repeated integration with QUADPACK programs has been proven extremely effective for applications to Feynman loop integrals with severe singularities in high-energy physics [2]. However, repeated or iterated integration has an impeding drawback with respect to the expense of the method for higherdimensional problems (say, dimensions ≥ 6). In a sequential setting, multivariate adaptive integration is considered suitable for moderate dimensionality (say, up to 10) while possibly dealing with mildly irregular integrand behavior, and for higher dimensions with well-behaved functions. These limits can be moved up considerably in a parallel environment by allowing finer subdivisions with higher numbers of integrand evaluations.
Written in C and layered over MPI [3], the PARINT methods are implemented as tools for automatic integration, where the user defines the integrand function and the domain, and specifies a relative and absolute error tolerance for the computation (t r and t a , respectively). The integrand is defined as a vector function over the domain D. Denoting the (exact) integral by it is then the objective to return an approximation Q f and absolute error estimate E a f such that || Q f − I f || ≤ || E a f || ≤ max{ t a , t r || I f || }, or indicate with an error flag that the requested accuracy cannot be achieved. When a relative accuracy (only) needs to be satisfied we set t a = 0 (and vice-versa). If both t a = 0 and t r = 0, the weaker of the two error tolerances is imposed; if both t a = t r = 0 then the program will reach an abnormal termination, typically when the maximum number of function evaluations is reached. Optionally the PARINT installation can be configured to use long doubles instead of doubles. This paper focuses on the distributed adaptive algorithm in the PARINT multivariate integration package combined with extrapolation techniques as addressed in Section 2, and presents test results for Feynman loop integrals in Section 3.

Adaptive integration and extrapolation
2.1. PARINT parallel adaptive integration Layered over MPI [3] for distributed computing, the PARINT adaptive code assigns the roles of controller and worker processes for adaptive region partitioning (see also [4] and [5] for an explanation of the algorithm). The integration domain is divided initially among the workers. Each on its own part of the domain, the workers engage in an adaptive partitioning strategy similar to that of DQAGE in QUADPACK [1], DCUHRE [6] and HALF [7] by successive bisections, thus generating a local priority queue of subregions as a task pool. The priority queue is implemented as a max-heap keyed with the estimated integration error over the subregions, so that the subregion with the largest estimated error is stored in the root of the heap, or as a deap or double-ended heap, to allow for efficient deletions of the minimum as well as the maximum element in case the user specifies a maximum size to be retained for the heap structure.
An important mechanism of the distributed integration algorithm is the load balancing strategy, which is receiver-initiated and geared to keeping the loads on the worker task pools balanced, particularly for problems with irregular integrand behavior. The message passing is performed in a non-blocking and asynchronous manner, and permits overlapping of computation and communication. As a result of the asynchronous processing and message passing on MPI, PARINT executes on a hybrid platform (multicore and distributed) by assigning multiple processes to each node. The parallelization may, however, lead to breaking loss or extra work performed at the end of the computation, due to the time elapsed while workers continue to generate subregions after the termination message has been issued but before receiving it. This may increase the total amount of work (compared to the sequential work), particularly when the number of processes is large.  When PARINT is used as a stand-alone executable, it uses the PARINT Plug-in Library (PPL) mechanism to specify integrand functions. The functions are written by the user, added to the library (along with related attributes), and then compiled using a PARINT-supplied compiler into plug-in modules (.ppl files). A single PPL file is loaded at runtime by the PARINT executable. Using a function library enables quick access to a predefined set of functions and lets PARINT users add and remove integrand functions dynamically without re-compiling the PARINT binary. Once these functions are stored in the library, they can be selected by name for integration.
For an execution on MPI, the MPI host file (myhostfile) contains lines of the form: node_name slots=ppn where ppn is the number of processes to be used on each participating node. A typical MPI run from the command line may be of the form mpirun -np 64 --hostfile myhostfile ./parint -f fcn -lf 10000000 -ea 0.0 -er 5.0e-10 For example, with 4 nodes listed in myhostfile and ppn = 16, a total of 64 processes is requested. The integrand function of this run is named fcn in the user's library; the maximum number of function evaluations is 10000000, and the absolute and relative error tolerances are 0 and 5.0e-10, respectively.

Use of extrapolation
We apply numerical extrapolation to integrals with an asymptotic expansion in the dimensional regularization parameter ε, of the form For example, the ϕ k (ε) functions may be integer powers of ε, ϕ k (ε) = ε k , K ≥ k, with K = 0 for finite integrals. If the ϕ k (ε) functions are known, we can apply a linear extrapolation by approximating S(ε) for decreasing values of ε = ε , and truncating Eq (2) after 2, 3, . . . , ν terms to form linear systems of increasing size to be solved for the C k variables. When the ϕ k (ε) functions are unknown we perform a non-linear numerical extrapolation with thealgorithm [9,10,11,12]. In that case we generate a sequence of integral approximations for S(ε ), using a geometric progression of ε that tends to zero with increasing (see also [13,2]), and employ a version of the -algorithm code (DQEXT) from QUADPACK. In between calls, the DQEXT implementation retains the last two lower diagonals of the triangular extrapolation table. When a new element of the input sequence is provided, the algorithm calculates a new lower diagonal, together with an estimate or measure of the distance of each newly computed element from preceding neighboring elements. With the location of the "new" element in the table relative to e 0 , e 1 , e 2 , e 3 pictured as:

Results for Feynman loop integrals
One prominent application of multivariate integration is to the computation of Feynman loop integrals, which contribute higher order corrections for accurate theoretical predictions of the cross-section for particle interactions. The integral corresponding to an n-dimensional scalar Feynman diagram with L loops and N internal lines can be represented in Feynman parameter space as Here the functions C and D are polynomials determined by the topology of the corresponding Feynman diagram [14]. After removing the δ-function in (3) and eliminating one of the The term i C prevents the integral in (3) from diverging if D vanishes in the interior of the domain. As this does not happen for the problems under consideration in this paper we can set = 0. We set n = 4 − 2ε to compute the leading order coefficients in an asymptotic expansion of the integral with respect to the dimensional regularization parameter ε.
Test results are presented below for a set of 3-loop self-energy diagrams from Laporta [15] (given in Fig 1). In order to compare our integral approximations with Laporta's we set all masses m r = 1 and s = 1, and furthermore divide the integral by Γ 3 (1 + ε). For the numerical approximations and timings we use the HPCC (High Performance Computation Center) cluster at WMU. The tests are run on 16-core cluster nodes with Intel Xeon E5-2670, 2.6GHz processors and 128GB of memory, and using the Infiniband interconnect for message passing via MPI. Tables 1-4 show various types of results obtained by calling the pi_integrate() function and running the executable with 16 processes on the HPCC cluster. The integrals are transformed from the (unit) simplex to the (unit) cube and the integration is taken over the cube, using the basic integration rule of polynomial degree 9 [6]. Tables 1 Table 1. Integral and leading order expansion coefficients using PARINT with 16 procs., and linear extrapolation for 3-loop integral of Laporta Fig 2(r), err. tol. ta = 10 −12 ; max. # evals = 20B, T (s) = elapsed time (s); Ea = integration estim. abs. error  Table 2. Integral and leading order expansion coefficients using PARINT with 16 procs., and extrapolation with -algorithm for 3-loop integral of Laporta Fig 2(r), err. tol. ta = 10 −12 ; max. # evals = 20B, T (s) = elapsed time (s); Ea = integration estim. abs. error  Table 3. Integral and leading order expansion coefficients using PARINT with 16 procs., and linear extrapolation for 3-loop integral of Laporta Fig 2(u), err. tol. ta = 10 −12 ; max. # evals = 20B, T (s) = elapsed time (s); Ea = integration estim. abs. error  and 3 are computed with consecutive calls to pi_integrate() in a loop and linear extrapolation, for the functions of (Laporta) Fig 2(r) with N = 7, and Fig 2(u) with N = 8, respectively (depicted in Fig 1). The values of C 0 , C 1 and C 2 are listed (K = 0 in Eq (2)). The abbreviation "B" refers to billion with regard to the number of integrand evaluations, and T (s) represents the integration time in seconds. Table 2 is obtained similarly for Fig 2(r), but with non-linear extrapolation by the -algorithm. The results in Table 4 for Fig 2(u) are computed with one call to pi_integrate() from the main program, integrating a vector function that contains the ε-dependence in the definition of its components. The integration time for the vector function (at 2180 seconds) is considerably less than the total integration time through the loop in Table 3, with comparable or slightly better accuracy for the same problem. If less accuracy is needed overall, the computation will take less time with both methods. The extrapolations in Tables 1-4 show good agreement with the values of the Eq (2) coefficients from Laporta [15]. Table 5 presents timing results on the HPCC cluster, corresponding to the six diagrams in Fig 1. Since the integrals with kinematics specified by Laporta [15] are finite and only the integral needs to be computed, we set ε = 0 in the integrand codes. The speedup S p = T 1 /T p (where T p is the integration time incurred with p processes) indicates good scalability of the parallel implementation (see also [16]). For a total of p = 64 MPI processes on 4 cluster nodes, 16 procs are spawned per node.

Concluding remarks
In this paper we explored distributed, automatic integration with the PARINT multivariate integration package, in particular its adaptive integration engine. We demonstrated its accuracy and parallel scalability for the computation of the leading order coefficients in an asymptotic expansion with respect to the dimensional regularization parameter ε, as ε → 0, for a class of 3-loop self-energy integrals from [15]. This procedure relies on obtaining approximations of a sequence of integrals as ε → 0, and an extrapolation for convergence acceleration of the sequence. In view of the compute-intensive nature for moderate integral dimensions, the parallelization has been proven beneficial for obtaining considerable accuracy.