A development of an accelerator board dedicated for multi-precision arithmetic operations and its application to Feynman loop integrals

Higher order corrections in perturbative quantum field theory are required for precise theoretical analysis to investigate new physics beyond the Standard Model. This indicates that we need to evaluate Feynman loop diagrams with multi-loop integrals which may require multi-precision calculation. We developed a dedicated accelerator system for multiprecision calculations (GRAPE9-MPX). We present performance results of our system for the case of Feynman two-loop box and three-loop selfenergy diagrams with multi-precision.


Introduction
With the discovery of Higgs particle at the CERN Large Hadron Collider, precision measurements are expected at a future International Linear Collider to explore new physics beyond the Standard Model. In tandem with experiments, an accurate theoretical prediction is required. To meet such a demand, higher order correction in perturbative quantum field theory becomes more and more important, and methods to evaluate multi-loop integrals precisely should be provided.
We have been developing DCM (Direct Computation Method) for loop integrals [1]. This is a fully numerical method of the combination of multi-dimensional integration and the extrapolation technique. It is known that the accurate numerical evaluation of the integral is a hard problem due to its divergent nature. Yuasa et al. [2] reported that the numerical evaluation is numerically unstable with double-precision operations. A solution to this difficulty is to compute the integral in multi-precision arithmetic, that is, it needs more bits for mantissa than double-precision. In addition, we have a case that needs more bits for exponent. With Double Exponential Formulas for numerical integration (DE) [3], we can estimate that 15-bit wise exponent is required to reduce a relative error smaller than a reference value, although the exponent of double-precision is only 11-bit wise.
There are several ways to accomplish multi-precision arithmetic. One way is to use the specific softwares, for examples, GMP [4], MPFR [5], ARPREC [6], MPFUN90 [6], DD/QD [6], and quadmath on a new version of GNU compiler. These softwares are easy to use. However, high performance can not be expected, since one multi-precision arithmetic operation is realized with, at least, more than 10 double-precision arithmetic operations in these softwares. Another way is to design a new, dedicated hardware which has circuits for calculation with precision higher than doubleprecision. We have been developing such a hardware based on the fundamental idea of GRAPE originally developed for gravitational N -body simulations [7]. We designed GRAPE-Multi-Precision (MP) processor which consisted of a number of processing elements (PE) and memory components with dedicated circuits for quadruple, hexuple, octuple-precision arithmetic. We implemented this processor on a structured ASIC (Application Specific Integrated Circuit) which was called GRAPE-MP [8] , and on an FPGA (Field Programmable Gate Array) board with a control processor, which was called GRAPE-MPX [9]. On these systems, we measured the performance of quadruple, hexuple, octuple-precision calculation and obtained the higher performance compared to a software implementation.
In this work, we developed a new system called GRAPE9-MPX. It is similar to GRAPE-MPX but is a rather larger system. This means that larger number of PEs can be implemented in the new system, which will be suited for large scale simulations. In addition to designing a hardware, we have been developing a programming interface named Goose and LSUMP which enable us easily to use GRAPE9-MPX.
This paper is organized as follows. Next section covers a brief explanation of the design and the implementation of the new system from hardware and software viewpoints. In section 3, we briefly explain Feynman loop integrals used for the performance measurement. The results are shown in section 4. The last section is devoted to the summary and our future prospects.  Figure 1 shows a schematic diagram and a picture of GRAPE9-MPX system. It is an accelerator which is connected to a host computer. It consists of FPGA boards (up to 8 boards available) which are housed on a PCIe extender board (G9MG). For the FPGA board, we used Altera Arria V board from Altera Co. [10]. On this FPGA, we implemented the MP processor and its Control Processors (CP). The MP processor consists of PEs and memory components for quadruple/hexuple/octuple-precision arithmetic (named as MP4/MP6/MP8), which forms a Single-Instruction-Multiple-Data (SIMD) processor. Each PE has a floating-point multiply unit and an add unit which perform in every clock cycle. Note that for MP4, we used the numerical representation compatible with IEEE-754-2008:binary128 format. For details of the PE, CP, and the numerical representations, see Daisaka et al. [8] and Nakasato et al. [9].
One of applications suited for our system is an interaction type calculation as describes an interaction form of X i and Y j . Note that data X i is set on i-th PE, whereas data Y j is set on all PE, then each PE calculates f (X i , Y j ) and sums up from j = 1 to j = n j in accordance with instructions. An example of this type calculation is gravitational interactions among particles. We should note that a multi-dimensional integration can be expressed in a similar form by loop fusion technique that merges multiple loops into a single loop. This indicates that we can accelerate the calculation of Feynman loop integrals effectively by using GRAPE9-MPX.

Programming interface
We have also developed Goose compiler which provides a programming interface for GRAPE9-MPX system. Figure 2 shows the flow of the compile process for our system, and the part of a sample program for one-loop box integral. Goose is a directive base compiler like OpenMP, in which a directive is inserted in an original code. By the directive (#pragma goose parallel seen in the sample code), Goose extracts the loop next to the directive and generates an intermediate representation which is like an assembler code. Goose also generates API calls in which functions of I/O interfaces for GRAPE9-MPX are embedded. In order to convert the intermediate representation to machine instructions (kernel code) for the MP processor, Goose uses LSUMP backend [11] which is a domain specific language (DSL) compiler specially developed for the MP processor. The kernel code is read by an executable file which is generated from the API calls by C/C++ compiler. For more detail, see Nakasato [11].   Figure 3 shows the two-loop crossed box diagram and the three-loop selfenergy diagram used for the performance measurement. The number of dimensions of the loop integral is six for the two-loop crossed box, and seven for the three-loop selfenergy, respectively. We consider the case where all masses of the internal lines are 1 and set the Mandelstam variables s = t = 1 and the external momenta p 2 i = 1(1 ≤ i ≤ 4) for the two-loop crossed box diagram and s = 1 and p 2 = 1 for the three-loop selfenergy diagram, respectively. We choose these values to compare numerical results to ones in the previous studies [12,13].
The numerical results in quadruple precision obtained by GRAPE9-MPX in the maximum problem size are I 2loop = 0.008536 and I 3loop = 0.279609, which are within the relative errors ACAT2014 IOP Publishing Journal of Physics: Conference Series 608 (2015) 012011 doi:10.1088/1742-6596/608/1/012011 of 10 −3 for the two-loop crossed box and 10 −5 for the three-loop selfenergy compared to the previous works [12,13]. It seems that the error is slightly larger because of smaller number of N total , but the calculation with this error level is achieved in a very short time, as seen later.

Performance results
We measured the performance of GRAPE9-MPX with up to 4 FPGA boards. It is connected a Linux PC (Scientific Linux 6.5) with Intel Xeon 2687W (3.4GHz) CPU on ASRock Xtreme11 motherboard (X79 chipset). The performance results are plotted in Fig.4. From this figure, we can see that for the two-loop crossed box integral, the effective performance in the maximum problem size was 2.4 Gflops with a single board, 4.7 Gflops with 2 boards, and 9.1 Gflops with 4 boards, respectively. On the other hand, the effective performance for the three-loop selfenergy integral in the maximum problem size was 8.7 Gflops with 4 boards. This figure also shows that the speed up by multiple boards is well scalable in the maximum problem size. It becomes 3.8 times faster using 4 boards, and 1.9 times faster using 2 boards than using a single board. On the other hand, the scalability is not good in smaller N j . This is because the overhead by communication between the GRAPE9-MPX boards and the host computer via PCI express is not negligible for smaller N j .
We compared the efficiency of the effective performance for the two-loop crossed box integral and for the three-loop selfenergy integral. We show that it is about 34% of the theoretical peak performance for the two-loop crossed box, whereas about 33% for the three-loop selfenergy. The reason why the latter is slightly lower than the former comes from the difference of the number of instructions. From LSUMP, the operation count of the instruction is 91 for two-loop crossed box, whereas it is 89 for three-loop selfenergy.  We also compared the computation time for the two-loop integral in the maximum problem size measured in GRAPE9-MPX of MP4 with 4boards, and the time measured in software implementations with the same numerical format (15-bit exponent and 112-bit mantissa). The measured calculation time is 21.3 sec for GRAPE9-MPX, 1074.7 sec for quadmath on gcc4.6.3, and 6086.7 sec for GMP-MPFR. Although we used OpenMP for the multi-threaded computing and the number of threads used was 16, GRAPE9-MPX system was 52.9 times faster than quadmath and, 319.5 times faster than GMP-MPFR. Thus, we could show the advantage of our system in the performance.

Summary
We have developed GRAPE9-MPX system, an accelerator system for multi-precision arithmetic operations which can accelerate the calculation of Feynman loop integrals. Our system consists of multiple FPGA boards in which a lot of PEs are implemented and forms SIMD way. Each PE has a dedicated logic for multi-precision operations. We measured the performance of the system in quadruple precision (15-bit exponent and 112-bit mantissa) for the two-loop crossed box and the three-loop selfenergy integrals, and got the effective performance of 9.1 Gflops and of 8.7 Gflops with 4 boards in the maximum program size, respectively.
One of the advantages of our system is that we have developed not only the hardware with high performance in multi-precision calculation, but also the programming interface which enables us to port applications easily for our system. The combination of Goose and LSUMP provides a simple way to use GRAPE9-MPX. All we have to do is to insert the directive in the original code. This simple programming model plays an essential role in treating various kinds of loop integrals in the automatic computation of Feynman diagrams.
Although we mainly presented the results of MP4 in this paper, we have measured the performance of MP6 and MP8 for the same Feynman loop integral. For the case of the two-loop crossed box, we achieved 4.4 Gflops for MP6 and 2.3 Gflops for MP8 with 4 boards, respectively. The detail of the results will appear in near future.