Parallel iterative solution of the Hermite Collocation equations on GPUs II

Hermite Collocation is a high order finite element method for Boundary Value Problems modelling applications in several fields of science and engineering. Application of this integration free numerical solver for the solution of linear BVPs results in a large and sparse general system of algebraic equations, suggesting the usage of an efficient iterative solver especially for realistic simulations. In part I of this work an efficient parallel algorithm of the Schur complement method coupled with Bi-Conjugate Gradient Stabilized (BiCGSTAB) iterative solver has been designed for multicore computing architectures with a Graphics Processing Unit (GPU). In the present work the proposed algorithm has been extended for high performance computing environments consisting of multiprocessor machines with multiple GPUs. Since this is a distributed GPU and shared CPU memory parallel architecture, a hybrid memory treatment is needed for the development of the parallel algorithm. The realization of the algorithm took place on a multiprocessor machine HP SL390 with Tesla M2070 GPUs using the OpenMP and OpenACC standards. Execution time measurements reveal the efficiency of the parallel implementation.


Introduction
The Collocation discretization method, based on Hermite bi-cubic elements, is a well-known high accurate finite element technique for solving elliptic boundary value problems (BVPs) [1,2,3]. Applying the method to problems on square domains, and by using uniform n s ×n s discretization and the appropriate red-black numbering of unknowns and equations [4] the resulting red-black collocation linear system Ax = b is in 2-cyclic normal form [5], namely, where the matrices D R and D B are block diagonal and nonsingular [6]. The large size of the collocation system, especially for fine discretizations, and the demanding storage requirements refer immediately to iterative methods for its efficient solution [7,8,9] on parallel computing environments [10,3,11,12]. The preconditioned BiCGSTAB [13] iterative method is an efficient solver for the collocation linear system (1) for high performance parallel architectures [11,10,12]. In [14] a new Schur Complement type iterative solver was proposed for multicore machines with a Graphic Processing Unit (GPU). The method approximates iteratively only the black coloured unknowns only, while the red ones are directly calculated from the black ones. Since the method is based on the application of a two sided precondition technique into the linear system (1), the iterative procedure of the algorithm for the Schur complement linear system includes an unpreconditioned solver. The BiCGSTAB method is chosen for this purpose evaluating all the black unknowns. Taking into account that this is the most computationally intense part of the solution process, and more specifically the two matrix-vector multiplications involving the Schur complement matrix for every iterative step of BiCGSTAB algorithm, a significant part of the parallel calculations for these matrix-vector multiplications are chosen to be performed on the GPU in order to achieve performance acceleration. The algorithm is based on the shared model for the CPU and GPU cores. But if more than one GPU are available, which means that each one stores the available data in its local memory, and due to the specific structure of matrices H R and H B , there is data dependency for every matrix-vector multiplication. The following section presents this extended hybrid algorithm based on the shared-distributed memory model required for the case of multiple GPUs.

The parallel algorithm
The parallel algorithm for shared memory architectures in [14] is based on a specific mapping of unknowns into the CPU/GPU computational threads permitting uniform load balancing for the computation among cores, minimizing the communication cost between threads and shared memory and also eliminating the idle core threads during the parallel procedures. Following the same mapping of unknowns in case of N number of GPUs available and for even k = ns N the parallel procedures of t = H R z and q = H B s GPU matrix vector multiplications have to use GPU and CPU core threads in order to avoid transferring multiple data copies from CPU memory to each GPU local memory. The core threads for every GPU j = 1, . . . , N calculate the vector parts t l and q l with l = (j − 1)k + 1, . . . , jk being the number of subvectors of length 2n s . For the evaluation on the jth GPU of subvectors t (j−1)k+1 ,t (j−1)k+2 ,t jk−1 , t jk , q (j−1)k+1 and q jk the subvectors z (j−1)k ,z jk+1 ,s (j−1)k−1 ,s (j−1)k ,s jk+1 and s jk+2 are needed, which have been stored on the j − 1 and j + 1 GPU's local memories. For this reason these operations are performed by CPU core threads, since the required data vectors are also stored on the CPU's shared memory. The above are implemented in the parallel algorithm that follows and describes the t = H R z GPU/CPU matrix vector multiplication. The j th CPU core computes q (j−1)k+1 , q jk

!$OMP END SECTION !$OMP END PARALLEL
We point out that, for the efficient implementation of the above algorithm parts at least N CPU core threads have to be available. The CPU threads for j = 1, . . . , N manage the GPU processes. More specifically, each one loads the required data from the CPU shared memory and transfers it to the corresponding GPU local memory. When all GPU calculations are performed the same CPU thread transfers the data from GPU's memory back to the CPU's memory. This is expressed in the above algorithms by the outer OMP PARALLEL procedure. The OpenACC subroutine acc device num assigns each GPU card to a CPU core thread created by the OMP parallel region.

Realization on a Shared-memory parallel computer with GPUs
The shared memory machine HP SL390s G7 consists of a 6-core Xeon X5660@2.8GHz type processor with 12 MB Level 3 cache memory. The total memory is 24 GB and the operating system is Oracle Linux version 6.2. This machine is also equipped with two Fermi edition Tesla M2070 GPUs [15] connected via PCI-e gen2 slots. Each GPU has 6GB of memory and 448 cores on 14 multiprocessors. The application is developed in double precision Fortran code using OpenMP [16,17] and OpenACC [18] standards with PGI's compilers [19] version 12.9. The basic linear algebra operations subroutines from scientific libraries BLAS and LAPACK [20] are considered.
For the implementation of the above parallel algorithm the test Dirichlet modified Helmholtz problem, which accepts the following exact solution with parameter λ = 1 was solved.
For the algorithm's performance investigation a single CPU core thread is used for the CPU implementation, while for the CPU/GPU implementations the CPU core threads where as many as the number of GPU cards. Several problem sizes are solved for n s = 256 up to 2048 finite elements in each spatial direction. As every Hermite collocation finite element has 16 degrees of freedom the total degrees of freedom for every problem size is 16n 2 s . For example in the case of the finest problem the total degrees of freedom are more than 67 millions.  Table 1 presents the total computation time in seconds and the speedup measurements using only the CPU , the CPU and one GPU card and finally cores from CPU and the two GPUs for all problem sizes. As expected the size of the problem and the GPU number affect the algorithm performance. An acceleration performance of almost 50% is observed using all available GPU cores for fine discretization problems.