Performance of parallel computation using CUDA for solving the one-dimensional elasticity equations

In this paper, we investigate the performance of parallel computation in solving the one-dimensional elasticity equations. Elasticity equations are usually implemented in engineering science. Solving these equations fast and efficiently is desired. Therefore, we propose the use of parallel computation. Our parallel computation uses CUDA of the NVIDIA. Our research results show that parallel computation using CUDA has a great advantage and is powerful when the computation is of large scale.


Introduction
A large number of real world problems can be modelled mathematically. Solutions to mathematical models are representatives of the solutions to the real problems. Some mathematical models are in the form of differential equations.
In this paper, we solve a system of partial differential equations. In particular, we consider the nonlinear elasticity equations. These equations were derived from engineering problems relating to elastic wave propagation through heterogeneous media [1]. Some work has been done previously regarding the nonlinear elasticity equations. Supriyadi and Mungkasi [2] solved the nonlinear elasticity equations in a sequential computation. Solving the problem in sequential is not appropriate for a large-scale problem, as the computation is tedious for the large-scale problem. Darmawan and Mungkasi [3] solved the nonlinear elasticity equations in a parallel computation using the MPI with a cluster of workstations applying MPI_Send and MPI_Recv techniques. However, parallel programming with MPI_Send and MPI_Recv techniques in the MPI standard is not appropriate to be used in the one-dimensional elasticity equations. This is because the total time needed in the data exchange is considered more dominant compared with the total amount of time to conduct the basic elasticity computation using the finite volume method [4].
Nowadays, Graphic Processing Units (GPUs) have better performance than CPUs in both floating point operation and memory bandwidth. One of computing platforms and programming models in GPUs is Compute Unified Device Architecture (CUDA) [5]. GPUs with CUDA offer high performance at a very low cost, and they can also be integrated into high performance computer systems [6][7]. This paper investigates if we obtain high speed in parallel computations using CUDA to solve elasticity problems. This paper is organized as follows. We recall the mathematical model concerning elasticity problems and present the numerical scheme to solve the model in Section 2. We describe the numerical method in parallel to solve elasticity problems in Section 3. Computational results and discussion are provided in Section 4. Finally, we draw some concluding remarks in Section 5.

Mathematical model and numerical scheme
In this section, we present the mathematical model to be solved and the numerical scheme used to solve the model. The nonlinear elasticity equations are given by In this model, the free variables are time t and the space x. In addition, the notation ) represents the velocity, and ) (ε σ denotes the stress. The space domain that we consider in this paper is The initial condition for our test problem is for all .
x The boundary condition at The system of equations (1)-(2) have the form of a conservation law is the conserved quantity and ) (q f is the flux function. One of numerical methods that can be used to solve conservation laws is the finite volume method. The finite volume method itself is conservative, as the numerical quantity is conserved at any time. The finite volume scheme for equation (9) in the fully discrete version is where n i Q is the approximate quantity computed in the finite volume frame work at the i -th cell at the spatial cell width respectively. We use a uniform time step as well as a uniform spatial cell width. All fluxes are computed using the Lax-Friedrichs formulation. Lax-Friedrichs fluxes for equation (9) relating the finite volume scheme (10) are given by ),

Method for parallel computations
The method for our parallel computations is described as follows.
NVIDIA developed the CUDA programming model and computing platform to let programmers write scalable parallel codes [5]. CUDA is an extension of the C and C++ programming languages. The programmer writes a serial program that calls parallel kernels, which may be functions or programs. A kernel executes a set of parallel threads. The threads must be organized into hierarchy of grids of thread blocks. A thread block is a set of concurrent threads that share access to a memory space privately to the block and cooperate among themselves through barrier synchronization. A grid is a set of thread blocks that each of them may be executed independently in parallel [8]. In this work, we use CUDA on a Personal Computer and a GPU for parallel programming with the C programming language.
Foster [9] mentioned that the execution time (which is varied with problem size) as well as the efficiency (which is independent of problem size) can be the metrics to evaluate parallel algorithm performance. The relative speed up Srelative is calculated as where the execution time T1 is used on one processor and the time p T is used on p processors. The factor by which execution time is reduced on p processors is actually the relative speedup. Relative efficiency relative E is calculated as Notice that from equations (13)-(14), the relative speedup is in relation with the relative efficiency. However, according to Tan [6] in GPU parallel computing the execution times are tested on hardware platforms with totally different architectures, namely, the CPU and the GPU. The efficiency is not as a useful metric as in CPU parallel analysis. Therefore, in this paper we evaluate computational results using the execution time and the speedup defined by equation (13) with a fixed p processor equal to the number of GPU cores plus the number of CPU cores.

Computational results
In this section we present main results of our research.
Computations are conducted in a Personal Computer with one Core 2 Quad processor with RAM 8 GB and a GPU using GT 730 from NVIDIA with 96 cores, memory 2 GB and maximum of 1024 threads per block. We use the global memory in the GPU to share data between threads. The operating system that we use is Windows 7 on 64 bit. The parallel computing environment is CUDA Toolkit 7.5. We have conducted 16 simulations of parallel computations and 16 simulations of sequential computations for running the program in order to solve the elasticity problem mentioned in Section 2. The elasticity problem is based on one-dimensional elasticity equations implemented in onedimensional array. We use scenarios with increasing size of the arrays from 10001 to 160001 with the step is 10000. We choose 1001 threads per block and calculate the number of blocks to obtain the number of threads at least equal to the dimension of the arrays.
For a basic elasticity problem we record the total time for each simulation in Table 1. As shown in Figure 1 for the total time of both sequential and parallel executions, we observe that more number of size arrays leads to slower computation. For the size of arrays between 10001 and 20001, sequential execution is faster than parallel execution. Surprisingly for the size of arrays between 30001 and 40001 we observe that parallel execution exceeds sequential execution, and parallel execution time continue to increase linearly exceeding the sequential execution time. As shown in Figure 2, for the speedup less than 1, the speedup increases linearly and sharply. However, for the speedup more than 1, the speedup increases linearly and gradually. This indicates that more size of arrays will gain better speedup. The limitation is the number of memory available. It means that for one-dimensional elasticity computations, more size of arrays in the CUDA parallel programming will gain more speedup.

Conclusion
We have simulated several scenarios of parallel computations for elasticity problems. We obtain that parallel programming with CUDA can be used to improve execution time of computation to solve the one-dimensional elasticity equations within an appropriate array dimension. The improvement of speedup can be obtained when the size of arrays are more than 40001 and it will continue to increase linearly until 160001. Without loss of generality, we recommend that the sizes of arrays in the onedimensional elasticity computation must be appropriate to obtain improvement of speedup in parallel computations using CUDA.