A new parallelization algorithm of ocean model with explicit scheme

This paper will focus on the parallelization of ocean model with explicit scheme which is one of the most commonly used schemes in the discretization of governing equation of ocean model. The characteristic of explicit schema is that calculation is simple, and that the value of the given grid point of ocean model depends on the grid point at the previous time step, which means that one doesn’t need to solve sparse linear equations in the process of solving the governing equation of the ocean model. Aiming at characteristics of the explicit scheme, this paper designs a parallel algorithm named halo cells update with tiny modification of original ocean model and little change of space step and time step of the original ocean model, which can parallelize ocean model by designing transmission module between sub-domains. This paper takes the GRGO for an example to implement the parallelization of GRGO (Global Reduced Gravity Ocean model) with halo update. The result demonstrates that the higher speedup can be achieved at different problem size.


Introduction
A branch of modern natural science is oceanic and atmospheric science which involves a lot of oceanatmosphere coupling models, such as POM (Princeton Ocean Model), ROMS (Regional Ocean Modeling System), FVCOM (Finite Volume Coastal Ocean Model) and CAM (Community Atmosphere Model). With the progress of computer hardware and software architecture, the high parallel performance computing technology has brought opportunities to improve the efficiency of these models; more and more parallel ocean models were developed, such as parallel versions of POM [1,2], ROMS [3,4], FVCOM [5] and CAM [6].
There are various software tools and library for developing parallel ocean model so far, such as HPF (high performance fortran), MPI (message passing interface), OpenMP (open multi-processing) and PVM (Parallel Virtual Machine). HPF is an extension of fortran, which has been chosen for the parallelization of FRAME (Fine Resolution AMmonia Exchange) [7]. MPI is a parallel software library and uses message transmission to communicate between parallel processes, which is one of the most commonly used parallel tools for the parallelization of ocean model, for example, the parallelization of FVCOM [5]. OpenMP is a directive-based parallel tool on shared memory platform; and the advantage of OpenMP is that one can quickly translate the serial ocean model to parallel version by means of inserting OpenMP parallel directive into the original serial ocean model [8,9]. In order to make merit of MPI and OpenMP, Osthoff [10]  computing nodes of heterogeneous environment to parallel processors, which masks the differences between computing nodes, and its application in parallel ocean model was demonstrated in reference [11]. However, all these parallel techniques chosen for parallelization are based on developing environment, and are not based on the characteristic of the discretization of the governing equation of ocean models. This paper develops a common parallel algorithm for ocean model with explicit discretization of governing equation. The time discretization of finite difference scheme of the governing equation of the ocean model is divided into explicit and implicit scheme. The selection of a difference scheme is related to the stability and convergence of the governing equations, and is also related to the selection of parallel technique in the parallelization of ocean model to a certain extent. When the governing equation is discretized with implicit scheme, large sparse linear system is produced, in which case, one can design parallel algorithm aiming at the characteristic of the linear system [12]. For example, when the governing equation is discretized with ADI scheme which produces tri-diagonal linear system, one can parallel design parallel algorithm to parallelize the tri-diagonal linear system [13,14]. This paper will focus on the parallelization of ocean model with explicit scheme. Aiming at the characteristic of explicit scheme, this paper proposes a new parallel algorithm named halo cell update with tiny modification of the original ocean model. In order to test the performance of the halo cell update algorithm, this paper applies the parallel algorithm to parallelize the Global Reduced Gravity Ocean model under different problem size of ocean model.

Halo cell update
Applying the halo cell update to parallelize ocean model, the main concept of the parallel method is "divide and conquer", which contains three steps as follows.
At the first stage, we divide the whole domain involved in the ocean model into sub-domains. Each sub-domain has its own boundary. In this paper, the inner grids which are close to the boundary are called inner boundary. Each sub-domain expands its boundary, which forms outer boundary called halo cells. The halo cells can be set boundary condition used to compute the inner grids.
At the second stage, at each integration step, the inner grids of each sub-domain can be computed according to boundary condition which are saved in halo cells, in which case, the sub-domain can be computed by each processor in a parallel manner, and the results of the inner grids that we get is correct, this is mainly because the inner grids is calculated with the correctly boundary condition saved in the halo cells.
At the third stage, in order to compute correctly at the next time step each sub-domain sends its inner boundary and receives its halo cells from other sub-domains, which is called update step. It is obvious that after the update step the results of the parallel algorithm is identical to the results of the serial algorithm and we can apply the same solver to all sub-domains.
In order to understand the halo cell update correctly, let us consider the horizontal velocity u.
where u is the horizontal velocity, i is the number of elements in the total domain, n represents the time step, is the boundary condition and F() is the discretization form of the governing equation of the ocean model. We divide the entire domain into two sub-domains, u1,i(i=1, …, p), and u2,i(i=p+1, …., m). Let us apply the halo cell update algorithm for these two sub-domains. These two sub-domains expand their boundaries, forming halo regions. We set the halo regions with the boundary condition: u1, p+1 = u2, p+1, u2, p = u1,p. Now these two sub-domains have their own boundaries, and can be computed independently by different processors. That is At the update stage, the halo region of these two sub-domains must be refreshed; this is because we must use the updated halo region to compute the inner grids at the next time step. In this stage, the update function can be implemented by the basic message passing function: The update function can be implemented by the use of basic MPI point-to-point communication operations, such as mpi_send and mpi_recv subroutines. The advantage of the halo cell update is as follows:  This parallel algorithm is simple, i.e., only an update function written in forms of basic message communication subroutines is required to refresh halo cell, and it is clear that the result of the parallel algorithm is identical to the result of the serial algorithm.  The parallel algorithm has certain extensibility, i.e. we can expand two halo cells to reduce the times of halo cell update, in which case, the times of halo cell update will reduce to a half at the cost of redundant computing of each sub-domain, as is shown in figure 1.

Ocean model
The ocean model used in this paper is an extended 1.5 layer Global Reduced Gravity Ocean model (GRGO) [15,16], which was first introduced by Cane to study ENSO. In this model, mass, momentum, and heat obey the conservation laws in the upper layer, and the lower layer is assumed to be infinitely deep and motionless to keep the kinetic energy finite, so the momentum, continuity, and thermodynamic equations for the upper can be written as: where U=U(x, y, t), h=h(x, y, t), T=T(x, y, t) represent the horizontal velocity vector, thickness of upper layer and surface temperature respectively; x, y represent horizontal coordinates; we represents entrainment velocity at the base of mixed layer; represents wind stress vector; b represents buoyancy in the upper layer; Tr represents reference temperature, T represents the surface temperature; Q represents heat flux; represents the diffusivity coefficient; is the density; h represents thermal expansion coefficient; μrepresents horizontal viscosity coefficient. The GRGO is about to solve u, v, h, T by the integral of time according to the boundary condition, so in these equations, we mainly solve the ∂ ∂t . The main idea of this paper is the domain decomposition in the horizontal directions; we are interested in the discretization of the terms containing

Discretization
This section will deal with the discretization of governing equations of the ocean model. We discretize the governing equations with explicit scheme which is relatively simple to set up and to program in order to find the numerical solutions of the equations. In order to keep the numerical solution stable and avoid generating oscillations, we must choose appropriate time step and spatial step to meet the CFL (Courant-Friedrichs-Lewy) condition. In computational fluid dynamics, the CFL is a necessary condition for the stable while solving certain partial differential equation. In this paper, we choose dt/dx<0.1 (This paper focuses on a two dimensional system). Take the momentum equation (1) and the continuity equation (2) for examples, and the finite difference equations are as follows: Where n represents the time step, i and j represent the number of the elements in the domain. According to the finite difference equations, we can get the relationships of grids shown in figure 2. For example, from equation (4) and figure 2(a), when we compute the u at n + 1 time, we must use u and the four points around u of n time. From equation (5) and figure 2(b), when we compute the h at n + 1 time, we use the hi, j, hi-1, j and hi, j-1 of n time. We can decide the schema of the domain decomposition according to the relationships of the grids.

Result and analysis
The aim of the parallelization of the ocean model is to reduce the runtime and increase the efficiency of the ocean model. This paper tests the parallel version of GRGO in the following environments: Platform 1, the number of CPU cores is 4, the memory size is 4 GB, and the operating system is Windows 7.
Platform 2, the number of CPU cores is 8, the memory size is 8 GB, and the operating system is Windows Server.
Platform 3, the number of CPU cores is 16, the memory size is 16 GB, and the operating system is Linux.
This paper use different kinds of resolution of the ocean model to test the performance of the parallel algorithm in the above parallel environment, which means the problem size is different.
The running time of U equation at different problem sizes and different platforms is shown in following tables 1-3. The speedup is an important indicator measuring the performance of parallel algorithm, and is defined as the ratio of serial execution time and parallel execution time. In order to analyse the performance of parallel algorithm, this paper calculates the speedup according to the running time as shown in above tables.
From tables 1-3, the results reveal that the running time reduces as the CPU cores increase on the condition that the problem size is fixed and that the speedup increases, which is shown in figures 3-5. On platform 1, when the number of CPU cores is 4, the speedups of problem size 1350*600 and   On platform 1, when the number of CPU cores is fixed, we can get higher speedup as the problem size increases, which is shown in figure 3, i.e., the speedup of problem size 2700*1200 is higher than that of problem size 1350*600. On platform 2 and platform 3, we can get similar results that the higher speedup is gained at the bigger problem size when the number of CPU cores is fixed, as shown in figures 4 and 5, i.e., the speedup of problem size 3000*1000 is higher than that of problem size 1440*480; the speedup of problem size 3600*1600 is the highest one in all of the problem sizes 3600*1600, 300*1000 and 1350*600. The reason of above results is as follows: as the number of CPU cores is fixed, the increased amount of calculation is greater than increased amount of communication when the problem size increases, which demonstrate that the halo cell update has scalability. From tables 1-3 and figures 3-5, the results reveal that the speedup increases as the number of CPU cores increase and the higher speedup is gained at bigger problem size, which demonstrates that the halo cell update algorithm can get better performance in the parallelization of ocean model.

Conclusion
Aiming at the characteristic of the explicit scheme of the ocean model, this paper develops a parallel algorithm used to parallelize ocean circulation model with explicit scheme. In order to measure the performance of the parallel algorithm, this paper applies this parallel algorithm to parallelize GRGO model. The process of parallelizing ocean model is simple, which means that one needs a bit of modification of original ocean model; in addition, the parallel version of ocean model is scalable. This paper tests parallel version of GRGO in three platforms; and the result reveals that the running time reduces as the number of CPU cores increases, and that with fixed number of CPU cores the speedup increases when the problem size increases, which demonstrates that halo cell update is scalability.