A study of parallelizing O(N) Green-function-based Monte Carlo method for many fermions coupled with classical degrees of freedom

Models of fermions interacting with classical degrees of freedom are applied to a large variety of systems in condensed matter physics. For this class of models, Weiße [Phys. Rev. Lett. 102, 150604 (2009)] has recently proposed a very efficient numerical method, called O(N) Green-Function-Based Monte Carlo (GFMC) method, where a kernel polynomial expansion technique is used to avoid the full numerical diagonalization of the fermion Hamiltonian matrix of size N, which usually costs O(N3) computational complexity. Motivated by this background, in this paper we apply the GFMC method to the double exchange model in three spatial dimensions. We mainly focus on the implementation of GFMC method using both MPI on a CPU-based cluster and Nvidia's Compute Unified Device Architecture (CUDA) programming techniques on a GPU-based (Graphics Processing Unit based) cluster. The time complexity of the algorithm and the parallel implementation details on the clusters are discussed. We also show the performance scaling for increasing Hamiltonian matrix size and increasing number of nodes, respectively. The performance evaluation indicates that for a 323 Hamiltonian a single GPU shows higher performance equivalent to more than 30 CPU cores parallelized using MPI.


Introduction
In solid state materials, we observe quite different physical properties such as superconductivity and magnetism.These different properties are explained by different behavior of electrons, which in turn is caused mostly by Coulomb interactions among electrons or/and interactions between electrons and other degrees of freedom such as phonon.Therefore, theoretical understanding of these interacting electrons is crucial to, e.g., design new materials with rich functionalities.However, precisely due to this many-body nature of these interactions, it is usually very difficult to treat these systems analytically and even numerically.
Among such interacting electron systems, models of electrons interacting with classical degrees of freedom can be applied to a large variety of systems.One of the most studied systems is the double exchange model for colossal magnetoregistive manganese oxides [1,2,3,4].In the double exchange model, electrons can move around a lattice site under the influence of the localized classical spins via Hund's rule coupling, the Hamiltonian being given in Section 2. To solve this model, the Monte Carlo method is commonly used to carry out the importance sampling for the classical degrees of freedom after treating electron degrees of freedom by numerically diagonalizing the electron Hamiltonian of dimension N (where N is a system size) [1,2,3].However, since at every Monte Carlo step one must fully diagonalize the Hamiltonian matrix to evaluate the partition function for a given configuration of the classical spins, it usually costs O(N 3 ) numerical complexity.Because of this O(N 3 ) scaling of the calculations, the accessible system size is strictly limited to up to hundreds to thousands sites, preventing us from studying important properties such as the Curie temperature of a ferromagnetic ordering [1,2,3].
In order to reduce the numerical complexity, Kernel Polynomial Method (KPM) is very advantageous [5].Indeed, Motome and Furukawa have applied a Chebyshev expansion of the electron density of states for evaluating the partition function, which can reduce the numerical complexity down to O(N 2 ) [6].They have also proposed that the numerical complexity can be further reduced to O(N ) by truncating the moment calculations for the Chebyshev expansion [7].Very recently, Weiße has pointed out that the same numerical complexity of O(N ) can be achieved by using a Green-function-based Monte Carlo (GFMC) method, instead of directly evaluating the partition function for a given spin configuration [8].The GFMC method dramatically decreases the computational time by using only a few local Green's functions estimated by the Chebyshev expansion to calculate the change of the electron density of states.Moreover, this O(N ) GFMC method is more attractive than the one for directly evaluating the partition function with the same complexity of O(N ) [7] because the GFMC method achieve the O(N ) complexity without any truncation [8].
Motivated by these recent progress, here we apply the O(N ) GFMC method to the double exchange model.In this paper, we mainly focus on the implementation of the GFMC method using both MPI on a Central Processing Unit (CPU) cluster and Nvidia's Compute Unified Device Architecture (CUDA) programming techniques on a Graphics Processing Unit (GPU) cluster [9].The time complexity of the algorithm is estimated.The implementation details on a GPU cluster is also described since programming on a GPU cluster requires more attention on hardware aspects such as memory copy and communication between CPU and GPU.The performance evaluation indicates that the GPU program can archive over 10× speedup as compared to the MPI parallelization on a CPU cluster.
This paper is organized as follows: Section 2 explains the double exchange model and formulates the GFMC method, Section 3 describes the implementation details and the parallelization techniques, which are tested in Section 4, Section 5 shows the performance on a single CPU processor as well as the whole cluster, and finally, Section 6 will give the conclusions.

Double Exchange Model
The double exchange (DE) model describes electrons interacting the classical spins via Hund's rule coupling.The Hamiltonian is given by where c † iσ is a creation operator of electron at site i and with spin σ = (↑, ↓), i, j runs over a pair of nearest neighbor sites i and j, S i is the classical spin at site i with its normalization | S i | = 1, and J H denotes the Hund's rule coupling with J H > 0. In the limit of infinite J H (J H → ∞), the Hamiltonian becomes with the hopping amplitude where θ i and φ j denote the polar and the azimuthal angles of the classical spin S i , respectively.The grand partition function of the DE model is written as where β = 1/T is inverse of temperature T , µ is the chemical potential, N is the total number operator of electrons, and Tr e [• • • ] indicates the trace over the electron degrees of freedom in the Fock space.The trace over classical spin degrees of freedom, i.e., N i d 3 S i P ({ S i }), is evaluated by the Monte Carlo importance sampling with its weight P ({ S i }) for a given spin configuration { S i }, where ǫ i is the i-th eigenvalue of Hamiltonian H({ S i }), and ρ(E) is the electron density of state (DOS) for a given spin configuration { S i }.Using the Metropolis algorithm, the possibility Therefore, the quantity ∆ seff define by is all we need to carry out the Monte Carlo calculation.Here, ρ ′ (E) is DOS for a given spin configuration { S ′ i }.As mentioned in Section 1, we can exactly diagonalize H({ S i }) with O(N 3 ) complexity to evaluate ∆ seff [1,2,3], or we can use Kernel Polynomial Method (KPM) with O(N 2 ) or O(N ) complexity to estimate the DOS directly [6,7].However, the latter one with O(N ) complexity must truncate the moment calculations.Here, we use the recently proposed GFMC method [8], as will be explained below for completeness.

Green-function-based Monte Carlo (GFMC) method
In this O(N ) GFMC method, we need to evaluate only several elements of the Green's function to calculate ∆ seff given in Eq. ( 9), provided that the change of spin configuration { S i } → { S ′ i } is local, i.e, only a single spin at site i is changed with the rest of spins unaltered.We first give the definition of the Green's function G(z) in the complex plane: where E is real and ǫ is a very small positive real.The Hamiltonian for a given spin configuration has a special role in the GFMC method, where ǫ i and ǫ ′ i are the i-th eigenvalues of H and H ′ , respectively.It is readily shown that Thus, this equation in the left hand side can be used in Eq. ( 9), and ∆ seff is now described using d(z): It is important to notice that we need only several elements of G(z) to evaluate As an example, here we consider the simple cubic lattice with the nearest neighbor hopping, and we assume that only one spin at site o is changed: In the cubic lattice, site o has 6 nearest neighbors (NN) denoted by {n, e, s, w, t, b}.Then, the ∆ matrix has a very simple form of Therefore, to evaluate d(z), only the following 7 × 7 Green's functions have to be calculated The computation can be further simplified by expanding d(z) as the following: where Moreover, using the following state |v : d(z) can be compactly expressed as Notice that we now need only a 2 × 2 Green's function to evaluate d(z).Now, a question is how to calculate efficiently the local Green's functions G(z).For this purpose, we use the KPM [5], which can be efficiently implemented in a GPU cluster [10].Using two types of Chebyshev polynomials (m: integer), the diagonal elements of the Green's function are expanded as Since the Chebyshev polynomials T m (x) and U m (x) requires that the argument x should be within [−1, 1], we must renormalize the energy spectrum E to w.The μm represents the m-th moment (defined below) after applying a kernel function, μm = µ m g m , where g m is the kernel function to eliminate Gibbs oscillations [5].Here, we apply Lorenz kernel function which is defined as with appropriate choice of λ [5].The m-th moment µ m is defined as In addition, the number M of coefficients is obtained with M/2 iterations if the following equation is used, When the coefficients µ m are all calculated, we can evaluate the Green's function using fast Fourier transform [5].In Eq. ( 24), if we choose the Green's function becomes where Let us now denote the summation part in Eq. ( 29) by χ k : It should be recalled that the following expression is required for the fast Fourier transformation [11]: where Using the following correspondence between χ j and γ j χ j can be evaluated using the fast Fourier transformation, and thus the time complexity of calculating Eq. (29) reduces to O(M log(M )), where M is the number of moments kept.
The off diagonal elements of the Green's function, G ov and G vo , can be evaluated similarly if we use the follow mixed elements of the Green's function (note that i below is the imaginary unit): G ov and G vo are now expressed by the diagonal elements of the Green's functions: Using these four elements of the 2 × 2 Green's function, we can readily calculate d(z) and thus ∆ seff can be evaluated.Finally, it should be noted that if Hamiltonian matrix H is stored in a compression format, the time complexity of Eq. ( 26) is O(N M ), where N is the dimension of H.As mentioned above, the time complexity to calculate Eq. ( 24) is O(M log(M )), and thus the total time complexity is expected to be O(N M ) + O(M log(M )).When M is fixed, the time complexity should scale linearly with the dimension of H. Indeed, we find, as shown in Fig. 1, that the execution time is approximately proportional to the Hamiltonian size N .), but for the GFMC method it is almost linear.

Algorithms
For a given temperature T , we use Algorithm 1 to calculate the magnetization M for the classical spins through the average over S Monte Carlo sweeps (the loop from line 1), where each sweep corresponds to N spin trail flips (the loop from line 2).Since the direction of the magnetization is trivial, the magnetization M is defined here as the length of the total spin vector.
In the implementation, we apply the Metropolis method (line 10 in the Algorithm 1) to determine whether a trail flip is accepted by comparing with a random number between 0 and 1.
Section 2 has illustrated how to calculate the ∆ seff (from line 4 to 9) using the GFMC method.Especially, the calculation of the expansion coefficients µ m plays a curial role (line 6).This part occupies most of the execution time since the recursion [denoted by Eq. ( 26)] involves intensive matrix-vector multiplication (MVM) with complexity of O(N ).Note, however, that GPU is an ideal platform to parallelize MVM, because the multiplications between the rows and the vector could be distributed to hundreds of streaming processors.Therefore, we focus on a GPU implementation with the highly parallelism and expect a large speedup factor as compared with the CPU one.

Parallelization Frameworks
The magnetization M as a function of temperature T is obtained through the average over a large number of Monte Carlo sweeps.If the system stays in the thermal equilibrium, the Monte Carlo sweeps are independent with each other.Therefore, we could divide the outside loop S (line 1) into many threads.However, the warm up sweeps, which are necessary to achieve the thermal equilibrium, should be abandoned for taking the average, and this condition may prevent us from dividing S to too many threads because some threads may have too few sweeps to reach thermal equilibrium.
In addition to this parallelism for the Monte Carlo sampling, the intensive matrix-vector and vector-vector multiplications in line 6 can be effectively parallelized into multi-core CPU processors using OpenMP or many-core GPU processors using CUDA.In this paper, we focus on the implementation based on GPU to archive higher parallelism than multi-core CPU.
In our algorithm, we combine two parallelizations to achieve high efficiency, i.e., the Monte Carlo sweeps are divided into several MPI threads while in each thread we employ GPU to calculate matrix-vector operations.Calculate the modification matrix ∆ ← H − H ′ using Eq. ( 3) Calculate 4 elements of the Green's function G oo , G vv , G o+iv,o+iv and G o−iv,o−iv using Eq. ( 24) Calculate d(z) using Eq. ( 22) Calculate ∆ seff using Eq. ( 14) Accept the new spin configuration. 12: Update H : H ← H + ∆ For the simple cubic lattice used in this paper, each site has 6 nearest neighbors that lead to 6 non-zero entries in each row.This case is very suitable for ELLPack format [12] because no extra data need to be padded to the end of the rows.ELLPack format needs two matrices to store the non-zero entries and column indices, respectively.On GPU device, the two matrices are stored in column major to trigger coalesced memory access.

Arrangement of CUDA Kernel Functions
One important issue that should be well addressed is the communication between GPU and CPU.Since the Hamiltonian matrix and the spin configurations may be updated after each iteration, if the Hamiltonian matrix are stored in CPU memory, the data transfer between CPU and GPU may significantly decrease the performance.This is especially true when the lattice size is small because the execution time for numerical calculation occupies a low percentage of the overall time consumption.According to our test, the simulation of 6 × 6 × 6 cubic lattice on a 8x PCIe Tesla C2050 is about 30% slower than on a 16x PCIe C2050.Therefore, the data transfer may not only decrease the performance, but also make the program to be more dependent on the bandwidth between the CPU and GPU.
To avoid the memory transfer between CPU and GPU as much as possible, in our implementation the matrix H and the spin configurations are kept in GPU memory.
With these considerations, we propose a multi-kernel GPU implemenation shown in the Figure 2, in which the kernel functions are indicated by "<<<>>>" and arranged as the Function "FlipSpinAndCalculateDelta" flips a spin with a random direction and calculates the Hamiltonian matrix difference ∆, which is used in the function "PrepareVector" to create four vectors {|o , |v , |o + i|v , |o − i|v }.For each vector, the function "CalculateCoefficient" calculates two expansion coefficients µ m in every iteration.After all the coefficients are ready, function "PrepareForFFT" prepares the data for fast Fourier transformation (FFT), including applying Lorentz kernel function and transform Eq. ( 24) to meet the requirement of FFT.After performing FFT by function "PerformsFFT", we obtain ∆ seff using function "CalculateDSEFF".If the flip is to be accepted, function "UpdateStatus" will update the spin configuration and the Hamiltonian matrix.After one MC sweep finishes, function "MeasureSpin" accumulates the spins to calculate the magnetization.

Implementation of the Recursion
The most important kernel function is "CalculateCoefficient", which involves the most intensive computation due to the recursive matrix-vector operations.Figure 3 demonstrates the implementation of this CUDA kernel function to calculate the coefficients µ 2m .In CUDA architecture, the parallel running threads are divided into several thread groups, called "Blocks".For example, in Figure 3 there are p thread blocks and each block has 4 threads, indexed from 0 to 3. Each thread performs the multiplication between a row and the vector |a m−1 that is stored in a 1D array R1.Since the produced vector |a m  could be placed on R0 overwriting the vector |a m−2 , we only need two 1D memory arrays, R0 and R1, to perform calculations by exchanging their pointers after each iteration.
After the vector |a m is ready, a production a m |a m is performed, shared memory is applied to store the product result.a m |a m is a dot product, for CUDA, it is a reduction problem, which will be carried out in two steps, first, the product result in each block is accumulated to produce a scalar and then the scalar in each block is further accumulated to get coefficient µ 2m using Eq.28.
This implementation could eliminate the most data transfers during the calculation.The bandwidth test shows the performance difference between a 8x and 16x PCIe based Tesla C2050 is very small (less than 6%) while calculating a 6 3 cubic lattice.

Validation
In order to check the availability of our implementation, The DE model on the simple cubic lattice is simulated.The magnetization M as a function of temperature T is examined and the results are shown in Figure 4.The simulations are performed with the following parameters: Chebyshev truncation number is 256, chemical potential µ is 0, the average is taken over 5000 MC sweeps.

Performance evaluation 5.1. Experimental Setup
The performance evaluation is performed on a cluster system that has 16 nodes of the NEC LX series connected with 40GByte/sec Infiniband network via the switch.Each node has two NVIDIA Tesla M2050 GPUs connected to the separate PCI Express buses and 12 CPU cores (Xeon E5645 2.4GHz).The OS of the node is CentOS 6 and the graphics driver with CUDA 4.0 is installed.The compiler for CPU is gcc-4.4.4.Two CPUs are tightly coupled with sharing the memory bus (totally 12 CPU cores in a node).Amount of total memory is 12GBytes per node.All experiments performed below use the compiler option "-O3" for CPU and GPU programs.A head node of the cluster is also connected to the cluster via the Infiniband network to serve the home file system in order to share the program resources.
We evaluate the performance in two ways.The first is to measure the performance of a single CPU core and a single GPU with different Hamiltonian matrix sizes.The second is to evaluate   !"#$%&'($"()*$+, -.,/0&$'()1$#,)23,/4)$()5'6"7$&8#)*/"%,) *9,,:09)2;<=>?<=4 mutliple MPI threads and there is very little communications among these threads, in an ideal case of parallelization, y should be a constant.However, we have noticed that when the threads number becomes larger than 4, y begins to increase fast, suggesting that the efficiency become to decrease rapidly.It indicates that the memory bandwidth becomes saturated when nprocs grows to be larger than 4. Therefore, we can conclude that the bottleneck of this method is memory bandwidth.In this paper we resort to solve the problem using Tesla C2050 GPU with maximum bandwidth of 144GB/s [9].

Performance Scaling for Increasing Hamiltonian Size on Cluster
Figure 6 shows the performance scaling on the cluster while the Hamiltonian size is increasing from 6 3 to 32 3 .38400 Monte Carlo trail flips are calculated while the number of Chebyshev moments are kept as 256.It can be observed that when the H matrix is small, e.g. 6 3 or 8 3 , 192 CPU cores are much faster than 32 GPUs.To explain the reason, let us divide the total time consumption of GPU program into two parts: a) the time for numerical calculation and b) the time for other works such as initializing GPU device, copying memory between CPU and GPU, and MPI communication, etc. Usually the latter part consumes very little time, but when the matrix size is small, comparing with the first part, time consumption of the latter part can not be neglected and significantly decrease the performance.However, the speedup increases rapidly with increasing H matrix size and in the best case, i.e. 32 3 , 32GPU is about over than 5 times faster than 192 CPU cores.If we define the speedup factor as the equivalent number of CPU cores comparing with one GPU.For Hamiltonian with size of 32 3 , this factor is about 30 (192/32 * 5).

Performance Considerations
As shown in Figure 5, the bottleneck of this algorithm is the memory bandwidth, GPU could solve the memory bandwidth limitation very well and shows high speedup ratio.Due to the nature that there is little communication among MPI threads, this algorithm is also capable to run on many compute nodes while keeping good parallelization efficiency.However, due to the thermal equilibrium problem explained in Section 3.2, the of number of MPI threads is limited, in this case, a combination of parallelization techniques such as MPI, OpenMP and CUDA should be introduced.For example, the workload is firstly distributed into different nodes using MPI, and then in each node we use OpenMP to calculate the MVM, if the CUDA is capable, the OpenMP thread may employ GPU to calculate the MVM.The combination can increase the parallelism and gain high performance.However, it should be pointed out that the process to combine these techniques may not be very straightforward since we have to take trade-offs in many occasions.

Conclusion and Future Work
In this paper, we have provided the detailed formulation of the GFMC method for the DE model.We have proposed implementation methods to parallelize the GFMC method using MPI on CPU as well as CUDA on GPU.In order to eliminate the data transfer between CPU and GPU as much as possible, the program is implemented on GPU using several kernel functions.The performance evaluation indicates that the GPU implementation could effectively overcome the CPU memory bandwidth limitation and shows more than 30 speedup factor when Hamiltonian size is 32 3 .The performance scaling test for increasing number of processors indicates that the MPI parallelization is very effective for this algorithm.Finally we discussed more considerations on the performance for future optimizations.
For the future work, we will implement this method as a GPU based library that can be used for other lattice models.
H is the renormalized Hamiltonian to fit the spectra within [−1, 1].Defining |a m = T m ( H)|i and thus µ m = i|a m , the Chebyshev series in Eq. (23) yeilds the recursive relation of these vectors |a m , namely,

Figure 1 .
Figure1.Hamiltonian size dependence of the execution time for (a) the exact diagonalization method[1,2,3] and (b)the GFMC method.Here, for simplicity, we performed only 10 Monte Carlo trial flips of spins, and M is fixed for different Hamiltonian sizes.We can clearly see that the time consumption for the exact diagnolization method roughly follow the complexity of O(N 3 ), but for the GFMC method it is almost linear.

Algorithm 1 2 : 3 :
Calculate the magnetization as a function of temperature Require: Integer S to represent number of Monte Carlo sampling sweeps Require: Hamiltonian matrix H of dimension N × N Require: Scalar M to store accumulated magnetization 1: for i = 1 → S do for j = 1 → N do Randomly choose site o and change the spin randomly 4:

3 . 3 .
Implementation on GPU 3.3.1.Compression Format of Hamiltonian Matrix Because the Hamiltonian matrix may be very large, e.g.20 3 × 20 3 , an effective compression technique must be introduced to reduce not only the memory consumption but also unnecessary memory access.The compression storage format of the Hamiltonian matrix may varies with different lattices and different physics models.

Figure 4 .Figure 5 .
Figure 4. Magnetization M as a function of temperature T for different sizes (indicated in the figure) of the simple cubic lattice.

5. 2 .
Performance Scaling on Multi-core CPU Section 3.1 has explained that the process of calculating coefficients μ involves intensive MVM operations.The performance of MVM largely depends on the memory access speed.Therefore, the memory bandwidth plays an essential role in the overall performance.This conclusion is consistent with our mutli-core CPU performance scaling test shown in Figure5.This test is performed on a single node that has 2 CPUs, totally 12 cores and 12GB DDR2 800 memory with a peak bandwidth of 12.5GB/s.In order to evaluate the parallelization efficiency, we examined the quantity expressed by y = execution time×nprocs, in which nprocs represents the number of MPI threads.As the outside loop [from line 1] in Algorithm 1 is divided into

Figure 7 .
Figure 7. Performance scaling for increasing number of nodes, each node has 2 Tesla C2050 and 12 Xeon 2.4GHz cores.8000 Monte Carlo trail flips of a 20 3 cubic lattice Hamiltonian are calculated with the Chebyshev truncation number kept as 256.The blue/green lines represent consumption of CPU cores/GPUs