High performance PIC plasma simulation with modern GPUs

With the recent Nvidia Tesla V100 a performance of 0.5 TFLOPS was achieved for 3D Particle-In-Cell simulation. The paper includes brief description of the simulation algorithm, the detail of GPU implementation and the performance analysis with different GPUs.


Introduction
The main idea of this work is to achieve high performance with supercomputers and GPUs in order to conduct high resolution 3D simulations of the fusion plasma in different experimental facilities.
The physical meaning of this work is to investigate the effect of anomalous heat conductivity observed at the GOL-3 fusion facility in the Budker Institute of Nuclear Physics [1]. The GOL-3 facility is a long open trap where the dense plasma is heated up in a strong magnetic field during the injection of the powerful relativistic electron beam of a microsecond duration. The effect is the decrease of the plasma electron heat conductivity by 100 or 1000 times compared to the classical value for the plasma with the temperature and density observed in the experiment. Anomalous heat conductivity arises because of the turbulence that is caused by the relaxation of the relativistic electron beam in the high-temperature Maxwellian plasma. At present, the effect has just a principal theoretical explanation [2].
The goal of theoretical and modelling studies is to define the origin and mechanism of the heat conductivity decrease. This is of great importance for the fusion devices because the effect of anomalous heat conductivity helps to heat the plasma and also to confine it.
Like other plasma physics problems with non-Maxwellian velocity distribution this problem requires kinetic treatment. It means either direct finite-difference solution of Vlasov equation or Particle-In-Cell [3] method. While PIC method is noisy and time-consuming, still it is in many cases the only method that is able to treat the problem with no non-physical simplifications.
There are a lot of PIC packages, like OSIRIS [4], LCODE [5] or VLPL [6], including those adopted for GPU, e.g. PIConGPU [7], ALaDyn [8], and also some works aimed to particular questions of GPU implementation of PIC method, like [9] and [10], but since they are all problem-specific, we make an implementation on our own. PIC method consists of two stages: field evaluation and particle pushing. Computational experiments with general purpose computers (e.g. the Lomonosov supercomputer at MSU) show that most time is spent in particle pushing (from 60% to 90%). Since the total time is big: about a week with 128 Intel Xeon cores for a 3D problem with just 100 × 4 × 4 mesh nodes and 10000 particles for each cell [11], and further study requires meshes of much bigger size, it requires running on GPUs.

Basic equations
The basic equations are Vlasov equations for ion and electron components of the plasma and also of the Maxwell equation system. These equations in the usual notation have the following form: Here f i,e is the distribution function for ions and electrons, respectively, t is time, v is the velocity vector, r is the coordinate vector, F is the Lorentz force, p is the impulse, E is the electric field, B is the magnetic field, ρ is the charge density.
All the equations below are given is non-dimensional units. The following quantities are used for transition to non-dimensional form: • velocity of light c = 3 × 10 10 cm/s • mass of electron m e = 9.1 × 10 −28 g • plasma density n 0 = 10 14 • time t = ω −1 pe , where plasma electron frequency ω pe = 5.6 × 10 11 s −1 . The 3D computational domain has the shape of a cube with the following dimensions: Within this domain there is the model plasma that consists from electrons and hydgoren ions. The model plasma particles are distributed uniformly within the domain. The density of plasma is set by the user as well as the electron temperature. The temperature of ions is considered to be zero. Initil distribution of particles by velocities is Maxwellian: here Δv -electron velocity dispersion, v 0 -average velocity (v 0 = 0 form plasma particles. Beam electrons are also uniformly distributed along the domain. Thus the beam is considered to be already present in the plasma, and the effects that occur while the beam is entering the plasma, are beyond the scope of the study.

Details on Physical Statement of the Problem
The physical statement of the problem exactly follows the simulations in [11], the difference is in simulation time which in the present case is by orders of magnitude smaller. As it is stated in [11] beam relaxation may proceed in three modes, depending on the instability increment γ and beam velocity dispersion Δv: kinetic (kΔv γ), hydrodynamical (kΔv γ) and transit mode (kΔv ∼ γ). Here k is the wavenumber The goal of the computations is not to simulate the facility as a whole -since no supercomputer is able to do it now or in the future. Instead, we are going to reveal the underlying instabilities and reproduce the turbulence observed in experiments.
In such a way, the computational domain corresponds to a very small (characteristic size ∼ 10 −3 cm) volume of plasma.

Description of the GPU implementation
Vlasov equation is solved vy the Particle-In-Cell method [3]. In order to understand the ideas and bottlenecks of GPU implementation of PIC method let us first consider the underlying numerical techniques.

Numerical method
The essential of the PIC method is the substitution of Vlasov equation by the set of characteristic equations. The characteristic equations exactly resemble Newton motion equations for plasma particles, and due to this reason they are called the motion equations for model particles.
One must note that these model particle are just the mathematical tool for solving Vlasov equation. This is not a replacement of very big number (N ∼ 10 23 ) of small particles in real plasma by a small number (N ∼ 10 1 0) of big particles in model plasma.

Particle push
In such a way, the equations of motion for model particles are the following: here i,e are the indices for ions nd electrons, respectively, p is impulse, r i,e is the coordinate vector, E -electric field, v is the velocity and B -magnetic field. Equations 2 are solved by the leapfrog scheme: here τ is the timestep.

GPU implementaion principles
The following main principles are used to achieve high GPU performance: • shared memory is used for computations with particles; • particles are stored in arrays attached to cells in order to use the cache more efficiently; • field and current values (8 of them for each field and current component) are computed and stored in cells; • particle reordering is used to speedup particle pusher.
The computational algorithm consists of the following stages performed at each time step: • particle push, that is the computation of new coordinates and impulses of each particle. At present, the evaluation of current is included in the push, [12,13]; • electromagnetic field evaluation, [14]; • reordering of particle, which means the transfer of particles to the cells they belong to [15]; • copying of currents from cells to the global arrays [14]; • copying of fields from global arrays to cells [14].
For each of the stages a special kernel is devoted in GPU implementation. The implementation of each kernel is given in more detail in the above-listed papers.

Performance measurements with different GPUs
The results of profiling of the code execution is presented in the table 1.
Performance is evaluated in the following way: the computational domain has 6.4 million particles, and near 500 floating point operation are spent for a particle at a time step. The total number of operations is the divided to the measured pushing time. The other parts of the numerical algorithms, like field evaluation of are not considered since they play almost no role both in terms of time and of flops.  figure 1. One can see the device name (Tesla V100 SXM2) as well as kernel duration (6.58132 ms). Is can be noticed also that there no big pause between the kernels, so the device is busy most of time.

Conclusion
The performance achievement of nearly 0.5 TFLOPS by a single NVidia Tesla V100 for 3D Particle-In-Cell simulation is presented. It means that now it is possible to perform the simulation of the size mentioned in Introduction in 4 hours with one V100 instead of a week with a ordinary cluster of 128 cores (the cluster at Siberian Supercomputer Center, based on rather old Intel Xeon 5540 4-core processors).
The performance shown in the paper is lower though comparable with the top Particle-In-Cell simulation speed of 1 TFLOPS shown in [16] for different architecture (a single Intel Xeon Phi). Future plans are to reduce current accumulation time by eliminating atomic operations and thus shorten pushing time.