Tackling exascale systems for astrophysics

Exascale systems are now appearing in the US, Japan, in Europe with two exascale procurements in the coming years and very likely in China. These systems offer great opportunities for the numerical astrophysics community and also raise serious technical challenges. Many legacy codes need to be fully rewritten to benefit from exascale architectures. Performance portability is a key issue as well as the management of the huge amount of data generated by exascale simulations. In this paper we discuss the path we have chosen to exascale and the software engineering solution chosen to address performance portability in the long run and data management. A new radiation hydrodynamics code based on the Kokkos library is presented as well as test runs and preliminary performance analysis. The first target of this new code will be to study supernova ejecta. They are multidimensional fluids, structured on both small and large scales following the non-linear development of complex fluid instabilities during and after the explosion. They contain a wealth of information about stellar composition and the evolution of elements in the interstellar medium. Their study requires significant computing resources that can be met through the use of exascale machines and suitable software.


Introduction
The life of a massive star is marked by stages of nuclear fusion in its core, from hydrogen to iron.When a star reaches its final stage, its life ends in a cataclysmic collapse followed by a violent explosion, known as a core-collapse supernova [1], leaving a neutron star at its core.The ejected envelope, originally stratified in composition, undergoes multidimensional fluid instabilities.These instabilities are driven by factors such as the original asymmetry of the explosion engine, the development of Rayleigh-Taylor instabilities around density inversions, or the 56 Ni bubble effect [2].To accurately describe the behavior of this multidimensional fluid, it is necessary to employ large-scale simulations [3].While the explosion of the star occurs on a large spatial scale, the instabilities act on small scales.Thus we need to follow short time scales of the order of a second, while the evolution of the ejecta must be simulated over the course of days.This results in extremely computationally intensive three dimensional simulations.To efficiently produce numerical results, it is essential to have suitable parallel algorithms along with adequate hardware.
Luckily, we have entered the exascale era with a significant leap in the advances of hardware and heterogeneous architectures.Computational resources are no longer limited in the use of

Exascale context
In the past decade, HPC has experienced a major shift both in terms of hardware and of usage.The emergence of GPU and data-science has deep consequences on the way HPC is used in science.The exascale era will offer the ability to solve large and complex problems, opening up new possibilities for research by combining huge computing power and data analytics.However, to fully exploit these possibilities it becomes essential for scientists to develop software and numerical approaches that can fully harness the benefits of exascale hardware and data science.This includes optimizing applications and developing new programming models to improve performance, long-lasting usability and data management.
The TOP500 ranking [4] is a widely recognized and updated every six months list of the world's top supercomputers.It presents benchmark results such as LINPACK or HPCG.Recently, updated in June 2023, the Frontier system at Oak Ridge National Laboratory (ORNL) in the US, appears to be the first true exascale machine, with a peak performance of 1.102 Exaflop/s.The Jupiter system just announced for 2024 will be the first European supercomputer to surpass the threshold of one trillion calculations per second.Both these systems, and 7 out of the top 10 machines of the Top500, rely on GPU for most of their computing power.
GPU based supercomputers are equipped with chips from two main vendors, namely AMD and NVIDIA, with Intel soon to join.Each of them support different programming languages: CUDA for NVIDIA, HIP/ROCm for AMD and OneAPI for Intel.They are often supplemented with additional models, typically offering a higher-level abstraction, such as OpenMP, OpenACC, Kokkos, or SYCL.To answer this challenge of developing software that can perform efficiently on heterogeneous architectures, we chose to work with the Kokkos library [7].The Kokkos EcoSystem offers a performance portability solution by providing a unified C++ programming model for different architectures.It is presented as a suite of libraries for fundamental computational algorithms along with tools for profiling and debugging.The library is developed and maintained by a multi-institutional team of HPC experts, scientists, and engineers.
The implementation of Kokkos relies on advanced C++ features such as template programming and function objects.It interfaces with existing shared-memory programming models like OpenMP, CUDA, and HIP, offering a level of abstraction that enhances userfriendliness.It utilizes abstract concepts, such as execution spaces, which determine where a computation is executed, and execution policies, which specify how the computation is executed.Similarly, for memory management, it addresses memory spaces, where data resides, and memory layout, which defines how multidimensional data is organized in memory.These concepts help manage how parallel operations are scheduled, executed, and synchronized providing essential flexibility and control for HPC.

Physical model
The basic equations solved by the code are the Euler equations expressing the conservation of mass, momentum, and energy.Additional components such as gravity or heating can be added as source terms.Radiative transfer shall be considered next as it is a key element for supernova ejecta.These equations can be reformulated using the generalized vector of conserved quantities U = (ρ, ρu, E) as: where ρ is the density, u the velocity, E the total energy per volume unit, F is the flux of conserved quantity U such as F(U) = (ρu, ρu ⊗ u + P I, u[E + P ]) with P = P eos (ρ, e, T ) the pressure, e the internal specific energy and S a source term.
Concerning the equation of state (EOS), the code is designed to be able to handle any EOS either analytical or tabulated.In the present paper, we are considering two specific equations of state: the perfect gas equation and the perfect gas equation supplemented with radiation which represents a simplified implementation of radiation in a purely hydrodynamic code.
P gas = ρe(γ − 1) and where γ is the adiabatic index, T represents the gas temperature and a r the radiative constant.
The equation P gas+rad applies to gases when the mean free path of photons is significantly smaller than the size of the system.The radiation is then at local thermal equilibrium with the gas and the temperatures of both the gas and radiation are equal.This flexibility in the choice of the equation of state allows for the adaptation of modeling to the specific characteristics of the system under study, whether it involves standard conditions of a perfect gas or situations involving radiative processes.
The code supports the treatment of passive scalars, which is valuable for understanding mixing in heterogeneous media like supernovae ejecta.The passive scalars, f x , can behave like tracers, facilitating the observation and analysis of fluid behavior and evolution within the medium without altering the fundamental properties of the fluid itself.They are also necessary when including processes like chemistry or heating from the radioactive decay of 56 Ni.
Their evolution is given by: with x representing the different species followed by the passive scalar The code is also adapted for various gravitational source terms such as a point mass or a constant gravitational field.Self-gravitating fluids will only be considered at a later stage since they are not really necessary for supernova ejecta.

Numerical scheme
In this section, we will introduce the notation for space-time discretization.For simplicity, we will concentrate on a one-dimensional problem.The positions of the mesh nodes are defined by x i+1/2 and the center of the cell (x i−1/2 , x i+1/2 ) by x i .The size of the cell is defined as δx i .The time interval between t n and t n+1 is denoted as δt.The notation U n i represents the average quantity at time t n for the cell x i .
The MUSCL-Hancock (Monotone Upstream-centered Schemes for Conservation Laws) scheme [8] is a commonly used numerical method in fluid mechanics and numerical simulations to solve partial differential equations.This method is a practical choice for achieving secondorder accuracy in numerical simulations.Its ability to accurately represent gradients and discontinuities, along with its adaptability and conservation properties, make it a reliable tool for various applications, such as fluid dynamics and compressible flows.The main idea behind this scheme is to achieve a better approximation of solution gradients to prevent undesirable oscillations near discontinuities.This method consists of two steps before advancing to the next time step.The first step is a reconstruction of quantities on each side of the interface with a slope limiter of our choice, as referenced in [9].The quantities will then evolve over a half time step.After these steps, we have the quantity U n+1/2 i±1/2 .Finally, the Godunov scheme [10] will reconstruct the values at the next time step: with F a Riemann solver describing the flow at the cell interface.
The code is built in a modular way such that various slope limiters and Riemann solvers can be used.In the test cases presented in this paper, the HLL solver and either a minmod or a van Leer slope limiter were used [8].
The code is designed for a logically Cartesian grid.By default the grid is regular, however it is possible for the user to implement any grid spacing that would be needed for a particular problem, such as a logarithmic grid or other specialized grid.This applies to Cartesian, cylindrical or sphercical coordinate systems in either one, two or three spatial dimensions.The same flexibility is allowed for boundary conditions.Free-flow, reflective and periodic boundary conditions are already implemented and user defined boundary conditions can easily be implemented.
The numerical scheme in three dimensions is similar to the one-dimensional scheme in all three directions, eventually taking into account geometrical terms.The global scheme fully couples all directions.See [6] for more details.

Implementation
We have chosen to design our code using MPI to deal with distributed memory parallelism and Kokkos for shared memory computations.The global computational domain is divided equally into subdomains and each subdomain is assigned to one MPI (Message Passing Interface) process [11].Within the subdomain, we use the Kokkos library [7] to exploit shared memory parallelism.
Our code implementation has a modular structure where each module represents a given step for the hydrodynamic scheme.For example, we have a module for the face reconstruction step, a module for the Riemann solver, etc.To be more specific, a module is composed by one or several Kokkos kernels.A kernel is a C++ functor, which is executed following the pattern Kokkos::parallel for or Kokkos::parallel reduce.For example, Equation 4 is implemented as : In this code snippet, U_new, U and S are all Kokkos::View which is the core data type used in the Kokkos library.The MDRangePolicy defines how to iterate over a multi-dimensional container.
Another aspect of our code worth mentioning is the use of the PDI [12] library that aims to decouple high-performance simulation codes from Input/Output (I/O) concerns.With PDI, we describe the I/O operations in a dedicated YAML file instead of interleaving them with the simulation code.We don't need to recompile the code when changing the I/O parameters.For the configuration of a simulation, we load a file with all the necessary parameters for the numerical scheme and the domain decomposition.Then, for the initialization of the physical values, we can either use PDI to read them from an HDF5 file or initialise them inside HERACLES++.The output values are written with PDI in HDF5 and XMF files.PDI is also designed to easily couple simulation codes with (in-situ) data-analytics tools which is a great asset for future usage of the code.

Hydrodynamics tests
The code has undergone extensive testing in one, two and three dimensions, covering a variety of hydrodynamical tests.We provide an overview of the tests conducted.For simplicity and speed, we assume an ideal gas equation of state.The two and three dimensional cases have been run using GPUs.The numerical scheme employed is the one previously described, which is the MUSCL scheme with the HLL Riemann solver using second-order discretizations.The slope limiter is applied to the primitive variables using a van Leer slope limiter for the one and two dimensional tests and a minmod slope limiter for the three dimensional cases.The tests are conducted using a regular Cartesian grid.

Sod shock tube
The classical shock-tube test is designed to demonstrate the code's shock-capturing capabilities.We consider the initial condition proposed in [8] for a perfect gas with the adiabatic index γ = 1.4.The initial condition consists of two states in a domain x from 0 to 1: where ρ is the density, u the velocity and p the pressure.The boundary conditions are transmissive.This test is not particularly challenging for modern HPC code but it is interesting as it exhibits all fundamental hydrodynamical waves and is well-documented.The exact solution consists of a propagating rarefaction wave on the left side, a rightward-propagating contact discontinuity and a rightward-propagating shock.
Figure 1 shows the analytic solution and the results from our code.The code correctly captured all of the different waves.The contact wave is the most diffuse, captured within approximately 4 grid points, while the shock wave spans around 2 grid points.The rarefaction wave is well described by our scheme.

Kelvin-Helmholtz instability
The Kelvin-Helmholtz instability is a multi-dimensional instability arising from a velocity shear within a fluid.We consider a two dimensional perfect fluid in a rectangular domain, same as The solid line is the analytic solution while the dots show the result obtained with a 100 grid points resolution.
[13], x ∈ [0, 4] and z ∈ [0, 2].The initial conditions include a density jump: with a = 0.05 σ = 0.2, z 1 = 0.5 and z 2 = 1.5.The initial amplitude of the perturbation in the vertical velocity is A = 0.01 and the initial density perturbation is δρ/ρ 0 = 1.0.The initial lateral velocity constant u 0 is 1.0 and the adiabatic index is γ = 5/3 for an adiabatic gas.We ignore the gravity for this test.The boundary conditions are periodic in the x direction and transmissive in the z direction.The instability results in the creation of small symmetrical vortices, which cascade into turbulence.This is a critical test, the Kelvin-Helmholtz instability is a fundamental and crucial instability in fluids with shear, especially in astrophysics, as for the Rayleigh-Taylor instability in the next section.
Figure 2 shows the colormaps for the density at the different times.The initial density perturbations evolve to form characteristic vortices.At a qualitative level, these results compare satisfactorily with those shown in [5].
The top and bottom density are set so that the Atwood number The initial pressure is set to maintain hydrostatic equilibrium: We track the evolution of the fluid using a passive scalar field f x .It is initialized to be 1 for positive z and 0 otherwise.For the boundary conditions, we assume periodicity in the x and y directions and reflectivity in the z direction.At time t = 0, the interface between the two fluids is perturbed: The normalized coefficient H is adjusted after obtaining the coefficients so that h rms = 3 × 10 −4 L. The initial perturbation generates small buoyant bubbles that interact and grow into larger bubbles as the system evolves, as shown in Figure 3. 15th International Conference on Numerical Modeling of Space Plasma Flows Journal of Physics: Conference Series 2742 (2024) 012027

Off-center Sedov blast wave
Our final test is the classical Sedov self-similar blast wave [14] in spherical coordinates and in three dimensions, but here with an off-center placement of the initial blast.This test allows us to test our geometrical source terms and is a relevant representation of the initial phase of a supernova blast.The initial conditions are similar to [15], with some adjustments.The domain is a portion of a sphere around the equator: r ∈ [0.5, 1.5], (θ, ϕ) ∈ [π/4, 3π/4] 2 .Initially, we consider gas at rest with ρ 0 = 1, P 0 = 10 −2 and γ = 5/3.The blast is off-center and occurs at the coordinates r e = 1, (θ e , ϕ e ) = π/2.To initiate this explosion, we introduce an amount of total energy E 1 = 1 in a single cell .We perform the test on a regular grid, and the boundary conditions are transmissive in all directions.
From [14] , the analytical position of the shock is given by: with ξ 0 ≈ 1.15 for an adiabatic equation of state with γ = 5/3.Figure 4 shows the result and the comparison with the analytical solution.As explained in [13], small-scale axis-related artifacts appear in the density of off-axis blasts.However, the evolution of the blast remains spherical.

Performance
The tests are performed on the French supercomputer Adastra [4] hosted at CINES in Montpellier.This cluster is equipped with AMD MI250X GPUs and is designed with 4 GPUs per node.Each GPU has a 128GB memory and a theoretical peak memory bandwith of 3276 GB/s.The MI250X GPU is composed of 2 GCD (Graphic Complex Die) and we assign one MPI process to each GCD.For the scaling tests, we use the 5.5.1 version of rocm with the HIPCC compiler driver.The Kokkos version is 4.1.00and we enable the Kokkos option Kokkos_ARCH_VEGA90A.
We use the 3-d Rayleigh-Taylor simulation with size (n x , n y , n z ), and the initial conditions presented in the previous section.We run the test for 50 iterations.The performance is evaluated for the temporal loop without the initialization and I/O.It shows the number of cells updated per second: perf = n x n y n z n itr t elapsed (11) with n itr the number of iterations and t elapsed the elapsed time.

Performance for one MPI process
For the first test, we are going to evaluate how the performance evolves with the size of the simulation using a single GCD.For maximum efficiency, the size of the problem must be large enough to expose sufficient parallelism.In our case, we can see in Figure 5 that the performance increases rapidly until the size of the problem reaches 20M cells.For problem of larger size the performance remains roughly constant.We have done several runs for each point and the variation in performance is less than 1%.

Weak scaling
The weak scaling test shows the ability of the code to improve its performance while the workload increases in proportion to the available computing resources.The problem size per GCD is fixed to 352 × 352 × 384 for one GCD.While doubling the number of GCDs, and the total size of the problem doubles as well.Figure 6 shows approximately an efficiency of 96% up to 512 GPUs.

Strong scaling
In the strong scaling test, we fix the total problem size to 416 × 416 × 448.Then we increase the number of GPUs. Figure 7 shows that the acceleration is close to ideal for a small number of GPUs.As expected in most strong scaling tests, the speedup slows down with increasing number of GCDs.Indeed, the problem size per GCD becomes too small to saturate the bandwidth, and the MPI exchanges take more time due to the increased number of ghost cells.

Conclusion
With the results obtained so far, we believe that the technical choices made for building a sustainable and user friendly version of HERACLES++ were very relevant.Keeping strongly on-board the astrophysicist community while making a radical change to C++ was a challenge.But the use of multi-dimensional Kokkos views and the modularity of the code design  make it quite accessible to (astro)physicists with mainly a Fortran background.Performance portability was also a key issue and preliminary results are encouraging.Without any specific work on optimisation, the same source code already has very acceptable performances on various architectures.Several improvements are already underway to further improve both the performance and the memory footprint.The first results concerning supernovae ejecta are in preparation.
where X = 2πx/L, Y = 2πy/L and the wave numbers k x and k y have values of [0, 1, 2, 3, 4].The coefficients a k , b k , c k and d k are sampled from a uniform distribution with values ranging from -1 to 1, excluding the boundaries.

Figure 3 :
Figure 3: 3-D volume rendering for the passive tracer concentration of the 3-D Rayleigh-Taylor test at time: A g z t 2 /L ∼ 3. Red values indicate higher concentration and blue lower.The number of zones in (x, y, z) are (256, 256, 512).

p
Figure 4: 3-D spherical Sedov blast wave at t = 0.08.Left: 2-D density, slice at ϕ = π/2 in the Cartesian referential.The analytic shock radius (r sh ) is overplotted in black.Right: the pressure distribution as a function of the distance from the blast center with the analytic solution in black.The (x, y, z) domain contains (256, 256, 512) zones.
Size (n x × n y × n z )

Figure 5 :
Figure 5: Performance in Mcells per second based on the size of the problem running with one MPI process (one GCD).The size of n x , n y and n z are roughly the same.

Figure 7 :
Figure 7: Strong scaling: acceleration of the performance, compared to one GCD, when the size of the problem remains constant and the number of MPI process increases.
15th International Conference on Numerical Modeling of Space Plasma Flows Journal of Physics: Conference Series 2742 (2024) 012027 IOP Publishing doi:10.1088/1742-6596/2742/1/012027Figure 6: Weak scaling: efficiency of the code, compared to one GCD, when the size of the problem and of the number of MPI process double.