A note on parallel efficiency of fire simulation on cluster

Current HPC clusters are capable to reduce execution time of parallelized tasks significantly. The paper discusses the use of two selected strategies of cluster computational resources allocation and their impact on parallel efficiency of fire simulation. Simulation of a simple corridor fire scenario by Fire Dynamics Simulator parallelized by the MPI programming model is tested on the HPC cluster at the Institute of Informatics of Slovak Academy of Sciences in Bratislava (Slovakia). The tests confirm that parallelization has a great potential to reduce execution times achieving promising values of parallel efficiency of the simulation, however, the results also show that the use of increasing numbers of computational meshes resulting in increasing numbers of used computational cores does not necessarily decrease the execution time nor the parallel efficiency of simulation. The results obtained indicate that the simulation achieves different values of the execution time and the parallel efficiency in regard of the used strategy for cluster computational resources allocation.


Introduction
Current fire simulation systems are capable to utilize the knowledge of the CFD (Computational Fluid Dynamics) theory and model fires and their consequences involving many physical and chemical processes related to fire such as combustion, pyrolysis, heat transfer, thermal radiation, turbulence, fluid dynamics and so on. However, simulation of fires in large areas generally requires carrying out the calculation in parallel on HPC (high performance computing) systems. The FDS (Fire Dynamics Simulator) system [1] is an advanced CFD-based fire simulator intended for fires in various environments. It was developed by NIST (National Institute of Standards and Technology, U. S. Department of Commerce) in cooperation with VTT (Technical Research Centre of Finland). FDS numerically solves a form of Navier-Stokes equations for low-speed fire induced flows with emphasis on transport of smoke and heat from fire. The system is capable to utilize various types of parallelization. To take advantage of computational resources available, FDS supports four programming models: the sequential model designed for sequential computers, the parallel MPI (Message Passing Interface) model designed for systems with distributed memory, the multithreading OpenMP (Open Multi-Processing) model designed for systems with shared memory and the hybrid MPI&OpenMP model designed for systems with distributed shared memory. In previous research we investigated applicability of FDS to simulate fires in various environments [2][3][4]. In this paper, we focus on the MPI model of parallelization of fire simulation on the HPC cluster at the Institute of Informatics of Slovak Academy of Sciences in Bratislava (Slovakia).
There were 52 IBM dx360 M3 computational nodes available for testing purposes; each node consisted of 2 sockets with 6-core processors Intel E5645, 2.4GHz (12 cores per node). Within each node, 48 GB RAM was available (24 GB of RAM per processor). In the MPI model, the computational domain is divided into several computational meshes and the computation on each mesh is considered as a single MPI process assigned to a single computational core. The communication between MPI processes is provided by MPI. We use the 64-bit MPI version 6.3.2 of FDS for Linux and the open-source Open MPI version 1.10.0 of MPI. The script mpirun (a part of the Open MPI) allows determining exactly how the MPI processes representing a task will be mapped by individual computational nodes and then bound to individual sockets or cores. The aim of this paper is to compare the impact of 2 selected computational resources allocation strategies on parallel efficiency of a corridor fire simulation.

Simulation of a corridor fire
We consider a 10-second fire of a corridor with the dimensions of 7.  decomposition for m = 7 and m = 11 does not fulfil the conditions required for solving a Poisson equation using FFT, therefore, we do not consider the simulations 7M and 11M. The tested simulations have the same 2 cm resolution with the total number of cells equal to 2916000 (2916000/m cells per individual mesh). Since the chosen mesh resolution is the same as in the sequential calculation and we simulate the same fire scenario, the mesh sensitivity study performed for the sequential calculation and analysis of the HRR curves of executed individual parallel simulations showed that we can suppose that also the computational meshes in parallel simulations correspond with "fine mesh" in the sense of the study [5]. ...

Figure 2. Decomposition of the computational domain for the chosen simulations
In the MPI model each simulation mM is represented by m MPI processes, each of them carried out on a single computational core. The mapping process of a given simulation on individual computational nodes of the HPC cluster and their subsequent binding to individual sockets or computational cores consists of three phases: mapping, ranking and binding [6]. In this paper, we use 2 allocation strategies: --map-by core, --bind-to core (denoted by CC) and --map-by socket, --bind-to socket (denoted by SS). The CC strategy binds the individual MPI processes to available individual computational cores in such way that it gradually binds individual MPI processes to corresponding individual computational cores of the socket 0 and then to corresponding individual cores of the socket 1. The SS strategy binds individual MPI processes alternately to the group of all cores of given socket the way that it binds the MPI process 0 and even MPI processes to the cores of the socket 0 and odd MPI processes to the cores of the socket 1. We reserved the whole computational node for calculation of each simulation in order to eliminate its interference with other tasks being resolved on the HPC cluster resulting in increase of the execution time of simulation. In Table 1, the values of execution times (in hours, rows 1 and 2) and parallel efficiency (in %, rows 3 and 4) of the considered simulations carried out with the CC and SS strategies, and their differences (in %, row 5) are shown (see figure 3).

Conclusion
In this paper, the effect of the 2 strategies for allocation of the MPI processes representing the given simulation carried out on computer cluster to individual cores or sockets on parallel efficiency of fire simulation was examined. The series of one sequential and 9 parallel simulations of simple corridor fire scenario was realized on the HPC cluster at the Institute of Informatics of Slovak Academy of Sciences in Bratislava using the parallel MPI programming model of FDS and the strategies --map-by core, --bind-to core and --map-by socket, --bind-to socket. The parallel efficiencies of parallel simulations were compared. The tests confirm a great potential of parallelization to reduce execution times and achieve promising values of parallel efficiency of simulation. The results indicate that the simulations carried out with the CC strategy reach markedly smaller values of parallel efficiency as compared with the simulations carried out with the SS strategy. Analysis of the simulation results also shows that parallelization of the simulation does not necessarily decrease the execution time nor the parallel efficiency of simulation with increasing number of computational meshes. The tests indicate that the problem of efficient parallel realization of fire simulation on computer clusters using FDS and the MPI programming model must be carefully considered.