The following article is Open access

An Efficient Algorithm for Astrochemical Systems Using Stoichiometry Matrices

, , , , and

Published 2024 January 17 © 2024. The Author(s). Published by the American Astronomical Society.
, , Citation Kazutaka Motoyama et al 2024 ApJS 270 19 DOI 10.3847/1538-4365/ad12c8

Download Article PDF
DownloadArticle ePub

You need an eReader or compatible software to experience the benefits of the ePub3 file format.

0067-0049/270/2/19

Abstract

Astrochemical simulations are a powerful tool for revealing chemical evolution in the interstellar medium. Astrochemical calculations require efficient processing of large matrices for the chemical networks. The large chemical reaction networks often present bottlenecks for computation because of time derivatives of chemical abundances. We propose an efficient algorithm using a stoichiometry matrix approach in which this time-consuming part is expressed as a loop, unlike the algorithm used in previous studies. Since stoichiometry matrices are sparse in general, the performances of simulations with our algorithm depend on which sparse-matrix storage format is used. We conducted a performance comparison experiment using the common storage formats, including the coordinate format, the compressed column storage format, the compressed row storage (CRS) format, and the sliced ELLPACK format. Experimental results showed that the simulations with the CRS format are the most suitable for astrochemical simulations and about a factor of 2 faster than those with the algorithm used in previous studies. In addition, our algorithm significantly reduces not only the computation time but also the compilation time. We also explore the beneficial effects of parallelization and sparse-matrix reordering in these algorithms.

Export citation and abstract BibTeX RIS

Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

Interstellar molecules are observed in a wide range of environments in interstellar space, and their emissions are used as probes of the physical conditions of astronomical objects. More than 240 different molecules have been detected in the interstellar medium (ISM) and circumstellar medium (McGuire 2022). In particular, a wide variety of molecules, including complex organic molecules, are observed in star-forming regions (see Jørgensen et al. 2020, and references therein). During the gravitational collapse of the molecular core, chemical reactions in the gas phase and on the dust–grain surface produce complex molecules from simpler molecules. Recent high-resolution observations with the Atacama Large Millimeter/submillimeter Array have revealed chemical structures of protoplanetary disks (Öberg et al. 2021). Molecules in protoplanetary disks provide material for the atmospheres of newly forming planets and are related to the origin of life.

Astrochemists model the chemical evolution through reaction networks to understand the formation and destruction of molecules observed in assorted sources in the ISM. In these models, the chemistry is described by processes of various types: ion–neutral and neutral–neutral reactions, dissociation and ionization by UV photons and cosmic-ray particles, dissociative recombination, etc. The models compute the individual abundances as a function of time starting from an initial composition. Rate coefficients for chemical reactions in the ISM can be taken from publicly available astrochemical databases such as UdfA (Millar et al. 1991, 1997; Le Teuff et al. 2000; Woodall et al. 2007) and KIDA (Wakelam et al. 2012, 2015). The UdfA database, also called the UMIST database, was first released in 1991, followed by subsequent releases in 1995, 1999, and 2006. 4 The KIDA database was first released in 2012 and an updated version was released in 2015. 5

The hydrochemical code KM2 (Motoyama et al. 2015) has chemistry and hydrodynamics modules, allowing us to deal with astrophysical problems requiring coupled evolution. KM2 belongs to the Kinetic Modules suite, developed by the ASIAA CHARMS group. Its hydrodynamics module is finite-volume conservative and treats the chemical species as passively advected scalars, flowing inside an axisymmetric grid in cylindrical coordinates. The chemistry module solves nonequilibrium chemistry and energy changes due to thermal processes by transferring external UV radiation, including self-shielding effects on the photodissociation of CO and H2. Other than KM2, hydrodynamic simulations coupled with nonequilibrium chemistry are used in studies such as primordial star formation (Hosokawa et al. 2016; Sharda et al. 2019) and photoevaporation of protoplanetary disks (Nakatani et al. 2018). The computational cost of the chemical module grows with the number of reactions and species considered in the reaction network, and it can be dominant in a hybrid hydrochemical simulation because each cell of the hydrodynamic simulation grid needs to have a separate chemical evolution of its many species undergoing numerous reactions. The matrix describing an extensive reaction network is very sparse, allowing a substantial reduction in runtime. Chemical reaction systems that include reactions with widely different reaction rates are represented by stiff differential equations, and the stiffness of the equations is also an important factor affecting computational cost. The larger the chemical reaction network, the stiffer the system, requiring the use of smaller time steps to obtain a numerically stable solution. These stiff differential equations are solved using the ordinary differential equation (ODE) solver that can solve stiff differential equations based on a backward differentiation formula method, such as LSODE and its variants included in ODEPACK (Hindmarsh 1983; Radhakrishnan & Hindmarsh 1993).

In this work, we focus on the efficient computation of the reaction network utilizing the resources of various sparse-matrix storage methods and computer settings and experimentally find the more efficient methods. Sparse matrices have only a small fraction of nonzero elements. They can be efficiently stored and allow efficient algorithms. Algorithms traversing an efficiently stored sparse matrix benefit from the concentration of nonzero elements, which reduces the operation count by reducing or eliminating the zeros. Additionally, an algorithm able to traverse the sparse matrix in storage order would also benefit from the increase in data locality, an important factor in improving memory access speed. The choice of storage order and the path of the algorithm through the data can, therefore, influence computational efficiency.

We measure these beneficial effects by comparing the efficiency of different storage mechanisms in their natural traversal order to compute the evolution of the chemical system, defined by a very sparse stoichiometric matrix.

The paper is organized as follows. In Section 2, we describe an algorithm of our new method and sparse-matrix storage formats used in numerical experiments in detail. In Section 3, we describe setups of the numerical experiments. In Section 4, we present and analyze our experimental results. In Sections 5 and 6, we discuss our results and summarize our main conclusions.

2. Numerical Algorithm

This section describes our new algorithm to solve chemical evolution in the ISM.

2.1. Formulation

Astrochemical simulations reveal the time evolution of chemical abundance in the ISM. The production rate of chemical species is written as

Equation (1)

where ni is the number density of ith chemical species, kjmi is the reaction coefficient of two-body reactions in which the jth and mth chemical species react to produce the ith chemical species, ζli is the reaction coefficient of a photoreaction or a cosmic-ray reaction in which the lth chemical species produces the ith chemical species.

We introduce a matrix formulation to solve simultaneous ODEs expressed by Equation (1). As an example, let us consider a simple chemical reaction network consisting of the following two reactions:

Equation (2)

Equation (3)

For simplicity, we denote kAAB as k and ζBA as ζ. Rates of these reactions are expressed as $k\,{n}_{{\rm{A}}}^{2}$ and ζ nB, respectively. Production rates of each species are given by

Equation (4)

Equation (5)

This set of these equations can be equivalently expressed in matrix form:

Equation (6)

Generalization of this equation for a chemical reaction network with m reactions and n reacting species yields

Equation (7)

where Rj is the reaction rate for the jth reaction. The αij value represents how much the ith species increases or decreases in the jth reaction. It is negative for reactants, positive for products, and zero for species not involved in the reaction. The equation's matrix on the right-hand side is called the stoichiometry matrix.

2.2. Sparse-matrix Storage Formats

Matrix–vector multiplication in Equation (7) is the most computationally expensive part in simulations with matrix formulation. Efficient sparse-matrix–vector multiplication is key for performing astrochemical simulations with our algorithm. The stoichiometry matrix for large reaction networks is generally sparse, i.e., most of the elements have a zero value. Various sparse-matrix storage formats have been proposed to improve computational efficiency by avoiding operations on zero elements as much as possible, but which one is suitable depends on the number and distribution of nonzero elements in the matrix. In this paper, to find a suitable sparse-matrix storage format for astrochemical simulations, we compare the performance of astrochemical simulations with four common storage formats: coordinate (COO) format, compressed column storage (CCS) format, compressed row storage (CRS) format, and sliced ELLPACK (SELL) format.

Let us consider the following sparse-matrix–vector multiplication to illustrate these formats:

Equation (8)

Listing 1 shows example Fortran code and data structure for the COO format. Values of nonzero elements are stored in the one-dimensional array A in arbitrary order. Row and column indices of nonzero elements are stored in row_ind and col_ind, respectively. Note that the Fortran array index starts at 1, not at 0. The total number of nonzero elements is denoted by nnz .

Figure 1(a) schematically illustrates the order in which nonzero elements are calculated in CCS format, and Listing 2 shows an example of the Fortran code. Values of nonzero elements are stored in the one-dimensional array A in column-wise order. Row indices of nonzero elements are stored in row_ind. Index pointers to the first nonzero element of each column in A are stored in col_ptr. The total number of columns is denoted by ncol.

Figure 1.

Figure 1. Orders in which nonzero elements are calculated in CCS, CRS, and SELL formats.

Standard image High-resolution image

Figure 1(b) schematically illustrates the order in which nonzero elements are calculated in CRS format, and Listing 3 shows an example of Fortran code. Values of nonzero elements are stored in an array A and calculated in row-wise order. Column indices of nonzero elements are stored in an array col_ind. Index pointers to the first nonzero element of each column in A are stored in an array row_ptr. The total number of rows is denoted by nrow.

Figure 1(c) schematically illustrates the order in which nonzero elements are calculated in SELL format (Monakov et al. 2010), and Listing 4 shows example Fortran code. Nonzero elements are shifted to the left side in the matrix and divided into slices of width wid_sl in the column direction. The row length of a slice is the length of a row with the largest number of nonzero elements in that slice. The row with fewer nonzero elements than that is padded with zero for a uniform row length in the slice. Lengths of slices are stored in an array len_sl. In each slice, values of nonzero elements are stored in an array A in column-wise order. Column indices of nonzero elements are stored in an array col_ind. The total number of slices is denoted by n_sl.

3. Experimental Settings

In this section, we describe the experimental settings for a performance comparison study of different sparse-matrix storage formats. We implemented our proposed algorithm in the hydrochemical hybrid code KM2 (Motoyama et al. 2015) and simulated chemical evolution in a photon-dominated region as a benchmark test. The hydro module of KM2 was temporarily turned off for this study. Physical parameters for the simulation were adopted from model F2 in Röllig et al. (2007). A semi-infinite plane-parallel cloud with a number density of nH = 103 cm−3 was illuminated by far-UV (FUV) radiation with an intensity of χ = 105 in units of the Draine field. The gas and dust temperatures were fixed at 50 K and 20 K, respectively. Visual extinction AV was assumed to be related to hydrogen column density NH,tot by AV = 6.289 × 10−22 NH,tot. All elements in the gas were in atomic state under the initial condition. Elemental abundances adopted in this study are summarized in Table 1. For the elements included in the model F2 in Röllig et al. (2007), i.e., H, He, C, and O, the same abundances as in Röllig et al. (2007) were used; abundances of other elements were taken from Garrod et al. (2008). Chemical evolution was solved for 30 Myr. A logarithmic grid was used to obtain high resolution near the cloud surface, where chemical reactions are strongly affected by FUV radiation. Computational domain covers the range of 0 < AV < 20 with 128 cells. For further details on benchmarking, refer to Röllig et al. (2007) and Motoyama et al. (2015).

Table 1. Elemental Abundances Relative to Hydrogen Nuclei

ElementAbundanceElementAbundance
H1.0Si8.0 × 10−9
He0.1Mg7.0 × 10−9
O3.0 × 10−4 Cl4.0 × 10−9
C1.0 × 10−4 Fe3.0 × 10−9
N7.5 × 10−5 P3.0 × 10−9
S8.0 × 10−8 Na2.0 × 10−9
F2.0 × 10−8   

Download table as:  ASCIITypeset image

The chemical reaction network of this performance comparison experiment consists of 6175 reactions among 468 species involving 13 elements taken from the UMIST rate12 database (McElroy et al. 2013). The stoichiometry matrix of the chemical reaction network is a 468 × 6175 matrix. The total number of nonzero elements is 24,580, which is about 0.085% of the total number of matrix elements. Each row corresponds to a chemical species, and each column corresponds to a reaction. The order of rows and columns can be arbitrary. Two stoichiometry matrices with a different row order were used to see how the distribution of nonzero elements affects computing performance. One stoichiometry matrix is created by using the same order of chemical species and chemical reactions as in the UMIST rate12 database, that is, chemical species are in alphabetical order and chemical reactions are grouped by reaction types such as charge exchange, dissociative recombination, photoprocess (hereinafter referred to as original matrix). The other stoichiometry matrix is created by reordering the rows of the above matrix in order of the number of nonzero elements (hereinafter referred to as reordered matrix). Figures 2(a) and (b) show distributions of nonzero elements in the original matrix and the reordered matrix, respectively.

Figure 2.

Figure 2. Distributions of nonzero elements in a stoichiometry matrix without row reordering (a) and with row reordering (b).

Standard image High-resolution image

Recent CPUs achieve high performance by incorporating single instruction multiple data (SIMD) units that can perform the same operation on multiple data concurrently. In addition to the effect of the distribution of nonzero elements, simulations using each sparse-matrix storage format were performed with three different settings to investigate the effect of SIMD operations. In the first setting, the reordered matrix was used and the SIMD operation of Advanced Vector Extensions 2 (AVX2) was enabled with compilation option -O3 -xcore-avx2. AVX2 instructions enable CPUs to operate four double-precision floating-point numbers concurrently. In the second setting, the original matrix was used and SIMD operation was enabled with compilation option -O3 -xcore-avx2. In the third setting, the reordered matrix was used and SIMD operation was partly disabled with compilation option -no-simd -no-vec in the critical routines. These simulations were run in serial.

The specification of the computing node that we used is summarized in Table 2. Our computing node has dual CPUs and each CPU has 10 physical cores. We measured not only the execution times of the simulations in serial but also the execution times of the simulations in parallel using OpenMP. Parallelization using OpenMP was done with different levels of task granularity. One is parallelization at the loop that computes production rates of chemical species as shown in Listings 1–4. The other is a more coarse-grained parallelization by a domain-decomposition approach, that is, the entire computational domain is divided into smaller subdomains. In these parallelized simulations, the reordered matrix was used and SIMD operations were enabled. Thread affinity policy was specified by setting the environment variables as OMP_PLACES =cores and OMP_PROC_BIND = close. Considering the influence of background processes, simulations were run five times each, and averages and standard errors of their execution times were obtained.

Table 2. Specification of Computing Node

ComponentSpecification
CPUIntel Xeon E5-2650 v3 2.3 GHz × 2
 Intel Haswell microarchitecture
MemoryDDR4-2133 dual-channel 32 GB
OSUbuntu 16
CompilerIntel Fortran compiler 16.0

Download table as:  ASCIITypeset image

As a reference, simulations with a conventional algorithm were also run by using the KROME package (Grassi et al. 2014). In the KROME package, there are two options to compute the right-hand side of Equation (1). One is the explicit method and the other is the implicit method. We used the explicit method, which performs better. KM2 and KROME use the same stiff ODE solver, DLSODES, which is a double-precision version of LSODES, a variant of the basic solver LSODE in ODEPACK. DLSODES is intended for problems in which the Jacobian matrix is sparse. Relative and absolute tolerances to control the magnitude of acceptable errors were respectively set to 1.0 × 10−4 and 1.0 × 10−20 for both KM2 and KROME. The reaction coefficients in Equation (1) depend on local physical properties such as the temperature, cosmic-ray ionization rate, or FUV intensity. In the KM2 code, these reaction coefficients were precomputed outside the stiff ODE solver. There is a KROME option "-useCustomCoe" for computing reaction coefficients with a user-defined function. By calling the function outside the stiff ODE solver, a similar computational approach to the KM2 can be achieved. It is faster for this purpose than the default KROME settings. In conventional algorithms, computations of production rates of chemical species are not expressed by loop iteration and cannot be parallelized at the chemical-loop level by OpenMP.

4. Results

In this section, we describe the results of performance comparison experiments. Figure 3 shows execution times of simulations in serial. The execution time of the reference model using KROME with SIMD operations and -useCustomCoe option was 427 s, much faster than default KROME at 1604 s. All simulations with our algorithm showed better performances than the default KROME, and formats CRS (and CCS) were also faster than the reference. The CRS format with row reordering and SIMD operations showed the best performance. Its execution time was 232 s, which is 6.91 times faster than default KROME and 1.84 times faster than the reference. In all settings, CRS format showed better computing performances than the other formats.

Figure 3.

Figure 3. Execution times of simulations with COO, CCS, CRS, and SELL formats in serial. Error bars represent the standard errors of execution times. The dashed and dotted lines represent execution times of a reference model using the KROME code with SIMD operations and the -useCustomCoe option, and a simulation using the default KROME settings, respectively.

Standard image High-resolution image

Row reordering improved computational performance by more than 10% for the CRS and SELL sparse-matrix storage formats. The execution times with row reordering were shorter than those without it. The performance improvement rates calculated by dividing the execution time without row reordering by the execution time with row reordering for the COO, CCS, CRS, and SELL formats were 1.04, 1.04, 1.10, and 1.69, respectively. The performance improvement was especially noticeable for the SELL format.

SIMD operation was effective for the CRS format. The simulation with SIMD operations was faster than that without it by 1.24 times. On the other hand, the simulation execution time with SIMD operations for CCS format was almost the same as that without it. The performance improvement rate was only 1.05. SIMD operations caused performance degradation for the COO and SELL formats, with performance improvement rates of 0.86 and 0.84, respectively. The effects of SIMD operations were also investigated for the reference model. The simulation execution time without SIMD operations was 1590 s. The performance improvement rate due to SIMD operations was only 1.01, and the difference was within the standard error of measured simulation execution times.

Figure 4 shows execution times of simulations parallelized at the loop that computes production rates of chemical species as a functions of number of threads. The trend of speedup by parallel computation differed depending on whether the number of threads exceeded 10, the number of cores in the CPU. The execution time of all formats reduced monotonically up to 10 threads within the margin of error. On the other hand, the execution times of all formats increased when the number of threads exceeded 10, and almost no speedup was observed for a higher number of threads. At the numbers of threads with the highest performance, the speedups for the COO, CRS, CCS, and SELL formats compared to simulations in serial are 2.34, 1.41, 1.81, and 2.10, respectively. The simulation with the CRS format showed the best performance among all formats.

Figure 4.

Figure 4. Execution times of simulations as a function of OpenMP threads, for a chemical-loop parallelization. The solid, dashed, dashed–dotted, and dotted lines represent COO, CCS, CRS, and SELL formats, respectively. Error bars represent standard errors.

Standard image High-resolution image

Figure 5 shows execution times of simulations parallelized by domain decomposition as a function of OpenMP threads. Unlike parallelization at the chemical-loop level, execution times reduced monotonically even when the number of threads exceeds 10 and the threads spanned multiple CPUs. The speedups for the COO, CRS, CCS, and SELL formats compared to simulations in serial are 10.7, 9.16, 10.1, and 10.7, respectively.

Figure 5.

Figure 5. Execution times of simulations parallelized by domain decomposition as a function of OpenMP threads. The solid, dashed, dashed–dotted, and dotted lines represent the COO, CCS, CRS, and SELL formats, respectively.

Standard image High-resolution image

In addition to execution times, compilation times were also measured. The average compilation time of the default KROME code was 6866 s with a standard error of 19.2 s. The compilation time of our code was almost independent of the compilation options. The mean compilation time of our code for all compilation options was 53.6 s with a standard error of 0.4 s.

We then performed domain-decomposition OpenMP experiments on the large nodes of the machine Kawas available to us at ASIAA. Each node of this machine is described on Table 3. We have run tests similar to those of Figure 5 but now also including 32, 64, and 128 threads. Figure 6 shows the results, which confirm efficiency of domain decomposition up to large number of threads. They also confirm the higher efficiency of the CRS algorithm over the other storage methods.

Figure 6.

Figure 6. Execution times of simulations parallelized by domain decomposition as a function of OpenMP threads. The solid, dashed, dashed–dotted, and dotted lines represent the COO, CCS, CRS, and SELL formats, respectively. Simulations performed on the machine Kawas were able to efficiently run 128 threads on each of its nodes.

Standard image High-resolution image

Table 3. Specification of Kawas Computing Nodes

ComponentSpecification
CPUAMD EPYC 7763 @ 2.45 GHz
 64-core processor × 2 (128 cores)
Memory512 GB
OSRocky Linux x86_64 8.5
CompilerIntel Fortran compiler 22.1

Download table as:  ASCIITypeset image

5. Discussion

The results of the performance comparison study show that our new algorithm significantly improves the performance of astrochemical simulation compared to the conventional algorithm. The new algorithm using the CRS format is particularly suited for astrochemical simulations. It also has the advantage of greatly reducing compilation time. Our algorithm is easy to implement and can improve the performance of existing simulation codes. While performance was greatly improved for simulations in serial, and also in OpenMP-parallel by applying domain decomposition, high efficiency of parallelization was not achieved for simulations in parallel with OpenMP applied to the chemical loops. A roofline performance model (Williams et al. 2009) is useful to see which hardware component limits application performance. Figure 7 shows the results of roofline analysis obtained by using Intel® Advisor for the simulations in serial. Computational performances are plotted as function of arithmetic intensity, which is the ratio of total floating-point operations (FLOPs) to the number of bytes accessed in the main memory. On our computing node, applications with arithmetic intensity greater than 0.166 FLOP byte−1 can be considered as CPU bound, while those below this threshold can be considered as memory bound. All plotted points are within the memory-bound area, indicating that astrochemical simulations are memory bound regardless of whether the conventional algorithm or our new algorithm are employed. Effective utilization of the cache and increased arithmetic intensity are important factors for high computational performance. The performances of the reference model using KROME with the -useCustomCoe option and the simulation with the COO format are lower than the limitation imposed by memory bandwidth. These simulations hardly take advantage of the cache. On the other hand, simulations with the CRS, CCS, and SELL format exhibit performance higher than this limitation, leveraging the advantage of the cache to some extent. We conducted performance profiling by Intel® VTune for simulations employing the reordered matrix with SIMD operations enabled. The computational time taken by the stiff ODE solver exceeded 97.9% of the total computation time for simulations using any sparse-matrix format. Sparse-matrix operations constitute the most time-consuming part within the stiff ODE solver, accounting for fractions of total computation time of 85.3%, 75.7%, 60.5%, and 79.1% with the COO, CCS, CRS, and SELL formats, respectively. The CRS format shows the best performance for sparse-matrix operations. Profiling clearly shows the steep relative reduction in computation time fraction for the best of these four formats.

Figure 7.

Figure 7. The computational performances of KM2 simulations using the reordered matrix with SIMD operations enabled are plotted as function of arithmetic intensity. Additionally, the plus symbol represents the computational performance of a reference model using KROME with the -useCustomCoe option. The horizontal line represents the peak performance of double-precision floating-point operations, and the diagonal lines represent the bandwidth of memory and the first-level, second-level, and third-level caches on our computing node.

Standard image High-resolution image

The differences in computation time by reordering the rows of the stoichiometry matrix indicate that the performance of our algorithm depends on the distribution of nonzero elements. This is to be expected since the amount of random memory access due to the irregular distribution of nonzero elements is an important factor for the performance of sparse-matrix–vector multiplication. Reordering the rows in order of the number of nonzero elements increases the locality of the nonzero elements. This improves cache hit ratio because the data used in computation is more likely to line up in contiguous locations in memory. In addition, in the SELL format, zero padding is reduced by having rows with a close number of nonzero elements in the slice, resulting in a degree of performance improvement.

The degree of simulation speedup by SIMD operations depends on the sparse-matrix storage format, which can be explained by the characteristic of the matrix used in the simulations and effects of indirect memory reference. As described in Section 3, stoichiometry matrices used in astrochemical simulation are much longer in the row direction than in the column direction, and therefore have more nonzero elements. Atomic hydrogen is involved in 1794 reactions, and the row representing atomic hydrogen has the highest number of nonzero elements. Next to this, molecular hydrogen is involved in 980 reactions and electron in 891 reactions. Since the CRS format stores nonzero elements in row-wise order, simulations with the CRS format have a longer loop length for the innermost loop than simulations with other storage formats. Thanks to the long loop length, the speedup of SIMD operations is superior to the overhead of SIMD operations. Simulations using the CCS or SELL formats have short loop lengths, which result in a large overhead relative to the speed-up effect of SIMD operations. The innermost loop length in simulations with the CCS format is the number of chemical species in the reaction, which is at most six in the UMIST rate12 database. The innermost loop length in simulations with the SELL format is the width of the slice, which is four in this experiment. In simulations with formats other than the COO format, there is only one indirect memory reference in the loop, but in simulations with the COO format, there are two, i.e., indices of column and row. Indirect memory references in loops make it difficult to achieve the speed-up effect of SIMD operations. This is probably because the two indirect memory references have a larger overhead than the speedup by SIMD operations, and the simulation with the COO format degrades performance when SIMD operations are enabled. In the model using the KROME code (with default options), the simulation is not sped up at all even with SIMD operations enabled.

Simulations parallelized at the chemical loop showed performance degradation for larger numbers of threads (larger than 10 in our experiment due to machine architecture). This is because the chemical rate computation is memory bound, not CPU bound. On the other hand, the domain-decomposition parallelization achieves good speedups essentially for all numbers of threads. This is because (1) a larger fraction of the problem is parallelized, not only the chemical rate calculation, and (2) there are fewer issues of contention for the usage of memory because each thread is now operating on largely independent sets of data. We have shown that this hybrid code can be parallelized at both levels, chemical loop and domain decomposition, each one to be preferred depending on the size of the domain and number of threads. The study of this domain-decomposition parallelization, natural to KM2, also serves as an example of the application of sparse-matrix techniques to a chemical code that belongs to an already established parallel framework and thus is not in immediate need of a more internal parallelization.

The temperature was assumed to be constant in our performance comparison experiments, which is a reasonable approximation for star formation processes, one of the key targets of astrochemical simulations. As described in Motoyama et al. (2015), the chemistry module of the KM2 determines the time step so that the temperature change from the previous step is small enough, and updates the chemical abundances separately from the temperature in the usual operator-split manner. The results of our performance comparison experiments are directly applicable to cases of variable temperature. For other chemistry codes that solve the temperature and chemical abundances together within a stiff ODE solver, the computational cost of updating the chemical abundances is significantly higher than that of updating the temperature. Therefore a large performance improvement is expected by using our new algorithm as well.

A more general conclusion of this work is that sparse-matrix techniques are efficient and effective. They are not extremely difficult to implement, although their most efficient usage on a given computer architecture often needs attention to memory distribution and CPU usage. We have observed that sparse-matrix methods are already widely used in many fields of computer applications. However, they are still underutilized compared to their full potential applications. That may be because the use of full or nearly full matrices instead of sparse matrices can also be effective, even if not the most efficient method, and it can be easier to implement. Within astrophysics, we observe that sparse-matrix methods can be applied to astrochemistry, to models of grain coagulation, and to problems in nucleosynthesis, without exhausting the list of potential applications. Matrices describing models are often sparse, a fact not always completely exploited for computational advantage.

6. Summary

We advance a substantial speedup of astrochemical computations by making use of their natural sparse-matrix structure. Up to a 1.84× speedup was found when compared to a conventional approach (KROME with -useCustomCoe option), and up to 6.9× when compared with the default KROME options. Several sparse-matrix algorithm variations were tried, showing the advantages of (1) reordering the matrix to improve locality, (2) using sparse-matrix storage that compresses along the less sparse row direction of this matrix, and (3) using SIMD operations. OpenMP parallelization was explored with two techniques (chemical parallelization and domain decomposition), each having its advantages for problems with different sizes of chemistry or geometry.

Acknowledgments

This work is supported by the Astrobiology Center Program of the National Institutes of Natural Sciences (NINS) grant No. AB271010, JSPS KAKENHI grant No. JP23K03457, and a University Research Support Grant from the National Astronomical Observatory of Japan (NAOJ). Numerical analyses were in part carried out on computers at the Center for Computational Astrophysics, National Astronomical Observatory of Japan. We acknowledge previous support by the National Institute of Informatics, Japan. The authors acknowledge grant support for the CHARMS group from the Institute of Astronomy and Astrophysics, Academia Sinica (ASIAA), and the National Science and Technology Council (NSTC) in Taiwan through grants 111-2112-M-001-074- and 112-2112-M-001-030-. The authors acknowledge the access to high-performance facilities (TIARA cluster and storage) in ASIAA and thank the National Center for High-performance Computing (NCHC) of National Applied Research Laboratories (NARLabs) in Taiwan for providing computational and storage resources. This work utilized tools (Kinetic Modules suite of codes) developed and maintained by the ASIAA CHARMS group. This research has made use of SAO/NASA Astrophysics Data System. We also thank the referees for their helpful comments that improved the manuscript.

Footnotes

Please wait… references are loading.
10.3847/1538-4365/ad12c8