SCF-FDPS: A Fast N-body Code for Simulating Disk–Halo Systems

Shunsuke Hozumi; Keigo Nitadori; Masaki Iwasawa

doi:10.3847/1538-4357/acbea5

1. Introduction

The number of particles in N-body simulations of astronomical objects like galaxies has been increasing, in step with the progress in parallel computing technology. This remarkable development has brought a great benefit to disk-galaxy simulations, because galactic disks are rotation-supported, cold systems, so that a sufficiently large number of particles are needed for the disk to sidestep the heating originating from Poisson noise. In fact, Fujii et al. (2011) demonstrated that a spiral feature emerging in a disk surrounded by an unresponsive halo is fading away gradually over time for a million-particle simulation, while it persists until late times for a three-million-particle simulation. On the other hand, Athanassoula (2002) revealed that for a given disk–halo model, the disk is stabilized against bar formation when the halo is rigid, while a large-amplitude bar is excited through a wave–particle resonance between a bar mode in the disk and halo particles when the halo is live. This fact coerces us to deal with a halo as self-gravitating. In making a halo live for a disk–halo system, the mass of a halo particle has to be made equal to that of a disk particle to avoid the shot noise generated by halo particles when they pass through the disk. Unfortunately, a halo mass is estimated to be at least around an order of magnitude larger than a disk mass, because a halo is considered to extend far beyond the optical edge of the disk on the basis of the observed rotation curves of disk galaxies that are, in general, flat out to large radii (e.g., Sofue & Rubin 2001). Consequently, the number of halo particles becomes larger than that of disk particles by an order of magnitude or many more. It thus follows that disk-galaxy simulations inevitably demand a large number of particles.

As the number of particles in N-body simulation increases, the number of force calculations increases accordingly. Because a given particle receives the gravitational force from all other particles in an N-particle system, the number of the total force calculation reaches ${ \mathcal O }({N}^{2})$ at every time step in the simplest way. This explosive nature in force calculation is alleviated down to ${ \mathcal O }(N\mathrm{log}N)$ by the introduction of a tree algorithm developed by Barnes & Hut (1986). Indeed, recent large N-body simulations of disk galaxies are based on a tree code. For example, Dubinski et al. (2009) adopted a parallelized tree code to investigate the bar instability in galactic disks using 1.8 × 10⁷ particles for a disk and 10⁸ particles for a halo, while D'Onghia et al. (2013) used a tree-based gravity solver to examine the origin of spiral structure in disk galaxies with 10⁸ particles for a disk immersed in a rigid halo. Furthermore, Fujii et al. (2018) employed a tree-based code called BONSAI (Bédorf et al. 2012) optimized for graphics processing units to scrutinize the dynamics of disk galaxies that consist of live disk, bulge, and halo components with the total number of particles being increased up to 5 × 10⁸. In their subsequent work, Fujii et al. (2019) boosted the total number of particles up to 8 × 10⁹ to construct the Milky Way model that reproduces the observed properties.

As mentioned above, a tree algorithm is commonly used to study disk galaxies with a huge number of particles. In such a situation, a faster tree code is understandably desirable from various aspects of numerical studies. As computer architecture is shifted to a parallelized one, a tree code has been adjusted to a parallel computer. Above all, a numerical library termed Framework for Developing Particle Simulators (FDPS; Iwasawa et al. 2016; Namekata et al. 2018) has tuned a tree code to the utmost limit of a massively memory-distributed parallel computer. Therefore, no further speedup of existing tree codes is expected on their own.

We then try to incorporate a self-consistent field (SCF) code into a tree code. Of course, the FDPS library is implemented in the tree part of the resulting hybrid code for the efficient parallelization. In an SCF approach, Poisson's equation is solved by expanding the density and potential of the system being studied in a set of basis functions. In particular, owing to the expansion of the full spatial dependence, the central processing unit (CPU) cost becomes ${ \mathcal O }(N)$ . Moreover, because the perfect scalability is inherent in the SCF approach, it is suitable for parallel computing. By taking advantage of these characteristics, we will be able to accelerate N-body simulations of disk galaxies using a hybrid code named SCF-FDPS in which an SCF code is incorporated into an FDPS-implemented tree code (Hozumi et al. 2023).

In this paper, we describe how an SCF code is incorporated into a tree code, and show how well the resulting SCF-FDPS code works. In Section 2, we present the details of the SCF-FDPS code, including how an SCF approach is applied to a disk–halo system. In Section 3, along with the determination of the parameters inherent in the code, the performance of the code is shown. In Section 4, we discuss the extension of the present code to cope with a wide variety of disk-galaxy simulations. Conclusions are given in Section 5.

2. Details of the SCF-FDPS Code

We develop a fast N-body code that is based on both SCF and tree approaches. First, we explain the SCF method briefly, and then, describe the details of the SCF-FDPS code.

2.1. SCF Method

An SCF method requires a biorthonormal basis set that satisfies Poisson's equation written by

$\begin{eqnarray}&&{{\rm{\nabla }}}^{2}{{\rm{\Phi }}}_{{nlm}}({\boldsymbol{r}})=4\pi G{\rho }_{{nlm}}({\boldsymbol{r}}),\end{eqnarray} \tag{ 1 }$

where ρ_nlm( r ) and Φ_nlm( r ) are, respectively, the density and potential basis functions at the position vector of a particle, r , with n being the "quantum" number in the radial direction and with l and m being the corresponding quantities in the angular directions. Here, the biorthonormality is represented by

$\begin{eqnarray}&&\int {\rho }_{{nlm}}({\boldsymbol{r}}){\left[{{\rm{\Phi }}}_{n^{\prime} l^{\prime} m^{\prime} }({\boldsymbol{r}})\right]}^{* }d{\boldsymbol{r}}={\delta }_{{nn}^{\prime} }{\delta }_{{ll}^{\prime} }{\delta }_{{mm}^{\prime} },\end{eqnarray} \tag{ 2 }$

where ${\delta }_{{kk}^{\prime} }$ is the Kronecker delta defined by ${\delta }_{{kk}^{\prime} }=0$ for $k\ne k^{\prime}$ and ${\delta }_{{kk}^{\prime} }=1$ for $k=k^{\prime}$ .

With the help of such a biorthonormal basis set as is noted above, the density and potential of the system are expanded, respectively, by the corresponding basis functions as

$\begin{eqnarray}&&\rho ({\boldsymbol{r}})=\displaystyle \sum _{n,l,m}\,{A}_{{nlm}}\,{\rho }_{{nlm}}({\boldsymbol{r}})\end{eqnarray} \tag{ 3 }$

and

$\begin{eqnarray}&&{\rm{\Phi }}({\boldsymbol{r}})=\displaystyle \sum _{n,l,m}{A}_{{nlm}}\,{{\rm{\Phi }}}_{{nlm}}({\boldsymbol{r}}),\end{eqnarray} \tag{ 4 }$

where A_nlm are the expansion coefficients at time t. When the potential basis functions are operated on the density field that is expanded as Equation (3), A_nlm are given, via the biorthonormality relation of Equation (2), by

$\begin{eqnarray}&&{A}_{{nlm}}=\int \rho ({\boldsymbol{r}}){\left[{{\rm{\Phi }}}_{{nlm}}({\boldsymbol{r}})\right]}^{* }\,d{\boldsymbol{r}}.\end{eqnarray} \tag{ 5 }$

If a system consists of a collection of N discrete mass-points, the density is represented by

$\begin{eqnarray}&&\rho ({\boldsymbol{r}})=\displaystyle \sum _{k=1}^{N}{m}_{k}\,\delta ({\boldsymbol{r}}-{{\boldsymbol{r}}}_{k}),\end{eqnarray} \tag{ 6 }$

so that by substituting Equation (6) into Equation (5), A_nlm result in

$\begin{eqnarray}\begin{array}{rcl}{A}_{{nlm}} & = & \displaystyle \int \displaystyle \sum _{k=1}^{N}{m}_{k}\,\delta ({\boldsymbol{r}}-{{\boldsymbol{r}}}_{k}){\left[{{\rm{\Phi }}}_{{nlm}}({\boldsymbol{r}})\right]}^{* }\,d{\boldsymbol{r}}\\ & = & \displaystyle \sum _{k=1}^{N}\,{m}_{k}{\left[{{\rm{\Phi }}}_{{nlm}}({{\boldsymbol{r}}}_{k})\right]}^{* },\end{array}\end{eqnarray} \tag{ 7 }$

where m_k and r _k are the mass and position vector of the kth particle in the system, respectively, and δ( r ) is Dirac's delta function. After obtaining A_n _lm, we can derive the acceleration, a ( r ), by differentiating Equation (4) with respect to r , finding

$\begin{eqnarray}&&{\boldsymbol{a}}({\boldsymbol{r}})=-\displaystyle \sum _{n,l,m}{A}_{{nlm}}\,{\rm{\nabla }}{{\rm{\Phi }}}_{{nlm}}({\boldsymbol{r}}),\end{eqnarray} \tag{ 8 }$

where ∇Φ_nlm( r ) can be analytically calculated beforehand, once the basis set is specified.

As was found from Equation (7), this form of summation can be conveniently parallelized, so that an SCF code realizes the perfect scalability (Hernquist et al. 1995), which leads to ideal load balancing on a massively parallel computer. In addition, the CPU time is proportional to N × $({n}_{\max }+1)$ × ${({l}_{\max }+1)}^{2}$ , where ${n}_{\max }$ and ${l}_{\max }$ are the maximum numbers of expansion terms in the radial and angular directions, respectively. Therefore, an SCF code is fast and suitable for modern parallel computers. Accordingly, a fast N-body code is feasible by incorporating an SCF code into a tree code.

2.2. The SCF-FDPS Code

For a disk–halo system, the acceleration of the kth disk particle, a _d( r _d,k), at the position vector, r _d,k, and the acceleration of the kth halo particle, a _h( r _h,k), at the position vector, r _h,k, are, respectively, represented by

$\begin{eqnarray}&&{{\boldsymbol{a}}}_{{\rm{d}}}({{\boldsymbol{r}}}_{{\rm{d}},k})={{\boldsymbol{a}}}_{{\rm{d}}\to {\rm{d}}}({{\boldsymbol{r}}}_{{\rm{d}},k})+{{\boldsymbol{a}}}_{{\rm{h}}\to {\rm{d}}}({{\boldsymbol{r}}}_{{\rm{d}},k})\end{eqnarray} \tag{ 9 }$

and

$\begin{eqnarray}&&{{\boldsymbol{a}}}_{{\rm{h}}}({{\boldsymbol{r}}}_{{\rm{h}},k})={{\boldsymbol{a}}}_{{\rm{h}}\to {\rm{h}}}({{\boldsymbol{r}}}_{{\rm{h}},k})+{{\boldsymbol{a}}}_{{\rm{d}}\to {\rm{h}}}({{\boldsymbol{r}}}_{{\rm{h}},k}),\end{eqnarray} \tag{ 10 }$

where a _d→d( r _d,k) and a _h→d( r _d,k) denote the acceleration due to the gravitational force from other disk particles to the kth disk particle and that from halo particles to the kth disk particle, respectively, while a _h→h( r _h,k) and a _d→h( r _h,k) stand for the acceleration due to the gravitational force from other halo particles to the kth halo particle and that from disk particles to the kth halo particle, respectively.

Vine & Sigurdsson (1998) already developed a code named scftree in which an SCF code is incorporated into a tree code. In their code, a _h→h( r _h,k) and a _h→d( r _d,k) are calculated with an SCF method, while a _d→d( r _d,k) and a _d→h( r _h,k) are manipulated with a tree method. However, as explained in Section 1, the number of halo particles is at least about an order of magnitude larger than that of disk particles, so that the calculation of a _d→h( r _h,k) is extremely time-consuming, if a tree algorithm is used. Of course, local small-scale irregularities often generated in a rotation-supported disk can be well described with a tree code, which makes it reasonable to apply a tree method to the calculation of a _d→d( r _d,k). In contrast, in a halo that is supported by velocity dispersion, global features survive but small-scale ones are smoothed out to disappear, so that we can handle a halo using an SCF approach without so many expansion terms. In fact, there are suitable basis sets for spherical systems whose density and potential are reproduced with a small number of expansion terms. Then, we apply an SCF method to evaluate a _h→h( r _h,k). Furthermore, even though small-scale features exist in the disk, they do no serious harm to the overall structure of the halo, as we will show in Section 3. Therefore, we can apply an SCF method to the calculation of a _d→h( r _h,k) as well. After all, only a _d→d( r _d,k) is calculated with a tree method. For this part in the code, we implement a C++ version of the FDPS library (Iwasawa et al. 2016), which is publicly available, because it helps users parallelize a tree part easily with no efforts in tuning the code for parallelization. We then name the code developed here the SCF-FDPS code (Hozumi et al. 2023). This code will enable us to simulate disk–halo systems much faster than ever before for a fixed number of particles.

The actual procedure for calculating the accelerations of a _h→d( r _d,k), a _h→h( r _h,k), and a _d→h( r _h,k) is as follows. First, Equation (8) shows that a _h→d( r _d,k) is provided by

$\begin{eqnarray}&&{{\boldsymbol{a}}}_{{\rm{h}}\to {\rm{d}}}({{\boldsymbol{r}}}_{{\rm{d}},k})=-\displaystyle \sum _{n,l,m}{A}_{{\rm{h}},{nlm}}\,{\rm{\nabla }}{{\rm{\Phi }}}_{{nlm}}({{\boldsymbol{r}}}_{{\rm{d}},k}),\end{eqnarray} \tag{ 11 }$

where A_h,nlm are those expansion coefficients obtained from halo particles, which are given by

$\begin{eqnarray}&&{A}_{{\rm{h}},{nlm}}=\displaystyle \sum _{k=1}^{{N}_{{\rm{halo}}}}{m}_{{\rm{h}},k}{\left[{{\rm{\Phi }}}_{{nlm}}\left({{\boldsymbol{r}}}_{{\rm{h}},k}\right)\right]}^{\ast }.\end{eqnarray} \tag{ 12 }$

In Equation (12), N_halo is the number of halo particles, and m_h,k is the mass of the kth halo particle.

Next, as is shown by Equation (10), a _h→h( r _h,k) and a _d→h( r _h,k) are added up to generate a _h( r _h,k), and again Equation (8) indicates that a _h( r _h,k) is calculated as

$\begin{eqnarray}&&{{\boldsymbol{a}}}_{{\rm{h}}}({{\boldsymbol{r}}}_{{\rm{h}},k})=-\displaystyle \sum _{n,l,m}{A}_{{\rm{h}}+{\rm{d}},{nlm}}\,{\rm{\nabla }}{{\rm{\Phi }}}_{{nlm}}({{\boldsymbol{r}}}_{{\rm{h}},k}),\end{eqnarray} \tag{ 13 }$

where A_h+d,nlm are those expansion coefficients evaluated from disk and halo particles, which are written by

$\begin{eqnarray}&&{A}_{{\rm{h}}+{\rm{d}},{nlm}}={A}_{{\rm{h}},{nlm}}+{A}_{{\rm{d}},{nlm}}.\end{eqnarray} \tag{ 14 }$

Here, A_d,nlm are the expansion coefficients that are calculated from disk particles as

$\begin{eqnarray}&&{A}_{{\rm{d}},{nlm}}=\displaystyle \sum _{k=1}^{{N}_{{\rm{disk}}}}{m}_{{\rm{d}},k}{\left[{{\rm{\Phi }}}_{{nlm}}\left({{\boldsymbol{r}}}_{{\rm{d}},k}\right)\right]}^{\ast },\end{eqnarray} \tag{ 15 }$

where N_disk is the number of disk particles, and m_d,k is the mass of the kth disk particle.

In summary, the hybrid code is based on the following Hamiltonian of the system:

$\begin{eqnarray}&&\begin{array}{c}H=\displaystyle \sum _{k=1}^{{N}_{{\rm{d}}{\rm{i}}{\rm{s}}{\rm{k}}}}\displaystyle \frac{\left|{{\boldsymbol{p}}}_{{\rm{d}},k}{|}^{2}\right.}{2{m}_{{\rm{d}},k}}+\displaystyle \sum _{k=1}^{{N}_{{\rm{h}}{\rm{a}}{\rm{l}}{\rm{o}}}}\displaystyle \frac{\left|{{\boldsymbol{p}}}_{{\rm{h}},k}{|}^{2}\right.}{2{m}_{{\rm{h}},k}}\\ \,\,-\,\displaystyle \sum _{k=1}^{{N}_{{\rm{d}}{\rm{i}}{\rm{s}}{\rm{k}}}}\displaystyle \sum _{j=k+1}^{{N}_{{\rm{d}}{\rm{i}}{\rm{s}}{\rm{k}}}}\displaystyle \frac{{Gm}_{{\rm{d}},k}{m}_{{\rm{d}},j}}{\sqrt{|{{\boldsymbol{r}}}_{{\rm{d}},k}-{{\boldsymbol{r}}}_{{\rm{d}},j}{|}^{2}+{\varepsilon }^{2}}}\\ +\,\displaystyle \frac{1}{2}\sum _{n,l,m}\displaystyle \sum _{k=1}^{{N}_{{\rm{h}}{\rm{a}}{\rm{l}}{\rm{o}}}}\displaystyle \sum _{j=1}^{{N}_{{\rm{h}}{\rm{a}}{\rm{l}}{\rm{o}}}}\\ \,\,\qquad {m}_{{\rm{h}},k}{m}_{{\rm{h}},j}{{\rm{\Phi }}}_{nlm}({{\boldsymbol{r}}}_{{\rm{h}},k})\,{\left[{{\rm{\Phi }}}_{nlm}\left({{\boldsymbol{r}}}_{{\rm{h}},j}\right)\right]}^{\ast }\\ +\displaystyle \sum _{n,l,m}\displaystyle \sum _{k=1}^{{N}_{{\rm{d}}{\rm{i}}{\rm{s}}{\rm{k}}}}\displaystyle \sum _{j=1}^{{N}_{{\rm{h}}{\rm{a}}{\rm{l}}{\rm{o}}}}\\ \,\,\,{m}_{{\rm{d}},k}{m}_{{\rm{h}},j}{\rm{R}}{\rm{e}}\,\left({{\rm{\Phi }}}_{nlm}({{\boldsymbol{r}}}_{{\rm{d}},k})\,{\left[{{\rm{\Phi }}}_{nlm}\left({{\boldsymbol{r}}}_{{\rm{h}},j}\right)\right]}^{\ast }\right),\end{array}\end{eqnarray} \tag{ 16 }$

where ${{\boldsymbol{p}}}_{{\rm{d}},k}={m}_{{\rm{d}},k}{\dot{{\boldsymbol{r}}}}_{{\rm{d}},k}$ and ${{\boldsymbol{p}}}_{{\rm{h}},k}={m}_{{\rm{h}},k}{\dot{{\boldsymbol{r}}}}_{{\rm{h}},k}$ are the momentum of the kth disk particle and that of the kth halo particle, respectively. The first two terms are kinetic ones. The third term is the self-gravity of the disk, which is calculated with a tree method based on the softened gravity of the Plummer type using a softening length, ε. Notice that this expression is used for convenience. That is, it is incorrect in a strict sense, because we cannot exactly construct the Hamiltonian owing to the way of calculating the gravitational force in the tree algorithm. The fourth term is the self-gravity of the halo expressed by the expansions due to the basis functions introduced into the SCF method. The last term represents the disk–halo interactions that are also expanded with the basis functions.

We have postulated above that each particle in a disk–halo system has a different mass. In fact, the SCF-FDPS code supports individually different masses for constituent particles in such a system. However, the mass of a halo particle should be made identical to that of a disk particle so as to prevent the shot noise caused by the halo particles that pass through the disk. Consequently, in a practical sense, it is appropriate to assign an identical mass to each particle in a disk–halo system.

Now that the left-hand sides of Equations (9) and (10) are obtained as explained above, we can simulate a disk–halo system with the code developed here. As a cautionary remark, we need a relatively large number of the angular expansion terms to capture the gravitational contribution from disk particles to halo particles properly, because the disk geometry deviates from a spheroidal shape to a considerable degree.

2.3. Parallelization

All simulations of the disk–halo system are run on a machine with an AMD Ryzen Threadripper 3990X 64-core processor. Although all 64 cores of this processor share the main memory, we apply not the thread parallelization but the Message Passing Interface (MPI) parallelization to the SCF-FDPS code, and execute simulations on up to 64 processes.

The MPI parallelization of the SCF part in the SCF-FDPS code is straightforward: once the particles are equally distributed to each process, only one API call, MPI_Allreduce(), is needed for the summation of those expansion coefficients, which are calculated on each process. Regarding the SCF part, we do not need to move particles across MPI processes. On the other hand, the parallelization of the tree part is more formidable than that of the SCF part, because we have to take into consideration the spatial decomposition and exchange of both particles and tree information between domains. Fortunately, the FDPS library copes with this complexity so as to be hidden from the programmers.

2.4. Hardware-specific Tuning

The processor mentioned in Section 2.3 supports up to 256 bits width SIMD instructions known as AVX. This corresponds to four words of double-precision numbers, or eight words of single-precision numbers as the word length that can be processed at once. We conservatively adopt double-precision arithmetic in the SCF-FDPS code to establish a reliable calculation method. A further speedup by using the single-precision is the subject of future work. Thus, a speedup to a fourfold increase is expected if the SIMD instructions are available.

In general, compiler's vectorization is applied to the innermost loops. However, this is not always the optimal way to exploit SIMD instructions. In the SCF-FDPS code, the compute kernel of the SCF part consists of the outermost loop for the particle index k and several inner loops for the indices n, l, and m that accompany the basis functions. Some of the inner loops can hardly be vectorized because of their recurrence properties. Thus, the maximal SIMD instruction rate is achieved when the vectorization is applied to the particle index k. To this end, we write the compute kernel of the SCF part in the SCF-FDPS code by the intrinsic functions of AVX to manually vectorize the outermost loop. In this way, the positions and masses of four particles are fetched at once, and the values of the basis functions are computed in parallel.

For the tree part, the compute kernel takes a double-loop form composed of an outer loop for the sink particles that feel the gravitational force and an inner loop for the source particles that attract others. Of the two loops, the SIMD conversion is applied to the outer loop through the intrinsic functions. The benefit of the outer-loop parallelization is the reduction of memory access, because fetching the coordinates and mass of one source particle to accumulate the gravitational forces for four sink particles is more efficient than fetching four source particles to accumulate the gravitational forces to one sink particle.

2.5. Portability

As we have mentioned, the compute kernels of the SCF part and the tree part in the SCF-FDPS code are written using the intrinsic functions of AVX. However, that code can be compiled not only by the Intel compiler but also by GCC and LLVM Clang. At the same time, it can run on the other x86 processors that support AVX/AVX2. Except in the SIMD intrinsics, the SCF-FDPS code is written in standard C++17 and MPI, so that it runs on the arbitrary number of processors as well as on the 64-core processors used here, regardless of whether processors are configured within a node or shared over multiple nodes. In fact, we have confirmed that the SCF-FDPS code can run properly using 512 cores on a Cray XC50 system.

3. Tests of the SCF-FDPS Code

3.1. Disk–halo Model

We use a disk–halo model to examine the performance of the SCF-FDPS code. The disk model is an exponential disk that is locally isothermal in the vertical direction. The volume density distribution, ρ_d, is given by

$\begin{eqnarray}&&{\rho }_{{\rm{d}}}(R,z)=\displaystyle \frac{{M}_{{\rm{d}}}}{4\pi {h}^{2}{z}_{0}}\exp (-R/h){{\rm{sech}} }^{2}(z/{z}_{0}),\end{eqnarray} \tag{ 17 }$

where R is the cylindrical radius, z is the vertical coordinate with respect to the midplane of the disk, M_d is the disk mass, h is the radial scale length, and z₀ is the vertical scale length being set to be 0.2 h. The disk is truncated explicitly at R = 15 h in the radial direction. On the other hand, the halo model is described by a Navarro–Frenk–White (NFW) profile (Navarro et al. 1996, 1997), whose density distribution, ρ_h, is written by

$\begin{eqnarray}&&{\rho }_{{\rm{h}}}(r)=\displaystyle \frac{{\rho }_{0}}{(r/{r}_{{\rm{s}}}){\left(1+r/{r}_{{\rm{s}}}\right)}^{2}},\end{eqnarray} \tag{ 18 }$

where r is the spherical radius, r_s is the radial scale length, and ρ₀ is provided by

$\begin{eqnarray}&&{\rho }_{0}=\displaystyle \frac{{M}_{{\rm{h}}}}{4\pi {{R}_{{\rm{h}}}}^{3}}\displaystyle \frac{{{C}_{\mathrm{NFW}}}^{3}}{\mathrm{ln}(1+{C}_{\mathrm{NFW}})-{C}_{\mathrm{NFW}}/(1+{C}_{\mathrm{NFW}})}.\end{eqnarray} \tag{ 19 }$

In Equation (19), R_h is the cutoff radius of the halo, M_h is the halo mass within R_h, and C_NFW is the concentration parameter defined by

$\begin{eqnarray}&&{C}_{\mathrm{NFW}}={R}_{{\rm{h}}}/{r}_{{\rm{s}}}.\end{eqnarray} \tag{ 20 }$

As a basic model, we choose M_h = 5 M_d, R_h = 30 h, and C_NFW = 5 for the halo model. These choices lead to r_s = 6. Concerning a specific performance test, the halo mass is changed with the other quantities being left intact.

We construct the equilibrium disk–halo model described above using a software tool called Many-component Galaxy Initializer (MAGI; Miki & Umemura 2018). Retrograde stars are introduced in the same way as that adopted by Zang & Hohl (1978), and the parameter η, which specifies the fraction of retrograde stars, is set to be 0.25. We choose the Toomre's Q parameter (Toomre 1964) to be 1.2 at R = h. In our simulations, the gravitational constant, G, and the units of mass and scale length are taken such that G = 1, M_d = 1, and h = 1.

We find from Equation (18) that the NFW halo shows a cuspy density distribution like r⁻¹ down to the center. In accordance with this characteristic, we adopt Hernquist–Ostriker's basis set (Hernquist & Ostriker 1992). Because the lowest-order members of this basis set are based on the Hernquist model (Hernquist 1990) whose density behaves like an r⁻¹ cusp at small radii, that basis set is suitable to represent the NFW halo with a small number of expansion terms. The exact functional forms of the basis set are shown in the Appendix.

3.2. Convergence Tests

For the SCF part in the SCF-FDPS code, we need to specify ${n}_{\max }$ and ${l}_{\max }$ . We determine ${n}_{\max }$ by comparing the radial acceleration calculated analytically with that derived from the expanded potential of the spherically symmetric NFW halo shown in Equation (18), which is realized by retaining ${l}_{\max }=0$ . In Figure 1, the radial acceleration obtained from the expanded potential for ${n}_{\max }=10$ , 16, and 20 is compared with the exact one. The scale length of the basis functions, a, is set to be a = 6. This figure indicates that the radial acceleration obtained with ${n}_{\max }=10$ shows some relatively large deviation from the exact one, while the radial acceleration with ${n}_{\max }=16$ is almost comparable to that with ${n}_{\max }=20$ . From this consideration, we adopt ${n}_{\max }=16$ . On the other hand, there is no way to estimate ${l}_{\max }$ for a spherical halo model. To search for an appropriate value of ${l}_{\max }$ , we carry out convergence tests in which ${l}_{\max }=12$ , 16, and 20 are examined, with ${n}_{\max }=16$ being retained. We found that the disk–halo model constructed in Section 3.1 forms a bar via the bar instability (see Figure 7). Then, we use the time evolution of the bar amplitude as a measure to determine ${l}_{\max }$ .

**Figure 1.** Relative radial-acceleration error of the spherically symmetric NFW halo model as a function of radius. In this plot, a_r is the exact acceleration of the NFW halo, while ${a}_{r,\exp }$ is the radial acceleration derived from the expanded potential using Hernquist–Ostriker's basis functions with a scale length of a = 6. The three curves show the effect of the maximum number of the radial expansion terms, ${n}_{\max }$ , on the resulting radial acceleration with the maximum number of the angular expansion terms, ${l}_{\max }=0$ , being retained. Note that the scaling of the abscissa is changed at r = 1 from the left to the right panels, whereby the ordinate is also rescaled accordingly.
Download figure:
Standard image High-resolution image

**Figure 1.** Relative radial-acceleration error of the spherically symmetric NFW halo model as a function of radius. In this plot, a_r is the exact acceleration of the NFW halo, while ${a}_{r,\exp }$ is the radial acceleration derived from the expanded potential using Hernquist–Ostriker's basis functions with a scale length of a = 6. The three curves show the effect of the maximum number of the radial expansion terms, ${n}_{\max }$ , on the resulting radial acceleration with the maximum number of the angular expansion terms, ${l}_{\max }=0$ , being retained. Note that the scaling of the abscissa is changed at r = 1 from the left to the right panels, whereby the ordinate is also rescaled accordingly.
Download figure:
Standard image High-resolution image

Regarding the parameters related to the tree part, we use θ = 0.3 and 0.5 as an opening angle, and ε = 0.006 as a softening length of the Plummer type. Gravitational forces are expanded up to quadrupole order.

We assign N_disk = 6,400,000 to the disk, and N_halo = 32,000,000 to the halo. A time-centered leapfrog algorithm (Press et al. 1986) is employed with a fixed time step of Δt = 0.1.

For comparison, the same disk–halo model is simulated with a tree code on which the FDPS library is implemented. Hereafter, we call this code the FDPS tree code, which is also applied to the SIMD instructions as has been done to the SCF-FDPS code. All of the tree parameters are the same as those employed for the convergence tests.

In Figure 2, we show the time evolution of the bar amplitude for θ = 0.3 and 0.5 in each of which ${l}_{\max }=12,16$ , and 20 are employed, while ${n}_{\max }=16$ is held fixed. Furthermore, the results with the FDPS tree code are also plotted. On the basis of these results, in particular, paying attention to the behavior of the exponentially growing phase of the bar amplitude from t = 0 to t ∼ 300, we select ${l}_{\max }=16$ .

3.3. Performance Tests

We carry out performance tests to examine how fast the SCF-FDPS code is as compared to the FDPS tree code. We measure the CPU time in the cases of θ = 0.3 and 0.5. For each value of θ, the Plummer type softening is used with ε = 0.006, and forces are expanded up to quadrupole order. Again, we use a time-centered leapfrog method (Press et al. 1986) with a fixed time step of Δt = 0.1.

In Figure 3, the CPU time using 64 cores per step is plotted as a function of the total number of particles, N = N_disk + N_halo, with the ratio of N_halo/N_disk = 5 being fixed. We can see that the CPU time is nearly proportional to N for both codes, but that the SCF-FDPS code is at least 3 times faster than the FDPS tree code for θ = 0.5, while the former is about 5–6 times faster than the latter for θ = 0.3. As N_disk increases, the ratio of the CPU time measured with the FDPS tree code to that with the SCF-FDPS code decreases for both values of θ. For example, the ratio is 3.3 for N_disk = 640,000, while it is 3.1 for N_disk = 20,480,000 when θ = 0.5 is used. If θ = 0.3 is used, the ratio decreases from 5.9 for N_disk = 640,000 to 4.8 for N_disk = 20,480,000. As Figure 4 demonstrates, the fraction of the CPU time exhausted by the tree part in the SCF-FDPS code increases as N_disk increases, while the CPU time consumed by the SCF part is basically proportional to N_halo. As a result, that ratio of the CPU time decreases with increasing N_disk.

**Figure 4.** Fraction of the CPU time occupied by the tree part in the SCF-FDPS code as a function of N_disk, which is calculated from the simulations shown in Figure 3.
Download figure:
Standard image High-resolution image

Next, in Figure 5, the CPU time per step is plotted as a function of the number of cores, ${N}_{\mathrm{core}}$ , used on the computer with N_disk = 6,400,000 and N_halo = 32,000,000 being unchanged. Irrespective of the value of θ, the CPU time scales as $\sim {N}_{\mathrm{core}}^{-0.8}$ , which means that the CPU time is almost inversely proportional to ${N}_{\mathrm{core}}$ for both codes. However, the SCF-FDPS code is about 3.6 times faster than the FDPS tree code for θ = 0.5, while the former is approximately 6.4 times faster than the latter for θ = 0.3. In the right panel of Figure 5, we can see that as ${N}_{\mathrm{core}}$ increases, the decrease rate in the CPU time becomes smaller. This is because the CPU clock is made to be lowered as ${N}_{\mathrm{core}}$ increases.

Last, in Figure 6, the CPU time using 64 cores per step is plotted as a function of the fraction of disk particles, f = N_disk/N, where N = N_disk + N_halo, and we use f = 1/16, 1/12, 1/10, 1/8, and 1/6. In this performance test, we change the ratio of N_halo/N_disk, while making the total number of particles unchanged as N = 30,720,000. As a result, the mass ratio of M_halo/M_disk is not constant but changes identically to the ratio of N_halo/N_disk. The other parameters such as R_h and C_NFW are left unchanged. After all, each halo model specified by the value of f is constructed by adjusting the value of ρ₀ in Equation (19) to the given M_halo. Figure 6 indicates how the fraction of the tree part in the SCF-FDPS code affects the CPU time. As a reference, we plot the results using the FDPS tree code. For these tree-code simulations, all particles are obviously calculated with a tree algorithm, so that the CPU time may be expected to be independent of f. In reality, the CPU time depends weakly on f, and it is proportional to f^0.043 for θ = 0.3, and to f^0.031 for θ = 0.5. On the other hand, the CPU time increases with f if the SCF-FDPS code is used for both values of θ. However, for θ = 0.5, the SCF-FDPS code is about 4.5 times faster at f = 1/16 and about 3.1 times faster at f = 1/6 than the FDPS tree code, while for θ = 0.3, the former is about an order of magnitude faster at f = 1/16 and about 5.1 times faster at f = 1/6 than the latter.

3.4. Simulation Results

We carry out simulations of the disk–halo system described by Equations (17) and (18) to examine to what degree the simulation results obtained with the SCF-FDPS code are similar to those with the FDPS tree code. The simulation details are taken over from those adopted for the performance tests. For each value of θ, the energy was conserved to better than 0.028% using the SCF-FDPS code, while it was conserved to better than 0.037% using the FDPS tree code. Figure 7 shows the time evolution of the surface densities of the disk projected on to the xy-, yz-, and zx-planes for θ = 0.3 and 0.5. We find from this figure that the time evolution of the disk surface densities obtained with the SCF-FDPS code is in excellent agreement with that using the FDPS tree code for both values of θ at least until t = 500. At later times, owing to the difference in the bar pattern speed from simulation to simulation, the bar phase differs accordingly. Even though a difference in the bar pattern speed is only slight at the bar formation epoch, it accumulates with time, so that the difference in the bar phase becomes larger and larger as time progresses. At any rate, the time evolution of the disk is satisfactorily similar between the two codes.

**Figure 7.** Time evolution of the surface densities of the disk projected on to the xy-, yz-, and zx-planes for the opening angle of θ = 0.3 (a), and that of θ = 0.5 (b). For each value of θ, the top panels show the results with the SCF-FDPS code, while the bottom panels exhibit those with the tree code into which the FDPS library is implemented. The softening length is set to be ε = 0.006 for all simulations. Regarding the SCF-FDPS simulations, ${n}_{\max }=16$ and ${l}_{\max }=16$ are used. Note that the drift motion along the vertically upward direction is seen from t = 500 to t = 800 for the θ = 0.5 simulation with the tree code.
Download figure:
Standard image High-resolution image

**Figure 7.** Time evolution of the surface densities of the disk projected on to the xy-, yz-, and zx-planes for the opening angle of θ = 0.3 (a), and that of θ = 0.5 (b). For each value of θ, the top panels show the results with the SCF-FDPS code, while the bottom panels exhibit those with the tree code into which the FDPS library is implemented. The softening length is set to be ε = 0.006 for all simulations. Regarding the SCF-FDPS simulations, ${n}_{\max }=16$ and ${l}_{\max }=16$ are used. Note that the drift motion along the vertically upward direction is seen from t = 500 to t = 800 for the θ = 0.5 simulation with the tree code.
Download figure:
Standard image High-resolution image

4. Discussion

We have shown in Figure 6 that the CPU time taken with the SCF-FDPS code increases as the fraction of N_disk increases. In that figure, the mass of each halo is assigned to that included within r = 30. However, if the optical edge of the disk is about 15 kpc, this radius corresponds to r = 6.25, because the disk scale length is estimated to be 2.4 kpc (Bland-Hawthorn & Gerhard 2016). In this case, Figure 8 indicates that the halo mass within r = 6.25 is at most about 1.7 times the disk mass even for the largest ratio of M_h/M_d = 15. Since the halo mass within the optical edge of the disk is at least comparable to the disk mass, we may be allowed to regard f = 1/16 in Figure 6 as a reference value of f. Thus, on the basis of the results obtained from the simulations with the value of f = 1/16 for θ = 0.3 and θ = 0.5, it follows that in a practical sense, the SCF-FDPS code is about an order of magnitude faster than the FDPS tree code for θ = 0.3, and that the former is about 4.5 times faster than the latter for θ = 0.5.

**Figure 8.** Cumulative mass of each halo used in Figure 6 as a function of radius. The halo masses are normalized by the disk mass, and M_h = 5, 7, 9, 11, and 15 correspond to f = 1/6, 1/8, 1/12, and 1/16, respectively, where f is the fraction of disk particles as in Figure 6. The vertical dashed line indicates the radius of 15 kpc when the radial scale length of the disk is assumed to be h = 2.4 kpc.
Download figure:
Standard image High-resolution image

We notice from Figure 7(b) that in the simulation for θ = 0.5 executed with the FDPS tree code, the disk begins to drift upward along the z-axis at t ∼ 300, which continues to the end of the run, while in the corresponding simulation with the SCF-FDPS code, no upward drift occurs during the run. As found from Figure 7(a), such an upward drift does not arise in the simulation for θ = 0.3 with both codes. Thus, in general, tree-code simulations of a disk–halo system do not necessarily lead to a vertical drift motion of the disk. Indeed, in general, linear momentum is not conserved intrinsically in an exact sense for numerical codes based on expansion techniques such as tree and SCF codes. However, our results may suggest that owing to the small fraction of the tree-based calculation, the SCF-FDPS code can easily conserve the linear momentum of each component better than the FDPS tree code to some satisfactory degree.

In our test simulations, we have adopted the softened gravity due to the softening of the Plummer type because it is easily implemented in the SCF-FDPS code. However, in some situations, spline softening (Hernquist & Katz 1989) may be useful because the force law turns into the pure Newton's law of universal gravitation at interparticle distances larger than twice the softening length. Then, we have also implemented the spline softening in the SCF-FDPS code.

For the SCF part in the SCF-FDPS code, we have used Hernquist–Ostriker's basis set on the grounds that it well describes a cuspy density distribution that the halo model chosen here shows. In addition, we have also implemented Clutton-Brock's basis set (Clutton-Brock 1973). This is suitable for cored density distributions, because the lowest-order members of the basis functions are based on the Plummer model (Plummer 1911). Therefore, the SCF-FDPS code can accommodate a wide variety of halo profiles.

In the SCF-FDPS code, disk particles are treated with a tree algorithm, so that the gas component can easily be included by implementing a smoothed particle hydrodynamics (SPH) method (Gingold & Monaghan 1977; Lucy 1977), as was done by Hernquist & Katz (1989), who named the code TREESPH. Fortunately, the FDPS library supports the implementation of an SPH method by supplying its sample code. Furthermore, an individual time-step method (e.g., McMillan 1986; Hernquist & Katz 1989; Makino 1991) can also be set in the SCF-FDPS code, which enables us, for example, to properly trace particles moving closely around a supermassive black hole residing at the disk center. Accordingly, we will be able to cope with various problems involved in disk galaxies by equipping additional functions such as SPH and individual time-step methods with the current SCF-FDPS code.

5. Conclusions

We have developed a fast N-body code for simulating disk–halo systems by incorporating an SCF code into a tree code. In particular, the success in achieving the high performance consists in reducing the time-consuming tree-dependent force calculation only to the self-gravity of disk particles by applying an SCF method to the calculation of the gravitational forces between disk and halo particles as well as that of the self-gravity of halo particles. In addition, the SCF-FDPS code has the characteristics that the CPU time is almost proportional to the total number of particles for the fixed number of cores and almost inversely proportional to the number of cores equipped on a computer for the fixed number of particles. As a result, for a disk–halo system, the SCF-FDPS code developed here is at minimum about 3 times faster and in one case, up to an order of magnitude faster, depending on the opening angle, θ, used in the tree method, and on the fraction of tree particles, f = N_disk/(N_disk + N_halo), than a highly tuned tree code like the FDPS tree code. Of course, the SCF-FDPS code leads to the time evolution of a disk–halo system similarly to that with the FDPS tree code.

We have implemented Clutton-Brock's basis set suitable for cored density distributions as well as Hernquist–Ostriker's basis set appropriate for cuspy density distributions on the SCF-FDPS code, so that it is capable of coping with a wide variety of halo profiles. Furthermore, because the spline softening as well as the Plummer softening have been implemented on that code, it will be able to be applied to the investigation of extensive dynamical problems of disk–halo systems.

We can easily incorporate both SPH and individual time-step methods into the tree part in the SCF-FDPS code. Therefore, the SCF-FDPS code will be able to be extended so that we can tackle central issues of disk-galaxy simulations like the evolution of a disk galaxy harboring a central supermassive black hole including a gas component with a huge number of particles by utilizing its high performance.

We are grateful to Dr. Yohei Miki for his advice about the usage of MAGI. S.H. thanks Prof. Lars Hernquist for his comments on the manuscript. This work was supported by JSPS KAKENHI grant No. JP21K03626. Some of the SCF and tree-code simulations were carried out on the Cray XC50 system at the Center for Computational Astrophysics at the National Astronomical Observatory of Japan.

Appendix: The Density and Potential Basis Functions

The basis set adopted here is that constructed by Hernquist & Ostriker (1992). The density and potential basis functions, expressed by ρ_nlm( r ) and Φ_nlm( r ), respectively, are represented by

$\begin{eqnarray}\begin{array}{rcl}{\rho }_{{nlm}}({\boldsymbol{r}}) & = & {K}_{{nl}}\displaystyle \frac{M}{2\pi {a}^{3}}\displaystyle \frac{{\left(r/a\right)}^{l}}{\left(r/a\right){\left(1+r/a\right)}^{2l+3}}\\ & & \times \,{C}_{n}^{\left(2l+3/2\right)}(\xi )\sqrt{4\pi }\,{Y}_{{lm}}(\theta ,\phi )\end{array}\end{eqnarray} \tag{ A1 }$

and

$\begin{eqnarray}\begin{array}{rcl}{{\rm{\Phi }}}_{{nlm}}({\boldsymbol{r}}) & = & -\displaystyle \frac{{GM}}{a}\displaystyle \frac{{\left(r/a\right)}^{l}}{{\left(1+r/a\right)}^{2l+1}}\\ & & \times \,{C}_{n}^{\left(2l+3/2\right)}(\xi )\sqrt{4\pi }\,{Y}_{{lm}}(\theta ,\phi ),\end{array}\end{eqnarray} \tag{ A2 }$

where M is the mass of the system, a is the scale length, ${C}_{n}^{(\alpha )}(\xi )$ are the ultraspherical, or Gegenbauer polynomials (Abramowitz & Stegun 1972) with ξ being the radial transformation defined by

$\begin{eqnarray}&&\xi =\displaystyle \frac{r-a}{r+a},\end{eqnarray} \tag{ A3 }$

and Y_lm(θ, ϕ) are spherical harmonics that are related to associated Legendre polynomials, P_lm(x), by

$\begin{eqnarray}&&{Y}_{{lm}}(\theta ,\phi )=\sqrt{\displaystyle \frac{2l+1}{4\pi }\displaystyle \frac{(l-m)!}{(l+m)!}}{P}_{{lm}}(\cos \theta )\exp ({im}\phi ),\end{eqnarray} \tag{ A4 }$

where i is the imaginary unit.

In Equation (A1), the normalization factor, K_nl, is provided by

$\begin{eqnarray}&&{K}_{{nl}}=\displaystyle \frac{1}{2}n(n+4l+3)+(l+1)(2l+1).\end{eqnarray} \tag{ A5 }$

SCF-FDPS: A Fast N-body Code for Simulating Disk–Halo Systems

Article metrics

Author e-mails

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction