SCF-FDPS: A Fast $N$-body Code for Simulating Disk-halo Systems

A fast $N$-body code has been developed for simulating a stellar disk embedded in a live dark matter halo. In generating its Poisson solver, a self-consistent field (SCF) code which inherently possesses perfect scalability is incorporated into a tree code which is parallelized using a library termed Framework for Developing Particle Simulators (FDPS). Thus, the code developed here is called SCF-FDPS. This code has realized the speedup of a conventional tree code by applying an SCF method not only to the calculation of the self-gravity of the halo but also to that of the gravitational interactions between the disk and halo particles. Consequently, in the SCF-FDPS code, a tree algorithm is applied only to calculate the self-gravity of the disk. On a many-core parallel computer, the SCF-FDPS code has performed at least three times, in some case nearly an order of magnitude, faster than an extremely-tuned tree code on it, if the numbers of disk and halo particles are, respectively, fixed for both codes. In addition, the SCF-FDPS code shows that the cpu cost scales almost linearly with the total number of particles and almost inversely with the number of cores. We find that the time evolution of a disk-halo system simulated with the SCF-FDPS code is, in large measure, similar to that obtained using the tree code. We suggest how the present code can be extended to cope with a wide variety of disk-galaxy simulations.


INTRODUCTION
The number of particles in N -body simulations of astronomical objects like galaxies has been increasing, in step with the progress in parallel computing technology.This remarkable development has brought a great benefit to disk-galaxy simulations, because galactic disks are rotation-supported, cold systems, so that a sufficiently large number of particles are needed for the disk to sidestep the heating originating from Poisson noise.In fact, Fujii et al. (2011) have demonstrated that a spiral feature emerging in a disk surrounded by an unresponsive halo is fading away gradually over time for a millionparticle simulation, while it persists until late times for a three-million-particle simulation.On the other hand, Athanassoula (2002) has revealed that for a given disk-halo model, the disk is stabilized against bar formation when the halo is rigid, while a large-amplitude bar is excited through a wave-particle resonance between a bar mode in the disk and halo particles when the halo is live.This fact coerces us to deal with a halo as selfgravitating.In making a halo live for a disk-halo system, the mass of a halo particle has to be made equal to that of a disk particle to avoid the shot noise generated by halo particles when they pass through the disk.Unfortunately, a halo mass is estimated to be at least around an order of magnitude larger than a disk mass, because a halo is considered to extend far beyond the optical edge of the disk on the basis of the observed rotation curves of disk galaxies that are, in general, flat out to large radii (e.g., Sofue & Rubin 2001).Consequently, the number of halo particles becomes larger than that of disk particles by an order of magnitude or many more.It thus follows that disk-galaxy simulations inevitably demand a large number of particles.
As the number of particles in N -body simulation increases, the number of force calculation increases accordingly.Because a given particle receives the gravitational force from all other particles in an N -particle system, the number of the total force calculation reaches O(N 2 ) at every time step in the simplest way.This explosive nature in force calculation is alleviated down to O(N log N ) by the introduction of a tree algorithm developed by Barnes & Hut (1986).Indeed, recent large N -body simulations of disk galaxies are based on a tree code.For example, Dubinski et al. (2009) adopted a parallelized tree code to investigate the bar instability in galactic disks using 1.8 × 10 7 particles for a disk and 10 8 particles for a halo, while D'Onghia et al. ( 2013) used a tree-based gravity solver to examine the origin of spiral structure in disk galaxies with 10 8 particles for a disk immersed in a rigid halo.Furthermore, Fujii et al. (2018) have employed a tree-based code called BONSAI (Bédorf et al. 2012) optimized for Graphics Processing Units to scrutinize the dynamics of disk galaxies which consist of live disk, bulge, and halo components with the total number of particles being increased up to 5 × 10 8 .In their subsequent work, Fujii et al. (2019) have boosted the total number of particles up to 8 × 10 9 to construct the Milky Way Galaxy model that reproduces the observed properties.
As mentioned above, a tree algorithm is commonly used to study disk galaxies with a huge number of particles.In such a situation, a faster tree code is understandably desirable from various aspects of numerical studies.As computer architecture is shifted to parallelized one, a tree code has been adjusted to a parallel computer.Above all, a numerical library termed Framework for Developing Particle Simulators (FDPS) (Iwasawa et al. 2016;Namekata et al. 2018) has tuned a tree code to the utmost limit of a massively memory-distributed parallel computer.Therefore, no further speedup of existing tree codes is expected on their own.
We then try to incorporate a self-consistent field (SCF) code into a tree code.Of course, the FDPS library is implemented in the tree part of the resulting hybrid code for the efficient parallelization.In an SCF approach, Poisson's equation is solved by expanding the density and potential of the system being studied in a set of basis functions.In particular, owing to the expansion of the full spatial dependence, the cpu cost becomes O(N ).Moreover, because the perfect scalability is inherent in the SCF approach, it is suitable for parallel computing.By taking advantage of these characteristics, we will be able to accelerate N -body simulations of disk galaxies using a hybrid code named SCF-FDPS in which an SCF code is incorporated into an FDPSimplemented tree code (Hozumi et al. 2023).
In this paper, we describe how an SCF code is incorporated into a tree code, and show how well the resulting SCF-FDPS code works.In Section 2, we present the details of the SCF-FDPS code, including how an SCF approach is applied to a disk-halo system.In Section 3, along with the determination of the parameters inherent in the code, the performance of the code is shown.In Section 4, we discuss the extension of the present code to cope with a wide variety of disk-galaxy simulations.Conclusions are given in Section 5.

DETAILS OF THE SCF-FDPS CODE
We develop a fast N -body code which is based on both SCF and tree approaches.First, we explain the SCF method briefly, and then, describe the details of the SCF-FDPS code.

SCF Method
An SCF method requires a biorthonormal basis set which satisfies Poisson's equation written by where ρ nlm (r) and Φ nlm (r) are, respectively, the density and potential basis functions at the position vector of a particle, r, with n being the 'quantum' number in the radial direction and with l and m being corresponding quantities in the angular directions.Here, the biorthonormality is represented by where δ kk is the Kronecker delta defined by δ kk = 0 for k = k and δ kk = 1 for k = k .With the help of such a biorthonormal basis set as is noted above, the density and potential of the system are expanded, respectively, by the corresponding basis functions as where A nlm are the expansion coefficients at time t.
When the potential basis functions are operated on the density field that is expanded as Equation (3), A nlm are given, via the biorthonormality relation of Equation (2), by (5) If a system consists of a collection of N discrete masspoints, the density is represented by so that by substituting Equation (6) into Equation ( 5), A nlm result in where m k and r k are the mass and position vector of the kth particle in the system, respectively, and δ(r) is Dirac's delta function.After obtaining A nlm , we can derive the acceleration, a(r), by differentiating Equation (4) with respect to r, finding where ∇Φ nlm (r) can be analytically calculated beforehand, once the basis set is specified.
As found from Equation ( 7), this form of summation can be conveniently parallelized, so that an SCF code realizes the perfect scalability (Hernquist et al. 1995), which leads to ideal load balancing on a massively parallel computer.In addition, the cpu time is proportional to N ×(n max + 1)×(l max + 1) 2 , where n max and l max are the maximum numbers of expansion terms in the radial and angular directions, respectively.Therefore, an SCF code is fast and suitable for modern parallel computers.Accordingly, a fast N -body code is feasible by incorporating an SCF code into a tree code.

The SCF-FDPS Code
For a disk-halo system, the acceleration of the kth disk particle, a d (r d,k ), at the position vector, r d,k , and the acceleration of the kth halo particle, a h (r h,k ), at the position vector, r h,k , are, respectively, represented by where a d→d (r d,k ) and a h→d (r d,k ) denote the acceleration due to the gravitational force from other disk particles to the kth disk particle and that from halo particles to the kth disk particle, respectively, while a h→h (r h,k ) and a d→h (r h,k ) stand for the acceleration due to the gravitational force from other halo particles to the kth halo particle and that from disk particles to the kth halo particle, respectively.Vine & Sigurdsson (1998) have already developed a code named scftree in which an SCF code is incorporated into a tree code.In their code, a h→h (r h,k ) and a h→d (r d,k ) are calculated with an SCF method, while a d→d (r d,k ) and a d→h (r h,k ) are manipulated with a tree method.However, as explained in Section 1, the number of halo particles is at least about an order of magnitude larger than that of disk particles, so that the calculation of a d→h (r h,k ) is extremely time-consuming, if a tree algorithm is used.Of course, local small-scale irregularities often generated in a rotation-supported disk can be well-described with a tree code, which makes it reasonable to apply a tree method to the calculation of a d→d (r d,k ).In contrast, in a halo which is supported by velocity dispersion, global features survive but smallscale ones are smoothed out to disappear, so that we can handle a halo using an SCF approach without so many expansion terms.In fact, there are suitable basis sets for spherical systems whose density and potential are reproduced with a small number of expansion terms.Then, we apply an SCF method to evaluate a h→h (r h,k ).Furthermore, even though small-scale features exist in the disk, they do no serious harm to the overall structure of the halo, as we will show in Section 3. Therefore, we can apply an SCF method to the calculation of a d→h (r h,k ) as well.After all, only a d→d (r d,k ) is calculated with a tree method.For this part in the code, we implement a C++ version of the FDPS library (Iwasawa et al. 2016) which is publicly available, because it helps users parallelize a tree part easily with no efforts in tuning the code for parallelization.We then name the code developed here the SCF-FDPS code (Hozumi et al. 2023).This code will enable us to simulate disk-halo systems much faster than ever for the fixed number of particles.
The actual procedure for calculating the accelerations of a h→d (r d,k ), a h→h (r h,k ), and a d→h (r h,k ) are as follows.First, Equation ( 8) shows that a h→d (r d,k ) is provided by where A h,nlm are those expansion coefficients obtained from halo particles which are given by In Equation ( 12), N halo is the number of halo particles, and m h,k is the mass of the kth halo particle.
Next, as is shown by Equation ( 10), a h→h (r h,k ) and a d→h (r h,k ) are added up to generate a h (r h,k ), and again Equation ( 8) indicates that a h (r h,k ) is calculated as where A h+d,nlm are those expansion coefficients evaluated from disk and halo particles which are written by Here, A d,nlm are the expansion coefficients that are calculated from disk particles as where N disk is the number of disk particles, and m d,k is the mass of the kth disk particle.In summary, the hybrid code is based on the following Hamiltonian of the system written by where p d,k = m d,k ṙd,k and p h,k = m h,k ṙh,k are the momentum of the kth disk particle and that of the kth halo particle, respectively.The first two terms are kinetic ones.The third term is the self-gravity of the disk that is calculated with a tree method based on the softened gravity of the Plummer type using a softening length, ε.Notice that this expression is used for convenience.
That is, it is incorrect in a strict sense, because we cannot exactly construct the Hamiltonian owing to the way of calculating the gravitational force in the tree algorithm.The fourth term is the self-gravity of the halo expressed by the expansions due to the basis functions introduced into the SCF method.The last term represents the disk-halo interactions that are also expanded with the basis functions.
We have postulated above that each particle in a diskhalo system has a different mass.In fact, the SCF-FDPS code supports individually different masses for constituent particles in such a system.However, the mass of a halo particle should be made identical to that of a disk particle so as to prevent the shot noise caused by the halo particles that pass through the disk.Consequently, in a practical sense, it is appropriate to assign an identical mass to each particle in a disk-halo system.Now that the left-hand sides of Equations ( 9) and ( 10) are obtained as explained above, we can simulate a diskhalo system with the code developed here.As a cautionary remark, we need a relatively large number of the angular expansion terms to capture the gravitational contribution from disk particles to halo particles properly, because the disk geometry deviates from a spheroidal shape to a considerable degree.

Parallelization
All simulations of the disk-halo system are run on a machine with an AMD Ryzen Threadripper 3990X 64core processor.Although all 64 cores of this processor share the main memory, we apply not the thread parallelization but the MPI parallelization to the SCF-FDPS code, and execute simulations on up to 64 processes.
The MPI parallelization of the SCF part in the SCF-FDPS code is straightforward: once the particles are equally distributed to each process, only one API call, MPI Allreduce(), is needed for the summation of those expansion coefficients which are calculated on each process.Regarding the SCF part, we do not need to move particles across MPI processes.On the other hand, the parallelization of the tree part is more formidable than that of the SCF part, because we have to take into consideration the spatial decomposition and exchange of both particles and tree information between domains.Fortunately, the FDPS library copes with this complexity so as to be hidden from the programmers.

Hardware-specific Tuning
The processor mentioned in Subsection 2.3 supports up to 256-bit width SIMD instructions known as AVX.This corresponds to four words of double-precision numbers, or eight words of single-precision numbers as the word length that can be processed at once.We conservatively adopt double-precision arithmetic in the SCF-FDPS code to establish a reliable calculation method.A further speedup by using the single-precision is the subject of future work.Thus, a speedup to a fourfold increase is expected if the SIMD instructions are available.
In general, compiler's vectorization is applied to the innermost loops.However, this is not always the optimal way to exploit SIMD instructions.In the SCF-FDPS code, the compute kernel of the SCF part consists of the outermost loop for the particle index k and several inner loops for the indices n, l, and m that accompany the basis functions.Some of the inner loops can hardly be vectorized because of their recurrence properties.Thus, the maximal SIMD instruction rate is achieved when the vectorization is applied to the particle index k.To this end, we write the compute kernel of the SCF part in the SCF-FDPS code by the intrinsic functions of AVX to manually vectorize the outermost loop.In this way, the positions and masses of four particles are fetched at once, and the values of the basis functions are computed in parallel.
For the tree part, the compute kernel takes a doubleloop form composed of an outer loop for the sink particles that feel the gravitational force and an inner loop for the source particles that attract others.Of the two loops, the SIMD conversion is applied to the outer loop through the intrinsic functions.The benefit of the outerloop parallelization is the reduction of memory access, because fetching the coordinates and mass of one source particle to accumulate the gravitational forces for four sink particles is more efficient than fetching four source particles to accumulate the gravitational forces to one sink particle.

Portability
As we have mentioned, the compute kernels of the SCF part and the tree part in the SCF-FDPS code are written using the intrinsic functions of AVX.However, that code can be compiled not only by the Intel compiler but also by GCC and LLVM Clang.At the same time, it can run on other x86 processors which support AVX/AVX2.Except in the SIMD intrinsics, the SCF-FDPS code is written in standard C++17 and MPI, so that it runs on the arbitrary number of processors as well as on the 64-core processors used here, regardless of whether processors are configured within a node or shared over multiple nodes.In fact, we have confirmed that the SCF-FDPS code can run properly using 512 cores on a Cray XC50 system.

Disk-halo Model
We use a disk-halo model to examine the performance of the SCF-FDPS code.The disk model is an exponential disk which is locally isothermal in the vertical direction.The volume density distribution, ρ d , is given by where R is the cylindrical radius, z is the vertical coordinate with respect to the mid-plane of the disk, M d is the disk mass, h is the radial scale length, and z 0 is the vertical scale length being set to be 0.2 h.The disk is truncated explicitly at R = 15 h in the radial direction.
On the other hand, the halo model is described by an NFW profile (Navarro et al. 1996(Navarro et al. , 1997)), whose density distribution, ρ h , is written by where r is the spherical radius, r s is the radial scale length, and ρ 0 is provided by In Equation ( 19), R h is the cut-off radius of the halo, M h is the halo mass within R h , and C NFW is the concentration parameter defined by As a basic model, we choose M h = 5 M d , R h = 30 h, and C NFW = 5 for the halo model.These choices lead to r s = 6.Concerning a specific performance test, the halo mass is changed with the other quantities being left intact.
We construct the equilibrium disk-halo model described above using a software tool called manycomponent galaxy initializer (MAGI) (Miki & Umemura 2018).Retrograde stars are introduced with the same way as that adopted by Zang & Hohl (1978) and the parameter η, which specifies the fraction of retrograde stars, is set to be 0.25.We choose the Toomre's Q parameter (Toomre 1964) to be 1.2 at R = h.In our simulations, the gravitational constant, G, and the units of mass and scale length are taken such that G = 1, M d = 1, and h = 1.
We find from Equation ( 18) that the NFW halo shows a cuspy density distribution like r −1 down to the center.In accordance with this characteristic, we adopt Hernquist-Ostriker's basis set (Hernquist & Ostriker 1992).Because the lowest order members of this basis set are based on the Hernquist model (Hernquist 1990) whose density behaves like an r −1 cusp at small radii, that basis set is suitable to represent the NFW halo with a small number of expansion terms.The exact functional forms of the basis set are shown in the Appendix A.

Convergence Tests
For the SCF part in the SCF-FDPS code, we need to specify n max and l max .We determine n max by comparing the radial acceleration calculated analytically with In this plot, ar is the exact acceleration of the NFW halo, while ar,exp is the radial acceleration derived from the expanded potential using Hernquist-Ostriker's basis functions with the scale length of a = 6.The three curves show the effect of the maximum number of the radial expansion terms, nmax, on the resulting radial acceleration with the maximum number of the angular expansion terms, lmax = 0, being retained.Note that the scaling of the abscissa is changed at r = 1 from the left to the right panel, whereby the ordinate is also re-scaled accordingly.
that derived from the expanded potential of the spherically symmetric NFW halo shown in Equation ( 18), which is realized by retaining l max = 0.In Figure 1, the radial acceleration obtained from the expanded potential for n max = 10, 16, and 20 is compared with the exact one.The scale length of the basis functions, a, is set to be a = 6.This figure indicates that the radial acceleration obtained with n max = 10 shows some relatively large deviation from the exact one, while the radial acceleration with n max = 16 is almost comparable to that with n max = 20.From this consideration, we adopt n max = 16.On the other hand, there is no way to estimate l max for a spherical halo model.To search for an appropriate value of l max , we carry out convergence tests in which l max = 12, 16, and 20 are examined with n max = 16 being retained.We found that the disk-halo model constructed in Subsection 3.1 forms a bar via the bar instability (see Figure 7).Then, we use the time evolution of the bar amplitude as a measure to determine l max .
Regarding the parameters related to the tree part, we use θ = 0.3 and 0.5 as an opening angle, and ε = 0.006 as a softening length of the Plummer type.Gravitational forces are expanded up to quadrupole order.
We assign N disk = 6,400,000 to the disk, and N halo = 32,000,000 to the halo.A time-centered leapfrog algo-rithm (Press et al. 1986) is employed with a fixed time step of ∆t = 0.1.
For comparison, the same disk-halo model is simulated with a tree code on which the FDPS library is implemented.Hereafter, we call this code the FDPS tree code, which is also applied to the SIMD instructions as has been done to the SCF-FDPS code.All the tree parameters are the same as those employed for the convergence tests.
In Figure 2, we show the time evolution of the bar amplitude for θ = 0.3 and 0.5 in each of which l max = 12, 16, and 20 are employed, while n max = 16 is held fixed.Furthermore, the results with the FDPS tree code are also plotted.On the basis of these results, in particular, paying attention to the behavior of the exponentially growing phase of the bar amplitude from t = 0 to t ∼ 300, we select l max = 16.

Performance Tests
We carry out performance tests to examine how fast the SCF-FDPS code is as compared to the FDPS tree code.We measure the cpu time in the cases of θ = 0.3 and 0.5.For each value of θ, the Plummer type softening is used with ε = 0.006, and forces are expanded up to quadrupole order.Again, we use a time-centered leapfrog method (Press et al. 1986) with a fixed time step of ∆t = 0.1.
In Figure 3, the cpu time using 64 cores per step is plotted as a function of the total number of particles, N = N disk + N halo , with the ratio of N halo /N disk = 5 being fixed.We can see that the cpu time is nearly proportional to N for both codes, but that the SCF-FDPS code is at least three times faster than the FDPS tree code for θ = 0.5, while the former is about five to six times faster than the latter for θ = 0.3.As N disk increases, the ratio of the cpu time measured with the FDPS tree code to that with the SCF-FDPS code decreases for both values of θ.For example, the ratio is 3.3 for N disk = 640,000, while it is 3.1 for N disk = 20,480,000 when θ = 0.5 is used.If θ = 0.3 is used, the ratio decreases from 5.9 for N disk = 640,000 to 4.8 for N disk = 20,480,000.As Figure 4 demonstrates, the fraction of the cpu time exhausted by the tree part in the SCF-FDPS code increases as N disk increases, while the cpu time consumed by the SCF part is basically proportional to N halo .As a result, that ratio of the cpu time decreases with increasing N disk .
Next, in Figure 5, the cpu time per step is plotted as a function of the number of cores, N core , used on the computer with N disk = 6,400,000 and N halo = 32,000,000 being unchanged.Irrespective of the value of θ, the cpu time scales as ∼ N core −0.8 , which means that the cpu For each value of θ, the softening length is ε = 0.006, and the maximum number of the radial expansion terms is nmax = 16.As a reference, the corresponding tree code simulation with ε = 0.006 is also plotted for each value of θ.where N disk and N halo are, respectively, the number of disk particles and that of halo particles with the ratio of N halo /N disk = 5.The left panel shows the cpu time on a linear scale, while the right panel stands for it on a logarithmic scale.The red symbols represent the results for θ = 0.5, while the blue symbols denote those for θ = 0.3.The circles display the results obtained using the SCF-FDPS code, while the triangles exhibit those using a tree code on which the FDPS library is implemented.The solid and dashed lines with red and blue colors provide power-law fits for corresponding data points.time is almost inversely proportional to N core for both codes.However, the SCF-FDPS code is about 3.6 times faster than the FDPS tree code for θ = 0.5, while the former is approximately 6.4 times faster than the latter for θ = 0.3.In the right panel of Figure 5, we can see that as N core increases, the decrease rate in the cpu time becomes smaller.This is because the cpu clock is made lowered as N core increases.Last, in Figure 6, the cpu time using 64 cores per step is plotted as a function of the fraction of disk particles, f = N disk /N , where N = N disk + N halo , and we use f = 1/16, 1/12, 1/10, 1/8, and 1/6.In this performance test, we change the ratio of N halo /N disk , while making the total number of particles unchanged as N = 30,720,000.As a result, the mass ratio of M halo /M disk is not constant but changes identically to the ratio of N halo /N disk .The other parameters such as R h and C NFW are left unchanged.After all, each halo model specified by the value of f is constructed by adjusting the value of ρ 0 in Equation ( 19) to the given M halo .Figure 6 indicates how the fraction of the tree part in the SCF-FDPS code affects the cpu time.As a reference, we plot the results using the FDPS tree code.For these tree-code simulations, all particles are obviously calculated with a tree algorithm, so that the cpu time may be expected to be independent of f .In reality, the cpu time depends weakly on f , and it is proportional to f 0.043 for θ = 0.3, and to f 0.031 for θ = 0.5.On the other hand, the cpu time increases with f if the SCF-FDPS code is used for both values of θ.However, for θ = 0.5, the SCF-FDPS code is about 4.5 times faster at f = 1/16 and about 3.1 times faster at f = 1/6 than the FDPS tree code, while for θ = 0.3, the former is about an order of magnitude faster at f = 1/16 and about 5.1 times faster at f = 1/6 than the latter.

Simulation Results
We carry out simulations of the disk-halo system described by Equations ( 17) and ( 18) to examine to what degree the simulation results obtained with the SCF-FDPS code are similar to those with the FDPS tree code.The simulation details are taken over from those adopted for the performance tests.For each value of θ, the energy was conserved to better than 0.028% using the SCF-FDPS code, while it was conserved to better than 0.037% using the FDPS tree code.Figure 7 shows the time evolution of the surface densities of the disk projected on to the xy-, yz-, and zx-planes for θ = 0.3 and 0.5.We find from this figure that the time evolution of the disk surface densities obtained with the SCF-FDPS code is in excellent agreement with that using the FDPS tree code for both values of θ at least until t = 500.At later times, owing to the difference in the bar pattern speed from simulation to simulation, the bar phase differs accordingly.Even though a difference in the bar pattern speed is slight at the bar formation epoch, it accumulates with time, so that the difference in the bar phase becomes larger and larger as time progresses.At any rate, the time evolution of the disk is satisfactorily similar between the two codes.

DISCUSSION
We have shown in Figure 6 that the cpu time taken with the SCF-FDPS code increases as the fraction of N disk increases.In that figure, the mass of each halo is assigned to that included within r = 30.However, if the optical edge of the disk is about 15 kpc, this radius corresponds to r = 6.25 because the disk scale length is estimated to be 2.4 kpc (Bland-Hawthorn & Gerhard 2016).In this case, Figure 8 indicates that the halo mass within r = 6.25 is at most about 1.7 times the disk mass even for the largest ratio of M h /M d = 15.Since the halo mass within the optical edge of the disk is at least comparable to the disk mass, we may be allowed to regard f = 1/16 in Figure 6 as a reference value of f .Thus, if we are based on the results obtained from the simulations with the value of f = 1/16 for θ = 0.3 and θ = 0.5, it follows that in a practical sense, the SCF-FDPS code is about an order of magnitude faster than the FDPS tree code for θ = 0.3, and that the former is about 4.5 times faster than the latter for θ = 0.5.
We notice from Figure 7(b) that in the simulation for θ = 0.5 executed with the FDPS tree code, the disk begins to drift upward along the z axis at t ∼ 300, which continues to the end of the run, while in the corresponding simulation with the SCF-FDPS code, no upward The number of disk particles is N disk = 6,400,000, while that of halo particles is N halo = 32,000,000.The meanings of the symbols and those of the solid and dashed lines with red and blue colors are the same as those in Figure 3. .The total number of particles is N = 30,720,000, and N disk and N halo are assigned according to the value of f .In this figure, the results with f = 1/16, 1/12, 1/10, 1/8, and 1/6 are plotted.The meanings of the symbols and those of the solid and dashed lines with red and blue colors are the same as those in Figure 3.For each value of θ, the top panels show the results with the SCF-FDPS code, while the bottom panels exhibit those with the tree code into which the FDPS library is implemented.The softening length is set to be ε = 0.006 for all simulations.Regarding the SCF-FDPS simulations, nmax = 16 and lmax = 16 are used.Note that the drift motion along the vertically upward direction is seen from t = 500 to t = 800 for the θ = 0.5 simulation with the tree code.
. Cumulative mass of each halo used in Figure 6 as a function of radius.The halo masses are normalized by the disk mass, and M h = 5, 7, 9, 11, and 15 correspond to f = 1/6, 1/8, 1/12, and 1/16, respectively, where f is the fraction of disk particles as in Figure 6.The vertical dashed line indicates the radius of 15 kpc when the radial scale length of the disk is assumed to be h = 2.4 kpc.
drift occurs during the run.As found from Figure 7(a), such an upward drift does not arise in the simulation for θ = 0.3 with both codes.Thus, in general, treecode simulations of a disk-halo system do not necessarily lead to a vertical drift motion of the disk.Indeed, in general, linear momentum is not conserved intrinsically in an exact sense for numerical codes based on expansion techniques such as tree and SCF codes.However, our results may suggest that owing to the small fraction of the tree-based calculation, the SCF-FDPS code can easily conserve the linear momentum of each component better than the FDPS tree code to some satisfactory degree.
In our test simulations, we have adopted the softened gravity due to the softening of the Plummer type because it is easily implemented in the SCF-FDPS code.However, in some situations, spline softening (Hernquist & Katz 1989) may be useful because the force law turns into the pure Newton's law of universal gravitation at inter-particle distances larger than twice the softening length.Then, we have also implemented the spline softening in the SCF-FDPS code.
For the SCF part in the SCF-FDPS code, we have used Hernquist-Ostriker's basis set on the ground that it well-describes a cuspy density distribution which the halo model chosen here shows.In addition, we have also implemented Clutton-Brock's basis set (Clutton-Brock 1973).This is suitable for cored density distributions, because the lowest order members of the basis functions are based on the Plummer model (Plummer 1911).Therefore, the SCF-FDPS code can accommodate a wide variety of halo profiles.
In the SCF-FDPS code, disk particles are treated with a tree algorithm, so that the gas component can easily be included by implementing an SPH method (Gingold & Monaghan 1977;Lucy 1977), as was done by Hernquist & Katz (1989) who named the code TREESPH.Fortunately, the FDPS library supports the implementation of an SPH method by supplying its sample code.Furthermore, an individual time step method (e.g., McMillan 1986;Hernquist & Katz 1989;Makino 1991) can also be set in the SCF-FDPS code, which enables us, for example, to properly trace particles moving closely around a super-massive black hole residing at the disk center.Accordingly, we will be able to cope with various problems involved in disk galaxies by equipping additional functions such as SPH and individual time step methods with the current SCF-FDPS code.

CONCLUSIONS
We have developed a fast N -body code for simulating disk-halo systems by incorporating an SCF code into a tree code.In particular, the success in achieving the high performance consists in reducing the timeconsuming tree-dependent force calculation only to the self-gravity of disk particles by applying an SCF method to the calculation of the gravitational forces between disk and halo particles as well as that of the self-gravity of halo particles.In addition, the SCF-FDPS code has the characteristics that the cpu time is almost proportional to the total number of particles for the fixed number of cores and almost inversely proportional to the number of cores equipped on a computer for the fixed number of particles.As a result, for a disk-halo system, the SCF-FDPS code developed here is at minimum about three times faster and in some case up to an order of magnitude faster, depending on the opening angle, θ, used in the tree method, and on the fraction of tree particles, f = N disk /(N disk + N halo ), than a highly tuned tree code like the FDPS tree code.Of course, the SCF-FDPS code leads to the time evolution of a disk-halo system similarly to that with the FDPS tree code.
We have implemented Clutton-Brock's basis set suitable for cored density distributions as well as Hernquist-Ostriker's basis set appropriate for cuspy density distributions on the SCF-FDPS code, so that it is capable of coping with a wide variety of halo profiles.Furthermore, because the spline softening as well as the Plummer softening have been implemented on that code, it will be able to be applied to the investigation of extensive dynamical problems of disk-halo systems.
We can easily incorporate both SPH and individual time step methods into the tree part in the SCF-FDPS code.Therefore, the SCF-FDPS code will be able to be extended so that we can tackle central issues of diskgalaxy simulations like the evolution of a disk galaxy harboring a central super-massive black hole including a gas component with a huge number of particles by utilizing its high performance.
We are grateful to Dr. Yohei Miki for his advice about the usage of MAGI.SH thanks Prof. Lars Hernquist for his comments on the manuscript.
This work was supported by JSPS KAKENHI Grant Number JP21K03626.Some of the SCF and tree-code simulations were carried out on the Cray XC50 system at the Center for Computational Astrophysics at the National Astronomical Observatory of Japan.

Figure 1 .
Figure1.Relative radial-acceleration error of the spherically symmetric NFW halo model as a function of radius.In this plot, ar is the exact acceleration of the NFW halo, while ar,exp is the radial acceleration derived from the expanded potential using Hernquist-Ostriker's basis functions with the scale length of a = 6.The three curves show the effect of the maximum number of the radial expansion terms, nmax, on the resulting radial acceleration with the maximum number of the angular expansion terms, lmax = 0, being retained.Note that the scaling of the abscissa is changed at r = 1 from the left to the right panel, whereby the ordinate is also re-scaled accordingly.

Figure 2 .
Figure2.Time evolution of the bar amplitude for lmax = 12, 16, and 20 obtained using the SCF-FDPS code with the opening angle of θ = 0.3 (a), and with that of θ = 0.5 (b).For each value of θ, the softening length is ε = 0.006, and the maximum number of the radial expansion terms is nmax = 16.As a reference, the corresponding tree code simulation with ε = 0.006 is also plotted for each value of θ.

Figure 3 .
Figure3.Measured cpu time using 64 cores per step in seconds as a function of the total number of particles, N = N disk +N halo , where N disk and N halo are, respectively, the number of disk particles and that of halo particles with the ratio of N halo /N disk = 5.The left panel shows the cpu time on a linear scale, while the right panel stands for it on a logarithmic scale.The red symbols represent the results for θ = 0.5, while the blue symbols denote those for θ = 0.3.The circles display the results obtained using the SCF-FDPS code, while the triangles exhibit those using a tree code on which the FDPS library is implemented.The solid and dashed lines with red and blue colors provide power-law fits for corresponding data points.

Figure 4 .
Figure 4. Fraction of the cpu time occupied by the tree part in the SCF-FDPS code as a function of N disk , which is calculated from the simulations shown in Figure 3.

Figure 5 .
Figure5.Measured cpu time per step in seconds as a function of the number of cores, Ncore.The number of disk particles is N disk = 6,400,000, while that of halo particles is N halo = 32,000,000.The meanings of the symbols and those of the solid and dashed lines with red and blue colors are the same as those in Figure3.

Figure 6 .
Figure6.Measured cpu time using 64 cores in seconds per step as a function of the fraction of disk particles, f = N disk /(N disk + N halo ).The total number of particles is N = 30,720,000, and N disk and N halo are assigned according to the value of f .In this figure, the results with f = 1/16, 1/12, 1/10, 1/8, and 1/6 are plotted.The meanings of the symbols and those of the solid and dashed lines with red and blue colors are the same as those in Figure3.

Figure 7 .
Figure 7. Time evolution of the surface densities of the disk projected on to the xy-, yz-, and zx-planes for the opening angle of θ = 0.3 (a), and that of θ = 0.5 (b).For each value of θ, the top panels show the results with the SCF-FDPS code, while the bottom panels exhibit those with the tree code into which the FDPS library is implemented.The softening length is set to be ε = 0.006 for all simulations.Regarding the SCF-FDPS simulations, nmax = 16 and lmax = 16 are used.Note that the drift motion along the vertically upward direction is seen from t = 500 to t = 800 for the θ = 0.5 simulation with the tree code.