OpenPFC: an open-source framework for high performance 3D phase field crystal simulations

We present OpenPFC (https://github.com/VTT-ProperTune/OpenPFC), a state-of-the-art phase field crystal (PFC) simulation platform designed to be scalable for massive high-performance computation environments. OpenPFC can efficiently handle large-scale simulations, as demonstrated by our strong and weak scaling analyses up to an 81923 grid on 65 536 cores. Our results indicate that meaningful PFC simulations can be conducted on grids of size 20483 or even 40963, provided there is a sufficient number of cores and ample disk storage available. In addition, we introduce an efficient implementation of moving boundary conditions that eliminates the need for copying field values between MPI processes or adding an advection term to the evolution equations. This scheme enhances the computational efficiency in simulating large scale processes such as long directional solidification. To showcase the robustness of OpenPFC, we apply it to simulations of rapid solidification in the regime of metal additive manufacturing using a recently developed quantitative solid-liquid-vapor PFC model, parametrized for pure tungsten (body-centered cubic) and aluminum (face-centered cubic).


Introduction
Phase field crystal (PFC) models belong to the class of phase field models capable of stabilizing order parameters in both constant and periodic states; the periodic states represent the coarsegrained atomic structure of crystalline phases over diffusive time scales [1][2][3].PFC simulations enable the study of solidification processes as well as elastic-plastic material responses, encompassing a wide range of phenomena.These phenomena include the formation and coevolution of microstructural defects such as dislocations and stacking faults [4], voids [5][6][7], defect formation in epitaxial growth [8], displacive phase transitions [9], and electromigration [10].The hybrid continuum-atomic nature of PFC models makes them particularly useful for investigating the influence of atomic-scale structures on meso-scale properties that control microstructure evolution.These properties range from interface energy anisotropy [11,12] to sub-lattice ordering in order-disorder transformations [13] and atomic structures in eutectic growth [14].More recently, PFC models have achieved increased quantitative accuracy in reproducing realistic phase diagrams in pure materials [6,7,15] and alloys [16], dislocation properties [4], and creep behavior [17].Some of these advances have allowed explicit quantitative comparisons with experimental results [7,16].With ongoing improvements to the PFC theory, it is poised to become a valuable tool for investigating practical materials science problems on nano-to-micrometer length scales.
Research using PFC modeling is experiencing significant advancements in theory and applications within materials science.Concurrently, the available computational resources for high-performance computing (HPC) are also expanding.However, despite this progress, there remains a scarcity of open-source PFC codes accessible to researchers.One notable exception is SymPhas [18], which offers spectral solvers for phase field and PFC problems but is, at the time of writing this paper, limited to OpenMP-based shared memory parallelization, restricting computations to a single machine or node.Another noteworthy project is MEUMAPPS, a spectral solver designed for phase field modeling problems [19].Moreover, the majority of PFC projects still rely on in-house solvers, which can pose challenges for new PFC users and impede the widespread accessibility and adoption of high-performance PFC modeling.These in-house solvers necessitate additional effort and hinder the expansion of PFC modeling to a broader global user base.
In this work, we introduce OpenPFC, a novel open-source PFC solver that is available to the wider research community through GitHub.Our primary objective is to offer a cutting-edge HPC platform for 3D PFC simulations, utilizing parallelization with message passing interface (MPI), and graphics processing unit (GPU) in the future, aiming to approach exascale computations.OpenPFC is designed to empower users to concentrate on PFC model development, while minimizing their involvement in technical aspects such as parallelization.We present a comprehensive evaluation of the OpenPFC platform, including analyses of weak and strong scaling for MPI-based parallelization.Furthermore, we investigate the practical implications of these scaling results for large-scale 3D PFC simulations in modern HPC environments, examining the achievable sizes of geometries and time scales.Additionally, we introduce an innovative approach for implementing moving boundary conditions, enabling the simulation of long directional rapid solidification.As a case study, we demonstrate the capabilities of OpenPFC by considering the rapid solidification of tungsten and aluminum.We employ a recent class of single-component PFC models specifically developed to simulate vapor-liquidsolid interactions in solidification [5,7].It should be noted that while our focus lies on these particular models, the OpenPFC platform remains adaptable to any PFC type model.

Vapor-liquid-solid phase field crystal model
We present here a brief overview of the vapor-solid-liquid PFC model (V-PFC for short ).The reader is referred to [1][2][3] for further details about the PFC paradigm itself.The field of interest to be evolved in the V-PFC model is Ψ(r) := (ρ − ρ)/ρ, where ρ(r) is the atomic density and ρ is a constant reference density, usually taken as the density around which the system free energy density expanded.The basic idea in constructing a PFC model is to construct a free energy functional which contains local minima corresponding to ordered (crystalline) and disordered phases of interest.The standard PFC free energy functional contains an expanded ideal free energy and a convolution of Ψ with a two-point correlation kernel, C (2) , which contains lowand high-frequency peaks (in k-space) that control, respectively, the bulk compressibility of the materials and the crystalline ordering of the solid.Such formulations to date, have been used to stabilize a single disordered phase (typically considered a 'liquid') and an ordered phase, with the latter being characterized by a density field containing spatial peaks corresponding to the time-averaged position of individual atoms in some crystalline lattice.
More recently, a PFC formulation has been introduced [5] to allow for a second disordered phase, effectively enabling coexistence and transformations between vapor, liquid and solid.A minimal path for achieving this is through the addition of a set of excess terms into the free energy of the form Ψ * Ψ n mf , where n is an integer and where Ψ mf = χ * Ψ(r), with * denoting the convolution operator and χ is a long-range kernel.Here, χ acts as a high-wavelength filter and decays on wavelengths shorter than the Bragg peak contained in the standard two-point correlation function C (2) , whose main role is to control the crystallography of solids phases.In this work, we take χ to have the simple form of a Fourier-space Gaussian, which in Fourier space become , where k := |k| is the Fourier wave vector magnitude and hat χ denotes the Fourier transform of χ.The convolution of Ψ mf (r) effectively creates a smoothed density field from Ψ(r).This convolution couples the density to long wavelengths (small k) and thereby controls the position of the liquid and vapor free energy wells relative to the solid.It is noted that by also including high-k peaks into χ the free energy of the solid can also be manipulated, however this will not be considered here.Terms based on Ψ mf , along with the long wavelength contributions of C (2) , also control the energy scale of density jumps across interfaces, and can be controlled through the coefficient λ, an inverse length scale on the order of the density interfaces.In terms of these definitions, the free energy functional for the V-PFC model used in this work is given by where R is a length scale on the order of the lattice constant (discussed further below), T is the temperature, the prefactors {p 2 , p 3 , p 4 } describe the deviation from the ideal free energy expansion, which can be seen as arising from higher-order correlations [20], and the coefficients {q 2 , q 3 , q 4 } are prefactors of the Ψ mf terms, discussed further in section 2.1.1.It is noted that more complex forms of χ in the Ψ 2 mf can be used, and each order can also have different forms of χ; this topic is left to a future publication.
The correlation function C (2) appearing in our model's free energy is a modification of that commonly used in the so-called 'XPFC' approach to PFC modeling [21,22].Namely, the Fourier-transformed correlation Ĉ(2) (k) is written as where B x controls the energy of the solid phase, T 0 is a fixed reference temperature, and PLattice (k) contains the structural information relating to the periodic (solid) phase, with the subscript 'Lattice' denoting the crystal structure whose information is contained in the model.
Here, 'Lattice' denotes either body-centered cubic (BCC) or face-centered cubic (FCC) lattices.For BCC tungsten, the specific form of PBCC (k) is given by where the shape of the correlation's peak is controlled through the parameters α and r tol , which affect the excess energy of interfaces separating bulk solid from other phases (interface energies), but not the thermodynamic equilibrium properties of the system.In equation ( 3), the chosen polynomial form of the function for k > 1 (i.e.beyond its peak value) ensures a stronger cut-off for sub-atomic wavelength perturbations than the usual XPFC correlation function.The form of the correlation kernel for FCC aluminum is largely the same, but with an additional peak and no polynomial decay in P(k), yielding It is noted here that the correlation functions C (2) in equation ( 2) do not have an explicit k = 0 peak as one is already contained in the Ψ mf Ψ terms in equation (1).In real space, the resulting PFC density evolution is governed by the following dynamical equation: where Γ is related to the atomic mobility analogous to other phase field type models.The simulations reported in this work are carried out in units of R which here is taken to be the nearest neighbor distance.In terms of the lattice constant, a Lattice , R = a BCC / √ 3 for BCC lattices, and R = a FCC / √ 2 for FCC.The edge length dx of a voxel in our numerical meshes are taken as dx = a Lattice /8.We can estimate the PFC time scale τ PFC by matching the PFC simulation vacancy diffusion coefficient D PFC v to the experimental/first-principle SI unit vacancy diffusion coefficient D Exp v , following [1].The PFC vacancy diffusion coefficient is Lattice /τ PFC , where the prefactor (A ≈ 1-1.5) depends on the PFC model variant and parameters; the prefactor A can be quantitatively estimated by simulating the gradual smearing of a single atom peak or doing a linear perturbation analysis on the PFC equations [1].For BCC tungsten, the experimental vacancy diffusion coefficient is D Exp v ≈ 10 −11 m 2 s −1 , extrapolated to 3300 K for W [23]. Assuming D PFC v ≈ 1.5a 2 /τ PFC [4], and taking the tungsten lattice parameter a BCC ≈ 0.316 nm [24], we get τ W PFC ≈ 1.5a 2 BCC /D W v ≈ 78 ns.Similarly, for FCC aluminum, D Exp v ≈ 10 −12 m 2 s −1 [25] and a FCC ≈ 0.404 nm [26], resulting into τ Al PFC ≈ 74 ns.The above and other model parameters, discussed below, are summarized in table 1.

Phase diagram of V-PFC model.
The bulk equilibrium properties of each phase (vapor, liquid, and solid) are controlled by parameters {B x , p 2 , p 3 , p 4 , q 2 , q 3 , q 4 }, and they are fitted to approximate experimental phase diagram data.While in general both {p 2 , p 3 , p 4 } and {q 2 , q 3 , q 4 } can be T-dependent, for simplicity, we make {p 2 , p 3 , p 4 } constant in T, and let only {q 2 , q 3 , q 4 } depend quadratically on T as Phase diagrams for aluminum and tungsten are shown in figure 1 using the fitting parameters shown in table 1.For tungsten, there is a larger discrepancy between the experimental data (black scatter points) and the PFC data (solid line); the vapor coexistence region matches well only for the higher density segment of the experimental phase line, and the solidus and liquidus phase lines close in too quickly.However, for the temperature considered here (approximately between 2500-3500 K), the solidus and liquidus lines are in reasonable agreement with the experimental data.It is noted that the apparent convergence of the solidus and liquidus lines to a triple point is fictitious and an artifact of the expanded nature of the model.This can be remedied by adding higher-k contributions to χ that would allow better control of the solid.

Exponential time integration in Fourier space
Dynamical simulations are performed using an exponential spectral method, which is described as follows.Noting the long-wavelength form of Ψ mf = χ Ψ, we split the PFC equation ( 5) right-hand side into linear and nonlinear terms, where L is a linear operator applied to Ψ, and N contains the non-linear terms, In Fourier space, equation ( 7) becomes where N is the Fourier transform of the full nonlinear term.We discretize equation (10) in time, taking the Ψ(k) acted on by L to be evaluated at time t + 1, while N is evaluated at time PFC phase diagrams for aluminum and tungsten.Experimental data are shown as black scatter points for tungsten [27] and aluminum [28].
t.This allows a robust implementation based on the exponential integrator method, useful for stiff differential equations.Numerical simulations may become unstable for certain choices of model parameters, particularly when the linear operator's value approaches zero for some values of k (for further discussion, see [30]).This potential issue can be avoided by modifying the way the equation is split between linear and nonlinear terms, by introducing a stability parameter 0 ⩽ P stab ⩽ 1.This parameter can in principle be a function, but in this case will be a dimensionless constant, giving in such a way that the linear term is now guaranteed to be greater than zero for all k > 0. In principle, this addition should not change the physics of the system, and its magnitude could be found by linear perturbation analysis; here a suitable value is found by simple trial and error, such that the system is stable for reasonably large values of ∆t; its values are indicated in table 1.The exponential time integration update step can then be written as One can verify that by letting ∆t → 0, equation ( 13) approaches the simpler exponential Euler time integration.

Adding a thermal gradient
To perform directional rapid solidification simulations, we impose a frozen temperature approximation described by T(x, t) = T const + T var (x, t), where T const is a reference temperature and T var (x, t) is a spatially varying contribution.Here G is the thermal gradient, V pull the thermal gradient pulling speed, and x 0 is the position of T var (x 0 , 0) = T const , typically set to the initial solidification front location.Because the spatially varying T introduces new non-linear contributions to the PFC evolution equation, it is reorganized by splitting T = T const + T var so that in the exponential time integration presented in section 2.2, terms containing T const remain in the linear operator L, while terms containing T var are moved to the nonlinear operator N. Specifically, the split of T = T const + T var is applied to C (2) in equation ( 2) and to q 2 , q 3 and q 4 in equation (6).For the correlation function C (2) , this linear-nonlinear splitting gives where C (2) and P(r) is the inverse Fourier transform of P(k) and * denotes the convolution operator.
Similarly, splitting of T = T const + T var gives rise to a re-grouping of the Landau bulk energy prefactor q2 , which is inside the linear operator L in equation ( 8).This is expressed as where q 2,N := q 21 T var T 0 and q 2,L := Expanding on the methodology of section 2.2, the linear operator used for the exponential spectral method now becomes while the non-linear operator becomes where for the first term on right-hand side (non-linear correlation kernel, −C (2) )) * P(r) * Ψ ), one needs to be careful when performing the Fourier transforms, convolutions, and multiplications correctly.First, we multiply P(k) with Ψ.Then we take an inverse Fourier transform so that we can multiply the real-spacedefined T var (x, t) dependent prefactor in the correlation kernel which we can solve in real space with explicit time integration.In terms of the split correlation kernels defined above, the exponential time integration operators, Ls and Ns in equations ( 11) and (12), are now modified for a moving thermal gradient as follows.The linear operator becomes where 2λ 2 is the mean field smoothing operator.Correspondingly, the spatially varying temperature leads to the following non-linear operator, N Ψ} is calculated via equation (20).

PFC pseudocode
The pseudocode for simulating the PFC model presented above (sections 2.1-2.3) is shown below, where looping through arrays (in real or Fourier space) is denoted with [:].The current algorithm contains two forward FFT operations, and three backward (inverse) FFT's.
It should be noted that for the auxiliary fields, there is a trade-off between reducing computation and memory; locally (inside for-loops) re-computing certain auxiliary fields can reduce memory consumption, but leads to re-computations that could be avoided by storing them in memory.The current algorithm is designed for CPU based parallelization.Therefore, before looping through time steps, we pre-compute the long wavelength operator χ, the linear operator L, and part of the non-linear operator N. Reducing auxiliary fields can become beneficial especially for GPU-based calculations, where currently the bottleneck tends to be memory consumption.

Software architecture in OpenPFC
The numerical simulation scheme of the directional solidification PFC model presented in sections 2.1-2.3,algorithmically depicted in section 2.4, is implemented as an open source project, OpenPFC.To give a high level understanding of the OpenPFC framework, written in C++, we give a brief description of its main classes; more detailed tutorial and descriptions are found on GitHub [31].The framework consists of building blocks that can be stacked on top of each other from the bottom up; the class structure flowchart is given in figure 2; note that for brevity, only the main classes are shown and discussed.Starting from the bottom, we first want do define calculation domain (class World), decompose it into smaller parts (class Decomposition), distribute it between MPI processes and set up the FFT computation (class FFT).The actual physics is introduced in class Model, which combines FFT and fields that can be real or complex.The RealField and ComplexField are collections of mathematical fields that serve as the underlying data structures manipulated by models and initial and boundary conditions.Model is responsible for performing model initialization, which includes memory allocation for all necessary data structures, construction of linear and non-linear operators, as well as defining the time integration scheme, which depends on the model physics.On a higher level of classes, Simulator is responsible for putting all lower level objects together, control initialization and boundary conditions, as well as writing results to disk on specific time instants.That is, Simulator puts together Model, FieldModifier, Time and ResultsWriter.Here, FieldModifier can be seen as an object that modifies the contents of the data structures of the considered mathematical model.Before the simulation starts, FieldModifier applies the initial conditions, and during the simulation, it is called after each step to apply boundary conditions.The main responsibilities of Simulator are (i) initialize model (ii) apply initial conditions (iii) while not done: (a) increment time (b) apply boundary conditions (c) perform time step (d) save results to disk The relations between the classes are of the 'has-a' type, meaning that one class contains an object of another class as a member variable.On a practical level, the object of interest is often the extension of the class Model with own physics.Boundary conditions and initial conditions can be implemented with the help of FieldModifier class.
The responsibility of designing the user interface lies with the user of the software framework.The technical features of the user interface are contingent upon the execution environment of the program code.Nonetheless, we have presented a feasible rough-level user interface, whereby the program's input is read from a JSON file, denominated as App<Model>.This methodology is appropriate for HPC computing due to several reasons.Primarily, computing environments on nodes are typically limited and lack graphic libraries.Secondly, the simulation initiation process is usually non-interactive and is carried out by running the simulation through a queuing system (such as SLURM) using a separate startup script.This form of program execution is better suited to a command line user interface where the simulation settings are defined declaratively with a separate configuration file.As the simulation settings cannot be modified during the simulation or interactively at the beginning of the simulation, the reproducibility of the results is guaranteed.It is worth emphasizing that OpenPFC is a software framework; it is not intended to work as a ready-made solution, but rather as a platform on which one can develop their own scalable PFC code.

Aspects of high performance FFT computing
The FFT computations are carried out using an efficient scalable library, heFFTe [32].The FFT computation overhead becomes progressively more dominant as the system size increases due to the superlinear scaling of the FFT algorithm, namely O(N voxel log (N voxel )), whereas the remaining overhead (arithmetic operations related to explicit time integration step of nonlinear terms) scales simply as O(N voxel ).The majority (90%-95%) of the computing time in this work is spent in FFT-related computations (discussed also in section 3.1).Therefore, it is important to optimize the FFT computation settings.
In heFFTe, there are two different protocols to communicate the MPI traffic; point-topoint (p2p), or all-to-all.Furthermore, the p2p can be either pipelined or not pipelined, denoted as p2p_plined vs. p2p, corresponding to functions MPI_Isend/MPI_Irecv vs. MPI_Send/MPI_Recv).The pipelined communication is non-blocking, meaning that new data can be sent instantaneously once the old send-command has been executed.Also, the all-to-all protocol has an all-to-all(v) variant (MPI_Alltoall vs. MPI_Alltoallv), where the latter can contain messages of varying sizes and types, whereas in the 'standard' all-to-all protocol the messages must have a single size and shape.In our tests, we found that (1) the computational performance varies significantly among the protocols, and (2) all-to-all(v) performs best for smaller grids, while point-to-point (pipelined) performs best for very large grids.Therefore we advice users to perform numerical tests to determine the best protocol for their clustersetup (hardware and software), and the specific grid used.The strong and weak scaling tests are shown later in section 3.1, where each MPI communication protocol is considered, and the most efficient protocol is chosen.
In 3D FFT computations, the problem is decomposed into a sequence of low-dimensional FFT problems.In heFFTe [32], the domain decomposition into different MPI processes can be performed either as 2D slab, or 1D pencils.For a large number of MPI processes and a small grid, 1D decomposition should be used as there is a limited number of 2D slabs that can be generated; if there are too many MPI processes for the 2D slab decomposition, the program crashes.Ayala et al analyzed the decomposition strategy more thoroughly in [33], where they found that e.g. for 1024 3 grid, more than 128 MPI processes should use pencil decomposition.In our tests, we found no performance difference in using 2D slab vs. 1D pencil decompositions.Therefore, we recommend using the pencil decomposition as a default.

Results and discussion
This section examines the computational performance of OpenPFC, and demonstrates its application to large scale solidification using aluminum and tungsten as relevant example materials.In all simulations presented in this section, we initialize the density field in the form of one or more spherical crystal seeds surrounded by a liquid.In the solid, the density field is initialized following the plane wave construction of the single mode expansion using the equilibrium amplitudes and average density, where each seed is given a unique 3D orientation.In the results, the PFC density peaks were converted to atomic positions, after which they were visualized in OVITO [34].

Computational efficiency and scaling in single crystal growth
We first considered the growth of a single spherical tungsten crystal seed at a uniform temperature.Here, the liquid density is set to a value below the equilibrium density, which provides the driving force for solidification of spherical seeds.The growth of a post-critical crystal seed of tungsten is shown in figure 3, where we used a 512 3 grid.The growth eventually stops, after which the density in solid and liquid sides were measured and compared to the phase diagram, verifying that the densities reach the values predicted by the phase diagram in figure 1.
We analyzed the computational scaling of the OpenPFC platform in the EuroHPC's LUMI cluster [35], employing the CPU-only partition, which uses two 64-core 3rd-generation AMD EPYC TM CPUs (128 cores per node), and 256 GB of memory, and Cray Slingshot interconnect of 200 Gbit s −1 .We used the same single crystal set-up as in figure 3, and varied the box dimensions from 256 3 to 8192 3 and the number of nodes from 1 to 512 (i.e.128 to 65 536 cores).First, we have found that typically 90%-95% of the computation is spent in the forward and backward FFT computations, which means that the non-linear explicit time integration has negligibly small computational cost.Therefore we expect our PFC solver scaling to closely follow the scaling analyses of the used FFT library, heFFTe [32].
For a computational scaling analysis, we simulated the first 200 numerical time steps, and considered the smallest time to perform a time step, i.e. the minimum step time.We chose the minimum because currently, LUMI has hardware issues which lead to slowing down of certain nodes, which bottleneck the computations; the minimum step time is expected to be representative for a hardware-issue-free cluster (this hardware issue is being fixed in LUMI at the time of writing this article).
The resulting computational efficiency, in terms of strong and weak scaling, are shown in figure 4  figure 4(a), where each color corresponds to a fixed grid size, scatter data are the t step measured from OpenPFC, and the dashed lines show the ideal strong scaling law t ideal step , defined for each grid size as For a 256 3 grid, the scaling starts to saturate after 2048 cores, and a 512 3 grid saturates after 8192 cores.Grids larger than 512 3 voxels show no clear limit in strong scaling up to the 65 536 cores (512 nodes) considered.Weak scaling, i.e. fixed ratio of grid size to number of cores, is shown in figure 4(b).For each of the three ratios considered, for increasing grid sizes, the step time increases first, then plaetaus, and starts to increase strongly for very large grids.Weak scaling is expected to have a slight positive slope even ideally because the computational complexity of FFT algorithm increases superlinearly with grid size.It is notable that there is an abrupt drop in step time for largest grids in 256 3 and 512 3 voxels/128 cores; this is because, for the respective grid sizes, the optimal MPI communication protocol changes from all-to-all(v) to point-to-point pipelined (the protocols were discussed in section 2.6).The chosen protocol seems to have a large influence on the computational performance (step time t step ), and in general, we expect that the optimal MPI traffic communication protocol depends on the grid dimensions and cluster-setup (hardware and software); therefore, as mentioned earlier, we advice users to perform numerical tests to determine the optimal protocol for their cluster and the grid dimensions considered.
For figures 4(a) and (b), the right-side Y axis shows the projected number of time steps calculated in a week; this helps the user to design their calculations, taking into account the computational resources (number of cores available) and the desired grid dimensions.In our experience, for simulations of solidification, one needs to compute between 10 5 and 10 7 time steps depending on the model parameters, quench temperature or density, and system length in directional solidification.Therefore, in principle, reasonable PFC simulations on 2048 3 or even 4096 3 grids can be performed with OpenPFC, if one has access to sufficiently many cores.However, for practical simulations on such large grids, a significant bottleneck can be the disk space requirements, which are discussed in the next subsection.
While we do not perform a GPU scaling analysis in this work, our preliminary tests indicate that GPU-runs are approximately twice as fast as equivalent runs with CPUs.Specifically, the raw calculation speed with GPUs is roughly 50 times that of CPUs inside a single MPI process.However, MPI traffic causes a bottleneck that reduces the net speed-up only to double when using GPUs, compared to CPUs; this is consistent with the findings of heFFTe developers [32].Another challenge is setting up the code to work on multiple GPU cards on multiple machines, where there can be hardware-specific issues to overcome.
There are several approaches to the computational performance of the code beyond simply increasing the number of cores.One approach is to perform shared memory parallelization e.g. using OpenMP, which can offer notable, e.g.two-fold , speed-up in heFFTe-based FFT computations.On a 'lower level', vector instructions allow for similar operation on multiple data elements in a single machine instruction, while instruction-level parallelism allows several independent instructions to be conducted in one clock cycle inside one CPU.

Consumption of memory and disk space.
For the PFC model under consideration, the memory consumption can be approximately calculated using the following formula: Thus, memory consumption for different grid sizes are as follows.256 3 grid → 1.6 GB, 1024 3 → 103 GB, 4096 3 → 6.6 TB, 8192 3 → 53 TB.Note that this memory consumption is proportional to the number of (auxiliary) fields and operators considered.Memory consumption can be reduced by reducing the number of pre-computed auxiliary variables, and computing them locally on the fly ('in for-loops').This is particularly important when using GPUs, where memory consumption can become a bottleneck.Another way to reduce memory consumption is to use 32 bit floating points, which would reduce the memory requirements to half; the 32 bit floating point arithmetic is most accurate close to zero, so if the fields are scaled close to the range (−1, 1), the accuracy losses can be insignificant.In OpenPFC, the uncompressed size of the result output file is equal to the number of fields outputted (single field, density in one-component PFC) times the grid size (number of voxels) multiplied by the double-precision floating point size: file size = number of fields × grid size × 8B/voxel.
Thus, for each recorded time step, the result output file sizes are: 256 3 grid → 134 MB, 1024 3 → 8.6 GB, 4096 3 → 550 GB, and 8192 3 → 4.4 TB.Result files are written with MPI parallel I/O formalism (using MPI_File_write_all) into a single file per recorded time step, and in our experience, the simulation is not bottlenecked by writing of the results.Effcient compression of the OpenPFC result files are considered in the future.

Moving non-periodic boundaries for directional solidification
Fourier transform based solution methods naturally impose periodic boundary conditions (BCs), while in directional solidification, one needs non-periodic boundaries at the front and Pinomaa et al back boundaries.We do this by appending two rectangular domains on either ends of the simulation box.At every time step, these domains are fixed to uniform densities, where the domain in front of the solidifying interface is set to the 'quench density' chosen for the initial metastable liquid (Ψ L,far ), representing the far-field liquid.The domain behind the solidifying interface is set to the density of the vapor in coexistence with solid (Ψ V,eq ) according to the phase diagram in figure 1, which allows the already-solidified material to equilibriate with the back domain.A PFC simulation of directional solidification seeded by four spherical grains is shown in figure 5(a), where the aforementioned fixed front domain is shown in red (right), and fixed back domain is shown in blue (left).We choose the back boundary to be vapor rather than solid due to the difficulty in having a constant solid boundary that does not introduce unphysical stresses to the simulated solid.It is also noted that the use of a V-PFC model makes the implementation of such 'soft' boundaries much easier than with two-phase (solid-liquid) PFC models, which would require abrupt temperature ramping at the boundary.
Once the furthest reaches of the solidification front approaches the front domain (figure 5(a)), the front domain starts to act as a density sink, leading to an nonphysical speed-up of the solidification front.To avoid this boundary condition artifact, we implemented moving boundary conditions in a new simulation with identical initial conditions, but with a system is shorter in the solidification direction (see top row in figure 5(b).In this case, we move the front domain forwards, wrapping it around to the left edge of the simulation box, as shown in second and third rows of figure 5(b).As the FFT solver's 'natural' boundary conditions are periodic, wrapping the front domain around to the back end of the simulation box allows for the solidification front to naturally re-appear on the left side of the simulation box, thereby letting us continue to track the 'forward' evolution of the solidification front within a smaller, fixed-size simulation domain.Simultaneous with the wrapping around of the front domain, we also need to move the back domain forwards, which implies that we cut out the earliest solidified regions.
Figure 5 compares a simulation result using with moving-boundary procedure descried above to that in a fixed simulation domain, i.e. one with static boundaries (coined the 'ground truth' data).The static grid size in figure 5(a) is 4092 × 1024 × 512, while the grid with the moving boundaries in figure 5(b) has a size 1024 × 1024 × 512.It can be seen that the two simulations are visually identical in a large domain straddling the solidification front up to 70 000 τ PFC , verifying that the method works well for tracking the solidification behavior without having to compute and record all the frozen-in solid far behind the solidification front.In order to maintain accuracy, the above moving-boundary procedure requires that the fixed computational domain is chosen large enough, so that the front end of the solidification front does not 'feel' the back end.Our tests indicate that this buffer size needs to be 60-120 grid points (7-15 lattice spacings) to avoid spurious interaction between front and back boundaries.Moreover, the liquid buffer layer between the solidification front and the front boundary domain should be several times the diffusion length measured from the solidification front.
An alternative implementation of the moving simulation window can be achieved by simply shifting the density field values backwards once the front is close to the front end of the simulation box.However, this grid value shifting approach can lead to decreased computational efficiency, particularly for very large (3D) simulations, because the field values shifts would need to be communicated between MPI processes, and in large HPC calculations the MPI traffic is often the main computational bottleneck.With our approach, the PFC density evolution equation (as well as any added coupled fields, such as concentration for alloys) can be solved without needing to copy the field values.Another alternative would be to add a constant velocity advection term to shift the reference frame, but this tends to cause numerical diffusion issues.A larger simulation of directional solidification of aluminum using a 2048 × 2048 × 1024 grid is shown in figure 6 using moving the non-periodic boundaries described above.This simulation is initialized with 36 randomly oriented spherical grains.Competitive growth and coarsening is notable, with grains growing in unfavorable orientations being eliminated.

3D directional rapid solidification simulations
We close this work with a view of some preliminary calculations from ongoing work to study rapid directional solidification of tungsten and aluminum using the V-PFC model run using the OpenPFC platform.Both simulations discussed in this section use non-periodic boundaries, as explained in section 3.2, but without moving boundaries.Both simulations are initialized with arrays of uniformly spaced spherical seed crystals whose positions were lightly off-set with random noise.Each seed was assigned a random initial crystal orientation.Figure 7 shows snapshots of some of the resulting microstructure evolution for tungsten (BCC) and aluminum (FCC), respectively.
In the tungsten simulation shown in figure 7(a), two rows of four seeds were initialized, and after solidification, growth and coarsening was observed to propagate grains throughout the simulation space, rapidly advancing to occupy the majority of the system volume.Using OVITO's surface finding filter [36], the resulting regions of lowered (vapor) density are shown in figure 7(c), where both spheroidal and elongated low-density regions are observed.This result is analogous to earlier numerical results in two dimensions for simulations of physical vapor deposition [37], and is caused by our choice of low average number density for this simulation.We chose a low density-effectively emulating a low pressure environment-to demonstrate the void formation capabilities of the model, which we expect to be crucial for modeling voids in crystal growth by vapor deposition methods.Analogous results have been found in two dimensions [7], where voids and dislocations were collectively observed to mediate sub-grain orientation gradients.This topic will be the subject of a future work, focusing in 3D simulations.OVITO's dislocation analysis filter does not find many dislocations in this case, most likely because the strain that would have resulted in dislocations ended up nucleating voids instead.For the grain boundaries with dislocations, the most common dislocation type has Burgers vector 1  2 ⟨111⟩ (green segments in figure 7(c), which is the major dislocation type in BCC materials [4].It is also notable that there are no dislocations in the grain interiors, most likely because the grains are only approximately 5 nm in size.
Grain growth in the aluminum sample is simulated in figure 7(b), initialized from a 2 × 4 array of seed crystals.Compared to tungsten, the aluminum was initialized at a higher average density, and did not produce low-density pockets, producing instead webs of dislocations along some of its grain boundaries.The crystal interiors of aluminum grains are also free of defects such as dislocations.We believe that this is also due to the small size of grains.The most common identified dislocation type is 1  2 ⟨110⟩ (blue segments), which is the major dislocation type in FCC materials [4].The total dislocation density, approximately 2.5 × 10 16 m −2 , is consistent with the PFC amplitude model simulations and MD simulations for aluminum conducted in [38].It is also noted that for very rapid solidification rates, the incorporation of two-time PFC dynamics through the use of inertial term (also called 'modified PFC' or MPFC for short [39]) can be used to improve the description of dislocation dynamics within the PFC framework.In principle, the MPFC type model can be straightforwardly implemented by converting the second-order time derivative equation into two first order equations, where one introduces the time derivative of the density field as new field variable with its own evolution equation; see e.g.[40,41] for more details.Moreover, it is notable that the liquid content at the grooves of the solidification front are very short, indicating that the absolute stability limit (planar solidification front conditions) are being approached.This is consistent with past rapid solidification simulations using phase field method [42,43] and amplitude expansion PFC [7,38].

Conclusion
In this paper we introduced a new open-source PFC framework for high performance computing environments.Coined 'OpenPFC' (https://github.com/VTT-ProperTune/OpenPFC),this platform is based on the FFT engine heFFTe, and is scalable for implementation on modern HPC environments.The technical aspects related to FFT solution and parallelization are hidden from users as much as possible, allowing them instead to focus on PFC model development and simulation result production.Considering up to 65 536 cores, OpenPFC platform was shown to have near-ideal strong scaling for grids larger than 512 3 , in principle allowing for practical PFC simulations for 2048 3 or even 4096 3 grids, provided that one has access to sufficient number of cores and disk storage.As an application of OpenPFC, we demonstrated simulations of rapid solidification of tungsten and aluminum under a frozen thermal gradient, using a recent solid-liquid-vapor PFC model (V-PFC) parameterized to simulate the phase diagram of these materials.We also presented a robust approach for handling moving boundaries, which minimizes memory-intensive array copying as well as MPI traffic, and does not require an advection term to shift the reference frame.While the platform was demonstrated with a single-component PFC model, it is also adaptable to other continuum field theoretic models (such as complex amplitude models), as well as traditional phase-field-type models with more than one evolving field.The user needs to only define the fields of interest, and embed the appropriate FFT-based update scheme for each field under consideration.Going forward, we will be accumulating different PFC model 'flavors' to the OpenPFC project (including binary alloy PFC models and, later, complex amplitude models), making it increasingly accessible for a wide audience of new users to implement their specialized PFC model with relative ease.It is notable that non-linear gradient terms necessitate iterative solution strategies (instead of the straightforward exponential time integration), in analogy to the fixed-point FFT algorithm implemented for the elasticity problem [44].As an open-source project, the code will continue to be improved and will eventually contain a menu-driven selection of tested models from previous users/contributors.At the time of writing this paper, the framework employs CPU-driven computations with distributed memory ('MPI-only') parallelism without parallel threading.Currently, OpenPFC is being extended to GPU-based computations.
Alliance of Canada (alliancecan.ca)for computational resources.CSC-IT Center for Science, Finland, is acknowledged for computational resources.

Figure 1 .
Figure1.PFC phase diagrams for aluminum and tungsten.Experimental data are shown as black scatter points for tungsten[27] and aluminum[28].

Figure 2 .
Figure 2. Main classes and their relationships in OpenPFC.

Figure 3 .
Figure 3. Growth of a spherical tungsten crystal at various time steps, in a 512 3 voxel (20 3 nm 3 ) grid.
, where the left-side Y axis shows the measured step time (seconds to compute one time step, t step ), and the right-side Y axis shows the number of time steps calculated in a week in blue; figures 4(a) and (b) share their Y axes.First, let us consider the strong scaling in

Figure 4 .
Figure 4. Parallelization efficiency of OpenPFC in terms of (a) strong scaling i.e. fixed size vs. number of cores, where the dashed lines represent ideal scaling; (b) weak scaling i.e. fixed ratio of voxels per number of cores vs. grid size.The right Y axis shows the projected number of time steps calculated in a week.Figures (a) and (b) share the left and right Y axes.

Figure 5 .
Figure 5. of non-periodic boundaries.(a) Static front and back boundaries in a 4096 × 1024 × 512 voxel grid; (b) the same simulation set-up as in (a), but with moving front and back boundaries using a 1024 × 1024 × 512 grid.Initial conditions, recorded time instants, and thermal gradients, are identical in (a) and (b).Gray domains are grain boundaries.

Figure 6 .
Figure 6.Time evolution in directional solidification of aluminum for a 2048 × 2048 × 1024 voxel grid using moving non-periodic boundaries described in the text.Grain growth is initialized with 36 randomly oriented spherical grains.The thermal gradient G = 5.0 × 10 8 K m −1 and pulling speed is V pull = 3.5 × 10 −5 m s −1 .Colors indicate different grain orientations.

Figure 7 .
Figure 7. Directional solidification of (a) tungsten (BCC), and (b) aluminum (FCC).The corresponding identified dislocations and generated surfaces are shown in (c) and (d).For tungsten in (c), only part of the generated surface for the grain is shown, to more clearly demonstrate the porosity that emerges in the final solid.The average density of the tungsten system is set close to the triple point in order to demonstrate the void-forming capacity of the PFC model.For tungsten in (a) G = 1.3 × 10 9 K m −1 and V pull = 1.0 × 10 −5 m s −1 ; for aluminum in (b) G = 5.0 × 10 8 K m −1 and V pull = 3.5 × 10 −5 m s −1