Neural Networks for Nuclear Reactions in MAESTROeX

We demonstrate the use of neural networks to accelerate the reaction steps in the MAESTROeX stellar hydrodynamics code. A traditional MAESTROeX simulation uses a stiff ODE integrator for the reactions; here we employ a ResNet architecture and describe details relating to the architecture, training, and validation of our networks. Our customized approach includes options for the form of the loss functions, a demonstration that the use of parallel neural networks leads to increased accuracy, and a description of a perturbational approach in the training step that robustifies the model. We test our approach on millimeter-scale flames using a single-step, 3-isotope network describing the first stages of carbon fusion occurring in Type Ia supernovae. We train the neural networks using simulation data from a standard MAESTROeX simulation, and show that the resulting model can be effectively applied to different flame configurations. This work lays the groundwork for more complex networks, and iterative time-integration strategies that can leverage the efficiency of the neural networks.


INTRODUCTION
In stellar astrophysics simulations, nuclear reactions are often the most computationally demanding aspect. Even in moderately complex networks, the time scales of the stiffest reactions can be on the order of pico-or even femtoseconds. In explicit reaction integration schemes, this leads to an overly restrictive (compared to advective, acoustic, or diffusive scale) time step, and for implicit schemes, can require hundreds or thousands of evaluations of the rates per time step. Thus, reactions can be orders of magnitude more expensive than advection and/or implicit global solvers such as self-gravity, momentum, or mass diffusion. Even with a relatively simple three-species, one-step reaction network, implicit treatment of burning can require nearly half of the total computational time.
One burgeoning approach to computational fluid dynamics (CFD) is the use of machine learning approaches to replace various computational kernels such as advection (Papapicco et al. 2022), diffusion (Sirignano & Spiliopoulos 2018), and the Poisson equation (Tang et al. 2017) (for self-gravity, electrostatics, or projection-based decomposition techniques for incompressible flow). In particular, the use of Deep Neural Networks (DNNs) for surrogate modeling in the framework of partial differential equations (PDEs) and ordinary differential equations (ODEs) has been especially popular in recent years. However, while literature on using DNN models in areas of CFD such as turbulence modeling is increasing rapidly (Echekki & Mirgolbabaei 2015;Duraisamy et al. 2019;Grimberg & Farhat 2020;Lye et al. 2020), there is notably less investigations in using them for chemical kinetics; see Meuwly (2021) for a review of current machine learning techniques for chemical reactions in a wide range of applications. Indeed, more interest in using neural networks for computational chemistry kernels in reacting flow simulations have only very recently been shown, with several groups having used this approach on moderate-sized systems (∼10 species, ∼20 reactions) in a terrestrial combustion context (Brown et al. 2021;Owoyele & Pal 2022;Sharma et al. 2020;Ji et al. 2021). Astrophysical nuclear reaction networks share a lot in common with terrestrial chemical networks, although the astrophysical rates tend to have much stronger temperature dependence, and hence can be more stiff.
In the context of using a DNN to accelerate or replace computationally-expensive PDE or ODE solve, the DNN model is trained to approximate the mapping from the input states of the differential equation solver to its solution states. The learning problem then aims at tuning the weights of the DNN model to train it to a high degree of accuracy. Examples of DNN for surrogate modeling include Residual Neural Network (ResNet) (He et al. 2016), physics-informed neural network (PINN) (Karniadakis et al. 2021;Raissi et al. 2019), and fractional DNN (Antil et al. 2020). Among them, PINN has achieved success in a wide range of applications by encoding physics constraints into the loss functions of the neural network such that the governing equations are satisfied. However, recent investigations by Ji et al. (2021) and Wang et al. (2020) have shown that the performance of using PINN in stiff chemical kinetic problems with governing equations of stiff ODEs is sub-optimal and can often fail due to both numerical stiffness and physical stiffness. Hence, our approach to constructing a surrogate model for the reaction computational kernel is more similar to Brown et al. (2021) where the goal is to Learn-from-Physics/Chemistry. In addition, we still want to include physics-based constraints in the loss function whenever possible, but they will not be expressed in the form of the governing equations and instead be specific to the problem (for example, conservation of mass).
In this paper we use the astrophysical hydrodynamics code, MAESTROeX (Nonaka et al. 2010;Fan et al. 2019b), to perform millimeter-scale nuclear flame simulations using a neural network to accelerate the reaction steps. We demonstrate that we can train neural networks using data from a traditional MAESTROeX simulation that utilizes stiff ODE integration for the reactions, and run new simulations utilizing this neural network at a reduced computational cost. This particular problem is extremely challenging due to the delicate balance between advection, thermal diffusion, and reactions. We have implemented a training scheme which is highly accurate and sensitive to this balance, and describe our design decisions below.

MODEL AND PRIOR NUMERICAL APPROACH
MAESTROeX is a finite-volume code that solves the equations of low Mach number reacting flow in astrophysical environments. Since the low Mach number model does not contain acoustic waves, the time step is limited by an advective CFL constraint, which is O(1/Ma) larger than an acoustic CFL constraint in compressible approaches, where Ma is the characteristic Mach number. The code is suitable for both stratified environments (planar regions or full stars) as well as small scale simulations without stratification (where it reduces to the system described in Bell et al. (2004c)). Here we focus on the latter case (a millimeter-scale flame); thus we do not account for gravitational stratification and the model is simpler than the full MAESTROeX equation set, Here ρ is the fluid density, X k is the mass fraction of species k with associated production rateω k , U is the fluid velocity, h is the enthalpy per unit mass, k th is the thermal conductivity, T is the temperature, H nuc is the nuclear energy generation rate per unit mass, and π is the perturbational (or dynamic) pressure. The divergence constraint is derived directly from the equation of state by taking the Lagrangian derivative and substituting in the equations of mass and energy evolution. The constraint represents a linearized approximation of the velocity field required so that the thermodynamic variables evolve such that p(ρ, h, X) = p 0 , where p 0 is a constant ambient pressure. The expansion term, S accounts for local compressibility effects resulting from nuclear burning, compositional changes, and thermal conduction; see Fan et al. (2019b). Millimeter scale flame instabilities with this algorithm have previously been studied in Bell et al. (2004a,b); Zingale et al. (2005). The MAESTROeX algorithm utilizes a projection methodology for the velocity integration, and Strang splitting for the thermodynamic variable integration. The projection algorithm involves an explicit hydrodynamic step followed by a global geometric multigrid Poisson solve to correct the solution so that it satisfies the divergence constraint. The Strang splitting algorithm alternates between a reaction half-step, an advection-diffusion full step, and a reaction half-step to achieve second-order accuracy. Advection is treated explicitly with a second-order accurate Godunov approach based on the corner transport upwind scheme of Colella (1990), and thermal diffusion is treated implicitly using a global Helmholtz linear solver using geometric multigrid. Reactions are integrated using the stiff ODE solver VODE (Brown et al. 1989).
We use a publicly available equation of state (Timmes & Swesty 2000), which includes contributions from electrons, ions, and radiation. We select a simple 3-isotope network describing the first stages of carbon fusion occurring in Type Ia supernovae (SNIa). In simmering and deflagration stages of common SNIa models, 12 C nuclei fuse with other 12 C nuclei to form predominantly either 4 He + 20 Ne or 1 H + 23 Na. Because both of these sets of reaction products can be thought to proceed from the decay of a short-lived 24 Mg nucleus in an excited state, our network describes these reactions as a single 12 C + 12 C → 24 Mg reaction to reduce the number of equations and isotopes we track. We evolve 12 C and 24 Mg mass fractions using screening as described in Graboske et al. (1973); Weaver et al. (1978); Alastuey & Jancovici (1978); Itoh et al. (1979). This particular network contains time scales on the order of 10 −14 s so implicit integration with VODE is required when using large timesteps afforded by the MAESTROeX algorithm. Because common SNIa models also generally contain about 50% 16 O by mass in the progenitor white dwarf, we include this quantity of 16 O in our network, comprising the third species in our network. However, we do not evolve 16 O in the network as we are only interested in modeling the early stages of 12 C fusion in SNIa models and have not extended the current study to account for later 16 O burning.

Problem Formulation
As stated in Section 2, the MAESTROeX algorithm integrates the reactions using the stiff ODE solver VODE, which evolves the species and enthalpy from X in k → X out k and (ρh) in → (ρh) out by solving the following system of equations over a time interval of ∆t, where in the absence of weak interactions, the nuclear energy generation rate can be expressed in terms of the specific binding energies, q k , as We note that in Fan et al. (2019b) we integrated temperature rather than enthalpy; it was later shown in Zingale et al. (2021) that integrating energy is a more robust approach, which we have subsequently adopted. In the Strang-split time integration algorithm for MAESTROeX, we evolve only Equations (5) and (6) in our reaction step (not density as well). As a result, as implemented within our reactions module, we integrate internal energy, e, rather than enthalpy h, since the time derivatives of these quantities from reactions alone are identical, if the pressure is constant in time, which it is for smallscale flames in an open domain. We seek to replace the current reaction solver using deep neural networks (DNNs) such that the updated mass fractions and enthalpy are determined by "black-boxes" that satisfy the following: Note that the inputs to both DNNs are the same, which allows the two models to be combined during runtime. Also note that the density remains unchanged during this reaction step, hence the enthalpy update can be expressed simply as Combining Equations (8) and (10) into a single neural network gives

Architecture
To accurately represent the functionality of the ODE solver VODE, the inputs of the DNNs are density, temperature, and mass fractions. Note that in this work, we use a constant time step in each of our simulations, and hence ∆t is not used as an input. For more general problems, ∆t will also need to be considered. The input dimension is then given by where M = 2 for our reaction network as discussed in Section 2. Due to the sensitivity of the nuclear energy generation (expressed in erg/g), e enuc , to the accuracy of the species mass fractions (refer to Appendix A for a detailed discussion), the output will include both the species mass fractions and nuclear energy generation e nuc for a total output dimension of M + 1. We consider two different types of neural networks for this problem: one single neural network DNN full satisfying Equation (11), and two parallel neural networks DNN 1 and DNN 2 satisfying Equations (8) and (10) respectively, which will be combined during run time to approximate the entire solution. It is shown in Section 4 that the single neural network does not seem to perform as well as the parallel network. Additional advantages of training the networks in parallel include decreasing the total training time and avoiding potential memory limitations.
All DNNs are implemented using a ResNet (He et al. 2016) architecture with differing number of hidden layers and shortcut connections. Although ResNets are most commonly used in classification problems instead of surrogate models, they have been shown to perform well for modeling problems where there are many inputs and few outputs. ResNets differ from fully connected neural networks in that they include additional connections between non-consecutive hidden layers, usually skipping two or more layers. In very deep networks, these "shortcut" connections help to alleviate the vanishing gradient problem, which is when the gradients become increasingly small during backpropagation causing sub-optimal convergence of the first few layers of the network. After experimenting with varying the number of hidden layers, hidden nodes, and shortcut connections, we chose to use the architectures specified in Figure 1 for the DNNs that we are training and testing. Note that the single and parallel neural networks contain the same number of total hidden nodes.
The activation function used in each hidden layer is the hyperbolic tangent activation function due to its derivatives being continuously differentiable. As a side note, the CELU activation function, which is also continuously differentiable, can be used to obtain faster convergence, but we found that it does not necessarily result in better accuracy in practice. Finally, a ReLU activation function is added to the output layer of DNN 1 to satisfy the physical constraint that mass fractions must be positive values.

Loss Function and Scaling
Since one of our chosen neural networks form a parallel ResNet architecture, each parallel DNN would use a separate loss function: one for the mass fractions and one for the nuclear energy generation. In the case of the single DNN, the total loss function would be the sum of the two parallel loss functions. Due to the fact that these neural networks are modeling systems of ODEs describing physical phenomena, there are many opportunities to incorporate physics constraints into the loss functions. In this particular problem, there are two such constraints we consider when constructing the loss function: 1) the sum of the mass fractions must be conserved in DNN 1 , and 2) the nuclear energy generation must be of the same sign in DNN 2 . LettingX k andẽ nuc be the model-predicted values of the mass fractions and energy generation, and X k and e nuc the solution states from the ODE solver, the loss functions for the two neural networks are DNN full : for data points i = 1, . . . , N , where C X and C e are cost factors for the physical constraints of mass fractions and energy generation, respectively. Here, MSE refers to the Mean Square Error between the ground truth solutions and the model-predicted values.
In order for neural networks to perform optimally, the inputs and outputs should be scaled such that they are close to order O(1). This is easily achievable for density, temperature, and nuclear energy generation, all of which are normalized with their respective maximum values. However, the mass fractions vary widely from O(10 −30 ) to O(10 −1 ) (see Figure 5) and hence cannot be normalized in a similar manner. Instead, we employ two potential approaches to alleviate this scaling problem. The first is to convert all mass fractions to their inverse log equivalents, which will significantly reduce the range of scale. The second is to use a loss function with exponentially increasing weights that is symmetric around a mass fraction of 0.25, which is motivated by mass conservation. The results of using these two approaches will be discussed further in Section 4.
The first approach to addressing the mass fraction scaling problem is to transform both input and output mass fractions (in DNN 1 ) to their negative inverse log equivalent, X k → −1/ log(X k ). This reduces the mass fractions to a much more narrow range of O(10 −2 ) to O(1) as shown in Figure 2. This transformation can be applied to the input mass fractions in DNN 2 as well. The second method is to note that in this particular 3-species network, the sum of X C12 and X Mg24 is conserved. Therefore, when either mass fraction approaches 0, the other must approach 0.5. In the previous approach where the inverse log form of mass fraction was used, we considered the scaling problem only near 0. However, due to the fact that the sum of the species is conserved, we realized that there is a scaling issue near 0.5 as well. In fact, we need a loss function that penalizes errors near 0 and 0.5 in a symmetric manner as well as take into account the scaling problem. This can be accomplished by using custom weights w(X k ) in the loss function such that where the weight is a function of the mass fraction that is symmetric about X k = 0.25 and exponentially increases towards 0 and 0.5. More specifically, after some initial testing of the accuracy needed for the mass fractions during real-time simulation, the modified loss function must heavily penalize errors within the mass fraction ranges of (0, 0.1] and [0.4, 0.5). Therefore, the weight function that we selected is similar to a "double" sigmoid function centered at X k = 0.05 and 0.45, and defined as where p is the order of precision such that 10 p is the maximum value of the weight function. In other words, p dictates the precision where the modified loss function starts to loosen penalties for any values less than order O(10 −p ). Figure  3 shows an example of the weight function with p = 8. During training, we set p = 10, but based on some preliminary testing, any value of p ≥ 8 gives stable results. Using this approach to solve the mass fraction scaling problem, the modified loss function for DNN 1 then becomes

Training and Validation
We define initial conditions for our simulation as follows. In each simulation, the computational domain is L x = 0.625 cm × L y = 5 cm with 128 × 1024 grid cells so the grid spacing is ∆x ≈ 49 µm. To initialize the problem, we define the fuel-state density (ρ fuel = 5 × 10 7 g/cm 3 ), temperature (T fuel = 10 8 K), and composition (X C12,fuel = X O16,fuel = 0.5, X Mg24,fuel = 0) and use the equation of state to compute the corresponding ambient pressure used throughout the domain, p 0 . We define the ash-state temperature (T ash = 3 × 10 9 K) and composition (X C12,ash = 0). We define profiles for the temperature and species using a smooth hyperbolic tangent, where α = 8 cm −1 is a parameter that controls the steepness of the transition. Then we use the equation of state to compute ρ and h throughout the domain using the temperature, composition, and ambient pressure. The term y(x) represents a spatial perturbation to the initial flame front. In our first tests,ỹ(x) = 0 so the flame front is on one plane (one-dimensional). The x-boundary conditions are periodic, and the y-boundary conditions are inflow and outflow. The inflow boundary conditions use a prescribed velocity of U = (0, 10 5 ) cm/s with the fuel-state condition for the remaining variables. The initial profiles of the species, temperature, and density for the one-dimensional case are shown in the top two panels in Figure 4. Using the MAESTROeX algorithm with VODE, we model 12µs of evolution over 500 time steps with ∆t = 24 ns, which corresponds to an advective CFL condition of ∼0.5. The peak velocity in this simulation is very close to the inflow velocity, 10 5 cm/s, which corresponds to a Mach number of O(10 −4 ). The final configuration of the species is shown in the bottom two panels in Figure 4. Over this time, the burning front speed is much smaller than the inflow velocity, so the front travels ∼1 cm to the right, which is roughly 20% of the length of the domain.
The data we are using to train and validate the DNN models is based on reaction data from this simulation. The inputs and outputs of the ODE solver VODE are saved to plotfiles that are then used to train the neural networks offline. These plotfiles are generated every 10 time steps starting from the time t = 4.8µs to the final time of t = 12µs. Figure 5 shows the profile of the relevant variables at t = 4.8µs.
We found in preliminary testing that simulations using neural networks that are trained using data directly from the MAESTROeX simulation tend to perform poorly for inputs that are slightly perturbed from the training data. This indicates that the model is most likely overfitting since it only sees a small set of the possible solution space of the ODE solver. To improve robustness, stability, and overall performance of the DNNs, we present an approach to diversify the training data by slightly perturbing the fluid state just before using VODE to solve the system. The procedure to perturb the states is as follows.
1. Apply a random perturbation within ±0.005 to X C12 .
2. Subtract the same perturbation from X Mg24 to ensure mass conservation. If any of the resulting mass fractions is negative, set the perturbation to zero (i.e. do not perturb this cell).
4. Repeat steps 1 to 3 in 75% of the cells in the computational domain.  The overall process to generate the training data is shown by the gray arrows in Figure 6. Note that we still run the actual simulation using unperturbed data; the perturbed data is passed in through separate, additional calls to VODE that do not feedback into the main simulation. Also note that we complete training over a representative set of data over all time steps before using the trained model to start a DNN-based simulation. In future work we may explore training "on-the-fly" to further improve the efficiency of the workflow. In the one-dimensional case, the data set we chose is the solution along x-slices located at x = [0.1, 0.2, 0.3] cm, and in the two-dimensional case, the training data is taken from half of the domain (x ≤ 0.3125 cm). The validation data is a subset of the training data that is not used to update the neural network but is instead used to measure the performance of the neural network during training time. The neural network makes predictions on this data to give an indicator of the error of the predictions to the ground truth, but does not actually back propagate  gradients to optimize for this data. Usually, a different loss function is used for validation. For simplicity, we chose to use the MSE as the validation loss function. We then split the total training data set into 10% for validation and 90% to be used to actually train the DNNs.
In Figure 7 we show the training and validation losses during the first 500 epochs of training the full DNN model using negative inverse log scaling of the inputs. There are two important things to point out here. Firstly, the validation data loss is slightly higher, but follows a similar downward trend as the training data during training time. This is because the neural network has not optimized for this validation data, so it will never achieve the same loss as that of the training data; however, this gives us confidence that the model is learning. Secondly, the validation loss of this neural network has not diverged from the training loss. If the validation loss were to increase while the training loss continues to decrease, this would be evidence of overfitting. On the other hand, if the validation loss was simply not decreasing in the same manner as the training loss (i.e. the validation loss plateaus at a higher loss cost), this would be evidence that our model is not complex enough to capture the intricacies of our data such that it generalizes well. This is a well known phenomena called the bias-variance trade-off. Here, we show our model is simple enough to generalize well in addition to being sophisticated enough to accurately capture the dynamics of the problem.

Testing
The testing phase comes after training and validation of the neural networks. In order to determine the best scaling approach and loss functions to use, when running with trained DNN in place of VODE, we re-run the same flame diffusion simulation in Section 3.4 starting with the configuration at t = 4.8µs to the final time of t = 12µs. After determining the best training approach, we train a new DNN with a perturbed flame with a V-shaped front, and test this model on three different configurations: the same V-shaped front, a one-dimensional flame, and a sinusoidallyvarying flame front. The V-shaped front is defined using whereỹ(x) is used in Equations (19) and (20) and again, L x = 0.625 cm is the width of the domain. For our simulations involving a sinusoidally-varying flame front, we use a domain that is twice as wide (with twice as many grid cells in x so that ∆x remains the same) and define the flame front using The results of these simulations are discussed in detail in Section 4.

Software Implementation
The initial simulations used to generate the plotfiles that contain the training data is run using MAESTROeX (Fan et al. (2019b,a), https://github.com/AMReX-Astro/MAESTROeX hash 717e4bc), which uses Starkiller Microphysics libraries , https://github.com/AMReX-Astro/Microphysics) tag 22.07), and the AM-ReX framework (Zhang et al. (2021) The training and validation of the neural networks are implemented in Python using the PyTorch library (Paszke et al. 2019) and yt (Turk et al. 2011) is used to extract the training data from the plotfiles. The main reason we chose to use PyTorch over other machine learning libraries is due to the simplicity of its C++ API LibTorch for integration into our existing C++ codes. By using PyTorch's tracing JIT and LibTorch library, we are able to export the trained neural networks to a model file that can then be imported and used in C++. The integration of LibTorch into the AMReX framework allows the user to take advantage of AMReX's parallelism capability on both host (MPI+OpenMP) and MPI+GPU. We have implemented the use of the LibTorch library in AMReX and, by extension, MAESTROeX in order to test the trained neural networks. The problem set-up is found in /Exec/science/flame ml/ directory of MAESTROeX.

RESULTS
As outlined in Section 3.5, we will first analyze the performance of the full DNN and parallel DNNs by training and testing with the same one-dimensional planar problem. Then after determining the best model to use, we will train a new model using data from a flame with a V-shaped front and test it on three different configurations: the same V-shaped front, a one-dimensional flame, and a sinusoidally-varying flame front. For each of the test cases, the final solution at t = 12µs is compared between using VODE and the trained DNN. In addition, plots of the model prediction versus the ground truth solution of the output variables are included to show how well the model performs on the testing data at the end of the training phase. The closer the points are to the y = x line, the closer the prediction is to the ground truth.

Full DNN
The full DNN was trained using data from the one-dimensional planar flame problem for 3000 epochs, and the results are shown in Figures 8 and 9. Of the two scaling strategies discussed in Section 3.3, the negative inverse log transformation of X k inputs seems to be the better choice, although both neural networks show similar shortcomings in performance. The first issue is the emergence of oscillations at the front of the flame, which suggests that the solution has become unstable. The oscillations are clearly more pronounced when using the symmetric mass fraction loss function. The second issue is that both full DNN models have underestimated the peak nuclear energy generation rate by a noticeable margin. In this case, the log transformation of X k inputs underestimated the peak by ∼11% while the symmetric loss function underestimated by ∼4%. Figure 10. (left) Final profiles of mass fractions and nuclear energy generation rate at t = 12µs, and (right) parallel validation results using the negative inverse log form of X k as input for both DNNs. Note that each point in the profile plot is every 5 data points on the grid in the y-direction.

Parallel DNN
Similar to the full DNN, each parallel DNN was trained using data from the one-dimensional planar flame problem for 3000 epochs, and the results are shown in Figures 10 and 11. From these results, we can easily conclude that the parallel DNN performed much better than the full DNN. There is no longer any discernible oscillations at the front of the flame and the profiles of all three neural network outputs (dotted lines) are very similar to the original Strang solution using VODE (solid lines). However, the scaling strategy using the negative inverse log transformation of X k inputs underestimated the peak nuclear energy generation rate by ∼6% while the symmetric mass fraction loss function only underestimated the peak by <1%.
The improvement in performance of the parallel DNNs over the full DNN is likely due to the fact that the loss function of the mass fractions can be derived independently from that of the nuclear energy generation. By defining completely separate neural networks for the mass fractions and e nuc , the resulting parallel neural networks are able to model each output state with higher accuracy. Interestingly, the time it takes to train one epoch of data on the parallel neural network is only 4% faster than the full neural network, which is much less of a speedup than expected. This is most likely caused by the uneven distribution of nodes and hidden layers in DNN 1 and DNN 2 with the former containing two-thirds of the total number of hidden nodes and much wider hidden layers than the latter. See Appendix A for a more detailed discussion on the use of the parallel DNN for separating the mass fraction and e nuc neural networks.

Two-dimensional Flame Problems
Based on the results in Sections 4.1 and 4.2, we chose to train the parallel neural network using the symmetric mass fraction loss function on the two-dimensional flame with a V-shaped front problem. The DNNs were trained using data from half of the domain (x ≤ 0.3125 cm) for 500 epochs, and the results at the final time are shown in Figure  12. The mass fraction solutions and location of the flame front using the parallel DNN are comparable to those using VODE. However, the peak H nuc prediction has an overshoot of approximately 28%. One possible explanation for this subpar performance is that the solution space of the ODE near the peak of the V-shaped flame front make up only a small portion of the training data set. This is especially true near the end of the simulation as the peak H nuc increases over time and would only exist in the last couple of plotfiles used to generate the training data set. Hence, because the neural network does not see as many instances of the solution near the peak H nuc , it is possible that the model does not have enough information to make good predictions there. By contrast, when we tested this trained DNN on the one-dimensional planar problem, the solution at peak H nuc is almost indistinguishable from the VODE solution as shown in Figure 13. Again, this could be attributed to the fact that there are many more instances of solutions closer to those encountered in the planar problem in the training data set and as a result the neural network is able to produce much more accurate predictions. Given these results, a potential solution to improve the performance of the DNN for the V-shaped flame problem is to include more instances of the flame near the peak H nuc either by expanding the training data set to include data from two-third of the domain (x ≤ 0.4167 cm) or to output multiple plotfiles with varying solution perturbations (as described in Section 3.4) near the end of the simulation.
As the final test, we test the trained parallel DNN on another two-dimensional problem that differs from the training set data, which is a flame with a sinusoidally-varying flame front. In this problem, the computational domain is 1.25cm × 5cm with 256 × 1024 grid cells so that the grid spacing stays the same as in the previous problems. Figure  14 shows that the neural network solutions of both mass fractions and nuclear energy generation rate are close to the VODE solutions, with a slight overshoot in peak H nuc of approximately 7%.

Timing Comparisons
To compare the runtime performance of MAESTROeX when using a neural network instead of the ODE solver VODE for the computational reaction kernel, we run the three previously defined flame problems (planar, V-shaped, and sinusoidally-varying flame fronts) on four cores (pure MPI) on an AMD EPYC 7702P processor and record the time it takes to complete the reactions subroutine per time step. The left plots in Figure 15 show that in our simulations using VODE, the runtime is dominated by reactions, using 59.5% of total runtime, whereas using DNN reduced the reactions runtime to 29.4%. In terms of the time it takes to run the reaction kernels per time step, using the neural network results in speedups of 3.14 to 3.45 depending on the test problem (see right plot in Figure 15).

CONCLUSIONS AND FUTURE WORK
We have demonstrated that we can train deep neural networks with a ResNet architecture using data obtained from the ODE solver VODE given that the data have been perturbed slightly to represent a larger portion of the possible solution space of the solver. In terms of accuracy and stability during simulation using MAESTROeX, the parallel DNN showed much better performance than the full DNN. In addition, the trained neural network can be used successfully in a test case that it has never seen before (i.e. a flame with a sinusoidally-varying front) given that the training data was obtained from a similar test (i.e. a flame with a V-shaped front). We also presented two strategies  to mitigate the mass fraction scaling problem. Of the two strategies, the one that uses the symmetric mass fraction loss function performed slightly better than transforming the mass fraction inputs to its negative inverse log form. One area of potential future work that looks promising is using a neural network model to compute the reaction solution in the predictor steps of a high-order numerical algorithm such as Spectral Deferred Corrections (SDC) method. Even in the second-order Strang splitting algorithm that MAESTROeX currently uses, a slight speedup of 1.32 over all reactions steps can be seen if we replace the reaction ODE solve in the predictor step with our trained DNN. We would expect to see larger speedups in high-order methods that consists of multiple predictor steps. In addition, when we applied this approach in the two-dimensional flame with a V-shape front problem using the same  parallel DNN that we previously trained in Section 4.3, the final solution is exceedingly close (< 0.2% error) to the one from the original Strang splitting algorithm (see Figure 16). Another topic for future work could be to develop and train neural networks with an additional input of the time step, which would lead to potential implementation of PINNs. However, in the case of stiff ODEs like the ones describing the nuclear reaction networks used in MAESTROeX, we would most likely need to consider a relaxation constraint instead of directly computing the gradients in the ODE (Ji et al. 2021). Finally, we could look to model more complex nuclear reaction networks with 13 or 19 isotopes in the future. Modeling more complex networks using neural networks should result in larger speedups on both a CPU and GPU.  Facilities: NERSC Software: Refer to Section 3.6.