Data efficiency and extrapolation trends in neural network interatomic potentials

Recently, key architectural advances have been proposed for neural network interatomic potentials (NNIPs), such as incorporating message-passing networks, equivariance, or many-body expansion terms. Although modern NNIP models exhibit small differences in test accuracy, this metric is still considered the main target when developing new NNIP architectures. In this work, we show how architectural and optimization choices influence the generalization of NNIPs, revealing trends in molecular dynamics (MD) stability, data efficiency, and loss landscapes. Using the 3BPA dataset, we uncover trends in NNIP errors and robustness to noise, showing these metrics are insufficient to predict MD stability in the high-accuracy regime. With a large-scale study on NequIP, MACE, and their optimizers, we show that our metric of loss entropy predicts out-of-distribution error and data efficiency despite being computed only on the training set. This work provides a deep learning justification for probing extrapolation and can inform the development of next-generation NNIPs.

Despite their successes, NNIPs still struggle with data efficiency and robust generalization.Over the last few years, several different model architectures were proposed to reduce errors in PES fitting, decrease the amount of data required to train the models, and improve predictions for configurations beyond the training domain.In particular, NN architectures incorporating physics concepts such as directional representations and equivariance [19][20][21][22][23] or many-body interactions [24][25][26] have gained popularity due to higher accuracy and data efficiency.Nevertheless, recent works show that accuracy metrics over datasets are insufficient to quantify the models' quality in production simulations and motivate the use of alternative metrics such as computational speed or simulation stability [27][28][29][30][31][32].Different NNIP models may have similar test accuracy, but completely different extrapolation ability [33,34].This begs the question: which metrics can distinguish between NNIPs with similar test error but different extrapolation behavior? 2 Background NNIPs: NN-based force fields have been first proposed using feedforward NNs and symmetry-based representations [14].In these systems, improvements are achieved by designing new representations to better capture the atomic environment [35][36][37][38][39][40][41].More recently, message-passing neural networks (MPNNs) [42] showed remarkable ability to fit PESes using learned representations.In this area, most works compare models according to their accuracy with respect to standardized datasets, such as QM9 [43,44], MD17 [8], and others.MPNN-based potentials often vary according to their interaction blocks, handling of symmetry operations, and general architectural choices.
NN representation capacity and generalization: although NNs have a large number of parameters, obtaining NNbased models with low generalization error is not uncommon.Preventing overfitting may require regularization techniques such as changing the loss function, e.g. with weight regularization or decay, augmenting the dataset, or using training protocols such as adaptive learning rates, dropout, and more [45][46][47].However, these are not requirements to controlling the generalization error [48].For example, NNs with good generalization capacity have been shown to overfit to random labels and input data [48].This suggests that, given enough parameters, even architecturally regularized NNs exhibit wide representation capacity for arbitrary data sets.In some cases, however, training to noisy labels can only be overcome by adapting architectures, regularization techniques, and correcting the loss function [45].
Loss landscapes (LLs): the shape of the LL is strongly correlated with the trainability of NN architectures.The minimization of the empirical risk is easier when the (non-convex) LL is smoother, as it yields more predictive gradients [49,50].Furthermore, LLs with several local minima exhibit lower trainability than their smoother counterparts [47,51].Some works proposed that flatter LLs are also related to lower NN generalization error [51][52][53].Although this relationship is often complicated by factors such as batch size [46], normalization [54], or weight decay, the sharpness of the LL is the most predictive of generalization error in NNs [55].Whereas this sharpness/flatness can be quantified using the Hessian of the loss [46,47] or assumptions of a prior on the weights [56], visualizations of the LL have proven useful to illustrate these minima with respect to the weight space [57][58][59][60] without the cost associated to calculating the full Hessian.Analogies with potential energy landscapes have also been performed to explore the cost functions of other machine learning methods and explain the shape of these optimization landscapes [61,62].

Visualizing loss landscapes
The loss landscape of a neural network can be plotted by evaluating the loss function L along a trajectory between two parameter sets θ and θ .The simplest approach is to linearly interpolate the weights [57], choosing a scalar t ∈ [0, 1] such that θ(t) = (1 − t)θ + tθ .Then, the loss landscape for a model becomes In this work, the loss L is evaluated on the training set of the model with parameters θ.
In the absence of the reference weights θ , the LL can be constructed by sampling a random vector δ in the parameter space and plotting the LL around θ as where the domain of t is appropriately chosen to span a neighborhood of θ.This notion can be extended to 2D LLs by taking two orthogonal vectors δ 1 , δ 2 such that with scalars t 1 , t 2 chosen to span a (two-dimensional) neighborhood of θ.These approaches have been used to study the LLs of NN classifiers in image datasets, interpolate between sets of classifiers, and explore the loss function around degenerate minima [46,[63][64][65].
One challenge when analyzing LLs is comparing different models according to their parameters.Because activation functions such as ReLU allow for scale-invariance of NN weights, especially when coupled with batch normalization techniques, the magnitude of the vector δ is not transferable from model to model.This prevents a fair comparison of LLs, curvatures, and sharpness metrics.To account for this, we use the filter normalization technique proposed by Li et al. [58].Therein, each random vector δ is normalized by the scale of each filter i, in each layer j of the NN, i.e.
where . is the Frobenius norm.Then, the LL is plotted according to the filter-normalized vector δ, and analogously for 2D LLs.
Although informative, sampling LLs can have cost comparable to training the NN, since evaluating the loss for each interpolated weight in each direction is equivalent to one training epoch.Depending on the number of parameters and directions δ n under analysis, the loss may be evaluated over the entire training dataset a large number of times.

Quantitative comparison of loss landscapes
Despite the usefulness of visualizing loss landscapes, comparing them beyond qualitative insights requires metrics to differentiate them.The most commonly used metric in the field is the curvature of the LL [55], which is related to the magnitude of the eigenvalues of the loss Hessian.Although informative, the Hessian is a local property and cannot fully capture "valley-like" degeneracies in loss landscapes.Furthermore, computing the full Hessian for a system with millions of training parameters would be intractable.Thus, alternative metrics to compare LLs are required.
In the case of NNIPs, which are often trained both to energies and forces, two LLs are obtained per model, one for each target value being trained to.Although they cannot be completely disentangled, both LLs should be compared simultaneously to assess model performance.To derive one of such metrics, we propose the use of "loss entropy" [51,66] to quantify loss flatness around optimized minima.Although variations of this quantity he been proposed, we use the formula where S is the entropy of the loss landscape ¯ (t) computed with respect to energy or forces and averaged over N orthogonal, filter-normalized weight displacements, and kT is a weighting parameter that quantifies the "flatness" of the LL with respect to a certain acceptable threshold of training loss.This is similar to distributions of microstates accessible in a given "temperature," and ensures that low-loss states contribute to a much larger entropy than high-loss states.
As the entropy of the LL of a NNIP takes into account both energy and forces losses, we adopt a strategy similar to the one used during training of the NNs by balancing energy and forces losses with a weighted sum, where α is a dimensionless parameter between 0 and 1 that weights the entropy of the energy loss (S E ) and forces loss (S F ).As these losses have different units, they have to be computed with different normalization parameters k E T E and k F T F , each of which takes into account "thermal randomness" of the training loss.For simplicity, in this work, we adopt k = 1 as a dimensionless parameter, and assign the adequate units to the "thermal error" for better interpretability of the loss entropy.

NNIPs and dataset
NequIP [23] is an equivariant NNIP that uses Clebsch-Gordon transformations and spherical harmonics to incorporate equivariance in the model.NequIP demonstrates state-of-the-art accuracy in several datasets, data efficiency, and has been employed to simulate a variety of organic and inorganic systems.
MACE [25] is an equivariant NNIP that uses higher-order messages to efficiently embed information beyond two-body interactions in traditional MPNNs.The model demonstrates state-of-the-art performance in a variety of benchmarks, faster learning, and competitive computational cost.
Training details are provided in Appendix A.
The 3BPA dataset [28] under study in this work was chosen due to its previous use in the literature for benchmarking NNIP models in extrapolation behavior [24][25][26].The benchmark involves training models on low-temperature samples and evaluating their performance on held-out samples from high-temperature simulations.Distributions of energies and forces of the 3BPA dataset are shown in Appendix D.1, Fig. S2.Additional results using the ANI-Al dataset [74] be found in Appendix D.2.

Robustness to noise and extrapolation trends in NNIPs
When comparing NNIPs, metrics of interest typically include errors in predicting forces and energies of a test dataset, and are appropriately used as baselines for assessing model quality.However, accuracy metrics are not necessarily predictive of extrapolation power [29].A first hypothesis to consider when analyzing NNIP extrapolation is whether NNIPs can learn a PES despite being trained on noisy data.Although this test does not measure extrapolation to out-of-distribution data, it verifies whether models are expected to overfit to corrupted training data, which would lower their robust generalization power.For example, NN-based classifiers can overfit to random labels in image datasets or to completely random inputs [48], even in architectures with good generalization error that are designed to prevent overfitting.Most NNIPs have enough parameters to memorize the training data, but standard regularization and architectural choices can curb overfitting in NNIPs, leading to lower generalization errors.
To test this hypothesis, we trained four different NNIPs to the 3BPA dataset and analyzed their training error.Following the lead from the deep learning literature [48], we then gradually corrupted the labels of the training set by randomly adding a sample from N (0, σ • σ DFT ) to the true forces, where σ DFT is the standard deviation in the forces predicted by density functional theory (DFT), and σ is a scalar ranging from 0.0 to 0.1 (see Appendix D.1).In principle, NN regressors with arbitrary levels of expressivity (or absent regularization) could achieve low training error even in these noisy PESes.Figure 1a shows the error for NNIPs trained to the 3BPA dataset with corrupted forces, and tested on the original, un-corrupted data.When the forces are not corrupted, models exhibit reasonable training errors lower than 40 meV/Å, as expected by their nominal performances in energy prediction [23,25].However, even small amounts of noise in the forces prevent the noisy dataset from being memorized with high accuracy, with the training loss plateauing instead of tending to zero.The ability of the models to predict the noisy forces saturates at the limit of the noise, indicating that these NNIPs do not memorize the high-frequency labels.On the other hand, when the test error of the models trained with corrupted labels is computed with respect to the uncorrupted dataset, the error is substantially smaller than the noise baseline (see the "original" panel of Fig. 1a).Thus, the NNIPs under analysis are able to learn the underlying PES in the 3BPA dataset despite the added noise.
Contrary to the overfitting hypothesis, these results suggest that data redundancy in the training set may help NNIPs to "denoise" the data.To illustrate this effect, we show in Appendix C how data redundancy downplays the effect of external noise in a toy system.Fig. S1 shows how a large number of training points can counterbalance the effect of external noise when predicting the original, non-noisy data for the case of linear regressor models.In this case, the model averages out the noise and recovers the true function even at high levels of added noise.On the other hand, at the low-data regime, the regression model is unable to recover the true function, and its error quickly grows.Although the results from the linear model may not directly translate to the case of NNs, Fig. 1a shows that, to an extent, NNIPs are able to "denoise" energies from the dataset due to architectural, training, or implicit data regularization.Figure 1: a, Force RMSE values (eV/Å) for models trained to only the forces of increasingly noisy versions of the 3BPA dataset.The models trained on noisy data are then evaluated on their noisy training set ("noisy" panel) or on the original, uncorrupted dataset ("original").The dashed lines correspond to the amount of noise that was added to the DFT forces in units of eV/Å.The legend denotes the maximum body order of the MACE models (v), the number of interaction layers in the NequIP models (n), or the maximum order of the model irreps (L).All MACE models in a used only 2 interaction layers.b, Relationship between extrapolation ability, quantified using the slope of the test errors on the 3BPA dataset with increasing temperature ("extrapolation slope"), and the test error at 300 K. Slopes were computed by performing a linear fit to values taken from the literature for MACE [25], BOTNet [24], and others [26] (see Fig. S3).The dashed line is a visual guide.c, Forces RMSE on the 300 K test set of several NequIP and MACE models with different architectures, but similar testing errors on the 300 K split.(top) Correlation between test errors on all splits (300, 600, and 1200 K) and test error on the 300 K split for models with similar test errors.(bottom) Average time-to-failure of MD simulation for each of the models with similar test errors.
To verify if this behavior was specific to the 3BPA dataset or could generalize to other datasets, we corrupted the energies or forces of the ANI-Al dataset and trained different models on the noisy values (see Appendix D.2), obtaining similar results even when only energies were considered.As the ANI-Al example exhibited even better results than the 3BPA, we tested whether a drastic increase in the injected noise could still be denoised by the NNIPs under study.Following the results of Fig. S1, we trained the two NNIP architectures under study to PESes with energy noises up to twenty times higher than those in Fig. 1 (see Fig. S12), and up to twice the standard deviation of the original dataset distribution of the ANI-Al dataset.Although the distribution of per-atom energies shows that the noisy PES is completely different than the original one (Fig. S12b,c), all models succeeded in modeling the underlying PES below the error baseline (Fig. S12).As also illustrated by the toy example, the performance of the models degrades as extremely large amounts of noise are added.Nonetheless, errors with respect to the non-noisy dataset are remarkably low considering the corruption baselines.
The toy example from Appendix C can be considered an upper bound of the "dataset denoising" ability, given that a functional form of the inputs is known and the model could, in principle, fit perfectly to the data.As the trends of NNIPs trained on the 3BPA or ANI-Al potential approach this behavior, it can be concluded that the two NNIP architectures are able to "denoise" the external noise added to the datasets, possibly due to data redundancy.As generalization tests assume the model is being tested on unseen data, it is not clear whether the accuracy reflects the quality of the model predictions or simply their ability to reproduce local environments existing in the training data.This is particularly important in the high-data regime, when the test dataset may be correlated to the train dataset in non-obvious ways.
To bypass this problem, alternative strategies were proposed to measure generalization power, including separating traintest splits according to sampling temperature, as is the case of the 3BPA dataset [28].While testing a model on held-out samples from high-temperature simulations is typically considered an independent evaluation of its performance, we have found that extrapolation errors are correlated with low-temperature test errors across various model architectures.This is performed by fitting a linear model to the errors on the 3BPA testing sets at 300, 600, and 1200 K from the literature [24][25][26] (see Fig. S3), then using the slope of the fitted line as associated metric.Fig. 1b shows that all the models that were trained only to 3BPA frames at 300 K follow an approximate linear scaling relation between the extrapolation slope and the log of the low-temperature errors.This correlation between the low-and high-temperature data does not strictly preclude the use of the 3BPA dataset for assessing a fitted model's extrapolation abilities, as the extrapolation slope represents how much the accuracy of a given model degrades as the sampling temperature increases.
Nevertheless, it suggests that data and model regularization effects may be enforcing extrapolation trends in wide error ranges.For example, a model like ACE [24,41] is known to provide functional forms that aid extrapolation beyond the training data [28].Despite this, the ACE model for 3BPA follows the same trends of more complex message-passing NNs.In fact, the generalization slopes are correlated to their low-temperature error regardless of significant architectural differences.This indicates that, for this benchmark, the root mean squared errors (RMSEs) at higher temperatures can be estimated from the test error at 300 K even without evaluating the model at the other test sets.
The exceptions to this rule are the two ANI models ("ANI-2x" and "ANI-pretrained"), which were pre-trained to the 8.9 million configurations from the ANI-2x dataset [75].As seen in other fields of deep learning [76], the pre-trained models extrapolate better than all other models, though fine-tuning on 3BPA ("ANI-pretrained") leads to slightly worse extrapolation slope.These results suggest that: (1) more diverse datasets may be required for assessing the extrapolation and generalization capacity of a model; and (2) pre-training on large datasets may be required to create universal NNIPs [77], given that pre-trained ANI models were able to escape the scaling relation seen in Fig. 1b.
Although the scaling relation can estimate the extrapolation slope within reasonable bounds, it is unable to recover trends within the same architecture in the low-error regime.As best-performing models often have differences of force errors smaller than 5 meV/Å in the 300 K test set, it is not clear whether their extrapolation behavior can be accurately recovered from the scaling relation.Indeed, Fig. 1c shows the correlation between the forces RMSE computed on the 300 K split and all splits (300, 600, and 1200 K, "high T") for several NequIP and MACE models with different hyperparameters and training methods (see Tables S3 and S6), but similar testing error at the 300 K split.Although there is a positive correlation between the two RMSE metrics, the dispersion of points indicates that models with nearly the same RMSE for the 300 K data split show discrepancies in error above 10 meV/Å when all splits are taken into account.
This scenario is aggravated when molecular dynamics (MD) simulations are used to test the extrapolation power of the NNIPs.When measuring the average simulation times for each of the models (see Appendix B for details on the MD simulations), no clear correlation is obtained between the error on the 300 K split of 3BPA and the average (physically meaningful) simulation length (Fig. 1c).This motivates the creation of a metric that captures robustness trends in NNIPs.

Training insights derived from loss landscapes
Beyond data regularization, architectural and training choices strongly influence generalization ability of NN models.Relevant aspects include model initialization, hyperparameter optimization, choice of optimizer, batch sizes, and many others.Although the process of training NNs strongly affects their extrapolation behavior, good NN models systematically outperform their counterparts by optimizing to better local minima in the LL [54].Thus, using these insights, we propose that loss landscapes of NNIP models can predict their generalization ability towards unseen data despite using only the training data.Correlations between robust generalization and loss sharpness have been observed for other NN models in the literature [51,55], but not yet explored in the context of NNIPs.
To first verify if qualitative insights could be derived from LLs in the context of NNIPs, we investigated the behavior of the loss function around the optimized minima of the NNIP models trained to the non-noisy 3BPA dataset.To ensure the LL visualizations were statistically meaningful, we sampled 20 different orthogonal directions for each set of parameters and models, and interpolated them using the filter-normalized method described in Sec.3.1 (see also Fig. 2a).Then, we compared the LLs of the NNIP models using 2D visualizations, as often done for NN classifiers [58].
Qualitative inspection of the 2D LLs in Fig. 2b reveals the presence of weight degeneracies in the prediction of energies in models (see also Fig. S14 for another example in the ANI-Al dataset).This "valley-like" landscape represents a subspace of weights leading to similar accuracy in energy [78], and reflects the interplay between energy and forces during training.These results agree with the literature regarding LLs of over-parameterized models [79], as well as the notion that physical systems often result in so-called "sloppy" models [80,81], and can improve trainability and interpolation [82].
Qualitative analysis of LLs also explain other factors typically found as heuristics of NNIP training.For example, the energy/forces coefficients in the final loss are often defined from hyperparameter optimization [23,83] and have ratios varying from 1:10 to 1:100 for energy:forces RMSE.Nevertheless, the success of these higher ratios can be justified from the perspective of the LL.In Fig. 2b, we show how a higher weight on force losses leads to LLs with less saddle points around the optimized minima, thus favoring training.Although energy and forces are related and completely disentangling their effects is not achievable, the interpolation in Fig. 2b shows how mixing both can lead to better optimization landscapes for NNIPs.Based on these results, an effective training regimen would start with a relatively large weight for forces loss for a fast (and smoother) optimization of forces.Once the force errors are reasonably converged, the weight can be decreased until a desired threshold in energy errors is achieved.Similar strategies with weight cycling and scheduling -e.g., starting with an energy:force loss weighting of 1:10, but then switch to 1000:1 in the later stages of training -can also be effective.
One limitation of visualizations of the high-dimensional LL is that different directions in weight perturbation may lead to similar losses.This results in only slight variations in landscape topography with different random directions, as noted in Fig. 2.This may be related to the dominance of specific layers of the model in the filter normalization technique.Figure S13 shows how the distribution of weights is non-uniform, suggesting some layers have higher sensitivity to weight perturbation than others.As the filter normalization displaces the model weights along random directions with magnitude proportional to the norm of each filter and layer, parameters with higher weights may influence certain regions of the loss landscapes.An exception to this point is when the parameters are intertwined with functions embedded in the architecture, such as the trainable Bessel functions in NequIP.As shown in Fig. S6, freezing certain high-magnitude weights when generating the LLs can help flatten the landscape and remove spurious minima, emphasizing the importance of proper regularization and training regimen taking these effects into account (e.g., separate learning rates for certain layers in the model).

Loss landscapes predict extrapolation trends in NequIP
In the context of NNIPs, robust generalization has important ramifications in two distinct areas: the stability of the model in production (e.g., MD simulations), and its data efficiency during training.To probe the first of these features, we trained NequIP models using various choices of model architecture and optimization techniques (see Table S1) in the low-temperature split of the 3BPA dataset.Then, we performed MD simulations in the NVT ensemble with a high temperature of 1600 K (Appendix B) to ensure all models were extrapolating well beyond their training split.Furthermore, as the 3BPA benchmark already provides geometries sampled from simulations up to 1200 K, the MD trajectories at 1600 K could provide information beyond the 1200 K splits provided by the dataset.For each model in the study, 30 MD trajectories were simulated to obtain reasonable statistics on the model behavior, with simulation timelengths of up to 6 ps and a timestep of 1 fs.In principle, a model's ability to extrapolate beyond the training set can be assessed by the number of trajectories that behave in a physical way [34], as well as the average trajectory length until the model fails to extrapolate (Appendix B).
Using the results from MD simulations, we investigate whether LLs can predict the extrapolation behavior and trajectory stability of models.While the test performance at low-temperature samples is unable to perform such predictions (Sec.4.1), the test error at high-temperature samples is still expected to be a reasonable predictor of such stability.Nevertheless, whereas high-temperature samples are available in the 3BPA dataset, out-of-domain data is often not available for an arbitrary material system.In addition to being expensive to generate using ground-truth methods, sampling new configurations is rarely performed in an exhaustive way.Thus, the major advantage of extrapolation tests based on LLs is their reliance only on the training dataset.To quantify the sharpness of the LLs [55], we compute the loss entropy as described in Sec.3.2, with T E and T F taken to be reasonable "room temperature" values of the energy/force RMSEs, respectively (T E = 4 meV/atom, T F = 40 meV/Å).The weight α was set to 0.2 to resemble the higher force weighting used during training without ignoring the contributions of energy LLs in model stability.
Nevertheless, a sensitivity analysis of S with respect to these parameters showed that results remained consistent around these error ranges, as long as unreasonable temperature values were not used (see Fig. S4).
The relationship between the model/training parameters, the MD stability, and LLs entropy are shown in Fig. 3 (see also Table S2).The analysis is grouped into four experiments (columns in Fig. 3) to isolate the effects of specific portions of the training procedure and model architecture, thus uncovering useful trends for NNIP practitioners.For example, considering only the distributions of time to failure from MD simulations in Fig. 3b, it can be seen that rescaling the energies predicted by the model (models "no rescaling" vs. "rescaling") greatly improves the stability of the model, especially when the Bessel weights used by the radial basis are trainable.This behavior is reflected in the energy and forces LLs (Fig. 3a), with the LL of the model without rescaling showing considerable sharpness compared to the models using rescaled energies.Quantitative trends are also obtained when the loss entropy and test RMSE at all splits in the 3BPA dataset are computed (Fig. 3c, first column, and Table S3).Models with rescaling showed higher average simulation time with physical trajectories, as well as increased entropy and lower forces RMSE.
When comparing how the training regime can change the extrapolation behavior of NequIP, the results show that the AMSGrad variant of the Adam optimizer [84] leads to consistent improvements in the model extrapolation both for the 2-layer and the 5-layer model (Fig. 3b, second and third column) compared to the baseline, which does not use AMSGrad.The improvements of simulation quality are particularly pronounced in the 5-layer models, where the model trained using AMSGrad shows significant improvements in simulation stability compared to the baseline despite not using trainable Bessel functions (Table S1).On the other hand, the exponential moving average (EMA) appears to degrade the extrapolation performance of the NNIPs, often leading to sharper LLs if used in isolation and with constant learning rates.These trends are reflected in the extrapolation metrics (Fig. 3c).Although the loss entropy underperforms compared to the forces RMSE in the case of the 2-layer models, the trend is correctly captured for the 5-layer model, where the model trained with AMSGrad showed both higher loss entropy and higher MD stability.
Finally, NequIP models can be compared according to the number of message-passing layers.Using a fixed training regime, a higher number of layers showed pronounced improvement both in the stable simulation time (Fig. 3b), as well as in the forces RMSE and loss entropy (Fig. 3c), with the trend remaining consistent across these models.Interestingly, the improvement in loss entropy for the 5-layer model is reflected mostly in the energy LL rather than the force LL (Fig. 3a).Although the energy is not required when integrating the equations of motion in the NVT ensemble during an MD simulation, this improvement in LL may reflect the model's higher generalization capacity, and thus wider minima.This phenomenon can be explained by recognizing that the out-of-domain data often results in a horizontally shifted version of the loss landscape [46].To explain this effect, we computed the LL of NequIP models with respect to the test set instead of the training set (see Fig. S5).As expected, the test LL shows higher errors compared to the train LL, as seen in the vertical shift of the curves.However, while the forces LL only undergo a vertical shift, test energy LLs also undergo a horizontal shift, indicating that the optimal model for the training set is not necessarily optimal for predicting energies of the testing set.Furthermore, as the forces loss is often derived from the energy in NNIPs, the propagation of these mismatches may be responsible for degradation in extrapolation performance.Thus, in general, the results in Fig. 3 show that architectures and optimization strategies which result in loss landscapes with higher entropy (i.e., flatter landscapes) tend to demonstrate improved stability in MD simulations that sample out-of-domain configurations.

Loss landscapes predict extrapolation trends in MACE
To confirm that the results from Sec. 4.3 could be extended beyond NequIP, we performed a similar study using the MACE framework.Given the differences in architecture and available code, MACE has different hyperparameters and  optimizer choices than NequIP (see Table S4).Nevertheless, a similar study for MACE reproduces the trends seen for NequIP, as shown in Fig. 4.
As seen in Sec.4.3, model stability can be greatly improved by using rescaling and AMSGrad.Also for MACE, EMA appears to lower the time-to-failure if used in isolation, with similar trends in out-of-domain RMSE, loss entropy, and MD stability seen in the NequIP case (Fig. 4).In addition, we also explored the effects of stochastic weight averaging (SWA) [85] and weight cycling (WC) (column 2 of Fig. 4) implemented in the MACE code.Consistent with results  from the use of SWA for image classification tasks [85], SWA + WC is shown to improve model generalizability, leading to higher simulation times and flatter loss landscapes.Interestingly, the model trained with SWA + WC exhibits similar force RMSE in high-temperature samples compared to the baseline, but a higher loss entropy and higher MD stability (Fig. 4c).As SWA flattens the loss landscape by design [85], implementing this strategy in other NNIP models may be a low-cost modification for improving the generalization capacity of different architectures.The use of WC can also help the optimization process (Sec.4.2) and lead to better energy landscapes overall.

Energy
For the MACE study, we also analyzed the relationship between the bond order of the model (column 4) and their stability in production MD simulations.In this case, the loss entropy recovers the trend in MD stability better than the forces RMSE (Fig. 4c, column 4).Interestingly, despite the higher errors of the v = 1, L = 3 model compared to its v = 2, L = 3 counterpart, the former still exhibits a higher average time to failure.This observation can be traced back to the flatter energy LL (Fig. 4a, column 4), as also seen in the case of NequIP.
Finally, the models were compared according to the symmetry order of the edge features (column 3).In general, increasing the value of L from L = 0 to L = 2 led to more stable simulations, although the model with L = 3 shows a degradation of the production behavior.Nevertheless, the LLs do not follow the expected trends explained so far in this work.We believe the filter normalization technique discussed in Sec.3.1 cannot provide comparable LLs upon changes of L in MACE.As this parameter affects the tensor order of some model components [25], the number of parameters in these particular filters grows with O(N L ), which may be affecting the filter normalization.In contrast, adding message-passing layers with similar numbers of parameters, as in NequIP, allows the filter normalization technique to create comparable loss landscapes.This result shows some limitations of the loss entropy when comparing models with different architectures or hyperparameters, and may require different strategies for computing them in the future.

Loss landscapes and data efficiency
As extrapolation power and data efficiency are conceptually related, a natural extension of Secs.To test whether loss entropy can predict the data efficiency of models, we computed the learning curves for NequIP and MACE models with the hyperparameters selected in Sections 4.3 and 4.4.This is achieved by training models with the specified training parameters and architectures to the same subsets of the 3BPA training set using only 25, 125, 250, or 500 samples from the 300 K split (Tables S3 and S6).Following previous work [23], the slopes of the learning curves were then computed by fitting the line log n = m log ε + b to the number of training samples n and the force RMSEs ε calculated using all test splits (300, 600, and 1200 K), and comparing the slopes m.Consistent with the results from Sections 4.3 and 4.4 (combined in Figures 5a,b), the high-temperature force RMSEs (i.e., the points from the learning curves using n = 500) are good predictors of the learning curve slope.In addition, Figs.5c,d show that the loss entropy also has a high correlation with the slope of the learning curve despite being computed using only the training set.As discussed in Sec.4.4, these correlations are less explicit for the case of MACE models, and thus show that none of the metrics is truly universal for predicting stability in simulations and learning curve slopes.Nevertheless, these results further demonstrate that loss landscapes provide information on the extrapolation behavior of NNIPs despite being derived only from the training set.

Conclusion
In this work, we motivate the need for additional metrics beyond RMSE by showing that in-domain errors fail to predict model stability in the high-accuracy regime despite large-scale trends in extrapolation behavior and NNIP robustness to noise.We propose the use of loss entropy as a metric for quantifying the extrapolation behavior, and demonstrate that it correlates well with out-of-distribution error and stability in production simulations.Using large studies with NequIP and MACE, we show how models containing flatter loss landscapes exhibit better extrapolation behavior, and how different training parameters can be used to achieve them.For example, rescaling, AMSGrad, and SWA were shown to increase the loss entropy and MD stability, and may be important tools when training NNIP models.Similarly, models with similar test error can be distinguished by their energy loss landscape, with models displaying broader minima performing better in extrapolation tasks.Future studies can address shortcomings of loss landscape visualizations in NNIP by analyzing how filter-normalization can be made more suitable for NNIP architectures, or to better isolate the effects of key architectural changes in the loss.Nevertheless, these results can better inform the development of new model architecture and optimization strategies in NNIPs, facilitating their use in general materials simulation.Learning curve slope 60 80 Forces RMSE (meV/Å)

A Training details
All models were trained on a single node of the Lassen supercomputer, using one NVIDIA V100 (Volta) GPU.Most models were trained for a maximum walltime of 12 hours, though some models were trained for an additional 12 hours to ensure convergence.For all models, an energy:force weight of 1:1000 was used on the 3BPA dataset, and 1:10 on ANI-Al.A cutoff distance of 5.0 Å was used for all models, though the effective interaction distance may have varied depending on the number of interaction blocks used, as described below.

A.1 NequIP
The NequIP model was trained using its NequIP package [23], version 0.5.5 (https://github.com/mir-group/nequip), implemented in PyTorch.NequIP model architecture and training parameters were chosen to match those used in [25], with some modifications as specified in this paper.See Table S1 for an overview of key architecture and optimizer modifications used in this work.The Adam optimizer was used with an initial learning rate of 0.005, and the ReduceLROnPlateau scheduler with a patience of 50 and a decay factor of 0.5.Any other parameters used the defaults specified by https://github.com/mir-group/nequip/blob/main/configs/example.yaml,version 0.5.5.

A.2 MACE
The MACE code was trained using its public package (https://github.com/ACEsuit/mace),version 0.1.0.The MACE code required a minor patch to allow for the Bessel weights to be made trainable in one of the studies.MACE model training parameters were chosen to match those used in [25], with modifications as specified in this paper.See Table S4 for an overview of key architecture and optimizer modifications used in this work.The Adam optimizer was used with an initial learning rate of 0.005, and the ReduceLROnPlateau scheduler with a patience of 50 and a decay factor of 0.8.Any other parameters used the defaults specified by https://github.com/ACEsuit/mace,version 0.1.0.

B MD simulations
Molecular dynamics simulations were performed for the 3BPA molecule using each of the models under study.In total, 30 trajectories were performed per model to obtain better statistics of their production behavior.Simulations were performed in the NVT ensemble using the Berendsen thermostat [86] implemented in ASE [87].The initial configuration used for the simulation was chosen to be the ground state of the 3BPA molecule.The simulation was performed at 1600 K and a timestep of 1 fs to force the models to evaluate configurations outside of their training set (300 K samples) and beyond those quantified by the test set errors (1200 K samples).The time constant of the Berendsen thermostat was chosen to be 250 fs.The simulation was performed for 6 ps for all models.This time length was sufficient to ensure the temperature was equilibrated and remained at 1600 K for about 4 ps (see Fig. S8).
MD trajectories were considered unphysical if the distance between bonded atoms increased above 2 Å.This bond rupture was the main source of failure observed in the trajectories, and thus is the only one considered in this work.

C Toy example on linear regressor trained on noisy data
To illustrate how redundancy in data points can help recover the underlying data despite the noise, we provide an example with a linear regressor trained on noisy data.A linear function can be obtained by fitting two parameters to the dataset, thus requiring at least two data points to be fit.When more data points are available, a least-squares method can be employed to obtain the only solution that minimizes the error of the fit towards the dataset.Because linear models are not robust to outliers, this pedagogical example helps illustrate how redundant data helps recover the underlying generative function despite the noise.To illustrate this effect, we consider a linear function y given by which represents the ground truth data.Using this definition, the "corrupted" data ỹ is given by where σ is a parameter and ε is sampled from a normal distribution with mean zero and variance one.
A noisy dataset {(x i , ỹi )} containing N elements is constructed by taking N linearly spaced values of x in the interval [0, 1] and computing the value of ỹ using Eq.(10).Then, this noisy dataset is used to fit a linear regression model ŷ expressed as where the linear model is trained with the least squares method without regularization.The model error against the true data is then computed as the RMSE between the predictions ŷ and the true data y.
Figure S1 shows the results of fitting the linear model to the noisy data.At zero noise, the RMSE is always zero, as the linear fit recovers the true function y.However, as noise is added to the system, the average RMSE against the true dataset increases, particularly in the low-data regime.As more data is added to the system, the effects of the noise are compensated by the redundancy of the data points in conveying the true function of the system.For a dataset with more than 1000 data points in this linear regression example, even high levels of noise (in the units of Eq. ( 9)) are compensated by enough redundancy in data.

D Additional figures D.1 3BPA
The figures in this section were all generated using the 3BPA dataset, either with the raw data directly or with a model trained to it.Figure S5: 1D loss landscapes for the 5-layer "rescaling + AMSGrad + EMA" NequIP model from Sec. 4.3 using the original 300 K training set, or the 600 K and 1200 K test sets.While the energy landscapes demonstrate a horizontal shift with increasing temperature, as theorized by [46], no such shift is observed in the force landscapes.
Figure S6: "Frozen" 1D loss landscapes for NequIP and MACE models with trainable Bessel weights.The "un-frozen" landscapes were generated as described in Sec.3.1.The "frozen" landscapes were generated by fixing the Bessel weights to their optimized value (corresponding to the model at a filter-normalized distance of 0) when generating the loss landscapes.With the exception of this figure, all LLs shown in this work correspond to "frozen" LLs.
Figure S7: Plot shown in Fig. 1c, but including all models from this work.As many of the models used to generate the learning curves from the main text were trained to fewer than 500 configurations (the size of the full 3BPA training dataset), we scale the corresponding points in this plot by the number of 3BPA configurations used for training (ranging from 5 to 500).Despite the vertical dispersion, models still exhibit a rough scaling relation proposed in Fig. 1c, with smaller training sets exacerbating the extrapolation slope in the high-error regime.

D.2 ANI-Al
In addition to the 3BPA dataset, part of this work also uses the aluminum dataset from Smith et al. [74] (ANI-Al).The ANI-Al dataset was generated using active learning intended to thoroughly sample configurational space, and was originally used to construct a model for running shock simulations.This additional dataset was chosen for being a condensed matter system with wide configurational diversity but restricted composition, thus being a strong contrast to the molecular 3BPA.
The figures in this section were all generated using the ANI-Al dataset, either with the raw data directly or with a model trained to it.

Figure 2 :
Figure2: a, A schematic describing the process of generating loss landscapes.Starting at the origin with a trained model, landscapes are constructed by performing a grid-sampling along a random filter-normalized direction[58] in parameter space up to a chosen maximum distance.In order to ensure that that results were consistent regardless of the choice of random sampling direction, landscapes were averaged over multiple random directions, δ 1 . . .δ n , with n = 20 usually showing only slight variations in landscape topography.b, 2D LLs for a MACE model trained to the 3BPA dataset, generated by sampling a plane defined by two random directions.The model was trained using E w = 1 and F w = 1000.0,but the final landscape was re-weighted for the purpose of this figure using a fixed value of E w = 1.0 and F w increasing from 0.0 to 10.0 to demonstrate the effects of different weights on optimization.

Figure 3 :
Figure 3: Analysis of LLs and MD simulation stability results for NequIP models while varying: (column 1) rescaling and Bessel weights, (column 2) optimizer settings using a model with two interaction blocks or (column 3) five interaction blocks, and (column 4) the number of interaction blocks.a, LLs for each model and b, distributions of time to failure computed over 30 MD simulations.c, Time to failure compared to (top) the entropy of the loss landscape (as defined in Sec.3.2) or the forces RMSE on the high-temperature test sets.The horizontal line, box, and whiskers in b depict the median, interquartile range, and points within 150% of the interquartile range of the distribution.Isolated diamonds are points outside of this range.

3 -Figure 4 :
Figure4: Analysis of loss landscapes and MD simulation stability results for MACE models while varying: (column 1) the use of rescaling; (column 2) optimizer settings using a model with a fixed body order, v = 2, and symmetry order, L = 3; (column 3) the symmetry order of the edge features with a fixed body order of 2; and (column 4) the body order with a fixed symmetry order of 3. a, Loss landscapes for each model.b, distributions of time-to-failure computed over 30 MD simulations.c, Time-to-failure compared to (top) the entropy of the loss landscape (as defined in Sec.3.2) or the forces RMSE on all test sets of the 3BPA dataset.The horizontal line, box, and whiskers in b depict the median, interquartile range, and points within 150% of the interquartile range of the distribution.Isolated diamonds are points outside of this range.

Figure 5 :
Figure 5: Loss landscape entropy and high-temperature force test RMSEs compared to a,b MD time-to-failure or c,d learning curve slope for a,c NequIP and b,d MACE models.The points .Learning curve slopes were computed as described in Sec.4.5.Colors for points in a and b were chosen to match the colors in Fig. 3 and Fig. 4 respectively.Lines in c and d represent linear fits to the data.

Figure S2 :
FigureS2: Distributions of a, energies and b, forces for the 3BPA dataset at the three sampling temperatures.To corrupt the labels, the standard deviation of these distributions are computed, leading to σ DFT = 6.12 meV/atom for energies and σ DFT = 0.95 eV/Å for forces.Noise in forces are added to each component of the force, for each atom in the system.

Figure S3 :Figure S4 :
Figure S3: Linear fits to the 3BPA testing set error, as described in Sec.4.1

Figure S8 :
Figure S8: Average temperature of the 36 MD simulations performed with NequIP that did not demonstrate unphysical behavior throughout the 6 ps of total simulation time.The blue line is the average temperature for each of the time step, and the shaded area is the standard deviation of temperatures for each time step during these 36 different trajectories.

Figure S9 :Figure S10 :
Figure S9: Distributions of a, energies, b, energy noises, c, forces, and d-e, force errors for the ANI-Al dataset.For d-e, the distributions of deviations are taken with respect to each force component for the noisy datasets.The deviation is computed for each coordinate of each force vector.e, shows the distributions of errors of the final norm of the noisy forces compared to the original ones.

Figure S11 :
FigureS11: Parity plots for all four models trained to the ANI-Al dataset with increasing noise levels.The points for each noise level are plotted on top of each other, with colors based on the amount of noise as specified in the legend, emphasizing that the models are learning predictions that are very similar to the ones that they make when trained to the non-noisy dataset.

Figure S12 :
FigureS12: a, RMSE of energies (eV/atom) for models trained to highly noisy versions of the ANI-Al dataset, using noise magnitudes up to 20× those in Fig.S10.The models trained on noisy data are then evaluated on their noisy training set ("noisy" chart) or on the original, uncorrupted dataset ("original").The black dashed lines correspond to the amount of noise that was added to the DFT energies in units of eV/atom.Distributions of b, energies and c, forces noises for this high-noise ANI-Al dataset.

Figure S14 :
Figure S14: 2D landscapes for the energy (top) and forces (bottom) loss of NNIP models on the ANI-Al dataset.The landscapes are plotted with respect to two normalized directions d 1 and d 2 , as described in Sec.3.1.Colors correspond to the same values of losses across models, with brighter colors corresponding to higher losses.
(10)reS1: Average RMSE of linear regressor models trained on a noisy linear dataset.The RMSE of each model is computed against the true linear function, even though the model was trained on a noisy dataset.The average RMSE is obtained by performing this experiment 100 times for each dataset size and noise level.The noise std corresponds to σ in Eq.(10).

Table S1 :
Major hyperparameters used to build/train the NequIP models in this work.

Table S2 :
Results of LLs entropy and MD simulation stability for NequIP models.The values of entropy of the energy loss (S e ) and forces loss (S f ) are computed with the parameters discussed in Sec.3.2.The error bar of the time to failure is the standard deviation of all 30 trajectories performed for each model.

Table S3 :
Test RMSE of energies and forces in the 3BPA dataset for most NequIP models in this study.All models were trained on different dataset sizes (N ∈ {25, 125, 250, 500}) containing samples from simulations at 300 K.Then, models were tested against held-out samples from simulations at 300, 600, and 1200 K.

Table S4 :
Major hyperparameters used to build/train the MACE models in this work.The model "rescaling + Bessel" is the only one containing trainable Bessel functions, which were introduced in the MACE code for this ablation study.

Table S6 :
Test RMSE of energies and forces in the 3BPA dataset for most MACE models in this study.All models were trained on different dataset sizes (N ∈ {25, 125, 250, 500}) containing samples from simulations at 300 K.Then, models were tested against held-out samples from simulations at 300, 600, and 1200 K.The entry v = 2, L = 3 is duplicated in this table to facilitate the visualization of the trends.