Synthetic pre-training for neural-network interatomic potentials

Machine learning (ML) based interatomic potentials have transformed the field of atomistic materials modelling. However, ML potentials depend critically on the quality and quantity of quantum-mechanical reference data with which they are trained, and therefore developing datasets and training pipelines is becoming an increasingly central challenge. Leveraging the idea of"synthetic"(artificial) data that is common in other areas of ML research, we here show that synthetic atomistic data, themselves obtained at scale with an existing ML potential, constitute a useful pre-training task for neural-network interatomic potential models. Once pre-trained with a large synthetic dataset, these models can be fine-tuned on a much smaller, quantum-mechanical one, improving numerical accuracy and stability in computational practice. We demonstrate feasibility for a series of equivariant graph-neural-network potentials for carbon, and we carry out initial experiments to test the limits of the approach.


Introduction
Machine-learning interatomic potential (MLIP) models are increasingly used to accelerate the simulation, discovery, and design of molecules and materials [1][2][3][4][5].MLIPs approximate quantummechanical potential energies and forces acting on atoms, yet require orders of magnitude lower computational cost than the corresponding reference methods.As such, they have begun to enable real-world applications that would otherwise have been out of reach for quantum-mechanically accurate simulations: the behaviour of matter under extreme conditions [6]; the complex atomic structure of amorphous solids [7]; the discovery of unconventional reaction mechanisms [8].
To further increase their impact, and to enable more widespread adoption and application in the natural sciences, it is necessary that non-specialists are able to quickly, cheaply, and yet reliably train MLIPs for new systems.A major bottleneck in this endeavour is the computation of expensive quantum-mechanical reference data used in training.Significant research efforts are therefore being spent on designing data-efficient MLIP fitting frameworks, and also on architecture-agnostic strategies that reduce the amount of training data required to reach a given level of accuracy.
In the fields of computer vision and natural language processing in particular, one strategy for creating specialised ML models is to leverage existing "foundation" models that have been pretrained on a general task over very large amounts of data.End users can then cheaply fine-tune such models for their desired domain and specific task using small amounts of data.However, those approaches rely on the availability of large corpora of existing data for pre-training -this is not often the case in atomistic ML if one is starting from quantum mechanics directly.
In the present work, we show that the prediction of synthetic energy and force labels, cheaply generated by existing MLIPs, is a useful pre-training task for neural-network (NN) interatomic potentials, enabling subsequent fine-tuning on quantum-mechanical data.Specifically: • We show that starting from general, synthetically pre-trained models and fine-tuning them for specific use cases improves the physical robustness of the final potentials, as measured by the ability to run stable dynamics outside the scope of the fine-tuning dataset; • We suggest that MLIPs can be repurposed as alchemical pre-training sources: we show that pre-training models on synthetic data for carbon, with proportionally scaled structures, leads to positive transfer when fine-tuning on data for silicon.

Related work
Synthetic data.The term "synthetic" refers to data that have been created with a surrogate model, for instance from simulations or generative modelling; this is in contrast to "real" data, which are obtained from physical measurements, computed using quantum mechanics (ab initio), or collected from other reliable observations of the ground truth.Increasingly, synthetic data are being used to (pre-) train ML models, most notably in the fields of image classification, segmentation and generation [9,10], natural language modelling [11], and speech detection [12].
Synthetic data are also beginning to be used in the physical sciences: Aty et al. studied synthetic data as a pre-training task for determining experimental lipid phase behaviour from small-angle X-ray scattering patterns [13]; Anker et al. explored the use of synthetic data in interpreting inelastic neutron-scattering data [14]; Schuetzke et al. used synthetic data to mimic the characteristic appearance of experimental measurements for several spectroscopy methods [15].
In the context of atomistic ML, we have recently shown that synthetic data allow one to distill reliable, but comparably expensive MLIPs into faster and cheaper ones in a teacher-student manner [16], and that synthetic data provide a cheap means to explore atomistic energy models [17,18].We showed initial proof-of-concept that synthetic data constitute a useful pre-training task, limited at that time to simple feed-forward NNs regressing scalar atomic quantities [17].Very recently, Kelvinius et al. investigated the use of synthetic data as a means to aid knowledge distillation from slow to fast graph-neural-network potentials via intermediate learned representations [19].
Pre-training tasks for MLIPs.Fine-tuning existing pre-trained NNs can improve accuracy and data efficiency compared to directly training models.As such, much research effort is expended on exploring useful, effective, and architecture-agnostic pre-training tasks.In the field of MLIP development, Wang et al. recently built upon previous work to show that unsupervised denoising of non-equilibrium molecular structures leads to improved final NN potentials when fine-tuned on quantum-mechanical labels [20][21][22][23][24].
Supervised pre-training tasks also exist.One technique of particular relevance here is "domain knowledge injection".Shui et al. showed that learning to mimic existing empirical potentials can dramatically improve accuracies when fine-tuning on quantum-mechanical data [25].Our proposed technique builds upon this approach by improving the generality and accuracy of the knowledge that we inject during pre-training.
Transfer learning.Transfer learning involves using knowledge learned in one setting to improve performance on some other target task [26].Many strategies exist to perform this transfer of knowledge, and successful applications have been found in fields as varied as image classification [27] and captioning [28], gaming strategies [29], and social network analysis [30].
In computational chemistry, transfer learning has been used to improve MLIP models [31][32][33] and molecular property prediction [34].For example, Smith et al. have shown transfer learning from DFT-to coupled-cluster-[CCSD(T)-] level data for molecules, using relatively few of the latter expensive labels to "lift" the level of the final MLIP [31].We note that our approach can be recast in a similar light as transfer learning from MLIP-to DFT-level accuracy [17].
Alchemical learning.Alchemical learning (across chemical elements) seeks to improve model performance by training on data for elements with which the final model is not concerned.In previous work [35,36], alchemical learning was performed by creating local-environment descriptors that explicitly include elemental information.Faber et al. [35] and Fias et al. [36], respectively, have shown that models incorporating such descriptors can (i) use training data including additional elements to improve their performance and (ii) extrapolate at test-time to elements not seen during training.
In this work, we seek to learn chemically transferable model weights, implicitly creating features instead of manually crafting them.To this end, we pre-train NN potentials on synthetic data for carbon, and show that this improves performance when fine-tuning on DFT data for silicon.We thus present alchemical pre-training as a form of domain adaptation transfer learning.

Neural-network interatomic potentials
Neural networks are a powerful class of models used to construct ML potentials.Their main computational primitive, the multi-layer perceptron, can provably learn any function that maps from one fixed-dimensional space to another, given sufficient parameterisation [37].Much research effort has gone into creating atomic-environment features [38][39][40][41] and model architectures [42][43][44][45][46][47][48][49], allowing one to use the fixed-dimensional mappings provided by MLPs to regress energies and forces onto atomic environments.The compounding of these efforts over time has established NN potentials as an accurate and data-efficient approach in materials modelling.
Herein, we use the Neural Equivariant Interatomic Potentials (NequIP) architecture introduced by Batzner et al. [46].NequIP combines learnable embeddings with interaction blocks in a message passing framework.E3 equivariance is achieved by the use of geometric tensors as internal features: these are equivariant to rotation and reflection, a property that is conserved under the tensor products used within the interaction blocks.In the output layer, invariant, per-atom energies are predicted; forces are explicitly calculated as derivatives to ensure energy conservation [46].
To train an NN model, a measure of its performance must be defined.Since NNs are fully differentiable, the gradient of this "loss function" can be calculated w.r.t. each parameter using backpropagation.Various optimisers exist that then use these gradients to update the values of parameters.
As per the NequIP paper, we define the loss for a single structure with energy label E and force labels {F j } as: where λ E and λ F weight the energy and force contributions to the total loss, respectively.We com- pute several such losses on a mini-batch of structures before performing an optimisation step.The order that structures appear in these mini-batches is determined by the random seed used.The pre-trained model is then fine-tuned on the same real data as in the direct approach.(See also Ref. 17.)

Pre-training and fine-tuning
Various advanced methods exist to fine-tune a pre-trained model, including early layer freezing, AutoFreeze [50], application of the lottery ticket hypothesis [51], and discriminative fine-tuning [52].Herein, we adopt the simplest possible approach and fine-tune all weights of the best pretrained model by training as normal on the real data.Thus, apart from weight initialisation, our training procedures for direct training, pre-training, and fine-tuning are identical.An overview of the approach is given in Figure 1.

Datasets and synthetic labels
The dataset of synthetic carbon structures used here, to which we refer as C-SYNTH-23M, is taken from Ref. 17.The dataset contains 546 carbon molecular-dynamics (MD) trajectories for carbon structures with mass densities ranging from 1.0 to 3.5 g cm −3 .These trajectories were generated through LAMMPS [53] melt-quench-anneal simulations with varying parameters, driven by the C-GAP-17 ML potential [54] in the Gaussian Approximation Potential (GAP) framework [55].The dataset contains over 23 million unique atomic environments, including a variety of ordered and disordered local structures.This dataset had originally been labelled with local energies and forces using C-GAP-17.We here re-label it with the following models: the C-GAP-20U potential [56], the atomic cluster expansion (ACE) potential for carbon by Qamar et al. [41,57], the long-range carbon bond order potential (LCBOP) [58], and the environment-dependent interaction potential (EDIP) fitted for carbon [59,60].
We also used "real" (quantum-mechanically labelled) datasets.One of those was the C-GAP-17 training dataset [54].This dataset contains cells representing liquid and amorphous carbon, an isolated dimer, as well as randomly distorted unit cells of graphite and diamond, labelled with atomic forces and local energies from DFT computations.The C-GAP-20U training dataset was used to extract carbon nanotube structures [56].The Si-GAP-18 training dataset was used to provide DFT-labelled silicon structures for alchemical learning experiments [61].

Computational details
Compute resources.We used a single NVIDIA RTX A6000 GPU to train all NequIP models discussed in the present work, for a total of about 590 wallclock hours.Running at full power (300 W) this is 177 kWh, which in the UK is estimated as equivalent to about 37 kg CO 2 .NequIP models.We performed an initial sweep over salient model and training hyperparameters.
Balancing speed and accuracy, we settled on 4 message-passing layers, each with 32 features and a message-passing cut-off of 4.0 Å.The Adam optimiser [62] was used in training together with an exponential moving average decay rate of 0.99.Energy (PerAtomMSE in eV/atom) and force (ForceMSE in eV/ Å) losses were weighted 4:1 and all other parameters were kept as default.
Software and data.We used the NequIP library to train all models.The LCBOP [58,63] and EDIP [64] empirical force-field labels were generated using the openkim library and models [65,66].The pacemaker [67,68] and quippy [69,70] Python packages were used to generate C-ACE and C-GAP-20U labels, respectively.LAMMPS was used to drive MD simulations [53].Customwritten code, data, and configurations used to train the models in the present work are provided in a GitHub repository (see data availability statement below).

Proof-of-concept
We show in Figure 2 that synthetic pre-training on C-ACE [57] labels leads to more accurate MLIPs than the widely established direct training approach.Increasing the number of structures seen in pre-training systematically improves the performance of the fine-tuned model.
We directly trained a series of NequIP models on increasing amounts of the C-GAP-17 training set.We also synthetically pre-trained several models before fine-tuning on the same structures.
Analysing the results, we find gains in accuracy on the C-GAP-17 test set as a result of synthetic pre-training are largest when only small amounts of real data are available.Training on just 25 structures, we observe an improvement of 105 meV/atom (63%) in per-atom energy RMSE, and 0.39 eV/ Å (32%) in force component RMSE on the C-GAP-17 hold-out test set.We note that, to save time and compute, fine-tuning results are quoted for a single run, whereas the best model from 5 training runs is taken for direct training.Thus these quoted values provide a lower bound on the improvement that synthetic pre-training can bring.
Taking an orthogonal view, to achieve the same accuracy as a model directly trained on 800 structures, a synthetically pre-trained model requires no more than 25 structures.This amounts to a data efficiency saving of at least 32×.This supports our original proposition in Ref. 17 that this sort of pre-training is most relevant and useful when aiming to fine-tune on small numbers of high-level quantum-mechanical (beyond DFT) data.Adopting the philosophy of pre-train once, fine-tune many times, we also find that the effective training time is dramatically reduced by synthetic pretraining: fine-tuning a model on 25 real structures took 35 s on the GPU used.In contrast, direct training on 800 structures took between 15 and 44 minutes.This amounts to a wall-time speed-up of between 25× and 75× to achieve the same numerical accuracy.

The synthetic source matters
We studied the effect of the source of the synthetic labels -that is, at what level are the pretraining data computed?Previous work has shown that empirical force fields can act as a source of synthetic labels for pre-training, improving energy errors by more than 50% in some settings [25].We carried out similar experiments for the specific dataset (C-GAP-17) and model architecture (NequIP) used herein, making use of two popular empirical force fields for carbon, viz.EDIP and LCBOP.Learning curves are presented in Figure 3 (showing the best of 5 direct-training or fine-tuning attempts in each case), and further data may be found in Table 1 (including standard deviations).
We find that, in the low-data limit (25 DFT labelled structures), empirical pre-training leads to a 25% improvement in energy RMSE.However, the absolute values are still well above "chemical accuracy" (∼ 40 meV/atom), and the force errors improve at most by 3% compared to the directly trained models.In contrast, we find that pre-training on C-GAP-20U or C-ACE labels leads to larger improvements in energy and force accuracy, and in data efficiency, than pre-training with empirical interatomic potentials.One explanation for this observed behaviour could be that the latter potentials have been parameterised to describe limited regions of configurational space.Indeed, their predictions on the test set as a whole appear to be meaningless (force RMSEs > 5 eV/atom).Thus, the knowledge injected into the NequIP model during pre-training is only useful for comparatively fewer structures.In contrast, the C-ACE and C-GAP-20U models aim to describe, at comparatively larger computational cost, much larger regions of configurational space, and so pre-training on them helps to bring down the error across the whole test set.

Pre-training can improve stability
Chemists using MLIPs for research are often interested in, and have small amounts of accurately labelled data for, highly specialised regions of chemical space.Fitting models directly to such restricted datasets can lead to models with poor quantitative accuracy and unphysical behaviour outside the scope of their training data.We propose general synthetic pre-training as a means to bypass both of these issues.
To demonstrate this, we sourced a set of structures consisting exclusively of carbon nanotubes (CNT), previously labelled with DFT, from the C-GAP-20U dataset [56].We take a generally pretrained model, i.e., a model pre-trained on C-ACE labels for 10,000 carbon structures covering a wide range of densities and degrees of structural disorder taken from C-SYNTH-23M, and we compare fine-tuning from this to directly training on the small, specialised dataset (Figure 4a).As above, we see that synthetic pre-training significantly improves test-set accuracy (here ∼25% improvement in force RMSE).
In Figure 4c we compare force errors as measured on the general C-GAP-17 test set.Directly trained models, having never seen a vast majority of this configurational space, have no reason to perform well here, and indeed they do not.In contrast, the pre-trained model was trained on lots of data covering a vast majority of the chemical space spanned by this test set.It thus performs very well, approaching the level of the pre-training source on this dataset.The fine-tuning process leads to a deterioration of this general performance: it seems that the penalty for improving on the CNT structures is partial forgetting of the original pre-training data.However, compared to the directly trained models, the fine-tuned models still perform significantly better (only 1.4× the error of the pre-trained model, as compared to 2.7×).Thus synthetic pre-training leads to models that are more robust when making predictions on atomic environments that are not present in the final training set.
To further emphasise this robustness, we perform MD simulations using the direct and fine-tuned models (Figure 4d).Taking an amorphous (1.5 g cm −3 ) carbon structure, we heat from 300 K to 3,000 K over 100 ps, before holding for a cumulative 1,000 ps.We find that the fine-tuned model creates stable trajectories that capture the qualitatively correct behaviour, viz.graphitisation [73,74].In contrast, the directly-trained models fail rapidly and dramatically by failing to handle rare events.A brief period (∼20 ps) of reasonable dynamics is interrupted by the creation of a 5-fold-connected carbon atom.After this, the simulation rapidly breaks down, leading to simultaneous production of 0-coordinated free carbon atoms and clusters of highly (8+) coordinated atoms.

Alchemical pre-training can be useful
We have shown that synthetic pre-training is useful when a general MLIP has already been trained for the system of interest.However, when exploring a new chemical system, this will often not be the case.We therefore explore using an MLIP trained on a different chemical system as a source for alchemical synthetic pre-training data (Figure 5).
We find that using the carbon structures and labels as they are yields no improvements, and in the low-data limit leads to significantly negative transfer.We note, however, that the characteristic length-scales of carbon and silicon structures are very different: the typical minimum separation seen in carbon structures is ∼1.4 Å whereas for silicon it is ∼2.1 Å.We therefore expand all carbon structures by a factor of 2.1/1.4 = 1.5, keeping the labels as they are, and repeat the experiment  by pre-training with the scaled structures and fine-tuning on the same silicon dataset.In this case, we observe positive transfer, with a similar trend in the number of pre-training structures as seen previously.
Figure 5c illustrates the effect of this scaling procedure.We project environments from the Si dataset onto a UMAP embedding [71] of environments from the unscaled (turquoise) and scaled (magenta) carbon structures.A majority of Si structures lie in the "sp 3 -like", tetrahedral region of the scaled carbon structures -we posit that it is the knowledge of this region gained during general pre-training that is leading to the positive transfer.Hence, we believe that a current limitation of this technique is that the alchemical structures used for pre-training appear to have to share similarities with the fine-tuning target.Further work on this theme is ongoing.

Conclusions
We have shown that pre-training neural-network interatomic potentials on synthetic energy and force data can improve accuracy, training time, and data efficiency for end users.We have focused on a single architecture (NequIP) and chemical system (carbon), but we expect that the technique in principle is more general, which we will explore in future work.
We presented a series of experiments to test the capabilities (and limits) of the approach.We found that increasing the number of structures seen during pre-training improves the final model, but that these improvements decay rapidly in the limit of large fine-tuning data (Figure 2), consistent with our previous findings for regressing scalar quantities [17].We showed that the synthetic source matters -the more accurate the pre-training labels, the larger the gain from the approach -and that synthetic pre-training can improve robustness of NN potentials in principle.
A possible limitation of homogenous synthetic pre-training is that one requires an existing MLIP to have been trained on this particular system -this is readily available for carbon, but will not always be the case for more complex chemistries.In an effort to bypass this restriction, we showed a proof-of-concept for alchemical pre-training: using an MLIP trained on one chemical system (here, carbon) to pre-train a model for use on a different system (here, silicon).We observe positive transfer, and further work is ongoing to improve the magnitude of this improvement.

Figure 1 :
Figure 1: Synthetic pre-training for neural-network interatomic potentials.This schematic compares the established direct training approach (top) to synthetic pre-training and subsequent fine-tuning (bottom).Direct training starts from a randomly initialised NN model which is then optimised for real, quantummechanically computed energy and force data.Our approach, instead, involves an initial pre-training step, directly training a model on a very large, synthetic database generated by an existing ML potential model.The pre-trained model is then fine-tuned on the same real data as in the direct approach.(See also Ref. 17.)

Figure 2 :
Figure 2: Proof-of-concept for synthetic pre-training.Energy (left) and force (right) RMSE as a function of the number of DFT labelled structures seen during training (x-axis) and number of synthetic pre-training structures (colour coding) as measured on the C-GAP-17 test set.Direct training points (black) are plotted as the best of 5 differently seeded runs.Fine-tuning points are plotted for a single random seed.

Figure 3 :
Figure 3: Effect of using different sources for pre-training.Per-atom energy (left) and component-wise force (right) RMSE are plotted as a function of the number of real structures seen during training, as measured on the C-GAP-17 hold-out test set[54].Pre-training on empirical force-field labels leads to limited improvement over direct-training, whereas pre-training on existing MLP labels leads to significant improvements.We pre-train each model on the same 10,000 synthetic structures taken at random from C-SYNTH-23M, and report the best error from 5 direct-training or fine-tuning attempts.
accuracy in scope In-scope versus out-of-scope data Less accuracy loss out of scope Improved robustness out of scope

Figure 4 :
Figure 4: Training directly on carbon nanotube (CNT) structures leads to potentials with very limited physical understanding.Fine-tuning from a potential pre-trained on a general carbon dataset leads to much better performance.(a) RMSE on a CNT test set for direct, pre-trained and fine-tuned models.The finetuned model has significantly better numerical performance.(b) Projection of a random sample of carbon nanotube environments onto an embedding of the chemical space spanned by the general pre-training set(created by describing each atomic environment using a SOAP vector[39] and performing dimensionality reduction using UMAP)[71].(c) RMSE on a general-purpose test set.The direct model performs very poorly, which is to be expected as it has not encountered environments other than CNT structures during training; the fine-tuning process leads to some degradation in the fine-tuned models' general performance.In panels (a) and (c), we plot results for the best model taken from 5 differently seeded training runs for the direct and fine-tuned models.(d) Visualisation of relevant structures from MD trajectories driven by a model directly trained on the CNT data only (top) and a fine-tuned model (bottom).Structural images were created using OVITO[72].
scaled pre-training data

Figure 5 :
Figure 5: Alchemical pre-training.(a) Pre-training to mimic C-ACE labels on unscaled carbon structures (turquoise) leads to, at best, no significant improvement upon fine-tuning.(b) Scaling carbon structures by 1.5× (magenta) leads to successful transfer.In both sets of learning curves, we plot the best model performance as taken over 5 separate fine-tuning attempts.(c) UMAP projections of the Si dataset (black) onto the structural space spanned by the un-/scaled structures show that the sp 3 -like region of the scaled carbon dataset overlaps well with the majority of the silicon structures.(d) Mean Si test-set force RMSE as taken over 5 separate training runs as a function of the number of scaled C structures seen during pretraining before fine-tuning on 400 Si structures.

Table 1 :
Per-atom energy RMSE (meV/atom) as a function of the pre-training source and number of DFTlabelled training structures.Mean values are provided together with standard deviations over 5 separate fine-tuning runs for each pre-trained model.Values marked with a * indicate large standard deviations -in these cases, one or more of the direct-training or fine-tuning runs experienced a sudden increase in loss and failed to converge.