Efficient generation of stable linear machine-learning force fields with uncertainty-aware active learning

Machine-learning (ML) force fields (FFs) enable an accurate and universal description of the potential energy surface of molecules and materials on the basis of a training set of ab initio data. However, large-scale applications of these methods rest on the possibility to train accurate ML models with a small number of ab initio data. In this respect, active-learning (AL) strategies, where the training set is self-generated by the model itself, combined with linear ML models are particularly promising. In this work, we explore an AL strategy based on linear regression and able to predict the model’s uncertainty on predictions for molecular configurations not sampled by the training set, thus providing a straightforward recipe for the extension of the latter. We apply this strategy to the spectral neighbor analysis potential and show that only tens of ab initio simulations of atomic forces are required to generate FFs for room-temperature molecular dynamics at or close to chemical accuracy and which stability can be systematically improved by the user at modest computational expenses. Moreover, the method does not necessitate any conformational pre-sampling, thus requiring minimal user intervention and parametrization.


INTRODUCTION
Machine learning (ML) models for the generation of force fields (FFs) are becoming a prominent aid for researchers in different fields, including drug discovery [1], prediction of metastable structures [2], heterogenous catalysis [3], and more [4][5][6][7].In all these fields, ML permits to speed up calculations or to manage larger datasets, largely overcoming the problem of the computational costs inherent to electronic structure simulations.In recent years, many ML models for the generation of FFs have been presented, e.g.sGDML [8], BP-NNP [9][10][11], GPR based models [12][13][14], PhysNet [15], SchNet [16], FCHL19 descriptors combined with different regressors [17], moment tensor potentials [18], message passage neural networks [19][20][21], and many more.All these methods have been shown to be able to reproduce the potential energy surface (PES) of complex chemical systems with chemical or near-to-chemical accuracy.However, such incredible results often come with the burden of requiring a lot of electronic structure simulations to generate the necessary training data to reach high accuracy, often in the range of 10 3 − 10 6 calculations [22][23][24].Such a scenario poses serious challenges to the widespread use of MLFFs.
Decreasing the size of the training set is a non-trivial challenge that depends on many different factors.Among the most crucial ones there is the complexity of the ML architecture used to map the PES and the approach used to select a training set.Although simple ML models, such as linear ones, achieve less accuracy than complex ones, they often perform better for small training sets in virtue of being less prone to over-fitting issues.In this work we will focus on this class of MLFFs and investigate the possibility to further optimize their generation in terms of accuracy and training set size.
A conventional way to learn the PES of a compound is to first perform ab initio molecular dynamics to sample a relevant number of configurations and their energy/forces [9,22].This approach can potentially achieve a good performance on both training and test sets, as the most statically relevant structures are automatically included.However, such approach does not guarantee that redundancies are not also included, potentially leading to large computational overheads.Moreover, the accurate representation of a molecular PES also requires the sampling of statistically-rare conformations, which by definition are not captured by small-size molecular dynamics samplings.Crucially, when such rare conformations are encountered during a molecular dynamics run, the MLFF must be able to correctly predict their energies and forces in order to avoid leading to unphysical scenarios and a breakdown of the system stability.This serious issue thus often requires a second step where additional configurations are sampled from a MLFF-driven MD run to achieve the desired stability.
Active learning (AL) strategies have a big potential to overcome these issues and lead to the generation of optimal training sets.Active learning is the process of iteratively selecting data to add to the training set, according to a user-determined criterion.Ideally, such criterion must be chosen in order to i) iteratively add configurations to the training set only if they significantly differ from the ones already included in the training set, thus avoiding unnecessary overheads and ii) include all and only configurations required to training the model.Even if conceptually simple, achieving an optimal active learning strategy is far from straightforward.
One of the most used AL approaches is the query by committee [15,[25][26][27][28][29][30].In this method, multiple models are trained to learn the same training set, but with different sets of initial parameters, e.g.biases and weights in a neural network.For the same ML architecture, different models will generally perform similar predictions of energy and forces for molecular configurations similar to those sampled in the training set, but will widely differ if the information contained in the training set is not sufficient to extrapolate to new configurations.Therefore, the disagreement on the prediction of energies and/or forces among the committee of MLFFs is used to signal AL to stop and extend the training with a new configuration.
Another common approach to AL is based on Bayesian uncertainty prediction and Gaussian Process Regression ML models [14,[31][32][33][34]. Bayesian models are generally based on the idea of combining our prior beliefs on the phenomenon under study and observations to achieve predictive power for unlabeled inputs.One of the strengths of this class of methods is the built-in possibility to estimate uncertainties on predictions, thus leading to a straightforward implementation of active learning.
At the best of our knowledge, only three implementations of AL methods have been proposed for linear MLFFs.
Podryabinkin et al. [35] provided a mathematically rigorous definition of interpolation and extrapolation with respect to a given training set and proposed an AL strategy specifically tailored for linear ML models.This method requires defining a maximum degree of extrapolation that the regression can attempt without triggering the AL algorithm to act and has been successfully used to find new stable alloys and crystal structures [36,37].Some of the present authors instead tested a Gaussian metric over atomic environments' fingerprints to measure the similarity of newly encountered environments with respect to the structures spanned by the training set and trigger AL accordingly when the dissimilarity is above a certain threshold [38].Very recently, a linear ML model based on Atomic Cluster Expansion (ACE) [39] has been implemented together with an active learning process that combines elements of query by committee and Bayesian uncertainty prediction [40,41].Despite the successful use of Bayesian regression to perform AL in the latter work, no details on its robustness and implementation details were provided.
Similarly to the philosophy of Gaussian Process Regression, here we use the theory of linear regression to estimate the uncertainty of a model over predictions, provide a unified picture of all these approaches recently appeared in literature, and benchmark the capability of these principles to form the basis of an AL method for linear MLFFs.
We assess the validity of this AL workflow by benchmarking the performance of the spectral neighbour analysis potential (SNAP) [42] over learning the revised MD17 data set [43].Moreover, we apply our method to four molecules of growing complexity, including coordination compounds and open-shell systems, and demonstrate that the proposed protocol generates MLFFs able to withstand stable molecular dynamics at room temperature starting from only one configuration in the training set and requiring a small amount of ab initio training data.This strategy can be readily applied to other linear MLFFs and used to tackle a wide range of chemical systems.

Spectral neighbor analysis potential
The MLFF used in this work is SNAP [42].This method is based on the expansion of the total energy of the system in a sum of single atomic contributions, which are further expanded in a linear combination of bispectrum components where B k (i) is the k-th bispectrum component of atom i, and provides a geometrical description of its atomic environment within a cutoff radius R cut .N k and N i are the number of bispectrum components in the expansion and the number of atoms in the system, respectively.The coefficients c k (α i ) depend on the atom species identified by the index α i , which can take an integer value between 1 and N species , where N species is the number of atomic species in the system.A corresponding definition of forces in terms of bispectrum components can be easily obtained by taking the derivative of Eq. 1 with respect to the atomic positions.The terms B k (i) and their derivatives with respect to atomic coordinates are calculated using LAMMPS [44].For a dataset of geometries and energies/forces, Eq. 1 can be written as where Y is a N data ×1 vector containing the target quantities to reproduce, either values of forces or energies.Defining M = N kinds × N k with N kinds being the number of atomic species in the system, X is a N data × M matrix encoding Eq. 1, whilst the vector c assembles the coefficients c k (α i ).
The training of SNAP requires the minimization of the loss function where λ is a regularization parameter.The coefficients that minimize the loss function are thus given by [45] c Uncertainty-driven active learning The AL workflow requires the following steps: • generate SNAP with a starting training set; • run molecular dynamics and evaluate the uncertainty on the target quantity of the FF (energy and/or forces) at each step.If the uncertainty on the structure is higher than a certain threshold, an ab initio calculation is performed on the new structure and the newly available information is included in the training set and the model retrained.If the uncertainty is low enough, MD keeps running; • repeat the first two steps until the model can terminate a full MD of the desired duration without finding new structures.
The design of a method able to estimate the uncertainty and the definition of a stopping criterion are the key aspects of this method.In the following we detail the proposed protocol for such quantities.
The method for the estimation of the uncertainty is based on the classical theory of statistics of the linear least squares method.Let us first address two variables in order to easily visual the method's working principle.In this case Y and x are related by a linear mapping [46] where a and b are coefficients to be determined and Z captures a random component to Y for which we have chosen a capital letter in order to stress its statistical nature.Once the coefficients are determined, we can obtain a value of the prediction ŷ for every value of x.Crucial to our study, we can associate an error to the fit parameters and by propagation an uncertainty in predictions.This represents the core concept by which we predict uncertainty.In the 2-variable case, the estimation of the variance on the prediction ŷ is given by [46] where s z is the standard deviation of Z which is here approximated as the difference between training data and corresponding predictions, n is the cardinality of the training set, x is the mean of the distribution of x and s2 x is the variance of the values of x.Assuming the variables Z and Y to have a Gaussian distribution, it is possible to show that the interval with confidence level CL = 1 − α is given by where t 1−α is the quantile of the t-Student distribution with n − 2 degrees of freedom.• the variance associated with a prediction increases as we move away from the centroid of the distribution of data as compared to the variance of its distribution, due to the presence of (x − x ) 2 /s 2 x ; • it is bounded from below from how well the linear regression model fits the preexisting data.
In general, the input x has arbitrary dimension p × 1 and Eq. 5 has to be written as where c is a vector of coefficients as in Eq. 2.
We report here the generalization of Eq. 6 for the variance of the prediction for a multidimensional input where X is the matrix defined in Eq. 3 and The quantity n appearing in Eq. 10 is the number of labelled data in the dataset plus the number of equations corresponding to regularization and other constraints, while Y is the same as in Eq. 3. The relation in Eq. 7 is still valid, but now t 1−α is the quantile of the t-Student distribution with n − p − 1 degrees of freedom, where p is the number of parameters to be estimated.
If we want to weight differently specific subsets of data, e.g. in case different weights were given to forces and energies when both are used to train the model, we can introduce the transformed variables Ỹ = W 1 2 Y and x = W 1 2 x, where W is a N data × N data diagonal matrix with the square roots of the weights on the diagonal, i.e.W = diag(1, 1, . . ., w, w, . . .).By definition we fix the weights for the forces equal to 1 and energies are weigthed by the factor w. The linear regression then takes the form Defining F j as the force acting on an atom in the system along a certain Cartesian direction (index j runs both on atoms and Cartesian coordinates, e.g.F 1 acts on atom 1 along the x-axis, F 2 acts on atom 1 along the y-axis and so on) and E as the energy of a given configuration, the loss function takes the following form Eqs. 9 and 10 then become and Now that we have defined a rigorous way to estimate uncertainties on the ML model predictions through Eqs. 9 and 13, we are ready to discuss how these quantities inform the AL protocol.
A conventional approach to AL would require running a new electronic structure calculation every time the condition s > k thresh (15) is achieved, where s 2 is the predicted variance of residuals and k thresh is a static user-defined threshold corresponding to the desired accuracy.In this work we explore a dynamical definition of k thresh , such as where s z is the square root of the variance of the residuals for the training set calculated as in Eq. 10, and δ is set by the user.Setting such a dynamic threshold effectively allows to decouple the definition of stopping criterion for AL from the error on the training set.Indeed, k thresh then becomes identical to the square root of the quantities in square bracket in Eqs. 9 and 13, which are independent on s z and are bounded from below to the value of 1.This approach has the advantage to avoid that AL stops too frequently in case the error on the training set increases as new structures are included in it, and to make the definition of k thresh exportable across different systems.
The implementation of Eq. 15 is trivial when the uncertainty on energies is the only quantity evaluated.In such case Eq. 9 simply output a scalar quantity.However, in the case of training on forces, x in Eq. 9 and Eq. 13 is a 3N at × 3N at matrix, because we are simultaneously predicting 3N at forces components.The output matrix is the covariance matrix for the new prediction, where the diagonal elements represent the variances of the predictions on the single forces.In such case Eq. 15 is implemented by taking the largest value of the diagonal elements of the matrix s 2 .

Connections with Bayesian uncertainty prediction
Let us now briefly show the similarities between the method just outlined and the Bayesian approach reported in ref. [45].
In a Bayesian framework, a distribution a priori for the parameters must be defined, often taken as an isotropic Gaussian In Eq. 17, α measures the spread of the parameters around the mean and it is assumed to be equal to the identity matrix I for all the parameters.An a posteriori distribution can thus be obtained by combining the a priori distribution with the likelihood function.The a posteriori distribution is again a Gaussian function with mean m N and covariance where β is the inverse of the variance of Z.It can be shown [45] that the maximization of the logarithm of the a posteriori distribution in Eq. 18 is equivalent to the problem of minimizing the loss function in 3 with λ = α/β and that Eq. 19 is equivalent to Eq. 4. Given the a posteriori distribution, we can finally obtain the predictive distribution to make predictions y * on unlabeled input where σ 2 N (x * ) is given by By setting λ = α/β, the expression of the variance in Eq. 9 becomes equivalent to the variance of the prediction in the Bayesian framework in Eq. 22.In the latter approach the values of α and β are obtained by maximizing the evidence function [45].
An AL method for linear MLFFs exploiting Eq. 22 has recently be reported by Oord et al. [40].The key difference with our implementation lies in the fact that SNAP makes it feasable to use Eq. 9 as is, while in the work of Oord et al. [40] Eq. 22 had to be approximatively estimated due to the large number of unknown parameters in their model.

Connections with D-optimality design
As done in the last section, we here want to briefly unravel the similarities between the approach presented in this work and the one proposed by Podryabinkin et al. [35] and based on the concept of D-optimality.
Given a pool of unlabeled data, the D-optimality criterion states that the optimal selection of points to label is the one that maximizes the determinant of X T X [47].
To make this principle appealing for an on-the-fly AL procedure we have to quantify how much the determinant of X T X changes when a new unlabeled configuration is added to the training set.If we indicate with X the matrix including the preexisting training set with the addition of the new point x * , then det(X T X ) = det(X T X) where The term in Eq. 24 can be suitably rewritten in case of regularization or in presence of a weight matrix and shown to be equivalent to the second term in Eqs. 9 and 13.If we set a dynamic threshold an ab initio calculation is triggered when It can be easily shown that the criterion to stop AL in Eq. 26 is equivalent to the one in Eq. 15.At the best of our knowledge, this connection between linear regression uncertainty prediction and D-optimality has never been established before in the context of AL for linear MLFFs.

Learning the set rMD17 of ab initio molecular dynamics trajectories
We assess the validity of our method by benchmarking it on the rMD17 dataset.The latter comprises 100k configurations (geometries, energies and forces) sampled from a single trajectory of ab initio MD at 500 K for 10 organic molecules of size between 9 and 24 atoms.For all of them, we train one SNAP potential over either energy data (TE) or over forces (TF).For TE, the initial training set, namely the training set before AL starts, includes the first three configurations in the dataset, while for TF only the first structure of the AIMD trajectory is used.
We compare results obtained with three different training sets: • training-AL built with the AL workflow presented in this work; • 1000-Random obtained by training the model on 1000 random configurations; • N-Random built by training the model on the same number of structures found with training-AL but selected randomly.
The parameters λ and N k (see Eq. 1) are kept fixed to the values of 0.1 and 56, respectively.We test different values of R cut in the range [3.0, 5.0] Å with a step 0.5 Å for TE and 0.25 Å for TF.In Tab.I we report only the value of R cut that minimizes the error on training and test sets.The test set is the same for the three MLFFs, and it is made of 1000 configurations randomly selected from each trajectory in the dataset.
Results are reported in Tab.I and show that SNAP achieves a good accuracy on the prediction of energy over both TE and TF, despite having orders of magnitude less degrees of freedom compared to other models [48].
Interestingly, we obtain comparable results for all the training sets, including 1000-random and N-Random.On the one hand, this demonstrates that a even a few configurations are enough to obtain converged results, proper of large training sets such as 1000-Random.On the other hand, it is not yet clear how AL would improve over a random selection of the training set.It is important to remark that a comparable level of accuracy does not imply that the FF generated with different datasets lead to MD simulations of comparable quality.The main advantage of using the present method is to guarantee that the uncertainty of energy and/or forces on the configurations sampled by MD is not exceeding a certain value and requiring a minimum number of electronic structure simulations, thus minimizing the computational overheads and the instability of the MD trajectory at the same time.To prove this point we evaluate the uncertainty during the MD trajectory of aspirin by using the three different training sets and reports results in Fig. 2.
Fig. 2 nicely shows that the tail of the distribution of uncertainty on both energy and forces (top and bottom panel, respectively) has a much longer tail for the N-Random with respect to training-AL.This fact directly translates into a minimization of the probability that critical configurations are sampled during MD, where the error on the predicted forces is so large to potentially lead the simulation astray.As emphasized by the inset of the top panel of Fig. 2, training the model over a large values of energies, i.e. for the 1000-Random set, does not overcome this issue and the tail of the distribution of uncertainty during MD exceeds the one achieved with AL.
The same qualitative results are obtained for the training over forces (bottom panel of Fig. 2), with the only difference that in this case the uncertainty achieved over the set 1000-Random is dramatically suppressed.This is in agreement with the fact that such training set contains a large volume of information coming from 3N at values of forces for each MD frame, therefore largely exceeding the amount of data available in the other two sets.Interestingly, this result further demonstrates that the error over a training/test set is not an sufficient indicator of the robustness of a force field.Indeed, for the training with forces, all sets achieve similar RMSE values but perform quite differently in terms of uncertainty over predictions.

Boot-strapping of machine-learning force fields with active learning
The tests over the rMD17 dataset provides important insights on the ability of the proposed AL strategy to achieve a well-balanced training set with a minimal number of configurations.However, this analysis does not address some additional crucial challenges connected with the generation of MLFFs.Firstly, performing the training on pre-compiled datasets, such as rMD17, does not take into account the challenge of selecting realistic configurations to be added to the training set in the first place.Indeed, all the structures contained in the rMD17 set are realistic by construction, having been generated by AIMD.However, this situation does not correspond to common realistic scenarios, where the possibility to run AIMD, even if short, would at least partially defeat the purpose of generating a MLFF.Alternative ways to kick-start the generation of a MLFF with AL have been explored.One simple but often inefficient approach consists in generating random atomic distortions, while displacing molecules along normal mode coordinates has also been tested.Alternatively, if another force field is available, one can use it in place of ab initio methods to generate a first conformational sampling.[40].Although widely used, pre-sampling methods often lead to a biased training set, thus posing limits to either the accuracy or the stability of the final MLFF.
In this section, we show that the proposed scheme is able to achieve the efficient training of a robust MLFF starting from the sole equilibrium configuration of a molecular compound.We perform simulation for four molecules of growing complexity, whose structure is reported in Fig. 3: benzene, aspirin, VO(dmit) 2 (where dmit=1,3-dithiole-2-thione-4,5-dithiolate), and Cr(ppy) 3 (where Hppy = 2-phenylpyridine).Whilst benzene can be regarded as a toy system, the generation of a MLFF for aspirin already presents some real-scenario challenges connected to its flexibility.On the other hand, VO(dmit) 2 and Cr(ppy) 3 are two coordination complexes with an open shell configuration of interest for the communities of molecular magnetism [49] and photo-luminescence [50], therefore rightfully belonging to the class of realistic systems.There is very sparse literature for coordination or magnetic compounds compared to organic molecules and we here provide evidence that the proposed AL scheme is general enough to deal with the inherent complexity of such molecules.
The AL protocol is implemented as for the rMD17 set, with the crucial difference that the MLFF is used to propagate the MD from the very beginning, therefore not relying on a pre-compiled trajectory.This is particularly challenging during the early stages of the training, where only very limited information is available to the ML model and the prediction of forces will in general be very poor, potentially leading to catastrophic results.
We arbitrarily define the AL simulation converged when the algorithm has completed five consecutive MD trajectories of 100 ps without finding new structures.This criterion allows to reinitialize the velocity periodically and enforces an ergodic exploration of the configuration space.As for the rMD17 test, the initial training set is constructed with just three configurations for the training over sole energy values.The latter three configurations correspond to the equilibrium structure and two randomly displaced structures by 0.05 Å max.For the training with forces, we instead trained the first MLFF using only information coming from the equilibrium structure.Finally, the test set is constructed by taking 100 configurations sampled every 1 ps from the last MD trajectory explored during AL.MD is performed at 300 K using the thermostat by Bussi et al. [51].The cutoff radius of the N k = 56 bispectrum components is set to 4 Å for all chemical species and the regularization value λ is set to 0.1.All ab initio calculations are performed with the software ORCA [52].For all four systems we employ the PBE functional [53], with the basis set def2-TZVPP and def2/J auxiliary basis for the RI approximation.In the case of benzene, VO(dmit) 2 and aspirin D3 vdW corrections are employed.[54,55] The results on the root mean square errors (RMSE) and final training set size are shown in Tab.II.For benzene and VO(dmit) 2 , the model achieves chemical accuracy on both training and test set (RMSE < 1 kcal/mol) for every value of δ.Notably, the number of structures required to achieve the generation of a stable and accurate force field is dramatically reduced by training on forces instead of the sole energy values.Fig. 4 further emphasize this result by reporting the number of MD steps performed before a new DFT calculation is requested by the AL algorithm.We further test our model on the more challenging Cr(ppy) 3 and aspirin.The training with energy once again achieves very good results close to chemical accuracy.Given the higher structural complexity of these two compounds, more structures are needed to converge the AL simulations.However, differently from the previous two compounds, the training on sole forces this time leads to the generation of force fields that reach unphysical configurations during MD.Even using small values of δ, very close to the lowest limit of 1, does not fix the problem.We overcome this apparent limit of the AL algorithm by combining the benefits of training on energies (stability and accuracy) with the benefits of training on forces (convergence rate) by training the model on energies and forces at the same time.We have set w = 9 and w = 81 for aspirin and Cr(ppy) 3 , respectively, in Eq. 12 and w = 1 in Eq. 13.The results reported in Tab.III demonstrate the viability of this approach and show that it is possible to obtain comparable performances over the test set's RMSE by training with either sole energies or energies and forces for complex compounds.Crucially, in the latter case, only a small fraction of ab initio calculations are required.

DISCUSSION AND CONCLUSIONS
The use of machine learning to map the PES of chemical compounds has revolutionized the field of materials modelling, opening up the possibility to simulate nmsized systems over extended time scales and to sample extremely large portions of the chemical space [4].Since the inception of the field, several different approaches have guided the development of new MLFF frameworks.Important achievements have been reached in the development of elaborated ML models able to fit the PES of chemical compounds with extraordinary accuracy, including for instance long range and non-local interactions [15,16,56].Moreover, certain MLFF frameworks have been shown to be able to learn the PES of entire classes of compounds and to generalize to molecules not included in the training set [11,57,58].
In this contribution, instead, we focused on a different approach, where training robustness and efficiency are valued at the same level of accuracy, at the expense of transferability.We believe that this approach is also required to fulfil all the needs of the MLFFs community.Indeed, given the complexity of the chemical space, we are still far away from having a universally accurate and robust MLFF able to predict the PES of any molecular system, and the application of MLFFs to new chemical systems often requires the generation of de-novo dedicated training sets.Such a computationally-demanding task must be dealt with as efficiently as possible in order for MLFFs to become a standard computational tool.
Whilst advanced MLFF frameworks are able to accurately map the PES of relatively simple organic molecules, their training is quite nuanced and often computationally expensive.Moreover, no evidence is yet available on their application to complex compounds with many chemical species.On the other hand, transferable force fields able to predict the PES of general organic compounds are now available, but only for a small number of ethero-atoms [57], and with the important exclusion of coordination compounds of transition metals and rare earths.The latter are key for the simulation of bio-inorganic system, luminescent sensors, catalysts, etc.
Here we have shown that linear models, once combined with an uncertainty-aware active learning strategy, are able to accurately approximate the PES of complex chemical systems with only a handful of electronic structure calculations and without requiring a, often biased, pre-sampling of the conformational space.These key features make it possible to readily train a MLFF for a new compound in a very short amount of human and computational time.Importantly, we have demonstrated that the resulting MLFFs are able to withstand MD at room temperature, which we advocate it should be introduced as a key metric to assess the quality of a MLFF.
It is also important to remark that the method outlined in this work employs only three hyper-parameters, namely the cutoff radius of the bispectrum components, R cut , the relative weight of energies and forces, w, and the active learning threshold, δ.Chemical intuition naturally guides the choice of an optimal R cut , while tests suggest that excessive fine-tuning of the other two hyper-parameters is not required.Having just a few, not too sensitive hyper-parameters is a key aspect of an efficient and robust MLFF framework, as it makes the model more user friendly and potentially compatible with high-throughput and automated workflows.
Several avenues of future development can be envisioned.First and foremost, an in-depth study on the dependency of the MLFFs' accuracy and stability on the choice of the atomic environments' descriptors is required.Here, we implemented our linear ML model with bispectrum components as atomic environments' fingerprints, which we believe offer some advantages.For instance, bispectrum components provide quite a compact description of atomic environments and their number scales linearly with the number of atomic species.Throughout this work, we have used 55 bispectrum components per chemical element, thus never exceeding a total number of adjustable parameters of 224.On the one hand, this small number of descriptors allowed us to generate accurate MLFFs with only a small number of reference ab initio data.On the other hand, a descriptor with only a few degrees of freedom poses a limitation on the accuracy that can be reached by increasing the training set size.Other descriptors such as ACE, the ones used in Moment Tensor and Jacobi-Legendre potentials [59,60], have been used as building block for linear MLFFs, and might offer a better trade off between accuracy and robustness and merit further investigation.
Another aspect that will require further work concerns the exploration of the limits of linear ML models and AL to deal with a varied number of chemical systems.Although in this work we focused on gas-phase molecular systems, the method should readily apply to condensed matter systems just as well, provided long-range interactions are included in the model.The inclusion of electrostatic and dispersion interactions into MLFFs frameworks has recently received large interest and several promising schemes are now available [61][62][63].
The method explored in this work is also promising in terms of the types of chemical properties that can be predicted.Indeed, an equivariant version of linear MLFFs based on SNAP have been recently proposed for the mapping of tensorial properties [64], and the method discussed here readily applies to that scenario, thus further extending the scope of the present work.
In conclusion, we have here presented an AL protocol for linear machine learning models able to produce accurate results for complex molecular systems with a minimal number of ab initio data and requiring minimal human intervention.We applied this strategy together with the model SNAP and tested its performances over the rMD17 dataset and on the generation of FFs from scratch with no preexisting dataset.This method successfully leads to force fields able to withstand accurate MD at room temperature with only tens of training configurations, thus paving the way to the automatic and efficient generation of MLFFs for challenging chemical systems.

Fig. 1 ,
Fig.1, based on Eq. 6, is very instructive about how the error is estimated[46]

FIG. 1 .
FIG. 1. Prediction uncertainty for linear models.Black dots are the arbitrary data fitted with a linear model and generated by adding random Gaussian noise to 20 values of y sampled from the function y = 3x − 2. The best-fit line is reported in orange.The two blue lines corresponds to ŷ ± 5s (see Eq. 6).

FIG. 2 .FIG. 3 .
FIG. 2. Uncertainty distribution of predictions during MD.The top (bottom) panel reports the distribution of the uncertainty evaluated on the trajectory for aspirin taken from the rMD17 dataset using the TE (TF) method.Results are plotted for the three different training sets, namely training-AL in green, N-Random in violet and 1000-Random in cyan.The insets report a zoom over the tail of the distributions for large values of uncertainty.A vertical black line marks the value of k thresh used during AL.

FIG. 4 .
FIG. 4. Acquisition curve for VO(dmit)2.The plot shows the number of steps performed during MD in an active learning cycle before finding a configuration to include in the training set for δ = 1.5.Results fro the training on only energy values are reported in green and the ones for a training on forces are reported in violet.

TABLE I .
Root mean square error on training and test sets for trajectories in the rMD17 dataset.Results related to TE (TF) are reported out of (in) parentheses.The RMSE of energies and forces are reported respectively in kcal/mol and kcal/mol/ Å.Cutoff radius values are reported in Å and have been chosen to minimize the error on the test error of the energies (TE) or forces (TF).Results are obtained by setting the value of the threshold parameter δ to 1.5.For every compound, we report three rows corresponding in order to the results obtained with training AL, 1000-random and N-random.

TABLE II .
RMSE on training and test sets for four selected compounds.The RMSE of energies and forces are reported in kcal/mol and kcal/mol/ Å, respectively, for the different sets, i.e. training (Tr) and test (Te), and for training performed on either energy (out of parentheses) or force values (in parentheses).The training set size (TSS) selected by the active learning algorithm is also reported for different values of the threshold parameter δ.
TABLE III.RMSE on training and test sets for aspirin and Cr(ppy)3.The RMSE of energies and forces are reported in kcal/mol and kcal/mol/ Å, respectively, for the different sets, i.e. training (Tr) and test (Te), and for training performed on energy (out of parentheses) and force values (in parentheses).Training in this case has been done both on energies and forces.The relative weight (energies:forces) when solving the regression problem is 3:1 for aspirin and 9:1 for Cr(ppy)3.The training set size (TSS) selected by the active learning algorithm is also reported for different values of the threshold parameter δ.