Training models using forces computed by stochastic electronic structure methods

Quantum Monte Carlo (QMC) can play a very important role in generating accurate data needed for constructing potential energy surfaces. We argue that QMC has advantages in terms of a smaller systematic bias and an ability to cover phase space more completely. The stochastic noise can ease the training of the machine learning model. We discuss how stochastic errors affect the generation of effective models by analyzing the errors within a linear least squares procedure, finding that there is an advantage to having many relatively imprecise data points for constructing models. We then analyze the effect of noise on a model of many-body silicon finding that noise in some situations improves the resulting model. We then study the effect of QMC noise on two machine learning models of dense hydrogen used in a recent study of its phase diagram. The noise enables us to estimate the errors in the model. We conclude with a discussion of future research problems.


Introduction
A key technology in the today's materials science and related fields is the ability to perform simulations of atoms and molecules using Molecular Dynamics (MD) or Monte Carlo (MC) methods.Because electrons are more than a thousand times lighter than the nuclei, at room temperatures we can often assume that electrons are in their ground state.We can then "integrate them out" and deal with the motion of the nuclei alone by using the Born-Oppenheimer separation between electronic and nuclear degrees of freedom.Once the electrons have been eliminated, the resulting "potential energy surface" (PES), V (R), is a function of 3n I coordinates of the n I nuclei.Traditionally this potential has been approximated by an empirical function, such as a pair potential, or calculated 'on-the-fly' using density functional theory [1].
Much progress has been made in approximating the PES using a neural network or machine learning (ML) model by Behler [2], Bartok [3], Zhang [4,5,6,7], Chmiela [8] and many others.In this approach, the ML model is trained (or fit) to a set of energies and forces of typical configurations of the modeled system.The data often comes from a density function theory (DFT) or other electronic structure method.However, such data can be biased by approximations used in those methods.
We examine in this paper whether stochastic methods, such as variational and diffusion quantum Monte Carlo can be used to provide the training data so that the resulting model will be more accurate.Such an approach has been considered previously, for example in work by Alfe [9], Sorella [10,11], Niu [12], Huang [13], Nakano [14] and Cheng [15].But several aspects of this approach, such as the effects of the noise on the model, have not been discussed, to our knowledge.
In the following we introduce QMC methods, and discuss various errors that enter into the models and their relationships.We then discuss several examples such as those based on linear regression, a ML model for silicon and for dense hydrogen.We conclude with observations on future research questions.

Quantum Monte Carlo Methods for electronic structure
We briefly review some QMC algorithms that estimate electronic properties using a stochastic process.We write R = {s 1 , s 2 . . .s n I } for the n I nuclear coordinates with charge Z j on the j th ion and r = {r 1 , r 2 . . .r ne } for the n e electron coordinates.The non-relativistic electronic Hamiltonian in atomic units is: R α will refer to the α th sampled configuration (a 3n I vector) used for training or evaluation.By F j (R) = −∇ j V (R) we mean the 3 vector of forces on the j th ion for a configuration R.
We mention the QMC methods that we think will be useful for generating machine learning data.For a detailed discussion of the various methods see [16].
Variational Monte Carlo (VMC) Metropolis (i.e. Markov Chain MC or MCMC) [17] is used to sample electronic coordinates from the square of a trial wavefunction |Ψ T (r|R))| 2 and calculate its (variational) energy: The simplest correlated trial function used in QMC is the Slater-Jastrow form: where U (r|R) is a bosonic Jastrow factor, and ϕ k (r|R) is a set of single-electron orbitals.Both of these terms are varied to minimize E V (R).There has been recent work on machine learning trial wave functions [18,19,20,21,22,23] which achieve excellent variational bounds.However, the current methods are quite time consuming, requiring very long optimizations.We argue that ML needs "lightweight" QMC methods in order to provide sufficient training data.For example, in dense hydrogen we work with analytic forms requiring little optimization of trial functions [24].Forces are then calculated as the derivative with respect to the ionic coordinates, invoking the Hellman-Feynman theorem [25].
Projector MC takes a trial wavefunction and uses the Hamiltonian to project out errors of the trial function: Φ = lim t→∞ e −t Ĥ Ψ T .
There are several methods by which this can be accomplished.Since electrons are fermions, the methods must employ an ansatz to control the stochastic error for many-electron systems such as the fixed-node [26] or fixed-phase [27] approximations.Fixed-node and fixed-phase methods solve exactly for the modulus of the wavefunction within statistical errors but the bias from the ansatz can be significant [16] if only a single Slater determinant is used for antisymmetry.One can use signful methods [28] to correct this error.Diffusion MC (DMC) [26] accomplishes the projection by representing the system as an ensemble of many-body configurations that execute a drifting, branching random walk.DMC has a mixed estimator bias for forces; there are several procedures for correcting this bias.
Reptation Quantum Monte Carlo (RQMC) holds a segment of a many-body path in the computer memory and adds or subtracts from the path using a modified MCMC [29,30].If the projection time is large enough, forces taken in the middle of the path do not have the mixed estimator bias but they will still have the fixed-node error.
Path Integral methods [31] work at non-zero temperature and differ from RQMC in working with closed paths in configuration space.PIMC methods could address important problems for warm dense matter addressing a regime that is difficult for most electronic structure methods.However, PIMC has only been made practical for temperatures above some fraction (25%) of the Fermi temperature.Sampling is done with MCMC and must include sampling electron permutations.Early work calculated forces between protons in an electron bath [32].
For ML the ability to compute forces is crucial.There are several methods for computing forces with QMC [33,34].For generating ML data the efficiency, namely how much computer resources are needed to reach a given error on the forces is important.More research is needed on the relative efficiencies, the biases, and the appropriate domains of the QMC methods for the calculation of accurate forces.
In the past, many researchers concluded that the computer resource requirements to compute forces using QMC were excessive.However, available resources continue to grow year-by-year while the cost of human resources needed to come up with reliable predictions is not decreasing.A more reliable electronic structure method can reduce this human cost and thereby allow for more efficient simulations of materials.One of the main roadblocks in the past has been the availability of high quality QMC codes but this situation has improved with development of such packages as QMCPACK [35,36], TURBORVB [37], and CASINO [38].In addition to more plentiful computer resources, QMC techniques are improving.We believe their efficiency could be increased much further as we discuss below, especially for the problem of determining unbiased forces.
QMC, in general, is more accurate than other electronic structure methods.By that we mean that it has a lower bias than other methods such as those based on DFT where the choice of the functional is guided by external experimental knowledge.Experimental data of high quality is rarely available in many applications.Its low bias is the most important reason to use QMC to generate data.QMC is also a more general technique since it is able to handle highly correlated systems such as lattice models, superconductors, quantum crystals, and liquid helium.
The continuum basis set used in VMC, DMC, RQMC and PIMC is better suited for disordered systems, such as a liquid, since distortions caused by a finite basis set are absent.For deterministic methods, one needs to assure convergence in the basis set, e.g. the plane wave cutoff or the expansion in local basis functions.
A very important advantage of stochastic methods is that extra degrees of freedom that need to be integrated over do not necessarily increase the computational effort.For example, twist averaging [39], is used to eliminate some finite size effects by integrating over the Brillouin zone of the super-cell.If we compute QMC properties at P twist values, this does not increase the QMC computation effort since the P points all serve to reduce the statistical error.Instead, within deterministic methods such as DFT, this will increase the computer time by a factor of P .Another example of this advantage occurs in the path integral algorithm for quantum protons which requires the electronic energy of several replica of the protons to accept an update [24].The same logic applies to getting data to train a ML model.The forces at a large set of configurations can be obtained using the same computer budget as a small set of configurations although the stochastic errors of the large set will be larger.But as we show below, the large set is better for constructing the ML model.Hence there is less importance to selecting an optimal set of configurations to use in training the model.
We expect that important applications of machine learning will be to complex, multi-component, many-body systems, for example a ternary liquid.For those systems, one does not know how many configurations will be required, how to place them, and how to assure that the energies are unbiased.With QMC, one can oversample the configuration space.Even if many of the points are clustered, the computational effort serves to reduce the error in the fitted potential energy surface.We do not wish to imply that methods to place the QMC points in an efficient manner will not improve the efficiency but that optimal placement is not always necessary.
Noisy data is also good for training neural networks and is often added to deterministic methods to aid in optimization.It can help to find the best model more easily [40].With noisy data, one does not need to worry about data clustering: one will not have a problem with oversampling.
The QMC computer time scales well with the number of electrons compared to other methods such as coupled cluster.For a recent discussion of the scaling and accuracy of various electronic structure methods see ref. [41].For systems with many hundreds of delocalized electrons the computation of determinants to enforce antisymmetry will cost n 3 e operations to do a multiparticle move.This is the same scaling as in DFT algorithms.(Note than one is calculating 3n I forces during this move.)In practice, because computing the determinant is a simple linear algebra operation, QMC time is typically dominated by other operations such as calculation of the charge-charge interaction and scales more slowly than n 3 e .QMC can easily profit from massively parallel computer environments since memory demands can be kept modest.
We will examine some of these issues in the rest of the article.We focus on methods that do not have basis set errors (i.e.work in the continuum) and work in the electronic ground state, that is the VMC, DMC and RQMC methods.

Errors in Quantum Monte Carlo and Machine Learning Models
Let V (R) be the exact energy (i.e. the "ground truth") and F k (R) = −∇ k V (R) the corresponding force on the k th atom.The energy and forces will be estimated using a method for numerically solving the ground state Schrödinger equation for the nuclei in a given position obtaining for a set of M configurations {R α } (1 ≤ α ≤ M ) with estimated values for the potential ( Vα ) and forces ( Fk ).Using this data we then make a model V m (R).To help with further discussion we explicitly list and discuss the errors and their relations.Although the listed formulas below are for energies, the analogous errors in the forces are more important in ML.
The model error is the deviation of the model with respect to the ground truth: Figure 1.The left diagram shows the error notation used in this paper, with the defining equations on the right.The errors {⃗ σ V , ⃗ ρ V ,⃗ ϵ V } are vectors whose components are defined on each configuration R α .Their normalized moduli (σ V , ρ V ,ϵ V ) are defined as the root mean square of the corresponding vector (i.e.divided by the number of configurations).For the analogous force errors, (⃗ σ F , ⃗ ρ F ,⃗ ϵ F ), one sums over the 3n I Cartesian components of the mean squared differences as well as over the configurations and divides by 3n I M .
Even with good data and training, unless the model is a sufficiently general function of 3n I variables, the model error will be non-zero.
The stochastic error is the difference between the QMC value (with noise) and the exact value: For simplicity, we assume that QMC is unbiased so that ⟨σ α ⟩ = 0: after enough averaging one will recover the exact energies and forces.Estimates of the stochastic errors on energies and forces are determined from the fluctuations in the QMC run, from the same data as the mean values and will depend on the ionic coordinates.An example of QMC noise for hydrogen is shown in Fig. (4).
Fitting error tells us whether the model fits the noisy, but unbiased data: Chisquare is the negative logarithm of the likelihood that the model fits the QMC data: Selection error.Since the assumed model is not perfect, it matters how many configurations M are selected for training and where they are placed.We will assume that they are selected independently from a probability density F (R) by using a stochastic process such as MC or MD.How exactly they should be selected (e.g. the function F (R)) is an important question which we will not address in this paper.Note that probability distribution for assessing the fitting and model errors, the testing distribution, could be different from F (R).
Inter-model or cross-mode validation error] is the difference in the energies and forces between two fitted models (m1 and m2): It gives an indication of the importance of the stochastic error, the selection error, the model error and the fitting robustness on the models.
Bias of the data means that ⟨σ α ⟩ ̸ = 0.This could either be a result from controlled errors, those that can be studied and eliminated such as time step error, basis set error, convergence errors, or uncontrolled errors such as the fixed-node error of projection quantum Monte Carlo.
Finite-size errors result from data being generated for small systems but used in simulations for larger systems.The QMC calculations are on systems containing hundreds to thousands of electrons so that corrections are needed [39,34,42] to use the data of models for larger systems.
One of the goals of Uncertainty Quantification is to estimate how accurate the model is.We can easily estimate σ and ρ but ϵ is more difficult since we do not have precise estimates of the energy or force for many-body systems.We can consider these 3 errors to be vectors with N c components where N c = M for the energy errors and N c = 3n I M for the force errors.In this paper we define the norm of an error vector as σ = ).If we define the vectors on a common set of configurations, the energy and force norms will obey the triangle inequalities: The norms should converge to a limit as M → ∞ so that the inequalities will apply in that limit.In that limit we expect that ϵ V will converge to lim A finite number of configurations is a stochastic (i.e.Monte Carlo) evaluation of this integral.We can construct other triangle inequalities using different models.
We can make tighter bounds on the model error, ϵ, by considering how the fluctuations in the QMC data could be correlated with the errors of the model.Suppose we assume ⃗ ϵ = η⃗ σ + ⃗ ϵ 0 where ⃗ ϵ 0 is the model error vector with zero noise σ = 0.In the most extreme procedure the model could consist of taking the closest configuration so that η = 1.For a good model which is relatively unaffected by the QMC noise because the noise is filtered or averaged over, and for models such as those based on a DFT functional or an empirical potential, we expect η ≈ 0. Because the vector of QMC errors, ⃗ σ, is uncorrelated with the ideal model errors ⃗ ϵ 0 : In the case of plentiful data, we expect η → 0, so that ϵ → ρ 2 − σ 2 ± σ/ √ N c .In general, we expect that 0 ≤ η ≤ 1, leading to tighter bounds on ϵ in terms of σ and ρ: To estimate the statistical uncertainties within a model, we can use a K-fold crossvalidation, a variation of the jackknife procedure in statistics.One partitions the set of configurations into K "folds" possibly stratifying the selection of the folds so that each fold contains roughly the same number of configurations at a given density and temperature.The number of folds may be limited by computer resources since each fold requires a model optimization.From the folds we make K training sets by sequentially dropping a fold from the complete data set.Using those training sets one finds K models V mk (R) and an average model V m0 (R) = k V mk (R)/K.Summed over a test set of configurations, one finds the rms differences in the energies and forces between the models and the average model.The jackknife estimated model error for the energy is: with a similar relation for δ F but normalized by an additional factor 3n I .These errors estimate the overall effect of the stochastic QMC errors, the selection errors and optimization errors, i.e. how would the energies and forces change if one redid the complete analysis with M newly sampled configurations, new QMC calculations on those configurations and a new optimization.They do not include model errors.
The stochastic and selection errors come from the finite sampling and can be reduced by more sampling.The model error can be estimated by showing that the deviation of the model from the data is larger than expected on the basis of Eq. ( 14).To quantify the bias and finite size errors, one needs to perform systematic studies of convergence, for example by changing the trial wavefunction or varying the number of atoms.

Least Squares Analysis
The simplest procedure for finding a model to represent the QMC data is to perform a linear least squares fit (LLSF), i.e. a regression.Some leading ML models such as Gaussian Approximation Potentials (GAP) [43] are based on LLSF.Non-linear models can be approximated by linear models near their solution.LLSF is a standard numerical method with a guaranteed single solution and methods for assessing the resulting error bars on the model.
We illustrate how having stochastic error affects the results with a simple 1D quartic function, shown in figure (2), by using LLSF with a polynomial basis.We assume that the parameters of the quartic potential (the ground truth) are unknown.The   with error bars represent QMC results for locations chosen at random and uniformly in the range −3.5 < x < 3.5.To the "exact" but unknown potential, we added normally distributed noise with a standard deviation of σ V = 1.The noisy potential energies were then fit to a polynomial of degree n using LLSF.The fit is the red dotted curve for a quartic polynomial, i.e. n = 4.The QMC derived potential energy surface is more accurate than the individual QMC energies and is only different from the exact PES around x = 2 by a maximum of 0.4 even though the data had a standard error of 1.
The average error defined in Eq. ( 13), computed by propagating errors in the data to errors in the solution (see Eq. ( 17)) is found to be ϵ V = 0.326.Using 5-fold crossvalidation we found ϵ V = 0.346.By fitting to a quartic polynomial, we have assumed that the PES must be smooth.If we had fit to a higher degree polynomial as shown in table (2), the errors would not have decreased.Using the value of ρ, σ, ϵ we can compute the sensitivity parameter and find η = 0.11.This is reasonable since we have 5 free parameters in the fit with 50 data points so η ≈ 5/50 = 0.10.
For LLSF we want the amount of data to be greater than the number of free parameters.To determine a quartic polynomial, exact energies at five distinct points would have been enough.A deterministic method on this particular low-dimensional problem would, of course, be preferred for generating data since one can saturate the configuration space with accurate noise-free data.However, one cannot saturate the space for realistic models of bulk materials.
The table in Fig. 2 shows χ 2 V vs polynomial degree.One has extremely poor fits using quadratic or cubic polynomials, but using a quartic polynomial is excellent.Increasing the order beyond this does not improve the fit and is overfitting the data.The computed error, shown in the bottom graph, is excellent in the region −3 < x < 3 but it degrades outside of the region where data is present.Now consider a general least squares fit (LLSF) in a space of arbitrary dimension.Expand the model as where f k (R) are P basis functions.Suppose we have unbiased energy data consisting of M configurations {R α } sampled from a distribution F (R) with estimated potential energies Vα and standard errors σ α .We determine a k by minimizing the likelihood with respect to a k .The solution is determined by solving linear algebraic equations in the P unknowns [44].
The variance of the energy of the fitted model at a point R δ is given in terms of the inverse of the matrix: The above brackets ⟨. ..⟩ mean that one averages over both the electronic structure process that determines the noise σ α and the selection of points R α sampled from F (R).
Let us examine how δ(R) depends on the "running" computer time T (i.e. the computer time when you are actually generating forces and energies.We exclude from T the start-up time for the QMC process).We assume that the variance of the QMC noise σ α is independent of R α , and is given by for some coefficient C. T α is the computer time expended for configuration R α .This implies that the mean value of the matrix A, where ⟨. . .⟩ F means the average over the selection of points R α .We assume that T α is constant or not correlated with R α .Then it follows from Eq. ( 20) that the mean value of Âmn depends only on the total computer time, T = α T α .From Eq. ( 17), the error in the model will have the same property: the error in the fitted model is only proportional to T −1/2 : it does not depend on how many points we use to determine the data or how that time is allocated between the various configurations as long as the overall computer time is fixed.This conclusion holds when the configurations are independently sampled from some distribution, the model potential is determined by minimizing the mean squared deviation of the model energy from the noisy energy and the QMC errors are given by Eq. (19).
It is quite important to use forces in the training because they are much more plentiful than total energies by a factor of 3n I .Also forces give local information and integrate well with the ML models which are also local.The likelihood function for optimizing the model including both energies and forces is The likelihood function is the correct object to minimize if the noise is componentby-component independent and if the fluctuations are normally distributed.Both are reasonable assumptions for QMC data, though there can be deviations.The parameter λ should be set to one for maximum likelihood but it is often treated as a free parameter.
Including the contribution of the force data, the matrix determining the errors gets an extra contribution: where ∇ β,i is the derivative of the β th Cartesian component on the i th atom.The above discussion for the energy follows through.The values of the matrix elements in A have changed, becoming much larger, implying that the stochastic error on the model is much reduced, however the average matrix elements still only depend on the total computer time, not on how many configurations are present.
In the above discussion, we have made some assumptions that need to be tested to see if they apply to realistic problems.What follows are two examples that show this property.First, a model system of many-body silicon where we know the interatomic potential and, second, an actual QMC calculation of many-body hydrogen at high pressure.

Silicon example
In this section, we give an example to illustrate the effect of stochastic error in training a neural network potential where the exact potential is known but where the potential has more realistic features.We use the Tersoff potential [45] for Si, included in the LAMMPS program [46], as the ground truth.A MACE [47] model was trained on a set of configurations.We do not expect that the MACE model will be able to capture the Tersoff potential perfectly though it should not be too difficult.
The training set was generated by isothermally compressing 64 Si atoms at 3000 K from 0.467ρ 0 to 1.282ρ 0 , where ρ 0 = 4.994 × 10 −2 Å3 is the ambient density of Si solid.This training set provides a diverse set of local environments over a wide range of densities.We used the MACE model [47] an equivariant message passing model that uses higher body order messages with 32 invariant messages, angular resolution l max = 3, and a cutoff radius of 3.2 Å, the same as the range of the Tersoff potential.Radial features are generated using 8 Bessel basis functions and a 5 th order polynomial envelope for the cutoff.The loss function is a parameterized likelihood function with λ = 1.3 ‡ Models were trained with AMSGrad variant of Adam, with default parameters, a learning rate of 0.01 and a batch size of 4.
We trained the model on different sizes of training sets ranging from 660 configurations to 4620 configurations.We performed 3 different optimizations on the same data with those sizes.The first set used the clean data, data without noise, to get a baseline.The trained model got better as more data was added, presumably by covering phase space better.Using the same configurations, we then added normally distributed noise to both the energies and forces but using a larger amount of noise as M increased as detailed in Table (1).Because the statistical noise is inversely proportional to the square root of how long one samples in QMC, assuming a fixed total computation budget, the standard deviation of the noise on the energies and forces followed σ ∝ √ M .For each configuration we subtracted the mean value of the force.Without this modification, the potential energy would be inconsistent with the forces since any model that only contains differences of coordinates will have a zero mean force.
The values of σ, M and the resulting ϵ are shown in Tables (1,2) and in figure (3).Table (1) gives errors as evaluated on an independent test set of 100 configurations (set Ω 1 ).This test set was generated with an MD trajectory obtained by heating a silicon crystal at ambient density from 1000 to 3000 K in 5 picoseconds.This set was not used in the training.Table (2) shows the errors on a different test set (denoted as Ω 2 ) which was a subset of 400 configuration of the training set with densities within 6% of ρ 0 .Strikingly different errors are obtained depending on the number of configuration used for the training and the makeup of the test sets.
We see several things from these examples.The most accurate model was obtained for the largest data set (M = 4620) using the clean data for training, obtaining a model with ϵ 0 V = 13 meV and ϵ 0 F = 1.2 meV/ Å.Clearly, MACE is able to represent accurately the Tersoff potential near ambient conditions of density if given enough data.Note that the highest precision on the clean data was only achieved with relatively large data sets; the errors on the energy for smaller data sets were quite erratic.This could be a consequence of the value of λ and the preponderance of force data.On the same data with 60 meV noise, the errors of the model trained with noise were about three times as large.The force errors showed a steady decrease as more data was added and accurate force models were obtained with fewer than one thousand configurations.This illustrates how much more information is provided by forces versus total energies and the insensitivity of the model to noise.
We also see that for the test set (Ω 1 ) with densities close to ambient pressure of silicon, the force errors were 20 to 30 times smaller than for the subset of the training set (Ω 2 ) that had a much greater variation in density.The model is much more accurate when tested over a very narrow range of densities even though it has been trained over a much wider range of densities.
While the model energy was affected by the noise, except for the largest data set, the forces were not affected by the noise.Additional noisy data always improved the error in the forces even though the additional data had larger statistical errors.From the results of this example we conclude that one should always run with the maximum possible value of M even though the resulting statistical errors are increased.Our analysis in the previous section of LLSF remains valid for non-linear models.In practice, the best choice for M will be influenced by the startup time of the QMC relative to the overall computer budget and the time to train the model.
We see from Table (2), that the sensitivity parameter η F is small and decreases as M increases.This implies that we can estimate the model error directly as F based just on quantities that can be easily computed in a realistic model where we cannot compute the exact potential.However, the results for η V , the energy value, are erratic and reinforce our finding that the energy model evaluated on test set Ω 2 is not well converged.
In a third set of optimizations we did not subtract out the mean force.The results are shown in Figure (3), The force error was little affected but the energy error was increased when the training had fewer than 2000 configurations.It is possible that the choice of λ in the loss function did not put enough importance on the energy errors versus the force errors.1.Energy (in meV) and force errors (meV/ Å) as a function of the number of training configurations, M , for the Tersoff model.σ V is the standard deviation of the noise added to the energies and σ F to the forces.ϵ is the rms deviation of total energy and force components with respect to the exact values as evaluated on the test set described in the text.ϵ 0 V and ϵ 0 F are evaluated on the model trained with noiseless data.Errors in this table were computed using the 100 configurations in the test set  2. Errors on a subset of 400 training configurations within 6% of ambient density, test set Ω 2 .η is a measure of the sensitivity of the energies and forces to the noise.M is the number of configurations used in training.Numbers in (. ..) are the statistical errors in the last digits.Other notation and units as in Table (1).

Dense hydrogen example
We now discuss a recent calculation [12] of many-body hydrogen under extreme conditions of temperature and pressure.Bulk hydrogen is important both because of its ubiquitous presence in the universe and in technological applications [48].Even though it is the "holy grail" of high pressure studies, its phase diagram above 100 GPa is uncertain.Experiments at high pressure are very difficult so they cannot be used to validate the computational approaches.Hydrogen is a good test case for computer simulation, QMC and machine learning because one does not need to use a pseudopotential so that does not complicate the algorithm or cause errors.Although it does not have core electrons, its electronic structure has other complications such as a still elusive molecular-atomic transition that may coincide with its insulator-metal transition as well as effects caused by the very large quantum zero-point of the protons resulting in solid H 2 being disoriented at low temperatures.QMC has been used extensively on this system [48].Statistical errors are small with only a single electron per atom.There is no fixed-node error for an isolated H 2 molecule if the 2 electrons have different spins.Within the molecular phase, wavefunction nodes are primarily in the low density region external to the molecules.Hence we find the biases from the fixed-node approximation is very small [49].
Fixed-node DMC (FN-DMC) calculations were done with QMCPACK version 3.9.2[35,36] using a Slater-Jastrow trial wavefunction with PBE orbitals computed using a Troullier-Martins PBE pseudopotential with core radius r c = 0.37a bohr and a planewave cutoff of 200 Ry.The Jastrow functions contain two-body correlations as a sum of short-range and long-range contributions, both fully optimized for each configuration.The FN-DMC calculations used > 2000 walkers and were projected for 100 ha −1 with a timestep of 0.02 ha −1 .Canonical twist-averaging was performed using a 4 × 4 × 4 shifted twist grid.We used the Chiesa force estimator [50] correcting the mixed-estimator bias with linear extrapolation.Finite-size corrections were applied to the total energy [34,42] using the computed structure factor.Forces are obtained from the twist-averaged electronic densities.No further finite-size corrections were applied to the forces, as the leading order energy correction is a function only of the mean density and thus does not affect the force.
We generated a large set of hydrogen configurations, all containing 96 protons, using several different simulation methods: CEIMC, AIMD with DFT forces, both classical simulations and path integral simulations.The temperatures of the simulations ranged from 300K to 2200K, the pressures from 50GPa to 200GPa.The configurations included mostly molecular hydrogen, but for the higher temperatures and pressures configurations with dissociated molecules were present.The lower temperature simulations were of disoriented h.c.p. crystals.Above 1000K most configurations were of liquid H 2 or liquid H.The forces for 100,000 configurations were calculated using DFT (both PBE and vdW-DF1 functionals).We calculated QMC forces and energies for more than 16,000 configurations chosen over a range of temperatures and pressures.The data and information about the temperatures and pressures resides in a public data base [51].
Histograms of the potential and force errors are shown in Fig. ( 4) as determined using a standard error analysis of the QMC calculations on each ionic configuration.Statistical analysis shows that the stochastic errors of the estimated forces and energies are close to normal justifying the use of the χ 2 loss function.There were a few (≈ 2%) configurations which had estimated energy errors twice the average which we discarded pending investigation into the cause of the larger errors.The variation of the remaining errors is less than 10%.The error in the energy and force is only weakly dependent on the proton configuration within the range of pressures and temperatures studied and have mean standard errors of σ V =10.8 meV on the total energy and σ F =76.6 meV/ Å on each force component.The mean value of the ionic forces within each configuration was subtracted from the individual ionic forces as was done in the silicon example.Using the QMC energies and forces we can make an estimate of the accuracy of DFT functionals and their error correlation.We have previously benchmarked DFT functionals on dense hydrogen [52,53].In that work, other methods were used to quantify the errors such as the mean absolute error.Using the property that the QMC errors are uncorrelated with the DFT errors, we can find an estimate of ϵ, the rms difference between DFT functionals and the ground truth; see table (3).We agree with the conclusion in our previous study: for dense molecular hydrogen, the vdW-DF functional [54] has errors about one third of that of the PBE functional [55].The rms error of the total energy using the vdW functional is 1.24 eV.Because it is the variance of the energy fluctuations that is proportional to the size of the system, for a local energy error we divide by √ n I = 9.8.This gives a local energy error of 0.127eV= 1440 K, reinforcing our conclusion that even vdW-DF is not accurate enough to predict the phase diagram of dense molecular hydrogen when energy scales of 1440K are relevant.
A diagram illustrating the magnitudes of the force errors in these 2 functionals is shown in Fig. (5).It shows that the PBE force errors are anticorrelated with the vdW errors, though with three times their magnitude, agreeing with our experience that PBE is a better description of the metallic-atomic phase and vdW of the insulating-molecular phase.By taking a linear combination of the two functionals (77% vdW, 23% PBE), one can get a significantly lower rms force error (ϵ F =92 meV/ Å) for this regime of temperature and pressure though with this combination the rms energy error is slightly higher.The red dot represents the ground truth, the purple dot the QMC error, the blue dots the PBE and vdW errors and the star, the optimal linear combination of PBE and vdW forces.

DeePMD model
Our first ML model for dense hydrogen used a ∆-learning model containing three terms: where E pair is the contribution from an analytic potential using only the repulsive part of the exact BO energy of an isolated H 2 molecule [56].This allows an accurate description of the internal motion of a single H 2 without having to learn it from the training data.The next two terms were determined by a ML model.∆E DF T is a model trained on DFT energies, forces, and stresses (either PBE or vdW functionals) minus the pair potential contribution and was trained on roughly 100,000 hydrogen configurations.This model should capture the overall band structure and electron correlation effects as described by the density functional.Finally, ∆E DM C is determined by fitting E − E pair − ∆E DF T to the subset of 20,000 configurations for which we have computed QMC forces.We expect ∆E DM C to be a small correction: a smooth function of proton positions that can be described by a simpler ML model with fewer parameters, thus making good use of the computationally expensive and noisy QMC forces.
To model ∆E DF T and ∆E DM C we used the Deep Potential Molecular Dynamics (DeePMD) [6,5,4] version 1.3.3.Symmetry invariant feature vectors for atomic environments are constructed via the "smooth edition" [4] descriptors with angular and radial information.The angular features are smoothly cut-off from 3.4 to 4.0 Å, and radial features from 5.1 to 6.0 Å.We used 8 neurons in the embedding layer, and 3 layers of 8 hidden neurons each for the descriptor network.The DeepMD training consisted of 10 6 iterations using 8 configurations to update the neural network parameters at each iteration.The starting and ending learning rates were set to 5 × 10 −3 and 2 × 10 −7 .

MACE model
Our second optimization used the MACE model [47], an equivariant message passing model that uses higher-body order messages.We did not use ∆-learning model of Eq. ( 23) with MACE.Instead the QMC data were fit directly without a baseline or DFT layer.Optimization used 64 invariant messages, an angular resolution l max = 3, and a cutoff radius of 5 Å.The loss function is the parameterized likelihood function with forces weighted with λ = 3.5.

Errors in the DeePMD and MACE models
Results in Table (3) give the fitting error ρ (Eq.( 9)) and cross-validation error δ (Eq.( 16)) in the energies and forces for the two models used and for the clean DFT calculation.A visual depiction is shown in Fig. (6).We see for both models that ρ > σ.This means that the models have not exhausted the information in the QMC data, i.e. the values of χ 2 are large compared to the number of data points.In particular, the QMC energy uncertainty of 10meV is two orders of magnitude smaller than the fitting error of the models to the QMC energies.(We remind the reader that what is quoted for σ V is the standard deviation on the total energy of 96 protons in the supercell.)The energies are much worse in this regard than the forces.The MACE model force errors are only 46% larger than the force QMC errors, so the force information from the QMC data has been reasonably well fit by the models.This indicates that more research is needed to find a more accurate model for dense hydrogen in this regime of temperature and pressure.We estimated the accuracy of the models with respect to the exact energy and forces using Eq. ( 14) assuming η = 0.
The different performance of the energy and force errors is not understood.One might think that as a model approaches the exact potential, that the force and energy errors would decrease together.However, for the two DFT functionals ϵ V /ϵ F = 8 Å but for the 3 DeePMD models this ratio is about 4.5 Å while for the MACE model it is 13 Å.Even though the forces drive the trajectory in MD, it is the potential that determines the equilibrium properties, for example the free energy.
There is a large uncertainty, δ, in the ML models as determined by the 5-fold cross validation also shown in Fig. (6).The uncertainty of the MACE model is less than half that of the DeePMD model, but it is of the same magnitude as the errors in the model forces.This indicates that the training would profit from more data; the existing ML models are still highly sensitive to the QMC noise and the selection of configurations.
The QMC/MACE model has the smallest errors in both the energy and forces from the other models including the clean vdW DFT model.However, this particular trained model has not yet been tested with MD or PIMD simulations.It is not unusual for a MD or PIMD simulation to reveal hidden problems with an energy surface when a trajectory ventures into regions of configuration space that the model has not been trained for.It is clearly worthwhile to do such tests since the MACE scales better with the number of protons and is several orders of magnitude faster than using DFT functionals or Coupled Electron Ion MC (CEIMC) for studies of dense hydrogen.

Summary and Future Research
We have shown that it is feasible to use QMC forces to train state-of-the-art ML models.Such models trained on QMC energies and forces open the possibility to quantify the accuracy of models used in Monte Carlo or Molecular Dynamic studies.We have shown in the molecular hydrogen example, that the DeePMD and MACE models have not fully exploited the information contained in the QMC force and energy data.Our analysis of the errors in the two analytic examples give some guidance on how to use QMC forces to improve ML models.Clearly one needs to calculate forces, not just energies because QMC forces have much more information to be exploited.It is better to have forces on many configurations even though the force errors are larger than  if fewer configurations were examined.A model is well converged if all of the important parameters are well trained.This can be judged by a cross-validation analysis.Models with fewer parameters can become well-converged more easily.The ratios, ρ V /σ V and ρ F /σ F , can be used to judge whether models have extracted all the information from the data.These are some of the ways noisy data can help in validating machine learning models.They present great opportunities but there are many open research problems.It is not clear which type of QMC will be most useful to generate data for machine learning: VMC, DMC, or RQMC?VMC can generate uncorrelated forces more efficiently.However, unless extensive careful optimization is done for each configuration, it can have a much larger systematic bias.DMC and RQMC are automatic procedures for eliminating much of this bias and should be equally accurate for the energy.A systematic examination of their efficiency and bias for forces has not been reported.We can also ask whether PIMC will be useful in the warm dense matter regime.
There has been limited research on the efficiencies and biases of the suggested estimators for forces such as the Chiesa [50] estimator or zero-variance, zero-bias estimators [33].Is extrapolation of the forces with DMC adequate or do we need to do forward walking, reptation or use unbiased estimators?What is systematic bias of the forces coming from estimators and from the fixed-node or fixed-phase approximations?Can second derivatives of the energy computed by QMC be useful to train the models?At the moment we do not know their statistical errors and biases.
An important practical questions is how to speed up the QMC calculation so as to deliver data more efficiently.We found in the Silicon example that it is best to have as many points as possible even though the noise would be larger.If we minimize the start-up cost of a run, we can generate more data but we need robust but "lightweight" QMC methods for this.It would be best to avoid a lengthy optimization of the trial function.Typically, QMC uses DFT orbitals for the trial wavefunction, however there are drawbacks to tying the QMC run to a fully converged DFT calculation as that would decrease its efficiency.One possibility is to develop a simplified single-particle theory, for example by using optimized effective interactions [24].One gains with a compact expansion of the electron orbitals in terms of a basis, first, because determining the best expansion coefficients is rapid, and, second, the evaluation of the trial wavefunction and its derivatives within the QMC calculation is faster.Such a description is possible because the electron-ion Jastrow factor and backflow already contain the wavefunction cusps and the dynamic correlations of the electrons and do not need to be present in the orbitals.What is important is to find orbitals with a good representation of the nodal surfaces so that the fixed-node approximation is accurate.It has been found that analytic Jastrow functions, backflow functions and three-body functions can be used [57] and need little or no optimization.Alternative ways to antisymmetrize the trial function such using geminal orbitals or pfaffians might prove useful.In CEIMC calculations of dense hydrogen, we made minor tweaks to the analytic functions for a given density and did not reoptimize for small changes in density or temperature [24].
What is the trade-off between high quality data and covering configuration space better?In the Tersoff Silicon example discussed above, we found that the best models come from the largest data set, even though the noise in forces was larger.What is the best way to select configurations?Can we formulate in terms of a sampling function F(R) or do learning procedures improve on this?For example, should one keep the configurations spaced out in configuration space.
What type of model should we to fit to?We have tried fitting directly to the QMC data and to fitting to a 3 level ∆-model containing molecular forces, DFT forces and QMC forces.It seems useful to start with a simple model that prevents atoms from getting too close to each other.But the MACE example for molecular hydrogen shows that hierarchical model may not be necessary.Having a model that needs accurate DFT calculated forces will make the data generation less efficient.Perhaps an embedded atom or tight-binding model will be a useful alternative baseline model.There is also an advantage to a model with fewer free parameters since having more parameters requires more QMC data for convergence.
Incorporating known physical constraints on the models can be important.We have already used short-ranged interactions such as the molecular H 2 binding curve in our model.However, long-ranged interactions are essential for calculations of the electrostatic properties such as the dielectric constant.But since forces on atoms are dominated by short-ranged interactions and ML models typically have a cutoff at a small multiple of the interatomic distance, long-range interactions may not be correctly determined in the ML model.In many cases, one is able to calculate parameters of the long-range interactions from first principles.For example, the dispersion interaction only depends on the multipole moments of the atoms or molecules present and their polarizabilities, and can be determined using QMC methods on small supercells.On the other hand in metals, the bare Coulomb interaction is screened by the mobile charges; models to describe the magnitude of the long-range interaction are more complex.A challenging problem is to simulate a metal-insulator transition such as occurs when hydrogen changes from the molecular phase to the atomic phase.The character of the long-range interaction changes during the transition and is correlated with the changes in the short-ranged structure of the system.It is important to be able to model the interactions at all distance scales accurately and self-consistently in order to make precise predictions of this transition in the ML model.
We have demonstrated that QMC is useful in generating data for machine learning potentials and assessing the error of those potentials.In principle, the data should have a much smaller systematic bias.The noisy feature of QMC forces and energies can be an advantage both in allowing more complete sampling of phase space, in optimization of the model and assessing the model errors.There is much scope for improving the methodology to make QMC a useful tool for generating ML data.

Figure 2 .
Figure 2. Left top panel: Exact potential energy V (x) (black curve) and estimated model V m (x) (red dots).The data with error bars are the points used in the fit.Left lower panel: model error ϵ(x) = V m (x) − V (x) (red) and expected statistical error (black) from eq.(17).The estimated error from 5-fold cross-validation is the dotted curve.Right table: χ 2 versus polynomial degree, n.Fits are satisfactory when χ 2 ≈ 50 − n ± 10 since there are 50 data points.Fits for n ≥ 4 are good.Values of n > 4 have overfit the data since they have extra parameters without decreasing χ 2 .

Figure 3 .
Figure 3. On the left is an illustration of the force errors in the data and model.The numbers are for 2310 points in the training set evaluated on test set Ω 1 .The figure on the right shows how the errors depend on the number of sample points, whether noisy data or clean data was used for training and how the measured errors depend on the test set.The upper figure is the energy, lower figure the forces.Solid lines interpolate the noisy data, dashed lines the clean data.Black and red curves are evaluated on test set Ω 1 , blue curves on test set Ω 2 .The red curve shows the model errors when we do not require the total force on the ions to be zero: the convergence of the energy is worse.

Figure 4 .
Figure 4. Histograms of the estimated values of standard deviations of the total energy (σ V ) and force components (σ F ) from 16290 dense hydrogen configurations, each with 96 protons, within the pressure range 50GPa -200GPa and temperature range 300K -2200K.The standard deviations were estimated from the mean squared deviations of each individual DMC simulation.We obtained σ v = 10.8 meV and σ F = 76.7 meV/ Å.A few percent of configurations with σ V > 0.014 were discarded.

Figure 5 .
Figure 5. Tetrahedron showing the QMC, PBE and vdW rms force errors (in meV/ Å) for a subset of the hydrogen configurations for pressures 175GPa and 200 GPa and temperatures between 600 and 2000 K.The arrows represent an orthogonal set of axes.The red dot represents the ground truth, the purple dot the QMC error, the blue dots the PBE and vdW errors and the star, the optimal linear combination of PBE and vdW forces.

Figure 6 .
Figure 6.Diagram illustrating the magnitude of r.m.s.errors in forces (in meV/ Å) for the QMC forces and 2 ML models for dense hydrogen, a 3-level DeepMD trained on DFT forces and on QMC forces and a single level MACE model trained only on QMC forces.Circles indicate the magnitudes of the cross-validation uncertainty in the DeepMD and MACE models. Table

Table 3 .
(4) errors in total energies (in meV) and forces (in meV/ Å) with respect to QMC calculations using 16,290 configurations of 96 protons at pressures between 50GPa and 200 GPa.The first two columns abbreviate the construction of the models.For example "PBE" means that the errors wrt QMC energies are compared to PBE energies, "PBE/DeePMD" means that a DeePMD model is constructed using PBE forces and compared to QMC, and "QMC/DeePMD" means a second level DeePMD model trained using QMC energies is added to the first level model.Symbols (ρ, ϵ, δ) are defined in Fig.(1).To compute ϵ we assumed η = 0 and used σ from Fig.(4).δ was estimated using a 5-fold cross-validation (jackknife) procedure.