Neural network Gaussian processes as efficient models of potential energy surfaces for polyatomic molecules

Kernel models of potential energy surfaces (PESs) for polyatomic molecules are often restricted by a specific choice of the kernel function. This can be avoided by optimizing the complexity of the kernel function. For regression problems with very expensive data, the functional form of the model kernels can be optimized in the Gaussian process (GP) setting through compositional function search guided by the Bayesian information criterion. However, the compositional kernel search is computationally demanding and relies on greedy strategies, which may yield sub-optimal kernels. An alternative strategy of increasing complexity of GP kernels treats a GP as a Bayesian neural network (NN) with a variable number of hidden layers, which yields NNGP models. Here, we present a direct comparison of GP models with composite kernels and NNGP models for applications aiming at the construction of global PES for polyatomic molecules. We show that NNGP models of PES can be trained much more efficiently and yield better generalization accuracy without relying on any specific form of the kernel function. We illustrate that NNGP models trained by distributions of energy points at low energies produce accurate predictions of PES at high energies. We also illustrate that NNGP models can extrapolate in the input variable space by building the free energy surface of the Heisenberg model trained in the paramagnetic phase and validated in the ferromagnetic phase. By construction, composite kernels yield more accurate models than kernels with a fixed functional form. Therefore, by illustrating that NNGP models outperform GP models with composite kernels, our work suggests that NNGP models should be a preferred choice of kernel models for PES.

For example, it has been demonstrated that the efficiency and accuracy of GP models can be systematically improved by increasing the complexity of GP kernels through compositional search that maximizes Bayesian information criterion (BIC) [45], which approximates marginal likelihood.GP models with composite kernels can produce accurate PES at high energies using information about potential energy at low energies [21,38] and yield global PES with up to 51 dimensions [38], or up to 57 dimensions when trained simultaneously by energies and energy gradients [36,[41][42][43][44].However, optimizing kernels of kernel models generally requires iterative inversion of the kernel matrix, which scales as O(n 3 ) with the number n of training points and makes compositional kernel search computationally expensive.Although this scaling of the computation complexity can be reduced by various techniques, such as data sparsification [39], optimization of sampling of energy points in the relevant configuration space [21-23, 26, 29, 32-35, 38] or building molecular symmetries into ML model kernels [43,44], it remains a major obstacle for applications of kernel models to high-dimensional PES.
NNs do not suffer from the O(n 3 ) scaling problem and can often be trained much more efficiently than accurate kernel models.However, can NNs provide more accurate models of PES than typical kernel models given the same number of potential energies?There is no clear answer to this question.The one study that directly compared GP regression with NN models of PES used for the calculations of ro-vibrational energy levels [25] indicated that GP models yield more accurate results with fewer energy points.However, neither the hyperparameters of NNs nor the functional form of the kernels of GP regression were fully optimized in Ref. [25].Previous work indicates that KRR and GP models are competitive with the NN models in terms of accuracy and data efficiency of PES models for polyatomic molecules .At the same time, unlike conventional NN and KRR, GP regression offers not only models of PES, but also the Bayesian model uncertainty, which can be exploited for applications such as Bayesian optimization [16,34,46,47] or the analysis of the effects of PES uncertainties on the accuracy of calculations of molecular observables [40].
Since the original work on GP regression [48], it has been known that GP models can be viewed as (Bayesian) NNs with infinite width.This connection has recently been elucidated [49][50][51] and exploited [52,53] to develop NNGP models [49][50][51][52][53]. NNGP models can combine multiple layers of Bayesian NN with different properties, yielding an algorithm to increase the complexity of a GP kernel.At the same time, NNGP models can potentially be trained much more efficiently than kernel models with complex, composite kernels.However, the application of NNGP models for fitting PES has not been explored.In particular, it is not known if NNGP models can extrapolate as well as interpolate target functions based on small data sets.The ability of a model to extrapolate is important for applications aiming to build high-dimensional PES with a small number of energy points, where the training point distributions are likely to miss some important parts of the PES landscape.
In this work, we present a direct comparison of NNGP models and GP models based on composite kernels with optimized functional form.We build the global PES for a polyatomic molecule with six degrees of freedom by both interpolation and extrapolation in the energy domain (i.e.build PES at high energies using energy points at low energies).We illustrate that, in both cases, NNGP models yield more accurate results than GPs with composite kernels.At the same time, we show that training NNGP models is much more efficient than training GP models with composite kernels.We analyze the eigenvalue decomposition of the GP model kernels and show that the kernels of NNGP models align better with the target PES than the composite kernels, indicating better learning ability of NNGP kernels.
To further illustrate the generalization power of NNGP regression, we use NNGP models to build the free energy surface for the Heisenberg spin model.The results illustrate that NNGP models can predict the onset of a phase transition by extrapolation of the free energy surface within one given phase.The results of this work illustrate that GP regression and NN modelling can be combined to yield efficient, yet accurate PES with a small number of energy points n.

II. GAUSSIAN PROCESS MODELS WITH COMPLEX KERNELS
A PES is represented by a function y = f (x), where x = [x 1 , x 2 , ..., x p ] is a vector of variables describing a molecule with p degrees of freedom.A supervised ML model of PES aims to infer this function from a finite set of potential energy points y = [y 1 , y 2 , ..., y n ] corresponding to n vectors x collected in the rectangular n × p matrix X.
A GP can be viewed as an infinite-dimensional multivariate normal distribution Y = [Y 1 , Y 2 , ..., Y ∞ ] ∼ N (µ, Σ), with Y i representing the output of a GP at a particular point of input space x i .The distribution is infinite-dimensional because each component of x is a continuous variable.Therefore, a GP is entirely determined by a mean function µ(x) and a covariance function Σ(x, x ).The covariance function of a GP is represented by the kernel function k(x, x ).Training a GP amounts to conditioning Y by the given energy points y, which changes the mean and covariance functions of the GP.In the present work, we distinguish two types of GP models: GP models with composite kernels and Neural Network Gaussian Process (NNGP) models.
As described below, both of these algorithms allow for kernels of varying function complexity.In both cases, the complexity of the kernel function can be specified by the number of layers in the kernel construction algorithm.We denote the number of layers in the kernel construction algorithm by L, for both types of models.It is important to note, however, that L has a different meaning for the two types of models.As explained below, L denotes the depth of the search tree for the composite kernels and the number of NN layers for the NNGP models.
As was shown previously [21,36,38], the mathematical form of the covariance function can be optimized to yield accurate PES with a small number of energy points.Introduced by Duvenaud et al [54,55] and applied to building accurate PES in Refs.[21,36,38], this algorithm amounts to search in an infinite space of kernel functions.To make the search feasible, Duvenaud et al used a greedy strategy to grow incrementally the complexity of the kernel function [54,55].The algorithm combines a set of simple kernel functions [1, . . ., N k ] into products and linear combinations by an iterative process, choosing the most optimal kernel function at each layer of kernel complexity.Here, N k is the number of kernel functions f i considered.Each function f i is chosen to have properties of kernel functions of a reproducing kernel Hilbert space, which guarantees that the resulting complex function can be used as the kernel function for a kernel ML model.As was previously demonstrated, this kernel construction algorithm yields GP models that produce accurate PES by both interpolation and extrapolation [21,36,38,56].However, the iterative optimization of kernel complexity for identifying optimal kernel functions for a given PES is numerically expensive [21,36,38,39].As illustrated in Figure 1, the iterative algorithm of Duvenaud et al can be mapped on a search tree, with the complexity of the kernel function increasing with the depth level (L) of the tree.The computational complexity arises from the requirement to train multiple GP models for a given depth layer L and the increasing number of kernel parameters as L increases.In the present work, we show that accurate GP models of PES can be obtained using an alternative method for increasing kernel complexity, exploiting the connection between GPs and NNs.We show that the resulting NNGP models can be trained much faster than GP models with composite kernels, while yielding more data-efficient PES.
A fully connected feed-forward neural network with multiple hidden layers enumerated by l ∈ [1, L] is computed as where W l ij and b l i are the weight and bias parameters for the layer l, φ(y) is a non-linear activation function, and N l is the number of nodes in layer l.The outputs y L i of layer L can then be collected into a linear combination to yield a scalar output y for a single-output regression problem.For a NN with a single hidden layer L = 1, this reduces to with where x i is the ith component of the p-dimensional vector x.
The connection between single-layer NN and GP is provided by the Central Limit Theorem (CLT) [48].In the limit of infinite width on the weight and bias parameters respectively, the network becomes a Gaussian process, with mean µ 1 and covariance K 1 functions.The means of the priors can be chosen as µ w = µ b = 0, yielding µ 1 = 0 and the covariance It was recently demonstrated that this can be extended to networks with multiple hidden layers L > 1 [49][50][51][52]57].Given that y l−1 j is an independent Gaussian process for each j, as N l → ∞, y l i is a Gaussian process GP(0, K l ) with the covariance Any two y l i and y l j in the same layer are independent and follow a joint Gaussian distribution with zero covariance when i = j.The recursive form of K l depends on the activation function A kernel function satisfies the eigenvalue equation where {φ i } form an orthogonal set, and {η i } are the kernel eigenvalues.If the target function f (x) is represented by the basis expansion with ψ i (x) ≡ √ η i φ i (x), the data efficiency of a model can be quantified by the cumulative power distribution [58] The rate of convergence of the cumulative distribution (CD) to one can be used to quantify the efficiency of a kernel for a particular problem.Ideally, the kernel function should be chosen such that the target function is completely represented by a single eigenfunction of k(x, x ), which would yield C(i) = 1 with just i = 1.We use CD defined in Eq. ( 10) to compare kernel performance in this work in addition to analyzing the errors of the models.

III. RESULTS
We examine the generalization power of NNGP models by building the global sixdimensional (6D) PES for H 3 O + , using the results of the ab initio calculations from Ref.
[59].The molecular geometry is described by a 6D input vector, with elements given by the Morse variables defined as m ij = exp (−r ij /b), where r ij is the distance between atoms i and j, and the range parameter b is fixed at 2.5.These molecular descriptors are the same as used in previous work [21,31].The performance of the models of PES is characterized by the root mean squared error (RMSE) where ŷi is the prediction of the model at x i and the sum is over N points that are not used for training the model, forming a hold-out test set of input-output pairs.Ref. [59] provides 31000 ab initio points that span the energy range [0, 21000] cm −1 .In all calculations presented here, the hold-out test set comprises all ab initio points that are not used for training the specific model under examination.

A. Comparison of model accuracy
To demonstrate the extrapolation ability of the NNGP kernel, we trained a series of 6D NNGP models of PES for H 3 O + with 500, 1000, and 1500 training points, randomly selected from a range of energies up to a maximum of E max .The results are compared with the predictions of a GP model with the most accurate composite kernel trained using 1500 ab initio points randomly selected from the same range of energies.Figure 2 illustrates that NNGP models trained with 1000 ab initio points outperform GP models with composite kernels trained by 1500 ab initio points.The advantage of the NNGP models is particularly noticeable for lower values of E max , which indicates that NNGP models provide more accurate extrapolation in the energy domain than the GP models with composite kernels.

B. Comparison of kernel eigenvalues and training complexity
As discussed in Section 2, the efficiency of a kernel regression model can be quantified by the CD C(i).The above results illustrate that NNGP models provide more accurate and data-efficient models of PES than GP models with composite kernels.At this same time, the computational complexity of training NNGP models is much lower than for GP models with composite kernels.To illustrate this, we plot in Figure 6   To obtain the CDs shown, the covariance K is computed using a distribution of 10000 randomly selected energy points excluding the training points.K y = y T y is computed with the same distribution of 10000 points and is represented by the horizontal solid line.

C. Free energy surface for Heisenberg model
In order to further demonstrate the generalization accuracy of NNGP models for extrapolation tasks, we consider a different problem: the evolution of the free energy with temperature (T ) and magnetization (m) for the one-dimensional (1D) Heisenberg model where J is the strength of the coupling between spins in different sites of an infinite onedimensional lattice, Ŝi is the spin-1/2 operator for lattice site i and the angular brackets indicate summation over indices of adjacent sites only.The dimensionless magnetization m is defined as 2 S , where S is the average over all spin orientations in the lattice.The free energy density for this model can be computed using the mean-field approximation, yielding [60][61][62][63] The result is a combination of a quadratic term in m 2 and a quartic term in m 4 , which reflects the competition between the ferromagnetic interaction, inducing the alignment of spins, and the thermal energy, which tends to induce disorder in spin alignment.The system undergoes a phase transition from the paramagnetic phase to the ferromagnetic phase at temperature T c = J/4.In the paramagnetic phase, the spins are randomly oriented and the magnetization is zero, while in the ferromagnetic phase, the spins are aligned and the magnetization is non-zero.When T > T c , the thermal energy is sufficient to overcome the ferromagnetic interaction and the system is in the paramagnetic phase.When T < T c , the ferromagnetic interaction is stronger and the system is in the ferromagnetic phase.This is reflected by the evolution of the free energy density ( 13) from a single-well dependence on m at T > T c to a double-well dependence on m at T < T c .Our aim is to explore if a NNGP model can be trained by the free energy density evolution in the paramagnetic phase to predict the critical temperature T c and the free energy density in the ferromagnetic phase by extrapolation along the temperature axis T .
We train two-dimensional models with T and m as input variables using 300 values of f (m, T ) given by Eq. ( 13) at temperatures in the shaded region of predicted by the NNGP model with L = 5, the GP model with the composite kernel with L = 5 and the mean-field results (13).Both GP models provide an accurate prediction of m 0 deep into the ferromagnetic phase without any information from the ferromagnetic phase.

IV. CONCLUSION
The generalization accuracy of GP models can be enhanced by optimizing the complexity of the GP kernel function.There are two general methods for building GP kernels with variable complexity.GP models with composite kernels can be constructed by combining simple kernel functions into products and linear combinations using an iterative algorithm guided by Bayesian information criterion.Previous work showed that this algorithm of increasing kernel complexity can be used to enhance the accuracy of PES models for polyatomic molecules trained by a fixed distribution of a small number of energy points.However, building composite kernels by this iterative algorithm can be computationally expensive, for two reasons.First, the iterative kernel construction requires one to build multiple GP models with varying kernel function complexity.Second, the number of kernel parameters increases quickly with the depth of the kernel composition search (as illustrated in Figure 1), making model training extremely time-consuming.
In the present work, we consider an alternative algorithm of increasing kernel complexity for building GP models of PES with enhanced generalization accuracy.This algorithm exploits the connection between Bayesian NNs and Gaussian processes to yield NNGP mod-els.We have shown that NNGP models yield PES with similar or better accuracy than GP models with composite kernels, at a fraction of the computation cost.To illustrate the generalization accuracy of the NNGP models, we consider two examples: a 6D PES for the molecular ion H 3 O + and a 2D free energy density surface of the Heisenberg spin model.
We demonstrate that 6D models trained by random samples from low energy distributions produce accurate PES both at low energies and at high energies, illustrating the ability of NNGP models to extrapolate in the energy domain.The 2D models are used to illustrate the extrapolation by NNGP models in the input variable space.We observe that the NNGP models are less sensitive to the distributions of training points and the model hyperparameters, yielding fast convergence with the number of NN layers.This makes NNGP models particularly suitable for use as surrogate models in Bayesian optimization for finding minima of PES that are expensive to compute.The improved computational efficiency of NNGP models, due in part to GPU acceleration, is expected to allow for modelling of complex high-dimensional systems.

ACKNOWLEDGMENT
This work was supported by NSERC of Canada

FIG. 1 .
FIG. 1. Schematic diagram of the algorithm for the construction of composite kernels for GP models.Each function f i with i ∈ [1, . . ., 4] represents a parametrized kernel function.The algorithm uses Bayesian information criterion to select at each layer L either a product or a linear combination of f i with the kernel function selected at the previous layer.

φ.
In the present work, we choose the error function erf(y) = 2 √ π y 0 e −t 2 dt as the activation function, and treat σ (l) w and σ (l) b as trainable parameters.The number of the parameters scales linearly with the number of layers L.

Figure 3 FIG. 2 .
Figure 3 compares explicitly the predictions of the NNGP model trained by 1500 ab initio points in the energy range [0, 10000] cm −1 with the computed ab initio energies in the entire energy interval [0, 21000] cm −1 .We use NN with five layers to build the NNGP model for this calculation.The results illustrate the remarkable generalization accuracy of NNGP models, yielding accurate global 6D surface by both interpolation and extrapolation (in the energy domain) of 1500 energy points.

Figure 5 1 )FIG. 3 .
FIG. 3. Comparison of the NNGP model predictions with the potential energy points computed in Ref. [59] for the 6D PES of H 3 O + .The NNGP models are trained by a random distribution of 1500 points in the energy range [0, 10000] cm −1 shown by the shaded area.

10 FIG. 4 .
FIG. 4. Root mean squared errors (RMSE) of NNGP interpolation models of the 6D PES for H 3 O + as functions of the number of training points used to build the models.The training points are selected randomly from the energy interval [0, 21000] cm −1 spanning the entire PES.The different symbols correspond to the different number of NN layers used to build the kernels of NNGP models: L = 1 (circles), L = 2 (down triangles), L = 3 (up triangles), L = 4 (squares) and L = 10 (pentagons).

FIG. 5 .
FIG.5.The cumulative power distributions for mode i for models with different kernels: the solid curve -NNGP model with L = 5; the dashed curve -the composite kernel with L = 5; the dotted curve -the dot-product kernel.All models are trained using 1000 randomly selected energy points.

FIG. 6 .
FIG. 6.Comparison of the computation time required to train composite kernel (circles) and NNGP(squares) models.For the composite kernel, L is the number of searched layers in the algorithm shown in Figure1.For the NNGP kernel, the computation time is recorded for training an NNGP kernel with L layers.All models are trained using a fixed set of 1000 randomly selected energy points.

Figure 7 (
left) and shown by the symbols in Figure 7 (right).The training data are purposely far-removed from the critical temperature T c . Figure 7 (left) compares the order parameter m 0 (T ) = arg min m f (m, T ),

Figure 7 (
right) depicts the surface of f (m, T ) produced by interpolation (blue area including the distribution of training points shown by symbols) and extrapolation (the remaining red area) with the NNGP model.This figure illustrates that the NNGP models produce a physical, smooth surface by both interpolation and extrapolation in the input variable space.

FIG. 7 .
FIG. 7. Left: The prediction of the order parameter m 0 that minimizes the free energy density f (T, m) of the mean-field solution of the Heisenberg model by the NNGP model (upper solid curve) and GP model with composite kernel (lower solid curve).The mean-field result is shown by the dotted curve.The critical temperature of the phase transition is T c = 1, and the training dataset consists of 300 points shown by symbols in the right panel.The training data are in the range of T ∈ [1.25, 3] and m ∈ [0, 1].Right: The surface f (m, T ) produced by the NNGP model with L = 5 by interpolation (blue area including the distribution of training points shown by symbols) and extrapolation (red area).