Easy representation of multivariate functions with low-dimensional terms via Gaussian process regression kernel design: applications to machine learning of potential energy surfaces and kinetic energy densities from sparse data

We show that Gaussian process regression (GPR) allows representing multivariate functions with low-dimensional terms via kernel design. When using a kernel built with high-dimensional model representation (HDMR), one obtains a similar type of representation as the previously proposed HDMR-GPR scheme while being faster and simpler to use. We tested the approach on cases where highly accurate machine learning is required from sparse data by fitting potential energy surfaces and kinetic energy densities.

(1) We specifically consider RS (random sampling) HDMR 3,5 which allows obtaining all the terms from one and the same dataset with an arbitrary distribution of data in the full D-dimensional space.6][7][8] We previously introduced combinations of HDMR with neural networks (RS-HDMR-NN) [9][10][11] and, recently, Gaussian process regressions (RS-HDMR-GPR) 12,13 which allow dispensing with integrals and also allow combining terms of any dimensionality, e.g. one may use where the component functions are fitted with NN or GPR one at a time in cycles until convergence to the known values of the function  (𝑗) = ( (𝑗) ) at points  (𝑗) , j = 1, …, M: (3) where a fading parameter a(c) (where c indexes fitting cycles) may be introduced to palliate local minima. 12,13In this way, existing NN or GPR engines can easily be used for component functions, although a custom code is needed to realize Eq. (3). 10,13In eq. ( 2), lower order terms of eq. ( 1) are lumped into d-dimensional terms.This representation is particularly attractive with sparse data; we previously showed that with low data density, there is a maximum d for which   1  2 …  can be reliably constructed. 9Note that data density is always low in sufficiently high-dimensional spaces in any practical setting, as due to the so-called "curse of dimensionality" adding more data (even if it were possible, which is often not the case) has little effect on data density in terms of number of data points per dimension. 14The representation of eq. ( 2) was previously used to fit functions to data with densities as low as about 2 data per degree of freedom or leess. 12,15Eq. ( 3) does not enforce orthogonality of component functions but much gains in simplicity.When the terms are built with GPR (RS-HDMR-GPR 12,13 ), eq. ( 2) allows determining relative importance of different combinations of variables, effectively extending the ARD (automated relevance determination) capability of plain GPR and allowing for pruning of HDMR terms; 13 this is important as the number of terms scales combinatorically with D and d.
In eqs.(2-3), the relative amplitudes of terms are subsumed in the definition of   1  2 …  ; in what follows, it will be convenient to consider them explicitly: (4) where   1  2 …  ~ are considered to be in some sense normalized (e.g. to have the maximum value or integral of 1).The amplitudes are fitted with eq. ( 3).The disadvantage of using eq.( 3) is the need for repeated fits as well as the need for a separate code implementing the method.Another disadvantage is loss of ease of computing the variance of the estimate of () (see eq. ( 6) below) which needs to be assembled from variance estimates of all component functions.
Representation in the form of Eqs.(1-2) can also be obtained by GPR kernel design.Even though GPR is often considered to be a nonlinear machine learning method, at each particular value of , it is equivalent to a regularized linear regression.Eq. ( 5) has the form of a basis expansion, with basis functions   () = (,  (𝑛) ) and with coefficients c obtained with least squares,  =  − . 17The covariance function is usually chosen as one of the Matern family of functions, 18 where  is the gamma function, and   is the modified Bessel function of the second kind.At different values of ν, this function becomes a squared exponential (ν→∞), a simple exponential (ν=1/2) and various other widely used kernels (such as Matern3/2 and Matern5/2 for ν = 3/2 and 5/2, respectively).It is typically preset by the choice of the kernel, and the length scale l and prefactor A are hyperparameters (i.e. = (l, A)) that can be optimized.The particular value of GPR compared to a linear regression with a generic basis (which could in principle also be taken in an HDMR form) is the use of the covariance function which imparts Eqs.(5, 6) the meaning of the expectation value and the variance of a Gaussian distribution of () values. 16ne can express k in the form of eq. ( 1) or ( 2), e.g. in the case of eq. ( 2), where (12) Note that elements of c, and thereby the amplitudes   1  2 …  , depend on products of many   1  2 …  by virtue of the minors of the matrix K forming its inverse (the dependence on   1  2 …  of det () need not be considered as it leads to a common scaling of all HDMR terms).That is,   1  2 …  ≠   1  2 …  , and explicit dependence of   1  2 …  on   1  2 …  is impractically complex.However, the choice of   1  2 …  is immaterial.One can even choose the amplitudes of HDMR terms of the kernel randomly.
Regardless of the choice of   1  2 …  ,   1  2 …  are the least squares solutions and are in this sense optimal (one can think of any changes introduced in relative magnitudes of different   1  2 …  being compensated in Eq. ( 12) via c).This is in contrast to Eq. ( 3) where the amplitudes of the component functions depend on the quality of the optimization and local minima.Eq. ( 11) as written (which is the form we use in the tests below) uses all combinations of d variables.The approach is obviously not limited to this particular form; a generic HDMR expansion of the form of eq. ( 1) can also be used for the kernel and will result in a corresponding HDMR expansion of ().Only selected combinations of  ′ ≤  variables can also be used to decrease the number of terms. 13Individual component functions are computable as where ).
Rather than using a dedicated code as in the case of Eq. (3), 13 an HDMR-type kernel of Eq. ( 11) can be used with any existing GPR engine to obtain a HDMR representation of () directly.One simply needs to define a custom kernel function, which is easily doable in common machine learning libraries such as Matlab's Statistics and Machine Learning Toolbox used by us.The use of a single GPR approximation instead of Eq. ( 3) also makes easy the calculation of the variance of () with Eq. ( 6), which is also then returned by the GPR engine.We caution that Eq. ( 6) should not be automatically used as an error bar; it allows computing the confidence interval on the expectation value of the function computed by GPR and might not be indicative of the quality of the fit (see notes and examples to this effect in Refs. 12,13).Table 1.Test set rmse errors obtained with RS-HDMR-GPR (Eq.( 3)) of different orders d in Refs. 12,13for the potential energy surfaces of H2CO and UF6 and for the kinetic energy densities of Al, Mg, and Si as well as those obtained with an HDMR-type kernel in this work for different numbers of training datapoints M. The M in the case of RS-HDMR-GPR results are the largest among those used in Refs. 12,13.Test set sizes are 100,000 for H2CO PES, 40,000 for UF6 PES, and 400,000 for the KEDs.N is the number of HDMR component functions at each d.The numbers are in cm -1 for the PESs and in a.u.for KED.The ranges of rmse from 10 runs are given (where available) reflecting the random nature of the training data selection from the overall dataset.
12 Eq.( Eq. ( 11) N Ref.We compared the performance of an HDMR-type kernel to that of the RS-HDMR-GPR 12,13 approach.We fitted the potential energy surfaces (PES) of H2CO and UF6 molecules and the kinetic energy densities (KED) of aluminium, magnesium, and silicon crystals at equilibrium and strained geometries.For the description of the datasets and information about the applications of these functions, see Refs. 20for H2CO, Ref. 21 for UF6, and Ref. 22 for the KED data.We chose these applications, as in them, high fitting accuracy is required (with errors much smaller than 1% of the data range for the model to be useful at all and with desirable accuracy of better than 0.01% [23][24][25] ).We have 120,000 data points in six dimensions (representing molecular bonds and angles) for H2CO with values of potential energy ranging 0-17,000 cm -1 , 54,991 data points in 15 dimensions (representing 15 modes of vibration) for UF6 with values ranging 0-6,629 cm -1 , and 585,890 data in seven dimensions (representing six terms of the 4 th order gradient expansion of kinetic energy density 26 and the effective potential 27 ) for KED with values ranging 0.0073-0.0398a.u.The distributions of the data can also be found in the original references.What matters for the purpose of the present work are not details of these applications but a comparison of GPR with HDMR-type kernel of Eq. ( 11) to RS-HDMR-GPR and plain GPR (obtained when d = D in eq. ( 11)) using a standard kernel on the same data.For   1  2 …  (  1  2 …  , ′  1  2 …  ) of Eq. ( 11), we use the squared exponential kernel (i.e. ).The PES data were normalized before fitting; we therefore use isotropic kernels.The KED data were scaled to [0,1] for the same reason; as their distributions are extremely uneven (see Ref. 22 ), scaling is preferred over normalization in this case.The length parameter l is 5.47 (d = 4-6) -8.17 (d = 1-3) for H2CO, 33.1 for UF6, and 1.22 for KED.The corresponding  values are 1×10 -6 (d = 4-6) -1×10 -5 (d = 1-3) for H2CO, 1×10 -5 for UF6, and 5×10 -4 for KED.We set all   1  2 …  to the same value (1  ⁄ where  =    is the number of HDMR terms and    is the binomial coefficient).We confirmed that they can be changed randomly without affecting the quality of the fit, just as the theory suggests.There is no effect of the setting of the relative amplitudes of component functions in the kernel; however, there is an effect via  as the effect of  depends on the overall magnitude of the kernel.Setting all   1  2 …  =  is sufficient; there is no need to optimize the magnitudes of the kernel's component functions.
The tests were run in Matlab 2021a using fitrgp function with a custom kernel function implementing Eq. (11).In Table 1, we compare test set root mean square errors (rmse) obtained with the kernel of Eq. ( 11) to those reported in Refs. 12,13with RS-HDMR-GPR (using similar kernels).We use test set sizes (also given in Table 1) which are much larger than the training sets to grasp well the global quality of the approximation.Note that there is variability of rmse values from run to run due to random selection of training data from the overall data set, which is within about ±10% and does not affect the reported trends (we do not fix the random seed precisely to monitor the effect of this source of uncertainly and give rmse ranges from 10 runs).For all practical purposes, the fit quality is similar to that achieved with RS-HDMR-GPR. 12,13The obtainable error for given hyperparameters with the HDMR kernel is lower than with RS-HDMR-GPR by construction, as the coefficients are optimal in the least squares sense.In the case of the KED data, the errors obtained here for d < D are slightly higher than those reported in Ref. 13 , as in that work, length parameters were optimized for each component function (see below for tests with optimized l).When using an HDMR-type kernel, it is easier to use larger training sets, as a single GPR instance is fitted once.In Refs. 12,13, a maximum of 3600, 5000, and 5000 points were fitted for H2CO, UF6, and KED, respectively.Larger sets were not used in Refs. 12,13, in particular, due to the relatively high scaling of cost of GPR with the number of training data, compounded by the cost of applying Eq. ( 3) and wielding  =    GPR instances.With an HDMR-type kernel, the cost is still higher than that of a conventional Matern-type kernel due to a higher cost of computing the kernel function which has a larger number of terms (the number of terms is also given in Table 1 for each d) but is easier manageable.We also provide results with much larger training sets and larger d in the case of UF6, where HDMR has more than a thousand terms.As expected, even lower global errors are obtainable for higher d with more training data, while lower-dimensional terms are well-defined with few data 9 (i.e. the test rmse is not improved by adding more data and is limited by the dimensionality of HDMR terms).These results with the HDMR kernel highlight the advantages of the HDMR representation, namely, that with finite training data, one can obtain a similar or better global rmse compared to a conventional (fulldimensional) GPR with d < D. We show representative examples of correlation plots between exact values and HDMR kernel based GPR predictions for select d < D in Figure 1, to visually highlight the high accuracy of the regressions performed here.We mentioned above that the HDMR-type kernel allows estimating variances of component functions and therefore, similar to RS-HDMR-GPR, relative importance of different variables or combinations thereof.Taking the KED data as an example, we list the variances of the 7 component functions at d = 1 obtained at fixed and optimized (to maximum likelihood) length parameters in Table 2. Similar to what was found with RS-HDMR-GPR in Ref. 13 , variables  1 ,  2 ,  3 ,  7 are seen as most important.Specifically, the dwindling of the variance of  4 ( 4 ) and  5 ( 5 ) corresponds to their optimized length parameters becoming large, indicating their low relevance, in a way similar to ARD.

Conclusions
We showed that a kernel type based on High-dimensional model representation for Gaussian process regression allows easily constructing a representation of a multivariate function as a sum of lowerdimensional terms.A similar kernel representation was previously introduced for Additive Gaussian Processes; 19 here, we show that it allows obtaining accurate models in applications notorious for high accuracy requirementspotential energy surfaces and kinetic energy densities and allows building HDMR representations of multivariate functions which are similar to the recently proposed RS-HDMR-GPR scheme 12,13 while being much easier to use.One only needs to define a custom kernel for use with existing GPR libraries; no dedicated software is needed.There is no need to optimize the magnitudes of the HDMR terms in the kernel, and the magnitudes of the HDMR term of the final function representation are optimal in the least squares sense.The relative importance of different HDMR component functions and corresponding variables can also be easily evaluated.

Figure 1 .
Figure 1.Representative examples of correlation plots and correlation coefficients between exact values and HDMR kernel based GPR predictions for selected combinations of d and D (for the case of 10,000 training points).

Table 2 .
Relative variances of the seven component functions of the HDMR obtained with the kernel of Eq. (11) for d = 1 when fitting 10,000 KED data with fixed and optimized length parameters.The values are in a.u.