Efficient interpolation of molecular properties across chemical compound space with low-dimensional descriptors

We demonstrate accurate data-starved models of molecular properties for interpolation in chemical compound spaces with low-dimensional descriptors. Our starting point is based on three-dimensional, universal, physical descriptors derived from the properties of the distributions of the eigenvalues of Coulomb matrices. To account for the shape and composition of molecules, we combine these descriptors with six-dimensional features informed by the Gershgorin circle theorem. We use the nine-dimensional descriptors thus obtained for Gaussian process regression based on kernels with variable functional form, leading to extremely efficient, low-dimensional interpolation models. The resulting models trained with 100 molecules are able to predict the product of entropy and temperature ($S \times T$) and zero point vibrational energy (ZPVE) with the absolute error under 1 kcal mol$^{-1}$ for $>78$ \% and under 1.3 kcal mol$^{-1}$ for $>92$ \% of molecules in the test data. The test data comprises 20,000 molecules with complexity varying from three atoms to 29 atoms and the ranges of $S \times T$ and ZPVE covering 36 kcal mol$^{-1}$ and 161 kcal mol$^{-1}$, respectively. We also illustrate that the descriptors based on the Gershgorin circle theorem yield more accurate models of molecular entropy than those based on graph neural networks that explicitly account for the atomic connectivity of molecules.


INTRODUCTION
The number of molecules with properties suitable for medicinal applications is estimated to be between 10 23 and 10 60 [1].This large number underscores the importance of the development of efficient methods for predicting molecular properties for any given combination of chemically bonded atoms.As demonstrated by recent work [2,3], prediction of molecular properties can be achieved by interpolation in chemical compound space, defined as an abstract space with individual molecules as elements of the space, by machine learning (ML) models.A chemical space is often, although not always, restricted to a particular number and combination of atoms (for example, a space of molecules with chemical formula C 7 H 10 O 2 ).Interpolation of molecular properties across chemical space meets with four main challenges: (i) the vast number of molecules; (ii) the requirement to represent molecules by numerical descriptors that are universal, efficient and accurate; (iii) the sensitivity of molecular properties to small structural changes in atomic arrangements; (iv) lack of structure in chemical compound space.
While interpolation of molecular properties in chemical spaces is often considered as a big data problem, requiring information from large sets of molecules, the focus of many recent studies has been on data-efficient models [4][5][6][7].The efficiency of ML predictions of molecular properties can be enhanced by building underlying symmetries into ML models [8][9][10], by improving the molecular descriptors [6,[11][12][13][14][15][16] and by improving the ML models through kernel optimization for kernel models [17][18][19] or optimization of neural network (NN) architecture and algorithm hyperparameters for NN-based models [20][21][22].
Key to any ML predictions of molecular properties are molecular descriptors that serve as inputs into supervised learning models.The molecular descriptors must be numerical, universal (i.e.describe any molecule), unique, and embody all of the relevant features, such as the shape of molecules, the arrangement of chemical bonds, bond strengths and bond order, vibrational frequencies and conformational degrees of freedom.Many variations of molecular descriptors have recently been proposed, including Coulomb matrices [23], bag of bonds (BoB) features [24], smooth overlap of atomic positions (SOAP) [25], Faber-Christensen-Huang-Lilienfeld (FCHL19) representation [6].Another possibility is to represent molecules by a Graph Neural Network (GNN) [26,27] that explicitly accounts for the spatial arrangement and connectivity of atoms in a molecule.GNN can be combined with other features embodying molecular properties, yielding accurate molecular descriptors [10,[28][29][30].However, for complex molecules, these descriptors are necessarily high-dimensional.While high-dimensional descriptors can be used as inputs into isotropic ML models (for example ML models with kernels that do not discriminate between different input dimensions) [31], such models do not account for inherent differences in scales of different features.Willatt et.al. [13] showed that reducing the dimensionality of descriptors may improve the accuracy of ML predictions.In addition, Musil et.al. [11] argued that descriptors reflecting some fundamental principles yield more robust, transferable and data-efficient ML models.In addition to efficiency, low-dimensional descriptors enable the possibility of Bayesian optimization of molecular properties in chemical compound space.
The present work demonstrates that it is possible to design low-dimensional descriptors of large polyatomic molecules for extremely efficient interpolation of entropy and zero point vibrational energy in chemical compound space.
Different molecular properties exhibit vastly different variation across chemical compound spaces.Therefore, it can be argued that the most data-efficient predictions must be based on models and descriptors tailored for the specific molecular property.In the present work, we test this conjecture by first considering interpolation of molecular entropy across chemical compound spaces and developing low-dimensional descriptors tailored for the prediction of entropy.Our starting point is based on three-dimensional, universal, physical descriptors derived from the properties of the distributions of the eigenvalues of the Coulomb matrix.We show that the largest eigenvalue and the two leading moments of the eigenvalue distribution provide the physical descriptors for molecular entropy, as elucidated using the Gershgorin circle theorem [32,33].We illustrate that the accuracy of these descriptors can be enhanced by combining them with either: (1) the leading principle components of the output of GNN tailored for the prediction of entropy; or (2) with descriptors based on the Gershgorin circle theorem.
We show that the models of entropy using descriptors based on the Gershgorin circle theorem perform better than the models using descriptors based on GNN.
By using the resulting nine-dimensional descriptors in Gaussian process regression with kernels optimized for pattern recognition [17,18], we build ML models capable of predicting entropy of 133,000 organic molecules with up to 29 atoms (HCNOF) using as few as 100 molecules for training.The predictions have chemical accuracy (with predicted entropy × temperature under 1 kcal mol −1 ), illustrating, to the best of our knowledge, the most data-efficient models of entropy to date.We then illustrate that these descriptors, even though tailored for models of entropy, can be used to interpolate zero point vibrational energy (ZPVE).The variation of ZPVE in the chemical space considered here is much greater than the scale of values covered by entropy.However, we show that the nine-dimensional descriptors can be used to produce extremely datastarved yet accurate models of ZPVE, provided the kernels of the underlying models are properly optimized.We illustrated that models trained by only 100 molecules produce ZPVE with mean absolute error and absolute error for > 92 % of predictions under 1.3 kcal mol −1 , in a chemical compound space including molecules with up to 29 atoms and covering the ZPVE range of 161 kcal mol −1

DATASET SUMMARY
In this work, we focus on molecules from the QM9 dataset [34], where a chemical subspace consists of 133,885 stable organic molecules made up of a combination of the following atoms: CHONF.Within the QM9 dataset, there are 526 subsets of constitutional isomers, where constitutional isomers are molecules with the same chemical formula but different atom connectivity.Each subset includes at least 2 molecules, while the largest constitutional isomer subset includes 6059 molecules.
We reserve 20 % of these molecules for the hold-out test set, while samples from the remaining distributions of molecules are used to train the ML models.Since molecules in this chemical subspace range from 3 to 29 atoms in size, we bin all molecules according to their size before randomly selecting the distributions for the training and test sets.This ensures the ensuing ML models sample the entire chemical subspace, both at the training and test stage.Furthermore, 20 % of the training dataset is used as a model validation set.The validation set is used for optimizing the hyperparameters of the ML models.
The prediction accuracy of the ML models is quantified, for both the test and validation sets, by the root-mean-square error (RMSE) or mean absolute error (MAE) evaluated over the number of samples N in the test or validation set, as appropriate.
Here, y i and ŷi are the observed entropy and ML predicted entropy of the i th molecule in the test/validation set, comprising N molecules.Lower RMSE and MAE values indicate better accuracy of ML predictions.

METHOD
The main goal of the present work is to develop low-dimensional, general descriptors of molecules that can be applied to both small and large molecules and used for efficient interpolation of molecular properties across chemical compound spaces with a great variety of molecules.We build these descriptors incrementally, by starting with the properties of distributions describing the Coulomb matrices [23], considering the enhancement of the descriptors by Graph Neural Networks, and using physical insights offered by the Gershgorin circle theorem [33,35].Once identified, the best low-dimensional descriptors are used as features of inputs into Gaussian process models with kernels optimized for a specific molecular property.The efficiency of the final models presented here is thus determined by both the efficiency of the molecular descriptors and the alignment of model kernels with the prediction targets.

Descriptor I: Distributions of eigenvalues of Coulomb matrices
Coulomb matrix is a molecular descriptor developed by Rupp et al. [23].Each molecule is represented by a n max × n max symmetric matrix, where n max is the number of atoms in the largest molecule in the chemical subspace.Each matrix element M ij in matrix M is defined as where Z i and Z j are the atomic numbers of atoms i and j, respectively, and R ij is the separation between atoms i and j.The diagonal elements of the Coulomb matrix represent the polynomial fit relating the atomic number to the total energies of the free atoms [23].The off-diagonal elements describe the electrostatic interaction between atoms i and j.If the size of a molecule is smaller than n max , the matrix elements with i or j > n max are set to zero.The Coulomb matrix is invariant to molecular translations and rotations, but not invariant to permutations of the atoms.This problem can be overcome by diagonalizing the Coulomb matrix [23,36].
In the present work, we begin by implementing the basic three-dimensional (3D) descriptors of molecules using the largest eigenvalue ϵ max , the mean µ(ϵ > 1) and the standard deviation σ(ϵ > 1) of the distribution of the eigenvalues of the Coulomb matrix (ϵ) with values greater than 1.These 3D descriptors are used as inputs into the most basic ML models of entropy in this work.To enhance the expressiveness of the molecular descriptors and the accuracy of the ensuing models, we supplement these 3D vectors with features based on either the outputs of Graph Neural Networks, optimized to predict entropy as described in Descriptor II, or features based on the Gershgorin circle theorem as described in Descriptor III.

Descriptor II: Graph Convolution Networks
To enhance the accuracy of the resulting ML models as well as to explore the role of explicit graph descriptors for interpolation of molecular properties, we supplement the 3D descriptors derived from the Coulomb matrix by descriptors based on the outputs of graph convolution networks (GCN).A molecule can be mapped onto a graph G = (ν, e), where the atoms and bonds are treated as nodes (ν) and edges (e), respectively.
Each graph is represented by an adjacency matrix A with matrix elements defined as where e ij is the edge between the i th and j th node.With the adjacency matrix thus defined, the (λ + 1) th layer of a multi-layer GCN is implemented using where D is the diagonal matrix with elements Â = A + I mmax , I mmax is m max -dimensional identity matrix, m max is the maximum number of heavy atoms (CONF) that a molecule has for the dataset, ϕ is the activation function, and W λ is the matrix of weights for layer λ ≥ 1, producing molecular descriptors illustrated in Fig. 1.The weight matrices are optimized by training the GCN using the algorithm illustrated in Fig. 2.
The GCN are trained using the training and validation sets of molecules.For λ = 1, H 1 = X where X are the node features in the one hot encoded data format.The feature of each node is a vector including atomic number, hybridization, degree, and chirality of each atom.Each node of the graph aggregates information from its neighbouring node(s) as the graph passes through layers of the GCN by Eq. ( 5).The weight matrices are initialized randomly with a truncated normal distribution.The weight matrix of the last GCN layer is set as a one-dimensional vector, so that the resulting final layer is a vector instead of a matrix.In this work, we build GCN with 3 + 1 layers, where W 1 is a 20 × 9 matrix, W 2 is a 9 × 9 matrix, and W 3 is a 9 × 1 matrix.

O O
Graph Representation Adjacency Matrix The resulting H λ+1 i vector is treated as input into a Gaussian process regression (GPR) model, with entropy of the corresponding molecule as output.The RMSE over the validation set is used as the loss function for the optimization of the GCN weight matrices.The outputs of GCN are converted into a 9D vector by a densely connected NN layer, and the 9D vector is further reduced to k dimensions by principal component analysis (PCA), yielding GCN-PC(k) descriptors.
In the present work, the GCN model is trained with the cluster of constitutional isomers with the chemical formula C 7 H 10 O 2 from Ref. [34].The trained weight matrices are then used to compute the adjacency matrices for other molecules in the QM9 database of Ref. [34] to generate the GCN-based descriptors.For smaller molecules with less than nine heavy atoms, the adjacency matrix size is still 9 × 9.The Gershgorin circle theorem identifies a region in the complex plane that contains all the eigenvalues of a complex square matrix [33,35].The theorem states that for an m × m matrix with the entries in complex plane C, the eigenvalues of matrix M are in the range of Schrier [36] analyzed the eigenvalues of Coulomb matrices using the Gershgorin circle theorem and showed that a highly substituted carbon in an alkane has larger off-diagonal values, leading to higher upper bound for D i and consequently higher eigenvalues.We use the results of the Gershgorin circle theorem to define new descriptors of molecules, as follows.
Each Gershgorin circle disc (D i ) of the coulomb matrix (M ) is used to define a normal distribution function where µ and σ are the mean and the standard deviation of the normal distribution, and ϕ(•) is the probability density function (PDF) defined as The mean of the normal distribution is defined with the value of the center of each Gershgorin circle disc, M ii .The standard deviation of the distribution is determined by the radius of each Gershgorin circle disc, where d i and τ are hyperparameters that allow adjustment on the weight on and standard deviation of f i , respectively.In this work, we fix d i and τ to 1.
We further define an atomic reference PDF (f atom ), where M atom = 0.5Z 2.4 atom , and d atom is a weight constant that is set to 10.By calculating the inner products between the atomic reference PDF and the individual molecular PDF, we reduce the high dimensional continuous PDF curve into a five dimensional descriptor, ⟨f HCNOF , f molecule ⟩.We also use the area under each f molecule denoted hereafter by AUC(f molecule ) as an additional molecular descriptor.The relative role of these descriptors is illustrated in the following section.

ML method: Gaussian Process Regression with variable kernels
Gaussian process regression (GPR) [37] is a supervised ML method that models an unknown target function, f , given input data X and output data y.The training is the noise-free observation of the function evaluated at x i .GPR infers a distribution over functions with p(f |D).In the context of property (e.g.entropy) interpolation across chemical space, a probability distribution of all possible entropy values for a molecule is inferred from the training dataset D. The inputs x i are the molecular descriptors and f i is the corresponding value of molecular entropy.GPR defines a joint distribution to predict f * on the unseen inputs (X * ) as where µ = [m(x 1 ), ..., m(x N train )], with m(•) representing the mean of a Gaussian distribution, and K = κ(X, X), K * = κ(X, X * ), K * * = κ(X * , X * ) determine the covariance matrix, defined by the kernel function κ.The efficiency of the GP model is determined by the functional form of the kernel function.
In the present work, we follow Refs.[17,18] to construct kernel functions with variable complexity by forming linear combinations and products of the following basis kernel functions: where σ 0 is a parameter which controls the inhomogenity of the kernel, where l is the length scale and d(•, •) is the Euclidean distance between two data points, where α is the scale mixture parameter and l is the length scale of the kernel, where ν and l are positive parameters, K ν (•) is a modified Bessel function and Γ(•) is the gamma function, as well as where l is the length scale of the kernel, and p is the periodicity of the kernel.
We optimize the functional form of the kernel function using a greedy search algorithm that combines simple functions into linear combinations and products, as was described in Refs.[17,18] and used for various applications in Refs.[38][39][40][41][42][43].The kernel construction algorithm is illustrated in Fig. 4. All possible linear combinations and products of base kernels defined in Eqs. ( 13) -( 17) are created to yield more complex kernels.The model is selected by the Bayesian information criterion (BIC) calculated as where |κ i | is the number of parameters of kernel κ i , and log P (y|κ i ) is the maximum of log-likelihood marginalized over the parameters of the functions in the Gaussian process, but not the kernel hyperparameters.This is the same function that is maximized to train a Gaussian process.In order to analyze the relative performance of the different molecular descriptors, we begin by building GP models of entropy with various combinations of molecular descriptors described in the previous section.Figure 5 compares the learning curves of GP models thus obtained.As shown in Fig. 5, the best performing GP model is obtained with nine dimensional CM-GC descriptor, which includes three descriptors stemming from the Coulomb matrix denoted as CM(ϵ max , µ(ϵ > 1), σ(ϵ > 1)), the area under the curve AUC(f molecule ), and five inner products ⟨f HCNOF , f molecule ⟩.The figure shows that models trained with CM-GC descriptor yield the most accurate interpolation, reducing the MAE to below 1 kcal mol −1 with less than 100 molecules.This figure illustrates that physical descriptors derived from combinations of CM-based quantities and properties based on the Gershgorin circle theorem improve the accuracy of ensuing models much more significantly than the descriptors based on graph neural networks.

Analysis of different low-dimensional descriptors
To understand the role of the descriptors contributing to the best performing combination in Figure 5, we examine correlations of molecular entropy with the largest eigenvalue of the Coulomb matrix and the standard deviation σ of the distribution of the eigenvalues ϵ > 1, both shown in Fig. (6).While neither ϵ max nor µ are unique features for entropy, the results in Figs. 6 illustrate that there exist strong correlations between entropy and both of ϵ max and σ.These correlation can be rationalized using the Gershgorin circle theorem.The Gershgorin circle theorem identifies a region in the complex plane that contains all the eigenvalues of a complex square matrix [33,35].The theorem states that for an m × m matrix with the entries in complex plane C, the eigenvalues of matrix M are in the range of D 1 ∪ D 2 ∪ ...∪ D m .Each disc D i is defined in Eq. 7. Schrier [36] analyzed the eigenvalues of Coulomb matrices using the Gershgorin circle theorem and showed that a highly substituted carbon in an alkane has larger off-diagonal values, leading to higher upper bound for D i and consequently higher eigenvalues.kcal mol −1 K −1 (lower), ϵ max = 212.320(higher), σ(ϵ > 1) = 61.221(higher), while µ(ϵ > 1) = 45.980 is similar to that of 1-hydroxyhept-5-yn-3-one.Molecule B is more rigid, which leads to fewer microstates and lower entropy.The more rigid molecule, 2,4-dioxatricyclo [4.3.0.0 3,8 ]nonane (Fig. 7B.), has a lower entropy.This is the manifestation of the fact that entropy is determined by the shape and rigidity of the molecule [44].Rigid molecules tend to be restricted in motion, which leads to lower numbers of available microstates and lower entropy.Thus, a good representation of molecular rigidity should be a good descriptor for predicting entropy.
Given this premise, it is useful to analyze how the CM(ϵ max , µ(ϵ > 1), σ(ϵ > 1)) descriptor reflects the rigidity of molecules.The diagonal entries of CM (Table I) for molecules considered here are M ii,H = 0.5, M ii,C = 36.858,or M ii,O = 73.517,corresponding to hydrogen, carbon and oxygen.The off-diagonal elements M ij are directly proportional to the atomic numbers Z i and Z j , and inversely proportional to the interatomic separation R ij .This means that the M ij value is larger for bonded pairs of heavy atoms.Table I lists some examples of M ij values, where the C=O double bond pair gives the highest M ij value.For rigid molecules, the atoms are spatially crowded, which leads to higher M ij values on average.We observe that the Gershgorin circle theorem yields higher values of |A ij | in Eq. ( 7) for rigid molecules on average, leading to larger values of j̸ =i |A ij |.Thus, the range of eigenvalues defined by the union of D i in the circle theorem is higher for rigid molecules.The opposite is true for linear and less rigid molecules.Since the molecules from the same constitutional isomer cluster have the same atomic composition, the distribution of the center of the discs, D i , i ∈ N , are identical.This explains the similarity of µ(ϵ > 1) for molecules from the same cluster observed in our analysis (not shown) of the dataset.
To illustrate the physical significance of the descriptors based on PDF derived from the Gershgorin circle theorem as defined by Eq. ( 11), we examine in Figure 8 the correlation of entropy with each of the individual overlaps ⟨f i , f molecule ⟩ and with  All results in the present work show entropy multiplied by temperature at T = 298.15K in the units of kcal mol −1 .Our goal is to build models with prediction accuracy below 1 kcal mol −1 using as few molecules on input as possible.We aim to achieve this by exploiting the most performant descriptors identified in the previous section in combination with kernel models based on optimized kernels.When the dimensionality of descriptors is low, it is possible to build anisotropic kernels, optimized for each dimension of the input space separately.In the present work, we have found that including the anisotropy into the kernels for the best low-dimensional descriptors of molecules does not have any significant effect on the accuracy of the ensuing models.
We use GP regression to model entropy.The kernel functions of the GP models are chosen and optimized using greedy search in the space of kernel functions guided by BIC as discussed in the ML method section.This aligns kernels with available data as much as possible, given the constraints of the greedy search, the descriptors and the uncertainties stemming from the extremely limited size of the datasets considered here.
To reduce bias in the calculation of BIC used for the greedy search of the optimal kernel The effect of the kernel function on the accuracy of the resulting models of entropy is shown in Fig. 9. Figure 9 compares the µ(BIC) values of base and complex kernels.
Although the RQ + Matern × DP kernel has the highest BIC score, this BIC score is only slightly higher than that of the RQ + Matern kernel.The RQ + Matern kernel is therefore chosen as the optimal kernel for the best model of entropy in this work due to its high BIC score and lower function complexity.The BIC algorithm stops at level four because the BIC scores plateau and decrease with further increase of the kernel complexity.

Zero-point Vibrational Energy
The preceding sections illustrate that very efficient and accurate models of entropy   Our results shown in Fig. 11 by blue diamonds use nine-dimensional descriptors with optimized, complex kernels given by the RQ + MAT × DP combination of base kernels.
These kernels are used for GP regression models of ZPVE.The results illustrate that the 9D descriptors proposed here produce ZPVE interpolation models with the prediction MAE under 1 kcal mol −1 , when trained by as few as 100 molecules.
While it is common to quantify the model performance by average errors, such metrics do not illustrate fully the quality of the models.A combination of average errors and largest errors is a better metric, which, however, can be significantly affected by a small number of outliers.To quantify non-ambiguously the performance of the bets models of entropy and ZPVE developed in the present work, we show in Fig. 12 the distribution of errors of predictions over a hold-out test set, containing 20,000 molecules of different size.The shaded bars illustrate the performance of models trained with 100 molecules and the open bars -molecules trained with 1000 molecules.We note that the range of S × T for the molecules considered here covers 26 kcal mol −1 , while the range of ZPVE covers 160 kcal mol −1 .In both cases, we observe that our models trained with 100 molecules are able to predict S × T and ZPVE with absolute error under 1 kcal mol −1 for > 78 % of the test data and with error under 1.3 kcal mol −1 for > 92.4 % of the test data.

SUMMARY
The goal of the present work is to build efficient, data-starved models of molecular properties for interpolation in chemical compound spaces comprising molecules of variable complexity with low-dimensional molecular descriptors.The dataset considered here comprises 133,885 molecules with complexity ranging from 3 atoms to 29 atoms.
We consider two molecular properties: entropy and zero point vibrational energy.We begin by designing low-dimensional descriptors tailored for the prediction of entropy.
Our results show that entropy can be efficiently interpolated by nine-dimensional models with molecular descriptors comprising three physical parameters describing the distributions of eigenvalues of Coulomb matrices and six parameters accounting for the composition and shape of molecules through quantities based on the Gershgorin circle theorem.
In order to build efficient interpolation models of entropy, we combine the ninedimensional descriptors thus obtained with Gaussian process regression based on kernels with variable functional form.The kernel functions are optimized using a greedy search algorithm guided by the Bayesian information criterion (BIC).In order to obtain meaningful, unbiased models based on extremely restricted datasets, we use BIC averaged over multiple randomly sampled instances of training data as a model selection metric.Nine-dimensional Gaussian process models thus optimized are capable of predicting molecular entropy across the entire dataset with the average error under 1 kcal mol −1 , using as few as 100 molecules on input.We present a careful analysis of model prediction errors, demonstrating that entropy models trained with 100 molecules predict S × T with chemical accuracy under 1 kcal mol −1 for > 80 % of molecules in the test set, comprising 20,000 molecular species.
The nine-dimensional descriptors based on the distributions of the Coulomb matrix eigenvalues and the Gershgorin circle theorem are then shown to perform well for the prediction of a completely different molecular property -ZPVE.The error analysis illustrates that the Gaussian process models of ZPVE trained with 100 molecules and using optimized kernels predict ZPVE with absolute error under 1 kcal mol −1 for > 80 % of the test set and with error under 1.3 kcal mol −1 for > 90 % of molecules in the test set.This is remarkable, given that ZPVE of molecules in the test set covers the range from 10 to 171 kcal mol −1 .
There are three main implications of the results presented here.First, because the training sets of the demonstrated models are very small, accurate predictions of molecular properties for a wide range of molecules can be based on either experimental data or extremely accurate quantum chemistry calculations.Second, the possibility of using low-dimensional descriptors for accurate predictions with restricted datasets opens up (1) the possibility of building models with anisotropic kernels that account for the difference in scales of different features; and (2) the possibility of Bayesian optimization in chemical compound space, which can be used to predict and design molecules with desired properties.We will explore both of these possibilities in a future work.

4 FIG. 1 :
FIG. 1: Schematic diagram illustrating the use of GCN to create molecular descriptor (H 4 C 2 H 4 O ) with C 2 H 4 O as an example.The GCN used in this work, as demonstrated in this figure, has four layers (λ + 1 = 4).The resulting molecular descriptor is nine dimensional, because the largest molecule in QM9 dataset considered here has 9 heavy atoms.

FIG. 2 :
FIG. 2: Schematic diagram of the algorithm used in the present work for training GCN by optimizing weight matrices (W 1 , W 2 , W 3 ).Each molecule in A is processed as described in Fig. 1 to yield a unique GCN vector (H 4 molecule ), used as the molecular descriptor.A subset of molecules in A, yielding a set of vectors depicted in C, is used to train a Gaussian process (GP) model of entropy.The RMSE of the GP predicted values is used in step E as the loss function to optimize the weight matrices (W 1 , W 2 , W 3 ) of GCN by gradient descent.

Figure 3 7 H
Figure3shows f molecule of two C 7 H 10 O 2 constitutional isomers, where the blue curve is an isomer with higher entropy and the orange curve is that with lower entropy.The

3 …FIG. 4 :
FIG.4: Schematic diagram of the kernel construction algorithm used in this work to increase the complexity of the kernel function for optimal kernel models.

FIG. 5 :
FIG. 5: Test MAE of GP regression of entropy trained with different combinations of molecular descriptors: circles -3D CM(ϵ max , µ(ϵ > 1), σ(ϵ > 1)) with RQ + RQ kernel; filled-diamonds -CM + GCN-PC(k = 5) with RQ + RQ kernel; empty diamonds -CM + GCN-PC(k = 9) with RQ + RQ kernel; up-trianglesthe 4D CM + AUC[f molecule ] with RQ + RQ + MAT kernel; down-triangles -the 9D CM + AUC[f molecule ] + ⟨f HCNOF , f molecule ⟩ descriptor trained with RQ + MAT × DP kernel.The kernel of each model is optimized as described in text with BIC as the model selection metric.The lines represent the average of ten sets of training using different distributions of molecules in the training set, and the shaded area is the standard deviation of the results.

FIG. 6 :
FIG. 6: Entropy as a function of the largest eigenvalue (ϵ max ) of the Coulomb matrix (left panel) and the standard deviation (σ) of the Coulomb matrix eigenvalues greater than 1 (right panel) for constitutional isomers with the formula C 7 H 10 O 2 .

FIG. 7 :
FIG. 7: Visual demonstration of the Gershgorin circle theorem on the Coulomb matrices of two C 7 H 10 O 2 constitutional isomers 1-hydroxyhept-5-yn-3-one (A, top) and 2,4-dioxatricyclo[4.3.0.0 3,8 ]nonane (B, bottom), where the molecule A has the higher entropy.The entropy values are in the unit of kcal mol −1 .The Gershgorin discs of each isomer's Coulomb matrix are drawn as orange circles in the figures.The blue triangles are all the eigenvalues of the Coulomb matrices.The red arrows emphasis the CM-EV1 of each isomer's Coulomb matrix.

Fig. 7
Fig. 7 visualizes the results of the Gershgorin circle theorem.The y-and x-axes are the imaginary and real axes of the complex plane containing the eigenvalues.The orange discs are D i defined by Eq. (7).Fig. 7 shows that the union of discs for the more rigid molecule on the right is D 1 ∪ ... ∪ D 19 = {−152.77≤ z ≤ 292.16}, which covers a wider area compared to that for the molecule on the left, D 1 ∪ ... ∪ D 19 = {−128.78≤ z ≤ 248.76}.Since the range of eigenvalues is wider for the more rigid molecule, it has a higher probability of having a larger ϵ max value and bigger σ(ϵ > 1).

Figure 8
shows that entropy is particularly well correlated with ⟨f H , f molecule ⟩ and ⟨f C , f molecule ⟩ and strongly correlated with

FIG. 8 :FIG. 9 :
FIG. 8: Correlation of entropy with AUC (upper left panel) and ⟨f atom , f molecule ⟩ for constitutional isomers with the chemical formula C 7 H 10 O 2 .
function, 1000 training data points are chosen randomly from the pool of 85,000 data points reserved for training.Each of the randomly chosen training datasets is passed through multiple layers of the BIC kernel selection algorithm.At each level of kernel selection, we calculate the average value of BIC scores, µ(BIC), for the same kernel trained with ten different sets of randomly selected data points.The base or complex kernel with the highest average BIC score is retained for the next level of kernel function selection.

FIG. 10 :FIG. 11 :
FIG. 10: Zero point vibrational energy as a function of AUC(f molecule (x|µ = M ii , σ 2 = ( j̸ =i |M ij |) 2 ) and ⟨f ref , f molecule ⟩ for all molecules in the QM9 dataset (left panels); and the molecules from ten largest isomer groups in the QM9 dataset (right panels).The reference molecule is H 3 C 4 N 2 OF illustrated in Fig. 3.

Figure 11 showsFIG. 12 : 5 .
Figure11shows the interpolation MAE for ZPVE illustrating the comparison of the performance of the present 9D models with models developed in Ref.[45] .The results in Ref.[45] use high-dimensional descriptors: full CM including 435 features (shown in Fig. 11 orange circles), BOB with 1128 features (shown by green triangles), SLATM 11960 features (shown by red squares), FCHL with 10440 features (shown by purple down triangles), and MBDF with 145 features (shown by brown diamonds).These

TABLE I :
Off-diagonal entries (M ij ) of the Coulomb matrix for several bonded atoms.
can be built with 9D descriptors (CM + AUC[f molecule ] + ⟨f HCNOF , f molecule ⟩).It is instructive to explore the performance of these descriptors in models of other molecular properties.Here, we consider zero point vibrational energy (ZPVE) of molecules.Figure10illustrates that ZPVE is strongly correlated with AUC(f molecule ) and ⟨f ref , f molecule ⟩.