Paper The following article is Open access

Efficient interpolation of molecular properties across chemical compound space with low-dimensional descriptors

and

Published 3 April 2024 © 2024 The Author(s). Published by IOP Publishing Ltd
, , Citation Yun-Wen Mao and Roman V Krems 2024 Mach. Learn.: Sci. Technol. 5 015059 DOI 10.1088/2632-2153/ad360e

2632-2153/5/1/015059

Abstract

We demonstrate accurate data-starved models of molecular properties for interpolation in chemical compound spaces with low-dimensional descriptors. Our starting point is based on three-dimensional, universal, physical descriptors derived from the properties of the distributions of the eigenvalues of Coulomb matrices. To account for the shape and composition of molecules, we combine these descriptors with six-dimensional features informed by the Gershgorin circle theorem. We use the nine-dimensional descriptors thus obtained for Gaussian process regression based on kernels with variable functional form, leading to extremely efficient, low-dimensional interpolation models. The resulting models trained with 100 molecules are able to predict the product of entropy and temperature (S×T) and zero point vibrational energy (ZPVE) with the absolute error under 1 kcal mol−1 for ${\gt}78$% and under 1.3 kcal mol−1 for ${\gt}92$% of molecules in the test data. The test data comprises 20 000 molecules with complexity varying from three atoms to 29 atoms and the ranges of S×T and ZPVE covering 36 kcal mol−1 and 161 kcal mol−1, respectively. We also illustrate that the descriptors based on the Gershgorin circle theorem yield more accurate models of molecular entropy than those based on graph neural networks that explicitly account for the atomic connectivity of molecules.

Export citation and abstract BibTeX RIS

Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 license. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

The number of molecules with properties suitable for medicinal applications is estimated to be between 1023 and 1060 [1]. This large number underscores the importance of the development of efficient methods for predicting molecular properties for any given combination of chemically bonded atoms. As demonstrated by recent work [2, 3], prediction of molecular properties can be achieved by interpolation in chemical compound space, defined as an abstract space with individual molecules as elements of the space, by machine learning (ML) models. A chemical space is often, although not always, restricted to a particular number and combination of atoms (for example, a space of molecules with chemical formula C7H10O2). Interpolation of molecular properties across chemical space meets with four main challenges: (i) the vast number of molecules; (ii) the requirement to represent molecules by numerical descriptors that are universal, efficient and accurate; (iii) the sensitivity of molecular properties to small structural changes in atomic arrangements; (iv) lack of structure in chemical compound space. While interpolation of molecular properties in chemical spaces is often considered as a big data problem, requiring information from large sets of molecules, the focus of many recent studies has been on data-efficient models [47]. The efficiency of ML predictions of molecular properties can be enhanced by building underlying symmetries into ML models [810], by improving the molecular descriptors [6, 1116] and by improving the ML models through kernel optimization for kernel models [1719] or optimization of neural network (NN) architecture and algorithm hyperparameters for NN-based models [2022].

Key to any ML predictions of molecular properties are molecular descriptors that serve as inputs into supervised learning models. The molecular descriptors must be numerical, universal (i.e. describe any molecule), unique, and embody all of the relevant features, such as the shape of molecules, the arrangement of chemical bonds, bond strengths and bond order, vibrational frequencies and conformational degrees of freedom. Many variations of molecular descriptors have recently been proposed, including Coulomb matrices [23], bag of bonds (BoB) features [3], smooth overlap of atomic positions (SOAP) [24], Faber–Christensen–Huang–Lilienfeld (FCHL19) representation [6]. Another possibility is to represent molecules by a Graph Neural Network (GNN) [25, 26] that explicitly accounts for the spatial arrangement and connectivity of atoms in a molecule. GNN can be combined with other features embodying molecular properties, yielding accurate molecular descriptors [10, 2729]. However, for complex molecules, these descriptors are necessarily high-dimensional. While high-dimensional descriptors can be used as inputs into isotropic ML models (for example ML models with kernels that do not discriminate between different input dimensions) [30], such models do not account for inherent differences in scales of different features. Willatt et al [13] showed that reducing the dimensionality of descriptors may improve the accuracy of ML predictions. In addition, Musil et al [11] argued that descriptors reflecting some fundamental principles yield more robust, transferable and data-efficient ML models. In addition to efficiency, low-dimensional descriptors enable the possibility of Bayesian optimization of molecular properties in chemical compound space. The present work demonstrates that it is possible to design low-dimensional descriptors of large polyatomic molecules for extremely efficient interpolation of entropy and zero point vibrational energy in chemical compound space.

Different molecular properties exhibit vastly different variation across chemical compound spaces. Therefore, it can be argued that the most data-efficient predictions must be based on models and descriptors tailored for the specific molecular property. In the present work, we test this conjecture by first considering interpolation of molecular entropy across chemical compound spaces and developing low-dimensional descriptors tailored for the prediction of entropy. Our starting point is based on three-dimensional, universal, physical descriptors derived from the properties of the distributions of the eigenvalues of the Coulomb matrix. We show that the largest eigenvalue and the two leading moments of the eigenvalue distribution provide the physical descriptors for molecular entropy, as elucidated using the Gershgorin circle theorem [31, 32]. We illustrate that the accuracy of these descriptors can be enhanced by combining them with either: (1) the leading principle components of the output of GNN tailored for the prediction of entropy; or (2) with descriptors based on the Gershgorin circle theorem. We show that the models of entropy using descriptors based on the Gershgorin circle theorem perform better than the models using descriptors based on GNN.

By using the resulting nine-dimensional descriptors in Gaussian process regression with kernels optimized for pattern recognition [17, 18], we build ML models capable of predicting entropy of 133 000 organic molecules with up to 29 atoms (HCNOF) using as few as 100 molecules for training. The predictions have chemical accuracy (with predicted entropy × temperature under 1 kcal mol−1), illustrating, to the best of our knowledge, the most data-efficient models of entropy to date. We then illustrate that these descriptors, even though tailored for models of entropy, can be used to interpolate zero point vibrational energy (ZPVE). The variation of ZPVE in the chemical space considered here is much greater than the scale of values covered by entropy. However, we show that the nine-dimensional descriptors can be used to produce extremely data-starved yet accurate models of ZPVE, provided the kernels of the underlying models are properly optimized. We illustrated that models trained by only 100 molecules produce ZPVE with mean absolute error and absolute error for ${\gt}92$% of predictions under 1.3 kcal mol−1, in a chemical compound space including molecules with up to 29 atoms and covering the ZPVE range of 161 kcal mol−1.

2. Background on molecular descriptors

There are three main applications of ML in chemistry that rely on specifically chosen molecular descriptors: ML modeling of potential energy surfaces (PES) and force fields for polyatomic molecules and molecular clusters, synthesis design, and screening of molecular properties in chemical compound spaces. Our current work falls into the latter category. In this section, we briefly describe these applications to provide context for the present work. There is also an emerging field of cheminformatics [3335], which we leave outside of the scope of the present discussion.

ML models of PES and force fields, needed for molecular dynamics simulations or for quantum calculations of molecular properties such as ro-vibrational spectra, aim to build accurate and efficient representations of ab initio data obtained from quantum chemistry calculations. The goal is usually to minimize the number of expensive quantum chemistry calculations. The models are based on both artificial neural networks (NNs) [3645] and kernel methods [4658], and often involve an interplay between ab initio calculations and ML [5966]. The inputs into such models must generally include information on the positions of atom within molecules or molecular complexes considered and should, therefore, have at least the same dimensionality as the molecular configuration space. The following molecular descriptors have been used for the purposes of modelling PES and force-fields: Coulomb matrices (CM) [23], atom-centered symmetry functions (ACSF) [67], spectrum of London and Axilrod–Teller–Muto potentials (SLATM) [68], smooth overlap of atomic positions (SOAP) [24], Faber-Christensen–Huang–Lilienfeld (FCHL) descriptors [6, 16].

CM is a global descriptor that includes electrostatic interaction between atoms, whereas BoB could be viewed as the vectorized form of CM. MBTR is a global descriptor that encodes molecular structure into tensors. ACSFs define local atomic environments through a fingerprint comprised of the outcomes of multiple two- and three-body functions. A two-body component accounts for the radial distribution between a central atom and its neighboring atoms in the immediate vicinity, while a three-body component similarly considers factors such as the distribution of angles and distances between atoms within the atom's local environment. Different descriptors such as SOAP, SLATM, FCHL, DF, and MBDF are all also based on multiple two- and three-body functions. For example, SOAP descriptors represent sections of atomic geometries through a local expansion of a Gaussian blurred atomic density utilizing orthonormal functions derived from spherical harmonics and radial basis functions. These functions are customizable to detect specific structural features within the vicinity of an atom.

Another type of ML applications focuses on interactions between complex organic molecules for the purposes of synthesis design. The goal of synthesis design is to determine how to synthesize a desirable product given the available materials and reaction conditions. ML methods are instrumental in two distinct tasks within synthesis design: (i) reaction prediction and (ii) retrosynthesis. Existing ML methods for both tasks can be further categorized into template-based and template-free approaches. Template-based approaches involve the classification or ranking of candidate products using known reaction templates. Despite achieving up to 78$\%$ prediction accuracy in some cases [69], this approach has a notable limitation. Due to computational resource constraints, the algorithm can only search within the templates specified by either experts in the field or heuristic algorithms trained by databases. Consequently, the performance of the algorithm is dependent on and limited by the provided templates. To address this limitation, the template-free approach predicts a series of mechanistic steps with fingerprints or custom-designed descriptors on input. In both template-based and template-free approaches, various ML architectures have been explored, including feed-forward neural networks (FFNN) [6971], recurrent neural networks (RNN) [72, 73], graph neural networks (GNN) [7476], and transformers [77].

For reaction prediction, the goal is to train ML models to establish accurate one-to-one mappings between reactants and their corresponding products under specified reaction conditions. In contrast, retrosynthesis predictions aim to identify the most optimal logical synthetic pathway to generate a target structure, potentially involving multiple steps. The development of traditional computer-aided retrosynthesis methods has been a heavily researched field since the 1960 s [78, 79], predating the growing popularity of ML. Examples of such computer-assisted synthesis planning (CASP) include various programs such as LHASA [80], Simulation and Evaluation of Chemical Synthesis (SECS) [81], and IBM RXN [82]. Unlike in the previous case, the molecular descriptors for this type of applications do not always have to include explicitly the dependence on the vibrational motions of molecules, but must account for the structure and composition of molecules. This makes the descriptors for this application also high-dimensional. Examples of molecular descriptors used for the purposes of synthesis design include the simplified molecular-input line-entry system (SMILES) [83], international chemical identifies (InChI) [84], molecular ACCess system (MACCS) keys [85], extended-connectivity fingerprints (ECFPs) [86].

The third type of ML approaches aims to screen molecules in the chemical space in order to identify suitable candidates for particular applications, including drug, material and catalyst discovery and optimization. For these applications, conducting experiments or computations of all candidates in the vast chemical space is impractical. In contrast with atomic-scale models—where nuclear charge and positions of atoms in the molecule(s) must be included into model features—the inputs for these applications of ML need to be informative of the desirable property to be predicted. For example, for catalyst optimization molecular descriptors must be carefully curated to include physical attributes such as conformer distributions to improve ML model prediction. Inspired by 4D-QSAR [8789], descriptors such as the average steric occupancy (ASO) [89, 90] have been developed. The conformers for each catalyst are obtained, aligned and averaged to compute the ASO descriptors. Steric information is effectively captured using this method, which allows ML models to predict catalysts with an average enantioselectivity over 80$\%$ enantiomeric excess. Other descriptors have been developed for enantioselective catalyst design, including electrostatic potential map (ESP) [91], average electronic indicator field (AEIF) [92, 93], and electronic and steric molecular interaction fields (MIFs) [94].

Models for screening in chemical compound spaces are based on various ML approaches including generative adversarial network (GAN) [9597], convolution NNs [98, 99], back-propagation NNs such as recurrent NN [100102]. However, NNs require a large number of training data to produce accurate models. Given the high cost of experiments and theoretical computations, data availability on molecular properties can be significantly limited. On the other hand, non-parametric kernel-based algorithms, such as Gaussian process regression (GPR) or kernel ridge regression (KRR), have been shown to make accurate prediction based on very sparse data [17, 24, 52, 103]. One can use isotropic kernels to interpolate in chemical compound spaces with high-dimensional descriptors [3, 6, 16, 23, 24, 67, 68, 104, 105]. For this purpose, previous studies have utilized some of the same descriptors as for PES an FF construction: Coulomb matrices (CM, 435 D)[23], Bags of Bonds (BoB, 1128 D) [3], 2-body Many-Body Tensor Representations (MBTR, > 2500 D) [106], atom-centered symmetry functions (ACSF, 4350 D) [67, 104], spectrum of London and Axilrod–Teller–Muto potentials (SLATM, 11 960 D) [68], Faber–Christensen–Huang–Lilienfeld descriptors (FCHL, 10 440 D) [6], smooth overlap of atomic positions (SOAP, 13 920 D) [24], weighted density functions (DF, 2500 D) and many-body distribution functionals (MBDF, 145 D) [105]. The list above includes the dimensionality required for interpolating the QM9 dataset [107] considered in the present work.

As can be seen, all of these descriptors are high-dimensional. However, high-dimensional descriptors make optimization difficult. Optimization of molecular properties is one of the most important applications of interpolation models. Furthermore, omitting the anisotropy of the kernel function may restrict the inference power of kernel models. At the same time, for the purposes of interpolation in chemical compound spaces, it seems unnecessary to include fine details of intra-molecular structure into molecular descriptors. It may be possible instead to describe molecules by effective low-dimensional vectors. The purpose of the present work is to explore the possibility of accurate and data-efficient interpolation of molecular properties in chemical compound spaces with low-dimensional (<10 D) descriptors. Furthermore, it has been shown recently that kernel models of PES [108] and force fields [109] can be significantly improved by optimizing the functional form of the kernel function. It thus appears that data-efficient interpolation of molecular properties in chemical compound spaces can be achieved in two ways: (i) by improving the expressiveness of the molecular descriptors through increasing the descriptor dimensionality; (ii) by improving the expressiveness of the kernel functions by optimizing the functional form of the model kernels. In the present work, we explore the second approach and illustrate that models based on low-dimensional descriptors with optimized functional form of the kernel functions can achieve chemical accuracy with extremely limited data.

3. Dataset summary

In this work, we focus on molecules from the QM9 dataset [107], where a chemical subspace consists of 133 885 stable organic molecules made up of a combination of the following atoms: CHONF. Within the QM9 dataset, there are 526 subsets of constitutional isomers, where constitutional isomers are molecules with the same chemical formula but different atom connectivity. Each subset includes at least 2 molecules, while the largest constitutional isomer subset includes 6059 molecules.

We reserve 20$\%$ of these molecules for the hold-out test set, while samples from the remaining distributions of molecules are used to train the ML models. Since molecules in this chemical subspace range from 3 to 29 atoms in size, we bin all molecules according to their size before randomly selecting the distributions for the training and test sets. This ensures the ensuing ML models sample the entire chemical subspace, both at the training and test stage. Furthermore, 20$\%$ of the training dataset is used as a model validation set. The validation set is used for optimizing the hyperparameters of the ML models.

The prediction accuracy of the ML models is quantified, for both the test and validation sets, by the root-mean-square error (RMSE)

Equation (1)

or mean absolute error (MAE)

Equation (2)

evaluated over the number of samples N in the test or validation set, as appropriate. Here, yi and $\hat{y_{i}}$ are the observed entropy and ML predicted entropy of the ith molecule in the test/validation set, comprising N molecules. Lower RMSE and MAE values indicate better accuracy of ML predictions.

4. Method

The main goal of the present work is to develop low-dimensional, general descriptors of molecules that can be applied to both small and large molecules and used for efficient interpolation of molecular properties across chemical compound spaces with a great variety of molecules. We build these descriptors incrementally, by starting with the properties of distributions describing the Coulomb matrices [23], considering the enhancement of the descriptors by graph neural networks, and using physical insights offered by the Gershgorin circle theorem [32, 110]. Once identified, the best low-dimensional descriptors are used as features of inputs into Gaussian process models with kernels optimized for a specific molecular property. The efficiency of the final models presented here is thus determined by both the efficiency of the molecular descriptors and the alignment of model kernels with the prediction targets.

4.1. Descriptor I: distributions of eigenvalues of Coulomb matrices

Coulomb matrix is a molecular descriptor developed by Rupp et al [23]. Each molecule is represented by a $n_\textrm{max}\times n_\textrm{max}$ symmetric matrix, where $n_\textrm{max}$ is the number of atoms in the largest molecule in the chemical subspace. Each matrix element Mij in matrix M is defined as

Equation (3)

where Zi and Zj are the atomic numbers of atoms i and j, respectively, and Rij is the separation between atoms i and j. The diagonal elements of the Coulomb matrix represent the polynomial fit relating the atomic number to the total energies of the free atoms [23]. The off-diagonal elements describe the electrostatic interaction between atoms i and j. If the size of a molecule is smaller than $n_\textrm{max}$, the matrix elements with i or $j \gt n_\textrm{max}$ are set to zero. The Coulomb matrix is invariant to molecular translations and rotations, but not invariant to permutations of the atoms. This problem can be overcome by diagonalizing the Coulomb matrix [23, 111].

In the present work, we begin by implementing the basic three-dimensional (3D) descriptors of molecules using the largest eigenvalue εmax, the mean $\mu(\epsilon \gt 1)$ and the standard deviation $\sigma(\epsilon \gt 1)$ of the distribution of the eigenvalues of the Coulomb matrix (ε) with values greater than 1. These 3D descriptors are used as inputs into the most basic ML models of entropy in this work. To enhance the expressiveness of the molecular descriptors and the accuracy of the ensuing models, we supplement these 3D vectors with features based on either the outputs of graph neural networks, optimized to predict entropy as described in Descriptor II, or features based on the Gershgorin circle theorem as described in Descriptor III.

4.2. Descriptor II: graph convolution networks

To enhance the accuracy of the resulting ML models as well as to explore the role of explicit graph descriptors for interpolation of molecular properties, we supplement the 3D descriptors derived from the Coulomb matrix by descriptors based on the outputs of graph convolution networks (GCN). GCN is a specific type of graph neural networks that utilizes convolutional operations to propagate information. A molecule can be mapped onto a molecular graph $\textrm{MG} = (\nu, e)$, where the atoms and bonds are treated as nodes (ν) and edges (e), respectively. Each graph is represented by an adjacency matrix $\hat{A}^{MG}$ with matrix elements defined as

Equation (4)

Alternatively, a molecule can be mapped onto a fully connected graph $FG = (\nu$, $e)$, where all nodes are connected to each other with a weighted edge (e). Each graph is represented by an adjacency matrix $\hat{A}^\textrm{FG}$ with matrix elements defined as

Equation (5)

where eij is the edge between the ith and jth node. Each eij is weighted by the electrostatic interaction between atoms i and j corresponding to the ith and jth nodes. With the adjacency matrix thus defined, the $(\lambda+1)$th layer of a multi-layer GCN is implemented using

Equation (6)

where $\hat{A}$ is a general adjacency matrix which can be either $\hat{A}^\textrm{MG}$ or $\hat{A}^\textrm{FG}$. $\hat{D}$ is the diagonal matrix with elements

Equation (7)

$\hat{A} = A + I_{m_\textrm{max}}$, $I_{m_\textrm{max}}$ is $m_\textrm{max}$-dimensional identity matrix, $m_\textrm{max}$ is the maximum number of heavy atoms (CONF) that a molecule has for the dataset, φ is the activation function, and Wλ is the matrix of weights for layer $\lambda \unicode{x2A7E} 1$, producing molecular descriptors illustrated in figure 1. The weight matrices are optimized by training the GCN using the algorithm illustrated in figure 2.

Figure 1.

Figure 1. Schematic diagram illustrating the use of GCN to create molecular descriptor ($H^4_{\mathrm{C}_{7}\mathrm{H}_{10}\mathrm{O}_{2}}$) with C7H10O2 as an example. The GCN used in this work, as demonstrated in this figure, has four layers ($\lambda +1 = 4$). The resulting molecular descriptor is nine dimensional, because the largest molecule in QM9 dataset considered here has 9 heavy atoms.

Standard image High-resolution image
Figure 2.

Figure 2. Schematic diagram of the algorithm used in the present work for training GCN by optimizing weight matrices (W1, W2, W3). Each molecule in A is processed as described in figure 1 to yield a unique GCN vector ($H^4_\textrm{molecule}$), used as the molecular descriptor. A subset of molecules in A, yielding a set of vectors depicted in C, is used to train a Gaussian process (GP) model of entropy. The RMSE of the GP predicted values is used in step E as the loss function to optimize the weight matrices (W1, W2, W3) of GCN by gradient descent.

Standard image High-resolution image

The GCN are trained using the training and validation sets of molecules. For λ = 1, $H^1 = X$ where X are the node features in the one hot encoded data format. The feature of each node is a vector including atomic number, hybridization, degree, and chirality of each atom. Each node of the graph aggregates information from its neighboring node(s) as the graph passes through layers of the GCN by equation (6). The weight matrices are initialized randomly with a truncated normal distribution. The weight matrix of the last GCN layer is set as a one-dimensional vector, so that the resulting final layer $H^{\lambda+1}_{i}$ is a vector instead of a matrix. In this work, we build GCN with $3+1$ layers, where W1 is a $20 \times 9$ matrix, W2 is a $9 \times 9$ matrix, and W3 is a $9 \times 1$ matrix.

The resulting $H^{\lambda+1}_{i}$ vector is treated as input into a Gaussian process regression (GPR) model, with entropy of the corresponding molecule as output. The RMSE over the validation set is used as the loss function for the optimization of the GCN weight matrices. The outputs of GCN are converted into a 9D vector by a densely connected NN layer, and the 9D vector is further reduced to k dimensions by principal component analysis (PCA), yielding GCN-PC(k) descriptors.

In the present work, the GCN model is trained with the cluster of constitutional isomers with the chemical formula C7H10O2 from [107]. The trained weight matrices are then used to compute the adjacency matrices for other molecules in the QM9 database of [107] to generate the GCN-based descriptors. For smaller molecules with less than nine heavy atoms, the adjacency matrix size is still $9\times 9$.

4.3. Descriptor III: Gershgorin circle theorem

The Gershgorin circle theorem identifies a region in the complex plane that contains all the eigenvalues of a complex square matrix [32, 110]. The theorem states that for an m×m matrix with the entries in complex plane C, the eigenvalues of matrix M are in the range of $D_1 \cup D_2 \cup \ldots\cup D_m$. Each disc Di is defined as

Equation (8)

Schrier [111] analyzed the eigenvalues of Coulomb matrices using the Gershgorin circle theorem and showed that a highly substituted carbon in an alkane has larger off-diagonal values, leading to higher upper bound for Di and consequently higher eigenvalues. We use the results of the Gershgorin circle theorem to define new descriptors of molecules, as follows.

Each Gershgorin circle disc (Di ) of the coulomb matrix (M) is used to define a normal distribution function fi

Equation (9)

where µ and σ are the mean and the standard deviation of the normal distribution, and $\phi(\cdot)$ is the probability density function (PDF) defined as

Equation (10)

The mean of the normal distribution is defined with the value of the center of each Gershgorin circle disc, Mii . The standard deviation of the distribution is determined by the radius of each Gershgorin circle disc, $(\sum_{j\neq i} |M_{ij}|)^\tau$. Each molecule is then described by $f_\textrm{molecule}$, defined as the sum of all fi

Equation (11)

where di and τ are hyperparameters that allow adjustment on the weight on and standard deviation of fi , respectively. In this work, we fix di and τ to 1.

We further define an atomic reference PDF ($f_\textrm{atom}$),

Equation (12)

where $M_\textrm{atom} = 0.5 Z_\textrm{atom}^{2.4}$, and $d_\textrm{atom}$ is a weight constant that is set to 10. By calculating the inner products between the atomic reference PDF and the individual molecular PDF, we reduce the high dimensional continuous PDF curve into a five dimensional descriptor, $\langle f_\textrm{HCNOF}, f_\textrm{molecule} \rangle $. We also use the area under each $f_\textrm{molecule}$ denoted hereafter by $\textrm{AUC}(f_\textrm{molecule})$ as an additional molecular descriptor. The relative role of these descriptors is illustrated in the following section.

Figure 3 shows $f_\textrm{molecule}$ of two $\mathrm{C}_{7}\mathrm{H}_{10}\mathrm{O}_{2}$ constitutional isomers, where the blue curve is an isomer with higher entropy and the orange curve is that with lower entropy. The variances (σ2) of fi are $\sum_{j\neq i} |M_{ij}|$. The atomic reference curves are shown by black dotted lines.

Figure 3.

Figure 3. Left: Molecular probability density function for two isomers of $\textrm{C}_{7}\mathrm{H}_{10}\mathrm{O}_{2}$ based on the Gershgorin circle theorem. The black doted curves show the atomic reference curves (HCNOF). Right: Molecular probability density function for an isomer of $\textrm{C}_{7}\mathrm{H}_{10}\mathrm{O}_{2}$ and the reference molecule based on the Gershgorin circle theorem. For all $f_{\textrm{C}_{7}\mathrm{H}_{10}\mathrm{O}_{2}}$ and $f_\textrm{ref}$, $d_i = M_{ii}$, and τ = 1.

Standard image High-resolution image

4.4. ML method: Gaussian process regression with variable kernels

Gaussian process regression (GPR) [112] is a supervised ML method that models an unknown target function, f, given input data X and output data y. The training dataset comprises $ \mathcal{D} = \{(x_i,f_i), i \in [1,N_\textrm{train}]\}$, where $f_i = f(x_i)$ is the noise-free observation of the function evaluated at xi . GPR infers a distribution over functions with $p(f| \mathcal{D})$. In the context of property (e.g. entropy) interpolation across chemical space, a probability distribution of all possible entropy values for a molecule is inferred from the training dataset $\mathcal{D}$. The inputs xi are the molecular descriptors and fi is the corresponding value of molecular entropy. GPR defines a joint distribution to predict $f_*$ on the unseen inputs $(X_*)$ as

Equation (13)

where $\mu = \left [ m(x_1), \ldots, m(x_{N_\textrm{train}}) \right ]$, with $m(\cdot)$ representing the mean of a Gaussian distribution, and $K = \kappa(X,X)$, $K_* = \kappa(X,X_*)$, $K_{**} = \kappa(X_*,X_*)$ determine the covariance matrix, defined by the kernel function κ. The efficiency of the GP model is determined by the functional form of the kernel function.

In the present work, we follow [17, 18] to construct kernel functions with variable complexity by forming linear combinations and products of the following basis kernel functions:

Equation (14)

where σ0 is a parameter which controls the inhomogenity of the kernel,

Equation (15)

where l is the length scale and $d(\cdot, \cdot)$ is the Euclidean distance between two data points,

Equation (16)

where α is the scale mixture parameter and l is the length scale of the kernel,

Equation (17)

where ν and l are positive parameters, $K_{\nu}(\cdot)$ is a modified Bessel function and $\Gamma(\cdot)$ is the gamma function, as well as

Equation (18)

where l is the length scale of the kernel, and p is the periodicity of the kernel.

We optimize the functional form of the kernel function using a greedy search algorithm that combines simple functions into linear combinations and products, as was described in [17, 18] and used for various applications in [103, 108, 109, 113115]. The kernel construction algorithm is illustrated in figure 4.

Figure 4.

Figure 4. Schematic diagram of the kernel construction algorithm used in this work to increase the complexity of the kernel function for optimal kernel models.

Standard image High-resolution image

All possible linear combinations and products of base kernels defined in equations (14)–(18) are created to yield more complex kernels. The model is selected by the Bayesian information criterion (BIC) calculated as

Equation (19)

where $|\kappa_i|$ is the number of parameters of kernel κi , and $\log P(y|{\kappa}_i) $ is the maximum of log-likelihood marginalized over the parameters of the functions in the Gaussian process, but not the kernel hyperparameters. This is the same function that is maximized to train a Gaussian process.

5. Results

5.1. Analysis of different low-dimensional descriptors

In order to analyze the relative performance of the different molecular descriptors, we begin by building GP models of entropy with various combinations of molecular descriptors described in the previous section. Figure 5 compares the learning curves of GP models thus obtained. As shown in figure 5, the best performing GP model is obtained with nine dimensional CM-GC descriptor, which includes three descriptors stemming from the Coulomb matrix denoted as CM(εmax, $\mu(\epsilon\gt1)$, $\sigma(\epsilon\gt1)$), the area under the curve $\textrm{AUC}(f_\textrm{molecule})$, and five inner products $\langle f_\textrm{HCNOF}, f_\textrm{molecule} \rangle$. The figure shows that models trained with CM-GC descriptor yield the most accurate interpolation, reducing the MAE to below 1 $\mathrm{kcal~mol^{-1}}$ with less than 100 molecules. This figure illustrates that physical descriptors derived from combinations of CM-based quantities and properties based on the Gershgorin circle theorem improve the accuracy of ensuing models much more significantly than the descriptors based on graph neural networks.

Figure 5.

Figure 5. Test MAE of GP regression of entropy trained with different combinations of molecular descriptors: circles—3D CM(εmax, $\mu(\epsilon\gt1)$, $\sigma(\epsilon\gt1)$) with $\mathrm{RQ+RQ}$ kernel; filled diamonds—CM + MG-GCN-PC(k = 5) with $\mathrm{RQ+RQ}$ kernel; empty diamonds—CM + MG-GCN-PC (k = 9) with $\mathrm{RQ+RQ}$ kernel; filled thin-diamonds—CM + FC-GCN-PC (k = 5) with $\mathrm{RQ+RQ}$ kernel; empty thin-diamonds—CM + FC-GCN-PC (k = 9) with $\mathrm{RQ+RQ}$ kernel; up-triangles—the 4D $\mathrm{CM}+\mathrm{AUC}[f_\textrm{molecule}]$ with $\mathrm{RQ + RQ + MAT}$ kernel; down-triangles—the 9D $\mathrm{CM}+\mathrm{AUC}[f_\textrm{molecule}]+\langle f_\textrm{HCNOF}, f_\textrm{molecule} \rangle$ descriptor trained with $\mathrm{RQ + MAT \times DP}$ kernel. The kernel of each model is optimized as described in text with BIC as the model selection metric. The lines represent the average of ten sets of training using different distributions of molecules in the training set, and the shaded area is the standard deviation of the results.

Standard image High-resolution image

To understand the role of the descriptors contributing to the best performing combination in figure 5, we examine correlations of molecular entropy with the largest eigenvalue of the Coulomb matrix and the standard deviation σ of the distribution of the eigenvalues ε > 1, both shown in figure 6. While neither εmax nor µ are unique features for entropy, the results in figures 6 illustrate that there exist strong correlations between entropy and both of εmax and σ. These correlation can be rationalized using the Gershgorin circle theorem.

Figure 6.

Figure 6. Entropy as a function of the largest eigenvalue (εmax) of the Coulomb matrix (left panel) and the standard deviation (σ) of the Coulomb matrix eigenvalues greater than 1 (right panel) for constitutional isomers with the formula C7H10O2.

Standard image High-resolution image

The Gershgorin circle theorem identifies a region in the complex plane that contains all the eigenvalues of a complex square matrix [32, 110]. The theorem states that for an m×m matrix with the entries in complex plane C, the eigenvalues of matrix M are in the range of $D_1 \cup D_2 \cup \ldots \cup D_m$. Each disc Di is defined in equation (8). Schrier [111] analyzed the eigenvalues of Coulomb matrices using the Gershgorin circle theorem and showed that a highly substituted carbon in an alkane has larger off-diagonal values, leading to higher upper bound for Di and consequently higher eigenvalues.

To elucidate the connection with entropy, consider two constitutional isomers with molecular formula C7H10O2 that have drastically different entropy. The molecules with the highest and lowest entropy in the C7H10O2 isomeric cluster are 1-hydroxyhept-5-yn-3-one (figure 7(A)) and 2,4-dioxatricyclo [$4.3.0.0^{3,8}$]nonane (figure 7(B)), respectively. For molecule A: the entropy is $1.130\times 10^{-1}$ kcal mol−1K−1, $\epsilon_\textrm{max} = 175.437$, $\sigma(\epsilon\gt1) = 51.800$ and $\mu(\epsilon\gt1) = 45.987$. For molecule B: the entropy is $7.668\times 10^{-2}$ kcal mol−1K−1 (lower), $\epsilon_\textrm{max} = 212.320$ (higher), $\sigma(\epsilon\gt1) = 61.221$ (higher), while $\mu(\epsilon\gt1) = 45.980$ is similar to that of 1-hydroxyhept-5-yn-3-one. Molecule B is more rigid, which leads to fewer microstates and lower entropy.

Figure 7.

Figure 7. Visual demonstration of the Gershgorin circle theorem on the Coulomb matrices of two C7H10O2 constitutional isomers 1-hydroxyhept-5-yn-3-one (A, top) and 2,4-dioxatricyclo[$4.3.0.0^{3,8}$]nonane (B, bottom), where the molecule A has the higher entropy. The entropy values are in the unit of kcal mol−1. The Gershgorin discs of each isomer's Coulomb matrix are drawn as orange circles in the figures. The blue triangles are all the eigenvalues of the Coulomb matrices. The red arrows emphasis the CM-EV1 of each isomer's Coulomb matrix.

Standard image High-resolution image

The more rigid molecule, 2,4-dioxatricyclo [$4.3.0.0^{3,8}$]nonane (figure 7(B)), has a lower entropy. This is the manifestation of the fact that entropy is determined by the shape and rigidity of the molecule [116]. Rigid molecules tend to be restricted in motion, which leads to lower numbers of available microstates and lower entropy. Thus, a good representation of molecular rigidity should be a good descriptor for predicting entropy. Given this premise, it is useful to analyze how the CM(εmax, $\mu(\epsilon\gt1)$, $\sigma(\epsilon\gt1)$) descriptor reflects the rigidity of molecules. The diagonal entries of CM (table 1) for molecules considered here are $M_{ii,\textrm{H}} = 0.5$, $M_{ii,\textrm{C}} = 36.858$, or $M_{ii,\textrm{O}} = 73.517$, corresponding to hydrogen, carbon and oxygen. The off-diagonal elements Mij are directly proportional to the atomic numbers Zi and Zj , and inversely proportional to the interatomic separation Rij . This means that the Mij value is larger for bonded pairs of heavy atoms. Table 1 lists some examples of Mij values, with the highest Mij value highlighted in bold for the C=O double bond pair. For rigid molecules, the atoms are spatially crowded, which leads to higher Mij values on average. We observe that the Gershgorin circle theorem yields higher values of $|A_{ij}|$ in equation (8) for rigid molecules on average, leading to larger values of $\sum_{j\neq i}|A_{ij}|$. Thus, the range of eigenvalues defined by the union of Di in the circle theorem is higher for rigid molecules. The opposite is true for linear and less rigid molecules.

Table 1. Off-diagonal entries (Mij ) of the Coulomb matrix for several bonded atoms.

Atom 1Atom 2 $Z_1 Z_2$ R (Å) Mij
HH10.741.351
CH61.095.505
OH80.9718.239
CC361.535 (single)23.453
   1.339 (double)26.886
   1.203 (triple)29.925
CO481.431 (single, ethanol)33.543
   1.219 (double, ketone) 39.377

Figure 7 visualizes the results of the Gershgorin circle theorem. The y- and x-axes are the imaginary and real axes of the complex plane containing the eigenvalues. The orange discs are Di defined by equation (8). Figure 7 shows that the union of discs for the more rigid molecule on the right is $D_1 \cup \ldots\cup D_{19} =$ $\{-152.77 \unicode{x2A7D} z \unicode{x2A7D} 292.16\}$, which covers a wider area compared to that for the molecule on the left, $D_1 \cup \ldots\cup D_{19} = \{-128.78 \unicode{x2A7D} z \unicode{x2A7D} 248.76\}$. Since the range of eigenvalues is wider for the more rigid molecule, it has a higher probability of having a larger εmax value and bigger $\sigma(\epsilon\gt1)$. Since the molecules from the same constitutional isomer cluster have the same atomic composition, the distribution of the center of the discs, ${D_i, i \in N}$, are identical. This explains the similarity of $\mu(\epsilon\gt1)$ for molecules from the same cluster observed in our analysis (not shown) of the dataset.

To illustrate the physical significance of the descriptors based on PDF derived from the Gershgorin circle theorem as defined by equation (12), we examine in figure 8 the correlation of entropy with each of the individual overlaps $\langle f_{i}, f_\textrm{molecule} \rangle$ and with $\textrm{AUC}(f_\textrm{molecule}(x| \mu = M_{ii}, \sigma^2 = (\sum_{j\neq i}|M_{ij}|)^2 )$. Figure 8 shows that entropy is particularly well correlated with $\langle f_\textrm{H}, f_\textrm{molecule} \rangle$ and $\langle f_\textrm{C}, f_\textrm{molecule} \rangle$ and strongly correlated with $\textrm{AUC}(f_\textrm{molecule}(x| \mu = M_{ii}, \sigma^2 = (\sum_{j\neq i}|M_{ij}|)^2 )$.

Figure 8.

Figure 8. Correlation of entropy with AUC (upper left panel) and $\langle f_\textrm{atom}, f_\textrm{molecule} \rangle$ for constitutional isomers with the chemical formula $\mathrm{C}_{7}\mathrm{H}_{10}\mathrm{O}_{2}$.

Standard image High-resolution image

5.2. Data-efficient models of entropy

All results in the present work show entropy multiplied by temperature at T = 298.15 K in the units of kcal mol−1. Our goal is to build models with prediction accuracy below 1 kcal mol−1 using as few molecules on input as possible. We aim to achieve this by exploiting the most performant descriptors identified in the previous section in combination with kernel models based on optimized kernels. When the dimensionality of descriptors is low, it is possible to build anisotropic kernels, optimized for each dimension of the input space separately. In the present work, we have found that including the anisotropy into the kernels for the best low-dimensional descriptors of molecules does not have any significant effect on the accuracy of the ensuing models.

We use GP regression to model entropy. The kernel functions of the GP models are chosen and optimized using greedy search in the space of kernel functions guided by BIC as discussed in the ML method section. This aligns kernels with available data as much as possible, given the constraints of the greedy search, the descriptors and the uncertainties stemming from the extremely limited size of the datasets considered here. To reduce bias in the calculation of BIC used for the greedy search of the optimal kernel function, 1000 training data points are chosen randomly from the pool of 85 000 data points reserved for training. Each of the randomly chosen training datasets is passed through multiple layers of the BIC kernel selection algorithm. At each level of kernel selection, we calculate the average value of BIC scores, µ(BIC), for the same kernel trained with ten different sets of randomly selected data points. The base or complex kernel with the highest average BIC score is retained for the next level of kernel function selection.

The effect of the kernel function on the accuracy of the resulting models of entropy is shown in figure 9. Figure 9 compares the µ(BIC) values of base and complex kernels. Although the $\mathrm{RQ} + \mathrm{Matern} \times \mathrm{DP}$ kernel has the highest BIC score, this BIC score is only slightly higher than that of the $\mathrm{RQ}+\mathrm{Matern}$ kernel. The $\mathrm{RQ}+\mathrm{Matern}$ kernel is therefore chosen as the optimal kernel for the best model of entropy in this work due to its high BIC score and lower function complexity. The BIC algorithm stops at level four because the BIC scores plateau and decrease with further increase of the kernel complexity.

Figure 9.

Figure 9. Left: BIC values for models with different kernels. Right: Learning curves of GP models trained with the same 9D descriptor ($\mathrm{CM}+\mathrm{AUC}[f_\textrm{molecule}]+\langle f_\textrm{HCNOF}, f_\textrm{molecule} \rangle$), but with the different kernels illustrated in the left panel. The shaded area is the standard deviation computed with ten models trained with randomly sampled data points. For the lines not showing the shaded area, the standard deviation is too small to be visible on this scale.

Standard image High-resolution image

5.3. Zero-point vibrational energy

The preceding sections illustrate that very efficient and accurate models of entropy can be built with 9D descriptors ($\mathrm{CM}+\mathrm{AUC}[f_\textrm{molecule}]+\langle f_\textrm{HCNOF}, f_\textrm{molecule} \rangle$). It is instructive to explore the performance of these descriptors in models of other molecular properties. Here, we consider zero point vibrational energy (ZPVE) of molecules. Figure 10 illustrates that ZPVE is strongly correlated with $\textrm{AUC}(f_\textrm{molecule})$ and $\langle f_\textrm{ref}, f_\textrm{molecule} \rangle$.

Figure 10.

Figure 10. Zero point vibrational energy as a function of $\textrm{AUC}(f_\textrm{molecule}(x| \mu = M_{ii}, \sigma^2 = (\sum_{j\neq i}|M_{ij}|)^2 )$ and $\langle f_\textrm{ref}, f_\textrm{molecule} \rangle$ for all molecules in the QM9 dataset (left panels); and the molecules from ten largest isomer groups in the QM9 dataset (right panels). The reference molecule is $\textrm{H}_{3}\mathrm{C}_{4}\mathrm{N}_{2}\mathrm{OF}$ illustrated in figure 3.

Standard image High-resolution image

Figure 11 shows the interpolation MAE for ZPVE illustrating the comparison of the performance of the present 9D models with models developed in [105]. The results in [105] use high-dimensional descriptors: full CM including 435 features (shown in figure 11 orange circles), BOB with 1128 features (shown by green triangles), SLATM 11 960 features (shown by red squares), FCHL with 10 440 features (shown by purple down triangles), and MBDF with 145 features (shown by brown diamonds). These high-dimensional descriptors were used in [105] to compute simple isotropic kernels for kernel ridge regression models of ZPVE.

Figure 11.

Figure 11. Comparison of the present results (squares) of ZPVE with the results of Lilienfeld et al (2023) [105]. The present results are obtained with 9D molecular descriptors $\textrm{CM} + \textrm{AUC} + f_\textrm{molecule} + \langle f_\textrm{HCNOF}, f\textrm{molecule}\rangle$ and the GP model with the optimized $\textrm{RQ} + \mathrm{MAT} \times \mathrm{DP}$ kernel. The results of [105] are based on kernel ridge regression with the following descriptors: full CM including 435 features (orange circles), BOB with 1128 features (green triangles), SLATM 11 960 features (red squares), FCHL with 10 440 features (purple down triangles), and MBDF with 145 features (brown diamonds).

Standard image High-resolution image

Our results shown in figure 11 by blue diamonds use nine-dimensional descriptors with optimized, complex kernels given by the $\textrm{RQ + MAT} \times \mathrm{DP}$ combination of base kernels. These kernels are used for GP regression models of ZPVE. The results illustrate that the 9D descriptors proposed here produce ZPVE interpolation models with the prediction MAE under 1 kcal mol−1, when trained by as few as 100 molecules.

While it is common to quantify the model performance by average errors, such metrics do not illustrate fully the quality of the models. A combination of average errors and largest errors is a better metric, which, however, can be significantly affected by a small number of outliers. To quantify non-ambiguously the performance of the bets models of entropy and ZPVE developed in the present work, we show in figure 12 the distribution of errors of predictions over a hold-out test set, containing 20 000 molecules of different size. The shaded bars illustrate the performance of models trained with 100 molecules and the open bars—molecules trained with 1000 molecules. We note that the range of S×T for the molecules considered here covers 26 kcal mol−1, while the range of ZPVE covers 160 kcal mol−1. In both cases, we observe that our models trained with 100 molecules are able to predict S×T and ZPVE with absolute error under 1 kcal mol−1 for ${\gt}78$% of the test data and with error under 1.3 kcal mol−1 for ${\gt}92.4$% of the test data. The validation and test errors averaged over the entire relevant part of the dataset are summarized in table 2.

Figure 12.

Figure 12. Distribution of errors for predictions of entropy (upper) and ZPVE (lower) with nine-dimensional descriptors of molecules identified as the best descriptors in figure 5. Full bars—predictions of models trained with 100 molecules; open bars—models trained with 1000 molecules.

Standard image High-resolution image

Table 2. Validation and test errors of GP model predictions for S×T and ZPVE for different numbers of training points ($N_\textrm{train}$) are presented. The RMSEs and MAEs in the table represent the average over ten randomly sampled training sets.

 Property $N_\textrm{train}$ RMSE (kcal mol−1)MAE (kcal mol−1)
Test S×T 1001.100 $6.171 \times 10^{-1}$
Validation S×T 100 $9.131 \times 10^{-1} $ $5.870 \times 10^{-1} $
TestZPVE100 $9.602 \times 10^{-1}$ $6.647 \times 10^{-1}$
ValidationZPVE100 $9.657 \times 10^{-1}$ $6.895 \times 10^{-1}$
Test S×T 1000 $6.545 \times 10^{-1}$ $4.939 \times 10^{-1}$
Validation S×T 1000 $6.522 \times 10^{-1}$ $ 4.9143 \times 10^{-1}$
TestZPVE1000 $7.511 \times 10^{-1}$ $ 6.003 \times 10^{-1}$
ValidationZPVE1000 $ 7.469 \times 10^{-1}$ $ 5.949 \times 10^{-1}$

6. Summary

The goal of the present work is to build efficient, data-starved models of molecular properties for interpolation in chemical compound spaces comprising molecules of variable complexity with low-dimensional molecular descriptors. The dataset considered here comprises 133 885 molecules with complexity ranging from 3 atoms to 29 atoms. We consider two molecular properties: entropy and zero point vibrational energy. We begin by designing low-dimensional descriptors tailored for the prediction of entropy. Our results show that entropy can be efficiently interpolated by nine-dimensional models with molecular descriptors comprising three physical parameters describing the distributions of eigenvalues of Coulomb matrices and six parameters accounting for the composition and shape of molecules through quantities based on the Gershgorin circle theorem.

In order to build efficient interpolation models of entropy, we combine the nine-dimensional descriptors thus obtained with Gaussian process regression based on kernels with variable functional form. The kernel functions are optimized using a greedy search algorithm guided by the Bayesian information criterion (BIC). In order to obtain meaningful, unbiased models based on extremely restricted datasets, we use BIC averaged over multiple randomly sampled instances of training data as a model selection metric. Nine-dimensional Gaussian process models thus optimized are capable of predicting molecular entropy across the entire dataset with the average error under 1 kcal mol−1, using as few as 100 molecules on input. We present a careful analysis of model prediction errors, demonstrating that entropy models trained with 100 molecules predict S×T with chemical accuracy under 1 kcal mol−1 for ${\gt}80$% of molecules in the test set, comprising 20 000 molecular species.

The nine-dimensional descriptors based on the distributions of the Coulomb matrix eigenvalues and the Gershgorin circle theorem are then shown to perform well for the prediction of a completely different molecular property—ZPVE. The error analysis illustrates that the Gaussian process models of ZPVE trained with 100 molecules and using optimized kernels predict ZPVE with absolute error under 1 kcal mol−1 for $\gt 80$ % of the test set and with error under 1.3 kcal mol−1 for ${\gt}90$% of molecules in the test set. This is remarkable, given that ZPVE of molecules in the test set covers the range from 10 to 171 kcal mol−1.

There are three main implications of the results presented here. First, because the training sets of the demonstrated models are very small, accurate predictions of molecular properties for a wide range of molecules can be based on either experimental data or extremely accurate quantum chemistry calculations. Second, the possibility of using low-dimensional descriptors for accurate predictions with restricted datasets opens up (1) the possibility of building models with anisotropic kernels that account for the difference in scales of different features; and (2) the possibility of Bayesian optimization in chemical compound space, which can be used to predict and design molecules with desired properties. We will explore both of these possibilities in a future work.

Acknowledgment

This work was supported by NSERC of Canada.

Data availability statement

All data that support the findings of this study are included within the article (and any supplementary files).

Code Availability

The coding language used for this work is Python. We used the Sklearn package for Gaussian process regression. Both GNN and BIC-based codes are original and use common Python packages for scientific computing, including Sklearn, Numpy, RDkit, torch and Scipy. All the calculations were performed using the open-source package MolDes [117]. The code will be available through the GitHub repository for MolDes-GCT at https://github.com/dmao1020/MolDes-GCT.git.

Please wait… references are loading.