Abstract
Graph-based machine learning (ML) models for material properties show great potential to accelerate virtual high-throughput screening of large chemical spaces. However, in their simplest forms, graph-based models do not include any 3D information and are unable to distinguish stereoisomers such as those arising from different orderings of ligands around a metal center in coordination complexes. In this work we present a modification to revised autocorrelation descriptors, a molecular graph featurization method, for predicting spin state dependent properties of octahedral transition metal complexes (TMCs). Inspired by analytical semi-empirical models for TMCs, the new modeling strategy is based on the many-body expansion (MBE) and allows one to tune the captured stereoisomer information by changing the truncation order of the MBE. We present the necessary modifications to include this approach in two commonly used ML methods, kernel ridge regression and feed-forward neural networks. On a test set composed of all possible isomers of binary TMCs, the best MBE models achieve mean absolute errors (MAEs) of 2.75 kcal mol−1 on spin-splitting energies and 0.26 eV on frontier orbital energy gaps, a 30%–40% reduction in error compared to models based on our previous approach. We also observe improved generalization to previously unseen ligands where the best-performing models exhibit MAEs of 4.00 kcal mol−1 (i.e. a 0.73 kcal mol−1 reduction) on the spin-splitting energies and 0.53 eV (i.e. a 0.10 eV reduction) on the frontier orbital energy gaps. Because the new approach incorporates insights from electronic structure theory, such as ligand additivity relationships, these models exhibit systematic generalization from homoleptic to heteroleptic complexes, allowing for efficient screening of TMC search spaces.

Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 license. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
1. Introduction
Virtual high-throughput screening (VHTS) has emerged as a valuable tool for computational discovery in many fields of material science, such as the design of catalysts [1–5], metal–organic frameworks [6–10], and components for energy storage [11–15]. One particularly interesting class of materials for VHTS are transition metal complexes (TMCs) due to their highly tunable properties and large chemical design spaces with potential applications ranging from optoelectronic devices [16–18] to biomedicine [19–22]. The computational method of choice for these studies has long been density functional theory (DFT) due to its accuracy at reasonable cost. However, as researchers are tackling ever larger chemical design spaces, machine learning (ML) surrogate models trained on DFT or higher-accuracy data sets have started to establish themselves as a lower-cost alternative to DFT for VHTS [23–27]. Depending on the application and its requirements regarding accuracy and computational cost, these ML models can be either drop-in replacements for DFT, i.e. mappings from the atomic structure to material properties [28–32], or operate on an abstracted representation of the design space, such as the molecular graph [33–37] or the crystal graph [38–42]. The latter approach allows one to perform VHTS in the abstracted representation space, bypassing the costly steps of building and optimizing a 3D geometry and thereby enabling the screening of millions of candidates in a matter of minutes. Nevertheless, even for a moderately sized set of potential ligands, the combinatorial explosion of possible arrangements into coordination complexes gives rise to a vast design space [43–47]. On the other hand, automated DFT evaluation of open-shell transition metal systems is difficult due to potential convergence issues and the presence of multi-reference character, severely limiting the size of available data sets for ML model training.
The large number of theoretical complexes and relative paucity of studied complexes necessitates a trade-off between expressivity and data efficiency in the design of TMC descriptors for ML. That is, to enable model training on small data sets, the descriptors compress the structural information as much as possible. The process of encoding atomic structures using higher-level representations and featurization of those representations to obtain fixed-size descriptors for ML models necessarily comes with a loss of information. However, this loss of information limits the potential accuracy of the ML models. For the specific use case of TMCs, describing them only by their molecular graph in ML also gives rise to challenges. The central metal ion in a TMC allows for multiple coordination geometries, a fact that is not captured by the molecular graph representation. Even when the design space is limited to a single coordination geometry, certain combinations of ligands allow for multiple stereoisomers, such as the cis/trans and fac/mer isomers of binary octahedral TMCs. For many applications, this stereoisomerism results in significantly different TMC properties. Notable examples include the anticancer drug cisplatin, for which the trans isomer is clinically ineffective [48–51], or stereoisomer-dependent reactivities of TMC catalysts [52–54]. It is, therefore, crucial to encode this minimal amount of 3D information in the featurization of the molecular graph.
We have previously addressed these challenges with revised autocorrelations (RACs) [55], a set of molecular graph descriptors tailored to mononuclear octahedral TMCs. However, in its original implementation, this approach requires an additional step of assigning an equatorial plane of ligands, which results in discontinuities in the feature space for certain ligand compositions and an inability to distinguish stereoisomers in some edge cases. In analytical models for TMC properties such as ligand field parameterization schemes [56, 57], Bursten additivity [58], and ligand field molecular mechanics [59], which were derived from early semi-empirical electronic structure theory approaches [60, 61], these stereoisomer effects are modeled by dividing the metal–ligand interaction into nonadditive and additive ligand field contributions. More modern approaches such as ligand field DFT [62], d-block localized orbital corrected DFT (DBLOC) [63], and ab initio ligand field theory [64–66] combine quantum chemistry calculations with parametrized effective Hamiltonians to address systematic errors in DFT- and wavefunction-based descriptions of TMCs. A unifying mathematical framework for most of these methods is given by the MBE. Inspired by the success of the MBE in molecular modeling [67–72] and, more recently, in ML interatomic potentials [73–77], we present an MBE-based approach for featurizing the molecular graph of octahedral TMCs. Combined with appropriate modifications to the ML models, this method allows one to systematically address the challenges of encoding octahedral stereoisomerism and the accuracy/data-efficiency trade-off by varying the truncation order of the MBE. We derive the necessary expressions to include this approach into two commonly used ML models, kernel ridge regression (KRR) and neural networks (NN), and compare the performance of these new models to our previous approach using spin-splitting energies and frontier orbital energies as target properties.
2. MBE based ML models
2.1. MBE of metal-centric properties
The influence of ligands on a metal-centric property Q of an octahedral TMC can be expanded in a series of n-body interaction terms, qn:
where M encodes the identity, oxidation, and spin state of the central metal ion, and Li refers to the identity of the ligand i. Following recent examples [76–78], we investigate the use of ML to model the individual interaction terms in the MBE. Here, we limit our analysis to two model types corresponding to truncation of the MBE at the two-body and three-body interaction levels.
Truncation at the two-body interaction terms gives rise to the concept of ligand additivity, which has been shown to be useful for the description of many properties of TMCs [47, 56, 79, 80]. However, in the simplest models that are restricted to single-ligand–metal interactions, there is an inability to distinguish different structural isomers for the same ligand composition (e.g. cis versus trans orientations), thereby limiting the applicability of such models. A notable example of models of this complexity is the DBLOC scheme for correcting DFT (e.g. B3LYP) spin-splitting energy predictions with respect to experimental reference values [63]. In DBLOC, the strength of the two-body interaction term depends on the identity of the metal, the oxidation state, and the coordinating atom type of each ligand.
The inclusion of the three-body order yields 15 additional interaction terms for the case of a mononuclear octahedral TMC. Depending on the relative positioning of the two ligands involved, these interactions can be divided into 12 cis-type and 3 trans-type interactions. Neglecting the geometry dependence of the interaction terms beyond cis and trans positioning leads to linear dependence between the set of two-body and three-body interactions (appendix
2.2. Featurization
To simplify comparisons to our previous work, we base the featurization for the machine-learned interaction terms on the RACs featurization for TMCs [55]. RACs expand on the idea of autocorrelations (ACs) [81] of atomic properties on the molecular graph by calculating correlations between restricted subsets of atoms:
where start and scope refer to the two involved subsets of atoms, d is the depth of the descriptor, Pi is an atomic property of atom i, δ is the Kronecker delta, and dij is the path distance between atoms i and j on the molecular graph. The five most commonly computed heuristic properties are nuclear charge, Z, Pauling electronegativity, χ, topology, T, which is the atom's coordination number, identity, I, which is 1 for any atom, and covalent atomic radius, S. In this work, these correlation functions are evaluated for four different combinations of start/scope atom sets and up to a depth of three bonds, i.e. d = 3. The first two start/scope combinations correspond to standard ACs where both start and scope include all atoms (all). One may also compute metal-centered (mc) descriptors where the start set is restricted to just metal atoms while the scope still includes all atoms. Both of these sets of descriptors apply to the whole TMC. For the other two combinations, the scope set is always limited to all atoms of a single ligand (lig), and the start subset includes either all atoms of the ligand or just the connecting atom (lc). In octahedral systems, the descriptors of only the individual ligands and are separately averaged over the four equatorial ligands and the two axial ligands. The use of two distinct subsets of atoms for start and scope also allows the evaluation of different functions on the molecular graph, such as the differences of atomic properties,
which are evaluated for the mc/all and lc/lig start/scope combinations. Separate averages of the ligand charge and denticity over the equatorial and axial ligands complete the set of 155 nontrivial descriptors referred to as standard-RACs in this work.
Due to either the involvement of multiple ligands or their non-local character, most of the start/scope combinations of standard-RACs are incompatible with the concept of a MBE localized at the metal center. Thus, we only use the and sets to featurize the ligand dependence of the many-body interaction terms. As a consequence of this restriction to only ligand-connecting atom-centered descriptors, each connecting atom of a multidentate ligand is treated like a separate monodentate ligand. For closely related ligands such as bipyridine and pyridine, this treatment results in identical descriptors up to a depth of d =2 and a potential source of information loss for the ML models. Nevertheless, we choose not to add an additional feature to encode ligand denticity since, for all ligands in our data set, any ambiguity is resolved by the inclusion of descriptors for depth d = 3. Finally, the ligand charge, or a fraction thereof in the case of multidentate ligands, e.g. −1/2 for a bidentate ligand with charge −1, is added as a feature yielding a vector Li consisting of a total of 32 nontrivial features referred to as ligand-RACs throughout the remainder of this article. For both the standard-RACs and the ligand-RACs, the character of the metal ion is featurized using a vector M containing seven entries corresponding to separate one-hot encodings of the electron configuration (d3, d4, d5, d6, or d7), and the oxidation state (II or III).
3. Data sets
We curated a data set of octahedral TMCs from six of our previous studies on the prediction of electronic structure properties from empirical TMC featurizations [82], graph-based descriptors for inorganic chemistry [55], genetic algorithm optimization of spin crossover complexes [83], ML predictions of DFT geometry optimization outcomes [84], systematic enumeration of monodentate and bidentate ligand space [45], and a comparison of 3d and 4d TMCs [85]. For the present study, we selected complexes composed of four 3d metals in two oxidation states M(II)/M(III), where M = Cr, Mn, Fe, or Co. We further restricted our data set to complexes with computed DFT properties for both high-spin (HS) and low-spin (LS) states, where HS and LS are defined as quintet and singlet for both d4 Mn(III)/Cr(II) and d6 Co(III)/Fe(II), sextet and doublet for d5 Fe(III)/Mn(II), and quartet and doublet for both d3 Cr(III) and d7 Co(II). From this initial data set of 2350 TMCs, we removed 349 data points with positive highest occupied molecular orbital (HOMO) energies, which are common in complexes with a large number of negatively charged ligands. Following the protocol in prior studies [86], structures with spin expectation values deviating by more than 1.0 from the expected values (136 examples) and deviations from the expected octahedral geometry (59 examples) measured by previously established metrics [80, 84, 86] were eliminated (supplementary material tables S1 and S2). The final data set consists of 1806 pairs of HS and LS complexes comprising 107 unique ligands (72 monodentate, 34 bidentate, 1 tetradentate).
We trained ML models to predict seven target properties of these TMCs: the adiabatic spin-splitting energy, ΔEH-L, i.e. the difference in energies between the geometry-optimized HS and LS states, four frontier orbital energies corresponding to the highest occupied and lowest unoccupied molecular orbital (HOMO and LUMO) energies for each of the spin states, and the HOMO–LUMO gaps of each spin state. In open-shell systems these molecular orbital energies can be defined for both the majority α and minority β spin. To resolve this ambiguity, we used a purely energy-based convention for HOMO and LUMO, i.e. the HOMO and LUMO levels are defined as max(HOMOα, HOMOβ) and min(LUMOα, LUMOβ), respectively.
In addition to the data set curated from previous studies, we generated two new test sets to measure the performance of models with respect to their ability to generalize from homoleptic data to heteroleptic complexes and to new ligands. The first set, the composition test set, is composed of all binary combinations and structural isomers for a given composition of three monodentate ligands: methanol, hydrogen cyanide, and hydrogen isocyanide. These ligands were selected because they cover a large range of field strengths in the spectrochemical series. The training set from the data curated from previous studies contains the homoleptic complexes for all investigated metal centers but no heteroleptic complexes built from the three ligands. Combining the eight different metal and oxidation state pairings, the eight compositional and structural isomers of binary complexes, and the three possible ligands pairings yields 192 TMCs, all of which passed the checks for octahedral geometry and spin expectation value. The second set, the ligand test set, is composed of homoleptic complexes of ligands not present in the training data from previous studies. We selected a total of 16 monodentate and 5 bidentate ligands by identifying unique chemical motifs in the 56 neutral, closed-shell ligands from a prior screen of the Cambridge Structural Database [87] for Fe(II) complexes [47] and by selecting ligands from a prior enumeration of novel ligands [45] that were not part of the original DFT-evaluated subset. Combining the 8 different metal and oxidation state pairings with the 21 new ligands yields 168 complexes, 127 of which passed the checks for octahedral geometry and spin expectation value (supplementary material table S3). The purpose of this split test set is to separately measure model performance in the interpolative regime, i.e. using the composition test set, and the extrapolative regime, i.e. using the ligand test set. Both sets are further subdivided according to the identity of the central metal ion and the oxidation state or the identity of the ligands during various stages of the analysis to gain further insights into potential trends or systematic errors of the ML models.
The curated data set of previously studied complexes is split into a training data set and a validation set using a two-step approach. First, complexes containing any ligand belonging to a hand-selected set of 11 ligands representing the most common coordination environments are placed into the validation set (156 complexes, supplementary material text S1). This ensures that the validation set metrics measure not only the performance on different combinations and structural isomers of the training set ligands but also the generalization to previously unseen ligands. In the second step, randomly selected examples are added to the validation set until a final data split of roughly 80/20 is achieved, yielding training and validation set sizes of 1444 and 362 complexes, respectively. All target values are normalized by subtracting their respective training set mean and dividing by the training set standard deviation. The RAC-type features are rescaled to the interval [−1, 1] by dividing each vector component by the largest absolute value observed in the training set. This approach is chosen because it conserves zero values in the feature vectors, which take a special role in RACs descriptors (typically encoding the absence of an atom), and results in a consistent feature range when combined with the one-hot encoded metal center features. The data set and subesequent analysis code can be accessed in a Zenodo [88] or Github [89] repository.
4. Computational details
4.1. Electronic structure calculations
DFT properties were obtained for the newly generated test sets (see section 3) by first building the initial geometries using molSimplify [86, 90] followed by DFT geometry optimizations using a developer version of the TeraChem [91–93] graphical processing unit accelerated quantum chemistry package. We used the B3LYP [94–96] hybrid functional combined with the LANL2DZ effective core potential [97] for iodine atoms and all transition metals and the 6-31G* basis [98, 99] for the remaining atoms (i.e., a composite LACVP* basis set). This combination is chosen for consistency with the curated data set and allows for easier comparison and validation with previous studies conducted by our group. However, we have previously shown a high linear correlation across different density functional approximations (DFAs) for several TMC properties [100]. For example, we observed a Pearson correlation coefficient of 0.99 between spin-splitting energies evaluated with B3LYP and the meta-GGA hybrid functional TPSSh [101]. Similarly, B3LYP spin-splitting energies evaluated using the LACVP* basis set and the larger all-electron def2-TZVP basis set showed a correlation coefficient of 0.98. This suggests that the presented findings should be independent of the specific choice of DFA and basis set. Singlet spin states were evaluated with the spin-restricted formalism, while all other calculations were performed in the spin-unrestricted formalism. To aid convergence, level shifting [102, 103] of 0.25 hartree on both majority- and minority-spin virtual orbitals were employed for all calculations. Geometry optimizations were performed with the BFGS algorithm in translation-rotation-internal coordinates [104] using the default convergence criteria of 1 × 10−6 hartree for the energy change between steps and 4.5 × 10−4 hartree/bohr for the maximum gradient component.
4.2. ML models
In this work, we use feed-forward artificial NNs and KRR to model the individual MBE interaction terms. However, the presented MBE-based approach is compatible with any ML regression algorithm. Regardless of the specific choice of ML approach, the three-body interactions involving dissimilar ligands need to be symmetrized with respect to the exchange of the two ligands, i.e. to ensure that q3(M, Li, Lj) = q3(M, Lj, Li). We implement this permutational invariance by applying a feature-wise transformation, calculating the average ij = ½ (Li + Lj) and half the absolute difference ij = ½ |Li−Lj| of the two ligand feature vectors. Alternatively, the symmetry condition could be realized by evaluating the ML interaction term for both ligand permutations and subsequently averaging. This is equivalent to lifting the restriction on the second sum over ligands in the three-body contribution of equation (1) and multiplying the term by a factor of one-half. For models trained to predict the four frontier orbital energies, we employ multi-task learning throughout. In the ANN model, this is implemented using shared hidden layers between the four targets and separate linear output layers for each orbital energy. For the KRR model, the multi-task approach corresponds to a shared kernel matrix and, thereby, shared hyperparameters for the models of the four frontier orbital energies. This restricted architecture is motivated by a high correlation between the frontier orbital energies, i.e. the Pearson correlation coefficient is greater than 0.96 for all pairwise combinations (supplementary material figure S1).
4.2.1. Artificial Neural Networks
The implementation of the MBE-based NN models was simplified by automatic differentiation, a core feature of modern NN libraries. In this work, we use the Keras [105] and Tensorflow [106] Python packages. The individual interaction terms are modeled using fully-connected layers combined with the softplus activation function [107] for hidden layers and a linear activation on the output layer. To reduce overfitting, we apply L2 weight-regularization with strength λ and dropout [108, 109], i.e. random zeroing of nodes during training with a probability p, on all hidden layers (supplementary material figure S2).
The training was carried out using the Adam optimizer [110] and the mean squared error loss function for a maximum of 5000 epochs. Early stopping was performed by monitoring the mean squared error on the validation set and stopping the training if the error did not decrease for 100 epochs. Models containing three-body interaction terms were trained in three steps, each terminated by the early stopping procedure. First, a simplified model using only two-body terms was trained. In the second step, the three-body terms were fit as corrections to the simple model by keeping the weights for the two-body terms fixed. In the final training step, the weights of all interaction terms were allowed to vary without restrictions. The size of the NNs, the batch size, the regularization strength λ, and the dropout probability p are determined by minimizing the mean absolute error (MAE) on the validation set using the tree-structured Parzen estimator [111] approach with 100 evaluations as implemented in the HyperOpt [112] package (supplementary material tables S4 and S5). Using the optimized hyperparameters, we constructed an ensemble model by training 10 independent models from different random weight initializations. The presented ensemble predictions and uncertainties were evaluated by calculating the mean and standard deviation over predictions of the 10 independent models.
4.2.2. Kernel Ridge Regression
It can be shown that incorporating the MBE approach into a kernel-based method such as KRR results in a modified kernel function (appendix
where the feature vectors x and y comprise the metal center featurization M, and the six ligand featurizations Li, i.e. x = [Mx, Lx,1, Lx,2, Lx,3, Lx,4, Lx,5, Lx,6]. The one-body kernel term is a function of just the metal core featurizations M,
where k1 is any valid kernel function. Motivated by the one-hot encoding used to featurize the metal core, we use a linear dot product kernel k1(Mx, My) = Mx · My for the one-body kernel. The two-body kernel term is given by,
where c2 is a scalar weighting factor and the kernel k2 is a function of two vectors, each of which is a concatenation of the metal featurization and a single ligand featurization. The relative importance of the two-body contributions is tuned by the hyperparameter c2. Additionally, a factor of 1/36 is used to rescale the kernel to unity ktwo-body(x, x) = 1 for homoleptic complexes, c2 = 1, and normalized kernel functions k2. Note, however, that the two-body kernel itself is not normalized since ktwo-body(x, x) ≠ 1 for heteroleptic complexes. Finally, the three-body kernel term can be divided into cis-type and trans-type interactions,
where ij and ij denote the symmetrized ligand features, c3,cis and c3,trans are scalar weighting factors, and k3,cis and k3,trans are kernel functions of the concatenation of the metal featurization and the two symmetrized ligand feature vectors. The scaling factors of c3,cis/144 and c3,trans/9 again tune the relative strength of the respective interaction terms and ensure normalization of the individual terms for homoleptic complexes and c3,cis= c3,trans = 1. While it is difficult to assign a many-body order to standard-RACs ligand features, we follow a similar approach to construct a kernel function for the KRR model built from standard-RACs. The one-body kernel kone-body, equation (5), is combined with a single kernel kRACs intended to capture all higher-order many-body contributions,
where cRACs is a scalar weighting factor, and kRACs is a function of two vectors and obtained by concatenating the metal featurization with the standard-RACs ligand features.
While we have previously shown that feature selection can further improve the performance of RACs-based KRR models [55], the KRR models presented in this study use the full feature set of their respective featurization approaches. This allows an unbiased comparison of all presented ML methods and featurization approaches but also avoids the need for custom implementations of these algorithms, as standard feature selection methods would not be compatible with the MBE-based KRR approach. The presented kernel modifications were implemented as an extension of the Gaussian process regression (GPR) module in scikit-learn [113] due to its higher degree of modularity compared to the KRR module. This also allowed us to obtain predictions of the KRR model uncertainty, as measured by the variance σ2, by using the corresponding GPR expression [114],
where x* is the feature vector of a given complex, k is the total kernel function (equations (4) or ( 8 )), k* is a vector obtained by evaluating the kernel function for x* and all training examples, K is the matrix of kernel values of pairwise combinations of training examples, and λ is the regularization strength. Regardless of the reliance on GPR concepts and implementations, we refer to the presented approach as KRR because GPR usually also implies Bayesian selection of the hyperparameters by maximizing the log marginal likelihood, which we did not do.
We investigated two different choices for the kernel functions kRACs, k2, k3,cis, and k3,trans used in the ligand interaction terms, the radial basis function (RBF) kernel,
and the Matérn kernel (for ν = 3/2), [115, 116]
where, in both cases, l is a hyperparameter typically referred to as the kernel length scale. Analogously to the NN approach, the KRR hyperparameters, i.e. the regularization strength λ, the choice of kernels kRACs, k2, k3,cis, and k3,trans, and their respective weighting factors c and length scales l, are determined by minimizing the MAE on the validation set using 100 iterations of the tree-structured Parzen estimator approach. To reduce the hyperparameter search space for the three-body model, all hyperparameters of its two-body kernel are fixed to the values obtained during the hyperparameter search for the two-body model (supplementary material tables S6 and S7).
5. Results and discussion
5.1. ML prediction of spin-splitting energies
We first evaluate the fitting of the training set for the newly derived models using two-body and three-body terms in combination with KRR and NN models on the spin-splitting energy and compare them to the standard-RACs based approach. The MAEs of the many-body ML model predictions show similar trends with respect to the featurization for both the KRR and the NN approaches (table 1). When trained on the standard-RACs features, both of the ML methods achieve the lowest training set errors, of 0.9 kcal mol−1 and 2.3 kcal mol−1 for KRR and NN, respectively. The models based on the MBE up to two-body order are less expressive and, therefore, result in significantly higher MAEs on the training set. This limitation is lifted by the inclusion of three-body terms. The corresponding three-body models yield slightly larger training set MAEs than the best-performing standard-RACs models. We also observe significantly lower training errors for all three KRR-based models compared to their NN counterparts. This can be attributed to the fact that the KRR method reproduces the training data exactly in the limit of zero regularization under the assumption of a unique featurization. NNs, on the other hand, are also limited in training set accuracy by the finite-sized architecture and the stochastic training process, including early stopping, in addition to the L2 regularization on their weights.
Table 1. Mean absolute errors of the model predictions of the spin-splitting energies on all four data sets in kcal mol−1. The lowest error for each data set is indicated in bold.
Model | Training set | Validation set | Composition test set | Ligand test set |
---|---|---|---|---|
KRR standard-RACs | 0.87 | 3.67 | 4.10 | 4.84 |
KRR two-body | 2.63 | 3.96 | 3.12 | 5.05 |
KRR three-body | 1.03 | 3.30 | 2.75 | 4.93 |
NN standard-RACs | 2.33 | 3.51 | 4.16 | 4.73 |
NN two-body | 3.16 | 3.73 | 3.61 | 4.14 |
NN three-body | 2.78 | 3.48 | 3.41 | 4.00 |
While fitting to training data gives a sense of how well the model is capturing the data, the comparison of model performance can only be fairly assessed on errors for the validation set. On the validation set, the variation in MAEs between different featurization concepts and ML approaches is significantly smaller than on the training set. Nevertheless, we generally observe the trend that all validation MAEs are higher at around 3.0–3.5 kcal mol−1, and the standard-RACs based models are no longer the best-performing models (table 1). While there is no clear difference in performance between the two ML methods, both approaches exhibit the same relative ordering with respect to the featurization. The three-body MBE-based models achieve the lowest MAEs of 3.3 kcal mol−1 and 3.5 kcal mol−1 for the KRR and NN models, respectively (table 1). The models based on the MBE truncated at the two-body order again exhibit the highest MAEs of 4.0 kcal mol−1 for the KRR model and 3.7 kcal mol−1 for the NN model. We attribute this similarity of MAEs across all methods to the fact that the MAE on the validation set was used as the target of the hyperparameter optimization. Because both models trained on standard-RACs result in slightly larger validation errors and exhibit lower training set errors, this hints at a higher degree of overfitting. Common to all six models is a weak dependence of the error on the identity and oxidation state of the central metal ion (supplementary material table S8). With the exception of the KRR model using standard-RACs, the lowest MAEs are observed on Cr(III) complexes. We attribute this to the relatively low variance in the spin-splitting energies on the Cr(III) subset (supplementary material figure S3). This is corroborated by significantly lower R2 scores of between 0.2 and 0.3 on the Cr(III) subset compared to R2 scores of approximately 0.96 on the full validation set (supplementary material tables S9 and table S10). Conversely, the highest MAEs are obtained on the Co(III) and Fe(II) subsets, which exhibit the highest range in target values.
The distribution of errors on the validation set highlights several outliers that are shared among the different ML methods and featurization approaches (figure 1 and supplementary material figure S4). The standard-RACs KRR model exhibits the highest deviation from the DFT reference on Co(III)(pyr)4(CN)(CH2O) (where pyr = pyridine), for which the spin-splitting energy is underestimated by 23.4 kcal mol−1 (figure 1(a)). Using the kernel function k from equation (8) as distance metric in the standard-RACs feature space, we identify the three complexes from the training set that are most similar to this outlier, Co(III)(pyr)4(CO)2, Co(III)(pyr)4(H2O)(CO), and Co(III)(pyr)4(CO)(misc) (where misc = methylisocyanide), all of which have spin-splitting energies 10–20 kcal mol−1 lower than Co(III)(pyr)4(CN)(CH2O). Therefore, we attribute the large error on this complex to the fact that the standard-RACs feature space distance is dominated by the four axial pyridine ligands and does not sufficiently account for the presence of cyanide, one of the strongest ligands in our data sets. This hypothesis is corroborated by the fact that Co(III)(pyr)4(CN)(CH2O) also represents the second-largest outlier for the standard-RACs NN model with an underestimation of 19.4 kcal mol−1, while the MBE-based models show significantly lower errors on this complex. Another commonality is observed for the three NN models that exhibit their highest errors, an overestimation of roughly 20 kcal mol−1, on the complex Fe(III)(CHPH2)6 (figure 1(b)). The only occurrence of this ligand in the training set is in combination with the ion Cr(III). Since the spin-splitting energy of Cr(III) is largely independent of the ligands, the ligand strength of CHPH2 cannot be inferred from this complex (supplementary material figure S3). The fact that the NN models predict significantly higher spin-splitting energies for Fe(III)(CHPH2)6 than the KRR models is attributed to the different extrapolation behavior of the two ML methods.
Figure 1. Calculated versus predicted spin-splitting energies of the validation set complexes for the two ML methods (rows) and the three featurization approaches (columns), all in kcal mol−1. The three complexes with the largest prediction errors, Co(III)(pyridine)4(cyanide)(formaldehyde) (a), Fe(III)(CHPH2)6 (b), and Fe(III)(pisc)6 (c), are annotated in all panels and their respective low-spin structures are depicted on the right. Structures are colored as follows: pink for Co, brown for Fe, gray for C, blue for N, white for H, red for O, and orange for P.
Download figure:
Standard image High-resolution imageAs a final example of shared outliers, both MBE-based KRR models exhibit the highest deviation from the DFT reference on the heteroleptic 1-tert-butyl-4-phenyl-isocyanide (pisc) Fe(III) complex. We attribute this to the fact that the training set does not contain any combination of pisc ligands and Fe(III) (figure 1(c)). This results in two potential ways an ML model might leverage training data to inform the prediction of the spin-splitting energy of Fe(III)(pisc)6. One option is to construct the prediction from the metal–ligand interaction of Fe(III) with similar ligands. The two MBE-based KRR models follow this approach, as indicated by the fact that the three most similar training complexes as measured by their respective kernels are Fe(III)(CNH)6, Fe(III)(misc)6, and Fe(III)(misc)5(CO). Alternatively, the ligand strength could be inferred from the interaction of pisc with other metal ions. Exemplary of this, the closest complexes in the training set as measured by the standard-RACs kernel are Mn(II)(pisc)6, Fe(II)(pisc)6, and Mn(III)(ox)4(pisc)2, which results in a significantly lower prediction error for the standard-RACs KRR model.
On the composition test set, which consists of all binary combinations and structural isomers of a fixed ligand set (see section 3), we observe similar trends with regard to the featurization approach for both ML methods. That is, the three-body models achieve the lowest errors, especially in the KRR models (i.e. 2.8 kcal mol−1 vs 4.1 kcal mol−1 for standard RACs), while two-body models exhibit intermediate performance (table 1). We attribute the better performance of KRR over NN models to lower KRR errors on the homoleptic complexes in the training data set that are reference values for interpolation of the heteroleptic complexes. Despite these overall trends, outliers are evident when separately analyzing each of the eight metal/oxidation state combinations (supplementary material table S11). On the Mn(III) subset, the global trend with regard to the featurization approach is reversed, and the standard-RACs models consistently outperform their MBE-based counterparts. Conversely, for the Fe(II) subset in combination with the three-body models, the highest R2 scores (i.e. 0.98–0.99) are achieved (supplementary material table S12).
To gain a better understanding of these trends, we analyze the interpolation behavior of the three KRR models on the Mn(III) and Fe(II) subsets in detail (figure 2). The standard-RACs based model predictions show unsystematic deviations from an overall linear interpolation trend in the generalization to heteroleptic complexes. This results in potentially large differences in the predicted spin-splitting energy for the cis/trans and fac/mer isomers of the same composition. We attribute the overall linear trend in the predictions to the design of standard-RACs. In the RACs feature space, the heteroleptic complexes lie close to straight-line connections between their homoleptic counterparts (supplementary material figure S5). In the two-body model, linear additivity is strictly enforced, resulting in a roughly 2 kcal mol−1 lower MAE on the Fe(II) subset (supplementary material table S11). However, due to this simplification, the two-body model is unable to distinguish between the two isomers for a given composition. Furthermore, the limited model expressivity in the two-body model results in higher errors on the homoleptic complexes, which we attribute to a trade-off between homoleptic and heteroleptic complexes during model training. These shortcomings are addressed by the inclusion of three-body interaction terms, which introduce two additional degrees of freedom to the interpolation. Their effect can be described as curvature, i.e. systematic higher or lower predictions compared to the linear interpolation, and a splitting in the predicted spin-splitting energy of the cis/trans and fac/mer isomers (appendix
Figure 2. Plot of the spin-splitting energy (in kcal mol−1) interpolation curves for binary complexes of three ligands, methanol (MeOH), hydrogen cyanide (NCH), and hydrogen isocyanide (CNH), and two metals Mn(III) and Fe(II) showing the DFT reference values (scatter) and the corresponding ML predictions (lines) for KRR models using different featurization approaches (columns). For compositions with multiple structural isomers, values are plotted using half markers for the DFT reference and dashed/dotted lines for the ML predictions.
Download figure:
Standard image High-resolution imageFinally, we compare the generalization of the models to unseen ligands using the ligand test set. As expected, all models and featurizations yield slightly worse MAEs than for the validation set. All three KRR models show comparable performance, with the lowest MAE of 4.8 kcal mol−1 achieved by the standard-RACs models (table 1). However, both MBE-based NN models exhibit significantly lower generalization errors of 4.1 kcal mol−1 for the two-body model and 4.0 kcal mol−1 for the three-body model. While the NN model on standard-RACs results in a slightly lower MAE of 4.7 kcal mol−1 than that for the KRR model, the standard-RACs featurization has the worst performance of the three NN models. Analysis of subsets for the eight metal/oxidation state combinations reveals that the KRR models exhibit significantly higher MAEs on the Cr(III) subset compared to the NN models (supplementary material table S13). Given the low variation of spin-splitting energies on the Cr(III) subset, using the average value of the Cr(III) training set complexes as constant prediction results in a lower MAE of 2.5 kcal mol−1 than the KRR models with MAEs of roughly 4–5 kcal mol−1. We attribute this to the fact that the kernel-based approaches insufficiently separate the effect of the metal identity from the effect of the ligands. Addressing this issue would require a redesign of the kernel, for example, following the approach of Jørgensen in which the two-body interaction terms are modeled as a product of separate interaction strength functions for the metal and the ligand q2(M, Li) = g(M)f(Li) [60, 117]. With the exception of the Co(III) subset, the overall dependence of the model MAEs on the metal/oxidation state combinations for the ligand test set is consistent with results from the validation and composition test sets. Dividing the ligand test set into subsets corresponding to the 21 different ligands shows significantly higher MAEs for the subset of the ligands thiane and PHCHOH, which also represent outliers in an analysis of the feature space distributions (supplementary material table S14 and figure S5). Removal of the 11 complexes corresponding to these two ligands results in a decrease in the MAE of roughly 0.5 kcal mol−1 for all models while maintaining the significantly improved generalization errors for the MBE-based NN models.
For applications in chemical discovery, large prediction errors on certain complexes do not represent a problem as long as the ML models are able to correctly identify these as outliers. The NN ensemble approach allows one to evaluate the prediction uncertainty as measured by the standard deviation of the ten individual NNs [118, 119]. Both the standard-RACs and the two-body NN models severely underestimate the prediction uncertainty even on significant outliers (figure 3). For example, on the largest outlier Fe(III)(thiane)6, with a DFT calculated spin-splitting energy of −8.3 kcal mol−1, the standard-RACs and two-body NN models predict values of −27.6 ± 5.0 kcal mol−1 and −25.5 ± 5.8 kcal mol−1, respectively, where the uncertainty is estimated using two standard deviations of the NN ensemble predictions. We attribute this to the limited expressivity of these models caused by their respective featurization approaches that limits variation in parameters of the NN models in the ensemble. This limitation is less severe for the three-body model for which the predicted Fe(III)(thiane)6 spin-splitting energy of −24.9 ± 12.5 kcal mol−1 also represents a significant outlier, but the model correctly assigns a high uncertainty to this prediction. We evaluate the average negative log-likelihood (NLL) as a quantitative measure for this model uncertainty prediction, for which the three-body NN exhibits up to a three times lower average NLL score on all test sets than the two-body and standard-RACs NN models (supplementary material table S15). An analysis of the uncertainty prediction for KRR shows that the average predicted standard deviation on the ligand test set (calculated with equation (9)) ranges from 40 kcal mol−1 for the standard-RACs model to 160 kcal mol−1 for the three-body model, which produces low NLL scores but makes these models' predicted uncertainties too high to be useful in Bayesian optimization algorithms (supplementary material table S15). We attribute the large predicted uncertainties to our optimization of KRR hyperparameters using the validation set MAE instead of a fully Bayesian approach, which was done to ensure fair comparison to the NN results.
Figure 3. Calculated versus NN-predicted spin-splitting energies of the ligand test set for the three featurization approaches (columns). Error bars corresponding to two standard deviations depict the model uncertainty. The five complexes with the largest prediction errors, Fe(III)(thiane)6, Mn(II)(thiane)6, Co(II)(thiane)6, Fe(II)(PHCHOH)6, and Mn(II)(PHCHOH)6, are annotated in all panels using a 2D representation of the respective ligand in which the connecting atom is highlighted with a gray circle.
Download figure:
Standard image High-resolution image5.2. ML prediction of frontier orbital energies
To test the generalization of our suggested MBE ML approaches, we next considered the frontier orbital energies, which are less metal-centric properties, as a second class of target properties to evaluate the performance. We might expect the ML model accuracy on both the composition test set and the ligand test set to be significantly improved by the inclusion of the MBE over standard RACs. However, the spin-splitting energy represents a best-case scenario for these models because the MBE of this property is closely related to the concept of the spectrochemical series, whereas the frontier orbital energies are more sensitive reporters of the overall size and charge of the molecule. Of the frontier orbital energies, we first focus our analysis on the HOMO energies, as we have previously successfully applied additivity relations to predict the HOMO energies of heteroleptic singlet Fe(II) complexes from their homoleptic counterparts[47].
Given the high degree of linear correlation between the LS and HS HOMO energies, we do not distinguish between the two spin-multiplicities in our analysis (supplementary material figure S1). The presence of negatively charged ligands in our data sets, which shift the HOMO level towards higher energies, produces a large variance in the target data. In the training data set, the HOMO energies range from roughly −25 eV to 0 eV (supplementary material figure S8). For all four metal ions, the M(III) subset covers a larger range of values than the M(II) counterparts (average range of 25 eV for M(III) vs 16 eV for M(II)). We attribute this to the fact that, on average, the median M(III) HOMO energies are 4.2 eV lower than the corresponding M(II) energies. This lower median HOMO level, in turn, allows complexes with a larger total number of negative charges on the ligands to pass the check for negative spin-majority HOMO energies applied during the data set curation (see computational details in section 4). This effect is the result of more than just compensating for the difference in charge of metal ion because the M(III) subsets contain more overall negatively charged complexes (supplementary material table S16). Given the large variance in the target data due to charged ligands and the fact that this property is less metal-centric, we would expect that learning the frontier orbital energies to be significantly more challenging than the spin-splitting energy.
Consistent with observations for spin-splitting energies, the KRR models outperform the NN counterparts on the training set. The KRR models have MAEs of 0.15 eV for the standard-RACs model (vs 0.29 eV for the NN), 0.41 eV for the two-body model (vs 0.43 eV for the NN), and 0.22 eV for the three-body model (vs 0.32 eV for the NN, table 2). On the validation set, the difference in performance between the different ML methods and featurization approaches is again significantly smaller. We observe comparable MAEs of roughly 0.4 eV for all models, where the NN models exhibit, on average, 0.02 eV lower errors than the KRR models (table 2). Given the large range of target values, the MAEs on the training and validation set correspond to high R2 scores of over 0.97 for all models (supplementary material table S17).
Table 2. Mean absolute errors of the model predictions of the combined LS and HS HOMO energies on all four data sets in eV. The lowest error for each set is indicated in bold.
Model | Training set | Validation set | Composition test set | Ligand test set |
---|---|---|---|---|
KRR standard-RACs | 0.15 | 0.37 | 0.58 | 1.03 |
KRR two-body | 0.41 | 0.44 | 0.28 | 1.27 |
KRR three-body | 0.22 | 0.40 | 0.23 | 1.12 |
NN standard-RACs | 0.29 | 0.34 | 0.56 | 0.96 |
NN two-body | 0.43 | 0.43 | 0.30 | 1.05 |
NN three-body | 0.32 | 0.38 | 0.79 | 0.97 |
The composition test set exhibits the highest difference in MAEs between the models (table 2). For the KRR models, we observe the expected trend of a significant improvement by including the MBE, resulting in an MAE of 0.28 eV for the two-body model, almost half the 0.58 eV MAE of the standard-RACs model. The NN models show a similarly improved accuracy of the two-body model, with an MAE of 0.30 eV, over the standard-RACs model, with an MAE of 0.56 eV. However, while the three-body KRR model achieves the overall lowest MAE on the composition test set of 0.23 eV, the three-body NN model exhibits the highest MAE of 0.79 eV (see more discussion, next). Finally, on the ligand test set, we observe high errors for all models, where the standard-RACs NN achieves the lowest MAE of 0.96 eV and the two-body KRR model exhibits the highest MAE of 1.27 eV. This poor performance of all models on the generalization to previously unseen ligands again highlights the fact that the HOMO energies are a less metal-centric and, therefore, a more challenging property to predict when prior knowledge of a specific ligand chemistry is unavailable to the model.
The uncharacteristically high MAE of the three-body NN on the composition test set warrants closer analysis. As discussed previously, the effect of the three-body interactions on binary heteroleptic complexes can be described in terms of an isomer splitting, i.e. the difference in predictions for cis/trans and fac/mer isomers, and 'curvature', i.e. the deviation from linear interpolation for the 5 + 1 complexes (appendix
Figure 4. Plot of the LS HOMO energy (in eV) interpolation curves for binary complexes of three ligands, methanol (MeOH), hydrogen cyanide (NCH), and hydrogen isocyanide (CNH), and two metals Cr and Fe showing the DFT reference values (scatter) and the corresponding ML predictions (lines) for NN models using different featurization approaches (columns). For compositions with multiple structural isomers, values are plotted using half markers for the DFT reference and dashed/dotted lines for the ML predictions.
Download figure:
Standard image High-resolution imageWe next further analyzed the performance of all of the different ML methods and featurization approaches on the more stringent ligand test set. On average, each approach produces comparably worsened MAEs (ca. 0.9–1.1 eV), but there are some smaller differences (table 2). Similar to our finding on the spin-splitting energies, we observe that for all three featurization approaches the NN models yield lower MAEs than the KRR models in this extrapolative regime. Consistent with the results on the other data sets, all models achieve significantly lower MAEs on the M(II) subsets of the ligand test set than on the M(III) subsets (on average over 0.4 eV lower, supplementary material table S19). We attribute this to the fact that the M(III) portion of the training set covers a wider range of ligand charges, resulting in a more challenging generalization task. Analyzing the MAEs on subsets grouped by each of the 21 ligands (i.e. with varying metal or oxidation state), shows large variations ranging from 0.09 eV for the three-body KRR model on the M(DMF)6 (where DMF = dimethylformamide) subset to 2.98 eV for the two-body KRR model on the M(oxz)6 (where oxz = oxazoline) subset (supplementary material table S20). One of the few examples of a ligand subset on which all models underestimate the HOMO energies is the aforementioned set of N-coordinating oxz complexes. However, unlike for the spin-splitting energies, these trends rarely hold for all ML methods and featurization approaches, making it challenging to conclude any relationship among featurization, chemistry, and model performance. A final notable difference between the different models is their ability to capture the metal/oxidation state dependence and ligand dependence of the HOMO energies. For example, both the standard-RACs NN and the three-body NN show comparable MAEs on the M(oxz)6 subset of 1.9 eV and 2.1 eV, respectively. However, for the three-body NN, this error stems from a systematic offset that is almost constant for all metal and oxidation state combinations, while the errors of the standard-RACs NN vary from 0.9 eV to 2.7 eV (supplementary material figure S11). A quantitative measure of this is given by the standard deviations of the absolute error, which on this subset evaluate to 0.63 eV for the standard-RACs NN and 0.27 eV for the three-body NN (supplementary material table S21). With the exception of a lower composition test set MAE for the three-body NN, all the trends discussed for the HOMO energies similarly hold for the LUMO energies (supplementary material text S2, tables S22 and S23, figures S12 and S13). To summarize, the overall lack of trends for the MAEs on various subsets of the data and between the different ML methods and featurization approaches suggests that both the ligand-RACs and standard-RACs feature spaces do not sufficiently capture the similarity of ligands for the frontier orbital energies, resulting in high generalization errors. More data could address some of these limitations of all models, but the lack of trends between different models makes it challenging to identify what way the feature sets should be further improved.
An even more relevant property than the individual HOMO or LUMO energies is the difference between HOMO and LUMO levels, i.e. the HOMO–LUMO gap. Calculating this difference results in significant changes to the distribution of the DFT target values compared to the pure frontier orbital energies (supplementary material figure S14). First, the high degree of linear correlation between the HOMO and LUMO energies of the same spin state results in cancellation of parts of the ligand dependence and, therefore, a smaller range of target values of about 7 eV. Second, the bimodal character of the frontier orbital distributions, stemming from the presence of charged ligands, cancels out almost completely. Third, the linear correlation between the LS and HS target values for the gap is significantly lower than for the individual HOMO and LUMO energies (supplementary material figure S15 and table S24). This can be attributed to cancellation of the linear correlation between the HOMO and LUMO energies of different spin states (supplementary material figure S1). Finally, by definition, i.e. due to the Aufbau principle, the DFT reference values for the gap are strictly positive. We first examined ML-model predicted gaps obtained from independent model predictions of the HOMO and the LUMO. However, since the HOMO and LUMO energies are two largely independent model predictions, the described differences in the distribution of gap energies are not necessarily reflected in the ML predictions of the gap. For example, on a small number of complexes, the ML models predict a lower LUMO energy than HOMO energy, resulting in a negative predicted gap (supplementary material table S25).
As expected, the MAEs of the gap predictions for all six models follow the same overall trends as the results on the HOMO and LUMO energies (table 3). In fact, with the exception of the standard-RACs models, the MAEs on the training and validation set gap energies are within 0.01–0.04 eV of the MAEs on the HOMO energies and 0.08–0.12 eV higher than the MAEs on the LUMO energies. However, given the smaller range of the target values, these MAEs correspond to lower R2 scores than for the HOMO or LUMO model predictions (supplementary material table S26). On the composition test set, we observe that both standard-RACs models achieve slightly lower MAEs for the gap than on the frontier orbital energies, while all MBE-based models yield slightly higher MAEs. The significant outlier in the HOMO MAEs of 0.79 eV for the three-body NN on the composition set propagates into the predictions for the gap and results in an MAE of 0.94 eV, which we interpret as compounding of the errors on the HOMO and LUMO energies. Conversely, on the ligand test set, we observe cancelation of errors resulting in 0.2–0.5 eV lower MAEs compared to the HOMO energies. It is again important to note that these decreased MAEs do not correspond to higher R2 scores due to the lower range of target values (supplementary material table S26). In order to quantify the influence of the indirect modeling approach on ML performance, we trained additional models directly on the LS and HS gap energies as a point of comparison. This results in lower validation set MAEs for all models and significant reductions in both test set MAEs of the standard-RACs and two-body KRR models (supplementary material table S27). On the other hand, these direct models exhibit an increased tendency to overfit, as evidenced by higher MAEs for the two three-body models on both test sets. We, therefore, interpret the approach of calculating the gap as a difference of the frontier orbital predictions as another regularization technique that might limit the achievable accuracy for already heavily regularized models but becomes crucial to reduce overfitting in highly expressive models.
Table 3. Mean absolute errors of the model predictions of the combined LS and HS HOMO–LUMO gap energies on all four data sets in eV. The lowest error for each set is indicated in bold.
Model | Training set | Validation set | Composition test set | Ligand test set |
---|---|---|---|---|
KRR standard-RACs | 0.16 | 0.52 | 0.45 | 0.80 |
KRR two-body | 0.35 | 0.45 | 0.34 | 0.72 |
KRR three-body | 0.20 | 0.41 | 0.26 | 0.61 |
NN standard-RACs | 0.34 | 0.43 | 0.50 | 0.63 |
NN two-body | 0.39 | 0.44 | 0.42 | 0.54 |
NN three-body | 0.34 | 0.39 | 0.94 | 0.53 |
To further investigate the differences between the frontier orbital models and the spin-splitting models, we evaluated the feature importance of the respective two-body NN models on the ligand test set using the model-agnostic Kernel SHapley Additive exPlanations (SHAP) approach [120]. As expected, both metal-centered features, d-electron configuration and oxidation state, rank in the top 5 most important features for all three properties (figure 5). In fact, many of the other top 12 features are shared between the models. A notable difference is the fact that the spin-splitting model mostly relies on metal-local features, i.e. ligand RACs with depth d ⩽ 2, whereas for the prediction of the HOMO–LUMO gaps ligand RACs with the maximum depth considered in the feature set (i.e. d = 3) also play a significant role. This is in line with prior feature importance analysis across broader sets of TMCs [55, 100]. It is also consistent with our expectation that frontier orbital energies are a less metal-centric property than spin-splitting energies and, therefore, represent a more challenging target for all presented TMC models. Both the relative ordering of the feature importance and the individual SHAP values observed for the two-body NN model are largely consistent with a SHAP analysis of the three-body and standard-RACs NN models (supplementary material figures S16 and S17). This consistency suggests that the SHAP analysis represents a good starting point for KRR feature selection methods, which could improve the generalization performance of all KRR models.
Figure 5. Kernel SHAP feature importance analysis of the two-body NN models for the spin-splitting energy (SSE, left), and the LS (center) and HS (right) HOMO–LUMO gaps (gap). For each model the 12 most important features, as measured by the mean absolute SHAP value, are shown, where dn refers to the d-electron configuration, OS to the oxidation state, and and represent the ligand-centered product and difference RACs from equation (2) and (3). The model baseline is evaluated using a k-means clustering of the training set with k = 25 and SHAP feature importance values are calculated for all 127 TMCs in the ligand test set. The one-hot encoded metal ion features are combined into two separate integer features (dn and OS) and their respective SHAP values are summed. Similarly, the six different starting atoms for the ligand-centered RACs are combined into a single feature and their SHAP values are summed. In SHAP the model predictions are assumed to be a sum of SHAP values for each feature. The first row of the left panel can be interpreted as follows: small values of the dn feature, i.e. d3 and d4, feature indicated by a blue hue result in lower SSE predictions indicated by negative SHAP values. The magnitude of the SHAP values for each feature corresponds to the importance.
Download figure:
Standard image High-resolution imageIn summary, the results for model performance on the frontier orbital energies in large parts mirror our findings on the spin-splitting energies. The KRR-based models outperform their NN counterparts in the interpolative regime, i.e. on the composition test set, while the NN models show better generalization to previously unseen ligands. The inclusion of the MBE shows a systematic improvement in the accuracy on the composition test set for all frontier orbital energy targets. The only outlier from this trend, the three-body NN, highlights the increased risk of overfitting that comes with more flexible models. The standard-RACs based models achieved the lowest MAEs on the ligand test set for both HOMO and LUMO energies. However, for the HOMO–LUMO gaps, the MBE-based featurization approach results in improved error cancellation. The same systematic reduction of the MAEs with increasing MBE order was observed on the composition test set.
6. Conclusion
We introduced a novel approach based on the MBE to encode the stereochemistry of octahedral TMCs in molecular graph descriptors. This featurization approach enables more efficient VHTS for material properties using ML surrogate models by exploiting ligand additivity relations. We derived the necessary expressions to implement this approach in two commonly used ML methods, KRR models and NNs. The MBE-based approach allows for the tuning of the expressivity of the ML models by truncating the expansion at different interaction orders. ML models obtained by truncation at the two-body and three-body order were investigated, and their performance was compared to models based on standard-RACs, our previous approach for featurization of octahedral TMCs.
We trained these models on a data set of 1444 spin-splitting energies and observed improved model accuracy on test sets in the interpolative regime, i.e. TMCs composed of new combinations of ligands from the training set, and the extrapolative regime, i.e. TMCs consisting of previously unseen ligands. On the composition test set, the three-body KRR model achieved the lowest MAE of 2.75 kcal mol−1, significantly outperforming the reference KRR model using standard-RACs with an MAE of 4.10 kcal mol−1. However, all KRR models showed comparable generalization errors of roughly 4.8–5.0 kcal mol−1 on the ligand test set. For the three-body NN models, we observed a reduction in the MAE of about 0.7 kcal mol−1 compared to the reference model using standard-RACs on both the composition test set (3.41 kcal mol−1 vs 4.16 kcal mol−1) and the ligand test set (4.00 kcal mol−1 vs 4.73 kcal mol−1). Despite slightly higher MAEs of 3.61 kcal mol−1 on the composition test set and 4.14 kcal mol−1 on the ligand test set, the two-body NN model represents the best trade-off between accuracy, data efficiency, and ease of training, i.e. robustness to overfitting.
Next, we investigated the model performance on the frontier orbital energies, i.e. HOMO energy, LUMO energy, and the HOMO–LUMO gap. On the composition test set, the MBE-based models achieved significantly lower MAEs than the standard-RACs reference models for all frontier orbital energy targets. For example, on the HOMO energies, the two-body and three-body KRR models achieve MAEs of 0.28 eV and 0.23 eV, respectively, compared to a reference value of 0.58 eV for the standard-RACs KRR model. However, we also observed significant overfitting for the three-body NN, resulting in an increased composition test set MAE on HOMO energies that propagated to the gap predictions. On the ligand test set, all models exhibit generally high MAEs (0.8 eV to 1.2 eV) compared to the other data sets, and there was no significant difference between the featurization approaches. The only exception to this trend is a decrease of roughly 0.1–0.2 eV in the MAE for the gap predictions of the MBE-based models, which we attribute to improved error cancellation of the HOMO and LUMO predictions due to the more systematic featurization approach. The limited generalization behavior for frontier orbital energies is attributed to shortcomings in the handcrafted features used in both the standard-RACs and the MBE-based approach. One way to address this in future work would be by incorporating the presented MBE-based encoding of TMC stereochemistry into a deep learning approach such as graph convolutional NNs. The observed trade-off between limited accuracy of the two-body models and overparametrization in the three-body model could be addressed by partial inclusion of three-body interactions, e.g. using only cis-type interactions but not trans-type interactions or only curvature parameters but not isomer-splitting parameters (appendix
No single model outperforms the other models on all investigated properties or metrics. Therefore, for future screening studies, the choice model should be informed by the available resources and required accuracy. In the limit of small data sets, the two-body KRR model, trained on a purely homoleptic data set, represents the most efficient use of computational resources and yields robust generalization to heteroleptic complexes based on simple additivity rules. Given a sufficiently large and adequately balanced, NNs typically outperform KRR models on out-of-distribution predictions, and a higher MBE-expansion order allows for a more accurate description of heteroleptic complexes. Other considerations that were not the main focus of this study might be the fidelity of the model's prediction uncertainty, which becomes crucial in Bayesian screening techniques, or the complexity of model training, which is a crucial factor in active learning loops. Our results suggest that an MBE-based model would outperform the corresponding standard-RACs model in all these use cases.
Acknowledgment
Initial support for this work, including data curation and initial model testing, was provided by the Office of Naval Research under Grant Number. N00014-20-1-2150. Subsequent support for final model training and development was provided by the United States Department of Energy under Grant Numbers. DE-NA0003965. D B K C was partially supported by a National Science Foundation Graduate Research Fellowship Program under grant #1745302. H J K acknowledges a Sloan Foundation Fellowship in Chemistry and a Simon Family Faculty Research Innovation Fund grant. The authors thank Adam Steeves, Aditya Nandy, and Vyshnavi Vennelakanti for helpful discussions.
Data availability statement
The data that support the findings of this study are openly available at the following URL/DOI: https://doi.org/10.5281/zenodo.13331586 https://github.com/hjkgrp/many_body_ml.
Conflict of interest
The authors declare no competing financial interest.
Appendix A: Linear dependence of three-body interaction terms.
Because the presented approach does not include explicit dependence of the interaction terms on geometric information, such as bond lengths or bond angles, the only geometric encoding in the truncated MBE is the distinction between cis-type and trans-type three-body interactions. Truncating equation (1) from the main text at the three-body order and separating the sum over three-body interactions into cis-type and trans-type yields,
This summation over the MBE terms can be rewritten as a vector product of an interaction vector q that contains all possible interaction terms and a coefficient vector n that counts the number of interactions of each type for a given TMC.
For example, all ten complexes that can be built from the combination of a single metal center and two ligands A and B can be expressed as,
where the constant one-body term and other unused interaction terms are not listed explicitly for the sake of clarity. In this example, the (10 × 8) matrix of coefficient vectors has rank 4, indicating linear dependence between the coefficient vectors. This linear dependence is a consequence of the restriction to octahedral complexes that imposes additional constraints on the coefficient vector, such as the restriction that the number of two-body interactions sums to a total of six.
To illustrate how this finding generalizes beyond a single metal center and the combination of two ligands, we give examples of how these constraints allow one to calculate the coefficients from a 'minimal' basis. Note, however, that this choice of basis is not unique. In octahedral complexes, the number of cis-type three-body interactions involving two ligands of the same type X can be calculated from the number of two-body interactions for that ligand and the number of cis-type three-body interactions between X and all other ligand types,
A simple interpretation of this relationship is that every ligand of type X in a given complex, quantified by , could potentially form cis-type interactions with all four nearest neighbor ligands. The actual number of cis-type interactions is reduced by one for every occurrence of an interaction between a X-type ligand and a ligand of any other type. Finally, the factor one-half accounts for double counting of interactions. Similarly, , the number of interactions, is given by,
As a consequence, the four interaction terms , , , and form a complete basis for the example from equation A2, i.e. restriction to these four terms results in a model of the same expressive power as the full model.
Another interesting finding from equation A2 is that the difference between expressions for the cis and trans isomers is exactly the same as the difference between the fac and mer isomer expressions,
In combination with the 'curvature' of the interpolation curve, i.e. the difference between the expression for the 5 + 1 complexes and linear interpolation,
and the expressions for the two homoleptic complexes,
and
this isomer-splitting term forms a basis that allows for easier interpretation of the effect of each interaction term.
Appendix B: Derivation of MBE-based kernels.
KRR is usually presented as an extension of linear ridge regression in which the input vector xi is mapped to a D-dimensional feature space φi = φ(xi), where D is typically larger than the dimension of the original input space [114, 122]. A linear model in this feature space is given by the expression,
where w⊤ = [w(1), ..., w(D)] is the vector of weights for the linear regression. This weight vector is determined by minimizing the ridge regression cost function,
where the pairs (x1, Q1), ..., (xN, QN) are the training data, and λ is a hyperparameter tuning the strength of the L2 regularization term. Setting the derivative of the cost function with respect to the weight vector to zero yields,
which can be rearranged into a closed-form expression for the weight vector,
where ID is an identity matrix of dimension (D × D). Using the definitions Ф = [φ1, ..., φN] and Q = [Q1, ..., QN]⊤, the sums can be rewritten in terms of vector/matrix multiplications,
The expression for the weights can be further rearranged by applying the following matrix identity, often referred to as the push-through identity [123–125],
where setting A = λID, P⊤ = Ф, and R = IN yields,
This allows one to perform the costly matrix inversion either in the (D × D)-dimensional feature space or the (N × N)-dimensional space of training examples. This switch from the outer product in feature space to the inner product also gives rise to the so-called kernel trick, where the inner product in feature space is replaced with a kernel function φ(x)⊤φ(y) = k(x, y). Plugging the expression for the weights in training example space (eq. B7) back into the definition of the linear model (eq. B1) yields,
Finally, applying the kernel trick gives,
where k(x) is a vector obtained by evaluating the kernel function for x and all training examples xn,
and K is the matrix of kernel values of pairwise combinations of training examples,
To derive a KRR expression for the MBE-based models, we start with a two-body model in weight-space view given by,
where w1 and w2 are weight vectors and φ1 and φ2 are transformations from the respective vectorial input space to the feature space. Importantly, this expression can be rearranged into the standard KRR weight space view by stacking the weight and transformation vectors,
and, therefore, results in exactly the same prediction expression (eq. B8). However, special care needs to be taken when applying the kernel trick. Since the kernel function k(x, y) is used to replace the inner product φ(x)⊤φ(y), all properties of this inner product must be conserved. In our specific case, the important properties are the separation into the individual MBE contributions and the summation over two-body terms,
Applying the kernel trick separately to each of these inner products of transformation vectors yields,
The same arguments can be used to derive a KRR expression for the three-body model by adding the cis-type and trans-type interaction terms.