Revving up 13C NMR shielding predictions across chemical space: Benchmarks for atoms-in-molecules kernel machine learning with new data for 134 kilo molecules

The requirement for accelerated and quantitatively accurate screening of NMR spectra across the small molecules chemical compound space (CCS), is two-fold: (1) a robust `local' machine learning (ML) strategy that captures the correlation between an atom's neighbourhood in a molecule and its `near-sighted' property---chemical shielding of its nuclear magnetic moment by the bonding electron density; (2) an accurate reference dataset generated with a state-of-the-art first principles method that can be used for training the ML model. Herein we report the QM9-NMR dataset comprising isotropic shielding of $^{13}$C nucleus of over 0.8 Million C atoms in 134k molecules of the QM9 dataset calculated at the mPW1PW91/6-311+G(2{\it d},{\it p}) level in gas phase and with implicit models of five common solvents. Using these accurate data for training, we present benchmark results for the prediction transferability of kernel-ridge regression models with popular local descriptors. When using the FCHL descriptor, our models based on 100k training samples, accurately predict the NMR shielding of 50k `hold-out' atoms with an MAE of $<1.9$ ppm. For rapid prediction of new query molecules, the ML models have been trained with molecular geometries calculated with the semi-empirical method PM7. Furthermore, we show by using $^{13}$C shifts from single point minimal basis set DFT calculations at the PM7 geometries as a baseline, one can apply a $\Delta$-ML strategy to quench the average prediction error to $<1.4$ ppm. We test the transferability of the local ML models trained on small molecules---with up to 9 heavy atoms---to non-trivial benchmark sets that include several large drug molecules, a small subset of the GDB17 dataset with molecules comprising 10 to 17 heavy atoms, and linear polycyclic aromatic hydrocarbons with 10--26 C atoms.


I. INTRODUCTION
Nuclear magnetic resonance (NMR) has become an indispensable tool in several aspects of chemistry, biochemistry and biophysics research. It is fast, accurate, information rich and non-destructive, making it ideal for detecting or describing chemical bonding scenarios. As easy and trivial have most NMR experiments become, it is still a computationally expensive task to estimate NMR shielding tensors or coupling constants for large molecular datasets 1,2 . In the case of molecules containing heavy atoms, attaining quantitative accuracy becomes a reality only when effects as subtle as relativistic corrections are incorporated 3,4 . On the other hand, when dealing with large organic molecules-so long as these are too fewcomputational NMR spectroscopy has significantly progressed to a state that it can be routinely employed to aid in the assignment of experimental results [5][6][7][8][9][10][11] . Grimme et al. 12 have discussed the automated prediction of spinspin coupled 1 H NMR in various solvents by accessing relevant conformers, in order to generate experimentally relevant NMR spectra, while Buevich et al. 13 have employed computer-assisted structure elucidation (CASE) algorithms along with predicted NMR results for moleca) Electronic mail: ramakrishnan@tifrh.res.in ular structure elucidation. Lauro et al. 14 have designed a protocol to identify stereoisomers using experimental and predicted NMR data.
Among the many ab initio quantum chemistry frameworks [15][16][17][18][19] available for computing chemical shifts, gauge-independent atomic orbital (GIAO) 20 is the most popular. Within the GIAO framework, Cartesian components of the NMR shielding tensor, σ q ij , of a nucleus q is calculated as the second-order magnetic response property 21,22 where E is the electronic energy of the molecule, B i is a component of the external magnetic field, and µ q j is the j−th component of the magnetic moment of the nucleus q. The isotropic shielding is defined as one-third of the trace of the shielding tensor, σ iso = (σ 11 + σ 22 + σ 33 )/3. Comparison of predicted values of σ iso with experimental results is done by calculating its 'shift' from the shielding of the same nucleus under consideration in a standard reference compound δ q iso = σ ref.
1 H and 13 C are amongst the most commonly studied NMR active nuclei. Accurate ab initio computation of δ 13 C requires methods such as coupled-cluster singles doubles (CCSD) 24 , or spin-component-scaled MP2 (SCS-MP2) with a triple-zeta quality basis set to reach a mean error of < 1.5 ppm 1,25 , albeit incurring a cost which prohibits the method's applicability in high-throughput studies. Composite methods have been proposedanalogous to Gn thermochemistry methods 26 -that exploit the additivity in basis set and correlation corrections to reach a greater accuracy 24,27 . Another strategy, that has been tested before, involves tailoring the performance of the exchange-correlation functionals (XCs) of Kohn-Sham density functional approximations (KS-DFAs) for better modeling of chemical shifts, as in the cases of weighted carbon 04 (WC04) and weighted proton 04 (WP04) DFAs 28 .
When relaxing the accuracy requirement-while retaining the generality-a DFA that has received wide attention over the years, particularly for NMR calculations of both 1 H & 13 C nuclei, is based on the XC: mPW1PW91 29 . This XC has been shown to provide good results for acetals 30 , pyramidalized alkenes 31 , acetylenes, allenes, cumulenes 32,33 and even natural products 5,6 .
The aforestated XC has also been used to model the 2D-NMR spectrum of exo-2norbornanecarbamic acid 34 . Further, a multi-reference standard approach (MSTD) 35 have shown consistent estimations of chemical shifts in solutions with the same XC and a triple-zeta basis set 36 . Thus, even though Flaig et al.'s 25 benchmark study ranked the B97-2 functional high, next only to the MP2 method, the consistency of mPW1PW91 has motivated several works including a recent effort by Gerrard et al. 37 , where the authors applied mPW1PW91 with the 6-311G(d,p) basis set to generate NMR chemical shielding and heteronuclear coupling constants of molecular components in experimentally characterized organic solids.
While direct application of DFT is feasible for any query molecule, the questions that arise in chemical compound space (CCS) explorations often concern property trends across large datasets, demanding realistically rapid evaluation of the desired property. To this end, machine learning (ML) based statistical inference, in combination with high-throughput ab initio computing, offers a viable alternative. This approach has received such widespread attention that a recent competition on the world-wide web (www), Kaggle, for MLaided prediction of NMR spectra 38 saw a participation of 2,700 teams across the world. An earlier proof-ofconcept study had discussed the feasibility of exploiting the local behavior of NMR chemical shifts with ML to achieve transferability to systems that are larger than those used to train the model 39 . That work depended on a cut-off based local version of the Coulomb matrix (CM) descriptor 40 . As for descriptors, successive improvements have been made by projecting the threedimensional (3D) molecular chemical structure into multidimensional tensors 41 , four dimensional hyper-spherical harmonics 42 , or a continuous representation such as the variant smooth overlap of atomic positions-the SOAP descriptor 43,44 . The latest version of SOAP had been used in combination with Gaussian process regression (GPR) to model chemical shifts of 2000 molecular solids with experimentally determined crystal structures collected in the CSD-2K database, achieving a root-meansquares-error (RMSE) of 0.5 ppm/4.3 ppm for 1 H/ 13 C nuclei, respectively 45 . Such accuracies are comparable to that achieved in fragment-based estimations 46 . SOAP combined with the ML approach, kernel ridge regression (KRR), has been successful for predicting 29 Si and 17 O NMR shifts in glassy aluminosilicates in a wide temperature range 47 . The joint descriptor-kernel formalism of Faber, Christensen, Huang, and Lilienfeld (FCHL) uses an integrated Gaussian kernel function accounting for three-body interactions in atomic environment yielding highly accurate results for global molecular properties such as atomization energies 48 . Recently, FCHL-based KRR has been applied to model 1 H, 13 C shifts and Jcoupling constants between these two nuclei for over 75k structures in the CSD 37 . For a test set, which was not part of training, the same study noted mean absolute errors (MAE) of 0.23 ppm / 2.45 ppm / 0.87 Hz (RMSE: 0.35 ppm / 3.88 ppm / 1.39 Hz) for δ 1 H / δ 13 C / 1 J CH , respectively.
In this work, we present gas and (implicit) solvent phase mPW1PW91/6-311+G(2d,p)-level chemical shielding for all atoms in the QM9 dataset 49 comprising 130,831 stable, synthetically feasible small organic molecules with up to 9 heavy atoms C, N, O and Fhenceforth denoted as the QM9-NMR dataset. We apply KRR-ML using training sets drawn from QM9-NMR, benchmark the control settings that govern the prediction accuracy, and rationalize their influence on the performance of large ML models using up to 100k training examples. With converged settings, we provide benchmark learning curves for ML and ∆-ML methods based on three local descriptors-CM, SOAP and FCHL. Finally, we evaluate the prediction transferability of the local ML models-trained only on small moleculesto larger systems using non-trivial benchmark sets that include several drug molecules, a small subset of the GDB17 dataset 50 with molecules comprising 10 to 17 heavy atoms and linear polycyclic aromatic hydrocarbons, the smallest of which is naphthalene-already with one heavy atom more than the largest QM9 molecule.

II. METHODOLOGY
A. Machine Learning of NMR Shielding: Single Kernel and ∆-ML Ansätze Among popular ML frameworks for quantitative modeling of molecular and materials properties, KRR has been one of the most consistent and accurate 51 . In the KRR formalism, for a query entity (molecule or atom), q, a generic property p from a reference (experiment or theory), is estimated as a linear combination of radial basis functions (RBFs a.k.a. kernel functions)-each centered at one training entity. Values of these RBFs are calcu-lated at q, then the distances between q and N training molecules defined via their descriptors d is given as The coefficients, c t , one per training datum, are obtained through ridge-regression by minimizing the least-squares prediction error The size of the kernel matrix is N × N , each element defined in close analogy to the right side of Eq. 2, K ij = k(||d i − d j ||), i and j going over N training elements, with || · || denoting a vector norm. In this work, for the choice of CM and SOAP descriptors, we used the Laplacian kernel depending on an L 1 norm defined as where ω defines the length scale of the exponential RBF. As shown in Ref. 52, optimal solution to Eq. 3 amounts to solving the linear system The purpose of the second term in the r.h.s. of Eq. 3 is apparent in Eq. 4; if the definition of the descriptor does not differentiate any two training entries, then K becomes singular and a unique solution to Eq. 4 can only be found with a non-zero value for the regularization strength, λ. Both ω and λ constitute hyperparameters in the model, that require a cross-validated optimization before out-of-sample predictions. Any non-zero value of λ determined by cross-validation is an indication of the presence of redundant training entries, either due to dataduplication or poor quality of the descriptor. As shown in Ref. 53, in the absence of redundant training entries, λ can be set to zero and the learning problem translates to solving Kc = p ref. . Alternatively, when linear dependencies may be anticipated-due to numerically similar descriptor differences-rendering an off-diagonal element of K to be ≈ 1, a fixed, numerically finite λ = may be used to shift the diagonal elements of K away from 1.0 and the lowest eigenvalue away from 0.0-making K nonsingular.
When dealing with large number of training samples, the selection of ω, the kernel-width or the length scale of the RBF is often accomplished with cross-validated values found for smaller training set sizes. Ref. 53 had discussed how an optimal value of ω is closely coupled to the condition number of K. A nonsingular kernel matrix K, is said to be well-conditioned if, for any matrix norm G, the condition number is small where G is a matrix norm and κ is the condition number with respect to the norm G. If the value of κ(K) is large, say 10 5 , then K and the linear system are ill-conditioned.
A simple definition of G is the spectral norm: Similarly, the spectral norm of the inverse of the kernel matrix is where spectral condition number κ 2 of the kernel matrix K is given by κ 2 (K) = λ max (K)/λ min (K) quantifying the stiffness of the solution. The intuitive meaning of the condition number is the lower bound of the ratio between the relative error in the solution to the relative deviation in an arbitrary right side of Eq. 4.
K is a covariance or a dispersion matrix with all of its off-diagonal elements bound strictly in the closed interval [0, 1] with unit diagonal elements. For a Laplacian kernel, a large value of ω results in increasing the density of the off-diagonal elements around 1-in turn increasing κ 2 . Ref. 53 showed how ω can be estimated independent of the property being modeled by restricting K ij corresponding to the largest descriptor difference, D max ij , to 0.5, resulting in In the present study, we also explore the performances of the choices of ω based on D mean Later we show how these choices are in close agreement with ω opt values found by a scan to minimize the error for a large hold out set. We also discuss how the kernel matrix constructed with ω median opt can be applied to model NMR shieldings from gas and different solvent phases.
The prediction error of an ML model given by Eq. 2 can be unconditionally quenched with increasing training set size for a good choice of the descriptor; however, the exponential nature of the learning rate often necessitates an increase in the model's size by orders of magnitude. While the resulting surge in the computational cost associated with the ML model's execution speed is seldom prohibitive, training with examples of the order of 10 6 places too severe hardware restrictions. When such hardware limit is reached for training, further drop in an ML model's error can be attained by training on the deviation of the property from inexpensive, yet qualitatively accurate baseline values in a ∆-ML fashion 54 .
The ML problem now involves solving for Kc = ∆p. For ∆-ML of NMR chemical shifts exemplified by benzene; the model is trained to predict property from a target-level theory using as inputs atomic coordinates and the property from a baseline theory. The targetline and baseline shifts of benzene were computed at mPW1PW91/6-311+G(2d,p)@B3LYP/6-31G(2df,p) and B3LYP/STO-3G@PM7 levels, respectively.
any new prediction, ML-predicted ∆ is augmented with the baseline value Fig. 1 illustrates ∆-ML for the modeling of NMR shifts with an example molecule. Often, for any given molecule, the determination of minimum energy geometry at the target level incurs a greater computational requirement than that is needed for the estimation of NMR shielding.
In the ∆-ML framework, this problem can be alleviated by using atomic coordinates calculated at the same or a different baseline level for the construction of descriptors. Hence, new predictions can be rapidly made using structural information calculated at the baseline-level.

B. Local Descriptors for Atoms-In-Molecules ML
Formal requirements for a chemical descriptor to be an accurate mathematical representation, mapping molecular similarity and dissimilarity have been discussed before 43,55-60 . Design of structure-based molecular descriptors for ML has drawn inspiration from the success of generic coordinates such as atom-centered symmetry functions 61,62 for mapping out molecular potential energy surfaces. CM, which is a popular structure-based descriptor that is easy to implement is given by 40 , For modeling local properties such as NMR shielding, a cutoff radius, r cut , can be applied to collect the subsets of the CM elements for each atom 39 . While CM is invariant to rotations and translations it is not invariant with respect to the indices of atoms in a molecule. A simple remedy is to diagonalize the CM and build a spectral representation; while the resulting features are index-invariant, they may fail to establish an injective mapping between three dimensional molecular structure and the query property 63 . More robust approaches include either permuting the rows (and columns) of the CM until the row norms are sorted in descending order (row-norm sorted CM) or to collect the elements of CM separately for each atom pair types forming a "bag of bonds" (BoB) 64 . The BoB descriptor has also been extended to include three-body interactions via angles, and many-body terms resulting in highly accurate ML modeling of molecular energetics 65,66 .
Local bonding environment can also be encoded as combination of spherical harmonics, as in the SOAP approach 43,44 . SOAP has been demonstrated as a mathematically sound and one of the better performing descriptors, especially for predicting NMR properties of organic molecular solids 67 . SOAP, by design is local; here, for every atom I with atomic number Z I , a radial distribution function is constructed by summing over all its neighbors, J, decided by a cutoff radius, r cut This is a formally exact representation, which can be approximated to arbitrary precision by projecting the density on a finite set of orthonormal basis functions.
Here, g Z I n is the radial basis function, taken as a Gaussian function, for which the variance is modeled as a hyperparameter. The resulting representation is made rotationally invariant by a power spectrum The coefficients c Z I nlm (r) are defined as an inner product of ρ Z I I (r) with spherical harmonics The computational complexity of the SOAP descriptor can be balanced by limiting the number of radial and angular functions via n max and l max .
The Faber-Christensen-Huang-Lilienfeld (FCHL) 48 formalism describes the kernel function for two query atoms I and J belonging to two query molecules A and B, respectively, as where ∆ [·] refers to the distance between the descriptors A M (I) and A M (J), and k(·) represents the kernel function. For modeling global properties, where each training element is a molecule, rather than an atom, the kernel elements are given by The descriptor is a set of interatomic M -body expansions defined recursively as The individual terms A M are also recursively defined as the product of the term of a lower order term A M −1 and an M − 1 fold summation.
where N is a standard Gaussian function and ξ M is an M -body scaling function.  48. The FCHL approach, although computationally intensive compared to those using CM or SOAP, results in a more accurate and convergent chemical representation, and has consistently performed superior to other approaches.

III. COMPUTATIONAL DETAILS
To use as training data in ML, we collected B3LYP/6-31G(2df,p)-level minimum energy geometries of 134k molecules in the QM9 dataset from Ref. 49. Those structures that have been reported to fragment during the geometry relaxation (3,054 in total) were excluded in this study. NMR shielding tensors of selected stable nuclei were calculated at the mPW1PW91/6-311+G(2d,p)-level in a single-point fashion within the GIAO framework [68][69][70] using Gaussian-16 suite of programs 71 . In all DFT calculations, integration grid was set to Ultrafine with a VeryTight SCF threshold. To use as a baseline property in ∆-ML, we used NMR shielding calculated at the B3LYP/STO-3G level with geometries optimized at the PM7 level, the latter done with the code MOPAC 72 . 13 C anisotropic shielding tensors, σ, were converted to 13 C chemical shifts, δ, using a reference value for σ corresponding to that of tetramethylsilane (TMS), which was calculated in gas phase to be 186.97 ppm. We have also computed shielding tensors for the entire 131k set with implicit modeling of the solventscarbon tetrachloride, tetrahydrofuran, acetone, dimethyl sulfoxide, and methanol-with the polarizable continuum model (PCM) 73 . This was achieved by invoking SCRF in Gaussian-16 and specifying the solvent name and retaining default settings.
We have retained the same settings-mPW1PW91/6-311+G(2d,p)@B3LYP/6-31G(2df,p)-to calculate the NMR shielding of benchmark molecules that we selected for validating QM9-based ML models. Initial unrelaxed structures of the linear PAH molecules studied here have been taken from Ref. 74. From the GDB17 dataset, we have randomly selected 8 subsets of molecules, each with 25 molecules comprising 10-17 heavy atoms (200 in total). Further, we collected drug molecules present in the GDB17 dataset identified in Refs. 50, 75-77. In addition, we also collected 12 somewhat larger drug molecules from Ref. 78. The corresponding SMILES strings of all these 'validation' molecules, when available, were converted to initial Cartesian coordinates using the program Openbabel 79 . Initial Cartesian coordinates of the 12 large drug molecules were created using the program Avogadro 80 . All molecules have been subjected to preliminary geometry relaxation performed with the force field MMFF94 81 . We used the default settings in DScribe 82 and QML 83 to calculate the SOAP descriptor and the FCHL kernel matrix, respectively. All ML calculations have been performed using codes written in Fortran90 with interfaces to the SCALAPACK 84 numerical library.

A. QM9-NMR dataset
For a systematic exploration of NMR properties across the QM9 CCS, QM9-NMR dataset was created as per the procedures outlined in Section III. This dataset consists of data for stable 130,831 molecules amounting to 1,208,486 (1.3 M), 831,925 (832 k), 132,498 (132 k), 183,265 (183 k), 3,036 (3 k), NMR values for H, C, N, O, and F nuclei, respectively. DFT-level NMR shielding of these elements (Fig. 2a) demonstrate the expected range of values. In case of H, the most deshielded nucleus corresponds to the one from the cationic ammonium ion (encountered in zwitterionic molecules), while the most shielded proton belongs to a highly-strained secondary amine bonded to N. Methane offers the most shielded environment for 13 C in QM9, whereas the most deshielded C features in a highly strained multiply-fusedring molecule. Similarly, for N, the most shielded nuclei comes from a strained tertiary amine, whereas for O it features in a strained ring. Most deshielded N and O nuclei belong to a zwitterionic molecule.
Besides extrema, Fig. 2a also highlights the chemical diversity of the QM9 dataset. For C and H atoms, majority of the NMR shielding parameters come from C(sp 3 )-H bonds, indicating QM9 to largely comprise saturated organic molecules. Unsaturated molecules form a relatively smaller fraction of QM9 as can be seen in its shielding distribution function (between 0-75 ppm for C, Fig. 2b). N atom distribution shows two sharp distribution peaks at about 200 ppm and −35 ppm belonging to primary amine and cyano groups, respectively, which could be considered as among the most synthetically feasible Ncontaining functional groups. Most frequent O atoms consist of ether linkages, while F atoms show a characteristic broad distribution around 250 ppm.
NMR shielding values for the entire dataset have also been calculated with continuum models of five commonly used polar and non-polar organic solvents: acetone, CCl 4 , DMSO, methanol, and THF. Formally, a polar solvent will result in a more deshielded environment. However, the influence of the solvent is nonuniform across various C atoms in a molecule depending on the local environment of an atom in the molecule. Other effects such as hydrogen-bonding, halogen bonding may further influence the chemistry of the molecule resulting in unexpected chemical shifts. Thus, it is necessary to build a database comprising NMR shielding tensors calculated at various solvent media. Here as a first step, we computed the shielding values of the molecules in different solvents under a PCM framework. For a better description of the solvent environment it is essential to go beyond continuum modeling by using micro-solvation models that account for explicit solute-solvent interactions. The solvents were chosen to represent diverse environments: non-polar, polar aprotic and polar protic. For any given 13 C nucleus, the spread in the shielding values due to the choice of the medium is at the most ±4 ppm (see Fig. 3). Hence, for ML predictions to differentiate the results from various phases, it is necessary that the models' prediction accuracy is much less than ±4 ppm.
QM9-NMR dataset also contains B3LYP/STO-3G NMR shielding constants for all the 130,831 molecules. Although the current ML study concerns itself with NMR shielding of the C-atom, the QM9-NMR dataset can be used to model other nuclei as well.
To facilitate such and other ab initio benchmark efforts, the entire QM9-NMR dataset, comprising gas and solvent phase results, is now a part of the openly accessible MolDis big data analytics platform 85 , http://moldis.tifrh.res.in:3000/QM9NMR.

B. AIM-ML modeling of NMR Shielding
Following the generation of the QM9-NMR dataset with 812k 13 C nuclei, we have selected a random set of 100k entries for training the ML models. Further, a separate subset of 50k nuclei-not overlapping with the 100k training entries-was kept for validating the ML models. Additionally, we have compared the distribution of the training and validation subsets with the total set, and found the normalized density distributions to be similar (see Fig. S1). Therefore, we believe that the ML models based on large training sets presented in this work do not suffer from a selection bias.
All hyperparameters employed in the ML models have been optimized via cross validation within the training set. Through a logarithmic grid search, a value of 10 −3 for λ was considered appropriate, and was kept constant throughout. NMR shifts are a local property, i.e., for any given atom, only neighboring atoms within a certain 'sphere of influence' contribute towards the chemical shift. Hence, the effective radius of such sphere, r cut , has to be determined empirically. Fig. S2 lists the prediction errors for different descriptors at various cutoff values, in gas and solvent phases. It highlights one clear distinguishing feature between CM/SOAP descriptors, and FCHL; as the cutoff is increased to include more information in the kernel, CM's and SOAP's accuracies show best performances at about 2.3 and 2.0Å respectively, beyond which the accuracy drops. FCHL, on the other hand, due to the presence of higher-order damping terms, shows a convergent behaviour with the accuracy saturating at 4.0Å.
The kernel width, ω, was chosen separately for each descriptor; optimal value of this parameter for a given descriptor is inherently coupled to the dimensionality, completeness and the metric of the "feature spaces". As the numerical values of the descriptor differences need not be same across descriptor definitions (Fig. S3), a single ω cannot scale two different kernels with same efficiency. This fact also sheds light on a heuristic approach for determining ω-from the descriptor differences, D ij , independent of the property being modeled. We have tested the performance of optimal ω derived using Eq. 9 and found the values derived using the median of {D ij } to perform better than those based on the maximal (Eq. 8) or mean values (Eq. 9), see Table S1. For CM and SOAP descriptors, we found these values to be 422. 78 and 18.85, respectively that we have adapted throughout this study. For these two choices of descriptors, we have also performed a cross validation to find out the best values of ω coinciding with ω median opt (vertical lines in Fig. 4). Due to the fact that the FCHL implementation in QML does not provide D ij values, a grid search showed the best ω to be 0.3. Fig. S4 features the distribution of the kernel matrix elements based on 10k training examples for all three descriptors. While the distributions for CM and SOAP are rather univariate, FCHL's K ij values show a multivariate distribution, implying the latter model to be more sensitive to the choice of the kernel width. After determining the most appropriate hyperparameters for various choices of descriptors, we collected 10 training sets of sizes: 100, 200, 500, 1k, 2k, 5k, 10k, 20k, 50k, and 100k. We ensured that each smaller dataset is a subset of a larger one making the learning monotonous. We solved the linear equations of ML (Eq. 4) using Cholesky decomposition, and the trained machine was used to predict NMR shifts of 50k out-of-sample validation set. Mean absolute error (in ppm) for these 50k predictions was admitted as the sole performance metric of the training accuracy (see Fig. 5). Fig. 5 also shows the performances of ∆-ML carried out using B3LYP/STO-3G NMR parameters at PM7 structures.
Overall, one notes from Fig. 5 that for all descriptors ∆-ML models converging by more than an order of magnitude faster (in terms of training set size) than direct ML ones. From Fig. 5 it is evident that among all three descriptors, FCHL delivers the best performance with an average prediction error of < 2 ppm; the error drops below 1.4 ppm for ∆-ML modeling. However, it may be  The origin of accuracy limiting factors in ML was further investigated by categorizing errors based on the NMR shielding range and the representation of the categorized region in the training dataset (Fig. S6). We found that the errors were not uniformly distributed and for certain shielding constant ranges, mean and variance of predictions were more erratic. The anomalous error in the -25 to 25 ppm region can be explained by the combined effects of i) under-representation of the aromatic or unsaturated systems in QM9, and ii) larger chemical diversity in the shifts of the unsaturated regions. In the 150-175 ppm region, where we found majority of C shielding values of the QM9-NMR dataset to lie, the prediction errors were rather low and less spread out. We note that as more data is added in the erroneous regions (such as the -25 to 25 ppm region), the accuracy of the NMR machine improves.
We have probed if the baseline 13 C shielding values computed in gas phase can be utilized also for modeling DFT-level values in various solvents. While it is possible to simultaneously model on multiple property vectors by feeding in a rectangular matrix-row of column vectors-to the Cholesky procedure, the cost of training can be slightly minimized by inverting the kernel matrix once and multiplied with any arbitrary property vector to get new training coefficients 53 . Table I demonstrates the versatility of this approach. Inverted FCHL-100k kernel instantly yielded trained machines for all solvents, with sub 2 ppm accuracy. Going from the gas phase values to DMSO ( = 46.8), we note the model's performance to deteriorate only by 0.11 ppm. The magnitude of NMR chemical shift/shielding of a 13 C nucleus in a molecule is governed by its local environment, experimentally this aids in the prediction of molecular structural features based on the chemical shifts. The inherent locality of this property implicitly suggests the shielding effect to drop with increasing distance. Subsequently, the information gained from a local moeity of a small test molecule can be reasonably transferred to the same local environment in a large molecule, provided the moeity is not perturbed by chemical interactions alien to the training molecule.
Using the 100k ML and ∆−ML machines, we investigate how well these properties can be estimated for larger molecules. In Fig. 6, we explore our ML and ∆-ML models' performances across GDBn datasets (where n = 10, 11 . . . , 17) using 25 randomly chosen molecules per n. Each of these 200 molecules were relaxed at the B3LYP/6-31G(2df ,p) level with reference NMR shielding tensors calculated at the mPW1PW91/6-311+G(2d,p) level (see Section III).
As expected, ∆-ML generally improves upon ML consistently yielding lower MAE and RMSD. Further, we note maximum average error per molecule (MAX) to overall improve with ∆-ML except for GDB15 and GDB11 sets. This is possibly due to systems showing chemical interactions alien to GDB9 . In Fig. 6, ML provides an MAE of < 4 ppm across the datasets while it is usually below 3 ppm for ∆-ML. Arguably, the choice of 25 random molecules is not an accurate representation of the entire dataset in question and hence trends instrinsic to these subsets are not transferable across sets. Still, a general observation can be made: increasing number of heavy atoms introduces long-range influences on moieties rendering our machines somewhat inefficient-MAE of both ML and ∆-ML generally increase as we explore larger systems. However, we find the accuracy trends to be somewhat inconsistent across GDB10-GDB17 sets. For instance, going from GDB11/GDB15 to GDB12/GDB17, we see a modest improvement in MAE and RMSD. Such irregularities can be attributed to sampling bias in the molecular selection. Overall, the success of our ML models based on GDB9 molecules to predict the NMR properties of GDB10 to GDB17 sets is nonnegligible, suggesting that future endeavors would benefit from including in the training set, systems with a wider chemical diversity.

D. Validation of AIM-ML for Drugs
The graphical framework upon which GDB datasets were developed, allows one to explore CCS in an unbiased fashion. GDBn universe is currently limited to molecules containing up to 17 heavy atoms amounting to 166.4 billion unique molecules (stereoisomers excluded). In this set, Reymond et al. 50,[75][76][77] have encountered many important small drug molecules and discussed their struc-tural properties. Interestingly, synthesis efforts in drug design begin by exploring previously known candidate molecules and by introducing modifications. For this reason, the local environment of the 13 C nuclei encountered in drugs are expected to be found in the QM9 small molecules. Thus, the machines developed in this work should be capable of predicting the NMR shielding in common drug molecules with a reasonable degree of accuracy.
In Fig. 7, we present 40 commonly used drug molecules previously observed in GDB datasets 50,[75][76][77] . We tested our machines' efficiency in predicting 13 C NMR shielding values for "40 drugs", and noted their error metrics across direct and ∆-ML machines for each molecule. Although the MAE highlights the overall efficiency of a method, for molecule-specific analysis, it is imperative to verify whether the trends in atom-specific properties are preserved-a fact desirable when assigning shifts to moeities present in unknown molecules 86 . Hence, we collect MAE and Spearman rank-correlation (ρ) for these drugs from both models in parenthesis (see Fig. 7). As Spearman coefficients are sensitive to numerical precision, we have utilized a modified version by mapping the stick spectrum of the NMR shielding by a step function of height 1 and a width of 1 ppm. Such coarsening provides a better comparison of NMR shielding values with in a threshold limit. The largest error encountered in this set is for Desflurane (C 3 H 2 F 6 O), a GDB10 molecule, with an average deviation of 14.8 ppm for ∆-ML. The unusually large error could be attributed to the presence of di-and tri-fluoro methyl groups that are under-represented in the training set. The second largest MAE noted is for Diethylcarbamazine for which ∆-ML model's error stands at 5.5 ppm, stemming from the deficiencies of the baseline data. We note a total of 25 and 5 systems to show MAE higher than 3.0 ppm in ML and ∆-ML modeling, respectively. Barring 6 systems, ∆-ML improves upon direct ML's MAE, a trend previously noted in Fig. 5 and Fig. 6; not only does it improve MAE, but for 14 systems it also improves ρ. Evidently, ∆-ML modeling is required to reach semi-quantitative predictions due to the qualitative accuracy of the baseline B3LYP/STO-3G results.
Having probed the transferability of the ML models to drug molecules present in the GDB17 universe, we embarked upon predicting the same for larger drug molecules containing several heavy atoms. In Fig. 8, we present 12 such important molecules along with their error metrics and mPW1PW91/6-311+G(2d,p) chemical shifts. The deviations of the same in ML and ∆-ML predicted values for every C atom are also reported. As expected, ∆-ML outperforms direct-ML consistently across all 12 molecules. The highest deviations is noted for Morphine with ∆-ML presenting an MAE > 3.0 ppm. A closer inspection of their structures reveal unique and large chemically relevant moeities that are underrepresented in the QM9 set. However, for other systems the local nature of the NMR shifts does help in improved prediction. Interestingly, while the benzenoid moeity is common to 11 of the 12 molecules ML predictions show a spread in the property. For instance, in Fluconazole, for C atoms connected to F, the overall accuracy for the ring decreases, an observation common to -CH 3 and -OH substitutions in Lidocaine and Morphine. Phenytoin's C nuclei for both rings have been modeled inaccurately-due to deficiencies in the baseline PM7 structures. Others present extended conjugation, which is not adequately captured in the ML-models that use small cutoff radii for the descriptors.
FIG. 8. Accuracies of ML and ∆-ML predicted mPW1PW91/6-311+G(2d,p)-level 13 C chemical shifts of 12 large drug molecules. Mean absolute error (in ppm) and Spearman rank correlation coefficient (ρ) are collected in parenthesis. Reference DFT results are provided next to C atoms along the deviations of ML and ∆-ML predictions from the DFT values. For clarity, unsigned deviations less than 2 ppm are shown in blue while larger ones are shown in red.

E. Challenging cases for AIM-ML
As noted in Section IV C and Section IV D, our ML models will fail when encountering chemical environments not related to those in the training set-common scenario being a double bond network with electron delocalization. For a series of linear polycyclic aromatic hydrocarbons (PAHs) we can expect delocalization to increase with increasing number of benzenoid rings. This presents a unique challenge as one does not necessarily know the least number of systems that adequately represent the entire PAH CCS for ML endeavors, since the addition of a benzenoid ring influences the shifts of all C atoms in the system.
To understand this problem, we collected linear PAHs from the PAH77 dataset 74 and analyzed the performance of ML and ∆-ML models (see Fig. 9). Even the smallest system, naphthalene has one atom more than the QM9limit. The edge atoms are similar to that encountered in benzene derivatives, so these have been modeled well. Although, as before, ∆-ML improves upon ML consistently, deviation between ML and DFT rises with increasing system size or the degree of delocalization. Particularly, the interstitial C atoms show the largest deviations with increasing number of benzenoid rings. This can be understood qualitatively when we note that these C atoms are the ones mostly influenced by delocalization. The current approach does not adequately model the chemical shifts of these systems with extended delocalization. Yet, ∆-ML picks up on this trend from baseline values and reduces the errors significantly for most atoms, in- terstitial atoms still showing large deviations. Unfortunately, the magnitude of errors imply that neither ML nor ∆-ML can unambiguously distinguish between 13 C centers in PAHs. Future endeavors can benefit significantly from descriptors capable of accurately modeling extended π-conjugation or by enriching the training data with representative examples.

V. CONCLUSION
We present the QM9-NMR dataset that augments the QM9 set 49 -containing DFT-level structures and properties of 134k organic molecules-with NMR shielding values computed at the mPW1PW91/6-311+G(2d,p)-level for about 2.4 million atoms constituting the molecules in this dataset. The choice of the DFT method was reinforced by its previously established advantageous costaccuracy trade-off. The presented dataset may be further extended by including J-coupling between 1 H and other nuclei so that the diverse array of nuclei and properties present in QM9-NMR may aid seamless data-mining or ML studies in future. The impressive size of the dataset compelled us to explore solvent-phase values using an implicit solvation model, which however may not be adequate to describe effects due to the solute-solvent explicit interactions as addressed in Ref. 87. We focus on predicting the anisotropic shielding values of 13 C nuclei in QM9 entries through KRR-ML models with Laplacian kernels. Upon benchmarking the performance of ML models across 3 descriptors: CM, SOAP and FCHL, we note a monotonous improvement in learning with increasing training set size up to 100k, with respect to predictions for a 50k hold-out set, where an FCHL-based (without alchemical corrections) ML-model showed the least MAE of 1.88 ppm. ∆-ML, using PM7 geometries and B3LYP/STO-3G baseline values, improves upon this accuracy to yield an MAE of 1.36 ppm. To the best of our knowledge, this is an improvement over the current record in out-of-sample prediction error in datadriven 13 C nuclei NMR shielding modeling 37 . SOAPbased ML model's under-performance could be speculated to the use of Laplacian kernel-based KRR when GPR have shown to be more effective. As expected, the performance drops with increasing diversity of validation molecules but the target being of local nature benefits from our models and aids in the prediction of 13 C shielding in molecules much larger than those in training set. Such a trend has been noted during the validation of 13 C shielding for a random subset of 25 molecules collected from GDB10 to GDB17 sets. Although, the prediction accuracy decreased with increasing molecular sizes, the MAE reported across datasets remained within 4.0 ppm for ML and 3.0 ppm for ∆-ML. When predicting 13 C shielding for 2 datasets of drug molecules-one containing 40 drug molecules found in the GDB17 universe, and the other containing 12 common drug molecules with 17 or more heavy atoms-∆-ML improves upon ML's performance with the MAE decreasing from 3.7/4.2 ppm to 2.3/2.6 ppm for 40-drug/12-drug datasets, respectively. However, extended delocalization in linear PAHs proves challenging because of the irrelevance of the small cutoff values decided based on cross validation in molecules lacking extended conjugations.
While we do not anticipate the deficiency in our models to fade when using other local descriptors 88 , augmenting the training set with systems displaying extended conjugation such as PAHs, fullerenes etc., or improving upon the current baseline for ∆-ML should lead to better accuracies. This opens exciting possibilities of ML-guided analysis into nucleus independent chemical shifts complimenting the latest tight-binding model for PAHs 89 . Although our 100k training set is an adequate representation of the QM9 dataset (see Fig. S1), adaptive sampling method employed in Ref. 37 might be useful when using smaller training sets. Given the locality of the shielding property, it may be helpful to employ different machines trained on sp, sp 2 and sp 3 C-to account for systematic deviations in each groups. Such separate ML modeling had previously resulted in superior performance for predicting the electronic spectra of GDB8 molecules 90 . If high-fidelity wavefunction theories-such as MP2 and CCSD(T)-are explored for subsets in QM9 containing up to 6 or 7 heavy atoms, then suitably calibrated DFT results based on these references will improve upon the current results for comparison with experiments. Finally, there is always the scope to improve the QM9-NMR dataset by estimating the effects due to improving the geometries of the QM9 molecules when going from B3LYP/6-31G(2df,p) to ωB97XD with a triple-zeta quality basis set.

VI. ACKNOWLEDGMENTS
The authors thank Vipin Agarwal, Kaustubh R. Mote and O. Anatole von Lilienfeld for fruitful discussions. This project was funded by intramural funds at TIFR Hyderabad from the Department of Atomic Energy (DAE). All calculations have been performed using the Helios computer cluster, which is an integral part of the MolDis Big Data facility, TIFR Hyderabad (https://moldis.tifrh.res.in/).

VII. DATA AVAILABILITY
The data that support the findings of this study are openly available in the MolDis repository, http://moldis.tifrh.res.in:3000/QM9NMR.
Additional information are available in the supplementary material.