Bridging the gap between high-level quantum chemical methods and deep learning models

Supervised deep learning (DL) models are becoming ubiquitous in computational chemistry because they can efficiently learn complex input-output relationships and predict chemical properties at a cost significantly lower than methods based on quantum mechanics. The central challenge in many DL applications is the need to invest considerable computational resources in generating large ( N>1×105 ) training sets such that the resulting DL model can be generalized reliably to unseen systems. The lack of better alternatives has encouraged the use of low-cost and relatively inaccurate density-functional theory (DFT) methods to generate training data, leading to DL models that lack accuracy and reliability. In this article, we describe a robust and easily implemented approach based on property-specific atom-centered potentials (ACPs) that resolves this central challenge in DL model development. ACPs are one-electron potentials that are applied in combination with a computationally inexpensive but inaccurate quantum mechanical method (e.g. double-ζ DFT) and fitted against relatively few high-level data ( N≈1×103 – 1×104 ), possibly obtained from the literature. The resulting ACP-corrected methods retain the low cost of the double-ζ DFT approach, while generating high-level-quality data in unseen systems for the specific property for which they were designed. With this approach, we demonstrate that ACPs can be used as an intermediate method between high-level approaches and DL model development, enabling the calculation of large and accurate DL training sets for the chemical property of interest. We demonstrate the effectiveness of the proposed approach by predicting bond dissociation enthalpies, reaction barrier heights, and reaction energies with chemical accuracy at a computational cost lower than the DFT methods routinely used for DL training data set generation.


Introduction
The application of machine learning (ML) and deep learning (DL) techniques in chemistry has led to the rapid development of supervised models that show great promise for the accurate prediction of important thermochemical properties.To provide just a few examples, recent work by St John et al demonstrated the power of a DL model to predict bond dissociation enthalpies (BDEs) to within 0.58 kcal mol −1 of density functional theory (DFT) reference values [1].Likewise, Roitberg's group developed a general-purpose neural network that approaches the so-called 'gold-standard' CCSD(T)/CBS accuracy (i.e.coupled cluster with single and double excitations plus perturbative triples (CCSD(T)) at the complete basis set (CBS) limit) for reaction thermochemistry, isomerization, and molecular torsions using transfer learning on DFT computed data [2].Grambow et al developed a DL model that is able to predict barrier heights of organic reactions within 1.7 kcal mol −1 of those determined using DFT [3].DL models such as these promise to revolutionize a breadth of scientific endeavours ranging from the development of new functional materials to catalysts and pharmaceutical discovery [4].
A central challenge in the development of supervised DL models for chemistry is the need for a large set of accurate training data (typically hundreds of thousands of data points or more) for the chemical properties of interest.Large training sets enable developing models that can generalize well to diverse chemical space (which explodes combinatorially with increasing molecular size and types of elements) and can be used to obtain high-quality predictions.For example, Sun et al [5] reported that the performance of DL models for computer vision tasks improved with the number of training data.In the studies cited above, the number of data points used for the DL model development were 276 717 (at the M06-2X/def2-TZVP level [1]), 5.2 million (ωB97X/6-31G * for the ANI-1x DFT model [2]), and 57 000 (B97-D3/def2-mSVP and ωB97X-D3/def2-TZVP [3]).Although these works highlight the potential of DL for chemistry, such developments have been severely limited by the size and quality of the training data available.
Clearly, the bottleneck in the production of databases large enough to support the development of new and effective DL models for chemistry is the high cost of using accurate quantum mechanical methods in combination with the large number of data points required for DL training [2].This is because the computational cost of accurate wavefunction-theory methods like CCSD(T)/CBS scale steeply with system size, which places severe limits on throughput (some illustrative examples of the cost of accurate wavefunction methods are provided in sections 3.1 and 3.2).DFT methods, on the other hand, scale more mildly and further increases in data production rates using DFT can be achieved by reducing the size of the basis sets employed.However, the price that is paid by making use of basis-set-incomplete DFT methods, like the ones used in the studies cited above, is a substantial loss in the accuracy of the predicted quantities, as illustrated in table 1 for a few selected properties.This, in turn, constitutes a significant limit on the accuracy of the predictions made by the eventual DL models trained on data generated with incomplete-basis-set DFT methods.
Regarding the prediction of thermochemical properties specifically, while density functional approximations continue to improve, there remains a wide gulf between their accuracy and that of high-level methods like CCSD(T)/CBS.In Yu et al's analysis of the performance of 49 density functionals, the best results were obtained using the MN15 density functional with triple-and quadruple-ζ basis sets [6], but its performance is still poor relative to high-level wavefunction methods.For instance, MN15 predicts hydrocarbon thermochemistry with a mean absolute error (MAE) of more than 4.5 kcal mol −1 , far too large to support the development of new DL models that can be used in any application for which chemical accuracy (ca. 1 kcal mol −1 ) is required.This illustrates the unaddressed 'catch-22' associated with the development of DL models for thermochemistry.On the one hand, the development of DL models that predict accurate thermochemical properties requires 10 5 -10 6 accurate data points obtained using very costly methods like CCSD(T)/CBS, and are therefore limited by the throughput of training data generation.On the other hand, a large number of training data points can be generated on much shorter timescales using computationally inexpensive DFT/DZ methods but the resulting DL models trained on these data generally cannot produce chemically accurate thermochemical properties.
Here, we outline a robust quantum chemical approach based on atom-centered potentials (ACPs) that bridges the accuracy gap between small-basis-set DFT methods and CCSD(T)/CBS.ACP-corrected methods allow calculating CCSD(T)/CBS quality data for specific properties at a computational cost that is on par with small-basis-set DFT, and therefore constitute an ideal solution to the problem with training data generation in DL model development.Crucially, far fewer fitting data (N = 100-1000) are required to generate robust and accurate ACPs than DL models.This means that benchmark sets available in the literature, which often have sizes in this range, can be used directly for ACP development.We demonstrate the efficacy of this approach through application to three properties that are central to chemistry (bond energies, reaction barrier heights, and reaction energies) using data from two of our recently published benchmark sets (BSE49 [7] for bond separation energies and BH9 [8] for reaction energies and barrier heights).
Previous efforts in ACP development were centered on the generation of general-purpose ACPs for Hartree-Fock (HF) and DFT methods with large fitting data sets (≈10 5 ) for non-covalent interactions and thermochemistry [9][10][11][12][13].In contrast with our past work, this article focuses on demonstrating how ACPs developed for specific properties (BDEs, reaction energies, barrier heights, . ..) can be fitted with minimal high-level data, and how the resulting property-specific ACPs can be used to generate large training sets for the development of DL models, thus combining the accuracy of the high-level method used to develop the ACPs with the low cost of a DFT/DZ method.Because the accuracy of the resulting DL model is only as good as that of its training data, the proposed approach enables the creation of highly accurate DL models by using ACPs as intermediaries for the construction of large and accurate DL model training sets.

Methods
ACP corrections are similar in spirit to the Hubbard U approach in conventional DFT ('DFT+U'), which was originally developed to address shortcomings in the description of the band gaps of Mott insulators and introduces energy corrections based on orbital occupations [14].The development of ACPs also bears a strong resemblance to an ML technique, sometimes described as ∆-ML [15][16][17] or ∆-DFT [18], where the goal is to reduce the difference between a specific method and a suitable target method for a particular chemical property.ACPs are one-electron operators that have the same form as effective core potentials (ECPs), which have been used for decades in molecular quantum chemistry programs to reduce the number of electrons that must be explicitly treated and incorporate relativistic effects.The general form of an ACP is [19]: where U local is a local potential and U l are potentials associated with the angular momenta (l) of the occupied orbitals of each atom (A) for which the ACP is developed.The value of l max is taken to be one larger than the maximum l of the occupied atomic orbitals; this is the choice used in previous works [9][10][11][12][13].The U l potentials are projected onto orbitals of specific angular momentum via the |l⟩⟨l| operators.Both potentials, U l and U local , have the form: where the sum is indexed from i to some arbitrary number, c i is a coefficient, n i is the power of r, and ζ i is an exponent.The value of n i is fixed at 2 in this work because it leads to simple exponential ACP terms; this choice is also common in ECP development [19].The V A ACP potential introduces a correction to the energy obtained from the wavefunction of the underlying quantum mechanical method by adding this operator to the Hamiltonian.Unlike ECPs, ACPs do not replace any electrons.Instead, ACPs produce an energy correction that is sensitive to the chemical environment, similar to the Hubbard U correction.The energy corrections obtained with ACPs reflect changes that occur in molecular systems, for example, with chemical group substitution or over the course of chemical reactions.The energy corrections generated by applying the ACPs are derived from the molecular wavefunction and so ACPs do not fail catastrophically and unpredictably, beyond what might be expected from the underlying method.Importantly, as demonstrated below, the computational cost associated with applying ACPs is close to that of the underlying method, provided that ECPs are efficiently implemented in the computational package with which they are used.
The general workflow for ACP development is illustrated in figure 1.It is necessary to select the atoms for which ACPs are to be developed, the range of exponents used in equation ( 2), and the underlying quantum mechanical method/basis set.A database of chemical properties for which the ACPs will be developed is assembled.ACPs are developed by optimizing the terms associated with a preselected range of values in equation ( 2), that is, by minimizing the difference between the reference value of a property and the value of the property obtained using the chosen method and basis set.Because the ACP contribution to the target molecular properties is linearly dependent on the ACP coefficients, the optimization of the ACP coefficients (c α in figure 1) is done by applying the least-absolute shrinkage and selection operator (LASSO) method [20,21].The LASSO method carries out a linear least-squares regression of the ACP coefficients (c i ) constrained by the one-norm of the coefficient vector: This serves the dual purpose of limiting the magnitude of the ACP coefficients as well as reducing the number of terms in the ACP.Limiting the magnitude of the ACP coefficients is important: If the coefficients are too large, a substantial perturbation is introduced in the ground-state molecular wavefunction.When the ACPs introduce large perturbations in the wavefunction, the performance of the ACP in self-consistent calculations diverges considerably from the ACP linear model predictions.The latter approach is our preferred method for developing ACPs because it is orders of magnitude faster than generating ACPs self-consistently.The deviation between the self-consitent ACP energy correction and the linear model prediction is called non-linearity error [22].Hence, in our procedure, the LASSO constraint (λ) is determined not by cross-validation or any similar method, but by comparing the predictions of the ACP linear model, which assumes the ground-state wavefunction from the uncorrected method is unperturbed, to the predictions of the ACP when it is used in a self-consistent calculation.A detailed description of the ACP development process can be found in [22].The ACPs generated by the procedure described above are used in the Gaussian16 [23] program by including them directly in the input files.The ACPs are read, along with the basis sets, using the 'GenECP' keyword.
In this work, the ωB97XD functional [24] was used in combination with Jensen's pcseg-1 double-ζ basis set [25] for all ACP development.Our choice is motivated by the good performance of ωB97XD for reaction energies and barrier heights [8,26,27], and by cursory calculations on the BSE49 and BH9 sets with various functional and basis set combinations, as well as our own experience that ωB97XD performs well across a range of chemical properties.The choice of this particular functional is not very critical because ACP improves performance regardless of the choice of functional, basis set, and target molecular property [9][10][11].The pcseg-1 basis set [28] was chosen because it was optimized for DFT calculations and is computationally efficient.Gaussian16 [23] was used for all the calculations in this work, except for the calculations using 3c-corrected methods, for which ORCA (version 5.0.4) was used [29].Bond dissociation energy calculations used full optimization and unscaled frequencies, and all equilibrium structures were verified as minima.For reaction energies and barrier heights, single-point calculations were carried out at the BH9 set structures.All Gaussian calculations were run with the 'ultrafine' integration grid.The acpdb program [30] was used to assemble the database and develop the ACPs.

BDEs
BDEs (∆H for the process AB −→ A • + B • ) are a central property in chemistry used to predict and understand reaction outcomes.St John et al recently developed a DL model to predict homolytic BDEs in organic molecules containing the H, C, N, and O atoms.Their DL model was trained on more than 2.7 × 10 5 unique BDEs calculated using the M06-2X [31] density functional with the def2-TZVP [32,33] basis set, with 1.06 × 10 6 parameters in the final model.The resulting DL model, which has been incorporated into the online BDE tool called ALFABET [34], predicts BDEs for molecules containing the target atoms to within a MAE of the M06-2X/def2-TZVP values of 0.58 kcal mol −1 .St John et al's model outperforms a previously developed DL model that was trained on ca.1.2 × 10 4 DFT calculated BDEs by a factor of five [35].The improved performance achieved by ALFABET underscores the importance of using very large data sets for the development of DL models.Despite ALFABET's excellent performance, the M06-2X/def2-TZVP method used to generate the training data has shortcomings that limit the efficacy of the model.For instance, we found an MAE of 2.2 kcal mol −1 in the bond separation energies (BSEs-bond energies that do not include zero-point energy and enthalpy corrections) predicted by M06-2X/def2-TZVP for the BSE49 data set [7] (see table 1), which contains 4374 BSEs with (RO)CBS-QB3 [36] reference values.Almost 47% of the BSEs predicted by M06-2X/def2-TZVP have errors outside the 2.2 kcal mol −1 range, and 245 BSEs have errors larger than ±5 kcal mol −1 .
To illustrate the usefulness of ACPs in studies similar to St John et al's, we developed property-specific ACPs designed to predict accurate BSEs/BDEs for molecules containing the following 10 elements: H, B-F, and Si-Cl.As noted above, we selected for this purpose the ωB97XD functional [24], along with the pcseg-1 double-ζ basis set of Jensen [25].The use of ωB97XD/pcseg-1 for the prediction of BSEs/BDEs provides a six-to eight-fold increase in throughput relative to the use of M06-2X/def2-TZVP used in the development of ALFABET and achieves sub-kcal/mol (chemical) accuracy (see table 1), enabling the creation of large and accurate DL training sets for BSEs/BDEs using the ACP-corrected method.
To demonstrate how few training data are required to develop an accurate ACP for this method, we carried out the following experiment.First, we selected 100 BSEs randomly from the BSE49 set (4374 data in total) and generated the corresponding optimal ACP using this 100-point training set.Then, we assessed the performance of the resulting ACP-corrected method on the remaining 4274 data (the validation set), and on the whole BSE49 set.Then, we repeated the same process with 200 randomly selected points in the training set and a validation set containing the remaining 4174 points.We increased the size of the training set in this way, always in steps of 100 and always using randomly selected data, until the complete BSE49 set was included in the training set.The MAEs for the training and validation sets as well as for the whole BSE49 set are shown in figure 2. Repeating the same test several times with different randomly selected training and validation sets does not alter the results significantly.
The figure demonstrates a number of important features.First, with the use of as few as 200 fitting data points, the MAEs of the predicted BSEs decrease from ca. 4 kcal mol −1 for the uncorrected ωB97XD/pcseg-1 method to about 1 kcal mol −1 (i.e. chemical accuracy) with ACPs, or less than half the MAE obtained with M06-2X/def2-TZVP for the BSE49 set.This result indicates that very few fitting data are required to develop ACPs that can significantly improve the performance of ωB97XD/pcseg-1 for the prediction of accurate BSEs over a broad range of systems.With 1000 fitting points (less than 25% of the BSE49 set and almost 300 times less data than used in [1]), the improvement brought about by ACPs effectively converges to an MAE of about 0.7 kcal mol −1 , which is nearly the value obtained by using the full BSE49 set for ACP development.In addition, as few as 100 data points randomly selected from the BSE49 set are sufficient to produce an ACP that gives an MAE of ca.1.5 kcal mol −1 , well below that of M06-2X/def2-TZVP and more than a factor of three lower than uncorrected ωB97XD/pcseg-1.This indicates that ACPs generated using small data sets maintain their performance for bond breaking reactions not present in the fitting set.We used ACPs fitted with 1000 randomly selected data points for BSE prediction in the rest of this work.A LASSO constraint of λ = 25 a.u. was chosen based on the examination of non-linearity error from a set of ACPs generated with different constraints.The final ACP contains 73 terms out of the possible 290 terms calculated.Figure 3 provides a visual presentation of the performance of ωB97XD/pcseg-1-ACP on the 4374 data points in the BSE49 set.Given the nearly identical performance of the ACPs on the fitting set and the unseen data (see the SI), we combined the results of the two for ease of presentation.Included in the figure are BSEs obtained with uncorrected ωB97XD/pcseg-1, and with M06-2X/def2-TZVP.Uncorrected ωB97XD/pcseg-1 predicts BSEs that span an error range of −22.2 to +9.4 kcal mol −1 , with a mean signed error (MSE) of −3.8 kcal mol −1 and 1.5 times the interquartile range (IQR) spans ca.18 kcal mol −1 .Clearly, as is the case for most DFT-based methods, ωB97XD with a double-ζ basis set is unusable for the generation of accurate bond energies and enthalpies.M06-2X/def2-TZVP provides somewhat better results, with a MSE of −1.3 kcal mol −1 and a spread in errors ranging more than 19 kcal mol −1 .
The performance of our ωB97XD/pcseg-1-ACP approach is striking: More than 67% of the calculated BSEs in the BSE49 set fall within an error window of ±0.5 kcal mol −1 , and more than 83% of the BSEs are within ±1.0 kcal mol −1 .Furthermore, 96.6% of the BSEs have errors in the range of ±2.2 kcal mol −1 -that is, within the MAE for the full BSE49 set generated by M06-2X/def2-TZVP.The ωB97XD/pcseg-1-ACP approach produces an IQR less than 0.5 kcal mol −1 and a 1.5 IQR spanning 2.3 to −2 kcal mol −1 .Only 9 BSEs (0.2% of the set) predicted using ACPs have errors larger than ±5 kcal mol −1 .Overall, BSEs obtained with ωB97XD/pcseg-1-ACP have an MAE of 0.76 kcal mol −1 , and an MSE of 0.18 kcal mol −1 , relative to the reference data.
A further demonstration of the utility of the ωB97XD/pcseg-1-ACP approach for unseen data is provided in figure 4, where we applied the method to compute the BDEs for a subset taken from the iBonD databank [37].We extracted 50 entries that were not part of the ACP fitting set, were of a size that could be calculated using our reference method ((RO)CBS-QB3), and for which St John et al [1] reported M06-2X/def2-TZVP or ALFABET BDEs with errors relative to experiment that are larger than ±5 kcal mol −1 (see the SI for full details).For this subset, compared with our calculated (RO)CBS-QB3 data, the ACP approach gives an MAE of 0.84 kcal mol −1 , which is quite close to that obtained for the BSE49 set, and performs more reliably than M06-2X/def2-TZVP (MAE = 1.78 kcal mol −1 ) and ALFABET (MAE = 2.46 kcal mol −1 ).
Lastly, table 1 compares the accuracy and computational cost of uncorrected and ACP-corrected ωB97XD/def2-TZVP with other double-ζ and triple-ζ methods typically used for DL training set development [1][2][3] as well as with the 3c-correction-based methods developed by Grimme and collaborators [39][40][41][42] in the description of bond separation energies, reaction energies, and barrier heights.The 3c-based methods are similar in spirit to ACPs in that they use an empirical correction to alleviate the shortcomings of a computationally inexpensive DFT method (e.g.basis-set incompleteness).Unlike the ACPs presented in this work, they are not property-specific, but could in principle be considered as viable alternatives for DL training set development given their emphasis on accuracy and low computational cost.Regarding BSEs, the table shows that ωB97XD/pcseg-1-ACP achieves chemical accuracy with an MAE of 0.76 kcal mol −1 and far outperforms its uncorrected version (ωB97XD/pcseg-1) as well as the triple-ζ functionals (ωB97XD/def2-TZVP and M06-2X/def2-TZVP), despite the fact that the latter are about 7-8 times more expensive.For comparison, the (RO)CBS-QB3 calculation for the reference energy in the BSE49 set [7], scaled for the number of cores, was about 50 times more expensive than ωB97XD/def2-TZVP, and about 330 times more expensive than ωB97XD/pcseg-1-ACP.[1] and ωB97XD/pcseg-1-ACP versus (RO)CBS-QB3 reference values for 50 entries in the iBonD databank [37] chosen as described in the text (not included in the ACP fitting set, size calculable with (RO)CBS-QB3 and significant deviation between M06-2X/def2-TZVP and experiment).Data are in kcal mol −1 .The blue line represents exact agreement between DFT/ML and reference BDEs.
Table 1.Mean absolute errors (MAE), mean signed errors (MSE), and root-mean-squares (RMS) for the bond separation energies (BSEs) in the BSE49 set [7] and the reaction energies (REs) and barrier heights (BHs) in the BH9 set [8] using various functionals (in kcal mol −1 ).An estimate of the computational speed-up relative to ωB97XD/def2-TZVP is shown in the last two columns a .The results from our ACP-corrected method are shown in bold in the first row.a We used Gaussian 16 [23] for the first five functionals and ORCA [29] for the 3c-corrected methods.b Ratio between the wall time of the corresponding method and ωB97XD/def2-TZVP, both calculated with the same software, for dicumylperoxide from the BSE49 set (t 1 scal , 42 atoms) and the transition state for the intramolecular pericyclic reaction 02-7 from the BH9 set (t 2 scal , 63 atoms).All calculations were ran on an Intel Xeon Platinum 8280 processor (4 cores for t 1 scal , 8 cores for t 2 scal ).

BSE
The performance of the 3c-corrected methods in the BSE49 set (table 1) is underwhelming, with MAEs higher than any of the other examined methods, including the double-ζ uncorrected ωB97XD/pcseg-1, and computational cost which for the double-ζ functionals PBEh-3c and r2SCAN-3c is higher than ωB97XD/pcseg-1 and ωB97XD/pcseg-1-ACP.This suggests that corrected general-purpose double-ζ functionals are not usable for accurate DL training set development.(A recent study has proposed a promising ωB97X-3c functional [43], but the development version of the code is at present not available to us.) From the preceding discussion, we conclude that ACPs generated for use with ωB97XD/pcseg-1 using 1000 fitting data points produce BDEs that achieve chemical accuracy, and are more reliable than those that can be obtained using uncorrected DFT methods.The computational throughput of ωB97XD/pcseg-1-ACP is estimated to be 7 to 8 times higher than ωB97XD/def2-TZVP and 6 to 8 times higher than M06-2X/def2-TZVP, so great savings in computing time can be expected with the application of ACPs for the generation of training data for DL model development.Importantly, the limiting step in the development of the property-specific ACPs is the calculation of high-level reference data, but since ACPs require on the order of 100-1000 data points, these can often be obtained from published benchmark sets, as we have done here with the BSE49.

Reaction energies and barrier heights
DL models for the accurate prediction of reaction barrier heights remain elusive, despite some recent efforts in this connection [3].Given the performance of ACPs on BSEs, we now examine the generation of property-specific ACPs for reaction energies (REs) and barrier heights (BHs).
Since barrier heights for a given reaction are related to the energy difference between the reactant(s) or product(s) and the transition state structure, we undertook the development of ACPs that would improve both BHs and REs simultaneously.The main challenge associated with this effort is the scarcity of accurate reference data in the literature for reactions that are relevant to (bio)organic chemistry.This is particularly problematic for DL model development but less of an issue for the generation of good quality ACPs since, as seen for the BDEs, they require much fewer training data.In a previous work, we assembled a set of RE and BH data for well-known organic reactions from the literature, called the BH9 dataset [8].The BH9 set contains RE and BH data for radical rearrangements/additions, pericyclic reactions, halogen/hydrogen atom transfers, and several other reaction types.In this work, reactions involving charged systems were removed, resulting in 331 REs and 663 BHs.From these data, we randomly selected 232 REs and 434 BHs, about 70% of the data, as training set to fit the ACPs.The REs and BHs removed at this stage are later used as unseen data for testing.The reference RE and BH data in the BH9 set were obtained using DLPNO-CCSD(T) values extrapolated to the CBS limit [8].We used the same LASSO constraint as in the previous section (λ = 25 a.u.) and the resulting ACPs contained 66 terms in total out of the possible 290 terms that were used in the LASSO method.
The detailed results obtained from the property-specific ACPs optimized using the REs and BHs for the reactions in the BH9 data set are provided in the SI.Analysis of the data shows that the ACPs perform quite similarly for the fitting set and the unseen data, as was the case for BSEs.Specifically, the MAEs and MSEs for the fitting and unseen sets are within 0.1 kcal mol −1 of each other for both BHs and REs.In the particular case of BHs, this is despite the fact that uncorrected ωB97XD/pcseg-1 and M06-2X/def2-TZVP display worse performance on the unseen set as compared to the fitting set.This suggests that our BH test set contains systems that are more challenging for DFT-based methods and, more interestingly, that the ACPs are capable of modeling these challenging systems with a similar level of accuracy as those in the fitting set.We present the performance of the ACP-corrected method on the fitting and unseen data collectively for the RE (figure 5, top) and BH (figure 5, bottom) datasets, along with results obtained using uncorrected ωB97XD/pcseg-1 and M06-2X/def2-TZVP.
Figure 5(top) shows that the ACP approach significantly outperforms uncorrected ωB97XD/pcseg-1 and performs much better than M06-2X/def2-TZVP for reaction energies.The ACP approach predicts 61% of the REs with errors below 1 kcal mol −1 , an improvement of a factor of three over uncorrected ωB97XD/pcseg-1 and a factor of 1.5 over M06-2X/def2-TZVP.The improvement factor achieved by the ACP approach is further illustrated by the MAEs generated for the full RE data set: For uncorrected ωB97XD/pcseg-1, M06-2X/def2-TZVP, and ωB97XD/pcseg-1-ACP, the calculated MAEs are 2.95, 1.68, and 0.94 kcal mol −1 , respectively.It is worth noting that only the errors in REs predicted using ACPs resemble a normal distribution centered at zero error.
The results for the barrier height predictions mirror those obtained for the reaction energies: Uncorrected ωB97XD/pcseg-1 performs most poorly (MAE = 3.20 kcal mol −1 ) and is effectively not usable as a method for the prediction of accurate barrier heights, while M06-2X/def2-TZVP gives an MAE of 1.75 kcal mol −1 and ωB97XD/pcseg-1-ACP gives an MAE of 0.91 kcal mol −1 for the barrier heights.The ACP-corrected approach significantly outperforms M06-2X/def2-TZVP in a manner that parallels the RE results.The MAE values may be compared to an MAE of 1.37 kcal mol −1 obtained for the entire BH9 set using the double-hybrid functional ωB97X-2 with def2-QZVPP basis sets, which is several orders of magnitude more computationally intensive than ωB97XD/pcseg-1-ACP [28].
Table 1 compares ωB97XD/pcseg-1-ACP to the 3c-corrected methods and other functionals typically used for DL model development in the literature regarding barrier heights and reaction energies.As in the case of BSEs, ωB97XD/pcseg-1-ACP is the only functional that achieves chemical accuracy both for BHs and REs, outperforming the double-ζ and triple-ζ functionals and at a cost 6-8 times lower than the latter.For comparison, the DLPNO-CCSD(T) calculations used in the generation of the BH9 set reference data [8] were about 520 times more expensive than ωB97XD/def2-TZVP and 3700 times more expensive than ωB97XD/pcseg-1-ACP.Also as in the case of BSEs, the 3c-methods, which could be considered as a potential alternative to double-ζ methods like ωB97XD/pcseg-1, have average errors that are higher than any of the other methods for REs, and, for BHs, only PBEh-3c has an MAE (4.29 kcal mol −1 ) that is comparable with In summary, as was the case for BDEs, the results in this section demonstrate that ωB97XD/pcseg-1 in combination with property-specific ACPs can generate reaction energy and barrier height data that are far more accurate and reliable than those obtained using uncorrected ωB97XD/pcseg-1, M06-2X/def2-TZVP, or other alternative methods currently available in the literature.Therefore, developing an intermediate property-specific ACP for reaction energies and barrier heights can be relied upon to increase throughput for the calculation of training data in DL model development.

Conclusions
In this work, we posit that the generation of training set data for the development of supervised DL models is, at the moment, a serious obstacle for the creation of truly accurate and predictive models for thermochemistry.The crux of the problem is that, typically, hundreds of thousands of data points are required for robust fitting of DL models, which forces the developers to use DFT methods with incomplete basis sets.DFT methods, particularly if affected by basis set incompleteness errors, are quite inaccurate for the prediction of thermochemical properties, and therefore the resulting DL models are flawed because they can only be as good as the reference data used to train them.
To solve this problem, we propose in this article the development and use of property-specific ACPs to act as an intermediate method for the generation of training sets aimed at the development of thermochemical DL models.ACPs are one-electron potentials designed to mitigate the errors associated with basis set incompleteness and the underlying deficiencies of a computationally inexpensive density functional and basis set combination (in this work, ωB97XD/pcseg-1).By training the ACPs against high-level CCSD(T)/CBS data, we obtain a method that has the cost of a DFT/DZ approach and yields the desired properties with an accuracy similar to the reference CCSD(T)/CBS level.As shown in this work, it is common that the ACP-corrected DFT/DZ method dramatically outperforms other density functionals, even those close to the complete-basis-set limit.Crucially, only a few hundreds to a thousand high-level data points are required to train the ACPs, which means that the data required for developing the ACP can be drawn from already-published benchmark sets, and the resulting ACP-corrected method typically retains its accuracy when applied to unseen systems.
In this work we examined the design of property-specific ACPs for training set generation regarding three thermochemical properties of interest: BDE, reaction energies (RE), and barrier heights (BH).The use of BDEs as an illustrative example was motivated by the recent work by St John et al [1] in which a DL model was developed for BDEs using a training set with 276 717 unique data points calculated at the M06-2X/def2-TZVP level.We show that BDE-specific ACPs can be developed with only a few hundred to a thousand high-level data points for the ωB97XD/pcseg-1 functional that yield a mean average error (MAE) in the BSE49 set of 0.76 kcal mol −1 , about three times smaller than M06-2X/def2-TZVP (2.21 kcal mol −1 ).At the same time, the resulting ωB97XD/pcseg-1-ACP method is substantially cheaper (7-8 times) than M06-2X/def2-TZVP.The fact that the ACPs are specifically designed for BDEs also allows ωB97XD/pcseg-1-ACP to vastly outperform the general-purpose '3c-corrected' methods proposed by Grimme and co-workers in the calculation of these properties: r2SCAN-3c (MAE = 7.60 kcal mol −1 ), PBEh-3c (5.50 kcal mol −1 ), and B97-3c (6.25 kcal mol −1 ).As a further illustrative example, we developed property-specific ACPs for the calculation of reaction energies (RE) and barrier heights (BH), also in combination with the ωB97XD/pcseg-1 functional.As in the case of the BDE-specific ACPs, only 666 training data were used (232 REs and 434 BHs).The resulting ωB97XD/pcseg-1-ACP method predicts REs and BHs with chemical accuracy (MAEs of 0.94 kcal mol −1 and 0.91 kcal mol −1 , respectively) and outperforms M06-2X/def2-TZVP (1.68 kcal mol −1 and 1.75 kcal mol −1 ) as well as the 3c-corrected methods: r2SCAN-3c (2.34 kcal mol −1 and 5.33 kcal mol −1 ), PBEh-3c (4.59 kcal mol −1 and 2.85 kcal mol −1 ), and B97-3c (4.00 kcal mol −1 and 8.42 kcal mol −1 ).Importantly, the mean average errors of the ACP corrected method in the unseen data set is within 0.1 kcal mol −1 of the MAEs for the training set.
This work demonstrates that ACPs tailored to the prediction of specific properties can be developed with fitting sets containing only 100-1000 data points.For many chemical properties of interest, this means that the training data can even be drawn directly from the literature, foregoing any costly wavefunction-theory calculation, since numerous recently developed high-quality benchmark sets fall within this size range.The property-specific ACPs can then be used to generate training data for the development of DL models with both high accuracy and high throughput, thus addressing one of the most significant challenges faced by chemistry-oriented ML scientists: generating sufficient and accurate training data.We expect ACPs to contribute to other areas of research in which large and accurate datasets are required, for example, in the development of accurate potential energy surfaces for quantum dynamics and vibrational spectroscopy [44,45], catalyst reaction mechanism elucidation [22], and force field development [46].

Figure 1 .
Figure 1.Workflow for the development of ACPs in this work.The energy corrections associated with individual ACP terms are determined and the fully optimized potentials are calculated using the LASSO approach.

Figure 2 .
Figure 2. Mean absolute errors (in kcal mol −1 ) predicted by the ACP development linear model for ωB97XD/pcseg-1-ACP for the BSE49 set using ACPs developed with training set sizes given by the x-axis.(The performance of the actual ωB97XD/pcseg-1-ACP is very similar, and not shown.)The MAE for the training and validation sets as well as for the whole BSE49 are shown.

Figure 3 .
Figure 3. Box and whisker plot showing the error distribution in bond separation energies (BSEs) predicted for the 4394 values in the BSE49 set using ωB97XD/pcseg-1, M06-2X/def2-TZVP, and ωB97XD/pcseg-1-ACP relative to (RO)CBS-QB3.The shaded regions encompass the upper and lower quartile in which 50% of the BSEs are contained.The 'X' shows the median values for each method and the whiskers indicate 1.5 times the inter-quartile range.