Improved decision making with similarity based machine learning

Implicit or explicit decision making pervades all branches of human and societal endeavours, including scientiﬁc efforts. We have studied the applicability of modern statistical learning methods to assist in relevant decision making problems where small sets of synthetic or experimental data are available. It is a hallmark of supervised learning that prediction errors decay with training data size (Big Data paradigm). By contrast, the ‘similar structure–similar property’-principle (popular in cheminformatics) hinges on the importance of similarity, rather than sheer data size. We discovered similarity based machine learning (SML) to exhibit favorable performance for certain conditional (Bayesian) decision problems. We apply and analyse the SML approach for the harmonic oscillator and the Rosenbrock function. Real-world demonstrations include improved decision making in (i) quantum mechanics based molecular design, (ii) experimental design in organic synthesis planning, and (iii) real estate investment decisions in the city of Berlin, Germany. Our numerical evidence suggests that SML’s superior data-efﬁciency enables rational decision making even in very scarce data limits.


I. INTRODUCTION
Useful answers to experimental design [1][2][3] questions are crucial components for successful decision making [4][5][6][7][8] under finite budget and time constraints.Conventionally, optimal or near-optimal solutions are proposed by few expert scientists which severely limits humanity's 'experimental reach' as high-dimensional combinatorial solution spaces pose a dramatic bottleneck.Breakthroughs in machine learning and advances in data-availability through high performance computing or high-throughput experimentation are accelerating the pace of scientific discovery and are considered by some to even represent a paradigm-shift [9][10][11] .In particular, autonomous robotic experimentation in the chemical and biological sciences [12][13][14][15][16][17] revolutionizes humanity's experimental reach through the capacity in which experiments can be executed.The synergy between machine learning, first principles simulations and experimental design facilitates data driven exploration in molecular and materials discovery through an accelerated feedback loop 18 .However, rigorous exploration algorithms are in dire need to bypass the combinatorial wall in the solution space.
With recent interests in data driven machine learning approaches surging, we have set out to investigate how to exploit modern statistical learning in order to assist with such experimental design exploration challenge 19,20 and decision making problems 4,21 .A crucial success factor is the availability of accurate machine learning models, often relying on the increasingly prevalent paradigm 'the bigger the data the better' as conveyed through large language model breakthroughs such as OpenAI's impressive Generative Pre-trained a) Electronic mail: anatole.vonlilienfeld@utoronto.ca Transformer (GPT-) 4. While GPT-4 is undoubtedly of value through its multimodal capabilities, including novel applications in chemical research [22][23][24] or few shot learning use cases, the overwhelming majority of human endeavours in research are still plagued by intense data scarcity 25 often with only hundreds if not only dozens of data points being available, without access to pre-trained all purpose models.Typically, scarcity is due to high costs imposed by the acquisition of the necessary amount of data.Common examples include long simulation times in quantum chemistry 26 , cost of compute to perform simulations or money and time to perform laboratory experiments.Consequently, the development of data-efficient machine learning models and training set selection schemes is crucial [27][28][29][30][31][32] .Within the chemical sciences, a concept frequently referenced in drug discovery research [33][34][35] called 'similar structure-similar property principle' (SPP) suggests that when the query is known, machine learning models should be trained on instances similar to query, rather than on increasingly many diverse ones (Fig. 1A).Thus, within statistical learning, the training-query similarity emerges as a crucial measure of the predictive power of a model, possibly more relevant than the total quantity of data available.
Here we exploit the concept of SPP in quantum chemical machine learning through introduction of similarity based machine learning (SML) that generates tailored models for arbitrary queries, reaching meaningful prediction accuracy at only a fraction of the total training pool.Ranking prospective data based on similarity to a query, only the N q nearest neighbours are chosen for training SML models (see Fig. 1B for kernel ridge regression based SML).Note that if not mentioned otherwise, N q neighbours are selected by proximity rather than all N q within a fixed distance radius.While conceptually similar to local learning algorithms 36 such as k-nearest-neighbors or moving-least-squares, SML models achieve local neigh- bourhood weighting through kernel or bagged decision tree based methods.Within this work, we rely on kernel ridge regression for SML, which in combinations with a Gaussian or Laplacian kernel (Eq.6-5) can be seen as a form of weighted nearest neighbour regression (see SI for further discussion).SML would be preferable for applications which require a) prior knowledge about the query pool, b) few queries of interest with properties that are expensive to measure, e.g.time or money, and c) that the property of interest is relatively easy to obtain for similar candidates.
In contrast to other data efficient learning schemes such as active learning, SML focuses on few queries at a time rather than training an all purpose model in an iterative fashion.Note that the smaller the number of relevant query instances, the more advantageous SML will be.However, SML can be extended to iterative active learning by using distance or density based training point selection if a multitude of queries is expected.Other data efficient learning algorithms such as transfer or few shot learning require pre-trained models, whereas SML solely relies on training points similar to a query, thereby opening the possibility of query aware few shot learning in domains with limited access to pre-trained models.
For exhaustive screening of chemical compound space, by contrast, conventionally trained random machine learning (RML) models are more appropriate.Note that within this work SML and RML are both based on kernel ridge regression, however, differ in their training data selection and applicability.As such, consider multi-objective compound design in which an RML model is used as a filter to scan vast do-mains of chemical space 37 , thereby requiring a large and diverse training set in order to answer any query 38,39 .Once a candidate compound has been identified, i.e. we are aware of the query, SML can subsequently be used to provide highly accurate estimates for properties not covered by the highly general RML model.
After introducing SML in this article, we demonstrate its usefulness for two distinct applications: i) Quantum mechanics based molecular design: Statistical learning of quantum properties holds the promise to accelerate the computational materials design process via fast and accurate machine learning models 40,41 .Due to the size of chemical compound space (encompassing at least 10 60 compounds 37 ) and the computational cost of atomistic simulations (ranging from hours to months 26 ), SML can dramatically reduce training data needs if target query compounds are known.ii) Organic synthesis planning: The bottleneck of molecular and materials discovery is amplified in experimental settings, due to large costs and the required manual labour.Additionally, compounds have to be synthesised or purchased before properties can be measured, adding to the cost-time complexity of molecular and materials design.We showcase how SML can be used to efficiently predict relevant properties for target boutique compounds without having make the expenditure necessary to acquire them.
Finally, we analyse the link between query and training data proximity, allowing to derive a relationship between the intrinsic dimensionality and volume of the feature space spanned by the training data.The results further corroborate the idea that with increasing local nearest neighbour density an decreasing fraction of the total data pool is required to converge to competitive performance.

A. Similarity Machine Learning
To assess the effect of the SML approach on learning, we compare learning curves for random and similarity based machine learning models applied to the analytically known harmonic oscillator and the Rosenbrock functions (Fig. 2 A and  B).The SML learning curves exhibit dramatically lower offsets and faster convergence when compared to randomly selected ML (RML) models.See methods for more technical details on training and testing.
On average, only very few yet similar training points are necessary for SML to be competitive with the RML model trained on an order of magnitude more training points.In comparison with k-nearest neighbour (k-NN) regression, SML shows on a par performance when the training pool was grid sampled (Fig. 2 B).However, SML outperforms k-NN in randomly sampled training pool settings, suggesting robust performance even in sparsely sampled regions.As mentioned before, since SML is trained on the fly for each query, the use of SML is only advantageous if few queries are of concern.For exhaustive screening and enumeration studies, the conventional RML approach is more appropriate.We note the difference in shape of an SML curve when compared to learning curves reported in the literature.The rapid onset of convergence indicates that only a fraction of available nearest neighbours are necessary to reach the maximum predictive accuracy.Note that in the limit of maximum dataset size, the SML model's prediction error will always converge towards the corresponding RML prediction error.

B. Dimensionality analysis
We have analysed the SML models throughout this paper using learning curves.The relationship between the predictive accuracy of a model and the number of training points is a fundamental concept in statistical learning.If trained properly, the learning curve 42 with the slope b being related to the intrinsic dimensionality (d id ) and a being related to the target similarity 43  , shown an inverse correlation with high model performance, implying that low d id data is generally easier to learn [44][45][46] , and that compact representations suffice.In the limit of large N and randomly sampled data, Eq. 1 must become linear for Gaussian Process Regression 42 as well as neural network regressors 47 .
Calculating the distances of all training points in a pool N max to all queries, respectively, the average amount of nearest neighbours N q within a certain distance radius r can be determined.Fig. 2C confirms the intuition that by increasing the total data pool N max , the local density, i.e. the amount of neighbours within a fixed radius, grows as well.To quantify the relationship between N max and distance radius r, we use the definition of density ρ, i.e. ρ = N/V = N/r d id , with vol- ume V , distance radius r and intrinsic dimensionality d id .To obtain values for ρ and d id , we transform the density equation to fit the steepest slope of N q as a function of distance r on a double logarithmic scale (Fig. 2C), yielding: Fig. 2D shows the resulting density ρ to increase linearly with larger pool-sizes N max .Interestingly, our calculated d id effectively underestimates the respective formal dimensionality of 1D and 2D for the harmonic oscillator and Rosenbrock function (Fig. 2D).Our analysis indicates that edge artifacts introduced through the fixed function boundaries, as depicted in Fig. 2A, contribute to this effect: Queries that are close to the boundary will have less neighbours when increasing the distance radius than queries in the center, leading to an underestimated d id (see SI, Fig. S1).Given an infinite space and perfectly centered points, the formal d id can be recovered.We note that other nearest neighbour or maximum likelihood d id estimations [52][53][54][55] are possible, but can suffer from negative bias through edge effects in higher dimensions 56,57 .
Having introduced the concept of SML and learning curves on mathematical toy functions, we will now apply SML to learning quantum properties of chemical compounds.

C. Quantum mechanics based virtual compound design
To illustrate a meaningful use-case of SML, consider a typical problem of computational molecular and materials design: Organic light-emitting diodes.Consider furthermore that a promising compound would have been identified through sophisticated screening of a very large set of potential candidate compounds.Suppose that the screening imposes all necessary constraints on the candidate compound, except for quantum mechanics based excited states calculations which is not only difficult to model, but is also computationally expensive for larger compounds 58,59 .In this scenario, it is difficult to make a rational decision due to the high computational cost and time required to perform the quantum calculation.Furthermore, given the scarcity of literature excited states data throughout chemical compound space, RML models such as state of the art neural network architectures [60][61][62][63] in quantum chemistry would potentially require larger volumes of data to yield accurate estimates.By contrast, due to its data-efficiency, SML will be able to produce substantially more accurate predictions after training on substantially less data.
Obviously, this molecular design scenario is not restricted to organic electronics.As a numerical illustration, we report the application of SML to learning quantum mechan-  ics based calculated atomization energies (rather than excited states) of organic molecules, as reported in the QM9 48 dataset, rather than for organic materials candidates.As for the harmonic oscillator and Rosenbrock function discussed above, Fig. 3A displays overwhelming numerical evidence that SML models also exhibit superior data-efficiency when it comes to molecules and their properties.More specifically, for all tested N max , SML learning curves exhibit a much reduced offset, and converge to the accuracy of the RML learning curve at N max for training set sizes that are smaller by orders of magnitude.In contrast, k-NN did not improve beyond the first neighbour.Note also the robustness of SML: As the number of nearest neighbours N q grows, its predictive variance decreases systematically (Fig. 3A inset).Furthermore, SML performs consistently when applied to quantum properties in chemical compound space as datasets and molecular representations are being varied (for further examples see SI and Fig. S3).
At increasingly large N max , a shoulder becomes apparent for small training set sizes with less than N q =30.Moreover, the larger N max , the more data-efficient the SML model (earlier flattening) suggesting that the increase in local neighbour density leads to faster saturation in learning (see II. B for indepth dimensionality analysis in analogy to the harmonic oscillator and Rosenbrock function).At the largest N max possible within the QM9 data-set (∼134k molecules for training), SML reaches the coveted chemical accuracy of on average 1 kcal/mol at only 100 training points for any given query.For RML models, this would have solved the 'QM9-IPAM-Challenge' in quantum chemical machine learning 64,65 .Note that many applications in quantum mechanics based simulation of molecules and materials are rather concerned with few queries than many.This indicates, that SML might be the superior choice, than learning a general purpose property throughout chemical compound space, such as a universal force field.
To illustrate the effect of the query-specific density on the SML model, we compare two query compounds with respectively low and high density of neighbours in chemical space.
More specifically, we consider the caged ether-bridged C 6 H 8 O 2 compound, 3,8-dioxatricyclo[4.2.1.0 2,5]nonane 66,67 , and the fused C 7 H 10 O aldehyde, 4-methylbicyclo[2.1.0]pentane-2carbaldehyde 66,67, encoding motifs that are respectively rather rare or frequent within the chemical space spanned by QM9 48 .SML learning curves (Fig. 3B) confirm the expectation: The higher the density of neighbours the better the learning curve's off-set and slope in the low-data limit.Note that for query specific training resulting learning curves can also increase with training set size.This is in stark contrast to the statistical learning tenet that averaged prediction errors must decay systematically 42,47 .Visual inspection of the closest nearest neighbours in terms of FCHL based distances (Fig. 3C) shows barely any relative chemical trends of similarity.Using BoB based distances, however, the visual inspection is more straightforward: Fig. S5 reveals for example that for a high-density query a tautomer in QM9 got selected by SML, resulting in near-perfect predictions.

D. Dimensionality analysis of QM9
To gain a better understanding of the relation between query and training data proximity, SML models have been trainedsimilar as reported for the harmonic oscillator and Rosenbrock function -using an increasing distance radius as a measure to select training data.Fig. 4A indicates that with increasing N max the distance radius required to reach the final accuracy becomes smaller.Concurrently, the average amount of nearest neighbours N q in a smaller radius also increases (Fig. 4B).We define a critical radius r crit to determine the distance cutoff at which 70, 80, 90, 95% of the maximum accuracy is reached at each N max (Fig. 4A), respectively.We observe that at increasing N max , r crit becomes smaller (Fig. 4D).Calculating the average amount of neighbours N q within r crit (Fig. 4B), a pareto front becomes apparent, confirming that at increasing N max a fraction of the data is sufficient (Fig. 4C).We apply Eq. 2 on Fig. 4B to obtain the density ρ and d id .As expected, the density ρ increases with N max (Fig. 4E).However, in consideration of the size of chemical compound space and ways of systematic enumeration, it is unclear how and if the density will converge and how this will influence the learning of properties in chemical compound space.The estimated d id converges to a value of 2.4 for N max > ∼100 using a Gaussian kernel (Fig. 4F) and 1.7 using a Laplacian kernel (see SI Fig. S11).While estimates of the d id for popular chemical datasets have not been rigorously reported yet, we note that the d id of a fixed segment of chemical compound space is bound to an upper limit of 4N-6 when using a roto-translational invariant representation.Reasons for potential underestimates can be finite boundaries in terms of representation or even chemical space.Points in high dimensions tend to be localized at the surface of the hypersphere.The effect of edge artifacts can further be amplified through the high dimensionality of molecular representation, which typically are vectors containing thousands of entries.However, we report that even a high dimensional representation such as BoB (∼1'000 entries for QM9) shows high co-linearity compared to uniform random vectors of the same size (see SI Fig. S12).

E. Organic synthesis planning
A SML use-case in the context of managing synthesis planning 68-71 within industrial chemistry could be as follows: Consider the problem of acquiring non-trivial molecular properties of one boutique compound through experimental measurements.Synthesis of the compound prior to measurements is also assumed to first require substantial experimental efforts resulting in very long delivery times (also known as 'lead times') 72 .Note that this is not an unusual 'academic' situation as the latter can nowadays be performed commercially through synthesis service companies such as Enamine Ltd.While such services have reached considerable reliability, reporting success rates of 60-80% and higher, challenging boutique target compounds, for example cyclopropylmethyl2-(2oxo-3,4-dihydro-2H-1,3-benzoxazin-3-yl)acetate 66,67 on display in Fig. 5A can impose substantial time delays until the compound has been made, characterized, and shipped, increasing lead times to several weeks or more.Alternatively, SML can be used to make accurate property predictions, rather than measurements, before having to wait out the lead time and measuring the target compound's properties: First ordering and measuring the most similar compounds with much reduced lead times, for example readily deliverable (3-5 days) building blocks from the Enamine REAL library 50,51 used for the synthesis of our exemplary boutique compound, would allow for rapidly making the measurements to generate the training data for subsequent SML model generation.The resulting SML models can then be used to accurately infer the property for the original boutique target compound.In comparison to the naive approach of simply waiting for the compound to be delivered, followed by measurements of the properties, this procedure would effectively enable a dramatic cut down of the time to decision while keeping the financial burden for data acquisition at a minimum.
To numerically illustrate the application of SML towards such use cases, we have relied on the aforementioned specific example product (cyclopropylmethyl2-(2-oxo-3,4-dihydro-2H-1,3-benzoxazin-3-yl)acetate), as well as 999 other boutique molecules (see data availability section for a complete list) with lead times of 3 to 4 weeks on average.As a relevant property, we have selected calculated free energy of solvation estimates 73,74 as a proxy to the measurement.As to be expected from our discussions in the preceding sections, resulting learning curves suggest that on average SML requires a magnitude less data than RML and, similarly, outperforms k-NN (Fig. 5B).Assuming a 500 US$/compound price level on average, estimated purchasing costs for training data-acquisition are shown, also indicating substantial potential for savings, i.e. time is money.
Given that there are multiple queries (up to 1'000), we have also assessed to which degree SML can benefit from synergies due to overlapping nearest neighbours of distinct queries.This possibility would reduce the scaling from the worst case upper bound imposed by RML, rendering SML preferable in terms of time and cost for larger numbers of target queries.To illustrate this scenario, and without loss of generality, we define the target accuracy to correspond to chemical accuracy (1 kcal/mol).Assuming up to 1'000 queries, we have applied SML to select a jointly combined training set containing the closest nearest neighbours of the selected queries.Analysis of the N neighs required per query to reach 1 kcal/mol (see SI), indicates that less N neighs are required when considering shared neighbourhood SML models, rather than single query SML.Note that with increasing N queries , also increasingly larger SML training set is required, ultimately converging to the upper bound used within RML models.Thus, variety in the query-set also requires variety training set, whereas few queries are more efficiently predicted via SML.
Additionally, time and cost are constraints to be considered during the design process (Fig. 5C).Albeit the dataacquisition costs of machine learning approaches can be more costly than directly ordering the target compounds (we assume 2'000 US$/target compound price in Fig. 5C), SML and RML can become financially advantageous for large sets of N queries .For a modest number of expected queries, SML will be preferable, whereas for many queries RML is the method of choice.In our example, however, the cost per query has not been included as an additional constraint when selecting nearest neighbours.This could lower the total cost even further.In terms of time to decision, SML will always be preferable.Note that the specific variables cost and time strongly depend on each scenario.In conclusion, given additional constraints and a large chemical space to choose from, approximate nearest neighbour selection 75 is beneficial for reducing cost and time to decision in planning the management of chemical synthesis.

F. Limitations
We want to note that any limitations that apply to an RML model will also apply to SML since both only differ in their training data, but not in their underlying machine learning algorithm, e.g.kernel ridge regression.Since the choice of representation and distance metric influences the choice of nearest neighbours, an improvement in both, e.g. by including more physical knowledge or metric learning 76 , will result in lower learning curve offsets (see SI Fig. S3).Conversely, low local nearest neighbour density (Fig. 3B) will require more data in order to reach optimal accuracy.Additionally, the compute time for an exact nearest neighbour search grows linearly with O(n).However, since the local nearest neighbour density is proportional to the size of search space (see Fig. 4), the O(log n) scaling of approximate nearest neighbour algorithms, commonly used with millions of data points, becomes more effective.Further strategies could encompass to use efficient low dimensional representations for approximate nearest neighbour searches while using higher quality representations for regression, resulting in a time vs. accuracy trade-off.To make the most out of SML within multi query scenarios, it is recommended to maximize the overlap of shared SML training sets as described in the supplementary information.Methods such as Auto3D 77 , OQML 78 , Graph-To-Structure 79 or generative models 80-82 could further facilitate the generation of 3D input structures for quantum chemistry based models to avoid performance loss by using lower levels of theory geometries such as from force fields 83 .Additionally, the atoms-in-molecule 84 (AMONS) approach could further aid the applicability of SML by using smaller molecular building blocks to predict significantly larger query compounds.

III. DISCUSSION
We have presented numerical evidence that similarity based machine learning (SML), an approach to select data and train a model on-the-fly for specific queries, can offer substantial improvements in data efficiency for certain use-cases.Applying SML to problems in quantum mechanics and organic synthesis planning, we have found dramatically lower offsets in learning curves 42 , enabling meaningful decision making at a fraction of the total training set size.Comparisons of SML with other similarity based methods indicate robust performance, even in high dimensional feature spaces.Large training set sizes and associated costs of data acquisition render the use of one-fits-all ML models unfeasible related to the size of chemical compound space.The shape of SML learning curves suggest meaningful deviations from 'the bigger the better' paradigm indicating that learning is rather dominated by locality effects, and even saturates upon inclusion of sufficiently many nearest neighbours in training.
After introducing and demonstrating SML for harmonic oscillator and Rosenbrock functions, we have demonstrated its usefulness for quantum mechanics based molecular design where we can reach the coveted 1 kcal/mol chemical accuracy already after just 100 nearest neighbours.This kind of task requires conventional RML models to be trained on thousands of data points.More fundamentally, however, we have derived a relationship between the intrinsic dimensionality and volume of the feature space spanned by the training data which governs the overall model accuracy.In managing the planning of experimental organic synthesis projects, we have shown that SML can bypass time intensive custom synthesis by using readily available compounds similar to the boutique compounds of interest.
We believe that for certain decision problems in chemistry the paradigm of 'the bigger the data the better' is not optimal, and that tailored, query specific SML models reach predictive power already for training set sizes that are orders of magnitude smaller than for conventional RML models.More generally, we expect SML to be advantageous for any problem with a) prior knowledge of and access to molecular features in query pool (to easily calculate similarity), b) few queries of interest with labels that are expensive to measure (e.g.long computation times or high costs) c) labels for similar candidates in pool that are less difficult to obtain.Moreover, the implicitly defined domain of applicability of the model provides an indication for the reliability of a machine learning model for unseen queries and inherently limits the usefulness outside of this domain (extrapolation 85 ).Potential applications beyond chemistry could include predictive maintenance tasks in computer aided infrastructure 86 or aerospace engineering 87,88 .Future work will address, automated experimental design, deriving ready made models in quantum chemistry or multilevel learning, improving similarity measurements, as well as exploring fast and approximate neighbour searches (cf.turbo similarity fusion) 76,89-91 .

A. Kernel Ridge Regression (KRR)
We rely on kernel ridge regression (KRR) 92 due to its successes in small data scenarios, however, other machine learning regressors are possible.Kernel based methods belong to the class of supervised machine learning models and are -despite their simplicity -a powerful approach for learning molecules and materials properties.In ridge regression, a mapping function is learned, relating a representation vector x to a label y 92,93 .The 'kernel trick' renders the problem linear through the use of kernel functions that yield the inner product of high dimensional representations x.In practice, KRR is formulated as follows: with α i being the regression coefficient, x i being the i-th representation vector of the training set, x q being the query representation and kernel function K.The regression coefficients α are calculated through a closed-form solution: with the identity matrix I and a regularization coefficient λ .The latter depends on the anticipated noise in the input and labels and has to be determined through hyperparameter optimization.

B. Kernel Functions
As a similarity measure, the Laplacian (Eq.5) and Gaussian (Eq. 6) kernels are the most common choices.
with σ being the kernel width, an additional hyperparameter to optimize.The kernel function for local atomic representations can be rewritten as the sum of pair-wise kernels: with I and J being atoms in the training and query, respectively.Kernels for local atomic representations of molecules are often augmented with an additional elemental screening function to compare only atoms of the same nuclear charge: with Kronecker Delta δ , nuclear charges Z and atomic representations x.In order to use local representations as a normalized molecular similarity measure, Eq.8 can be expressed as the sum of local atomic comparisons:

C. Representations
We use standard representations which have been developed, benchmarked and continuously improved over the past years for learning quantum properties.The Coulomb Matrix (CM 94 ) contains a 1-body self interaction and a 2body Coulomb repulsion term and has been one of the first representations developed for machine learning quantum properties.The bag-of-bonds (BoB 95 ) representation, the direct successor of the CM, vectorizes the CM terms into fixed size bins, thus enabling the strict comparison between chemically similar environments.The spectrum of London and Axilrod-Teller-Muto (SLATM 84 ) representation uses London dispersion contributions as the 2-body and the Axilrod-Teller-Muto potential as the 3-body term and has shown improved performance over BoB for learning quantum properties.The Faber-Christensen-Huang-Lilienfeld (FCHL19 49 ) representation describes a local atomic environment per atom, respectively, and contains 2 and 3-body terms similar to Behler & Parrinello atomic symmetry functions (ACSF 96 ) accounting for distance and angular information.We used the optimized FCHL19 parameters as recommended in Ref. 49 .The Smooth Overlap of Atomic Positions (SOAP) 97 representation is, similar to FCHL19, a local atomic representation that encodes local geometries through spherical harmonics and radial basis functions.We used n max = 3 and l max = 5 with σ = 0.1 and a cutoff of 3.5 Å as SOAP settings.For the learning of the free energy of solvation, we used FCHL19 and the free-energy-machine-learning (FML) scheme 98 .

D. Training
We optimize the hyperparameters σ and λ through crossvalidation on the largest training set size.Both parameters are held constant for all other training set sizes.To train an SML model, the pair-wise distances of a query to all available datapoints defined by N max are calculated.An overview of training/query set sizes and distance metrics for each representation and dataset can be seen in Tab.S1 in the SI.After sorting all points by distance, only the N q closest neighbours are considered for training.SML learning curve are obtained by averaging across all single query runs.Similarly, k-NN baselines are calculated by averaging the labels of the closest N q neighbours weighted by the distance to the query, respectively.RML models are trained choosing N data points at random.The regression coefficients are obtained through Eq.4 and the query label predicted via Eq.3 (Fig. 1 B).Note, RML models are only trained once to predict the labels of all queries, whereas SML and k-NN models are trained on-the-fly for each query, respectively.

E. Data
To assess SML, several quantum based datasets have been considered.The QM7 dataset totals 7'165 organic molecules up to seven heavy atoms (CONS) 94,99 .Atomization energies and structures were calculated using the Perdew-Burke-Ernzerhof hybrid functional (PBE0) level of quantum chemistry 94,99 .The QM9 48 dataset is a hallmark benchmark in quantum machine learning.Containing 133'885 small organic molecules up to nine heavy atoms (CONF), properties were calculated at the B3LYP/6-31G(2df,p) level of quantum chemistry 48 .Additionally, a subset of 6'095 C 7 O 2 H 10 constitutional isomers has been used as a toy case.
The Enamine REAL database is the largest enumerated database of synthetically available compounds containing over 4.5 billion molecules 50,51 .We have chosen 1'000 random query compounds of the REAL library containing 6-21 heavy atoms.For training, we used the Enamine REAL building blocks, which encompass over 128'856 compounds that are used to synthesize the Enamine REAL dataset.We applied a SMILES 100 standardization protocol and filtered all compounds containing B, Si or Sn elements, yielding a total number of 124'440 compounds.3D coordinates and conformers for all compounds were generated with ETKDG 101 .A maximum of 25 conformers were further relaxed using GFN2-xTB 102 and corresponding Boltzmann weights obtained through energy rankings.As a training label, free energies of solvation were computed using the Reaction-Mechanism-Generator 73,74 .

Shared SML Models for Multiple Queries
To determine the data efficiency of SML in consideration of multiple queries, we train shared neighbourhood SML models by combining the local nearest neighbours of multiple queries to compile the total training set (Fig. S13).Thus, the number of nearest neighbours per query N neighs becomes a hyperparameter.Moreover, we introduce an additional hyperparameter N f ilter that only considers nearest neighbours which occur at least 1, 2, 3 or 4 times in a finite radius of multiple query neighbourhoods.We apply this scheme to train shared SML models tailored for up to 1'000 boutique compound queries of the Enamine REAL library (Fig. S14).Results suggest that for all considered N queries and N neighs per query, a combination can be found that reaches 1 kcal/mol chemical accuracy with less than 2'000 training points (Fig. S15).Furthermore, optimal hyperparameter combinations are mostly found for N f ilter = 1, underlining the importance of neighbour proximity.For example, given two queries, a neighbour within a large radius of both queries could be found (N f ilter = 2) which might dissimilar to both queries and, therefore, be less impactful for learning than a direct neighbour (N f ilter = 1), respectively.Considering the optimal training set sizes for multiple queries (Fig. S15), the time and cost estimate depicted in Fig. 5C can be derived.

FIG. 1 .
FIG. 1. Schematic of similarity based machine learning (SML).(A) As a data acquisition strategy, consider similarity based vs. random selection of costly labels (y) corresponding to quantum observables and molecules (represented by wave-functions Ψ) from chemical space, resulting in either SML or random machine learning (RML) models, respectively.Due to SML's exclusive focus on the region most relevant to the query improved data-efficiency is achieved, as evinced by learning curves.(B) Training procedure of SML with kernel ridge regression: Given a vast pool of potentially interesting queries (rectangle), typically only certain queries are of interest.SML selects the corresponding relevant training points (N q ) from feature space (banana) as ranked by similarity in representation features (Step No. 1).Consequently, training SML models (Step No. 2) requires costly labels only for the N q nearest neighbours (shown in circles) which do not yet overlap with any other previously selected query.Resulting query specific SML models yield similar prediction accuracy for less training data (Step No. 3).
must reflect the inverse relationship between out-of-sample prediction (test) error as a function of number of training points N. Note, in the case of SML based learning curves, N corresponds to the number of nearest-neighbours N q chosen for training.The logarithmized version of the leading error term is, log(Error) ≈ log(a) − b log(N),

FIG. 2 .
FIG. 2. Left column: SML of a harmonic potential.Right column: SML of the Rosenbrock function.(A) Illustration of target functions and some random query points.(B) Learning curves showing systematic improvement with increasing training set size using RML and SML compared to k-NN baselines.Random (dashed) and grid (solid) indicate the initial training pool sampling of the function, respectively.The nullmodel corresponds to the averaged function value in training set.(C) Average number of nearest neighbours N q as a function of a radius around a query for increasing total training set size N max , drawn from respective ranges at random.(D) Intrinsic dimensionality d id and density of points ρ for increasing N max (see Eq. (2)).

FIG. 3 .FIG. 4 .
FIG. 3. Training and analysis of SML models of atomization energies of QM9 48 compounds using the FCHL19 49 representation.(A) Learning curves for RML, SML and k-NN for increasing maximal training set sizes, drawn at random.The inset depicts the error distribution of SML models at varying training set sizes for the largest N max .Chemical accuracy threshold shown as horizontal line.The nullmodel corresponds to the averaged function value in training set.(B) Learning curves of RML and SML models for predicting 4-methylbicyclo[2.1.0]pentane-2-carbaldehyde(https://tinyurl.com/2p9dtueh)and 3,8-dioxatricyclo[4.2.1.0 2,5]nonane (https://tinyurl.com/2p8e4d3h)with either high or low nearest neighbour density, respectively.(C) Display of the first shell (closest ten molecules) and second shell (next closest fifteen) neighbours for the two query compounds.

FIG. 5 .
FIG. 5. Example of how to use SML within a virtual custom synthesis scenario.(A) Suppose, the decision to make an investment in a certain desired boutique query compound (exemplary molecular graph and string shown as inset https://tinyurl.com/5yvxxzst) depends on a certain property.Rather than ordering a costly and time-intensive custom synthesis, one can instead simply predict the property by first ordering sufficient more easily available compounds which are similar to the query, and by subsequently measuring their properties in order to train an accurate SML model.The nullmodel corresponds to the averaged function value in training set.(B) Respective learning curves and training data acquisition cost for RML and SML models of an exemplary property (free energies of solvation) using molecules drawn at random from Enamine REAL libraries 50,51 .Dashed horizontal line indicates chemical accuracy threshold (1 kcal/mol corresponding to experimental uncertainties in thermochemical combustion).(C) Cost vs. time associated with the generation of an SML or RML dataset (∼500 US$/compound) in comparison to a ball-park synthesis cost of 2'000 US$/query for different number of queries N queries to be expected.The time estimate includes approximate delivery times of 5 days for SML/RML compounds and 1 month for boutique compounds, respectively, as well as an experimental throughput of 25 compounds per day.
Figures 1,5 and S16. were created by Freepik, photo3idea_studio, Victoruler, Maxim Basinski Premium and Monkik of flaticon.com.This project has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No. 772834).This research is part of the University of Toronto's Acceleration Consortium, which receives funding from the Canada First Research Excellence Fund (CFREF).Part of this research was performed while GFvR was visiting the Institute for Pure and Applied Mathematics (IPAM), which is supported by the National Science Foundation (Grant No. DMS-1925919).
VI. AUTHOR CONTRIBUTIONSDL wrote new software used in the work, produced all figures, performed the literature search, compiled the references.DL and GvR acquired the data.DL, GvR and OAvL conceived and planned the project, analyzed and interpreted the results, and wrote the manuscript.
FIG. S2.Randomly chosen atomization energy learning curves of 100 QM9 3 queries predicted via RML and SML using FCHL19 4 , respectively.The FCHL19 accuracy represents the MAE of an RML model at maximum training set size.
FIG. S5.Analysis of the influence on neighbourhood density on learning atomization energies of two QM9 compounds with either low or high nearest neighbour density using the BoB representation, respectively.(A-B) Learning curves of SML models for predicting two compounds with either low or high nearest neighbour density, respectively.(C) Amount of nearest neighbours found within an increasing distance radius, respectively.(D-E) Display of the first shell (closest ten molecules) and second shell (next closest fifteen) neighbours for the two query compounds.

TABLE S1 .
Overview of maximum training set size N max , number of queries N Queries and representation/similarity metric combination used to identify nearest neighbours for each dataset, respectively.