Building robust machine learning models for small chemical science data: the case of shear viscosity of fluids

Shear viscosity, though being a fundamental property of all fluids, is computationally expensive to calculate from equilibrium molecular dynamics simulations. Recently, machine learning (ML) methods have been used to augment molecular simulations in many contexts, thus showing promise to estimate viscosity too in a relatively inexpensive manner. However, ML methods face significant challenges—such as overfitting, when the size of the data set is small, as is the case with viscosity. In this work, we train seven ML models to predict the shear viscosity of a Lennard–Jones fluid, with particular emphasis on addressing issues arising from a small data set. Specifically, the issues related to model selection, performance estimation and uncertainty quantification were investigated. First, we show that the widely used performance estimation procedure of using a single unseen data set shows a wide variability—in estimating the errors on—small data sets. In this context, the common practice of using cross validation (CV) to select the hyperparameters (model selection) can be adapted to estimate the generalization error (performance estimation) as well. We compare two simple CV procedures for their ability to do both model selection and performance estimation, and find that k-fold CV based procedure shows a lower variance of error estimates. Also, these CV procedures naturally lead to an ensemble of trained ML models. We discuss the role of performance metrics in training and evaluation and propose a method to rank the ML models based on multiple metrics. Finally, two methods for uncertainty quantification—Gaussian process regression (GPR) and ensemble method—were used to estimate the uncertainty on individual predictions. The uncertainty estimates from GPR were also used to construct an applicability domain using which the ML models provided even more reliable predictions on an independent viscosity data set generated in this work. Overall, the procedures prescribed in this work, together, lead to robust ML models for small data sets.


I. INTRODUCTION
Shear viscosity is a fundamental transport property of all fluids. 1 Understanding its molecular underpinnings would advance our scientific understanding of supercooled liquids, 2 magma transport, 3 mixing of fluids, etc. For example, a good estimate of shear viscosity is crucial to model the earth's outer core which is believed to be a liquid form of Iron based alloys. 4,5 However, in the absence of direct measurements, its estimates from different methods differ by about fourteen orders of magnitude 6 . Hence, a better understanding of the behavior of viscosity from simulations can help address some of these issues. Further, from the point of view of applications, predicting the viscosity of industrially relevant fluids (such as hydrocarbons and carbonates) in-silico would accelerate the progress in energy storage, petroleum, lubricants, chemical processing, pharmaceutical, and many other sectors. 7,8 A. Viscosity from Molecular Dynamics simulations Atomistic Molecular Dynamics (MD) simulations with ab initio or empirical force fields can be used to estimate the viscosity of any liquid, however complex, in silico. [9][10][11][12][13][14][15][16][17] While there exists many methods to estimate viscosity from MD simulations, they largely fall into two categories a) Authors for Correspondence: Email: nikhil@jncasr.ac.in, bala@jncasr.ac.in -Equilibrium MD (EMD) 9,18 and Non-Equilibrium MD (NEMD) based methods. [19][20][21] A comparison between them is beyond the scope of this work and the readers are directed to some excellent works in this area. 9,10 Despite the progress in this area, 9,10,14,18,[22][23][24][25][26][27][28][29] the state-of-the-art methods to estimate viscosity accurately from MD simulations require huge computing time especially for viscous fluids, 18,[30][31][32] as it is a collective quantity. This drawback precludes the use of MD simulations in viscosity-based high throughput screening processes in the industry. 7 Also, force field refinement strategies which use the experimental viscosity as a benchmark require significant effort for the same reason. 33 Another important difficulty in estimating viscosity from MD is its sensitive dependence on primary and ancillary simulation setup parameters. Hess showed that ancillary MD run parameters such as the number of independent replicas, numerical precision of the MD engine, neighbor list cutoff, and Ewald sum related parameters have a significant effect on the estimated viscosity value. 10 Hence, reporting meaningful confidence intervals of viscosity estimates is crucial and is a topic of ongoing research. 9,10,23,24,34 To address these problems related to estimating shear viscosity from MD simulations, we look at alternate approaches using Machine Learning (ML) methods.
B. Machine Learning methods ML methods to predict slowly converging properties of liquids -of which shear viscosity is one -some initial advances have been made. Alam et al. used Neural Networks (NN) and Random Forests (RF) to predict the self diffusion coefficients of particles interacting via Lennard-Jones (LJ) interaction and showed them to be performing better than empirical models. 40 They also used ML methods to predict the finite size corrections to self diffusion coefficients in binary LJ fluids. 41 Also, several ML models were also used to predict the experimental viscosity and ionic conductivity of ionic liquids using only molecular features. [42][43][44][45][46][47][48][49][50] However, to our knowledge, ML methods have not been used to predict shear viscosity derived purely from MD simulations.
The protocols for developing and comparing ML models -especially for chemical science applications -are not yet completely established. 51 There exist a plethora of supervised learning algorithms, 52,53 model selection rules, [54][55][56][57][58] performance metrics, 51,59,60 and uncertainty quantification methods [61][62][63][64][65] which are used to create ML models; yet, there are no clear protocols on which combination should be chosen. For example, among the many performance metrics that are used to train and compare ML models, the best choice is still a matter of debate. 51,66,67 Similar conclusions hold true for model selection, 58 and uncertainty quantification 68 as well.
In this context, 'No-Free-Lunch' theorems by Wolpert and Mcready imply that there is no single algorithm that has the best performance across all possible optimization problems. 69 These theorems when applied to ML indicate that any single ML algorithm cannot be expected to perform well across all possible ML tasks. Though there is still a debate on the applicability of these theorems to practical ML problems, it is considered common knowledge that the ML algorithm should be tailored to the specific task at hand. 53 Exploiting the idiosyncrasies of the data set can lead to significant improvements in the performance of ML models. For example, Müller et al. were able to improve the performance (MAE) of ML models used to predict atomization energies from 10 kcal/mol to 3 kcal/mol (70 % improvement) by exploring a number of ML techniques and molecular representations. 70 In this work, we use this approach to develop ML models tailored to viscosity data set generated from equilibrium MD simulations.

C. ML models for small data
As it is computationally costly to get reliable (including standard errors) estimates of viscosity from atomistic MD, generating large data sets (like GDB-17 with nearly 150 billion data points 71 ) is practically infeasible with the current computing resources. The largest MD-derived viscosity data set (with 1061 data points) we could find was the work of Vlugt et al. in which the MD computed viscosity was used to predict the box size corrections to diffusion coefficients of particles in binary Lennard-Jones (LJ) systems. 72 Such small data sets pose unique challenges to the ML methods 73 -(1) they are hard to generalize, (2) they are susceptible to overfitting, and (3) they tend to underestimate the generalization error. 55,74 For example, Neural Networks (NN) which generally outperform other ML models, struggle in the low data regime. 70,75,76 Also, Casson et al., on surveying over 50 articles on machine learning for autism, have shown that ML models tended to produce overoptimistic results when the sample sizes are small. 77 All these issues exacerbate the reproducibility of the results which is already a fast growing challenge in all fields using ML methods. [78][79][80] In this work, we train seven ML models to predict the shear viscosity of binary LJ fluids, with particular emphasis on addressing issues arising from a small data set. Specifically, the issues related to model selection, performance estimation, performance metrics, uncertainty quantification and applicability domain were investigated. First, we show that the common practice of estimating the performance of the ML models on a single unseen data set shows wide variability for small data sets. The consequences of using individual unseen data set on the hyperparameter optimization landscape are demonstrated. Then, we compare two simple CV procedures for their ability to do both the model selection and performance estimation together. We discuss the role of performance metrics in selecting and evaluating ML models. We discuss some general principles for comparing different metrics and use them to choose a suitable set of metrics relevant to the viscosity data set. We propose a holistic ranking method based on multiple metrics to choose the best performing ML algorithms. To complement the traditional ML models, we train a probabilistic model to capture the inherent uncertainty in the data set and compare its performance with that of other models. The performance of the ML models developed here is shown to be better than empirical models for viscosity. Finally, the applicability domain of ML models is also constructed (and tested) to assist the decision making of the end user. We believe that the techniques adopted herein to train the ML models, combined with the uncertainty quantification and applicability domain can lead to accurate, reliable and reproducible ML models to predict shear viscosity of binary LJ mixtures and can help researchers to develop ML models for small data sets in general. Also, we hope that the detailed descriptions and the codes attached in the supporting information would help with the reproducibility of the results presented in this work.
The paper is organized as follows: the next section describes the background theory and empirical evidence that aids in the discussion on model selection and performance metrics. It is followed by sections on computational details and results. The final section presents our conclusions and suggestions for developing ML models for small data sets.

II. BACKGROUND
A. The structure of the problem We assume that there exists a joint probability density p(x, y) that generated the data set. 52,53 Here, x is a vector of input features and y is the target variable, also called as the label. In the context of shear viscosity prediction, the feature vector can be constructed from quantities like x 1 , σ 2 , ε 2 , k 12 , ζ , ρ * and the target variable is the shear viscosity η (see section III A). We focus on the regression task which aims to determine a function f * (x) that is an optimal representation of the data set. The sense in which the function f * (x) is optimal is often taken to be the one that minimizes the expected loss (also called as risk) E L(y, f (x)) p(x, y) dx dy (1) Where L(y, f (x)) is the user-defined loss function. The choice of the loss function has a direct relation to the kind of function f * (x) obtained. 66 The most common loss function, the squared loss, where L(y, f (x)) = (y − f (x)) 2 yields the conditional mean E y [y|x] (Eq S2) as the f (x). 52 Discussion on other loss functions and the consequent effect on the properties of f * (x) is presented in section S1.3.
However, to compute the expected loss/risk, the underlying joint probability density p(x,t) has to be known which is hard to do in practice. Hence, the expected loss is approximated by the empirical loss, (2) where, N is the number of data points, y true i are the target values corresponding to the feature vector x i . Generally, the empirical loss tends to be much lesser on the data set used to infer f * (x) (called the training set) than on new/unseen data set(s). This is because the minimization of empirical loss (per se) incentivizes the learning machine to learn the peculiarities (like noise) of the particular training data sample rather than the trends in the underlying model that generated that data set. 52,53,81 Hence, the goal of the learning protocol should be to minimize the error on new/unseen data set(s) called the generalization error. This phenomena of ML methods having significantly lesser training error than the generalization error is called overfitting and is especially relevant for models on small data sets. 52,53,81 The most common way to alleviate the problem of overfitting is to reduce the complexity/capacity of the learning machine, thereby reducing its ability to learn the noise associated with the training data sample. However, the complexity should not be reduced to such an extent that the general trends in the data are lost, resulting in underfitting.
Hence, the ML model should choose an optimal complexity corresponding to the general trends in the data. A popular method to control the complexity of the models is called regularization in which a penalty term (called the regularizer, Eq 3) which penalises complex models is added to the empirical loss. 53,81 The common forms of the regularizer are based on the norm of the weights (w) of the model like -L 2 norm (called as ridge regression or Tikhonov regularization), L 1 norm (for example in LASSO model), or a combination of both . 53 We also note that there are many other regularization techniques that are specific to Deep Learning (DL) methods such as -early stopping, dropout, soft weight sharing, etc. 82 Where Ω j ( f ) are the regularizers and λ j are the parameters that control the amount of regularization. Now that the ML models have a mechanism to control the complexity through regularization, the natural next step would be to choose the values of regularization parameters. This task falls under the purview of model selection. 53 In the following section, we discuss various model selection criteria and a closely related topic of performance evaluation.

B. Model Selection and Performance Evaluation
It is a common practice to distinguish the parameters of ML and DL models into model parameters and hyperparameters. 52,53 The model parameters are learnt during the training phase on the training data. Examples of model parameters include -slope and intercept in linear regression, coefficients of kernel expansion in kernel methods (like KRR), weights of neurons in Neural Networks (NNs), etc. The hyperparameters are generally the high level settings of ML algorithms which are either set by the user or inferred during the model selection procedure. Examples of hyperparameters include -regularization parameters, the degree of polynomial in polynomial regression, choice of the kernel in kernel methods, choice of activation function in NNs, number of neurons in NNs, etc.
The task of selecting the model with the optimal complexity is reduced to the estimation of values of hyperparameters; the criteria used for such selection are called model selection criteria. As stated earlier, the goal of ML models is to minimize the generalization error which is the average error over all unseen data. However, generalization error cannot be obtained in most practical situations and hence estimators on finite data sets are constructed to approximate it. The process of estimating the generalization error by using estimators on finite data sets is called performance evaluation and is a prerequisite for model selection. It is crucial to note that the error estimates are obtained over finite data sets and hence depend on the size of the data set, especially for small data sets. A simple example of such an estimator is the split sample estimator where the whole data set is split into two parts (generally unequal) and the error is computed on the split that was not used for training. 55 Split sample estimator is known to be unbiased i.e., the average split sample error over multiple independent realisations of unseen data asymptotically converges to the generalization error. Hence, minimizing the split sample error can in principle reduce the generalization error. However, it was recently shown that the unbiasedness per se is not as important as the variance of the estimator when it is used for model selection. 55 When an estimator has high variance (occurs with small data sets 55 ) the value of the estimated error on any one particular unseen data sample can be very different from the generalization error; hence the hyperparameters that minimize the estimated error can be far off from the optimal ones. Cawley et al. showed (on a synthetic data set) that hyperparameters selected based on split sample estimators can severely overfit or underfit the data. 55 In practice, the users rarely have the capability of generating multiple independent realizations of the data and hence the variance of the estimator plays a major role. Therefore, for small data sets it is not considered a good practice to estimate the error on a single realisation of the data set. 55 In order to mitigate this problem, various cross-validation (CV) schemes are generally used.
The core idea of k-fold cross-validation (CV) is to split the entire data set into k equal disjoint sets, train the ML models on k-1 sets and estimate the error on the remaining one set. This process is repeated k times, each time with a different hold-out set. 52,53 The average error over k folds is used as the estimate for the generalization error. It is a common practice to use 5 or 10 folds during CV. 53 The error estimates from k-fold CV are often used for model selection by searching over the space of hyperparameters and choosing the one that yields minimum CV error. But once the k-fold CV error is used to optimize the hyperparameters, it is no longer unbiased. 53,83,84 Typically another unseen data set (called the test set) is used to estimate the generalization error of the models with optimized hyperparameters. 53 Using a single realization of the test set, however, suffers from the high variance issue discussed above. Nested cross-validation or double cross-validation improves upon k-fold CV by doing performance evaluation and model selection in two nested loops. 55,70,77,83,84 The outer loop is used to estimate the generalization error and the inner loop is used to select the hyperparameters. Also, we note that there are many methods of splitting the data set into train/validation/test sets such as -Monte-Carlo CV, bootstrapping, Kennard-Stone splitting, and combinations thereof. 58,85 Xu and Goodacre compared the performance (in terms of their ability to predict the generalization error) of various data splitting methods including k-fold CV, Monte-Carlo CV, bootstrapping, etc. and found that a single best method could not be found a priori and suggest that the choice of the method should be tuned to the kind of data (No Free Lunch again). 58 Finally, we note that model selection and performance evaluation are big and unsolved challenges on small data sets. 53,55,74,[83][84][85][86] Guyon et al. organized a performance prediction challenge in which the participants (more than 100) were asked to predict the generalization error on finite data sets of real world importance like medical diagnosis, speech recognition, text categorization, etc. 85 They observed that most submissions were overconfident about their ML models i.e., their prediction of generalization error is less than the true generalization error. They also noted that the performance of the ML models truly improved in the first 45 days of the 180 day challenge after which overfitting set in. It is now a common belief that when a data set is worked upon repeatedly, even careful performance prediction protocols can result in optimistic performance predictions over time. 53

C. Performance Metrics for Regression
In this section, we summarize some of the principles that can be used to choose a relevant metric to the particular ML task at hand and also consider the particular case of viscosity data set. Performance metrics are generally used in two critical areas of ML model development workflowmodel training and model comparison. Though the choice of the metric can significantly alter the kind of ML model developed and consequently its real-world performance, there is no clear consensus on this topic. 51,59,66 As is the case with model selection criterion, there is no single best metric for model training that can be used across all ML tasks. 59 Further discussion on this topic can be found in section S1.3.
Another area in which loss functions are used in ML workflow is model comparison, in which models are ranked based on their generalization performance. Ideally, the generalization performance of ML models should also be measured using the same metric used in their training phase. 66 For example, an ML model trained by minimizing MSE should be compared to other models using MSE generalization error. However, in many cases, the choice of the loss functions cannot be controlled by the model developers and hence it is difficult to choose just one metric to compare such models. For example, Makridakis et al. use a weighted average of sMAPE and MASE to compare the models in the M4 forecasting competition citing a lack of agreement on the advantages and drawbacks of various metrics. 67 Hence, it is generally recommended to report the estimates of generalization error using multiple metrics. 51,59,70,76 Also, given the proliferation of various metrics, it is important to choose the set of metrics that are relevant to the ML task at hand and preferably containing complementary information to each other. Armstrong and Callopy compared six commonly used metrics and ranked them qualitatively (good,fair,poor) according to five characteristics -reliability, construct validity, sensitivity, outlier protection, and their relationship to decision making. 59 They conclude that there is no single metric for model comparison that can be considered the best in all situations and that they should be selected based on the kind of data set.
We use some of the arguments presented in their work to identify metrics suitable to the viscosity data set. See section III A for a discussion about the characteristics of the viscosity data set used in this study. First, we look at the compatibility of metrics to a data set that spans many orders of magnitude. All metrics that have units i.e., are not scaled, tend to be dominated by the error from the highest order of magnitude and hence do not give information about the contributions of the errors from low orders of magnitude. 59 Metrics based on scaled error like MAPE are more suited to such a situation. Next, we look at the level of outlier protection of various metrics. All metrics that take an average of individual errors suffer from outlier problem because the mean itself is sensitive to large outliers. Median based metrics like MedAE are better suited to such a situation. However median based metrics are not sensitive to small changes in the errors and also do not have clearly defined gradients with respect to model parameters. Finally, we look at metrics that can capture systematic biases (over or underestimation) in the ML models. Metrics based on error function with strictly positive range like SE, AE, APE, etc., cannot distinguish between systematic over or under prediction by the ML models. Metrics based on Mean Error (ME) or Mean Percentage Error (MPE) can be used to gauge the bias in the models. Therefore, we rank the ML models developed in this work based on the following metrics -MSE, MAE, MAPE, MedSE, MedAE, MedAPE, ME, MPE, MedE, MedPE, and R 2 (coefficient of determination).

A. Data
Viscosity is one of the few properties that can span many orders of magnitude (>10), depending on the complexity of the system and the thermodynamic conditions. In this work, we restrict ourselves to studying systems with simple interaction parameters (Lennard-Jones only). This has the twin advantage that the liquid part of the phase diagram is well understood and also being a simple fluid, the viscosity computation is relatively easy. However, even for such simple systems, a consistent data set with a large number (several thousand) of systems is not yet available in the literature. In the absence of a coherent data set, smaller data from multiple sources is generally collated to build a larger data set. However, due to the sensitivity of viscosity (especially the confidence interval) to the ancillary MD parameters, this procedure can result in unreliable models.

Vlugt data set
Recently, Vlugt and co-workers simulated 250 binary Lennard-Jones fluids to study the system size dependence of the self-diffusion coefficients. 72 In order to test the analytical expression for the system size corrections to the self-diffusion coefficients, they also computed the viscosity using the Einstein-Helfand equation. 9,87,88 We found that their work reported the largest consistent data set that used multiple long independent runs to compute viscosity and crucially, its confidence interval. In view of these attributes, our ML models were built using this data set.
The data set contains a total of 1061 points, all of them at the same temperature and pressure of 0.65 and 0.05 respectively. All the quantities are reported in dimensionless units with interaction parameters of the first component as the base units i.e., σ 1 = 1, ε 1 = 1 , mass 1 = 1. The state space is spanned by varying three interaction parameters (σ 2 , ε 2 , k 12 ) and one compositional parameter (X 1 ). We call these parameters as 'pre-MD' features to be consistent with ML nomenclature where the independent variables are called as features. Further, each state point was studied at four different system sizes (quantified in terms of the total number of particles in the simulation box). In sum, about 250 state points were simulated, each at four different system sizes, giving a total of about 1000 data points. Unlike the self diffusion coefficient, shear viscosity does not have a strong dependence on system size. 12,15,16,34,[89][90][91][92] Hence, we use only the data points at the system size of 2000 particles (273 out of 1000 data points) to develop the ML models in this work.
In the raw data, viscosity values span four decades, from 10 −1 to 10 3 , but only two data points had a value greater than 20. These two data points (η true > 20) were identified as outliers and were not considered during the ML modeling. Figure 1 shows the distribution of data points by their viscosity values indicating that most of the data points are populated around the mean viscosity value of around 3 and the extremal decades are sparsely populated. Due to the uneven distribution of viscosity values across decades, the models trained subsequently can be biased towards values around the mean. The distribution of the standard error relative to the corresponding mean is shown in Figure S2. The relative standard error seems to be uncorrelated to the viscosity value itself indicating that the data across decades is of similar quality. The standard error on the mean was used to calculate the irreducible minimum value of various loss functions 52 (also called as the Bayes error 53 ). The irreducible errors are incurred by all non-probabilistic ML models because of their approximation of the conditional density E y [y|x] by point estimates. 52 The irreducible MSE is the average variance in the data. The irreducible MAE and MAPE were estimated by sampling from a Gaussian conditional density. 52 We also note that, in general, metric value obtained from the average standard error would be different from the irreducible loss. For example, the MAPE value obtained from the average standard error(%) of the data is about 2%, whereas the irreducible MAPE is 0.8%. Unless explicitly mentioned, the metric values obtained from the average standard errors are used to compare the corresponding metrics from the ML models. Hence, we consider ML models with MAPE metric lower than 2 % to be successful models.
Furthermore, to get a preliminary understanding about the underlying correlations in the data, viscosity is plotted against other features -X 1 , σ 2 , ε 2 , k 12 , box length, packing fraction (ζ ), and self-diffusion coefficients (D1 and D2). These features can be divided into two sets -preMD and postMD features. As their names suggest, the preMD feature set consists of all those features that are fixed before running the MD simulation and postMD features are obtained only after the MD simulations. In this case, there are four preMD features -X 1 , σ 2 , ε 2 , k 12 and six postMD features -number density, ζ and the four preMD features. As expected, self-diffusion coefficients are inversely correlated to viscosity, consistent with the Stokes-Einstein relation. Apart from self-diffusion constant, only the packing fraction seems to be well correlated with viscosity, with higher packing fractions corresponding to higher viscosity. Rest of the plots show a wide spread of viscosity values at any given feature value indicating that no single feature can predict viscosity accurately. See section S3.1 for more details.
Two different sets of models, using postMD and preMD features respectively, were developed for each ML algorithm. Unless otherwise mentioned, the results presented are from the models developed using postMD features. All the features were scaled using Min-Max scaler before training the ML models. Also, the logarithm of viscosity was used as the label. However, all the metrics presented in the subsequent sections were computed on the untransformed viscosity values.

Interpolation data set
In order to test the predictive performance of the ML models away from the Vlugt data grid, a complementary data set called the interpolation data set was created. As the name suggests, the data set is created in the interpolation region of the preMD feature space of the Vlugt data set. The interpolation set consists of a total of 17 points at several interpolation distances. We note that the interpolation space is not entirely in the equilibrium liquid region of the binary LJ phase diagram at the thermodynamic conditions studied by Vlugt et al. 72 Hence, it is difficult to generate a "representative sample" of the interpolation space, which is required to obtain quantitative estimates of predictive performance. In this context, the current data set of 17 points (though small) can be used to understand the predictive performance in a qualitative sense. Details of the MD simulation procedure used to create the interpolation data set are given in the Supporting Information.

Applicability Domain
Given that the interpolation space is not entirely in the liquid region, the ML models cannot be expected to perform well over the entire interpolation space, especially far away from the Vlugt data grid. One way to tackle this issue is to define an Applicability Domain (AD) within which the ML models are expected to perform well. [93][94][95][96][97] There are many methods to construct an AD and a detailed comparison is beyond the scope of this work. [95][96][97] The Applicability Domain (AD) used in this work is described in section IV C. The interpolation data set is divided into two parts called In-AD and Out-AD based on whether the points fall within or outside the AD respectively.

B. Machine Learning Models
A total of seven ML models were tested for their ability to predict shear viscosity -Kernel Ridge Regression (KRR), Artificial Neural Network (ANN), Gaussian Process Regression (GPR), Support Vector Regression (SVR), Random Forest (RF), k-Nearest Neighbors (KNN), and Least Absolute Shrinkage and Selection Operator (LASSO). In the current work, GPR is the probabilistic model (level 2 in section S1.1) and all others are non-probabilistic in nature (level 2 in section S1.1). Except ANN, all other the ML models used in this work were from the scikit-learn implementation. 98 The ANN models were built using the keras library in a python environment. 99 Many helper functions from numpy, 100 scipy, 101 pandas 102 and scikit-learn 98 were also used in the model construction, model selection and performance estimation steps.

C. Model Selection and Performance Estimation
In this work, we compare two popular model selection and performance estimation methods called -Shuffle Split Cross validation (SS-CV) and K-fold Split Cross validation (KFS-CV). A precise algorithmic description of these methods is given in section S4.1. Briefly, SS-CV splits the data set into three parts (named train/validation (val)/test) multiple times. Each time the data is shuffled and hence independent random realisations of the data can be obtained by SS-CV. KFS-CV is a two step procedure in which firstly entire data is split into two parts (named train/test) and later the train part is again split into k folds of roughly same size. The k folds (obtained in the second step) are used to obtain validation score and the test sets are used to do performance evaluation. Figure 2 summarizes the two procedures. Finally, we note that the procedures outlined above are referred to by slightly different names in the literature 42,52,55,58 and the ML software packages. 98 Hence we recommend using the algorithmic description of these methods given in section S4.1. See section S5.1 for more details.

D. Interpolation Grid
The interpolation capabilities of the models can be qualitatively tested by plotting the predicted viscosity values at the grid of interpolated feature values. As the feature space is generally high dimensional (four in this case), only projections onto 1D/2D sub-spaces can be visualized. To keep the visualization uniform across the features, interpolation was done in the scaled feature space i.e., after the min-max scaler is applied. The Vlugt data set was generated at discrete values of each feature -X 1 : (0.1, 0.3, 0.5, 0.7, 0.9) , σ 2 : (1.0, 1.2, 1.4, 1.6), ε 2 : (1.0, 0.8, 0.6, 0.5), k 12 : (0.05, 0.0, -0.3, -0.6). However, the final data set consists of only 250 unique combinations of preMD features as opposed to 320, had all the combinations been studied.
We have constructed four different interpolation grids to test the interpolation capability across each feature individually. The interpolation grid for a particular feature is generated at values uniformly spaced in the range of 0 to 1 while holding all other features at the values from the Vlugt data set. For example, in order to generate the X 1 interpolation grid, 19 uniformly distributed values between 0 to 1 were used for X 1 , while the values for σ 2 , ε 2 and k 12 were taken from their scaled values in the Vlugt data set. We also define the interpolation distance of any interpolated point as its Euclidean distance from the nearest training data point.

IV. RESULTS AND DISCUSSION
In this section, we first present evidence related to aspects of model selection and performance evaluation that are pertinent to ML models at the low data regime. We then compare different ML models based on their performance metrics and interpolation behavior. We finally compare the predicted uncertainties of ensemble models and that of a probabilistic ML model (GPR).

Understanding the Hyperparameter Optimization Landscapes
In order to understand the optimization landscape of the hyperparameters, we use LASSO and KRR models. They were chosen because they have less than three hyperparameters and are hence conducive for the visualization of the optimization landscape in 2D plots. Moreover, both models have analytic solutions and are hence much faster when trained on small data sets. Both these models were also studied in the context of model selection (albeit on synthetic data sets) and hence allow a close comparison wherever possible. 55 In this section, we elucidate the dependence of the performance of these ML models on the particulars of the data splitting procedure, which is an essential step for model selection and performance evaluation. The entire data set is randomly split into three parts -train, validation (val) and test sets with 60/20/20 ratio. This procedure is repeated N split times thereby creating multiple random realisations of the train, validation and test splits. These N split train sets are used to train N split ML models at each hyperparameter value. The trained models are then evaluated on their corresponding validation and test sets. Figure 3 (a) shows the average test MSE of the LASSO model across a wide hyperparameter (α) range. Clearly, α values less than 10 −3 are suited for the viscosity data set. However, no single optimal hyperparameter can be selected as there is no discernible change in the MSE values for α less than 10 −3 . A similar "flat-minima" hyperparameter landscape was also observed by Pfaendtner et al. on an ionic liquid experimental viscosity data set. 42 Hence, strictly applying the common model selection criteria of selecting the hyperparameter with the best performance on the validation set (in this case, minimizing MSE) belies the flat-minima nature of hyperparameter landscape. Also, the wide confidence interval around the average test MSE indicates that there is significant variability in the performance (MSE) of the ML models across different data splits. mark the optimal hyperparameters that minimize individual validation and test MSE respectively. The pink triangles in (c) mark the optimal hyperparameters that minimize individual Interpolation MSE. The validation, test, and Interpolation MSE were obtained using KKR model and shuffle-split data sampling scheme (See Computational Methods).
Interestingly, the wide scatter of points in Figure 3 (b) indicates that the individual test MSEs are not correlated to individual validation MSEs. Hence model selection criteria based on optimization of the performance of individual validation sets need not necessarily result in a good generalization performance. However, the average validation MSE (over N split splits) is perfectly correlated to the average test MSE as shown in Figure 3 (c). Consequently, the data splitting procedures that reserve a single "unseen" data set (often called as test set 46,76 ) for evaluating the generalization performance should be discouraged as they suffer from wide variability. We note that this problem is unique to small data sets and the variability decreases rapidly with increase in data set size. 55 The same procedure was applied to the KRR model to elucidate its hyperparameter optimization landscape. In addition to validation and test sets, the performance of the KRR models was also evaluated on a single Interpolation set containing 17 data points. Figure 4 shows the 2D contour plots of the average validation, test and Interpolation MSE over a wide range of KRR hyperparameters α and γ.
The landscapes of the average test and validation MSE are similar, consistent with the LASSO results ( Figure 4). The Interpolation MSE landscape is slightly different from others near the minima while still retaining the overall features. Importantly, it is more rugged than the validation and test MSE landscapes, possibly because the same Interpolation set was used across all the N split splits. The wide scatter of points in Figure 4 (a,b,c) show the optimal hyperparameter selected by minimizing the individual   55 Further, while the average MSE landscapes are smooth, those corresponding to individual realisations of the data splits are rugged as shown in Figure 5. The validation and test landscapes of these individual splits also show significant differences. For example, Figure 5 (d-f) corresponds to validation, test and Interpolation landscapes of a randomly chosen realisation and their optimal hyperparameters vary by more than seven orders of magnitude.
These results demonstrate that there is a wide variability in choosing the optimal hyperparameter values (model selection) and also in estimating the generalization performance (performance evaluation) of the ML models on small data sets. Hence, it is crucial to do both the model selection and performance evaluation tasks by training an ensemble of models on different random splits of the data set.

Comparison of CV Procedures
The model selection criterion is based on the average validation score obtained from the CV procedures. We observe that the average validation score landscapes are almost identical in both SS-CV and KFS-CV irrespective of the kind of metric used to construct the landscape ( Figure  S8). However, the two procedures differ in the variance of the validation landscape, with KFS-CV yielding a lower variance than SS-CV ( Figure S9). Cawley et al. show that the estimators with lower variance can do a better job of selecting the optimal hyperparameters. 55 Hence we use KFS-CV to do model selection and performance evaluation on rest of the ML models -SVR, RF, KNN, ANN.
Also, in both the CV procedures, the variance of the validation landscapes was strongly dependent on the error metric used to construct the landscapes. In general, we observe that MSE shows the highest variance followed by MAE, MAPE and R 2 . The variance in MSE validation landscape is often so high (> 100%), that unambiguous selection of optimal hyperparameters is difficult. Hence, the use of other metrics that that have lesser variance such as MAE, MAPE, R 2 can help rectify the issue. In this work, we use the MAE validation landscape to choose the optimal hyperparameter values. Tables in section S5.1.2 list the optimal hyperparameters and the corresponding values of metrics for all the ML models studied in this work.
Finally, the performance estimation is done by evaluating the trained models (with the optimal hyperparameters) on the test sets. As both CV procedures result in nearly identical performance scores, we use the KFS-CV method to do the performance estimation for the other ML models -SVR, RF, KNN, ANN. In the following section, we compare the performance of various ML methods and rank them using multiple metrics.

B. Model Comparison and Ranking
The predicted viscosity values from all the models, except KNN and LASSO, agree well with the true viscosity values both for test and interpolation data sets as shown in Figure  6. The agreement is seen to be good across decades of viscosity values indicating, at least to the naked eye ( Figure  6 (a)), that models do not bias any particular decade of viscosity values. A detailed discussion on the model bias is presented in the Supporting Information (section S5.2). Figure 6 (c) compares the MAE and MAPE of all the models evaluated using the KFS-CV performance estimation procedure (section III C). The MAPE of KRR, GPR, ANN and SVR models are below the average standard error (%) of the data (called as threshold henceforth) and hence can be considered as successful models. On the other hand, KNN and LASSO models have both their test and train MAPE much above the threshold and can hence be considered as Each row shows the ranks of the model labeled on the y-axis and each column corresponds to the metric based on which the ranking was done. The metric values are computed using the ensemble predictions of the ML models on the entire Vlugt data. 72 Lower model rank is better.
The MAE and MAPE of the successful models -i.e., KRR, GPR, ANN, and SVR -are very close to each other as seen in Figure 6 (c) and hence more information is required to unambiguously rank them. In Figure 7, we show the relative ranks of all the models based on seven (MAE, MSE, MAPE, MedAE, MedSE, MedAPE and R 2 )performance based metrics and four (ME, MPE, MedE, MedPE) bias based metrics. The mean, median values of the Absolute Error (AE), Squared Error (SE), and the Absolute Percentage Error (APE) along with the coefficient of determination (R 2 ) are the performance metrics, while mean, median Error (E) and the Percentage Error (PE) constitute the bias metrics. An average rank (averaging done across the metrics with uniform weights) is also shown for each model. The average rank follows closely the MAE rank with four successful models -KRR, GPR, SVR, and ANN -having an average rank less than four and three unsuccessful models having a rank greater than four. According to the average rank, GPR is the best performing model, followed closely by SVR and KRR. These three models are followed by ANN, RF, KNN, and LASSO respectively with the last two consistently ranked sixth and seventh. Interestingly, there is considerable mixing of ranks based on MAE vs MSE, indicating that a holistic approach using a combination of metrics needs to be used to objectively evaluate the models. Another noticeable trend is the disconnect between the ranking based on performance and those based on bias metrics. These findings highlight the need to evaluate models based on metrics beyond the simple loss functions used to train the models themselves to get a complete picture of the models accuracy, bias and generalizability. Now that the models have been validated for performance and bias, we discuss their interpolation capabilities in the next section. indicates the distance from the nearest training data point that the models have seen, with darker shades being farther away. The viscosity values from the Vlugt data set are also shown (as red diamond symbols) for comparison. KRR, SVR, and ANN models show a smooth variation as the feature values move farther away from their corresponding values in the Vlugt data set. On the other hand, RF and KNN models show sudden discontinuities in the predicted viscosity values at some specific X 1 values. These discontinuities are probably due to the presence of decision boundaries in RF and a sudden change of nearest neighbors in the case of KNN. Such sudden discontinuities are incompatible with viscosity which is expected to be continuous (at least as long as there is no phase transition).

C. Uncertainty Quantification
As demonstrated in the previous sections, the performance of ML models trained on small data sets can have wide variations. In this regard, models that can estimate the uncertainty on individual predictions can help alleviate this issue. The uncertainty estimates can be used as a guide to end-users about the reliability of a given prediction and thus of the models. More generally, uncertainty quantification has many other applications 61,63-65,104 such as -setting the applicability domain of the ML models, 95 active learning for generating data on the fly, 105,106 etc. In this work, we used two approaches to estimate the uncertainty -probabilistic ML method and ensemble ML method. The probabilistic ML methods inherently capture the uncertainty through their model architecture, whereas the ensemble ML approach uses an ensemble of ML models on several random realisations of the data. Gaussian Process Regression (GPR) was the choice of probabilistic ML method due to its simplicity (few hyperparameters), wide applicability and near universality. KRR was used to test the ensemble approach again due to its simplicity (two hyperparameters), training speed (analytical solution) and wide applicability. Figure 9 compares the standard error estimated from the ML models and the true standard error of the data. Both GPR and ensemble KRR methods show good agreement with the true data. The standard errors predicted by GPR show a slightly better agreement with that of the data than KRR ensemble does, especially at high error values. This is because the ensemble methods capture the uncertainty in the data indirectly by repeated sampling from the training data set. On the other hand, the standard error values of the data are directly fed into the GPR training. A similar observation was made by Müller et al. on their comparison of GPR and ensemble methods to predict error bars on the solubility data. 95 Information about closeness of a new query point to the training data would be useful to decide whether or not to trust the values predicted by the model. This information can be naturally encoded into GPR in the form of epistemic uncertainty. 105 Ideally, the predicted standard error should increase beyond the natural uncertainty in the data as the query point moves away from the training data set. Figure  10 (a),(b) show the predicted relative standard error from the GPR model plotted against the interpolation distance with the colors indicating the predicted mean viscosity value. The predicted relative standard errors clearly increase with increasing interpolation distance. However, this behavior could not be seen in the case of relative standard errors estimated by the ensemble method as seen in Figure 10 (c),(d). Hence, the uncertainties estimated by GPR represent the true uncertainties better and also systematically increase when the query points move far from the training set. Additionally, GPR needs to be trained only once when compared to ensemble methods which need to be trained over multiple realisations of the training set.
Finally, the standard errors predicted by the GPR can be used to construct applicability domain of the ML models. 95 For example, for queries which have a distance less than 0.2 (in the scaled feature space) from the nearest training point, a relative error of less than 10% can be expected ( Figure 10). Hence, the scaled distance of 0.2 can be set as a limit for the Applicability Domain, beyond which the predictions from the ML models need to be treated with caution. Though such a distance based approach is simple, it can be applied to any ML model thereby justifying its use. 95 Also, such distance based AD methods have been successfully implemented for QSAR 93 and ML models. 95 The interpolation data set was split into In-AD and Out-AD based on whether the points fell within the AD or otherwise. The performance of all the ML models on the In-AD data set is much better than that on Out-AD set, demonstrating that the application of AD can be used to detect and remove the outliers ( Figure 11). Also, the improvements from the AD (though constructed from GPR only) were observed across all the ML models and performance metrics.

V. CONCLUSIONS
In this work, we trained and evaluated several successful ML models to predict the shear viscosity of binary LJ fluids. Being a collective property, shear viscosity is expensive to predict from equilibrium atomistic MD simulations and hence only small data sets can be found in the literature. The major challenges posed by the small data sets on ML methods are discussed. Specifically, we focus on -1. model selection and performance estimation, 2. performance metrics, and 3. uncertainty quantification.
ML models are prone to overfitting on small data sets at both the model parameters and hyperparameter levels. We discuss various model selection methods -K-fold CV, nested K-fold CV, Monte Carlo CV, etc. -that are generally used to address this issue. While these methods are commonly used for selecting hyperparameters, the generalization error (performance evaluation) is estimated on a single unseen data set (test set). We demonstrate that such estimates are prone to wide variability because of the small data set size. The hyperparameter optimization landscapes of LASSO, KRR models are shown to have "flat-minima" thereby making it difficult to unambiguously select the optimal hyperparameters. We compared two simple CV procedures -SS-CV and KFS-CV and found that while their mean error estimates were almost identical, KFS-CV showed lower variance. Hence, it was chosen to both the model selection and performance estimation tasks simultaneously.
We discuss the role of performance metrics in model training and model evaluation, both from theoretical and empirical standpoints. We compare several commonly used metrics like MSE, MAE, MAPE and R 2 and discuss their relevance to the viscosity data set. We propose a holistic model ranking procedure based on inputs from multiple complementary metrics. The interpolation behavior of the ML models are compared qualitatively. While the KRR, ANN and SVR models showed smooth interpolation behavior, RF and KNN models showed sudden discontinuities and are hence considered unsuccessful. The successful models are also shown to outperform the best-in-class empirical model of Hasse et al. 103 We present two methods to estimate uncertainty in individual predictions from the ML models -1. Gaussian Process Regression (GPR) and 2. ensemble of KRR ML models. The uncertainty (in terms of standard error) estimated by the methods showed overall agreement with the the true uncertainty of the data, with GPR faring slightly better. The behavior of the estimated uncertainty by both the methods in the interpolation feature range is also compared. The GPR's uncertainty steadily increased as the query data points moved away from the training data, while no discernible pattern could be identified in the uncertainty from the ensemble method. The relative standard error estimated from the GPR model can be used to set distance limits for the query points thereby defining the applicability domain (AD) in which the results from the ML models are reliable. We found that the points of the interpolation set that fell within the AD were better estimated by the ML models than the ones that fell outside the AD, thereby demonstrating the utility of AD. Finally, the principles discussed in this work can be applied to develop ML models of viscosity for more complex fluids. However, in such fluids, the identification of the features that are most relevant to shear viscosity would also be non-trivial and constitutes a part of our future work.

SUPPLEMENTARY INFORMATION
See the Supplementary Information for additional information on theoretical background, model description, Vlugt data set, empirical model description, computational details of model selection, model performance, bias, etc.

Conflict of Interest
The authors have no conflicts to disclose. The data that supports the findings of this study are available within the article and its supplementary material.

S1.1 The structure of the problem
We assume that there exists a joint probability density p(x, y) that generated the data set. 1,2 Here, x is a vector of input features and y is the target variable also called as the label. In the context of shear viscosity prediction, the feature vector can be constructed from quantities like x 1 , σ 2 , 2 , k 12 , ζ, ρ * and the target variable is the shear viscosity η (see section S3.1).
The task of the ML algorithm is to infer the joint probability density (or properties thereof) from the finite data set generated from running the MD simulations. This inference task can be classified into three levels with progressively lesser complexity: 1 1. Determine the joint density p(x, y). It is the most demanding of the three and generally needs huge amount of data especially when x is of high dimensionality. But if this task is achieved, p(x, y) can be used to generate new data. We do not attempt this task in this work due to the sparsity of the viscosity data set.
2. Determine the conditional density p(y|x). It is much simpler than the above because we bypass the difficult task of estimating p(x). Gaussian Process Regression (GPR) method falls under this category.
3. Determine a function f * (x) that is an optimal representation of the data set. It is the simplest and the most common ML approach of the three. The sense in which the function f * (x) is optimal is often taken to be the one that minimizes the expected loss (also called as risk ) E[L] (Eq S1). Most common ML models like Kernel Ridge L(y, f (x)) p(x, y) dx dy (S1) Where L(y, f (x)) is the user-defined loss function. The choice of the loss function has a direct relation to the kind of function obtained. 3 The most common loss function, the squared loss where L(y, f (x)) = (y − f (x)) 2 yields the conditional mean E y [y|x] (Eq S2) as the f (x). 1 Discussion on other loss functions and the consequent effect on the properties of f * (x) is presented in section S1.3.
However, to compute the expected loss/risk, the underlying joint probability density p(x, t) has to be known which is hard to do in practice. Hence, the expected loss is where, N is the number of data points, y true The most common way to alleviate the problem of overfitting is to reduce the complexity/capacity of the learning machine thereby reducing its ability to learn the noise associated with the training data sample. However, the complexity should not be reduced to such an extent that the general trends in the data are lost, resulting in underfitting. Hence, the ML model should choose an optimal complexity corresponding to the general trends in the data. A popular method to control the complexity of the models is called regularization in which a penalty term (called the regularizer, Eq S4) which penalises complex models is added to the empirical loss. 2,4 The common forms of the regularizer are based on the norm of the weights (w) of the model like -L 2 norm (called as ridge regression or Tikhonov regularization), L 1 norm, or a combination of both (for example in LASSO model). 2 We also note that there are many other regularization techniques that are specific to Deep Learning (DL) methods like -early stopping, dropout, soft weight sharing, etc. 5 Where Ω j (f ) are the regularizers and λ j are the parameters that control the amount of regularization. Now that the ML models have a mechanism to control the complexity through regularization, the natural next step would be to choose the values of regularization parameters. This task falls under the purview of model selection. 2 In the following section, we discuss various model selection criteria and a closely related topic of performance evaluation.

S1.2 Model Selection and Performance Evaluation
It is a common practice to distinguish the parameters of ML and DL models into model parameters and hyperparameters. Now, the task of selecting the model with the optimal complexity is reduced to the estimation of values of hyperparameters and the criteria used for such selection are called model selection criteria. As stated earlier, the goal of ML models is to minimize the generalization error which is the average error over all unseen data. However, generalization error cannot be obtained in most practical situations and hence estimators on finite data sets are constructed to approximate it. The process of estimating the generalization error by using estimators on finite data sets is called performance evaluation and is a prerequisite for model selection. It is crucial to note that the error estimates are obtained over finite data sets and hence depend on the size of the data set especially for small data sets. A simple example of such an estimator is the split sample estimator where the whole data set is split into two parts (generally unequal) and the error is computed on the split that was not used for training. 6 Split sample estimator is known to be unbiased i.e., the average split sample error over multiple independent realisations of unseen data asymptotically converges to the generalization error. Hence, minimizing the split sample error can in principle reduce the generalization error. However, it was recently shown that the unbiasedness per se is not as important as the variance of the estimator when it is used for model selection. 6 When an estimator has high variance (occurs with small data sets 6 ) the value of the estimated error on any one particular unseen data sample can be very different from the generalization error; hence the hyperparameters that minimize the estimated error can be far off from the optimal ones. Cawley et al. showed (on a synthetic data set) that hyperparameters selected based on split sample estimators can severely overfit or underfit the data. 6 In practice, users rarely have the capability of generating multiple independent realizations of the data and hence the variance of the estimator plays a major role. Therefore, for small data sets, it is not considered a good practice to estimate the error on a single realisation of the data set. 6 In order to mitigate this problem, various cross-validation schemes are generally used.
The core idea of k-fold cross-validation (CV) is to use split the entire data set into k equal disjoint sets, train the ML models on k-1 sets and estimate the error on the remaining one set. This process is repeated k times, each time with a different hold-out set. 1,2 The average error over k folds is used as the estimate for the generalization error. It is a common practice to use 5 or 10 folds during CV. 2 The extreme case when the number of folds is equal to the number of data points (k = N ) is called Leave One Out (LOO) method. LOO is also popular because for some ML models, it is possible to compute the LOO error without training the ML model k times thereby reducing the computational cost. 6,7 The error estimates from k-fold CV are often used for model selection by searching over the space of hyperparameters and choosing the one that yields minimum CV error. But once the k-fold CV error is used to optimize the hyperparameters, it is no longer unbiased. 2,8,9 Typically another unseen data set (called the test set) is used to estimate the generalization error of the models with optimized hyperparameters. 2 Using a single realization of the test set, however, suffers from the high variance issue discussed above. there is no clear consensus on this topic. 3,14,15 As is the case with model selection criterion, there is no single best metric that can be used across all ML tasks. 15 In this section, we summarize some of the principles that can be used to choose a relevant metric to the particular ML task at hand and also consider the particular case of viscosity data set.
As discussed in section S1.1, models that learn a function f (x) (inference level 3) lose some information contained in the joint density p(x, y) from which the data is assumed to be generated. Such models generally aim to predict point estimates related to a central tendency of the conditional density p(y|x) like the conditional mean or the median. 3 It can be seen that the choice of the loss functions can be used to decide the kind of central tendency of p(y|x) to be captured by the f (x). 3,16 For example, the f (x) that minimizes the expected value of the squared loss results is the mean of the p(y|x). Hence the f * (x) can be thought of as a functional of the conditional density p(y|x) which can be chosen by choosing the loss function L. 16 Table S1 summarizes some of the popular loss functions and their corresponding functionals. APE |(y pred − y true )/y true | β−median 16 RE |(y pred − y true )/y pred | β−median 16 quantile τ (y true − y pred ) if y true − y pred > 0 quantile [17][18][19] quantile (τ − 1)(y true − y pred ) if y true − y pred < 0 quantile [17][18][19] From the perspective of loss functions in model training, model developers have two approaches -(1) choose the loss function first and work out the kind of functional (if possible) from it, or (2) choose the desired functional and work out the corresponding loss function. 16 Most ML practitioners use the first approach because of its simplicity and the availability of efficient numerical optimizers for the minimization of the common loss functions. The latter approach is generally used in quantile regression models which can be used to predict the confidence intervals around the mean. 19 However, it must be noted that the connection between the loss function and the corresponding functional is clear only when the models minimize the expected loss function without any additional constraints. As most ML methods rely on regularization to avoid overfitting, the resulting functional can be different from that of the expected loss minimization. Also, not all ML models use expected loss minimization and hence it might not be straightforward to work out their corresponding functionals.
Another area in which loss functions are used in ML workflow is model comparison, in which models are ranked based on their generalization performance. Ideally, the generalization performance of ML models should also be measured using the same metric used in their training phase.  Some of the commonly used kernels are: Linear kernel: Gaussian (radial basis function) kernel ) end for Shuffle the points in D end for return ve, te

S4.2 Empirical Correlation
We could not find an empirical model (also called as empirical correlation) which is applicable   24 and Vlugt's simulation data. 22 The blue dashed line indicates y=x and is drawn as a guide to the eye.

S4.3 Interpolation Data Set
A small data set to test the predictive performance of the ML models away from the Vlugt data grid called the interpolation data set was created in this work. This data set consists of 17 randomly chosen points in the interpolation region of the Vlugt data grid. We note that the entire interpolation space is not in a single liquid phase and some data points tend to phase separate. For example, we started out by simulating 20 interpolation data points out         These GPR models are trained using four features called preMD features (See Computation Details section). The black squares represent the predictions on the entire Vlugt data, the orange triangles represent the predictions on the interpolation data points that are within the Applicability Domain, the red stars represent the predictions on the interpolation data points that are outside the Applicability Domain.  These KRR models are trained using four features called preMD features (See Computation Details section). The black squares represent the predictions on the entire Vlugt data, the orange triangles represent the predictions on the interpolation data points that are within the Applicability Domain, the red stars represent the predictions on the interpolation data points that are outside the Applicability Domain.   These RF models are trained using four features called preMD features (See Computation Details section). The black squares represent the predictions on the entire Vlugt data, the orange triangles represent the predictions on the interpolation data points that are within the Applicability Domain, the red stars represent the predictions on the interpolation data points that are outside the Applicability Domain.