Paper The following article is Open access

Analysis of machine learning prediction reliability based on sampling distance evaluation with feature decorrelation

and

Published 3 May 2024 © 2024 The Author(s). Published by IOP Publishing Ltd
, , Citation Evan Askanazi and Ilya Grinberg 2024 Mach. Learn.: Sci. Technol. 5 025030 DOI 10.1088/2632-2153/ad4231

2632-2153/5/2/025030

Abstract

Despite successful use in a wide variety of disciplines for data analysis and prediction, machine learning (ML) methods suffer from a lack of understanding of the reliability of predictions due to the lack of transparency and black-box nature of ML models. In materials science and other fields, typical ML model results include a significant number of low-quality predictions. This problem is known to be particularly acute for target systems which differ significantly from the data used for ML model training. However, to date, a general method for uncertainty quantification (UQ) of ML predictions has not been available. Focusing on the intuitive and computationally efficient similarity-based UQ, we show that a simple metric based on Euclidean feature space distance and sampling density together with the decorrelation of the features using Gram–Schmidt orthogonalization allows effective separation of the accurately predicted data points from data points with poor prediction accuracy. To demonstrate the generality of the method, we apply it to support vector regression models for various small data sets in materials science and other fields. We also show that this metric is a more effective UQ tool than the standard approach of using the average distance of k nearest neighbors (k = 1–10) in features space for similarity evaluation. Our method is computationally simple, can be used with any ML learning method and enables analysis of the sources of the ML prediction errors. Therefore, it is suitable for use as a standard technique for the estimation of ML prediction reliability for small data sets and as a tool for data set design.

Export citation and abstract BibTeX RIS

Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 license. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

The use of machine learning has become increasingly popular in material science due to the capability of machine learning methods to capture main trends in the data set by fitting complex nonlinear models. Machine learning methods have similarly been applied in a wide variety of other disciplines for data analysis and prediction. While deep learning methods have seen rapid development in the past decade, they cannot be applied for relatively small data sets that are common in materials science and other fields. For such data sets, traditional ML methods such as support vector regression, random forest and XGboost [13] must be used. Despite their success, ML methods have several disadvantages such lack of interpretability and inability to accurately estimate the reliability of ML predictions, in contrast to methods based on fundamental scientific principles (e.g. quantum mechanics) where accuracy of prediction can be estimated based on the approximations of the method [4]. For example, for a data set of formation energies of transparent conductor oxides materials used in a recent ML study [5], while the ML predictions are quite accurate in most cases, they still show some error in ∼10% of the cases and show severe failures in about 2% of the cases (figure 1). To estimate the reliability of the prediction for a given system, an ensemble of prediction models can be used to evaluate the standard deviation of the predicted values. However, this method is not always accurate and may not predict the failure of the model for data that is out of the distribution of the training data.

Figure 1.

Figure 1. Reliability of predictions by standard methods. (a) Predicted versus actual formation energies for the TCO formation dataset (b) Distribution of the errors in the predicted formation energy values by SVR.

Standard image High-resolution image

This has motivated a research effort focusing on uncertainty quantification (UQ) for machine learning models [113]. Among the methods used for UQ, approaches that use similarity evaluation based on the evaluation of the distances between the point to be predicted and the points in the training set are advantageous due to their low computational cost and simplicity of interpretation. Intuitively, the similarity of the data point to be predicted to the data points used in the training data set should be related to the reliability. Thus, the distance in feature space between the data points should provide information regarding the reliability of prediction.

However, the application of this intuitive concept in practice is not straightforward. For example, it is unclear how the similarity metric should be defined. Most commonly, the average distance between the point i to be predicted and k points in the training set nearest to point i has been used. However, one can imagine that a predicted data point located a moderate distance away from several training data points will be predicted more reliably than a data point that has only one close neighbor and many distant neighbor training points. Thus, a method for properly weighting the effects of different data points located at different distances must be identified. Furthermore, since the features have different weights and often are strongly correlated with each other, it is also unclear what is the proper method for evaluating the distance. In practice, while in some cases the approach of using k nearest neighbors has proven to be effective [13], in other cases as demonstrated below and in previous work this approach does not provide sufficient separation between the low- and high-error predictions. An alternative method based on the average of errors weighted by the normalized dot products of the vectors of point i and points k was proposed by Korolev et al [6] This average called the delta metric for data point i to be predicted is defined as

Equation (1)

where j runs over all of the data points in the training set, epsilonj is the error of the model for the point j in the training set, and Kij are the weights of training set points j for the predicted point i given by

Equation (2)

where xi and xj are the feature vectors of data points i and j. However, despite some successes, it was found to be only a moderately effective UQ tool for several chemistry datasets [112]. Thus, while numerous studies have been performed probing the relationships between predictability and domains of applicability, feature space and pointwise distance [18, 1419] in some cases using complex models, a generally reliable method for predicting the errors of ML models based on distance in feature space is still unavailable.

In this work, we focus on the development of a simple similarity-based metric for UQ that does not require training of multiple models and allows single-shot estimation of uncertainty by estimating the average sampling density by the known data points of the feature space relevant to the predicted data point using decorrelated features. The inverse of the average sampling density is then defined as the average sampling distance. Our method has two advances over the baseline method of using the average distance of k nearest neighbor points. First, rather than using an arbitrary number of closest data points in feature space, we introduce a formula for calculating the average sampling distance that takes into account all data points without any arbitrary adjustable parameters; this formula is justified theoretically based on the concept of sampling density. Second, we use the Gram–Schmidt procedure to decorrelate the features, enabling a more accurate evaluation of the feature space distance. Compared to the delta metric, our method has the advantage of not using the values of the errors of the training data which may be underestimated relative to the errors of the test data due to overfitting.

Using 12 different data sets taken from previous applications of ML in materials science and other disciplines, we show that the ML prediction error systematically increases with increasing average sampling distance (lower sampling density), allowing the classification of the predicted data as high-, medium- and low-reliability based on the evaluated average sampling density or sampling distance. This analysis also allows us to identify data points in the data set for which the features used in the data sets are insufficiently descriptive, such that the prediction error is due to poor features rather than to the insufficiently flexible model. We show that our method is more effective for UQ than both the standard approach of using the average distance of k nearest neighbor points. The proposed analysis is simple and computationally inexpensive, and can be applied to a wide variety of datasets and ML methods used in various disciplines. It may also serve as a basis for further development and application for reliability analysis of deep learning predictions, for future improvement of ML prediction accuracy through machine learning targeted to specific regions of feature space, and as a tool for data set design.

2. Methods

We follow the approach of using the distances in feature space between the data point to be predicted (for which the features are known but the target property is not) and the known data points (for which both the features and the target property values are known). In order to determine the strength of the relationship between machine learning prediction ability and feature space distances, we first need a reliable determination of a data point's location within the feature space. This entails a coordinate transformation of feature space such that the distances between any two data points in the space can be found in the same manner one would for instance in a Cartesian, cylindrical or polar coordinate system. However, for these coordinate systems, the vectors of the various coordinates are orthogonal, making distance evaluation straightforward, for example using the standard Euclidean distance formula for the Cartesian space. However, the features used as different dimension coordinates in machine learning are almost always correlated to some degree and can be highly correlated, making the standard Euclidean formula inaccurate for the evaluation of the true distance. As illustrated in figure 2, for a system with highly correlated features, some distances evaluated by the Euclidean formula to be large are in fact small. Therefore, it is necessary to decorrelate or orthogonalize the features to make the standard Euclidean formula applicable.

Figure 2.

Figure 2. The underlying logic of the distance metric formula. (a) Illustration of the difference between true distance and distance estimated by simple Euclidean formula for non-orthogonal coordinate systems. (b) Illustration of the derivation of the distance metric formula from the local sampling density.

Standard image High-resolution image

To orthogonalize the features, we apply the well-known Gram–Schmidt procedure; by applying this technique on the input data features, we obtain an orthonormal basis from the features from which we can obtain an interpretable metric for determining the data points' distance from each other in feature space.

For each data set with N points where each point is described by n features, we start with target properties yi and features xk,i where I= 1, N is the index of the data points and k= 1, n is the index of the features. The linear correlation coefficient between features k and l for the dataset can be calculated as

Equation (3)

Thus, we can consider each feature k as a vector Xk of length N given by Xk = (xk ,1, xk ,2, xk ,3, ... xk ,N ), where as explained above these vectors are not orthogonal so that <xk, xl > ≠ 0. The Gram–Schmidt procedure is applied to n vectors Xk to obtain orthogonal vector X'k

Equation (4)

where projX ' k (Xl ) is the projection of vector X'k onto vector Xl that is defined by

Equation (5)

We use the QR function in Numpy package to carry out the Gram–Schmidt orthogonalization of X to obtain X'.

Once the orthogonal feature vectors X' are obtained where X'k = (x'k ,1, x'k ,2, x'k ,3 ... x'k,N ), we obtain the dataset in terms of orthogonal features, where each data point i is characterized by (x'1,i, x'2,i, x'3,i , ... x'n ,i ). These features are then used to construct the SVR machine learning model for predicting the target variables yi . We use Sklearn package to construct the SVR models.

It is intuitively obvious that features that are unimportant (i.e. not correlated with the target predicted property) should not influence prediction accuracy and therefore should not contribute to the distance metric, as demonstrated in a recent deep learning study [20]. Therefore, in the distance metric, the features should be weighed by their relative importance. The extent to which the Gram–Schmidt procedure and the weighting by feature importance improve the utility of the distance metric depends on the extent to which the dataset contains features that strongly correlated with each other and/or dominant in relating the input data to the target variable. Thus, we evaluate the distance in feature space with n features x'k (k= 1, n) between data points i and j as

Equation (6)

where ${x_{k^{\prime}}}$ are the features constructed from features xk using Gram–Schmidt orthogonalization and ${w_{k^{\prime}}}$ are the weights of these features obtained for the ML model trained using ${x_{k^{\prime}}}$.

We now consider the question of how to combine the different distances ${d_{ij}}$ into a single distance metric. We first consider a simple case of a data point i surrounded by N equally spaced points j such that all ${d_{ij}}$ are the same and are equal to d. This is illustrated in figure 2(b) for N= 4 for the points on the left. In this case, the sampling density of the target property (function) in the vicinity of point i (${S_i}$) is

Equation (7)

Then, if we consider a data point i surrounded by N points j with different distances (figure 2(b), points on the right, N= 4), the average sampling density of the target property (function) in the vicinity of point i (${S_i}$) will be given by

Equation (8)

We can therefore define the distance metric Di (the average sampling distance) that measures the quality or density of the sampling in the vicinity of point i for a dataset with N data points with known target property values as

Equation (9)

We expect that larger error will tend to be obtained for predicted data points with larger Di values. While even for large values of ${D_i}$ (small sampling density around data point i), some predictions will be accurate simply by chance, the errors will be distributed over a larger range, so that a larger MAE will be obtained. By contrast, for small values of ${D_i}$, the error distributions should be very narrow and MAE values will be small.

3. Results

We examine this hypothesis for 15 data sets used in previous ML studies in materials science and other disciplines. The seven materials science datasets are the formation energy [21] and band gap energy [21] of TCOs, the activation energy for dilute solutes in crystals [22], the reduced glass transition temperature for alloys [23], the formation energy and band gap of perovskites [24], and perovskite stability [25]. The five data sets from other disciplines are daily bike count [26], frequency variation of Parkinson's patients [27] game actions [28], energy use [29], forest fire area [30], productivity of garment workers [31], carbon nanotubes [32] and bias correction of weather forecast [33]. For each of these data sets, we divide the data set into training and test sets with 10-fold cross-validation. We then orthogonalize the features of the training set using the Gram–Schmidt procedure to obtain orthogonal features ${x_{k^{\prime}}}$ and then use these features to train an SVR model as described in the methods. Next, we predict the target properties ${y_i}$ for the test set using the trained SVR models and evaluate the errors of the SVR predictions. Then, we evaluate distance metrics ${D_i}$ for each of the test points and classify all test set data points based on their distances into 10 groups with equal numbers of data points (deciles). Finally, we evaluate the MAE of ML prediction for each group and plot the obtained MAE values for the different deciles in figure 3. To demonstrate the effect of the Gram–Schmidt feature orthogonalization and feature weighting on this analysis, we also present the results of MAE of SVR prediction versus ${D_i}$ evaluated using the original non-weighted and non-orthogonal features. The MAE results of SVR prediction versus ${D_i}$ evaluated using non-orthogonal but weighted features and orthogonal but non-weighted features are shown in the SI together with the results obtained with non-weighted and non-orthogonal features and the results obtained with weighted and orthogonal features. Examination of the results presented in figure 3 shows that with the exception of the forest fire data set (figure 3(l)), in all cases there is a clear trend of MAE increasing with increasing decile number. For about half the datasets (figures 3(a), (e), (f), (g) and (i)) the trend is quite smooth while for the other half there are strong fluctuations in the MAE superimposed on the overall trend.

Figure 3.
Standard image High-resolution image
Figure 3.

Figure 3. Relationship between the distance metric and prediction error. Mean absolute prediction errors as a function of distances of predicted data points from all known data points. The distances are grouped into deciles with error standard deviations computed for each decile. The computations are done for (a) TCO formation energy [21] (b) TCO band gap energy [21] (c) activation energy for dilute solute in crystal [22] (d) reduced glass transition temperature for alloys [23] (e) daily bike count [24] (f) frequency variation of Parkinson's patients [25] (g) perovskite formation energy [26] (h) perovskite band gap energy [26] (i) game actions [27] (j) perovskite stability [28] (k) energy use [29] (l) forest fire area [30] garment production [31] carbon nanotube [32] and bias in temperature [33] datasets.

Standard image High-resolution image

To isolate the effect of the orthogonalization and feature weighting from that of the distance metric formula, we compare to the plots of MAE vs ${D_i}$ calculated without GS orthogonalization and feature weighting. It is observed that for five datasets (figures 3(a), (b), (e), (g) and (l)) the benefit of GS orthogonalization and feature weighting is either small or non-existent, with essentially equally smooth trends obtained for both di evaluation methods. However, for seven data sets (figure 3(c), (d), (f), (h), (i), (j) and (k)), GS orthogonalization and feature weighting clearly improves the smoothness of the trend and in particular, for four data sets (figures 3(f), (h), (i) and (j)) the use of GS orthogonalization and feature weighting changes very weak or no correlation of MAE with ${D_i}$ into a strong dependence. Thus, for these datasets orthogonalization is a crucial step to obtain the correct distance evaluation. This suggests that previous attempts to use distance metrics for reliability characterization that found no or weak relationships between distances in feature space and prediction accuracy may have been affected by feature space distance evaluation that did not take into account the correlations between the features. Comparison of the plots of MAE vs ${D_i}$ evaluated using GS orthogonalization and feature weighting, GS orthogonalization without feature weighting, feature weighting without GS orthogonalization and with no GS orthogonalization and feature weighting (see SI) shows that GS orthogonalization has the main effect of improvement of MAE vs ${D_i}$ trends while feature weighting has a minor effect.

Even with the use of GS orthogonalization and feature weighting, for the forest fire data set, no trend with ${D_i}$ is found. This is most likely due to the poor features used in this data set that lead to larger errors of SVR prediction. For data sets where the features are not well-correlated with the target properties, the suggested analysis procedure will not be useful because the error is not controlled by the sampling density but rather by the fact that the target property is controlled by hidden features that are omitted in the dataset. In that case, even a high sampling density of the known features will not lead to accurate prediction. Thus, our analysis can identify datasets for which better features are necessary. Furthermore, for datasets for which predictions are accurate overall but some outliers with poor accuracy are obtained, our method can identify whether the poor accuracy of data point i is due to weak sampling (as indicated by high ${D_i}$) of the region of the feature space in which the outlier is located or due to the importance of hidden features for this data point as indicated by high density of sampling and low ${D_i}$.

To demonstrate the effectiveness of our analysis method in more detail, we present the plots of error vs ${D_i}$ for individual data points (figure 4) and plots of predicted versus actual values (figure 5) with different deciles shown in different colors for the TCO formation and band gap energy datasets from [19] and bike sharing [15] and game action [29] datasets. It is clear that the data in deciles 1–2 corresponding to small distances ${D_i}$ are predicted accurately, with the data points for these deciles shown in red falling on the y = x line in the plots in figure 4. With increasing decile number corresponding to increasing sampling distance and decreasing sampling density, increased deviation from the y = x line is observed with all outliers corresponding to deciles 9–10. We note that the ratios between the distances for deciles 1–2 and deciles 9–10 is between 2 and 3. Thus, even a relatively small decrease in the sampling density has a strong impact on prediction accuracy.

Figure 4.

Figure 4. Error distributions as a function of distance. Errors of individual ML prediction plotted as a function of the distance metric D for the (a) TCO formation energy (b) TCO band gap energy (c) daily bike count and (d) game action datasets with data points for different deciles shown by different colors.

Standard image High-resolution image
Figure 5.

Figure 5. Prediction accuracy for different distances. Real versus ML-predicted values for the (a) TCO formation energy (b) TCO band gap energy (c) daily bike count and (d) game action data sets with data points for different deciles shown by different colors.

Standard image High-resolution image

Finally, we compare the results obtained using our method to the standard approach of using the average distance of k nearest neighbors. As above, we use the plots of MAE for different deciles and the Spearman correlation coefficient to evaluate the UQ accuracy of the different method for the 12 data sets (figure 6). We find that for three datasets, namely perovskite band gap energy, frequency variation of Parkinson's patients and game actions datasets, our methods shows much better ability to separate the well-predicted and poorly predicted data points than the baseline metric based on the average distance of 10-nearest neighbors. For these three data sets, the baseline method fails to separate the well- and poorly-predicted data points as evidenced by very low slopes in the plots of MAE vs decile for the baseline method for these three data sets. For all of these three datasets, the use of GS orthogonalization achieves clear separation even when only the average distance of the 10 nearest points is used as a metric. The use of the average sampling distance instead of the average distance of the 10 nearest points for the GS-orthogonalized data then leads to a further improvement and more smooth trends for the perovskite band gap energy and frequency variation of Parkinson's patients datasets, while changing to the average sampling distance metric does not improve the smoothness of the MAE vs decile trend for the game action dataset.

Figure 6.
Standard image High-resolution image
Figure 6.

Figure 6. Comparison of the relationship between the distance metric and prediction error between our method and the baseline average distance of 10 nearest points. Mean absolute prediction errors as a function of decile number for various distance metric. The data for the baseline metric of the average distance of 10 nearest points (black), average sampling distance calculated according to equation (4) (green), average distance of 10 nearest points after feature orthogonalization (red) and average sampling distance calculated according to equation (4) after feature orthogonalization (blue) are presented. The computations are done for (a) perovskite band gap energy [26] (b) frequency variation of Parkinson's patients [25] (c) game actions [27, 34] (d) reduced glass transition temperature for alloys [23] (e) perovskite formation energy [26] (f) daily bike count [24] (g) TCO formation energy [21] (h) perovskite stability [28] (i) TCO band gap energy [21] (j) activation energy for dilute solute in crystal [34] (k) forest fire area [30] datasets (l) energy use [29] garment production [31] carbon nanotube [32] and bias in temperature [33] datasets.

Standard image High-resolution image

For another five data sets, namely reduced glass transition temperature for alloys, perovskite formation energy, daily bike count, TCO formation energy, and perovskite stability, both the baseline average distance of the 10 nearest points metric and our method achieve the basic goal of separation of well- and poorly-predicted, but better results are obtained by our method. A particularly strong improvement is obtained for the low decile numbers, where the use of the average sampling distance metric and GS orthogonalization of the feature data decrease the MAE of the first two deciles by 45%, 40%, 55%, 33% and 30% for the reduced glass transition temperature for alloys, perovskite formation energy, daily bike count, TCO formation energy, and perovskite stability datasets. For the reduced glass transition temperature for alloys, perovskite formation energy, and perovskite stability datasets, the improvement is solely due to the use of GS-orthogonalization of the data, whereas for the daily bike count and TCO formation energy data sets, UQ performance is also improved by the use of the average sampling distance instead of the average distance of the nearest 10 points. For the TCO band gap energy, activation energy for dilute solute in crystal and forest fire data sets, there is no improvement by our method relative to the baseline, and for the energy use dataset, our method is slightly worse than the baseline for separating the well- and poorly-predicted data points.

To mathematically demonstrate that that ML prediction error tends to increase with increasing distance metric, we calculated the Spearman rank correlation coefficient for the data used to plot figure 3. This figure shows the mean absolute prediction errors as a function of the number of the decile of the predicted data points based on the distance metric. The Spearman rank correlation coefficient ρ which evaluates how well the relationship between variables y and x can be described using a monotonic function. A value of 1 means that the y variable is a fully monotonic function of the x variable, while a value of 0 means that the two variables do not show any consistent dependence. Thus, if the ρ for the mean average error of a decile as the y variable and the decile number as the x variable is close to 1, this means that the error tends to increase with increasing distance metric. As shown in figure 7(a), the ρ values obtained using our method are generally above 0.8. Of the 15 examined data sets, eight have ρ⩾ 0.9, five have 0. 8 ⩽ ρ ⩽ 0.9, and two have ρ ⩽ 0. 8. As discussed above, for the two data sets with ρ ⩽ 0. 8, the features are not good predictors of the target property, making it impossible to perform uncertainty quantification. Comparison to the ρ values for the baseline method of using 10 nearest distances shows that the ρ values for our method are consistently higher, with very large increase in ρ for four cases, significant increase in ρ in five cases and no significant increase in ρ in six cases, verifying the improvement provided by our method.

Figure 7.

Figure 7. Comparison of the Spearman's rank correlation and integral of area between the distance metric and prediction error between our method and the baseline average distance of 10 nearest points. (a) Rank correlation values (b) visual example of the area above and below the mean error for all points and (c) integral area of the mean absolute prediction errors for different deciles. For (a) and (c) the values are given for the baseline metric of the average distance of 10 nearest points (green) and the average sampling distance calculated according to equation (4) after feature orthogonalization (blue).

Standard image High-resolution image

We also introduced another measure, namely area under the curve (AUC) to quantify and mathematically demonstrate the improved uncertainty quantification provided by our method. In the absence of distance-based uncertainty quantification, the uncertainty of ML model prediction for a given data set can be estimated as the mean absolute error of prediction for the entire dataset. This provides a baseline for any metric-based UQ model. If a UQ metric, e.g. average 10 nearest distances or the inverse distance proposed in our work, is useful, the MAE for low values of the metric should be less than the MAE for the entire dataset (MAEall), and the MAE for the high values of the metric should be greater than MAEall. Therefore, if we consider a plot of the MAE of a different deciles of the data points (MAEdecile, where the data points are divided into deciles based on the value of their distance metric) versus decile number, the curve with greatest difference from the horizontal line at MAEall show the best error separation. For example, as shown in figure 7(b) for the perovskite TCO band gap dataset, the curve of MAEdecile versus the distance metric show slight deviation from the horizontal line at MAEall for the baseline nearest 10 distances method whereas the curve of MAEdecile versus the distance metric shows very strong difference from the horizontal line at MAEall for our method. To quantify the error separation, we calculate the area between the curve and the horizontal MAEall line as shown by the shaded area in figure 7(b). For comparison across all sets we normalize the area calculation by dividing the MAEdecile by MAEall. Thus, the formula for AUC is

Equation (10)

Thus, AUC approaching zero indicates a very small additional benefit provided by the metric for separation of smaller and larger errors, while a larger AUC indicates greater benefit provided by the metric for separation of smaller and larger errors.

As can be seen from figure 7(c), for 13 out of 15 datasets, our method obtains greater AUC values than the baseline 10 nearest points method, with large improvements for six datasets and small improvements for seven datasets.

We have also included a comparison with the delta metric introduced in [6] that is given by equations (1) and (2). Examination of figures 7(a) and (c) clearly shows the superiority of our distance method to the delta metric, with the Spearman coefficients of the delta metric significantly lower than those for our method. A direct comparison of the plots of the prediction error versus decile number for the three methods for the TCO formation energy and band gap datasets (figure 8) further confirms the superior performance of our method.

Figure 8.

Figure 8. Comparison of the relationship between the distance metric and prediction error between our method, the baseline average distance of 10 nearest points and the delta metric. Mean absolute prediction errors as a function of decile number for various distance metric. The computations are done for (a) TCO formation energy [26] (b) TCO band gap energy.

Standard image High-resolution image

The shortcoming of the delta metric is due to its use of training set data points errors to estimate the errors of the test set. Since the training set data are usually predicted more accurately than test data, the delta metric will tend to underestimate the test errors severely in the case where the model overfits to the training set data. For example, consider a test data point in the low-sampling region for which the nearest neighbor points are located at a large distance in features space but their target function values are given accurately by the constructed model. In this case, our distance method will suggest that the error for this test point will be large because of the difficulty of interpolation from the nearest neighbor data points located far away; by contrast, the delta metric will suggest a small error for this test point due to the small errors obtained by the constructed model for the nearest neighbor training set data points.

4. Discussion

We now discuss the implications of our results for small-data machine learning. Our results provide a clear confirmation that ML methods are generally not suitable for extrapolating to regions in feature space where little data are available as can be seen by the prevalence of large errors for decile 9–10 data. We also show that correlations between features often strongly hinder the use of similarity or distance in feature space evaluation as a metric for UQ. Thus, decorrelation of the features is likely to be generally important for any similarity-based UQ metric. Additionally, our method provides guidance for targeted exploration of feature space experimentally or computationally by identifying regions of feature space with low sampling density. These undersampled regions should be examined in order to obtained improved data sets, whereas obtaining additional data in the high-density sampling region will not be beneficial for improving the prediction accuracy.

Furthermore, our method enables investigation of the sources of ML prediction error and reliability. A given ML model has three sources of errors, namely insufficient model flexibility, insufficiently descriptive features and insufficient sampling of the features space by the data set used for model training. Previously, it was difficult to deconvolute the effects of these error sources. Using our method, we can separate the effects of the sampling density, such that for low ${D_i}$ there is little sampling error and for high ${D_i}$, the error is clearly dominated by insufficient sampling. If a high error is found for data points in regions of high sampling (low ${D_i}$), this indicates insufficiently suitable features or insufficiently flexible model. To distinguish these two error sources, it may be possible to use a series of different ML methods. If a clear improvement is obtained for one method compared to the other, this indicates that model flexibility is important. If all methods obtain the same results, this suggests that model flexibility is not the limiting feature in achieving the desired accuracy or prediction and that rather feature selection must be improved to improve prediction accuracy.

In future work, our local sampling density analysis method may serve as a basis for addressing several other problems in ML learning. First, the method is quite simple and yet it can predict the overall trend error (reliability) of ML prediction for a variety of data sets. It is likely that more sophisticated analyses of the feature space and the dependence of the target properties on the features will provide a more granular and accurate estimate of prediction reliability (e.g. explaining why for deciles 1–2 in the bike sharing [19] and game action [14] datasets, some of the data points are predicted with very low error while other are predicted with moderate error). For example, a distance metric function can be suggested that uses a weighting scheme that is different from that expressed by equation (4). Such more complex approaches may result in improved separation ability.

Additionally, it may be possible to develop more accurate methods for ML prediction by designing separate methods for application to the high-sampling regions that focus on the accurate interpolation of the highly-sampled data and for application to the low-sampling feature space regions that focus on accurately capturing the overall trends. Finally, our method relies on the orthogonalization of the features using the GS procedure which scales as ${N^3}$ where N is the number of data points. Due to its N3 scaling, GS orthogonalization cannot be applied to big data problems and therefore cannot be used to evaluate the accuracy of deep learning predictions. Therefore, another future direction is to investigate how a decorrelated feature distance metric ${D_i}$ can be evaluated efficiently for big data sets, and whether such distance-metric-based analysis of reliability is useful for problems addressed by deep learning methods.

5. Conclusion

We have demonstrated that the errors of ML prediction for small data sets generally show a well-defined and systematic dependence on their separation in feature space from other data points. For various data sets and multiple independently generated data points, we find that ML prediction error tends to increase with increasing distance metric defined as the inverse of the average local sampling density. We also find that the use of decorrelated features created by the application of the Gram–Schmidt orthogonalization procedure to the features of the data used in ML model strongly increase the accuracy of ML reliability prediction based on the distance metric. Our method is computationally simple, can be used with any ML learning method and enables analysis of the sources of the ML prediction errors. Therefore, it is suitable for use as a standard technique for the estimation of ML prediction reliability and design of improved datasets for ML.

With regard to the limitation of the proposed UQ method, our findings show (not surprisingly) that UQ methods based on feature-space distance are ineffective when the features provide a poor description of the target property. Therefore, the UQ method proposed here is always limited by the quality of the data in the dataset. Development of methods for identifying descriptive features of desired target properties is therefore necessary to make our UQ procedure more generally applicable.

Several additional limitations remain to be addressed in future work. First, the effectiveness of the proposed UQ method should be examined for other ML methods used for small data sets such as random forest and XGboost and tested for a much larger number of datasets to demonstrate its generality. Second, the method should be applied for UQ of deep learning model predictions for large data sets and compared to methods such as bootstrap ensemble and Bayesian neural networks that incur a large model training cost compared to the training of a single model. Due to the N3 scaling of the Gram–Schmidt orthogonalization, this may require the use of approximate orthogonalization procedures which may affect the effectiveness of the method. Third, extension of the decorrelation method to the removal of non-linear correlations, beyond the linear correlations removed by the GS procedure, should be examined. These research directions will be pursued in our future work. Furthermore, to enable a wider application of this method, it would be useful to integrate this method with other UQ methods as well as with existing publicly available ML workflows and tools. This will be done in future work as a part of wider testing of our UQ method described above.

Acknowledgments

This work was supported by the US. Department of Defense and the U.S. Army, through Grant W911NF-19-2-0119, and the Israel Science Foundation through Grant 1479/21.

Data availability statement

All data that support the findings of this study are included within the article (and any supplementary files).

Please wait… references are loading.

Supplementary data (0.3 MB PDF)