Classifying rock types by geostatistics and random forests in tandem

Rock type classification is crucial for evaluating mineral resources in ore deposits and for rock mechanics. Mineral deposits are formed in a variety of rock bodies and rock types. However, the rock type identification in drill core samples is often complicated by overprinting and weathering processes. An approach to classifying rock types from drill core data relies on whole-rock geochemical assays as features. There are few studies on rock type classification from a limited number of metal grades and dry bulk density as features. The novelty in our approach is the introduction of two sets of feature variables (proxies) at sampled data points, generated by geostatistical leave-one-out cross-validation and by kriging for removing short-scale spatial variation of the measured features. We applied our proposal to a dataset from a porphyry Cu–Au deposit in Mongolia. The model performances on a testing data subset indicate that, when the training dataset is not large, the performance of the classifier (a random forest) substantially improves by incorporating the proxy features as a complement to the original measured features. At each training data point, these proxy features throw light based on the underlying spatial data correlation structure, scales of variations, sampling design, and values of features observed at neighboring points, and show the benefits of combining geostatistics with machine learning.


Introduction
An economic mineral deposit that is termed an 'ore' deposit, is an unusually high concentration of metallic or non-metallic minerals in a part of the Earth's crust that can be extracted economically or strategically [1].The identification of the rock types (lithology), ore-bearing geologic structures, and the geological properties of those rocks, are fundamental in defining ore deposits.Exploration geologists are initially concerned with the lithology of the country rocks, and subsequently move toward the identification of the host rocks (that host the ore body) and wall rocks (that surround the host rocks).Geological properties that characterize the different rock types in an ore deposit include the basic physical attributes that can be observed with the naked eye, and attributes pertaining to geochemistry, petrography, mineralogy, alteration, material science and rock mechanics which can be measured only in the laboratory.The classification of rock types is helpful to many modeling and engineering decisions in downstream stages of a mining project, such as: (1) generating geologic maps and 3D geological models, (2) estimating the probability of occurrence of economic minerals in a particular rock type, (3) estimating the cost of blasting and extracting the raw material to be mined, (4) adjusting drilling parameters for a given rock type to maximize core recovery, (5) estimating recoverable resources, and (6) estimating the rock strength, among others.
In practice, the visual determination of the rock type on the basis of its mineralogy (when distinct), color, grain size and texture, structure, contacts with other rock types, or other visual attributes, is the regular method of identifying and classifying a lithology.However, two rocks may look very much the same physically, but be geochemically different.When the visual procedure fails to exactly determine or confirm the lithology class, chemical assays of the rock samples reveal their exact type.However, chemical assays of rocks involve service costs that have to be planned judiciously.
Porphyry Cu deposits, including porphyry Cu-Au-Mo and porphyry Cu-Au deposits, and Cu-(Mo) breccia pipes are the world's largest source of copper.These are generally large-tonnage-low-grade metal deposits.Ore occurs in bulk-mineable mineralization that is spatially, temporally and genetically associated with intrusive rock plutons of the granite family with the metals transported by hydrothermal fluids [2].The intrusive plutons, host or wall rocks, and other country rocks are commonly subjected to hydrothermal alteration overprinting, resulting in a loss of the original physical signature of the lithology types, thereby posing a difficulty in their visual identification and classification.Alteration caused by weathering and meteoric waters acidified after percolating through iron sulfides (pyrite, FeS 2 ) in the deposit, further adds to the problem of labelling the exact rock type.This is the first dimension to the problem addressed here.
Where chemical assay of the rock sample is the available solution to the identification of the lithology, there lies another problem, related to the cost-benefit paradigm of the mining and metals industry.Why would a mining company invest money in assaying a population of rocks sampled from a lithologic domain if a few initial sample assays indicate a poor metal grade of the domain?This economic logic leads to the resultant problem of missing values of some or all assay variables pertaining to the rock sample at a data point in underground 3D space.This is the second dimension of the problem.
A more pessimistic corollary of the same logic adds a third and more acute dimension: complete absence of information.Why would the company invest at all in additional drill holes for extracting those cylindrical rock core samples from 'that' low-grade lithologic domain?A core loss or poor core recovery during drilling is an additional risk.The complete absence of information leads to 'no data' patches of underground zones among zones of 'data-rich' and 'data-poor' irregularities.
There are two perspectives on handling missing data: avoiding the rows in which some or all variables have missing values, versus filling the missing values by statistically calculated imputations.The former approach is naive, conservative, faithful to the missing values, but implies discarding a data point location along with grade values that are costly to acquire.The latter approach appears reasonable for most investigators but is considered as inappropriate practice by mineral resource reporting systems belonging to CRIRSCO [3].It is pertinent to note that data imputation in the blank cells of the assay data table means imputing values of the components of a spatial vector at different data point locations, which implies changing the original spatial data configuration from 'partially heterotopic' (i.e. with under-sampled variables) to 'isotopic' (with equally-sampled variables).
One typically characterizes mineralization by sampling less than 0.001% of the deposit.Therefore, the scale of uncertainty combined with high levels of geological variability create unique risks for the profitable exploitation of mineral resources.The ultimate problem is that given a sparse population of rock core data at scattered locations, we are required to predict the spatial distribution of the different rock types and metal grades comprising the 3D subsurface of the deposit [4][5][6].The traditional approach of classifying the rock type of drill core samples relies on whole-rock geochemical assays, comprising concentrations, and their transformed values, of a series of major oxides, alkaline and alkaline-earth elements, trace elements, and other elements to as much as 55-60 feature variables.While several machine learning case studies exist on rock type classification from whole-rock geochemistry (e.g.[7,8]), there are perhaps just a few on rock type classification from a limited number of feature variables comprising essentially of assays of metal grades.This is primarily due to the general unavailability of proprietary drill hole datasets of ore deposits in the public domain for scientific research.
The prime novelty of our study is the ingenious introduction of two complementary sets of feature variables ('proxies') at sampled data points, by means of geostatistical techniques: (1) leave-one-out cross-validation (LOOCV), and (2) filtered kriging for removing short-scale spatial variation ('noise') of laboratory measured values.Relying on the different sets of feature variables-the original measured set and the two sets of proxies, we tested a machine learning technique, namely the Random Forests (RF) [9], in tandem for classifying a total of 13 rock types occurring within the ore deposit.
Our principal scientific questions in the current research investigation are: 1. Is it possible to predict (classify) the different rock types with reliable accuracy from a given set of assayed grade values at sampled data locations?Our intuition is based on that it could be possible using a machine learning classifier provided the classification algorithm can recognize a decisive pattern of numerical features (metal grades) associated with each individual rock type.The input 'features' are assay values of geochemical variables and bulk density (BD) measured in a drill hole dataset in compliance with industry quality assurance and quality control standards.We will assess the accuracy of predictions by evaluation with standard model performance metrics for classification.2. Could the different rock types be predicted with more accuracy if we add complementary sets of numerical features (proxy grade values generated by geostatistical techniques) to the original assayed grade values?In this case, our intuition is based on that more accurate prediction is plausible as a more distinctive pattern of features could be likely generated corresponding to each individual rock type (thus, rendering the classifier as more efficient in recognizing the pattern).The underlying merit of the complementary sets of grade values is that they are spatially correlated with the original assayed grade values.3. What could be the prediction performances of the classifier if we were to substitute the assayed grade values at the sampled data locations and the missing values by the complementary sets of features (proxy grade values) generated by geostatistical techniques?
While trying to answer the above questions, we will analyze two cases, depending on whether the feature variables are equally sampled or not; in the latter case, the missing data will be filled with the median value per lithology class for each corresponding input feature variable to apply the traditional classification approaches.
The paper is outlined as follows: section 2 provides a geological description of the application case study under consideration.The dataset of this case study is introduced in section 3, together with the proposed tools and methodology.Section 4 presents and discusses the results of the classification problem.Section 5 concludes the paper, while detailed statistics on the classification results are provided in Appendices.

Geologic setting of the deposit
Porphyry deposits are regionally associated with convergent plate boundaries, along island arcs and continental margins where subduction of oceanic crust takes place beneath continental crust.The mineralization process is characterized by intrusions of large volumes of granitic magma accompanied by the availability of metal-rich hydrothermal fluids, with the sites of ore localization controlled by structural features, rock types, and hydrothermal alteration styles.Copper and gold are typically the main economic minerals in these deposits, with molybdenum and silver being common as additional.A good summary of porphyry deposits, their types of metal associations, host rocks, and alteration can be found in [10].The details of the ore-forming process can be found in [2,11].
The Central Asian Orogenic Belt extends for 5000 km from the Urals in the west to the Pacific coast in the east.Within this belt, a chain of porphyry mineralization occurs in Uzbekistan, Kazakhstan, Russia, China, and Mongolia, ranging in age from the late Neoproterozoic till the Jurassic.The most prolific geologic period of ore formation, including the largest deposits, was during the Late Devonian and Early Carboniferous [12].
In Mongolia, porphyry mineralization occurs in two places: Erdenet and Oyu Tolgoi.Erdenet is located in Central Mongolia, whereas Oyu Tolgoi is located within the South Gobi Desert, approximately 650 km due south of the capital, Ulaanbaatar.The latter constitutes a cluster of seven porphyry Cu-Au-Mo-(+Ag) deposits, the largest high-grade group of Paleozoic porphyry deposits known in the world.The rock sequence surrounding this cluster of deposits is predominantly composed of basaltic to intermediate volcanic rocks of Devonian age, overlain by layered dacitic pyroclastic and sedimentary rocks intruded by basaltic and dolerite sills.The southern deposits (Southwest, Heruga North, and Heruga) are characterized by high Au (g/t)/ Cu (%) ratios (0.8-3:1), and hosted by biotite-magnetite-altered augite basaltic rocks.In contrast, the Hugo Dummett (North and South) deposits are characterized by lower Au (g/t)/ Cu (%) ratios (0.1-1:1), and hosted mainly by volcanic and plutonic rocks with extensive sericitic and advanced argillic hydrothermal alteration.The detailed geology and mineral exploration history in the Oyu Tolgoi district can be found in [13,14].
The deposit concerned in this study is the Hugo Dummett South deposit (figure 1).It is a polymetallic porphyry with economic grades of Cu, Au, Mo and Ag and potentially deleterious grades of As, F, S and Fe.It is hosted by an east-dipping basalt rock sequence intruded by porphyritic quartz-monzodiorite plutons.The deposit is both disrupted by, and bounded by fault displacements in four directional orientations.The mineralization occurs in an anastomosing network of stockwork veinlets of sulfides ± quartz with Cu grades typically greater than 2%.There are potentially multiple overlapping mineralizing phases.Hydrothermal alteration exhibits a strong lithologic control with a number of assemblages including advanced argillic, muscovite/sericite, and intermediate argillic styles with minor K-silicate alteration.The stockwork is mainly localized within the Late Devonian quartz-monzodiorite intrusive; however, it also extends into the basaltic wall rock units.

Data description
In the present study, we have considered an exploration dataset comprising geological, geochemical, and survey information of 360 drill holes representing 178 km of drilling.The dataset includes separate spreadsheets: assay values of thirteen chemical elements (Cu, Au, Ag, Mo, As, Fe, S, C, F, Pb, Zn, Cd, Hg), dry BD values, lithology classes, alteration types, mineral percentages, and survey data in the form of collar and down-the-hole survey files.In the original assay table, core sample lengths vary from 0.1 m to 1908.8 m (excluding the core loss intervals).Sample lengths in the range of 1 m -3 m comprise 98.4% of the total samples.As per standard practice, samples were optimally composited to a mean length of 5.0 m without any left-overs, in lieu of fixed-length compositing.Decomposition of 12 large samples into smaller length composites has been done due to the fact that samples with lengths >5.0 m have negligible grade values for all the thirteen assay variables.Basic statistics on the target and feature variables are provided in tables 1 and 2, respectively.An interpretive lithological model (3D) with the trajectories of drill holes is shown in figure 2.

Decision trees
A decision tree is a set of questions organized in a hierarchical manner and represented graphically as a tree, with a data structure made of a collection of nodes and edges, and devoid of loops.Nodes are divided into internal (or split) nodes and terminal (or leaf) nodes.All nodes have exactly one incoming edge.For a given input object, a decision tree predicts an unknown property of the object by asking successive questions about its known properties.Depending on the answer of a binary test, the data is sent to the right child node or to the left child node.This process is repeated until the data point reaches a leaf node.The leaf nodes contain a classifier or regressor which associates an output (a class label or a continuous value) with the input, and the decision is then made based on the output at the leaf node.The more the questions asked (the number of branches), the higher the confidence is in the response.

RFs
RFs [9] is an ensemble learning method that combines multiple decision trees to make predictions.It operates by constructing a collection (or forest) of decision trees, where each tree is trained on a subset of the training data.Randomness is introduced at two levels: 1. Bagging or bootstrap aggregating involves sampling the training data with replacement to create multiple bootstrap samples.Each decision tree is trained on a different bootstrap sample.This duplicates some samples and does not select others.An average of approximately 63.2% of cases is included in each training subset, whereas the remaining, or 'out-of-bag' (OOB) samples (approximately 37.8%) are used for validation.2. Random feature selection: The second form of randomness involves selecting a random subset of features at each split node during tree construction.The number of variables in this subset is predefined and consistent across the forest.This feature subsampling helps to decorrelate the trees and to reduce overfitting.At every node, the randomly selected variables are then ranked by their ability to produce a split threshold that maximizes the homogeneity of child nodes relative to the parent node.
The class assigned by RF to each sample is determined by a majority vote combined from the prediction output of all classification trees.The reader is referred to [9,15,16] for the theory, mathematical rigor and consistency proofs of the algorithm, parameter selection, resampling mechanisms, variable importance measures, and the theoretical and methodological progress made since its initial development.Although RF has been extremely successful as a generic classification and regression method, it is routinely used for classification problems (e.g.[17][18][19]).

Kriging
Kriging is a geostatistical predictor that applies to a regionalized variable, i.e. a quantitative variable indexed by spatial coordinates.It relies on the modeling of the spatial correlation structure of this variable, via mathematical tools such as the (auto)covariance function or the variogram that quantify the similarity or the dissimilarity between paired observations of the variable, as a function of their spatial separation.Unlike machine learning techniques (e.g.decision trees or RFs) that are used to predict one variable (target) from another (feature), kriging predicts the same variable as the input, but at different spatial locations, with the unbiased prediction leading to the smallest mean-square error, i.e. the error with zero mean and the smallest variance.The reader is referred to [20] for details on the representation of regionalized variables by spatial random fields, inference and modeling of covariance functions or variograms, as well as on the mathematical background and the most common variants of kriging.
In the following section, two kriging-based approaches will be used to complement the original feature variables inputted in machine learning algorithms.The motivation is to account not only for the feature variables observed at a sampling location when predicting a target variable at this particular location, but also for the information brought by the feature variables observed at surrounding locations.These are: 1. LOOCV.The value of a feature variable at a given location is predicted by kriging, using as input the values of the same feature variable at surrounding locations and a modeled covariance function or variogram.The cross-validated prediction therefore includes the information of the neighboring data, but excludes the information of the data itself.2. Kriging with filtering.This variant applies when the modeled covariance function or variogram of the feature variable is decomposed into two components; typically, the first component is the nugget effect variance, which stands for the discontinuity of the covariance function or variogram at the origin, while the second component is the continuous part of the covariance function or variogram.In such a situation, the feature variable can also be interpreted as the sum of two components, associated with the components of its covariance function or variogram.In particular, the nugget effect component represents the short-scale variability and noise associated with measurement errors [21], while the continuous component represents the spatial variations at larger scales.Kriging with filtering [22, chapter 15] allows predicting the continuous component of a feature variable (viewed as the 'error-free' variable) at a given location, using as input the values of the observed variable (including the nugget effect component) at this location and surrounding locations, as well as the modeled covariance function or variogram of the variable.Unlike LOOCV, here the prediction not only includes the information of the neighboring data, but also the information of the data itself.

Proposed methodology
The proposed methodology is illustrated in figure 3 and detailed in the next subsections.The objectives of this methodology are the following: 1. To test whether it is possible to predict different classes of intrusive lithology ('target variable') with assay variables and BD as input 'feature variables' .In doing so, our emphasis is not to test multiple machine learning algorithms for evaluating their individual predictive powers, but to, primarily, find out whether a limited number of assayed metal grades exhibit extractable patterns of variations for correlation with different rock types via supervised learning.2. To compare classification performances by the traditional approach versus our ingenious approach of creating proxy variables as input features, as explained below.In the case where feature variables are not equally sampled, the traditional approach relies on imputing missing values.Although we had several statistical options of imputation, we opted for a geochemically sound procedure, i.e. the missing values were filled with the median value per rock type for each corresponding measured variable, to apply the traditional classification approach.The principal objective was to test whether the proxy variables created by us are: (a) indispensable, or (b) somewhat useful, or (c) useless, for the purpose of the said classification.

Geostatistical modeling of feature variables
The first step of the methodology is the calculation of experimental variograms to identify the spatial correlation structure of the feature variables (chemical element assays and BD), which encompasses their spatial variability at both short and large scales and can depend on the spatial direction.To avoid over-simplifications and to get a realistic geostatistical model, we considered two directions for variogram calculation: the horizontal plane (omni-horizontal calculations) and the vertical direction (table 3).The fitting of theoretical variogram models considers nested structures of nugget, spherical and exponential types.The fit is performed with a least-square procedure (table 4 and figure 4).

Geostatistical prediction by LOOCV and kriging with filtering
For both types of prediction, the feature variables were kriged separately.In each case, ordinary kriging was used, with a moving neighborhood implementation.The shape of the neighborhood was ellipsoidal, and its anisotropy was chosen as a function of the anisotropy of the variogram model (e.g. the neighborhood was more elongated in case of a strong anisotropy, or closer to a sphere in case of isotropy) (table 5).To avoid getting predictions that are locally very high, a top-cut value was applied to feature variables with long-tailed histograms.Finally, to avoid inconsistent results, negative predictions were post-processed and set to zero.

RF prediction (classification) of lithology classes
The following steps were performed sequentially.

Classification based on the dataset with imputed values (Models M1, M2 and M3)
1.The missing values of the first set of feature variables: assayed grade variables, viz., Cu, Au, Ag, Mo, As, Fe, S, C, F, Pb, Zn, Cd, and Hg, plus BD, were filled with the median value corresponding to each lithology class.2. Our first task was to test the prediction of the lithology types with the first set of feature variables (14).
Then, the next is to test the same combining the first, second, and third set of feature variables (14 × 3 = 42 feature variables).The last task is to test the same combining only the second and third set of feature variables (14 × 2 = 28 feature variables).Statistics on the target variable (lithology) in the composited dataset is given in table 1.The dataset has 67 323 observation points, including imputed values for the first set of feature variables (14).We split the dataset into subsets comprising training and testing in a 70:30 ratio.Since lithology has been sampled in an unbalanced way, subsets comprising training and testing data were split through stratified random sampling.The prior probabilities are considered so as to match total sample frequencies (table 6).The purpose of sub setting the training rows is for model training and tuning, and the testing rows for model evaluation.Optimizing RF for modeling involves tuning some parameters controlling the structure of each individual tree, the structure and size of the forest as well as its randomness.The methodology and software that we have adopted for tuning hyperparameters for the RF classification models are described later in this section.

Classification based on the dataset without missing values (Models M4, M5 and M6)
2. Next, we repeat the exercise (prediction of lithology types with the RF classifier), but with all missing data (i.e.imputed values) removed.A reduction in the number of data points from 67, 323 to 899, and the number of lithology classes from 13 to 12 (table 7) was implicit due to the fact that missing values for different grade variables in the original dataset were not uniformly distributed among the data points.Dry BD (t m −3 ), a physical variable, in the dataset was discarded due to missing values at a huge proportion of these 899 data points.Hence, this time, the sets of feature variables consisted of only chemical grade variables: the first set consisted of 13 assayed chemical grade variables; the second set  Following the model runs for the above cases, we analyzed the model performances from each confusion matrix after deriving the accuracy rates, kappa statistics, and other statistics that include the sensitivity, specificity, precision, recall, prevalence, detection rate, and balanced accuracy.We also investigated the relative feature importance according to statistical criteria.Finally, we did some model tuning to make the models run on the testing subsets, for each of the three sets of feature variables, and obtain the final sets of results, which are reported in the Results section and in Appendices.
The randomForest R package version 4.7-1.1 [23], which implements Breiman and Cutler's RFs for classification and regression supporting the original Fortran code [9], was the fundamental package used to obtain the RF models.We, by and large, followed [24], for the parameter tuning strategies for the RF classification.The specific hyperparameter values were determined using the tuneRanger R package version 0.14. 1 [25] that tunes the RF classifier using the OOB observations.

Hyperparameter tuning
We considered the following hyperparameters for tuning: (a) the number of candidate feature variables considered at each split (denoted as mtry), (b) number of observations that are drawn for each tree (i.e.sample size, denoted by n), and (c) the minimum number of observations in a leaf node (nodesize).An optimal compromise between low correlation and reasonable strength of the trees can be controlled by tuning the parameters mtry, n, and nodesize [9].However, prior to tuning these three hyperparameters, we also estimated the hyperparameter ntree (number of trees to grow) by finding the optimal number of trees to grow using the R package of [23].Resampling during model tuning was performed by 5-fold cross-validation.We used the tuneRanger R package, to choose the optimal hyperparameter values for our models.A default nodesize parameter value of 1 was used for all the models.The default splitting rule used was that minimizing the Gini impurity.Research on tuning hyperparameters for variable importance is still in its infancy [24].Therefore, we report (see appendices) both measures of variable importance: the Gini variable importance (mean decrease Gini) as well as the permutation variable importance (mean decrease accuracy).

Performances of classification models
The optimally tuned model was selected based on the highest accuracy of candidate models with different hyperparameter values.The accuracy rate was computed using predictions made using the optimized model on the training dataset.The final hyperparameter values used for models M1, M2, M3, M4, M5, M6 and their prediction performances are indicated below (tables 10 and 11).Additional results are tabulated in appendix A. See [26] for technical details and definitions of the metrics.

Discussion
Need for spatial machine learning RF is a machine learning tool that has proved to be useful in doing regression and classification analyses.However, RF was not developed for modeling spatial stochastic processes because of its inherent inability to take into account the nature of spatial dependency (the spatial autocorrelation structure) intrinsic to such processes.However, there have been attempts made for spatial prediction (regression) using alternate models or even integrating RF with geostatistical algorithms.Notable are that of spatial linear mixed-models [27],  RF-GLS [28], RF-Residual Kriging [29,30], spatial RF [31] and geographical RF [32].The spatial structure intrinsic to mineralization can extend to nested structures with additional difficulties of modeling spatial drifts with estimation complexities [20,22], in the events of which the above-cited modeling frameworks, unfortunately, fall short in theory.In particular, when grades of core-samples are the input data to obtain predictions of average grades of unit blocks with volumes different from that of the core-samples, the problem of 'change of support' [20] is invoked, which a unique contribution of geostatistics in the realm of stochastic processes.
The scientific problem in this investigation does not necessitate to merge the algorithms of RF with those of geostatistics.At first glance, the problem is simply one of classifying the likely rock type associated with a given spatial pattern of grade variations measured on drill cores.However, there are additional dimensions to the problem, as cited in section 1, making it more complex.The traditional method for missing values in a dataset is to impute with realistic values.Then arises, which value is realistic and which is not?This is where expert subject knowledge comes in.Notwithstanding this, we approached the missing data problem with the simplest technique, that is, imputing the missing values by the median grade per lithology type.On the other hand, we have the option of incorporating proxy variables with values at each data point generated by geostatistics.Hence, the task was to evaluate which technique-traditional versus geostatistical-is superior.

Incorporation of spatial information in classification by geostatistics
The large amount of data and feature variables used in the first exercise that considers the dataset with imputed values leads to accuracy rates close to 100% (table 10).However, this performance significantly decreases when reducing the number of data and feature variables to the subset of data with no missing values (table 11), in which case the traditional RF classifier (model M4) applied to the testing dataset reaches an accuracy of 61.2%only and a kappa statistic less than 0.49.In this case, the prediction performances of model M5, for which the accuracy raises to 67.5% and the kappa statistic to 0.56, show the importance of incorporating proxy variables obtained by a spatial processing of the input information: • The LOOCV predictors convey information on the expected value of the feature variables at an observation point, based on the spatial correlation structure of these variables, on their sampling design and on their values observed at neighboring points.Because they ignore the values at the target observation point, these LOOCV predictors can be seen as a 'summary' of the neighboring information at the point under consideration.• The filtered kriging predictors provide another complementary information, consisting of the expected values of the feature variables if there were no short-scale variability or measurement errors.If one decomposes the feature variables into a 'signal' (associated with the continuous component of the covariance function or variogram models) and a 'noise' (associated with the nugget effect), the kriging with filtering provides a predictor of the former, i.e. the noise-free feature variables.As the LOOCV, this predictor accounts for the spatial correlation structure of the feature variables, their sampling design and their values observed at neighboring points, but also for the value observed at the target point, which is not 'wasted' as in LOOCV, and for the basic nested structures used in the spatial correlation model, which are associated with different scales of variations [22].The rationale of using the filtered kriging predictors is that the relationship between the lithology (output) and the feature variables (input) is better explained by considering the large-scale variations of the inputs that are not contaminated by short-scale variations.This is analogous to the denoising of images with convolutional neural networks [33], except that it is applicable even when the observations are not arranged in a regular sampling design.Note that, if the correlation model does not have a nugget effect, then the filtered kriging predictor matches the original predictor at any point where the latter is observed.
Another significant advantage of using the LOOCV and filtered kriging predictors in the classification problem is the handling of missing data, insofar as they provide a 'clever' alternative to the imputation of missing values, based on the spatial correlation structure and neighboring information, instead of (in complement of) a simple median value by lithological class.The use of geostatistical proxies can be decisive in the presence of highly heterotopic datasets, for which discarding the missing data would imply a considerable loss of information.Note that the LOOCV and filtered kriging give the same predicted value for a feature variable at a point with missing observation.

Associations between lithology and chemical assays
Minerals and metal enrichments in porphyry ore deposits are formed by the precipitation of sulfides from a hydrothermal system accompanying magma intrusion through pre-existing host rocks.Several interrelated phenomena get activated during such ore-forming processes, which depend on the geochemical, petrophysical (e.g.porosity, permeability) and thermodynamical (pH, redox potential, thermal conductivity, etc) rock properties, and on the environment of magma generation and its chemical characteristics.Some chemical elements are more likely to be found in specific rock types than in others in the Earth's crust [34].However, this usual metal-rock associations based on geochemical affinity may change once a mineralization event takes place.As in our case, elements such as Cu, Zn, and Au (both chalcophile and siderophile) have got hosted both in an intrusive acidic igneous rock such as quartz-monzodiorite (QMD), as well as in the basic igneous wall rock such as Augite Basalt (VA), after the hydrothermal mineralization event.
Accordingly, an explicit quantitative assessment of the lithology-grade relationship is a very arduous task.To top it all, grade assays are affected by short-scale geological variability and measurement errors, which renders their relationship with the lithology 'noisier' and more difficult to display.
Our work suggests that machine learning classification can successfully model this complex lithology-grade relationship and predict the lithology from the grade assays, and even better when the spatial information is taken into consideration, in particular when different scales of spatial variations are identified and the short-scale variability of the grade assays is filtered out.

Variable importance in the classification models
Bootstrap aggregating (bagging) is a general-purpose procedure for reducing the variance of a statistical learning method.When we perform bagging involving many trees, it is no longer possible to represent the resulting statistical learning method using a single tree, and it is no longer clear which variables are most important to the procedure.Thus, bagging improves prediction accuracy at the expense of interpretability.However, we can obtain an overall summary of the importance of each predictor using the Gini index.This index is a measure of node purity-a small value indicates that a node contains predominantly observations from a single class.We add up the total amount that the Gini index is decreased by splits over a given predictor, averaged over all bagged trees.
The RF algorithm estimates the importance of a variable by looking at how much prediction error increases when (OOB) data for that variable is permuted while all others are left unchanged.The necessary calculations are carried out tree by tree as the RF is constructed [35].There are actually four different measures of variable importance implemented in the original code.However, we report only two measures of variable importance in appendix B. The first is the mean decrease of accuracy in predictions on the OOB samples when a given variable is permuted.The second is the mean decrease Gini, a measure of the total decrease in node impurity that results from splits over that variable, averaged over all trees.
In the results, models M2 and M5 comprise the three sets of predictors: (1) original assayed grades (denoted with the index 1 in the Appendices, e.g.Cu1, Au1, Ag1), (2) LOOCV predictors (denoted with the index 2), and (3) filtered kriging predictors (denoted with the index 3).Model M2 constitutes the data table when the original assayed grades are imputed by the median value per lithology at the missing data points.Model M5 constitutes the data table with those data points, at which missing data occurs, completely removed.We can see in appendix B that, in model M2, the top 10 most important predictors belong exclusively to the first set, whereas, in model M5, the top 15 most important predictors (barring F from the first set) belong wholly to the second and third sets of predictors.
Despite this fact, the lower performance of model M3 in comparison with models M1 and M2, and the marginally lower performance of model M6 in comparison with model M5, show that, even if they are the most important, the LOOCV and filtered kriging predictors are a complement to, not a substitute of, the original assayed predictors.

Concluding remarks
This work fits into the context of mineral resources modeling and dealt with the use of machine learning to predict lithological classes from a set of feature variables (geochemical elements and dry BD), through a case study of a polymetallic porphyry deposit located in Mongolia.
Classification has been performed with the RF algorithm.Our main finding is that, when the training set is not so large, the predictive performance of RF can be substantially improved by incorporating new feature variables corresponding to proxy variables calculated by the geostatistical technique of kriging.Specifically, two proxies have been introduced, consisting of (1) LOOCV predictors that summarize, at each data location, the information of its neighboring data, and (2) the filtered kriging predictors that provide a denoised version of the feature variables, which carries out information on the continuous components of the feature variables, free of measurement errors.The proxy variables calculated at a data point account for the spatial data correlation structure, scales of variations, sampling design, and values of the features observed at neighboring points.In the second classification exercise (with 631 data in the training set and 13 feature variables), the accuracy rate on the testing dataset raised from 61.2% without the proxy variables, to 67.5% with these variables.
Another advantage of using geostatistical proxy variables is the possibility to handle missing data in a clever way than in traditional approaches where missing values are imputed with median values per lithology class.In the presence of a highly heterotopic dataset where data imputation becomes adventurous, one could think of using model M3, with no need to discard any data (as in exercise n • 2) or to impute any missing value (as in models M1 and M2 in exercise n • 1), just by substituting the under-sampled variables with the geostatistical proxy variables.This also opens the door to the possibility to predict the lithology class at any unsampled location, as the proxy variables can be calculated at any location of interest, even if no measurement is available (note that, in such a situation, the LOOCV and the kriging with filtering lead to the same values, because the nugget effect is automatically filtered when making the kriging prediction at an unsampled location).
Even better, as a perspective for future work, one could think of simulating the feature variables at any (sampled or unsampled) location, via geostatistical approaches [20], instead of using LOOCV or kriging with filtering.By having multiple sets of simulated variables, it becomes possible to think of a probabilistic classification at the target location [36,37].

Appendix B. Variable importance tables for models M1 to M6
Assay predictors are denoted with index 1, LOOCV predictors with index 2, and filtered kriging predictors with index 3.

Figure 1 .
Figure 1.Simplified geological map of the Hugo Dummett South porphyry deposit considered in this study and its surrounding area.

Figure 2 .
Figure 2. Interpretive lithological model (3D) used for mineral resource estimation with the drill hole intercepts superimposed.Coordinate values are omitted for confidentiality reasons.

Figure 3 .
Figure 3. Schematic workflow of the proposed methodology, combining geostatistical analyses (yellow panel) and machine learning (green panels) for enhanced classification in a spatial context.

Figure 4 .
Figure 4.Examples of variograms for chemical assays (Cu, Ag, As, Fe and S) and bulk density (BD).Crosses represent experimental variograms and solid lines represent fitted models, along the horizontal (black) and vertical (blue) directions.

Table 1 .
Statistics on target variable (lithology) in composited data table.

Table 4 .
Parameters for theoretical variogram models.

Table 5 .
Parameters for kriging.Feature variable Horizontal search radius (m) Vertical search radius (m) Optimal number of data Top-cut value To compare the model performances corresponding to the traditional machine learning approaches versus our approach, we have created an extended dataset comprising of three sets of feature variables.The first set consists of the assay variables of the original dataset, i.
e. Cu, Au, Ag, Mo, As, Fe, S, C, F, Pb, Zn, Cd, and Hg, plus BD.The second set consists of leave-one-out cross-validated ('LOOCV') values, and the third set consists of filtered (after removing the 'nugget variance' from the variogram) kriging values for each of the above feature variables.Only the first set contains missing values.In the third set, if the variable has no nugget, then the filtered value is equal to the original value; if the value of a variable is missing, then the filtered value is its kriging prediction from the surrounding values and matches the LOOCV value.

Table 6 .
Multi-class response information in the training and testing subsets.

Table 7 .
Multi-class response information (before merging) in the training and testing subsets.

Table 8 .
Multi-class response information (after merging) in the training and testing subsets.Due to the similarity in age and mode of formation, we merged the AND, RHY, and INTR lithology classes into a single class: INTR.Hence, the total number of lithology classes was reduced to 10 for this set of classification models.The final statistics of the lithology variable used for the prediction models are shown in table 8.For this set of models, we split the 899-point dataset into training/testing subsets in a ratio 70:30.Table9shows the number and percentages of lithology class records in the training and testing subsets.

Table 9 .
Multi-class response information of table 8 separated into training and testing subsets.

Table 10 .
Prediction performance for models based on dataset with imputed values.

Table A2 .
Performance statistics for model M2.

Table A3 .
Performance statistics for model M3.

Table A4 .
Performance statistics for model M4.

Table A5 .
Performance statistics for model M5.

Table A6 .
Performance statistics for model M6.

Table B1 .
Variable importance table (sorted as per permutation variable importance) for model M1.

Table B2 .
Variable importance table (sorted as per permutation variable importance) for model M2.

Table B3 .
Variable importance table (sorted as per permutation variable importance) for model M3.

Table B4 .
Variable importance table (sorted as per permutation variable importance) for model M4.

Table B5 .
Variable importance table (sorted as per permutation variable importance) for model M5.

Table B6 .
Variable importance table (sorted as per permutation variable importance) for model M6.