Validating spatio-temporal environmental machine learning models: Simpson’s paradox and data splits

Machine learning has revolutionized environmental sciences by estimating scarce environmental data, such as air quality, land cover type, wildlife population counts, and disease risk. However, current methods for validating these models often ignore the spatial or temporal structure commonly found in environmental data, leading to inaccurate evaluations of model quality. This paper outlines the problems that can arise from such validation methods and describes how to avoid erroneous assumptions about training data structure. In an example on air quality estimation, we show that a poor model with an r 2 of 0.09 can falsely appear to achieve an r 2 value of 0.73 by failing to account for Simpson’s paradox. This same model’s r 2 can further inflate to 0.82 when improperly splitting data. To ensure high-quality synthetic data for research in environmental science, justice, and health, researchers must use validation procedures that reflect the structure of their training data.


Introduction
Recent advances in computing, machine learning, and remote sensing have led to a revolution in using data generated by machine learning models in environmental science research.Variables that historically have been painstakingly time and resource intensive to gather including air quality (Rybarczyk and Zalakeviciute 2018, Hammer et al 2020, Stowell et al 2020), land use and cover type (Hansen et al 2013, Potapov et al 2022), wildlife population counts (Tuia et al 2022), disease risk (Coroian et al 2020) and many more (Sheffield et al 2006, Michel et al 2020) have all been estimated using machine learning.This has greatly improved the resolution of available environmental datasets by relying on high resolution remote sensing, administrative, or climate reanalysis data as model inputs.These modeled estimates have allowed research in areas which would otherwise lack the necessary monitoring, and have therefore been especially useful for environmental justice research (Banzhaf et al 2019) and research in developing country contexts (Turner et al 2015, Kussul et al 2020).
Despite the popularity of these new synthetic data, the machine learning models that generate them are most commonly validated in ways that do not reflect a model's suitability for environmental research.In order to validate a model, the first step is to separate, or split the data used for training versus testing to avoid data leakage.Data leakage happens when the model is evaluated, or tested, on data that is not independent from the data used for model training (appendix C).Avoiding data leakage therefore prevents unobservable overfitting, wherein similarities between the training and test data lead to misleadingly high performance metrics that do not reflect the model's true ability to generalize..While many studies split data randomly (Stowell et al 2020, Karimian et al 2023, Zhang et al 2023), a random split ignores spatio-temporal or other structure that is common in environmental data (Neumann and Gujer 2008, Baumann 2021, Zhong et al 2021).Several studies have shown this can lead to data leakage and overly optimistic results, and advocate for grouping data in space or time when creating splits (Meyer et al 2018, Schratz et al 2019, Li et al 2020, Ploton et al 2020).
Another issue with common validation methods stems from Simpson's paradox.Simpson's Paradox dictates that a trend between two variables can change when groups within the data are separated (figure 1) (Simpson 1951).As a result, when comparing the trend between observed measures and estimates predicted from a machine learning model, trends demonstrating a good fit overall may not hold when the data are grouped in space or time to assess temporal or spatial prediction ability.While most environmental machine learning models are trained with the goal of interpolating new values in a specific dimension, such as space or time (figure 2), metrics are generally calculated over the entire dataset and therefore may not accurately reflect a model's performance for its task.Therefore, the isolated prediction ability over the dimension of interest must be calculated (Xiao et al 2017, Sousa et al 2021), but it is not always reported and studies have yet to evaluate its effect on evaluating environmental models.Failing to properly split data and account for Simpson's paradox makes validation metrics uninterpretable and difficult to compare across studies.
The goals of this paper are twofold.First, we describe common issues in environmental machine learning model validation that arise as a result of the spatio-temporal structure inherent in much environmental data.We then describe steps to take to avoid these common pitfalls (figure 2).Second, using a case study on air quality prediction, we demonstrate how failing to adhere to these steps through the use of more traditional validation methods can lead to erroneous conclusions about model quality.While similar work has provided empirical evidence on the importance of properly splitting data in environmental modeling (Schratz et al 2019, Li et al 2020, Ploton et al 2020), the same has not been done for Simpson's Paradox.Here, we demonstrate that Simpson's Paradox can skew validation results even more heavily than improperly splitting data.Additionally, we show that very flexible nonparametric machine learning models which are increasing in popularity are more prone to unobservable overfitting than more traditional linear models.These insights highlight the urgency of establishing standardized validation practices that accurately convey model performance as machine learning becomes more accessible and widely used (Lary et al 2015, Huntingford et al 2019, Zhong et al 2021).This framework can help ensure that environmental researchers have confidence in data produced using machine learning.

Air quality prediction example
In this paper, we demonstrate how to validate a model that spatially interpolates air quality, specifically fine particulate matter (PM 2.5 ).Government maintained and calibrated air quality monitoring stations are few and far between despite recent increases in smoke exposure from catastrophic wildfires (Williams et al 2019, Balmes 2020), as well as a growing recognition of the dangers of poor urban air quality (Apte et al 2018).To aid public health and environmental justice studies that demand high resolution data, there has been extensive research on increasing the spatial resolution of PM 2.5 data using machine learning (Liu et al 2005, Ma et al 2016, Our example is situated in the San Francisco Bay Area of California, a highly populated urban center that has recently suffered from several days with very high PM 2.5 levels as a result of catastrophic wildfires devastating California (Balmes 2020).We study 2017, the first among a string of years in recent history in which smoke from catastrophic wildfires caused public health advisories and a halt in the functioning of several institutions in the Bay Area, including schools and universities (Nauslar et al 2018, Mass andOvens 2019).
The goal of the model is to generate a 1 km 2 map of PM 2.5 over the San Francisco Bay Area, a region with only 14 EPA air quality monitors that gather daily PM 2.5 values (figure 3, red box).To improve prediction accuracy, we allow the model to train over a larger surrounding area with an additional 20 monitors (figure 3, orange box).We train both a linear statistical model and a nonparametric machine learning ensemble model.The linear statistical model is a two-stage model shown to achieve an r 2 (reported as R 2 ) of 0.80 in southern California using conventional validation techniques (Stowell et al 2020).The nonparametric machine learning model is the ensemble model generated using the AutoGluon Automatic Machine Learning (AutoML) package and is based on several popular machine learning models (Erickson et al 2020).Like most studies in this domain, we use AOD, land cover, elevation, and climatological data as predictive variables.See appendix A for more details on the models and data used for this example.

Validation metrics
The performance of a model is evaluated using metrics that summarize how close the model's predictions are to ground truth.The most common validation metrics in environmental science are the Root Mean Squared Error (RMSE) and the squared Pearson correlation coefficient (r 2 ), often referred to as R 2 .We use these metrics to evaluate the models in the example study as well as the coefficient of determination, also commonly referred to as R 2 .In order to avoid confusion between r 2 and the coefficient of determination, we refer to the coefficient of determination as Nash-Sutcliffe Efficiency (NSE) (Nash and Sutcliffe 1970).Though historically used in different contexts, the NSE and the coefficient of determination have the same formula (appendix B) (Kvålseth 1985).The NSE is similar to r 2 , but has the advantage of having the explicit interpretation of the percent of variation explained by the model.It is also sensitive to negative relationships between predictions and can indicate when the predictions are further from the truth than the mean of the true observations (Kvålseth 1985, Poldrack et al 2020).For a more detailed discussion of these metrics and their different names, refer to appendix B.
Step 1: Splitting the data The first step to accounting for spatio-temporal structure during validation is to create a test set that mimics the gaps in data that the model will encounter when it is used to generalize to new settings (appendix C) (Roberts et al 2017, Meyer et al 2018, Schratz et al 2019, Li et al 2020, Wadoux et al 2021).This step avoids data leakage, which happens when training and test data contain observations that are not fully independent.This can lead to unobservable overfitting, wherein the model learns relationships that are not generalizable to new data, but still appears to perform well since the training and test data are related.
This problem can be solved by splitting the data into appropriate training and test sets (Roberts et al 2017, Meyer et al 2018, Schratz et al 2019, Li et al 2020, Wadoux et al 2021).The independence between the training and test data should reflect the level of independence between the training data and data the model will be used to generate predictions over.
If the goal of a model is to generate new predictions in space, spatially grouped hold out sets should be used, that is, observations from the same location should all be in the same split.If instead the goal is to interpolate in time, the hold out sets must be grouped in time.If the model is intended to interpolate both in time and space, the test set must be made up of space-time pairs where no observations from that time or location are present in the training set.
The proximity with which two observations should be considered part of the same location or time depends on the nature of the data the model will extrapolate over (Li et al 2020).Some modelers consider several space distances and different time horizons in their validation procedure, in order to evaluate how much the model error levels depend on the considered distance or/and the time horizon (Guilpart et al 2022).
In our air quality prediction example, we are interested in training a model that can generate predictions at new locations in space where no air quality measurements are available.A random split would lead to data leakage since data collected from the same location are related.Therefore, we do a spatial split, and choose to group data by station, sometimes called a site-based CV.Because we have 34 sites in our data training area, our site-based CV has 34 splits.This choice assumes the distance between monitors effectively mimics the distance to the new locations the model will extrapolate to (Li et al 2020).
Note that data leakage is only a problem if the model is to be applied to data that are not correspondingly similar to the training set (Wadoux et al 2021).In our air quality example, even though the PM 2.5 observations are also correlated in time, we do not use temporally grouped hold out sets since we expect the model to use the information from nearby sensors to inform predictions on the same day.Similarly, though some PM 2.5 stations are close to one another and are likely to be related through spatial autocorrelation, this is also legitimate information for the model since the model has access to information from nearby stations when predicting continuous PM 2.5 maps (Wadoux et al 2021).It is however interesting to compare the difference in performance relative to the remoteness of the station (appendix E, figure S3).Additionally, if the stations were very clustered together, the level of autocorrelation between the stations could become problematic and a region-based CV could be used to help avoid data leakage (Li et al 2020).
To test the effect of using a spatial versus a random split, we compare the overall prediction abilities of both the linear model and the nonparametric model when using both a random and spatial split.We expect that a random split will report overly optimistic metrics, with a greater effect for the nonparametric model since its flexibility makes it more prone to overfitting.Like many environmental science applications where data availability is a concern, we use cross-validation to separate data that we train on from test data (appendix C).However, it is important to recognize that this methodology can potentially lead to data leakage during the process of hyperparameter tuning because it allows information from the test data to inadvertently influence the tuning process.Consequently, we advocate for the use of this method primarily in scenarios where extensive hyperparameter tuning or model selection is not a critical concern.In the case of our machine learning model, which indeed necessitates significant hyperparameter adjustments, the risk of data leakage is mitigated because AutoGluon automates both the hyperparameter tuning and model weighting aspects during the training phase (appendix A).We calculate confidence intervals using bootstrapping (appendix D).For this test, we calculate overall prediction ability as opposed to spatial prediction ability, which we calculate only under Step 2.
Step 2: Calculating metrics After adequately splitting the data and training the model, the ground truthed observations from the test set are compared to model predictions.Simpson's paradox predicts that overall, spatial, and temporal prediction abilities are not necessarily the same (Simpson 1951), so it is important to isolate the prediction ability over the dimension of interest.
Simpson's Paradox occurs when a trend between two variables changes when groups within the data are separated (figure 1).This occurs when the means of the x and/or y variable are not stationary across groups, as is common across space or time.Simpson's paradox therefore predicts that unitless metrics that quantify trends, such as r 2 and NSE, can be different when calculating overall, spatial, and temporal model performance.Note that this is an issue only when calculating unitless metrics such as r 2 and NSE (appendix B).Metrics such RMSE depend on the units of the observations and predictions and therefore are unchanged by Simpson's paradox.Since non-stationary means in dimensions other than what the model is interpolating over causes this issue, calculating unitless metrics separately over groups in these additional dimensions effectively de-means them and isolates the variation in the dimension of interest (equations (1), (2).
Since we are interested in our models' abilities to predict new PM 2.5 values across space, we are interested in spatial prediction ability.To isolate the PM 2.5 model's ability to capture the variability in values across space, we must hold time constant to avoid also including its ability to predict temporal variations.Therefore, rather than calculating the overall performance, we first hold time constant by grouping the test data by temporal group before calculating the metrics and then averaging them (equation ( 1)).If instead we were interested in calculating temporal prediction ability, we would calculate the metrics on spatial groups before calculating the average performance across locations (equation ( 2)).This is conceptually equivalent to controlling for variation in time or space using fixed effects, also known as dummy variables.
where R spatial 2 is the isolated spatial prediction ability, T is the number of times, and R t 2 is the R 2 value for time t computed over all space samples.
where R temporal 2 is the isolated temporal prediction ability, S is the number of locations, and R s 2 is the R 2 value for location s, computed over all time samples.
We test the effect of spatio-temporal structure and Simpson's paradox on both models' overall, temporal, and spatial prediction abilities by calculating and comparing each of these.We evaluate whether erroneous conclusions about model quality could be made if assuming the overall or temporal prediction abilities reflected the model's ability to predict PM 2.5 across space.The 95% confidence intervals are calculated using bootstrapping by randomly removing 10% of the data for the unisolated metrics, and across the spatial/ temporal groups (appendix D).

Results and discussion
Step 1: Splitting the data To assess the effect of inappropriately using a random split, we compare the overall prediction abilities reported for the same models using both random and spatially grouped hold out sets.As predicted, the random split consistently reports better evaluation metrics than the spatial split (figure 4).Also in line with predictions, this bias is larger and significant over all metrics for the nonparametric model, due to its increased flexibility which can more easily lead to overfitting.While the r 2 of the nonparametric model drops from 0.82 to 0.73, the linear statistical model only drops from an r 2 of 0.74 to 0.72.The increased risk of overfitting presented by modern, flexible models highlights the importance of adopting careful hold out sets in environmental science as such models become more common.
Step 2: Calculating metrics Next, we test the effect of Simpson's paradox and evaluate how different spatial, temporal, and overall prediction abilities are.In this case, we are specifically interested in spatial prediction ability, and we assess how our results would be affected if we erroneously evaluated the models based on overall or temporal prediction ability instead.We find that the spatial prediction ability of the models is steeply lower than its overall and temporal prediction ability (figure 5).The linear statistical model falls from an overall r 2 of .725 to a spatial r 2 of 0.101, and the nonparametric ensemble model falls from 0.732 to 0.093.The negative NSE informs us that the predictions are worse than what the predictions would have been had we used the average PM 2.5 across locations on each day as the estimates.In other words, neither model effectively explains spatial variability despite achieving high overall prediction ability.The outputs of these models would not be useful for public health or environmental justice research interested in understanding spatial heterogeneity in exposure, even if the resulting maps appear to align with domain knowledge (appendix F).The added noise from the data would bias results towards the null hypothesis and any added power would be an illusion.
It is easiest to visualize the effect of Simpson's paradox by plotting out relationships in the data by group (figure 1, figure 6).When calculating spatial prediction ability, we investigate the relationship between observation and prediction separately for each day in order to isolate variation across space (equation ( 1)). Figure 6(a) plots out the predictions from the nonparametric model across the various stations in our sample for a subsample of fifteen days, with each color representing a different day.Visually, there is no clear relationship between predicted and observed PM 2.5 between stations at the level of a given day, in line with the low spatial prediction ability found in figure 5. Conversely, figure 6(b) shows predictions for a subsample of three monitoring stations, and it is clear that the model is able to capture the variations in PM 2.5 levels across time at a fixed location.This recalls the high temporal prediction ability from figure 5, since when holding the location fixed the model can capture temporal variation with exceptional accuracy.
Not all environmental machine learning models are necessarily as prone to have such large differences between overall and spatial or temporal prediction ability.In this case, the large effect of Simpson's paradox is partially explained by the relatively high variance of PM 2.5 across daily means: the variance over time, 34.74 μg m −3 , is an  with points within a day capturing variation across space.Spatial prediction ability is visibly poor, since there is no apparent relationship between predicted and observed PM 2.5 values across space.(b) Each color represents a different station, and points from each station represent a different day.This shows that the temporal prediction ability of the same model is strong, since the relationship across days appears strong.The x and y ranges are cut off from 0 to 50 for ease of visualization.
order of magnitude greater than the variance across space, 3.96 μg m −3 .Intuitively, this is because on any given day, PM 2.5 is generally either high or low everywhere across the entire Bay Area since wildfire smoke is usually either present or not.If the model were trained to predict PM 2.5 over a larger area, such as the United States, there would be greater variance across space, since the presence of wildfires on a given day would generally not affect the entire landscape.Given the large variation across time, it makes sense that the model would be more successful in distinguishing between these very different events than the subtle spatial differences within one day.In fact, we find that the models have significantly higher spatial prediction abilities over the entire data training area as opposed to the data prediction area (figure 3), suggesting that the models can better explain the larger variations over this greater area.
One strategy to enhance the model's spatial prediction ability is to remove the temporal variability from the predicted value.Specifically, we calculate a daily mean PM 2.5 from the training data, remove this value from the original PM 2.5 observations, and train the model to predict this residual difference.The daily means are also included as a predictor and can be added back to the model's residual predictions to retrieve PM 2.5 estimates.Equivalently, one could use the same strategy to remove spatial variability if one is focused on temporal prediction ability.We test this strategy on the nonparametric machine learning model.However, while we do find statistically significant improvements in NSE, the model still performed poorly with these changes (r 2 = .01,NSE = −0.34)(appendix G).This suggests that this model and the predictors we use are not powerful enough to explain the nuanced spatial variation in PM 2.5 across the Bay Area, even when this is the model's sole focus.
There is therefore no way to judge the effect that Simpson's Paradox can have without calculating the appropriate prediction ability in time or space.Researchers should always calculate model performance in accordance to the model goals or simply forgo using unitless metrics (Kang et al 2020).

Conclusion
There is an increasing opportunity for environmental science, health, and justice researchers to not only perform their own validation on the data used in their studies, but also generate their own, study-specific, models and data.With the proper validation strategies, this new wave of data democratization will allow researchers to create datasets specific to the study area and times of interest at the resolution required (DeFelice et al 2017, Xiao et al 2018, Coroian et al 2020).The technical and computational barrier previously presented is being dissolved as AutoML packages and cloud computing platforms become increasingly available and easy to use.In this paper, we employed the user-friendly AutoGluon AutoML package to generate the nonparametric ensemble model without needing to have any input on model design or hyperparameters (Erickson et al 2020).We also used free cloud computing made available by Google Colab to train this model.Given the comparable performance of the nonparametric model to the highly engineered statistical model in this study, such packages and platforms can be useful to researchers interested in getting started using machine learning in environmental science.
However, using these tools and models opens up the possibility for errors in evaluation.Environmental ground truthed data used to train environmental machine learning models are commonly structured in space and time, rendering conventional validation strategies inappropriate (Neumann and Gujer 2008).This paper outlines steps that ensure proper validation of models with spatio-temporal dimensions.First, hold out sets must effectively avoid data leakage by grouping splits in either space or time (Roberts et al 2017, Meyer et al 2018, Schratz et al 2019).Our example on PM 2.5 prediction shows how data leakage from improperly designed hold out sets can lead to overfitting and biased evaluations, especially when using highly flexible models such as a nonparametric ensemble model.Second, isolating the prediction ability in the relevant dimension is important as it can be strikingly different from the overall prediction ability (Xiao et al 2017, Sousa et al 2021).The combination of these steps took a model from appearing promising with an r 2 of 0.82 to revealing its unfitness for research with an r 2 of 0.09.With the data and machine learning revolution afoot in the environmental sciences, this paper proposes a standard and relevant validation workflow.The guide outlined in this paper can help ensure robust environmental science, health, and justice research as we enter a new era of data generation and availability.
work comes from the National Science Foundation Graduate Research Fellowship (1650114).Use was made of computational facilities purchased with funds from the National Science Foundation (CNS-1725797) and administered by the Center for Scientific Computing (CSC).The CSC is supported by the California NanoSystems Institute and the Materials Research Science and Engineering Center (MRSEC, NSF DMR 1720256) at UC Santa Barbara.

Appendices
Appendix A. Data and models used in the air quality example Data All data used in the training and testing of the models in this study are described in table A1.

Models
We test the effects of conventional and best practice validation approaches on two types of models: a linear statistical model and a nonparametric machine learning ensemble model.The statistical model is a two-stage linear model previously shown to achieve an r 2 (reported as R 2 ) of 0.80 in southern California using conventional validation techniques (Stowell et al 2020).The stages of the algorithm are a linear mixed effects model followed by a geographically weighted regression (equations (A1), (A2)).The nonparametric machine learning model is the ensemble model generated using the AutoGluon AutoML package.It is based on several popular machine learning models, including random forests, boosted trees, neural networks, and k-nearestneighbors.Using these two types of models allows us to evaluate the importance of the proposed validation techniques across model types.

Statistical two-stage model
The statistical model consists of two subsequent steps which leverage first the temporal and then spatial information (Stowell et al 2020).The first regression is a linear mixed effect model with the variable 'Day' acting as a random effect (equation (A1)).The second step consists of predicting the error of the first model using a geographically weighted regression model (equation (A2)).Following the process of the original authors, we perform feature selection on the linear mixed effect model and only include AOD as an input to the geographically weighted regression (Stowell et al 2020).We use the Akaike Information Criterion (AIC) for feature selection.This feature selection introduces some data leakage and potential for overfitting since we do not use a validation set.If this model were to be used for generating data used in research, a validation set would need to be used for the feature selection.where b 0,s and b 1,s are location specific intercepts calculated as a weighted sum of the least squares regression coefficients at each monitoring location using a gaussian kernel.AOD sd represents either the retrieved AOD value, or, if no such value is available due to cloud cover, an indicator value representing whether AOD is measured.Two separate models were trained for predicting PM 2.5 when AOD is and is not available.The model for when AOD is available was trained on a dataset which removed instances with missing AOD, while the model for missing AOD values was trained on all data with an indicator in place of the AOD data.

AutoGluon AutoML package
AutoGluon is an AutoML package available in Python which allows the user to easily train a variety of popular non-parametric machine learning models with little required knowledge of each method or the parameters which govern its training and performance (Erickson et al 2020).Though the user may manually choose to set these parameters, as a default the package tunes them on its own.Most models included in the package are drawn from other popular Python machine learning libraries.The included models are: CatBoost boosted trees, LightGBM boosted trees, Scikit learn nearest neighbors, Autogluon tabular deep neural networks, Scikit learn random forest classifier, Scikit learn extremely randomized trees.For the case study, we use the ensemble method which trains all available models and weights their outputs to optimize performance.Because AutoGluon automatically does hyperparameter tuning and model weighting in training, this avoids the data leakage issue introduced from not using a validation set with the statistical two-stage model, since the hyperparameters are automatically tuned on the training data.Additionally, unlike with the two-stage model, we do not perform any feature selection.
Table A1.Included data and their sources.In order to train these models, we follow the lead of previous studies and use PM 2.5 measurements from the Environmental Protection Agency's (EPA) ground level air quality monitors and predictive variables relating to Aerosol Optical Depth (AOD), meteorological conditions, land cover/land use, and PM 2.5 point emissions (Di et al 2016, Hu et al 2017, Stowell et al 2020).Specifically, we predict PM 2.5 using AOD, AOD-derived plume boundaries, maximum temperature, wind speed, wind direction, precipitation, relative humidity, boundary layer height, elevation, forest cover, length of roads and streets within a pixel, and point PM 2.5 emissions data.We include all days in which all data sources used for the study are available, a total of 327 days in 2017.
values (figure B1).possibly lead to significant spatial prediction ability.Notably, both appear to to show better air quality at higher elevations and over natural areas.However, one might also notice that nearly all of the monitors are located in urban areas, which happen also to mostly be at low elevations.The models would therefore need to successfully be able to differentiate between higher and lower PM 2.5 stations within urban areas in order to achieve high spatial prediction ability.While this becomes a more difficult task to validate using domain knowledge, it remains in line with what is needed to study fine differences in exposure for people residing in the Bay Area.
Calculating the spatial prediction ability therefore becomes even more important for accurately gauging model quality.
A modeler simply using the mapped predictions to evaluate model performance may find solutions to remove the visible artifacts from the linear model and be satisfied with the result.The jagged lines result from using the outline of smoke plumes in the model.Because plumes are present during very high PM 2.5 days and the linear model can have difficulty extrapolating to such high values accurately, this can leave outlines of the plumes visible even when averaging over all days.Additionally, the small dots across the landscape with unrealistically high PM 2.5 values represent locations of PM 2.5 point sources which again, the linear model was unable to properly extrapolate for.Therefore, one could simply remove the plumes and point sources from the model and likely create a map that aligns much more closely with domain knowledge.Indeed, other studies using the same linear model and the same inputs but without plumes and point sources have produced intuitive maps with high overall prediction abilities (Stowell et al 2020).
However, studying the map resulting from the nonparametric model warns us that even if these artifacts were no longer visible, the spatial prediction ability would not necessarily improve.Indeed, it is difficult to imagine why removing information about the location of a smoke plume would improve model accuracy.It is therefore important that the spatial prediction ability be calculated regardless of whether the maps appear to align with domain knowledge.

Figure 1 .
Figure1.Illustration of Simpson's Paradox.Simpson's paradox dictates that a trend between two variables may change when groups within the data are separated.This occurs when the means of the x and/or y variable are not stationary across groups.For example, this figure shows how the very different means in predicted and observed values for groups 1 and 2 can lead to differences between grouped and overall trends.This figure demonstrates how a model may have poor predictive ability over a particular dimension, such as space or time, while maintaining a high overall prediction ability.

Figure 2 .
Figure 2. Appropriate validation approaches are governed by the goal of the environmental machine learning model.One of the most common goals of machine learning in the environmental sciences is to generate artificial data that fills in the gaps between where ground truthed data is missing in space or time.(a) Predicting in space tends to be necessary when observations are repeatedly taken in the same location, such as by a monitoring station, but information is needed in other areas as well.(b) Predicting in time is helpful to fill in irregular or sparse time series.

Figure 3 .
Figure3.Study area: San Francisco Bay Area, California.The target prediction area (red box) encompasses 12,096 km 2 and 14 EPA PM 2.5 monitors (black dots).In order to improve prediction accuracy, we train the data on a larger area which includes a buffer (orange box) and a total of 34 EPA PM 2.5 monitors, as has been shown in previous studies to improve results(Hu et al 2014).

Figure 4 .
Figure4.Random hold out sets ineffectively guard against data leakage and unobservable overfitting.Because the models predict PM 2.5 values in space rather than time, spatially grouped hold out sets avoid data leakage.Random hold out sets lead to a biased evaluation with inflated RMSE, NSE, and r 2 values.This effect is more pronounced to highly flexible models such as the nonparametric ensemble models since their flexibility makes them more prone to overfitting.The parentheses represent 95% confidence intervals.

Figure 5 .
Figure5.The effects of isolating spatial and temporal prediction ability.After performing spatial cross-validation, we use equations (5) and (6) to isolate the ability of the models to predict variations in space and time.The parentheses represent 95% confidence intervals.

Figure 6 .
Figure 6. Simpson's Paradox drives radically different spatial and temporal prediction abilities.The predicted versus observed PM 2.5 values are plotted for the nonparametric model for a subset of (a) 15 days and (b) 3 stations.(a) Each color represents a separate day,with points within a day capturing variation across space.Spatial prediction ability is visibly poor, since there is no apparent relationship between predicted and observed PM 2.5 values across space.(b) Each color represents a different station, and points from each station represent a different day.This shows that the temporal prediction ability of the same model is strong, since the relationship across days appears strong.The x and y ranges are cut off from 0 to 50 for ease of visualization.
0, 0, 0, 0, 0, 0, 0 , A1 where PM 2.5,sd represents ground-level PM 2.5 concentrations in μg m −3 at each site (s) for each day (d); b 0 and b 0,d are respectively the fixed and random intercepts for the model, AOD sd denotes the AOD values at site s and day d with fixed and random day-specific slopes b 1 and b 1,d respectively, RelHumidity sd represents the measured relative humidity at site s and day d with fixed and random day-specific slopes b 2 and b 2,d respectively, WindSpeed sd is the maximum wind speed at site s and day d with fixed and random day-specific slopes b 3 and b 3,d respectively, Temp sd is the maximum daily temperature at each site, PlumesLow sd , PlumesMed sd , and PlumesHigh sd are indicators for the presence of remotely sensed low, medium, and high density plumes, BoundaryLayerHeight sd is the boundary layer height at site s and day d, Forest s is an indicator variable denoting pixel forest cover, Street s is the sum of the length of all streets in the pixel, Elevation s is the elevation; and Emissions s are the yearly point emissions per pixel as reported by the EPA.ε sd is the error term.b 0,d through b 8,d are multivariate normally distributed and y represents the variance-covariance matrix for all random effects.

Figure G1 .
Figure G1.Improvement in the spatial prediction ability of the nonparametric machine learning model is possible by removing daily means from the predicted values.The spatial prediction ability sees a statistically significant improvement in NSE relative to the model trained using a regular protocol.The temporal and unisolated prediction abilities also improve since the daily means are included in the model's predictors.