Multivariate spatial prediction of air pollutant concentrations with INLA

Estimates of daily air pollution concentrations with complete spatial and temporal coverage are important for supporting epidemiologic studies and health impact assessments. While numerous approaches have been developed for modeling air pollution, they typically only consider each pollutant separately. We describe a spatial multipollutant data fusion model that combines monitoring measurements and chemical transport model simulations that leverages dependence between pollutants to improve spatial prediction. For the contiguous United States, we created a data product of daily concentration for 12 pollutants (CO, NOx, NO2, SO2, O3, PM10, and PM2.5 species EC, OC, NO3, NH4, SO4) during the period 2005 to 2014. Out-of-sample prediction showed good performance, particularly for daily PM2.5 species EC (R2 = 0.64), OC (R2 = 0.75), NH4 (R2 = 0.84), NO3 (R2 = 0.73), and SO4 (R2 = 0.80). By employing the integrated nested Laplace approximation (INLA) for Bayesian inference, our approach also provides model-based prediction error estimates. The daily data product at 12 km spatial resolution will be publicly available immediately upon publication. To our knowledge this is the first publicly available data product for major PM2.5 species and several gases at this spatial and temporal resolution.


Introduction
Ambient air pollution measurements from regulatory monitoring networks are routinely used to support large population-based epidemiologic studies globally (Dai et al 2014, Chen et al 2018, Vicedo-Cabrera et al 2020. However, regulatory monitors are spatially sparse, temporally incomplete, and preferentially located in areas with large populations. There has been considerable effort to estimate air quality by supplementing monitoring measurements with additional data sources, such as numerical model simulations, dispersion model outputs, satellite imagery, portable monitors and low-cost sensors (Özkaynak et  The main objective of this study is to create a publicly available data product of daily pollutant concentrations over the contiguous United States for the period 2005 to 2014. This is accomplished by combining measurements from monitoring networks and outputs from the Community Multiscale Air Quality (CMAQ) model (Binkowski et al 2003, Appel et al 2007, Appel et al 2008. CMAQ is a 3-dimensional chemical transport model (CTM) that simulates pollution concentrations using emission inventory data, meteorology, atmospheric chemistry and physical principles. CMAQ outputs have been used extensively in previous applications to estimate pollutant concentrations in the US (Berrocal et al 2012, Friberg et al 2016, Senthilkumar et al 2019 and China (Liang et al 2017, Lyu et al 2019. One key advantage of using CTM outputs for modeling air quality is its complete spatial-temporal coverage, which has also been leveraged to address missing data in remotely-sensing aerosol optical depth when estimating air quality , Shaddick et al 2018. Our approach to utilizing CTM outputs have several unique features. First, CTM simulations are conducted under a multipollutant framework because concentrations of pollutants are often correlated due to shared emission sources and meteorological drivers. Exploiting cross-pollutant dependence may lead to improved prediction performance, as demonstrated by Berrocal et al (2010) in a bi-pollutant model for CMAQ PM 2.5 and ozone, and by Crooks et al (2014) that imposed a sum-constraint for modeling PM 2.5 species. Here we take a joint modeling approach that utilizes multipollutant CMAQ outputs as predictors of each other and incorporates spatial cross-pollutant correlation in monitoring measurements. We also expand on previous joint modeling work by considering a larger set of 12 pollutants, including fine and coarse particulate matter (PM 2.5 and PM 10 ), carbon monoxide (CO), nitrogen dioxide (NO 2 ), nitrogen oxides (NOx), sulfur dioxide (SO 2 ), ozone (O 3 ), and 5 major PM 2.5 species: including nitrate (NO 3 ), ammonium (NH 4 ), sulfate (SO 4 ), elemental carbon (EC) and organic carbon (OC).
A second feature of our approach is the use of Bayesian geostatistical models that allow for complex multipollutant dependence based on spatial locations. Geostatistical models are widely used for modeling air pollution levels (Bergen et al 2013, Hu et al 2013, Keller et al 2015, Berrocal et al 2020, and studies have shown that considering spatial autocorrelation can be particularly beneficial for large study regions. The Bayesian geostatistical approach also provides straight-forward uncertainty quantification for each prediction (e.g., prediction standard error) that can be difficult to obtain from machine learning algorithms. Prediction uncertainty can be incorporated as exposure measurement error in subsequent health impact assessments or health effect analyses (Rushworth et al 2014, Blangiardo et al 2016, Butland et al 2019. Consideration for prediction uncertainty may be particularly important when the modeling domain includes regions with sparse monitors. Finally, to address computational burden typically associated with Bayesian computation, we also describe how to employ integrated nested Laplace approximation (INLA) for model fitting and prediction with large spatial-temporal datasets.

Air quality data
Daily air pollution concentration measurements are acquired from the US Environmental Protection Agency (EPA) covering the continental US for the period 2005-2014. We include 7 particulate pollutant (PM 2.5 , PM 10 , NO 3 − , NH 4 + , EC, OC, and SO 4 2− ), and 5 gases (CO, NOx, NO 2 , SO 2 and O 3 ). Following daily air quality metrics most often used in health studies, particulate pollutants are recorded as daily averages, ozone is recorded as the daily maximum 8 h averages, and other gases are recorded as the maximum 1 h average. The daily AQS dataset (US EPA, 2017) includes a total of 2,644 stations measuring at least one pollutant, as shown in figure 1. Some pollutants are available every day, while others are only measured every 3 or 6 days. The pollutant measurement time and unites are shown in table A1 in supplemental materials. For each pollutant, numerical model outputs from CMAQ version 5.0.2 are acquired from the US EPA. CMAQ simulations were run at a 12 km resolution with 299×459 grid cells over the contiguous US. Hourly CMAQ concentration simulations were used to calculate daily metrics. For developing the prediction model, each AQS monitor is linked to a CMAQ grid cell. The sample correlation between CMAQ and AQS for the same pollutant ranges from 0.12 for PM 10 to 0.74 for O 3 . Table 1 provides the number of observations for each pollutant and the number of station-days where each pair of pollutants are measured at the same location and same date; additional summary statistics are provided in Senthilkumar et al (2019). The most commonly measured pollutant across days and locations is O 3 , and the least commonly measured pollutants are the PM 2.5 species EC and OC. There are strong correlations between several pairs of pollutants that motivate using a multivariate model for prediction. For example, more intenselysampled NO 3 and sparsely-sampled EC have sample correlation 0.79 at collocated monitoring sites.

Multi-pollutant geostatistical models
For each of the = p 12 pollutants, let Y s t , k ( ) be the AQS measurement of pollutant = k p 1,..., at spatial location s and day t. To avoid prediction of negative concentrations, we transform the responses as + log Y s t , 1 . k { ( ) } Further, to put all pollutants on the same scale for multivariate modeling, we standardize across s and t so that each pollutant has mean zero and variance one. Denote y s t , k ( )as the log-transformed and standardized pollution concentration. Similarly, denote x s t , k ( ) as the log-transformed and standardized of value of CMAQ for pollutant k for the grid cell containing location s on day t.
A hierarchical framework is used to jointly model the spatial variation of multiple air pollutants while accounting for bias and error in the deterministic CMAQ model outputs. For each pollutant interest, we assume the model where b k0 is the intercept for pollutant k, b t kl ( ) is the association between pollutant l from CMAQ and the pollutant of interest, w s t , k ( ) is the spatial random effect for pollutant k assumed to be independent over time after accounting for the CMAQ covariates, and the residual error component e s s t N , 0 , k k 2 ( ) ( )are independent over space and time. In the above model, we borrow information across pollutants by (1) using multiple CMAQ pollutant as predictors, and (2) imposing between-pollutant dependence in the residual spatial random effects w s t , . k ( ) The regression coefficients for pollutant k are allowed to vary over time via a random slopes model where m k and S k are the random effect mean vector and covariance matrix, respectively, and the slopes are independent over time. The multivariate spatial random effects w s t , k ( ) explains the residual correlation not addressed by CMAQ and is modelled using the Linear Model of Coregionalization (LMC; Gelfand et al 2004). The LMC captures both spatial correlation between nearby locations and cross-correlation between the p pollutants. The LMC is a spatial extension of factor analysis (Everitt and Hothorn, 2011) with the random effect for each pollution modeled as a linear combination of spatially-varying factors. That is, where L kj are the factor loadings and z s t , j ( ) are the spatial factors (independent over j and t ). The spatial factors z s t , j ( ) are modeled as Gaussian processes with mean zero, variance one, and an isotropic Matérn spatial correlation function (Stein, 2012). The factor loadings determine the cross-covariance with

Estimation procedure
Fitting the model described in section 3 using all of the data pooled over days, sites and pollutants is computationally infeasible. The analysis thus proceeds in two stages: we first estimate the random slopes b t k ( ) using a non-spatial method and then analyze the residuals using a multivariate spatial model. Predictions from these two stages are then combined to give the final predictions.
Estimation of the random slopes begins by performing a least squares regression separately for each day and each pollutant. For pollutant k on day t, , ..., ) and the associatedṕ p covariance matrix (of the sampling distribution) be V t . k ( ) Using these estimates we then fit the Bayesian random effect model ( )This stage is fit separately by pollutant using JAGS (Plummer 2003) with 10,000 Markov Chain Monte Carlo (MCMC) iterations after a burn-in of 1,000 iterations. Let b t k ̲ ( ) denote the mean from the posterior samples. Since these regression coefficients are estimated without accounting for correlation between observations, they are suboptimal estimates of covariate effects and their posterior variances likely underestimate their true uncertainty. However, since our main goal is spatial prediction and not inference on the effects of the CMAQ predictors, this non-spatial first step is a reasonable computational compromise.
In the second stage, we fit a multivariate spatial model in equation (1) separately by day with b t k ( ) fixed at its first-stage estimate, b t .  (2019)) for the spatial parameters. The PC prior is weakly informative and penalizes complexity by shrinking the Matérn spatial range parameter toward infinity, and the marginal standard deviation and error precision toward zero. The smoothness parameter is fixed at 1. The prior distributions for the coefficients l j 's are N(0, 0.1). The model is fit using the R-INLA package (Lindgren and Rue, 2015).
As mentioned in section 3, the temporal trend is captured by the random slope model in the first stage. The final prediction will use its first-stage posterior estimate, b t k ̲ ( ) with CMAQ covariates and the multivariate spatial prediction results. The spatial prediction is performed in INLA at the mesh locations (constructed by Delaunay triangles) and then projected on a set of dense regular prediction grids or nodes covering the US domain. As it is quite computationally expensive for dense mesh over a large domain, we employ a moderate number of mesh nodes (varying depending on the number of observed locations for all pollutants in a day) to achieve computation efficiency. Figure 2 shows the comparison of observed AQS locations for PM 2.5 species in Exploratory analysis suggests that many of the pollutants are uncorrelated after accounting for the CMAQ outputs in the mean structure. Therefore, as a final computational simplification, we fit the multivariate spatial model separately to correlated groups of pollutants. Since not all monitors are collocated, to determine which pollutants are strongly correlated, we first calculate the cross-covariogram (CCG; Nychka et al 2017) for each pair of pollutants using the residuals after removing the first-stage CMAQ mean estimates. The strength of association between each pair of pollutants is measured using the CCG estimate (averaged over days) at distance 0.5 degrees (about 50 kilometers). The number of pollutants to be included in a multivariate model is determined by cross-validation; the remaining pollutants are analyzed using a univariate spatial model in INLA. In the cross-validation study, data with all 12 pollutants measured on the same day in the first 9 years are used to calculate and rank the averaged pairwise CCG. We use the final year of data (all days included) to select models. The training set consists 90% of the data which is randomly separated by day across sites and pollutants and we use it to fit the multivariate models and the remaining 10% treated as the testing set estimates root mean square error (RMSE) and R-square between predicted and the true values. The comparison is summarized in section 5.2. And figure 3 shows a flowchart of the modeling and estimation procedures with the final selected model. Table 2 presents mean regression coefficient m k in equation (2) as a measure of the impact of CMAQ output for different pollutants on predicting a specific pollutant concentration. For most pollutants, their corresponding CMAQ output is the strongest predictor. For example, for O 3 , the CMAQ O 3 output has average coefficient 0.75 while the other 11 CMAQ pollutants have average coefficients less than 0.29. However, many of the crosspollutant relationships are statistically significant with the 95% interval of m k excluding zero. To support this conclusion, we also considered the simple linear regression model with only the corresponding CMAQ pollutant and found the multivariate model with all CMAQ pollutant predictors gave better fit (table A2 in supplemental materials).    Table 3 shows the prediction RMSE and R 2 for the multivariate models with 4 (EC, OC, NH 4 , NO 3 ), 5 (EC, OC, NH 4 , NO 3 , SO 4 ) and 6 (EC, OC, NH 4 , NO 3 , SO 4 , CO) highly-correlated pollutants defined by the CCG ranking. Almost all multivariate models have smaller RMSE than univariate models and the model with 5 pollutants: EC, OC, NH 4 , NO 3 , and SO 4 performs the best with the lower prediction error and higher R 2 than other models; therefore it is selected as our final model for multivariate INLA fitting. Results for multivariate models with 7 or more pollutants are not shown in the table due to poor performance with over-parameterization.

Multi-pollutant cross-correlations
Relationships between pollutants not explained by CMAQ can be captured in the residual correlation matrix, W = Cor{w s , 1 ( ) K, w s p ( )}, derived from the covariance in equation (4) (where p=5 with the chosen model in section 4). The cross-correlation is allowed to vary by day, and table 4 shows the mean (standard deviation) across days. The strongest residual cross-correlations are between EC/OC, NH 4 /NO 3 and NH 4 /SO 4 , which is consistent with the ranking of CCG in section 4.

Spatial predictions and uncertainty quantification
As described in section 4.1, predictions are made at mesh locations and then interpolated to CMAQ grid which is of 12 km resolution for all pollutants. The time series plots in figure A1 show that all pollutants have a seasonal periodical trend. For example, EC and NO 3 tend to be higher in winter and lower in summer, while others behave the opposite. Beside the seasonality, most of the pollutant concentrations have decreased over the years, except that OC remains at a flat level. Figure 4 gives season-specific maps (median concentrations) and corresponding standard error for EC, OC and NO 3 in 2014. These three pollutants show clear spatial patterns with higher levels in the central valley of California and Midwest and Eastern regions of the US. As a primary pollutant, EC shows less spatial smoothness compared to OC and NO 3 . The high level of NO 3 in the Midwest during winter has been attributed to high emission of ammonia associated with animal agriculture (Pitchford et al 2009). The prediction standard error is plotted for comparison across seasons and their variation is mainly driven by the number of AQS observations associated with the model. For example, NO 3 is measured slightly more sparsely in winter and spring, which leads to an elevated uncertainty for the prediction. The regions with Table 3. Spatial prediction performance from cross-validation experiments, comparing root-mean square error (RMSE) and R 2 of univariate model (UV) and multivariate models (MV) with 6, 5 and 4 pollutants. Outcomes are standardized to have variance of 1 for ease of comparison across pollutants. fewer monitors in the west and mountainous areas also have higher standard prediction error. Similar season concentration maps for NH 4 and SO 4 are given in supplementary materials figure A2. Figure 5 compares seasonal predictions of NO 3 and SO 4 from the multi-pollutant model with raw CMAQ outputs, showing that predictions from the multivariate model are slightly higher overall. while EC and NH 4 are very similar with CMAQ. NO 3 predictions in the Fall from multivariate models show less variation compared to CMAQ; Spring and Summer predictions are slightly higher overall. The comparisons for other pollutants are included in appendix figure A3.

Discussion
Our prediction accuracy compares favorably to previous efforts to map air pollution. Senthilkumar et al (2019) also conducted national data fusions of CMAQ outputs and AQS monitor measurements but using one pollutant at a time. While it is difficult to compare prediction performance metrics due to different cross-validation designs, there is evidence that our multipollutant approach results in improved performance, especially for pollutants that have sparse monitoring networks when we are able to leverage cross-pollutant spatial dependence. For example, in Senthilkumar et al (2019) (Rundel et al 2015).
There are several previous efforts to estimate national PM 2.5 and O 3 concentrations and our univariate model approach under-performs. For example, Berrocal et al (2020) compared various geostatistical and machine learning methods to estimate national daily PM 2.5 levels in 2011. The best performing model (universal kriging with CMAQ output) has an out-of-sample R 2 value of 0.76. Other studies have shown that this may be further improved by incorporating other predictions such as land use, meteorology, and satellite-derived aerosol optical depth (AOD). For example, Hu et al (2017) reported an out-of-sample R 2 of 0.80 for PM 2.5 prediction from a random forest model for the US; Di et al (2019) reported an R 2 of 0.86 for PM 2.5 at 1km spatial resolution using an ensemble of machine learning algorithms; more recently, Kianian et al (2021) incorporated gap-filled satellite AOD and reported an R 2 of 0.80 for PM 2.5 . For national ozone predictions, Senthilkumar et al (2019) reported an R 2 of 0.83 and achieved similar prediction performance (R 2 of 0.76) in their US model. Our lower R2 values for PM 2.5 (R 2 =0.67) and O 3 (R 2 =0.77) are likely due to the lack of extensive additional predictors, especially at fine-scale spatial resolutions. The use of multiple CMAQ predictors might also result in model overfit. Moreover, previous studies have demonstrated the usefulness of borrowing spatial information across nearby CMAQ grid cells (Berrocal et al 2012, Reich et al 2014 and allowing the regression coefficients to be spatially-varying (Berrocal et al 2012 that are not considered here. One of the main contributions of this work is the creation of a US national daily pollution product for PM 2.5 components. The growing studies on PM 2.5 components have typically been restricted to specific regions, e.g., northeastern US (Di et al 2016) and California (Geng et al 2020), as well as for long-term averages (Van Donkelaar et al 2019). Similar to previous studies, among the PM 2.5 components examined, we found that prediction is more challenging for EC, likely due to its high spatial heterogeneity and limited monitoring stations. One important distinction of our modeling approach is the ability to provide uncertainty estimates for each prediction.
In summary, we describe a multi-pollutant modeling framework and computational approaches to estimate daily ambient pollution concentrations. We utilize numerical model simulations of air quality, observed monitoring measurements, and correlations between pollutants to create a daily data product at 12 km spatial resolutions for the contiguous US. Model evaluations have demonstrated the benefits of considering multiple pollutants simultaneously to improve predictive performance. Pollutant concentration estimates and their associated predictions uncertainties have been made accessible at zenodo.org. To our knowledge this is the first publicly available data product for major PM 2.5 species and several gases (e.g., CO, NO x ) at this spatial and temporal resolution.