Long-term satellite-based estimates of air quality and premature mortality in Equatorial Asia through deep neural networks

N Bruni Zani; G Lonati; M I Mead; M T Latif; P Crippa

doi:10.1088/1748-9326/abb733

1. Introduction

Atmospheric particles (i.e. aerosols) have been shown to be responsible for increased morbidity and premature mortality (Cohen et al 2017, Burnett et al 2018), particularly in developing countries, where extreme aerosol episodes are more frequent and intense than in high-income countries (WHO 2018). In recent decades, many countries worldwide have experienced rapid development, with fast economic growth, industrialization and urbanization (Muntean 2018, United Nations 2018), which have led to increased primary emissions and enhanced secondary formation of aerosols in the atmosphere.

To improve understanding of atmospheric pollution impacts and inform policymakers on effective mitigation strategies, there is a strong need to assess aerosols' properties at high spatio-temporal resolution. This includes information on the distribution of particulate matter (PM) with aerodynamic diameter smaller than 10 μm and 2.5 μm (PM₁₀ and PM_2.5, respectively). Large uncertainties in the estimates of PM environmental and societal impacts arise from the incomplete understanding of the key controls dictating the spatio-temporal variability of degraded air quality conditions and extreme aerosol events. Multiple factors, including climate variability, meteorological conditions and land-use change, potentially play different and changing roles in increased occurrence of extreme pollution episodes globally (Jacob and Winner 2009, Fiore et al 2012, Hong et al 2019, Turnock et al 2020).

Data from monitoring networks are frequently used to produce localized assessments, but their spatio-temporal coverage degrades dramatically in developing countries, where air quality is generally worse and in need of monitoring. Advances in satellite‐borne instrumentation and data retrieval protocols now allow identification of pollution sources and detailed monitoring of atmospheric properties and thus of air quality conditions with almost global and near real‐time coverage. Multiple mathematical approaches have been proposed to infer ground-level PM concentrations from satellite retrievals of column integrated aerosol optical depth (AOD). Simple statistical linear models (LMs) have demonstrated high potential for global mapping (Donkelaar et al 2010, Reid et al 2012) and more sophisticated proxies have been also successfully proposed to predict ultrafine particle concentrations (Kulmala et al 2011, Crippa et al 2013) and to account for aerosol dynamics (Sullivan et al 2016, Crippa et al 2017). However, LMs present limited skills in predicting PM₁₀ spatio-temporal distribution and cannot capture mechanisms involved in aerosol dynamics, chemistry and transport processes which are characterized by a strong non-linearity and interactions between variables (Seinfield and Pandis 2016). To overcome major limitations of prior studies, machine learning approaches represent a unique opportunity given their high predictive skills (Grgurić et al 2014, Li et al 2017, Chen et al 2018, Di et al 2019, Shtein et al 2019) and low computational expense compared to the widely used Earth System Models (Huntingford et al 2019, Reichstein et al 2019). Specifically, artificial neural networks have shown to be one of the most effective and low-demanding tools in predicting the spatio-temporal distribution of both gaseous pollutants and atmospheric PM (Feng et al 2019, Lautenschlager et al 2020), especially over areas with sparse monitoring sites (Alimissis et al, 2018).

The present study develops a novel and general Deep Learning approach for spatially and temporally continuous air pollution mapping based on a suite of satellite retrievals. As previous studies mainly rely on AOD, meteorological and land-use data to infer ground-level PM (Ma et al 2014, Wei et al 2019), this machine learning application aims to account for chemical processes connected to primary and secondary aerosol formation/evolution and for different emission sources by exclusively relying on satellite-retrieved data. Moreover, the presented application targets an entire decade (2005–2015), with the purpose to quantify and capture spatio-temporal patterns and long-term trends in PM₁₀ and PM_2.5 concentrations and their epidemiological impacts. Specifically, this work aims to (i) investigate the predictive skills of a set of satellite-based proxies in reproducing ground-level PM₁₀ through deep neural networks (DNNs), (ii) explore the predicted seasonal and inter-annual changes in air quality over the period 2005–2015 in response to variable atmospheric composition, land use and emissions, and (iii) analyze the chronic health impacts due to PM_2.5 exposure. The focus is on Equatorial Asia, a tropical region particularly sensitive to changes in climate that in recent decades has experienced significant urbanization and land-use/land-cover change (Field et al 2009, Gaveau et al 2014). Recent studies have shown that haze episodes and extreme air pollution concentrations have become more frequent due to both increased local/urban emissions and transboundary pollution (Aouizerats et al 2015, Lee et al 2018, Hansen et al 2019, Alifa et al 2020). Equatorial Asia is also currently one of the most densely populated regions in the world, thus the need of improving air quality to reduce harmful impacts on human health is particularly pressing.

2. Data

2.1. PM₁₀ observations

Long-term observations of PM₁₀ concentrations from a network comprising 52 ground-level monitoring stations across Peninsular Malaysia and Malaysian Borneo are analyzed (figure 1). The sites have been active during the period 1997–2015 and monitored PM₁₀ concentration through beta attenuation or tapered element oscillating microbalance instruments, as part of the continuous air quality monitoring program of Malaysia. Measurements have been standardized using universal calibration approaches. In this work, daily mean PM₁₀ values are used to investigate air pollution variability at multiple time scales, including monthly, seasonal, annual, and inter-annual.

**Figure 1.** Analyzed region and location of the 52 ground-based stations monitoring PM₁₀ (blue triangles). The color shading indicates satellite-retrieved tropospheric column NO₂ [mmol m⁻²] averaged during 2005–2015.
Download figure:
Standard image High-resolution image

2.2. Satellite observations of aerosols, atmospheric trace gases and land use

Satellite retrievals of aerosol properties, trace gases and land use are used to develop a satellite-based proxy able to capture the variability of ground-level PM₁₀. Key features of the analyzed satellite retrievals are summarized in table S1. Specifically, our proxy is based on:

Aerosol optical depth (AOD) data from MODIS (Moderate-resolution Imaging Spectroradiometer) Collection 6 deployed onboard the NASA Terra and Aqua satellites. Level 2 (L2) daily AOD (at the λ = 550 nm wavelength) at 1 km × 1 km resolution from the multi-angle implementation of atmospheric correction (MAIAC) algorithm (Lyapustin et al 2018) is used. MAIAC is chosen because characterized by a wider spatial coverage and higher retrieval accuracy, compared to other algorithms applied in neighboring regions (Mhawish et al 2019). AOD is chosen as a proxy for suspended aerosols in the atmosphere, including fine solid PM. Over land, 95.5% of the AOD data used have the highest quality in the MAIAC product.

Column water vapor (CWV), retrieved as a daily, 1 km × 1 km resolution data from MODIS on Terra and Aqua, and corrected through MAIAC. CWV is considered as indicator of liquid suspended particles/droplets, which affect AOD measurements as absorbing part of the radiation detected by MODIS.
Normalized difference vegetation index (NDVI), retrieved as monthly Level 3 (L3) 1 km × 1 km resolution quantity from MODIS onboard Aqua. NDVI denotes the vegetation surface coverage and is used as a proxy of natural emissions of PM precursors (e.g. volatile organic compounds (VOCs), mainly isoprene), as well as an important absorber of both atmospheric PM₁₀ and its gaseous precursors (Nowak et al 2014, 2018).
Carbon monoxide (CO), tropospheric amount derived from the Measurements of Pollution in the Troposphere (MOPITT) sensor onboard Terra (Deeter et al 2017), as gridded L3 (Version 8) monthly averages at 1° × 1° latitude × longitude resolution. CO is taken into account as an indicator of primary PM emissions derived from both anthropogenic (e.g. traffic) and natural (e.g. wildfires) combustion processes.
Urban fraction (UF), from the Consensus Landcover dataset (Tuanmu and Jetz 2014), as a single satellite image with 30-arc-second spatial resolution (∼1 km at the equator). UF is included, as expected to be positively associated with anthropogenic emissions of both primary PM₁₀ and gaseous precursors of secondary aerosol.

Tropospheric amounts of trace gases and ultra-violet (UV) irradiance, measured by the ozone monitoring instrument (OMI) onboard the NASA's Aura spacecraft, are also considered in this study as key precursors of both inorganic and organic secondary atmospheric PM:

Nitrogen dioxide (NO₂), daily tropospheric column, cloud-screened at 30%, with 0.25 × 0.25° latitude × longitude resolution from the OMNO2d (V3) L3 product (Duncan et al 2018).
Sulfur dioxide (SO₂), daily L3 column amount within the planetary boundary layer (OMSO2e, V3) at 0.25° × 0.25° resolution (Krotkov et al 2008).
Formaldehyde (HCHO), daily L3 weighted mean global V3 (OMHCHOd) HCHO column amount, gridded at 0.1° × 0.1° resolution. HCHO is included, similarly to (Sullivan et al 2016), as a proxy for the availability of secondary organic aerosol precursors (e.g. VOCs).
Ultra-violet irradiance (UV), daily gridded L2 retrieval (OMUVBG, V3) at 0.25° × 0.25° resolution, measured at λ = 310 nm. UV is considered as the main energy source of the photochemical reactions that lead to secondary aerosol formation.

2.3. Population data

Population data are retrieved from the Socioeconomic Data and Application Center (SEDAC) census archived in the NASA Earth Observing System Data and Information System. Population counts, available at 1 km × 1 km resolution for the years 2005, 2010 and 2015, are upscaled to the reference grid by summing all the cells falling inside each 0.25° × 0.25° square unit. A linear regression across the 3 available years is performed for each grid cell i to account for the different demographic growth rate across the whole region for all other intermediate years.

3. Methods

3.1. Data pre-processing

To test models skills in predicting ground-level PM₁₀, satellite data are extracted by averaging the values of the pixels falling within a 20 km radius around each measuring station. The radius choice derives from our sensitivity analysis that identifies the minimum radius maximizing the correlation between daily PM₁₀ and AOD while retaining an appreciable number of non-missing data over the entire period 2005–2015. AOD averages are computed if at least 5% of the total number of pixels inside the 20 km radius are non-missing data.

An autocorrelation analysis is performed at each site to quantify the actual scales of PM₁₀ temporal variability. As the PM₁₀ autocovariance function displays an exponential decay (also shown by (Alifa et al 2020)), the mean autocovariance among sites reaches the value of 1/e (∼0.37) at a lag equal to 7 d (figure S1 (https://stacks.iop.org/ERL/15/104088/mmedia)), thus we average daily PM₁₀ concentration using a 7-d moving average without discarding significant temporal variability. This moving average is applied only when at least 3/7 values are non-missing. An analogous moving average is applied to daily values of AOD, CWV, NO₂, SO₂, HCHO and UV. Monthly satellite-retrievals of CO and NDVI are replicated on a daily basis for each month and a 7-d moving average is applied to smooth the transition between consecutive months.

Satellite retrievals are homogenized to the reference OMI 0.25° × 0.25° (latitude × longitude) grid when aiming to predict PM₁₀ maps over Equatorial Asia. As CO is available at 1° × 1° resolution, each grid cell is divided into 16 sub-cells containing the same value of the initial one to match the reference grid.

3.2. Deep neural networks

DNNs are powerful non-parametric approaches to explain highly non-linear relationships between input and output (Goodfellow et al 2016) and hence appropriate to explain atmospheric chemistry processes in the Earth system (Reichstein et al 2019). Here DNNs are trained using satellite data extracted with a 20 km averaging radius around each station and aggregated with a 7-d moving average. Based on our sensitivity analysis on model performance (figure S2), we define our DNN to comprise two subsequent hidden layers of ten and nine nodes, respectively. Given the 9 inputs and 1 output, the total number of model's parameters is 209 (figure S3). DNNs are trained on a randomly extracted 80% subset of all available data; then, validation is performed on the remaining 20%. The data for DNN are selected when all the nine input variables are non-missing at the same time-step. One hundred trials are performed to eliminate the dependence of model performance on individual random sampling of training data. The overall performance of DNN is evaluated with the Pearson (r) and Spearman (ρ) correlation coefficients between observed and modeled values. The model bias and error are quantified on a seasonal basis using the normalized mean bias factor (NMBF) and the normalized mean absolute error factor (NMAEF) (Yu et al 2006), defined as:

where m_i represents the estimated PM₁₀ and o_i the observed one, while $\bar m$ and $\bar o$ their associated means and n the number of samples of the entire dataset. DNN predictive skills are also compared against the performance of a LM having the same input and output variables. Moreover, as seasonal phenomena (mainly monsoons and wildfires) are present in the analyzed area, model evaluation is also performed over distinct seasons: winter, spring, summer and fall (DJF, MAM, JJA and SON, respectively).

To predict annual mean PM₁₀ spatial fields, other DNNs are trained on monthly aggregated data. PM₁₀ patterns are thus predicted from the monthly aggregated satellite variables homogenized to the reference grid at 0.25° × 0.25° resolution. Due to the presence of several missing values in satellite-retrieved CO, the aforementioned monthly based DNNs are integrated with an additional DNN trained by excluding this variable and used on grid cells where CO is missing. Monthly PM₁₀ maps at 0.25° × 0.25° resolution are finally averaged on a yearly basis, to obtain annual maps during 2005–2015.

A sensitivity analysis performed on a set of meteorological parameters obtained from the Modern-Era Retrospective analysis for Research and Applications, Version 2 (MERRA2, (Gelaro et al 2017)) indicates that meteorological variables do not enhance DNN performance on a monthly basis (see table S2), while they are more relevant over shorter time scales and when more data are available (table S3). Thus when aiming to predict annual mean PM₁₀ fields, meteorological data are not incorporated into DNN.

3.3. Premature mortality estimates

We apply the global exposure mortality models (GEMMs) (Burnett et al 2018) to estimate the mortality burden associated with the satellite-derived PM₁₀ concentrations. As GEMMs require PM_2.5 data, we estimate the PM_2.5/PM₁₀ ratio from simulations of the Weather Research and Forecasting model with Chemistry (WRF-Chem) presented in (Crippa et al 2016) for September, October and November 2015. The annual domain average PM_2.5/PM₁₀, retrieved by assuming September and October to be wildfire months and November being representative of the other 10 'ordinary' months, is estimated by averaging all the on-land grid cells. A value of 0.622 ± 0.036 is found, consistent with (Amil et al 2016).

The relative risk (RR), which is the probability of a fatal outcome from a specific disease due to PM_2.5 chronic exposure divided by the risk of the same outcome in the case of no-exposure, is calculated for each grid cell i and year, following (Burnett et al 2018):

$\begin{equation}R{R_i} = {\text{exp}}\left( {\frac{{\theta \cdot {\text{log}}\left( {\frac{{{{\tilde z}_i}}}{\alpha } + 1} \right)}}{{1 + {\text{exp}}\left( { - \frac{{{{\tilde z}_i} - u}}{\nu }} \right)}}} \right)\end{equation} \tag{ 3 }$

where θ, α, μ, μ are age and disease-specific parameters and ${\tilde z_i}$ the maximum between zero and the difference between PM_2.5 concentration in grid cell i and the no-observed-effect concentration (2.4 μg m⁻³). RR is calculated by referring to 5-year age groups >25 years old (i.e. 25–30, 31–35, 36–40... 80 plus) and to four chronic diseases: chronic obstructive pulmonary disease (COPD), lung cancer (LC), ischemic heart disease (IHD) and stroke (S).

The premature deaths in each year and grid cell i are computed as:

$\begin{equation}{\text{Death}}{{\text{s}}_{\text{i}}} = \frac{{{\text{R}}{{\text{R}}_{\text{i}}} - 1}}{{{\text{R}}{{\text{R}}_{\text{i}}}}} \cdot {\text{B}} \cdot {{\text{P}}_{\text{i}}}\end{equation} \tag{ 4 }$

where P_i is the total population living in cell i for a specific year and B the yearly baseline mortality for a specific chronic disease, derived from the Global Burden of Disease (Naghavi et al, 2017). The total population living in the region in figure 1 is considered for the health impact assessment. The uncertainty of (i) PM_2.5/PM₁₀ ratio, (ii) GEMMs parameter (θ) and (iii) B, is propagated into the estimates of yearly premature deaths with a Monte Carlo approach on 10 000 simulations, assuming a Gaussian distribution of each of the parameters, similarly to (Giani et al 2020). Estimation of the uncertainty associated with DNNs parameters remains an open question in computer science (Goodfellow et al 2016), thus the uncertainty from PM₁₀ predictions cannot be included in our assessment.

4. Results

4.1. Sources of seasonal and inter-annual variability of air quality in Equatorial Asia

During 2005–2015, significant seasonal patterns are observed in ground-measured PM₁₀, with highest concentrations during summer and lowest in winter (figure S4). The mean PM₁₀ averaged across all stations is higher during Jun–Aug (50.90 ± 23.63 µg m⁻³) and lower during Dec–Feb (45.63 ± 15.71 µg m⁻³). This is related to the occurrence of monsoon cycles characterized by a dry period during May–September (Southwest Monsoon) and a wet period during November–March (Northeast Monsoon). Higher PM₁₀ levels are likely during the Southwest Monsoon, as humidity is lower, rainfall is less frequent (Tan et al 2015) and widespread wildfires may occur, particularly during drought years enhanced by intense El-Niño Southern Oscillation (ENSO) conditions (Marlier et al 2012, Field et al 2016). PM₁₀ peaks appear in fall 2006 and 2015 (figure S4), due to the dry conditions brought by ENSO and associated wildfires. During these years, central Sumatra and southern Borneo were hotspots of intense and widespread wildfires, which degraded air quality in surrounding areas (Crippa et al 2016, Field et al 2016). In addition to PM₁₀ seasonal variability due to climate and meteorological conditions, satellite retrievals also display a strong seasonality which may also explain the PM₁₀ variability. AOD and CO show a strong seasonality, with highest values over southern Borneo and central Sumatra during fall (figures S5 and S6), as a result of the Southwest Monsoon and possible wildfires occurrence. A seasonal pattern is also noticeable in SO₂, which highlights the impact of volcanic emissions (Carn et al 2017), especially over Java island and during winter (figure S7). Conversely, no clear seasonal pattern is observed for NO₂ and HCHO (figures S8 and S9), which however clearly identify key anthropogenic sources. Given the significant seasonal variability of the predictors, our DNNs are developed on a seasonal basis to capture seasonal phenomena key to explain PM₁₀ variability and ultimately improve satellite-based PM₁₀ predictions.

4.2. Model evaluation

The evaluation of seasonal DNN performance and comparison to LM, as described in section 3.2, are here presented. Significantly higher r and ρ are found for DNN compared to LM (table 1), which indicate that DNNs present higher performance in predicting ground-level PM₁₀ values. This is likely due to the presence of non-linear mechanisms and interactions between variables, typical of environmental pollution systems, which are not captured by the LM. The accuracy of DNN predictions is also higher, as both model bias (NMBF) and absolute error (NMAEF) are significantly closer to 0. Moreover, while LM underestimates PM₁₀ observations in every season, DNN presents a negative NMBF in summer and fall, when PM₁₀ peaks generally occur, and a slightly overestimation in winter and spring. NMBF remains anyway modest and lower compared to prior chemical transport model simulations performed over the area (Gao et al 2014, Crippa et al 2016).

Table 1. DNN and LM performance on 7-d moving average PM₁₀ data. Statistical metrics include correlation coefficients (r and ρ) as well as NMBF and NMAEF (see definitions in Methods) computed on validation data (i.e. 20% of the sample size n).

Seasons	Model	r	ρ	NMBF	NMAEF	n
Winter	DNN	0.694	0.628	0.0001	0.1891	11 775
Winter	LM	0.453	0.372	−0.0438	0.2442	11 775
Spring	DNN	0.643	0.544	0.0007	0.2018	15 830
Spring	LM	0.488	0.379	−0.0421	0.2369	15 830
Summer	DNN	0.744	0.628	−0.0025	0.2208	15 845
Summer	LM	0.598	0.553	−0.0482	0.2589	15 845
Fall	DNN	0.777	0.656	−0.0004	0.2018	8335
Fall	LM	0.684	0.565	−0.0403	0.2420	8335

The seasonal variability of model performance reflects the complexity of the relationship between dependent (measured PM₁₀) and independent variables (satellite-retrieved). DNNs skill on relatively low PM₁₀ (∼40 µg m⁻³) remains similar among seasons; however, the occurrence of higher PM₁₀ in summer and fall favors an improved fit, with the model being able to estimate a wider range of values (figure S10). Such model behavior suggests the presence of a baseline PM₁₀ level that cannot be fully explained by the predictors included in the model. Some satellite retrievals are also subject to higher uncertainty when aiming to detect low levels of trace gases, thus other predictors, such as meteorological variables, may be included to provide additional information on PM₁₀ variability.

DNN seasonal performance is generally higher when input data are aggregated on a monthly basis. The monthly averaging reduces some short-term variability, while still capturing seasonal patterns, and produces a non-negligible increase in the linear correlation coefficients r and ρ, compared to DNN trained on 7-d moving averages (table 2 and figure S11). In this case, training is performed on all the available data, as the monthly temporal aggregation reduces the sample size by an order of magnitude and precludes generating random samples, representative of most of data variability, for training and validation.

Table 2. DNN statistic metrics for model evaluation. Comparison between DNN trained on data aggregated with 7-d moving averages (MA) and monthly means (MM). Statistical metrics of model performance are computed using the entire sample size n, differently from the results shown in table 1, which refer to the validation on 20% of all available data.

Season	Model	r	ρ	NMBF	NMAEF	n
Winter	MA	0.731	0.666	−0.0004	0.1779	11 775
Winter	MM	0.712	0.668	−0.0004	0.1594	1112
Spring	MA	0.694	0.578	0.0002	0.1921	15 830
Spring	MM	0.724	0.679	−0.0009	0.1491	1393
Summer	MA	0.786	0.644	−8.625e–05	0.2087	15 845
Summer	MM	0.805	0.759	0.0009	0.1474	1466
Fall	MA	0.825	0.690	−0.0003	0.1873	8335
Fall	MM	0.905	0.741	−0.0009	0.1423	1081

4.3. Spatial variability in annual PM₁₀ and PM_2.5 concentrations

DNN trained on monthly aggregated data are applied to predict annual mean PM₁₀ at 0.25° × 0.25° resolution. The estimated annual PM₁₀ means are slightly underestimated compared to the observed values (mean bias of −3.15 µg m⁻³ over the entire period). Figure 2 compares two ordinary years (i.e. 2008 and 2014) against 2006 and 2015, which instead experienced widespread wildfires and, consequently, intense haze phenomena and extreme concentrations particularly in southern Borneo and central Sumatra.

**Figure 2.** Annual mean PM₁₀ concentrations (0.25° × 0.25° resolution) estimated with satellite-based DNN during ordinary (2008 and 2014) and wildfire (2006 and 2015) years.
Download figure:
Standard image High-resolution image

Diffusion and dispersion phenomena are also captured by the model, as wildfire emissions appear to have spread towards densely populated areas (mostly Singapore and Kuala Lumpur), as also identified in prior modeling studies (Crippa et al 2016, Lee et al 2018, Mead et al 2018).

High values of PM₁₀ (>50 µg m⁻³) also occur over Peninsular Malaysia, central/eastern Sumatra and part of Java every year (see figure 1 for the locations of these islands), due to the combined effect of local emission sources and transnational pollution transport (Lee et al 2016). Urban scale pollution is also captured by the model, as localized pollution peaks are present over metropolitan areas including Singapore, Jakarta and Kuala Lumpur. The yearly average PM₁₀ over these areas always exceeds the World Health Organization threshold of 50 µg m⁻³, thus suggesting that both wildfires and large anthropogenic emissions are critical in deteriorating the regional air quality and lead to potentially severe impacts on human health. Analogous results are found for yearly maps of PM_2.5 where most of the analyzed domain exceeds the yearly average WHO standard of 10 µg m⁻³ (figure S12).

4.4. Trends in human health impacts

Yearly satellite-based PM₁₀ spatial fields, generated with DNNs, are converted to PM_2.5 maps (see example in figure S12) using a PM_2.5/PM₁₀ ratio estimated from WRF-Chem (see Methods) and fed to the GEMMs. Premature deaths are computed for each year by integrating the estimated RR with population distribution maps. The total premature mortality burden and the associated 95% confidence interval (C.I.) are reported in table 3.

Table 3. Total estimated premature deaths associated with PM_2.5 and 95% confidence interval (C.I.) (columns 2 and 3, respectively) over the analyzed domain for each year during 2005–2015. Columns 4–7 indicate the percentage of premature deaths associated with each of the diseases analyzed (see section 3.3 for their definition). Column 8 contains the total exposed population (in millions) over the analyzed domain.

Year	Total Deaths	95% C.I.	%COPD	%LC	%IHD	%S	Population
2005	149 500	108 400–193 400	10.51	6.97	49.20	33.31	284
2006	158 200	111 100–202 300	10.84	6.95	48.51	33.70	289
2007	149 500	106 000–193 500	10.69	7.02	49.57	32.72	294
2008	159 800	112 200–205 600	10.35	7.07	49.54	33.05	298
2009	162 100	115 300–210 100	10.41	6.87	49.66	33.06	303
2010	160 700	113 600–206 800	10.07	7.04	50.42	32.47	308
2011	173 900	124 200–223 400	10.00	7.08	49.97	32.95	313
2012	170 900	122 800–218 400	9.91	6.99	50.40	32.71	317
2013	173 500	122 800–224 400	9.71	7.08	50.56	32.65	322
2014	191 000	137 100–245 000	9.60	7.22	49.50	33.68	327
2015	203 900	145 200–260 100	9.50	7.24	49.22	34.04	332

The most relevant diseases over the analyzed decade are IHD and stroke, which on average contribute to 49.69% and 33.12% of the premature deaths, while COPD and LC are responsible for the 10.14% and 7.05%, respectively. A positive trend is seen in the absolute number of estimated deaths, partially due to the significant population growth over the area: from ∼284 M in 2005 to ∼332 M in 2015. A rise through the years is evident also when the number of deaths is normalized by the exposed population: the mean trend, calculated by excluding 2006 and 2015 as 'extraordinary' wildfire years, is significantly increasing (+5.19 deaths/Mpop/year, p-value = 0.033, figure 3). The trend including all years would be +6.38 deaths/Mpop/year. This suggests that regional air quality has deteriorated and its effects enhanced during the past decades, especially over big cities, where the majority of people lives and population growth rates are higher. The effect of wildfire occurrence is clear, particularly in 2015, and to a lower extent in 2006. The number of deaths per million inhabitants in these years is in fact higher than the mean trend. The same happens in 2014, although less affected by wildfires, as it presented higher mean concentrations than other years (figure 3). Our mortality rate of ∼570 deaths/Mpop/year for the four analyzed diseases, quantified as the mean trend in figure 3, moderately underestimates the World Health Organization value of 676.4 deaths/Mpop/year for 2016 (computed as the mean rate for Indonesia and Malaysia) (WHO 2020).

**Figure 3.** Annual median premature deaths (white line) per million people due to PM_2.5 chronic exposure. The blue shading indicates the standard deviation computed through Monte Carlo approach, assuming a Gaussian distribution of death estimates. The trend over 2005–2015 is indicated by the black dashed line.
Download figure:
Standard image High-resolution image

The total burden of PM_2.5-related deaths during 2005–2015 is mapped to highlight the most impacted regions (figure 4). Big metropolitan areas, including Jakarta, Singapore and Kuala Lumpur, stand out clearly as the most affected areas. This is certainly due to the large population, but PM_2.5 also plays a crucial role, as it peaks over those locations (figure 2). Other PM_2.5-related health effects, beyond to big cities, are prominent in highly urbanized areas, such as Java and the west coast of Peninsular Malaysia. Health effects of wildfires are instead moderate over southern Borneo and central Sumatra, which are sparsely populated (figure 4). Conversely, wildfire contribution to premature mortality is most likely determined by transport phenomena from burnt areas to densely populated areas, thus impacting the total amount of victims (table 3).

5. Conclusions

In this study, we develop a novel DNN approach trained on a suite of satellite-retrieved variables related to atmospheric physics and chemistry, and land use, to remotely predict ground-level PM₁₀ concentrations. The model is developed for Equatorial Asia but its applicability extends to any region with a sparse monitoring network. DNNs generally show enhanced predictive skills and lower bias compared to the more classical LM approach, as able to capture significant non-linear mechanisms and variable interactions dictating PM₁₀ concentrations. On a seasonal basis, the dry period brought by the Southwest monsoon, possibly enhanced by ENSO, is found to be associated with higher PM₁₀ that lead to an improved model performance during summer and fall. Higher PM₁₀ in the fall is associated with the less frequent wet deposition processes and enhanced wildfires occurrence, which determined the extreme haze events recorded in 2006 and 2015. As significant spatio-temporal variability remains poorly explained for relatively low PM₁₀, future research should focus on including additional meteorological variables (e.g. wind speed/direction, ground temperature and planetary boundary layer height), which may enable description of aerosol vertical profiles and transport and dispersion processes on the local scale. Further, while this study shows high skill of DNN in predicting surface PM concentrations, future investigations could be directed to quantify the predictive skills of a different machine learning approaches (e.g. random forests, gradient boosting machine or mixed models).

The annual PM₁₀ and PM_2.5 maps reveal significant spatial and inter-annual patterns related to both anthropogenic drivers and wildfires. The estimated health impacts indicate that metropolitan areas remain the most affected, due to the combined effect of numerous anthropogenic emissions and high population density. Conversely, the effect of wildfires dominates on the regional scale, as indicated by the strong inter-annual variability in the number of premature deaths over the region, which are significantly higher during fire years than adjacent non-fire years. We also found a significant increasing trend of PM_2.5-related mortality of +1600 additional deaths per year on average over 2005–2015. In addition to the population growth, a possible explanation for this include the ongoing urbanization and land-use and land-cover changes, as well as large-scale climatic changes, such as the enhanced intensity of ENSO and more frequent wildfire events. Future research will be directed to attribute the role of these drivers through numerical model simulations including different climate conditions and emission scenarios.

Acknowledgments

The authors gratefully acknowledge the National University of Malaysia (Universiti Kebangsaan Malaysia, UKM), METMalaysia (Malaysian Meteorological Department) and the Malaysian Department of Environment (DOE) for providing access to the observational data used in this study. The authors acknowledge NASA for accessibility to MODIS and OMI data (downloaded from open-access website https://earthdata.nasa.gov), MERRA-2 output (https://gmao.gsfc.nasa.gov/reanalysis/MERRA-2/data_access/) and population/demographic data (obtained from NASA SEDAC (https://sedac.ciesin.columbia.edu). Global Burden of Disease used in this study have been accessed from the Institute for Health Metric and Evaluation website: http://ghdx.healthdata.org/ihme_data.

Data availability statement

The data that support the findings of this study are available upon reasonable request from the authors.

Long-term satellite-based estimates of air quality and premature mortality in Equatorial Asia through deep neural networks

Article metrics

Submit

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction