Can we use a machine learning approach to predict the impact of heatwaves on emergency department attendance?

Global warming has contributed to more frequent and severe extreme weather events, which has led to increased research on the health impacts of extreme heat. However, research on heatwaves, air quality, and their spatial impact on health service demand is limited. This study used machine learning (ML) approaches to obtain the optimised model to predict health service demand associated with those risk factors for an all-age model and compared it with young children (0–4 years) model in Perth. Ten years’ data (2006–2015) on emergency department attendances (EDA), socioeconomic status (SES), heatwaves, landscape fires, and gaseous and particulate air pollutants were collected. ML approaches, including decision tree, random forest (RF), and geographical random forest (GRF) models, were used to compare and select the best model for predicting EDA and identify important risk factors. Five-hundred cross validations were performed using the testing data, and a construct validation was performed by comparing actual and predicted EDA data. The results showed that the RF model outperformed other models, and SES, air quality, and heatwaves were among the important risk factors to predict EDA. The GRF model was fitted well to the data (R2 = 0.975) and further showed that heatwaves had significant geographic variations and a joint effect with PM2.5 in the southern suburbs of the study area for young children. The RF and GRF models have satisfactory performance in predicting the impact of heatwaves, air quality, and SES on EDA. Heatwaves and air quality have great spatial heterogeneity. Spatial interactions between heatwaves, SES, and air quality measures were the most important predictive risk factors of EDA for young children in the Perth southern suburbs. Future studies are warranted to confirm the findings from this study on a wider scale.

One important aspect of environmental studies concerns geographical patterns of environmental hazards. In recent decades, ML methods have been used to predict rainfall and develop spatial models for air pollutants Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. [30,31] and to define the heatwave threshold [32]. However, there is a lack of studies using the ML approach to predict health effects from heatwaves and air quality.
This study aimed to apply ML approaches to obtain the best model for predicting the effects of heatwaves, air quality, and other important risk factors on EDA demands for Perth residents, and their spatial variations for young children as a special stratified group.

Research setting and study period
Perth is the capital city of Western Australia (WA) (figure 1). It has a Mediterranean climate with hot and dry summers and cool, wet winters because of the domination of subtropical high-pressure systems [33]. The study was conducted in the Perth metropolitan area and covered 10 years (2006-2015) for the warm months (November-April).

EDA data
The EDA data was collected from the WA Emergency Department Data Collection and aggregated by statistical area level 3 (SA3) and age groups (0-4, 5-9, 10-14, 15-59, and 60 + years) to obtain daily counts. SA3 is a relatively large geographical area defined by the ABS. One SA3 generally has a population of between 30,000 and 130,000 people. A total of 21 SA3s were identified in the Perth metropolitan area and included in the analysis.

Heatwave data
Heatwave was measured as Excess Heat Factor (EHF). It was initially available in gridded data with 5 × 5 kilometre pixels in a Network Common Data Form format provided by the Australian Bureau of Meteorology [5]. The data was converted into daily raster layers then was extracted from the layers and aggregated to (SA3) [35] using ESRI ArcMap software (Version 10.5).
An EHF value was calculated by combining the effects of excess heat (the long-term temperature difference categorized by every locality's climatology of heat, EHI sig ) and heat stress (the short-term (preceding 30 days) temperature difference based on recent thermal acclimatization, EHI accl ). EHF formulated as Average daily temperature (DAT) is the average of the maximum temperature and the next day's minimum temperature [5]. Thus, EHF provided a relative measure of the load, intensity, spatial distribution, and duration of a heatwave event.
In this study, a day was deemed as a heatwave day if the EHF value was greater than zero. Then the possible lag effects of heatwave on health were assessed. A cut-off value of 80% was applied to define a heatwave day for Perth, which means that if 80% of the study area (SA3s) had an EHF value >0 on a day, the whole Perth metropolitan area would be counted as a heatwave day.

Air quality data
Air quality data (carbon monoxide (CO), nitrogen dioxide (NO 2 ), sulphur dioxide (SO 2 ), ozone (O 3 ), particulate matters with an aerodynamic diameter 10 micrometres (PM 10 ), and aerodynamic diameter 2.5 micrometres (PM 2.5 )) were sourced from the Department of Environmental Regulation, WA from eight air quality monitoring stations (figure 1). The daily average of the top eight hourly values of each air pollutant was calculated to derive a daily measure. Due to the limited number of monitoring stations available in the study area, the air pollutant values for each SA3 were estimated using the Inverse Distance Weighted (IDW) method [36,37]. The IDW is an interpolation method to estimate values for unsampled locations based on their distance and radius to sampled locations in geographic information systems. IDW assumes closer values are more related than further values with its functions. Finally, the values for each air pollutant were categorized into three levels based on their distribution percentiles, low (<25th percentile), middle (25th-75th percentile), and high (>75th percentile).
Daily data on the size and location of landscape fires (including wildfires and government-managed prescribed burns) was sourced from the Department of Biodiversity, Conservation, and Attractions, WA.

ML modelling
ML was implemented using RStudio (Version 1.1.463). Several predictive models described below were evaluated for their accuracy. The whole data set was randomly divided into a 70:30 split (70% for training models and 30% for testing cross validation). Further validation of EDA was conducted by training the optimised model using 2006 to 2014 (nine years) data to predict 2015 (one year) data. The predicted values were then compared with the observed 2015 data. Finally, the optimised model was applied to the data for young children.
To assess the effect of heatwaves and air quality on EDA, all potential risk factors were included in the models, including all-age group, socioeconomic status (SES) measured by the socioeconomic index for area (SEIFA) categories (1 = disadvantaged and most disadvantaged area, 2 = middle area, 3 = second least disadvantaged and the least disadvantaged area) [38], daily heatwaves events (heatwave/non-heatwave), air pollutants (low/middle/high levels), landscape fire events, week periods (weekend/weekday) and public holidays (yes/no).
The Baseline Model does not include independent variables (predictors or risk factors). It only includes dependent variable (EDAs) in the training data to predict the number of EDAs. The Multiple Linear Regression model assumes that EDA has a linear function of risk factors. The Decision Tree Models [39] make no assumptions about the relations between the dependent and independent variables. Since the internal crossvalidation is done at the same time as growing a tree, the result has an error measurement that is used to find the optimal number of splits for decision nodes and terminal nodes. The Full Decision Tree includes all nodes and splits, while the Pruned Tree removes trees that are non-critical, and only keeps the optimal number of splits in the model to enhance efficiency and predictive accuracy by reducing overfitting.
A Random Forest (RF) Model is an ensemble-learning method for which a multitude of decision trees are constructed at the training time and outputting the mean prediction of the individual trees to explain the relationships between EDA and risk factors [40]. A maximum of four variables (mtry, number of variables randomly sampled at each split) were selected for each node of a decision tree based on the default rule of 1/3 variables set in the RandomForest package. A total of 500 decision trees (ntree) were modeled [40,41]. The best model was determined based on the maximum model coefficient of determination (R 2 ) and the lowest error values (the root mean square error (RMSE) and the mean absolute error (MAE)). Finally, the actual and predicted values were compared to obtain the R 2 value.
The percentage increase in mean square error (%IncMSE) is a relative measure for ranking the importance of all risk factors. This measure was derived based on how much the accuracy decreased when a variable was excluded from the model [41]. Variables with larger positive %IncMSE values would have a significant impact on predicted EDA. The higher the score, the greater the contribution of a variable to predicting the EDA.
The Geographical Random Forest (GRF) Model is an extension of a RF that can address the spatial heterogeneity of the model. A newly introduced package, SpatialML [42,43] was used to develop the GRF models with the local weight value set to one and the global weight value set to zero. An adaptive kernel for the GRF models was used. There were 21 SA3s in the study area, so 20 bandwidths were selected to consider 20 neighbors' weights for each SA3 modeling.
The GRF model was applied to young children as an exploration. The sum of air pollutants and EHF values for the years 2006 to 2015 was computed to derive a single value for each SA3. For example, to derive a PM 2.5 value for one SA3, the sum of 10 years of PM 2.5 concentrations for this SA3 was computed. The prediction for EDA was performed on the validation datasets with 30% of randomly selected data.

Results
During the study period, the total number of heatwave days was 163 (table 1). The years 2011 and 2012 had 25 heatwave days each, and the majority of them occurred in January and February. The most significant lag effect of heatwaves on EDA was identified on Day 3 of the heatwave events based on the coefficient of the Poisson regression. A 3-day lag was then used in all subsequent analyses for the association between heatwaves, air quality, and EDA.
The overall EDA rate was higher in heatwave days (76.43/100,000 populations/day) than in non-heatwave days (73.66/100,000 populations/day).
Concentrations of all air pollutants had significant statistical difference between heatwave days and nonheatwave days (p < 0.0001). Overall, the air quality was worse during heatwave days than it was in non-heatwave days except for CO (table 2).

RF model
Results from table 4 display the importance ranking of risk factors for the all-age (model 1) and young children (model 2) groups, respectively. In model 1, age was the most important risk factor while in model 2, SES was the most important risk factor for predicting EDA.  For air quality measures, the top risk factors contributing to EDA were PM 2.5 , PM 10 and fires in both models. Heatwaves (EHF) also contributed positively to predicting EDA in both models.

Validation of EDA by RF model
Further validation was performed using the RF model with the 2006-2014 EDA data and predicted EDA for 2015. Construct validation was then conducted by comparing results between the predicted and actual 2015 EDA data (figure 2). Adjusted by the population, the goodness of fit measure R 2 for EDA counts was 0.953, indicating the model fitted the data extremely well. Overall, SES (SEIFA) remained the most important risk factor with the highest ranking, followed by heatwaves (EHF), and air pollutants.

Geographical variation of RF for young children
The impacts of SEIFA, EHF, and PM 2.5 for predicting EDA in different SA3 areas were presented in figure 3. Overall, these three risk factors were relatively more important in the Perth southern suburbs than in other areas for young children, indicating a spatial interaction between them.

Discussion
This study was the first in which a ML approach was used to optimise models for predicting the impact of heatwaves and air quality on EDA and to identify major risk factors using 10 years of large datasets. It learns the pattern from the training data and generalizes models to predict EDA with cross-validations. The optimised model (RF) for EDA prediction was selected from 5 models and a construct validation was performed to verify the findings.

Prediction of EDA by RF model
The results from the RF model indicated that demographic factors such as age and SES were the most important predictors of EDA. The air quality measures were also important predictors, including particulate matters (PM 2.5 and PM 10 ). Fires are an important source of short-term particulate matters air pollution [44][45][46]. Heatwaves positively contributed to the prediction model for EDA.
In the past decade, very few studies used ML approaches to predict environmental factors such as heatwaves [47,48] or air quality [49,50]. Up until now, only one study in China [48] used an RF model and 3 years of data to predict the effect of heatwaves on heatstroke (R 2 = 0.70). But it did not compare different ML algorithms and didn't include air quality as a risk factor. Previous heatwave-related studies mainly used traditional epidemiological methods for prediction such as generalized estimating equations [28], Poisson regression [51,52], and logistic regression models [53]. When there are complexes of independent variables or the method needs to rely on distribution patterns (e.g., autocorrelation or multicollinearity), the methods are limited. Such autocorrelations often exist for daily data on weather and air pollutants. As the RF modeling does not require a specific data distribution pattern, it can be applied to any type of data.
Another advantage of using the RF model is that it can determine the importance ranking of all risk factors in the prediction model. The RF model was built on the outcomes from multiple decision tree models, which can learn the patterns from the training data to predict future events more accurately. In this study, 500 crossvalidations were performed to achieve the best outcome. In traditional epidemiological studies, age was identified as one of the important demographic factors [13,54]. Using RF models, the current study demonstrated that age was the most important risk factor for predicting EDA. Apart from the elderly, children, especially young children are vulnerable to the effects of heatwaves due to their immature organ systems, poor adaptive capacities, and ability to produce more metabolic heat as compared to adults [55,56]. In addition, children have a reduced ability to lose heat through sweating, and this can cause some severe outcomes such as convulsions and disorientation [57].
The SES was identified in this study as the second most important risk factor for predicting EDA in the allage group model and the most important risk factor in the stratified young child model. A few studies observed that SES was a risk factor for heatwave-related mortality [58][59][60][61][62]. For example, a study in Hong Kong [58] used a Poisson model and showed that SES was an important risk factor for estimating the effect of heatwaves on mortality. People living in low socioeconomic districts might influence by the inequalities in intracity urban ambient air ventilation caused by the affordability of air conditioning and building infrastructure conditions. A case crossover study conducted in Adelaide [63] used multivariate logistic regression models and found that living alone (OR = 2.41, 95% CI: 1.32-4.40), with a lower socioeconomic status (OR = 2.10, 95% CI: 1.09-4.04) and no private health insurance (OR = 1.82, 95% CI: 1.05-3.16) increased the risk of heat related hospitalizations. Very few studies assessed the impact of SES on EDA. Using a generalised additive model, one study conducted in Brisbane [64] found that heat-related EDA was increased in most disadvantaged areas compared with the least disadvantaged areas for the 65-74 year age group during heatwaves. The Chinese study [48] used the RF method also found that SES was important for heatstroke prediction.
In the global models, air quality measures ranked high among the risk factors as compared to heatwaves, indicating the importance or sensitivity of air quality to EDA. Among all air pollutants, particulate matter was the most important risk factor. Currently, nearly no studies have used ML approaches to assess the impact of air quality and heatwaves on EDA for vulnerable populations, especially for children. A few epidemiological studies observed the significant adverse effects of heatwaves and poor air quality on health by other methods such as cross-correlation, generalised linear model, and generalised estimating equation model [12,28,65,66]. Our study demonstrated that air quality should be included when estimating the effect of heatwaves on health as air pollution is often worse during a heat-wave event, especially the effect of particulate matter [12,67,68]. Moreover, our study showed that PM 2.5 from landscape fires was an important risk factor for the increased air pollution and EDA. This is consistent with a study conducted in Sydney, which found that air pollution arising from fire led to increased EDA [46]. A recent study by our team on landscape files in Perth also demonstrated that exposure to PM 2.5 during files was associated with increased overall EDA, EDA for cardiovascular diseases, and acute respiratory infection. Our study also found that there were more wildfire events during the summer season, indicating that heatwave was a potential contributing factor to wildfires [44]. During fires, more particulates can be generated. This creates a smog effect, making the air uncomfortable for breathing, which in turn leads to an increase in the use of emergency health services due to an increase in respiratory distress [69].

Prediction of EDA by the GRF model
It is important to consider if risk factors such as climate conditions, social and environmental factors have any geographic variations for targeted intervention [70]. However, including spatial components in the modeling for estimating heatwave impacts on EDA was rarely reported. Although RF is a highly flexible algorithm, it does not account for spatial heterogeneity. The GRF, an extension of the RF algorithm, was recently introduced by Georganos, et al [43] as a novel geographic weighted RF approach to assess local variations of outcomes. To our knowledge, this is the first study using GRF to assess the impact of heatwaves, air quality, and other risk factors on health and to predict geographic variations in the EDA service demand for children in WA.
The goodness of fit test revealed that the GRF model fitted the data relatively well (R 2 = 0.975). The SES and EHF were the two most important risk factors for predicting EDA demand for young children living in Perth's southern suburbs compared to other areas. While the SES remained the most important risk factor, heatwave was the second important risk factor. This is different from the global RF model and indicates that heatwave has strong spatial variations across the study area (figure 3). We also found that there were spatial interactions between EHF, SES, and PM 2.5 for young children. This means the afore-mentioned risk factors contributed more to EDA in the similar locations. Using a Poisson regression model, a study conducted in Shanghai reported that with a 10 μg m −3 increase in concentrations of PM 2.5 (56.3 μg m −3 average daily concentration), the risk of emergency and outpatient visits for cardiovascular disease increased by 0.50% during the warm seasons [71]. However, a study conducted in Sydney did not observe a significant positive relationship between maximum temperatures, PM 2.5 (9.34 μg m −3 average hourly median concentration) and illness (fever and respiratory illness) for children younger than 6 years when the ARIMA time series model was applied [72]. The different outcomes from these two studies may be related to different levels of air pollutants exposure and possibly the sensitivity of their analytic methods.

Strengths and limitations
This study has several strengths. Firstly, EDA was used to assess the acute health effects of heatwaves, which was more sensitive than a mortality indicator. Secondly, a ML approach was used to identify the optimised models ( i.e., RF and GRF) to predict EDA during anomalies in weather and air quality with satisfying results. Thirdly, the gridded method was used for heatwave measurement instead of averaging station-based temperature data or using single-station data estimation. It makes the estimation of heatwave intensities (EHF) more accurate than those traditional measures. Moreover, the ML approach and RF model applied in this study are capable of modeling nonlinearity and interactions among different risk factors, which improves the model's goodness of fit. It also provides useful information on the ranking of the importance of all risk factors. Finally, the GRF model can provide information on local variations of risk factors. Such information can be of great value in prioritizing and allocating resources to communities in at-risk areas more accurately.
There are some limitations of this study. Firstly, this study only explored a few models. Future studies can include more models for assessment. Secondly, the current GRF method is not able to allocate large amounts of time series data directly into the model to assess temporal and spatial modeling simultaneously. Improving such an advanced method will be of great value for future studies. There is no doubt that more studies are needed to confirm the accuracy and reproducibility of the models produced in this study.

Conclusions
This study used ML approaches and confirmed that the RF was the most optimised prediction model for EDA among all models tested. In the GRF model, age, SES, and heatwaves were the top three important risk factors.
Heatwave and air quality have high spatial heterogeneity. Overall, there was a higher spatial interaction between heatwave, air quality and SES in the southern areas of Perth.
Future large-scale studies on heatwave related health assessment are warranted and can apply the ML approach to explore more models and assess geographical variations. As humidity, wind direction and speed can affect the heat load and air quality locally, there is a need to include those measures (if the data is available) in the modeling to identify the importance ranking of them. In addition, the spatial-temporal impact of heatwaves and air quality on EDA should be evaluated when such a method is well established for large datasets.
As heatwaves will continuously increase, public health authorities have identified the emerging need for timely planning to implement interventions and mitigation strategies to reduce heat-related burden on the healthcare system in future. An early and sensitive heatwave warning system should include emergency morbidity indicators such as EDA for the execution of effective public health interventions. The research methods used in this study have the potential to more accurately and timely predict health service demands related to heatwave exposure and prioritize activities for vulnerable populations, including young children in high-risk areas, for managing and mitigating adverse effects from environmental exposure (e.g., heatwaves and air pollutants) and protecting vulnerable populations in high-risk areas. ML approaches and GRF models will have more use in the future for governments as advanced methods to provide evidence for developing and improving policies for better population health.