Can socio-economic indicators of vulnerability help predict spatial variations in the duration and severity of power outages due to tropical cyclones?

Tropical cyclones are the leading cause of major power outages in the U.S., and their effects can be devastating for communities. However, few studies have holistically examined the degree to which socio-economic variables can explain spatial variations in disruptions and reveal potential inequities thereof. Here, we apply machine learning techniques to analyze 20 tropical cyclones and predict county-level outage duration and percentage of customers losing power using a comprehensive set of weather, environmental, and socio-economic factors. Our models are able to accurately predict these outage response variables, but after controlling for the effects of weather conditions and environmental factors in the models, we find the effects of socio-economic variables to be largely immaterial. However, county-level data could be overlooking effects of socio-economic disparities taking place at more granular spatial scales, and we must remain aware of the fact that when faced with similar outage events, socio-economically vulnerable communities will still find it more difficult to cope with disruptions compared to less vulnerable ones.


Introduction
Power outages resulting from tropical cyclones (e.g.tropical storms, tropical depressions, and hurricanes) can be highly disruptive to households, communities, and regional economies (Guikema andNateghi 2018, Feng et al 2022).Perhaps more importantly, socioeconomically vulnerable populations tend to suffer disproportionately from natural disasters (Moreno and Shaw 2019).While much research has been dedicated to predicting power outage disruptions due to tropical cyclones based on storm characteristics, comparatively less work has been devoted to analyzing how socio-economic conditions can explain spatial variations in disruptions (Best et al 2022).
In this context, power outage risks (e.g.duration and severity of outages) are determined by the tropical cyclone hazard itself and by a community's resilience (or vulnerability) to such an event (Best et al 2022).The term hazard pertains to the characteristics of the tropical cyclone itself (e.g.wind-speeds, trajectory, etc), and we quantify such characteristics with weather-related variables (Birkmann 2007, IPCC 2022).Generally, resilience describes a system's ability to mitigate, resist, and/or recover from harm, and vulnerability refers to a system's susceptibility to harm (Cutter and Emrich 2006).Barring minor theoretical distinctions between the two concepts, from a practical standpoint and in our context, both describe a community's ability to reduce the impacts of power outages (Bakkensen and Mendelsohn 2016).As typical in the risk analysis literature, we use socioeconomic variables to measure a area's resilience (and vulnerability) to these disasters (Cutter et al 2010).
Anecdotal evidence and recent case studies suggest that after controlling for weather and environmental (i.e.geographical) factors, socio-economic variables can reveal inequities in the outage duration and severity experienced by communities as a result of tropical cyclones (Mitsova et al 2021).Our goal was to holistically examine such trends on a largescale and attempt to quantify the effects of socioeconomic factors when predicting outage duration and severity via statistical models.We accomplish this task by applying various machine learning techniques to a comprehensive set of historical weather data, environmental factors, and socio-economic variables that span a range of tropical cyclones impacting the continental U.S. from 2015 to 2019.Our analyses allow us to quantify complex interactions that take place between these factors and better understand dynamics contributing to power outage risks for communities.

Background
Tropical cyclones, hurricanes in particular, are the leading cause of major power outages across the U.S. (Alemazkoor et al 2020, Feng et al 2022).For example, Hurricane Sandy (2012) left 8.5 million customers along the U.S. eastern seaboard stranded without power (Guikema and Nateghi 2018).Additionally, Hurricanes Katrina (2006) andGustav (2008) resulted in 90% of customers losing power along portions of the Atlantic Gulf Coast (Alemazkoor et al 2020, Best et al 2022).Major power outages such as these can cause irrevocable social and economic harm to communities (Moreno andShaw 2019, Anderson et al 2020).Exacerbating concerns, the frequency and intensity of disruptions is expected to worsen under future climate scenarios (Do et al 2023).
In an attempt to mitigate impacts of power outages from tropical cyclones, much research has been dedicated to developing statistical models that accurately predict the extent and duration of disruptions prior to landfall (Nateghi et al 2011, McRoberts et al 2018, Alemazkoor et al 2020).Such models can help utility companies prepare for restoration efforts and assist government agencies with relief and evacuation planning (Guikema andNateghi 2018, Alemazkoor et al 2020).Prominent, early studies include Liu et al (2005), Han et al (2009), who used various forms of parametric regression to forecast expected number of power outages using weather and environmental factors.Later, Guikema et al (2014), Nateghi et al (2014), McRoberts et al (2018)extended these approaches using non-parametric techniques; we rely heavily on these studies as baselines for our analysis.
Until recently, few studies have tried to quantify the effects that socio-economic factors have on predicting power outage durations or severities as a result of tropical cyclones.This notion is important because during outages, utility companies are presumed to focus restoration efforts in areas that reach the greatest number of people and/or contain vital services (Xu et al 2007).However, there is ultimately a degree of subjectivity in these decisions, so conscious or unconscious biases could manifest as prejudicial treatment for different communities (Best et al 2022).Moreover, as socio-economically vulnerable populations tend to have fewer and lower-quality resources, these communities may be more susceptible to outages simply because they have outdated and/or inferior infrastructure (Xu et al 2007).
Recent case studies suggest as much.Mitsova et al (2018) found through regression analysis that counties in Florida with higher percentages of minority groups, disabled populations, and unemployed residents predicted longer disruptions after controlling for proximity to Hurricane Irma (2017).Their work was later extended via multi-level modeling approaches to incorporate household-level data and revealed that a lack of insurance and reduced access to healthcare were also significant predictors of outage duration (Mitsova et al 2021).Similarly, Ulak et al (2018) determined that census tracts in Tallahassee with higher percentages of elderly residents (65+) experienced greater relative number of customers without power following Hurricane Hermine (2016).Lastly, Best et al (2022) used spatial autoregressive models to show that after controlling for weather conditions and percentage of customers without power, lower levels of median income predicted longer recovery times for census tracts in Louisiana following Hurricane Isaac (2012).

Contributions
Tropical cyclones are the leading cause of major power outages in the U.S., and their effects can be devastating for communities (Alemazkoor et al 2020).Researchers have developed statistical models that accurately predict outage duration and extent by considering storm characteristics and environmental factors of impacted regions (Nateghi et al 2011, McRoberts et al 2018).However, only a few case studies have tried to measure the degree to which socio-economic variables can explain spatial variations in outcomes and reveal potential inequities thereof (Mitsova et al 2021, Best et al 2022).Our goal was to quantify trends from these studies on a more generalized dataset that introduces variability in tropical cyclone characteristics, environmental factors, and socio-economic conditions.In doing so, we make several contributions to the literature.
First, while prior studies include only a handful of weather and socio-economic variables, we feature extensive sets of both types of factors.We use spatially and temporally granular data to understand and account for exact weather conditions that were present during the outages.Similarly, we include over 130 established socio-economic indicators of community vulnerability in our analysis to help ensure we capture any effects due to socio-economic disparities (Johnson et al 2020).
Second and related, we feature a wide range of tropical cyclones.Previous studies examining effects of socio-economic variables predicting spatial variations in disruptions focus on individual hurricanes (i.e. they are case studies).Here, our data include over 20 tropical cyclones that span five years (2015-2019), 397 counties, and a variety of storm severities (i.e.tropical storms, depressions, and hurricanes).This broad range of data allows us develop models and insights that are more generalizable.
Lastly, we utilize machine learning techniques that account for both linear and nonlinear interactions between factors.The aforementioned case studies assumed linear relationships between weather conditions, social factors, and outage responses.Due to the complex nature of natural hazards interacting with communities and power outage risks, we surmise that linear assumptions do not sufficiently capture these effects.To explore this notion, we compare results and insights from linear regression analyses with those from more sophisticated, non-parametric approaches.

Methods
Our study consists of two main tasks.First, we compile a county-level repository of data, consisting of power outage information, weather conditions, environmental factors, and socio-economic indicators of vulnerability.Second, we then use this dataset to develop statistical models that predict outage duration and maximum percentage of customers without power.We use these models to examine the effects that socio-economic factors have in explaining outcomes after controlling for weather conditions and environmental factors.

Data
The primary dataset in our analyses features spatiotemporally joined power outages and weather conditions present during a tropical cyclone.We obtain power outage data from the U.S. Department of Energy's EAGLE-I repository (US Department of Energy 2023).Data for the number of customers without power are available over fifteen minute intervals at the county-level spatial scale.We obtain weather-related data from the Modern-Era Retrospective Analysis for Research and Applications, Version 2 (MERRA-2) (NASA 2023).The MERRA-2 data contain hourly atmospheric conditions at an approximately 50 square-kilometer spatial resolution.Table 1 shows a list of the exact weather variables used in our analysis.We then interpolate these gridded variables at the county-level and spatio-temporally merge them with the outages.
Next, we use the National Oceanic and Atmospheric Administration (NOAA) hazards database to identify and filter which outages coinciding with tropical cyclones (NOAA 2023).Then, for each county experiencing outages present during a tropical cyclone, we tabulate the maximum number of customers impacted, normalize this metric on a per capita basis, and then sum the total length of the disruption.Similarly, we calculate summary statistics for the weather variables present at each county during the tropical cyclone (table 1).The result is a processed outage and weather dataset defined at the countycyclone level, similar to prior studies (McRoberts et al 2018).In other words, each row of data corresponds to a county that experienced a power outage event due to a given tropical cyclone.Ultimately, our power outage and weather data include over 830 countycyclone data points that feature 20 different tropical cyclones from 2014 to 2019.
Environmental factors used in our analysis include topographic information, root-zone growing depths, land-use characteristics, and drought indices; we base these factors on spatially generalizable hurricane outage prediction models discussed in the Background section (Nateghi et al 2014, McRoberts et al 2018).For topographic variables, we transform digital elevation maps (DEMs) into county-level statistics for mean, median, standard deviation, maximum, and minimum elevation (Earth Resources Observation and Science (EROS) Center 2017).Similarly, we use the Gridded Soil Survey Geographic (gSSURGO) Database to derive the mean, median, and mode of root-zone growing depths for each county (USDA 2022).For land-use characteristics, we convert National Land Cover Database (NLCD) raster images into percentages by area that each land class occupies for a county (e.g.percent 'Forest' or 'Developed') (Wickham et al 2021).Lastly, we convert gridded Standardized Precipitation Index (SPI) data into county level averages of 3 month, 6 month, 12 month, and 24 month measures (NOAA 2022).
We also feature an exhaustive set of socioeconomic factors in our analysis.Johnson et al (2020) previously compiled and analyzed 130 socioeconomic variables derived from an established list of community vulnerability and resilience indices pertaining to natural disasters: Baseline Resilience Index for Communities (BRIC) (Cutter et al 2010), Community Disaster Resilience Index (CDRI) (Peacock et al 2010), Community Resilience Index (CRI) (Sherrieb et al 2010), Resilience Capacity Index (RCI) (Foster 2012), Social Vulnerability Index (SoVI) (Cutter et al 2003), and Social Vulnerability Index (SVI) (Flanagan et al 2011).Some example indicators include but are not limited to median housing value, percent of population below the poverty line, percent elderly populations, and percent female populations.Because the Johnson et al (2020) data are available at the countylevel, we directly incorporate these metrics in our analysis.
Finally, we combine all the data at county-level and end up with 830 data points (county-cyclone  1).The supplementary materials include the aggregated dataset and all scripts necessary for processing the data.

Machine learning
From this dataset, we develop statistical models to predict outage duration and maximum percentage of customers without power for each county-by-cyclone data point.We use these two metrics as response variables because they are regularly used to evaluate power outage risks due to tropical cyclones (Guikema et al 2014, McRoberts et al 2018).
As shown in table 1, explanatory factors consist of 165 variables, grouped into three main categories: weather, environmental, and socio-economic.As expected, several of these variables are highly correlated with others, so we process the data via correlation analyses to remove multi-collinearity and end up with 149 uncorrelated (r < 0.875), predictors.
With this uncorrelated dataset, we then employ multiple linear regression (MLR), eXtreme gradient boosting (XGBoost), and Bayesian additive regression tree (BART) models to predict power outage duration and maximum percentage of customers impacted (Chipman et al 2010, Chen andGuestrin 2016).We use these particular approaches for several reasons.First, we want to include a mix of both linear and non-parametric models to see how the effects of predictors, especially those of socioeconomic variables, differ based on assumptions of linearity.We use MLR for the linear models because previous studies examining effects of socio-economic variables predicting outages featured some form of MLR.Second, BART models have historically produced the most accurate predictions of power outage risks due to tropical cyclones when hazard characteristics and environmental factors are included as explanatory factors (Nateghi et al 2011).Lastly, XGBoost algorithms tend to perform very well in other applications of risk analysis (Chen and Guestrin 2016).We train our models on 80% of the data, using standard 5-fold cross validation to tune the nonparametric approaches, and then use the remaining 20% of the data for testing.
It should be noted that we also used random forest models because they had been prominently featured in previous studies (Guikema et al 2014, McRoberts et al 2018).However, we found the random forest models consistently under-performed relative to the BART and XGBoost models and did not provide any additional insights.As such, for the sake of brevity and simplicity, we do not discuss the results of the random forest models.However, they are included in the various R scripts within the supplementary materials.
Additionally, we did not explicitly account for spatial autocorrelation in any of our models.This choice was due to our belief that a comprehensive set of explanatory predictors would sufficiently account for all spatial variation in outages not attributable to random noise.To confirm this notion, we used semivariograms to examine the spatial autocorrelation of residuals and ensure our models satisfied assumptions of spatial independence, which are also available in the supplementary materials (S1assumptions.pdf).

Results
As seen in figure 1, our models, in particular the BART and XGBoost algorithms, are able to accurately predict county-level outage duration and maximum percentage of customers without power.Moreover, residuals satisfy assumptions of normality, homoscedasticity, and spatial independence (supplementary materials, S1 -assumptions.pdf).Note, we logtransformed the output variables because residuals of the pre-transformed models did not satisfy assumptions of normality or homoscedasticity.
Overall, the weather-related variables are the most important factors across linear and non-parametric approaches (figures 2 and 3).Here, variable importance is defined as the relative gain in accuracy a feature contributes to the model (XGBoost) and as the absolute t-value corresponding to a feature (MLR) (Friedman 2001, Kapelner andBleich 2016).We do not depict the variable importance of the BART models because they are similar to those of the XGBoost models.
As seen, the socio-economic effects are much more prominent in the linear models compared to the non-parametric approaches.However, there are two items of note here.First, the linear models are less accurate than the non-parametric approaches, so we have commensurately less confidence in insights stemming from the former.Second, we believe the presence of multiple socio-economic factors in the MLR models is primarily due to these models needing more and potentially extraneous terms to account for complex, non-linear interactions between variables (Friedman 2001).
To investigate this notion, we extended the MLR models using elastic net (eNet) regression to incorporate L1 and L2 regularizing terms.If certain socioeconomic effects were truly important, the regularization techniques would help distinguish them.However, we found that the eNet models still produced a multitude of non-trivial socio-economic effects, and the magnitude and sign of the coefficients as well as the features themselves were highly unstable and dependent on the L1 and L2 parameters.This analysis can be found within the R scripts in the supplementary materials.
Moreover, partial dependence plots (PDPs) of the most important factors in the non-parametric approaches reveal fundamentally non-linear relationships between predictors and responses (figure 4).PDPs depict the marginal effect a feature has on response while controlling for effects of other explanatory variables (Friedman 2001).Based on the results in figure 4, these behaviors would need to be modeled as non-linear functions (i.e.defined as splines or polynomial functions in parametric models or learned from decisiontree ensembles or neural networks) but should not simply assumed to be linear, as in the case with MLR.
Lastly, we retrained all models on a simplified set of socio-economic variables.Johnson et al (2020) performed exploratory factor analysis on the 130 socioeconomic variables and found that 50 of them were highly correlated and could be loaded onto five main factors of vulnerability, which they referred to as (1): wealth, (2): poverty, (3): agencies per capita, (4): elderly populations, and (5): non-English speaking populations.If the socio-economic effects revealed in the previous MLR models were truly important (figures 2 and 3), then model predictions using this simplified set of variables (i.e.five factors instead of 130 variables) should suffer or these five factors should be prominently featured in results and map to previously important variables.
However, as shown in figure 5, neither occurs.Both the linear and non-linear models actually perform better.There are fewer extraneous terms to sift through in the case of the non-parametric approaches and fewer total factors to potentially over-fit the training data in the case of MLR.Additionally, the effects of all but one socio-economic factor, Factor4: Elderly Populations, are now effectively immaterial in the MLR models; based on Johnson et al (2020), only one variable featured toward the bottom of figure 2, percentage of population receiving social security benefits (Qssben), loads onto this factor.

Discussion
Contrary to expectations, we find that socioeconomic indicators of vulnerability do not meaningfully predict county-level outage duration and maximum percentage of customers without power due to tropical cyclones after sufficiently controlling for weather conditions and environmental factors.This finding is anecdotally counter-intuitive and ostensibly contradicts previous case studies that have investigated relationships between socio-economic variables and tropical cyclone induced power outages (Mitsova et al 2018, 2021, Ulak et al 2018, Best et al 2022).However, upon closer examination, we are able to reconcile many of these discrepancies.
First, as discussed in the Introduction, it has been shown that effects of socio-economic factors tend to diminish as more information about a tropical cyclone hazard is incorporated in statistical models (Mitsova et al 2018, Best et al 2022).Thanks to the MERRA-2 data, we have granular weather data that allows us to account for detailed weather conditions present during the outages and explain most of the spatial variability in outcomes.As such, weather variables are still the most important factor in predicting outage duration and extent in our analysis, but there is less remaining variability to be attributed to other, less relevant factors.
Second and related, the XGBoost and BART models allow us to incorporate non-linear interactions between all the variables.To the best of our knowledge, all previous studies that have examined effects of socio-economic variables in predicting tropical cyclone power outage risks have assumed linear relationships between predictors and responses.As seen in figures 2 and 3, socio-economic factors appear to be important in the original MLR models, but this occurrence is likely due to the MLR models needing extraneous terms to account for fundamentally non-linear dynamics.As such, we recommend that researchers use a mix of both parametric and nonparametric approaches in related studies to obtain a more holistic view of key drivers of power outage risks.
Additionally, several of the studies that have found socio-economic factors to be significant predictors of outage risks have done so at more granular spatial scales than the county-level scale featured in our analysis.These studies explored outages at the zip-code, census-tract, or even neighborhood level (Azad and Ghandehari 2021, Best et al 2022, Lee et al 2022).Nelson et al (2015) demonstrated how socio-economic indicators of vulnerability measured at the county-level can often fail to capture disparities taking place at more granular levels.As such, we believe that our county-level analyses may be overlooking some of these more granular details.The obvious problem is that finding extensive outage and weather data to do a similar, holistic machine learning exercise at a smaller scale is difficult.We see a clear need and room for opportunity in developing comprehensive data-sets pertaining to power outage risks at more granular spatial scales.Other related enhancements could involve incorporating physicsbased outage risk models for more granular insights and/or including power-grid related factors as predictive variables themselves in the statistical models (Feng et al 2022).
Perhaps most important, we should also be aware of the fact that our models do not account for the subjective experiences of communities impacted by power outages.Our models show that after controlling for weather and environmental factors, socio-economic variables are largely irrelevant when explaining spatial variations in outage duration and percent of customers impacted at the county-level.In other words, given a similar storm, rich counties and poor counties will be subjected to similar outages.However, the poor counties will still subjectively experience worse outcomes; for example, they will have a more difficult time replacing food or medicine that spoiled during the outages and/or relocating to other areas (Moreno and Shaw 2019).This phenomenon is a blind-spot in our statistical analysis, and as such, we recommend more research being done to help quantify subjective impacts of communities in hopes of achieving more equitable distributions of power outage risks.

Conclusion
Few studies have holistically examined the degree to which socio-economic variables can explain spatial variations in outage duration and severity and reveal potential inequities thereof (Mitsova et al 2018, Best et al 2022).Here, we use machine learning techniques to predict county-level outage duration and maximum percentage of customers losing power as a result of 20 different tropical cyclones impacting the continental U.S. between the years 2015 and 2019.Our models, in particular BART and XGBoost algorithms, are able to accurately predict these outcomes.However, contrary to expectations, we find the effects of socio-economic variables to be largely immaterial after controlling for weather and environmental factors.That said, we must remain cognizant of the fact that even if communities suffer similar disruptions from a given tropical cyclone, socioeconomically vulnerable communities will still find it more difficult to cope with impacts compared to less vulnerable ones.

Table 1 .
Summary of explanatory variables.
level) and 165 explanatory variables (table