Downscaling ERA5 wind speed data: a machine learning approach considering topographic influences

Energy system modeling and analysis can provide comprehensive guidelines to integrate renewable energy sources into the energy system. Modeling renewable energy potential, such as wind energy, typically involves the use of wind speed time series in the modeling process. One of the most widely utilized datasets in this regard is ERA5, which provides global meteorological information. Despite its broad coverage, the coarse spatial resolution of ERA5 data presents challenges in examining local-scale effects on energy systems, such as battery storage for small-scale wind farms or community energy systems. In this study, we introduce a robust statistical downscaling approach that utilizes a machine learning approach to improve the resolution of ERA5 wind speed data from around 31 km × 31 km to 1 km × 1 km. To ensure optimal results, a comprehensive preprocessing step is performed to classify regions into three classes based on the quality of ERA5 wind speed estimates. Subsequently, a regression method is applied to each class to downscale the ERA5 wind speed time series by considering the relationship between ERA5 data, observations from weather stations, and topographic metrics. Our results indicate that this approach significantly improves the performance of ERA5 wind speed data in complex terrain. To ensure the effectiveness and robustness of our approach, we also perform thorough evaluations by comparing our results with the reference dataset COSMO-REA6 and validating with independent datasets.


Introduction
Renewable energy technologies play a crucial role in mitigating climate change impacts. As the world shifts towards a more sustainable future, many nations have already started the transition toward renewable energy sources [1,2]. In energy system models, values such as installable capacity and power generation potentials can be calculated from meteorological data, such as wind speed [3,4]. The most frequently used meteorological dataset is reanalysis data [5,6]. Reanalysis data can provide meteorological data in a spatially and temporally consistent way [7]. However, its low spatial resolution impedes its ability to capture small-scale details, which are crucial for the accurate simulation of local-scale energy systems [8]. Over the years, in-depth studies of local effects and the development of local storage models have drawn more and more attention in the energy system modeling community, which cannot be achieved through the use of coarse-resolution reanalysis data. Therefore, to capture local effects in energy system models, a dataset with high spatial resolution is necessary.
Many efforts have contributed to the improvement of the spatial resolution of reanalysis data, with downscaling being the most widely adopted technique. Downscaling is a method for obtaining highresolution climate or climate change information from global climate models [9]. Downscaling can be categorized into dynamic downscaling and statistical downscaling. Dynamic downscaling refers to the use of high-resolution regional climate models to dynamically extrapolate the impacts of large-scale climate processes to a regional or local scale of interest [10]. It provides individual variables that are physically consistent in time and space and is internally consistent across different variables [11]. However, the complexity of performing large-scale calculations of various physical and chemical equations can be limiting and challenging due to the large amount of computational resources required.
On the other hand, statistical downscaling offers a computationally efficient solution by analyzing statistical relationships without considering complex physical and chemical processes. For example, Curry et al [12] investigated the statistical correlations between climate forecast variables and reanalysis data to derive monthly Weibull distribution parameters. Kirchmeier et al [13] and González-Aparicio et al [14] used vector generalized linear model to predict a daily-varying probability density function of local wind speeds conditioned on large-scale daily wind speed predictors. Other notable works in this field include [15][16][17]. These studies are focused on downscaling the probability distribution parameters of wind speed, which proves beneficial in quantifying the range of local-scale wind speeds and calculating energy output. However, for the purpose of energy system modeling, time series data are often required and cannot be derived from wind speed probability distributions.
Meanwhile, other studies developed statistical downscaling methods in different directions. For instance, Monahan [18] downscaled monthly wind speed time series at the buoy locations by using multiple linear regression. Jung et al [19] employed a least square boosting approach to downscale the monthly extreme wind speed value for North America and Europe. Winstral et al [20] developed an optimization scheme to downscale the wind speed time series in Switzerland taking the local terrain structure into consideration. These and other studies such as [21][22][23] are explicitly focused on the acquisition of high spatial resolution time series data, rather than the probability distribution parameters. However, many studies only developed site-specific corrections or aimed at obtaining daily or even monthly wind speed time series, limiting spatial and temporal capabilities when applied to an energy system model.
In recent years, the emergence of wind atlas platforms has significantly enhanced the acquisition of high spatial resolution wind speed data. Alongside various national and regional wind atlases in Europe [24], two prominent and widely recognized atlases, namely Global Wind Atlas [25] and the New European Wind Atlas [26], have gained recognition for their ability to provide high spatial resolution and open-access estimations of wind characteristics at specific locations. While these wind atlases offer highresolution maps of the wind climate, such as longterm averaged wind speed and variability, they do not provide wind speed time series data for specific time spans. To address this limitation, researchers have explored the use of wind atlases to bias-adjust reanalysis data. One common approach involves scaling the reanalysis time series data to align with the longterm averaged wind speed provided by the wind atlas [27][28][29]. However, this method has a notable drawback: the resulting time series tends to be excessively smooth when compared to point measurements. As a consequence, it fails to accurately capture the significant fluctuations in wind speed commonly observed in measurements.
To overcome these limitations, we propose a machine learning-based statistical downscaling method to improve the spatial resolution of ERA5 wind speed time series data to around 1 km × 1 km. This method offers robust and computationally convenient solutions with simple input requirements by considering the importance of topographic conditions in local wind speed estimates. By investigating the relationship between large-scale wind speed, local-scale wind speed, and topographical metrics, our approach can improve the quality of reanalysis data for specific regions, thereby providing wind speed time series data at high spatial resolution. This enables us the capability to perform a close examination of local-scale energy systems, such as battery storage systems for small-scale wind farms or decentralized community energy systems.

Local-scale observation data
In this study, we use two observation datasets: MeteoSwiss observations for Switzerland and German Meteorological Service (DWD) observations for Germany. MeteoSwiss observations are used for training, testing, and cross-dataset validation of our machine learning model, while the DWD observations representing an independent dataset are exclusively used for cross-dataset validation.

MeteoSwiss observation data
The MeteoSwiss observations are collected from the website of the Federal Office of Meteorology and Climatology of Switzerland [30]. This dataset comprises 10-meter hourly wind speed measurements gathered from weather stations distributed throughout Switzerland. In total, there are more than 150 measurement stations. However, stations with missing values accounting for over 10% of the data are excluded. The final dataset contains measurements from 116 weather stations for the years 2017, 2018, and 2019. Of these years, only observations from 2018 are used to develop our model, while measurements from 2017 and 2019 are used for cross-dataset validation. The distribution of the MeteoSwiss weather stations used in this study is presented in figure 1(a).

DWD observation data
The DWD (German Meteorological Service) observations, accessible from the DWD Open Data Server, also provide 10-meter hourly wind speed observations from over 500 weather stations [31]. As the focus of this study is on topographic influence, offshore weather stations are excluded. In addition, stations with missing values exceeding 10% of the data are also excluded. The final dataset contains measurements from 272 stations in 2018. Figure 1(b) shows the distribution of these DWD weather stations.

Large-scale reanalysis data
Reanalysis data are gridded datasets that represent the atmosphere state, incorporating observations and outputs of numerical weather prediction models from past to present-day [32]. ERA5 is a widely used reanalysis dataset due to its extensive temporal and spatial coverage [8,[33][34][35]. Moreover, it also provides more than 200 other variables, some of which are topography-related and have significant impacts on wind speeds. Leveraging the advantages of ERA5, we employ ERA5 as the source datasets for our downscaling model development, where the wind speed and wind direction time series are calculated from the '10 meters U-component of wind' and '10 meters Vcomponent of wind' in ERA5. In addition to ERA5, we also utilize COSMO-REA6 [36], another reanalysis dataset, as the reference dataset for result comparison. Compared to ERA5, COSMO-REA6 has a higher spatial resolution, approximately 6 km × 6 km, offering valuable insights when assessing the performance of our results. We then employ nearestneighbor interpolation to identify the nearest grid point based on the latitude and longitude of each weather station to obtain continuous time series data.

Topographic metrics
To thoroughly examine the effect of topography on local-scale wind speed, we calculate and analyze six crucial topographic metrics: elevation, slope, aspect, small and large-scale topographic position index (TPI), and terrain diversity index (TDI). These metrics offer insight into the complexities of the terrain and provide a comprehensive picture of its impact on wind dynamics [20,34,37].
These topographic metrics can all be derived from a Digital Elevation Model (DEM). The DEM used in this study is Global Land One-Kilometer Base Elevation (GLOBE), which has a spatial resolution of 0.0083 degrees (around 1 km × 1 km) [38]. Among these topographic metrics, elevation data can be directly retrieved from DEM. Slope can be calculated by dividing the vertical change in elevation by the horizontal distance [39]. Aspect is the orientation of slope, measured clockwise in degrees from 0 to 360 [40]. TPI, first proposed by Weiss in 2001 [41], provides a broad description of the elevation of a particular location relative to its surroundings. The TPI value of a point is a measure of its relative elevation, determined by subtracting the average elevation of all pixels within a specified radius from the elevation of the point. In this study, a radius of 75 km and 5 km TPI are calculated respectively to better capture both major landscape units and smaller local features.
In addition to these widely recognized and established topographic metrics, we introduce an index called TDI. It quantifies the topographic variety of an area by computing the ratio of the range of elevations to the mean elevation as indicated in equation (1), thus reflecting the diversity of the terrain. In our study, we employ an 11 km radius window for TDI calculation. The value of TDI serves as an indicator of the degree of topographic diversity in a given region. A higher TDI value signifies a greater range of elevations, thereby implying a more intricate topography. Conversely, a lower TDI implies a more uniform landscape. The maps of these topographic metrics for Switzerland are provided in figure 2.

Methods
Our statistical downscaling approach involves a regression analysis aimed at establishing the relationship between ERA5, observed data, and topographic metrics. Before this process, we conduct a data preprocessing step that classifies the study region into various categories based on the aforementioned topographic metrics, thereby ensuring a thorough consideration of the terrain impact.

Data preprocessing
When comparing ERA5 with observations, we discover that the biases of ERA5 vary based on the site location. More specifically, in mountainous areas or valleys, ERA5 tends to have large biases, whereas, in plain areas, the biases are usually small. This is consistent with other studies [8,20,42]. To examine the influence of topography on the quality of ERA5, we first employ the method proposed by Winstral et al [20] who utilized TPI to assess the performance of COSMO-REA6. They observed that COSMO-REA6 overestimates wind speeds in regions with low TPI values and underestimates wind speeds in regions with high TPI values. However, our findings differ in the case of ERA5, where we observe that ERA5 often underestimates wind speed even for regions with low TPI values. This discrepancy could be due to the coarser spatial resolution of ERA5 compared to COSMO-REA6 and the insufficiency of relying on a single topographic metric such as TPI.
To investigate the impact of various topographic metrics on the accuracy of ERA5 and predict the potential quality of ERA5 for any given region solely based on the topographic conditions, we propose a preprocessing step. This involves integrating multiple topographic metrics, as outlined in section 2.3, into a random forest classification model implemented using the scikit-learn python package [43].
Initially, all weather stations are classified into three classes based on the root mean square error (RMSE) value between observations and ERA5, as indicated in table 1. These classes are used as the target variables in the classification process. Additionally, we calculate elevation, slope, aspect, TPI (with 5 km and 75 km radii), and TDI for each station and include them as input features for the classification process. It is worth noting that due to the limited sample size of MeteoSwiss stations, the classification process is performed for both MeteoSwiss and DWD weather stations. The distribution of the stations across different classes is illustrated in figure 3. To assess the model performance, a data split of 80% training data (310 stations) and 20% testing data (78 stations) is applied.

Regression of wind speed time series
After the preprocessing step, a regression analysis is performed for each class to investigate the correlations between input features and the target variable. In our case, input features include both large-scale ERA5 data and local-scale topographic metrics. These large-scale ERA5 data are timedependent and provided as time series, including ERA5 wind speed, ERA5 wind direction, and gravity wave dissipation (GWD). The local-scale topographic metrics in our study are time-independent and provided as constants, including 5 km and 75 km Radius TPI. And the target variable is observed wind speed, which is also a time-dependent variable and provided as time series. Including GWD as an input feature in our analysis serves a specific purpose. GWD is the cumulative conversion of kinetic energy into thermal energy in the mean flow over the entire atmospheric column per unit area, which is due to the effects of stress associated with lowlevel, orographic blocking and orographic gravity waves [7]. Incorporating GWD can provide additional insights into the impact of unresolved valleys, hills, and mountains at a scale between 5 km and the ERA5 grid. This added dimension provides a better understanding of the topography influences on local-scale wind speed.
Before being introduced to the machine learning model, all predictors are normalized to set the features on a common scale. The regression process is then implemented using the eXtreme gradient boosting (XGBoost) algorithm, a scalable and optimized tree-boosting framework [44]. XGBoost offers great accuracy, scalability, and efficient handling of missing data, while its regularization capabilities can prevent overfitting [44]. Data for both input features and target variables are collected for all MeteoSwiss stations in 2018. Similar to the preprocessing step, all time series data are split into 70% for training and 30% for testing. To estimate the performance of the regression model, we use four crucial statistical metrics: RMSE, Pearson Correlation Coefficient (PCC), R 2 score, and Kolmogorov-Smirnov D statistic (KSD). KSD can quantify the degree of matching between two distributions, with D = 0 indicating a perfect match.

Preprocessing results
In the preprocessing phase, our random forest classification model yields a model accuracy of 0.76 for the entire testing data set. However, when specifically considering Class3 stations, the model accuracy increased to 0.90. Figure 4 compares the relative significance of each input feature. A comprehensive map of the region class predictions in Europe is presented in figure 5. This map is generated by applying our random forest classification model to regions where no observation data is available.

Regression results
To determine if a machine learning model is prone to overfitting, it is important to compare its performance on training and testing datasets, as a comparable error in both datasets suggests that the model is generalized. Therefore, we compare the statistical metrics for both training and testing datasets as summarized in table 2. Meanwhile,  table 3 showcases the improvements resulting from the regression process. To visually demonstrate these improvements, figure 6 presents scatter plots and histograms for a randomly selected station in each class. Meanwhile, the extent to which various input features impact the target variables is demonstrated in figure 7.

Cross-dataset validation
To verify the robustness of our regression model, we conduct a cross-dataset validation scheme in two ways. Firstly, we apply the regression model to all MeteoSwiss weather stations but using data from different years. Secondly, we test the model for DWD observations.

Cross-dataset validation across different years
To acquire the downscaled wind speed time series for MeteoSwiss observation stations across multiple years, the time-dependent input features are first adjusted for the current year prior to being fed into the regression model. The results of this crossvalidation are summarized in table 4. To provide a visual representation of these results, scatter plots and histograms for three representative stations are shown in figure 8, and time series plots are displayed in figure 9. Due to limited data availability in COSMO-REA6 after August 2019, the comparison for the year 2019 is focused on the first 8 months.

Cross-dataset validation across different locations
To perform cross-validation with DWD observation stations, the input features are first computed at DWD station locations before being fed into the regression model. The results are presented in table 5. To delve deeper into the improvement observed in Class3 stations, table 6 presents a comparison of statistic metrics for all Class3 DWD stations. For a more visual representation of the results, figures 10 and 11 present scatter plots, histograms, and time series plots for selected Class 3 stations.

Discussion
In the preprocessing step, we investigated the impact of various topographic metrics on the accuracy of ERA5. Our findings reveal that all of the topographic metrics considered in this study play a crucial role in determining the performance of ERA5.        The model accuracy of the random forest classification indicates a strong performance overall, especially in Class3 region. Additionally, we present a map of the region classes in Europe, which reveals that most of Europe falls under Class1 regions with relatively highquality ERA5 data. Conversely, Class3 regions, which indicate complex topographic conditions, occupy a small fraction of the total area, primarily surrounding  In the regression process, our results indicate that the regression model is well-fitted, as evidenced by the small differences in statistic metrics between training and testing datasets. The regression results reveal that our model can significantly improve the quality of ERA5, particularly in Class3 regions. Furthermore, the improvement in KSD suggests that this downscaling process not only decreases statistical error but also maintains a high degree of wind speed variability. The feature importance analysis indicates that TPI and GWD play a crucial role in estimating local-scale wind speed in Class3 regions. These topographic metrics serve as important indicators of the topographic conditions. Conversely, Class2 and Class1 regions, which are predominantly composed of flat terrain, are less affected by topographic conditions, making the topographic metrics less significant. From the cross-dataset validation with MeteoSwiss observations across various years, we can observe a large degree of improvement, particularly in Class3 stations. This demonstration of generalizability suggests that our model can effectively apply to unseen datasets and is applicable across different years. However, when extending our model to Germany, the results reveal a marked improvement only for Class3 stations. The corrected wind speeds in Class1 and Class2 stations show a larger bias compared to ERA5 and COSMO-REA6. One possible explanation is that the topographic influences in these regions are so minimal that considering topographic-related features in a machine-learning regression model would result in inaccurate predictions. Therefore, we strongly recommend limiting the use of our regression process to Class3 regions alone. This will prevent potential errors in Class1 or  For all these stations, the stronger overlap between the corrected wind speed and observation data also suggests a significant reduction in biases and a better agreement with the observations for the corrected wind speed, highlighting the robustness of our model when applied to independent datasets. Class2 regions while significantly reducing computational efforts. However, to better correct wind speeds in Class1 and Class2 regions, one possible approach is to increase the amount of training data for the regression model. Given that our regression model is solely trained for MeteoSwiss data, incorporating more observations from Class1 and Class2 stations in diverse regions could lead to improved outcomes. Another possibility is to incorporate topographyindependent features into the model, such as weather regimes and air pressure. These features, as previously noted in studies such as [45,46], play a more significant role in affecting the quality of reanalysis data in Class1 and Class2 regions, and should be considered accordingly.
Furthermore, it is important to note that our study focuses on correcting wind speeds at 10 meters height, as the availability of wind speed observations at higher heights is limited. However, the same model can be applied to downscale wind speeds at higher heights once a sufficient amount of observational data becomes available.

Conclusion
In summary, this study highlights the crucial role played by topographic conditions in determining the spatially disparate quality of ERA5 wind speed data. Complex terrain regions, as characterized as Class3 regions through preprocessing step, show the highest degree of inaccuracies in the ERA5 wind speed data. To downscale ERA5 to higher spatial resolution, we apply a machine learning-based regression model to interpret the relationship between the large-scale ERA5 data, local-scale observations, and topographic metrics. The robustness of the regression model is evaluated by comparing its output against measurement data across different years and locations. The results demonstrate that while the method results in considerable improvement in ERA5 data quality in Class3 regions, the improvement is not as pronounced in Class1 and Class2 regions, which already have better ERA5 quality. By utilizing this method, ERA5 wind speed data can be downscaled from a spatial resolution of 31 km × 31 km to as fine as 1 km × 1 km, depending on the resolution of DEM used.

Data availability statement
The data that support the findings of this study are openly available at the following URL/DOI: https:// 10.5281/zenodo.8100208.