Comparison of a novel machine learning approach with dynamical downscaling for Australian precipitation

Dynamical downscaling (DD), and machine learning (ML) based techniques have been widely applied to downscale global climate models and reanalyses to a finer spatiotemporal scale, but the relative performance of these two methods remains unclear. We implement an ML regression approach using a multi-layer perceptron (MLP) with a novel loss function to downscale coarse-resolution precipitation from the Bureau of Meteorology Atmospheric high-resolution Regional Reanalysis for Australia from grids of 12–48 km to 5 km, using the Australia Gridded Climate Data observations as the target. A separate MLP is developed for each coarse grid to predict the fine grid values within it, by combining coarse-scale time-varying meteorological variables with fine-scale static surface properties as predictors. The resulting predictions (on out-of-sample test periods) are more accurate than DD in capturing the rainfall climatology, as well as the frequency distribution and spatiotemporal variability of daily precipitation, reducing biases in daily extremes by 15%–85% with 12 km prediction fields. When prediction fields are coarsened, the skill of the MLP decreases—at 24 km relative bias increases by ∼10%, and at 48 km it increases by another ∼4%—but skill remains comparable to or, for some metrics, much better than DD. These results show that ML-based downscaling benefits from higher-resolution driving data but can still improve on DD (and at far less computational cost) when downscaling from a global climate model grid of ∼50 km.


Introduction
Precipitation and its extremes are among the most impact-relevant consequences of a warmer climate (Field et al 2012, Allan et al 2021. General circulation models (GCMs) are the most advanced tools for obtaining global-scale climate change projections of hydroclimate (Giorgi et al 2019), and provide precipitation projections on global and regional scales. Climate projections help improve our understanding of future changes and allow better planning to help mitigate impacts at regional scales. However, while GCMs capture many aspects of large-scale climate change well (Allan et al 2021), simulations of changes in precipitation and precipitation extremes benefit from much finer spatial resolution than GCMs currently employ. Dynamical and statistical downscaling have been used to bridge this resolution gap and provide more spatially detailed projections of future rainfall and rainfall extremes (Laprise 2008, Ekström et al 2015, Giorgi and Gutowski 2015. Dynamical downscaling (DD) uses outputs from GCMs or quasi-observational reanalysis products to drive a higher-resolution, regional climate model (RCM). The RCM solves the governing equations of the atmosphere over a limited spatial domain, nudged toward the GCM or reanalysis fields. It is commonly argued that RCMs improve the accuracy of the GCM simulation when compared to in-situ observations (Di Luca et al 2012, 2016, Torma et al 2015, Dosio et al 2019, Di Virgilio et al 2020. However, recent studies have shown that RCM precipitation statistics are highly predictable from those of the driving GCMs (Nishant and Sherwood 2021) and that RCMs generally show larger precipitation biases than the driving GCMs (Nguyen et al 2022). DD is computationally intensive due to resolutions approaching convection-permitting simulations (Jacob et al 2020). Given the expense and questions around the added value from DD, particularly when used for large ensembles of high-resolution downscaling, examining whether other approaches can be employed is timely.
Statistical downscaling is based on data-driven approaches that involve empirical relationships to map GCM or reanalyses-simulated climate variables (predictors) onto finer scale variables (predictands) (Benestad et al 2008). Statistical downscaling has been implemented in both weather forecasting and climate applications. For example, studies have used regression models to downscale climate predictions of precipitation and temperature from coarse to high resolutions (Stoner et al 2013, Tareghian et al 2013, Thrasher et al 2022 and employed weather generators to simulate an ensemble of different realizations of future climate (Fatichi et al 2013, Keller et al 2017. Traditional statistical downscaling techniques also include perfect prognosis (Schubert 1998), and model output statistics with bias correction (Eden and Widmann 2014).
There have been multiple studies comparing dynamical and statistical downscaling. For example, Schmidli et al (2007) compared six standard statistical (i.e., regression based) downscaling models with three RCMs for predicting historical precipitation over the European Alps. They found similar biases in statistical models and RCMs, except that the former strongly underestimated the interannual variations and pattern correlations. Alessi and DeGaetano (2021) compared regression-based and DD for short-term forecasts of temperature, precipitation, and humidity in the northeast United States; RCMs outperformed the statistical models in capturing probability distributions. The most comprehensive intercomparison studies report no best approach overall between statistical and DD (Casanueva et al 2016, Vaittinada Ayar et al 2016, Gutiérrez et al 2019. These studies suggest the performance of the downscaling approaches varies with location, time-period (historical and warmer climate), complexity of the landscape, predictors used, variable to be downscaled, scale ratio i.e., between the coarse and fine resolution, evaluation metrics and application.
Besides the standard statistical methods, machine learning ( (Okkan and Inan 2015), genetic programming and gene expression programming (Hadi et al 2014, Hassanzadeh et al 2014 have been used for downscaling global and regional climate projections. The few intercomparison studies comparing statistical and ML techniques similarly find that neither technique clearly outperforms the other. For example, Li et al (2020b) compared two ML (SVM and long short-term memory) techniques with two standard statistical techniques (ensemble mean and multiple linear regression) for downscaling daily temperature over Canada, finding that all four techniques performed similarly with respect to root mean square error (RMSE). Vandal et al (2019) also found no large differences between several statistical and ML methods.
To our knowledge there has not yet been any study comparing ML-based with DD. ML can handle more complex, nonlinear, and time-varying input-output mapping than standard statistical techniques (von Storch 2000), but can be very sensitive to the choices of implementation and predictors. Precipitation is a highly heterogeneous variable in both time and space and occurs because of the complex non-linear interactions between multiple atmospheric, ocean and land components. ML has shown strong prediction potential in some studies (Baño-Medina et al 2021, Rampal et al 2022, Hobeichi et al 2023. Comparing ML-based and DD for predicting regional scale precipitation is important, since ML is computationally much cheaper. ML therefore has the potential to allow more global climate predictions to be downscaled, or large ensembles of GCMs to be downscaled, facilitating improved understanding of how model uncertainty influences the projection of precipitation at finer spatial scales. Ultimately, if ML enables the development of many thousands of downscaled scenarios, the potential exists to bridge climate projections with catastrophe modelling and help inform risk assessments undertaken by the reinsurance industry. Accordingly, using the Australian-sub region as a case study, we examine two important research questions for the projection of precipitation and its extremes. First, to what extent can ML-based downscaling perform as well as, or outperform, DD for capturing precipitation characteristics? Second, how does ML downscaling performance change as the span between coarse and fine scales increases? To do this, we develop ML models to downscale Bureau of Meteorology Atmospheric high-resolution Regional Reanalysis for Australia (BARRA; Su et al 2019) 12 km (coarse) resolution to 5 km (fine) gridded observational i.e., Australia Gridded Climate Data (AGCD; Jones et al 2009) resolution. We then compare the ML downscaled results with a dynamically downscaled BARRA-SY simulation, which is obtained by running convection permitting (1.5 km) downscaling model over smaller subdomain centred on Australian capital city Sydney. To test the ML model performance as the span between coarse and fine scales increase, we coarsen the 12 km BARRA-R to 24 and 48 km respectively and then downscale each of them to 5 km gridded AGCD.

BARRA
BARRA is an atmospheric regional reanalysis over Australia, New Zealand, and Southeast Asia available between 1990-2019, which provides both the driving fields (BARRA-R) and DD benchmark (BARRA-SY) used here. BARRA-R, at a 12 km resolution over Australia, New Zealand, and the maritime continent, is produced using version 10.2 of the Unified Model (Davies et al 2005). The atmospheric model uses a non-hydrostatic, fully compressible, deep-atmosphere formulation. Its dynamical core solves the equations of motion using massconserving, semi-implicit, semi-Lagrangian, and time-integration methods. BARRA-R is configured with 70 vertical levels extending from near the surface to 80 km above sea level: 50 model levels below 18 km and 20 levels above this. BARRA-R uses the Joint UK Land Environment Simulator (Best et al 2011) and the mass flux convective parameterization scheme of Gregory and Rowntree (1990).
The BARRA-R sequential data assimilation process is initialised using ERA-Interim reanalysis fields (Dee et al 2011). After the initialisation, ERA-Interim is only used as the lateral boundary conditions. Hourly lateral boundary conditions for BARRA-R are interpolated from ERA-Interim's six hourly analysis fields at 0.75 • × 0.75 • resolution. BARRA-R assimilates observations from land-surface stations, ships, drifting buoys, aircrafts, radiosondes, wind profilers, and satellite observations, namely retrieved wind, radiances, and bending angle. Before being assimilated, observations are screened to select the bestquality observations, remove duplicates, and reduce data redundancy (Rawlins et al 2007).
BARRA-R is used to drive convection-permitting (1.5 km) downscaling models over smaller subdomains centred over five Australian capital cities. In particular, we use the BARRA-SY domain centred on Sydney as our DD benchmark. BARRA-SY is constrained by BARRA-R at the lateral boundaries using the method of relaxation and blending (Davies et al 2005, Bush et al 2020). The boundary conditions force the development of larger-scale features within the BARRA-SY domain and is meant to ensure the benefits of the BARRA-R reanalysis (i.e., incorporating equivalent-resolution observational data) are inherited by BARRA-SY, wherein the nested model is treated as a physically consistent interpolator of 10 m specific humidity BARRA-R 4-11.
Maximum temperature at 10 m BARRA-R 15.
Minimum temperature at 10 m BARRA-R 16.
Mean temperature at 10 m BARRA-R 17.
Canopy cover BARRA-SY the driving model (Su et al 2021). BARRA-SY is run without convection parameterization and relies on the model dynamics to represent convective motions.

AGCD
We use observational precipitation estimates from the AGCD to train ML during 1990-2009, and to evaluate the simulated precipitation from both BARRA-SY and ML during 2010-2018, thus ensuring out-of-sample evaluation of ML. The AGCD has a spatial resolution of 5 km and is obtained by employing topography-resolving analysis methods over station data across the Australian continent (Jones et al 2009). Most of these stations are in the more heavily populated coastal areas, with a sparser representation inland.

ML downscaling model
We built a separate ML downscaling model for each coarse grid box, which predicts the fine-scale precipitation field within the grid box as observed by AGCD. A multilayer perceptron (MLP) neural network algorithm (Gardner et al 1998) was trained to predict the set of collocated fine-scale values based on 25 predictors (table 1). The choice of MLP over other ML techniques (for example convolutional neural networks and diffusion models) is due to its cost efficiency, speed of training and inference, ability to handle both linear and non-linear data, fewer parameters, and lower risk of overfitting compared to the alternatives. Initially, we developed MLP on 13 coarse timevarying and 1 fine-scale static predictors. These 13 predictors were selected based on their variable importance in predicting the output of the MLP. We found that increasing the number of fine-scale static predictors from 1 to 4 helps to improve the spatial variability of the predicted grid. A further addition of coarse precipitation from eight neighbouring grid cells helps to improve the accuracy of the predicted grid. All MLP models were based on the same architecture, consisting of two hidden layers and one output layer (figure 1). The first hidden layer contains 64 perceptrons and takes as input the training data in batches of 3000. The second layer contains 32 perceptrons and is fed by the output of the previous hidden layer.
Two transformations occur at each perceptron: a weight is assigned to each input and a weighted sum is calculated, followed by an activation function which typically breaks the linearity of the perceptron and determines whether it should be activated, i.e., whether the output of the perceptron should be passed on to the following layer. We used the rectified linear activation unit (ReLu), where ReLu(x) = max(x,0). The output layer of the network, which directly outputs a prediction, consists of a single perceptron without an activation function.
The MLP was optimized by minimizing a loss function chosen after testing multiple options. Initially RMSE was tested as the loss function, but this led to underestimation of extreme values. To improve the prediction of both mean and extreme values a custom combination of RMSE and minimum and maximum error was used to calculate the loss. Multiple tests revealed that coefficients of 0.7, 0.2, and 0.1 for minimum error, maximum error, and RMSE respectively, provided the best prediction of both mean and extreme precipitation on the test sample. We therefore define the loss as given by equation (1) between the prediction (pred) and the target (targ) values: After each forward pass through the network, the weights and the other parameters of the neural network are adjusted through backpropagation. The learning rate was set to 10 −4 and an adaptive moment estimation (known as ADAM; Kingma and Ba 2014) algorithm was used to adapt the learning rate for each weight of the MLP during back propagation. The neural network was optimized through 1000 iterations or epochs. We used the GPU-enabled PyTorch package in Python to build the models. It took ∼16 h to train 6808 MLPs over the domain using subparallel programming on a single GPU, and ∼7 min to predict the fine-scale daily precipitation from the final trained model over a 9 years period throughout the domain.

Methodology
The ML models ML-12, ML-24 and ML-48 were used to downscale 12 km, coarsened 24 km, and coarsened 48 km BARRA-R data respectively to the 5 km target. For DD BARRA-SY which is provided at 1.5 km resolution, we re-gridded the data to 5 km using the CDO conservative re-gridding function.
We evaluate the models first by examining their ability to simulate the observed probability density (2) Next, we examine mean annual precipitation, and four annual extreme precipitation indices based on the daily data: the annual 99th percentile (R99p [mm]), annual maximum (Rx1Day [mm]), annual number of days when precipitation is greater than 10 mm (R10mm [days]) and annual maximum number of consecutive days with precipitation less than 0.1 mm (CDD [days]). Each index is computed for each year and grid cell within the domain using the ensemble of all observing times.
The main skill metric used here is simply the relative bias ij defined in equation (3) (Nishant et al 2022). Relative bias measures the absolute bias relative to the magnitude of the observable. Here, X ij and X Observation ij are the values of one of the above-listed statistics at a single grid point 'ij' from ML/DD and AGCD observation respectively Relative Bias ij = X ij − X Observation ij /X Observation ij * 100. (3) We also evaluate the models' ability to capture the temporal variability of precipitation at each grid point. For this, we calculate the Pearson correlation coefficient between the time series of the daily mean precipitation, and annual mean indices, from a model and observations.

Daily precipitation
Both ML and DD show reasonably good skill (PSS > 0.9) in capturing the daily PDF of precipitation and the PSS score is similar across both DD and all the ML models ( figure 2(a)). A slight deterioration in skill is observed with an increase in the span between the coarse and fine scales. For example, ML-24 and ML-48 show slightly smaller PSS values than ML12 but remains comparable to DD ( figure 2(a)).
DD struggles to capture the observed daily temporal variability, with a domain-mean correlation coefficient below 0.3 ( figure 2(a)). ML models perform much better: ML-12 achieves the strongest temporal correlation (∼0.91), while ML-24 and ML-48 show lower skills (∼0.72) ( figure 2(a)). The least skilled ML models still perform notably better than DD. Thus, while DD simulates a similarly good PDF of daily precipitation as does ML, it does not produce high precipitation on the same days as observed, which the ML is largely able to do.

Annual means and extremes of precipitation
DD overestimates mean and extreme precipitation nearly everywhere in the domain, especially in the topographically complex regions along the eastern coastlines (figures 3(a)-(e)). For example, the relative biases in DD are +30%-80% in most regions for Rx1day (figure 3(c)) and +40%-60% for R99p ( figure 3(b)). For the other indices, DD shows typical biases of +10%-20%. ML-12 consistently performs better than DD in capturing both means and extremes of precipitation (figures 3(f)-(j)). However, ML shows a typical tendency to predict too many light rain days. This results in a fairly constant negative bias in ML-simulated CDD.
Both ML-24 (figures 3(k)-(o)) and ML-48 (figure 3p-t) show stronger biases than ML-12 for both means and extremes in precipitation. For mean precipitation the domain-mean biases in ML-24 and ML-48 are typically stronger (∼4%-6%) than DD, however for extremes like R99p and Rx1day they still have lower biases than DD ( figure 2(b)). On an average, there is a ∼10% increase in relative bias between ML-12 and ML-24 and a further 2%-3% decrease between ML-24 and ML-48 for majority of the indices ( figure 2(b)).

Interannual variability
For the in-sample years, all the ML models closely capture the observed interannual variability of all indices except CDD. For CDD, the ML models struggle to capture the extreme years like 1990, 1995 and 2005 due (figure 4) to its typical tendency of predicting too many light rain days. DD also performs well in capturing the interannual variability of mean precipitation, R10mm and CDD, however, it fails to capture the observed variability of extremes such as R99p and R10mm. For R99p and R10mm, the performance of DD in the out-of-sample period is also typically worse than the in-sample period. We acknowledge that this is a coincidence as DD is not trained on any data. ML-12 typically shows better skills than DD in capturing the interannual variability of both means and extremes in precipitation for all out-of-sample years. ML-24 and ML-48 also show similar performance as ML-12 for all the indices except for mean precipitation and R10mm.
The interannual correlation of DD for annual mean (figure 2(c)) shows massive improvement compared to daily precipitation, however the correlations for extremes (R99p, CDD and R10mm) remain weak (figure 2(c)). All the ML models also show strong temporal correlations for all the indices with a gradual decrease in skills observed for ML-24 and ML-48 as before. There is a ∼11% decrease in temporal correlation between ML-12 and ML-24 and a further ∼4% decrease between ML-24 and ML-48 for most indices (figure 2(c)). Despite smaller differences between ML models and DD, the temporal correlations in ML remain stronger than DD for all the indices.

Conclusion and discussion
We developed an ML-based approach for downscaling precipitation to match an observational product (AGCD) as closely as possible. A separate ML model is developed for each coarse grid to predict the fine-scale variability within that grid. The ML model combines local coarse time-varying climate and fine-scale static (land surface properties) variables as predictors, thus enabling it to learn the specific relationship between coarse scale and fine scale for each coarse grid. We then compared the results with DD over the same domain, testing only in an out-of-sample period (not used to train the ML models). ML outperforms the BARRA-R Australian DD product in capturing the AGCD daily precipitation distribution, and its spatial and temporal variability of mean and extreme precipitation. ML performs better with input data of higher resolution (here 12 km), and the performance gradually decreases as the input data become coarser (here 24 and 48 km). However, the deterioration in skill is relatively minor since the coarsest resolution input data exhibits only an additional ∼12%-13% bias over the finest resolution input data for both means and extremes in precipitation. As the grid becomes coarser, the information provided by the static variables becomes less sufficient to reflect the heterogeneity within the coarse grid box, suggesting that additional static variables may be required when training coarser resolution input data.
These results have important implications for future global and regional climate modelling as they suggest that downscaling high-resolution (typically 50 km) GCMs can be accomplished directly using ML techniques. Our results also highlight that some problems in RCMs can be overcome such as its tendency to overestimate the intensity and temporal persistence of precipitation. These problems have been reported in past studies which have shown that DD models amplify the systematic biases of driving data (Xu et al 2018, Nishant et al 2022. For example, Rauscher et al (2010) evaluated RCMs at 25 and 50 km resolution over Europe and found that precipitation intensity and number of dry days increase with resolution. Recent studies using convection-permitting RCM run between 3-5 km resolution have also shown that high resolution model amplifies the biases in precipitation (Li et al 2020a, Kim et al 2022. The ML models developed in this study were consistently closer to observations and used significantly less computational resources that DD approaches. However, there are limitations to these ML models. First, the ML approach only allows the downscaling of one climate variable at a time, unless multiple variables can be predicted using the same predictors. So, for downscaling a new climate variable which requires different set of predictors, a new ML model needs to be developed and trained with a different subset of predictors. Second, it is assumed that the relationship between the predictands and predictors is consistent over time (Sunyer et al 2012), so system nonstationarity could become problematic. While we have not tested ML predictions for a warmer climate, recent work has shown that ML can accurately reproduce the observed climate in the historical period and show a similar trajectory into the future (Baño-Medina et al 2022). The ability of ML to closely reproduce interannual variations in extreme indices is also encouraging. Third, the observational data used in this study have uncertainties especially in topographically complex regions. Chubb et al (2016) compared AGCD data against a spatially dense, independent gauge network in the Snowy Mountains region of the BARRA-SY domain. Their results suggested that AGCD data underestimated the precipitation amount by about 15% over these regions. They attributed this dry bias to a lack of stations in the area needed to represent the precipitation climatology empirically, and the inability of the AGCD gridding to account for the steep topography exposed to the prevailing winds. Since our ML model is trained on the observed data it is likely to inherit any biases in the observed data.
Nonetheless, the accuracy of ML model over the DD model shows that ML techniques are an efficient alternative to DD for conducting coarse to high resolution downscaling of important climate variables. This study examined only one DD product; additional comparisons would be warranted with other DD methods. With the ML downscaled projections being computationally efficient and more accurate than their dynamically downscaled counterparts, it may be possible to better meet societal needs for highresolution and high-quality climate information for impact assessments.

Data availability statements
BARRA data that support the findings of this study can be accessed by emailing helpdesk.reanalysis@bom.gov.au and requesting access to NCI (National Computational Infrastructure) Gadi project data cj37. More details about the BARRA dataset can be found on the following website: www. bom.gov.au/research/projects/reanalysis/, 'How to get access' section. Data from AGCD are available freely at the Bureau of Meteorology website www. bom.gov.au/climate/data/. The dataset is also available on the NCI Gadi project zv2. Detail on how to access the data can be found http://climate-cms.wikis. unsw.edu.au/AGCD.

Author contributions
Nidhi Nishant conducted all the analyses and led the writing of the paper. All the other co-authors assisted in interpretation and writing the paper.