Uncertainty estimation of machine learning spatial precipitation predictions from satellite data

Merging satellite and gauge data with machine learning produces high-resolution precipitation datasets, but uncertainty estimates are often missing. We addressed the gap of how to optimally provide such estimates by benchmarking six algorithms, mostly novel even for the more general task of quantifying predictive uncertainty in spatial prediction settings. On 15 years of monthly data from over the contiguous United States (CONUS), we compared quantile regression (QR), quantile regression forests (QRF), generalized random forests (GRF), gradient boosting machines (GBM), light gradient boosting machine (LightGBM), and quantile regression neural networks (QRNN). Their ability to issue predictive precipitation quantiles at nine quantile levels (0.025, 0.050, 0.100, 0.250, 0.500, 0.750, 0.900, 0.950, 0.975), approximating the full probability distribution, was evaluated using quantile scoring functions and the quantile scoring rule. Predictors at a site were nearby values from two satellite precipitation retrievals, namely PERSIANN (Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks) and IMERG (Integrated Multi-satellitE Retrievals), and the site's elevation. The dependent variable was the monthly mean gauge precipitation. With respect to QR, LightGBM showed improved performance in terms of the quantile scoring rule by 11.10%, also surpassing QRF (7.96%), GRF (7.44%), GBM (4.64%) and QRNN (1.73%). Notably, LightGBM outperformed all random forest variants, the current standard in spatial prediction with machine learning. To conclude, we propose a suite of machine learning algorithms for estimating uncertainty in spatial data prediction, supported with a formal evaluation framework based on scoring functions and scoring rules.


Introduction
Merging satellite-based data and gauge observations can produce accurate precipitation datasets with fine spatial resolution (Hu et al. 2019, Abdollahipour et al. 2022).This is necessary because (a) gauges offer high accuracy, but spatially dense gauge networks are expensive, and (b) satellites provide affordable fine-resolution data but with lower accuracy.
Machine learning algorithms (Hastie et al. 2009, James et al. 2013, Efron and Hastie 2016) often serve as tools for merging satellite-based and gauge precipitation data.This procedure, essentially a regression problem, utilizes gauge measurements as the dependent variable and satellite data as predictor variables.After training a regression algorithm on these paired data points, precipitation can be predicted at any location within the study area and, in this way, a more accurate gridded dataset can be produced A characteristic of most of the relevant studies is that they issue point predictions of precipitation at any point in space (Hengl et al. 2018).While this approach corrects the inherent inaccuracies of satellite data using sparse gauge information, the resulting predictions still hold some degree of uncertainty that needs to be quantified.Therefore, entire predictive probability distributions or adequate approximations of them should be preferred when possible (Gneiting and Raftery 2007) due to the larger amount of information that they offer, thereby allowing better decision making than point predictions alone.Such predictive uncertainty estimates (i.e., probabilistic predictions) appear more frequently in other fields (e.g., in Rodrigues and Pereira 2020, Kasraei et  Quantile regression algorithms can provide quantile predictions, allowing them to approximate a predictive probability distribution.Comparisons between quantile regression algorithms rely on quantile scoring (loss) functions and quantile scores, which are well-suited metrics for evaluating quantile predictions.Quantile scoring functions offer a consistent assessment of quantile predictions (Gneiting 2011), while quantile scores serve as proper scoring rules for assessing probabilistic predictions in the form of quantile predictions at multiple quantile levels (Gneiting and Raftery 2007).The formal definitions of consistency and propriety for these metrics are provided in Section 4.3.For now, it suffices to understand that a consistent scoring (loss) function for a point prediction (e.g., a quantile prediction) incentivizes modellers to make attentive and truthful assessments of those predictions (Gneiting 2011).Similarly, a proper scoring rule for a probabilistic prediction encourages modellers to make accurate and honest assessments of the predicted probability distribution (Gneiting and Raftery 2007).
The comparison had to be of large scale according to the guidelines by Boulesteix et al. (2018) to ensure that its output would be as general as possible for the problem of interest.This translates to using data that cover a large spatial region and a large time period, and to including several learners and a large number of predictors in the experiments.Notably, such large-scale comparisons are rare in the literature that merges data from satellites and ground-based gauges (Papacharalampous et al. 2023a).
The remaining article is arranged as follows: Section 2 defines the problem setting.
Section 3 provides important information about the learners that were extensively compared by this work for the general task of estimating predictive uncertainty in blending precipitation observations from satellites and ground-based gauges.Section 4 presents the dataset and procedures used for the comparison.Section 5 presents the large-scale results.Section 6 discusses several outputs of the comparison under the spectrum that the current literature provides and proposes ideas for future investigations.Section 7 presents the concluding remarks.To ensure the reproducibility of the methods and tests of this work, Appendix A is devoted to statistical software information.This latter information could be used alongside with the parameter settings described in Section 3.

Spatial prediction setting and validation method
In this section, we define the problem solved, while detailed descriptions of its components follow in Sections 3 and 4.

Spatial prediction setting
Figure 1 illustrates how the spatial prediction problem was formulated for facilitating the experiments of this work.The gauge-measured total monthly precipitation at a geographical location of interest (e.g., the location of the station represented in Figure 1) was the dependent variable.Grid points are locations of satellite-based precipitation data.to the station (8 predictor variables), as well as 8 predictor variables from the IMERG dataset and the station's elevation (1 predictor variable).In total, each sample included the dependent variable value and the values of 17 predictor variables.

Predictions at spatial points in the form of quantiles
In a regression setting, a trained quantile regression algorithm can issue a prediction of a quantile ( ) of the probability distribution of a predicted random variable at a prespecified level , where ( ) is defined by: Assume that one is interested in predicting the median of the prediction, i.e., the (0.5) quantile.Then, the assessment of two algorithms that can predict medians can be done with the absolute error scoring function averaged over the test set (i.e., using the mean absolute error, MAE).
Returning to our problem, let us assume that we receive the directive to predict monthly precipitation quantiles at multiple levels other than 0.500 at a point in space.By doing so, it is possible to approximate the probability distribution of the prediction at any point, because we can compute the predictor variables (satellite precipitation of four closest points, distances of the four points and the point's elevation) at any point in space.
Given that quantile prediction can be done using multiple quantile regression algorithms, the natural question is which algorithm to choose.That requires a formal assessment which is based on three components: -Existence of ground-based measurements for training and testing the algorithms.
-Use of quantile scoring functions as metrics for the assessment of quantile predictions.
Quantile scoring functions are generalizations of the absolute error scoring function and are suitable for assessing quantile predictions (details on quantile scoring functions can be found in Section 4.3).
-If the full probabilistic predictions need to be assessed, then the comparison should be based on quantile scores that are proper scoring rules, i.e., suitable to assess probabilistic predictions in the form of multiple quantiles (details on quantile scores can be found in Section 4.3).
Here, it is important to note that: -The proposed framework does not study the problem of the discrepancies between satellite-based datasets and gauge-based measurements.It assumes that gauge-based measurements are correct (free of measurement errors) and represent the ground truth.
-It estimates a probability distribution of a prediction, in the sense that predictions should be provided in the form of a probability distribution.
-Since multiple algorithms can do this, we can select the most appropriate based on a metric.In our case, the implemented metrics were quantile scoring functions and quantile scoring rules.
-If one algorithm outperforms the others in the test set, it is natural to prefer this algorithm for predictions at any point in space for the specific dataset.

Problem definition
Based on the previous descriptions, the problem was defined as follows: a. Let PRi,j denote the monthly precipitation at the gauge i in the time point (i.e., month) j.
b. Compute the elevation ELi of the gauge i. f.Define a quantile regression problem at quantile level α, where PRi,j is the dependent variable and the remaining 17 variables of SAMPLEi,j are the predictor variables.
g. Test multiple quantile regression algorithms in a five-fold cross validation setting using the 91 623 samples.
h.The best algorithm, according to the pre-specified metric (quantile scoring function or quantile score), can be retrained in the full sample and, then, it will be able to predict quantiles at level α, at any point in space.Please note that those predictions are quantiles of the probability distribution of the prediction.Consequently, if the procedure is repeated for multiple quantile levels, we will have an approximation of the full probability distribution of the prediction at any point.

Learners for uncertainty estimation
We recall from Section 2 that the dataset comprised 91 623 samples, with each sample encompassing the values of the dependent variable and 17 predictor variables, and that the assessment is a five-fold cross-validation setting.The six quantile regression algorithms described in this section were assessed in the study.Recall also that, once a quantile regression algorithm is trained, then it can predict quantiles at multiple prespecified quantile levels.Default software implementation parameters (where applicable) were selected.It is well known that random forests (and their variants) perform well with default hyperparameter values.For the remaining algorithms (boosting and neural networks), optimization to force performance to the limits was out of scope, while those algorithms perform also well with default software hyperparameters.
3  ).In this work, it was implemented with decision trees as base learners.The maximum tree depth, the maximum number of leaves in one tree, the minimal number of data in one leaf, the shrinkage rate, the number of boosting iterations (number of trees), the feature fraction that is randomly selected on each iteration (tree), the data fraction that is randomly selected without resampling on each iteration (tree), the minimal gain to perform split and the number of threads were equal to 10, 500, 200, 0.05, 400, 0.75, 0.75, 0 and 10, respectively.These are parameter values that were previously identified as optimal in  Bilinear interpolation was applied to the IMERG grid extracted to transform it to the CMORPH0.25 grid with a spatial resolution of 0.25 degree x 0.25 degree.The latter IMERG grid was used in the experiments of this work, together with the PERSIANN grid extracted.
Additionally, total monthly precipitation data was obtained by aggregating the daily data for conforming to requirements of the same experiments.

Elevation data
Because of its well-recognised usefulness as a predictor for many hydrometeorological variables (Xiong et al. 2022), elevation was estimated at the locations of the ground-based gauges (which are depicted in Figure 2).The estimation was based on the Amazon Web Services (AWS) Terrain Tiles.The latter are available online in the following address (accessed on 2022-09-25): https://registry.opendata.aws/terrain-tiles.

Quantile levels and quantile crossing handling
A good approximation of the predictive probability distribution can be acquired by predicting quantiles of this distribution at multiple levels.In this study, the quantile levels α ∈ {0.025, 0.050, 0.100, 0.250, 0.500, 0.750, 0.900, 0.950, 0.975} were investigated.We preferred to predict quantiles at the tails and, consequently, most of the results refer to the ability of the algorithms to model the probability mass at the tails of the probability density function.Results could be different if we were interested on the central mass of the probability density function.Negative predicted quantiles at the lowest quantile level were censored to zero.Additionally, quantile crossing was handled by setting, separately for each pair {sample, learner}, any predictive quantile that was smaller than the predictive quantile of the immediate lower quantile level equal to this latter predictive quantile.

Performance comparison
To facilitate the comparison of the six learners (see Section 3) with respect to their performance at the nine quantile levels (see Section 4.2), the quantile scoring function was first computed.This function is defined as where is the realization of the precipitation process, is the respective prediction (predictive quantile) at level and ( ) is defined as where ( ) is the indicator function.This function is equal to 1 when the event realizes.
Otherwise, it is equal to 0.
The quantile scoring function is asymmetric around the realization , thereby allowing to predict quantiles (an illustration of the quantile scoring function for the case of = 0 can be found in Tyralis and Papacharalampous 2024).Setting = 1/2 in Equation ( 2), we get / ( , ) which is the absolute error scoring function up to a multiplicative constant.
The absolute error scoring function is symmetric with respect to the realization y and can be used to rank predictions of the median.The quantile scoring function is strictly consistent for the quantile at level (Gneiting 2011), i.e., it has the following property: where ℱ is the family of probability distributions of the random variable and $ is the actual value of the predictive quantile.Strict consistency for the quantile scoring functions means that its expectation is minimized when the prediction attains the true quantile value $.
A performance criterion takes the form where , and , , / ∈ {1, … , *} are, respectively, the prediction and observation of the i th data sample and * is the number of data samples included in the test dataset.Obviously, this performance criterion was computed separately for each learning algorithm.
Quantile prediction skills ) ,34566 were also computed for each learning algorithm and the entire dataset according to the equation ) ,34566 ∶= 1 − ) ,6789:79 / ) ,;7:<=>894 , where the simplest learner (i.e., QR) is used as benchmark.The quantile prediction skill takes values between −∞ and 1.The larger it is, the better the predictive quantiles of a learner on average compared to the predictive quantiles of the benchmark.The benchmark method and the learner are equivalent with respect to their performance when ) ,34566 = 0.
Quantile prediction skills were additionally computed for each pair {learner, station} to examine how the relative performance of the learners varies from station to station for each quantile level.
Moreover, the quantile scoring functions at the various quantile levels (computed as stated above using Equation ( 2)) were summed up for each pair {case, learner}.The sum of quantile scoring functions for quantile predictions , … , -at levels , … , -defines the quantile scoring rule @: @( , … , -; ) ∶= ∑ B ( , which is proper (Gneiting and Raftery 2007).In particular, the expectation of a proper scoring rule is minimized when the prediction of the probability distribution attains its true value.Thus, proper scoring rules incentivize modellers to report the true predictive probability distribution.
The values of the quantile scoring rule at each point acquired through this procedure were subsequently averaged across all the cases (points) in the dataset, separately for each learner, thereby giving average scores.Then, following an equation similar to Equation ( 7), the latter scores were used to compute quantile scoring rule skills (Gneiting and Raftery 2007).Similar to the quantile prediction skill, the quantile scoring rule skill takes values from −∞ to 1 and the larger it is, the better the predictive performance with respect to the benchmark.Quantile scoring rule skills were also computed for each pair {learner, station} to examine how the relative performance of the learners varies from station to station for the entire approximation of the predictive probability distribution.
Lastly, the reliability of the learners was examined by computing sample coverages.
For this test, the frequency with which each predictive quantile is larger or equal to its corresponding observation was computed for each pair {learner, quantile level}.The closer a frequency to its nominal value, the more reliable the predictive quantiles.

Predictor variable importance comparison
Explainable machine learning can support, among others, predictor variable importance comparisons.For conducting such a comparison for the problem of interest in this study, LightGBM (see Section 3.5) was used.More precisely, the total gain in splits of each predictor variable (Shi et al. 2023) was computed through this learner for the entire dataset.Based on the respective results, the predictors were ranked from the most to the least important ones according to the following rule: The larger the total gain, the larger the importance.

Performance comparison
Figure 3 presents statistics that allow us to evaluate and compare the performance of the learners in the task of estimating predictive uncertainty while blending precipitation data from satellites and ground-based gauges.In terms of sample coverage (see Figure 3a), all the learners performed at least adequately well in the large-scale experiment, with the best of them being QRF and GRF for quantile levels {0.025, 0.050, 0.100, 0.250}, QR and QRNN for the quantile level 0.500, and QR, GBM and QRNN for the remaining quantile levels.
The goal of probabilistic prediction is to maximize sharpness subject to calibration (Gneiting and Raftery 2007).Nice coverage can indicate nice calibration, thereby offering some insight into the absolute performance of the learners.Given that all algorithms seem well calibrated (i.e., they have good coverages), the one that maximized sharpness should be preferred.Sharpness is a property of the predictions.As a representative intuitive example, sharpness of prediction intervals can be characterized by their width, where narrower intervals are preferable.Scoring rules provide a single metric encompassing both calibration and sharpness, enabling the honest ranking of probabilistic predictions.
According to the quantile prediction skill (see Figure 3b), the best-performing learner is LightGBM at all quantile levels.According to the same skill, QRF, GRF, GBM, QRNN and QR are, respectively, ranked second, third, fourth, fifth and sixth for each of the quantile levels.The same ranking is obtained based on the quantile prediction skill (see Figure 3c), which is appropriate for evaluating the quality of an ensemble of predictions at multiple quantile levels (with these predictions approximating an entire predictive probability distribution).Compared to QR, LightGBM showed improved performance in terms of the quantile scoring rule by 11.10%, surpassing QRF (7.96%), GRF (7.44%), GBM (4.64%), and QRNN (1.73%).The larger the quantile prediction skill (and the darker the colour on the respective heatmap), the better the quantile predictions on average compared to the quantile predictions of the benchmark learner (i.e., quantile regression).For interpreting (c): The larger the quantile scoring rule skill, the better the probabilistic predictions on average compared to the probabilistic predictions of the benchmark learner (i.e., quantile regression).
Figure 4 allows us to assess to which degree the quantile prediction skill can vary in space for the various learners and the various quantile levels, while Figure 5 allows a similar assessment but for the quantile scoring rule skill (which provides a proper assessment for the total of the quantile levels).Both scores were found to vary a lot from station to station.Also notably, good and bad performances with respect to the benchmark tended to appear at the same stations for multiple methods.Therefore, although on average LightGBM outperformed the other algorithms, that does not imply that it was uniformly best, i.e., the best at all individual stations.(e) quantile regression neural networks at the various quantile levels for 200 arbitrary stations.The larger the quantile prediction skill (and the darker blue the colour on the heatmaps), the better the quantile predictions on average compared to the quantile predictions of the benchmark learner (i.e., quantile regression).
Figure 5. Quantile scoring rule skill of the learners for the same stations as in Figure 4.
The larger the quantile scoring rule skill (and the darker blue the colour on the heatmap), the better the probabilistic predictions on average compared to the probabilistic predictions of the benchmark learner (i.e., quantile regression).

Predictor variable importance comparison
Figure 6 facilitates a comparison of the predictors with respect to their importance for the various quantile levels in the task of estimating predictive uncertainty while blending precipitation data from satellites and ground-based gauges.The respective results are largely homogenous across quantile levels, with IMERG value 1 and 2 having been identified as the most and the second most important predictors by the LightGBM algorithm.The remaining IMERG values and the PERSIANN values were most often ranked before the distances with few exceptions, and the elevation of the station is also ranked third or fourth for the quantile levels {0.250, 0.500, 0.750, 0.900, 0.950, 0.975} and eighth or ninth for the remaining quantile levels.
Figure 6.Ranking of the predictors from the most (1 st ) to the least (17 th ) important ones at the various quantile levels based on the gain statistic computed through the light gradient boosting machine algorithm by exploiting the entire dataset.The darker the colour on the heatmap, the more important the predictor variable.

Discussion
The large-scale benchmark experiment of this work suggests that LightGBM is the bestperforming learner for uncertainty quantification in blending precipitation data from satellites and ground-based gauges at the monthly temporal scale, leaving QRF, GRF, GBM, QRNN and QR behind, for the case of CONUS.In our view, this is an important finding, given also that LightGBM has at least similar or considerably lower computation requirements than other sophisticated learners for uncertainty quantification and has been proven useful when the interest is in extreme events (Tyralis et al. 2023a).Existing research favours random forests for spatial prediction, but we expect a shift towards boosting methods.
Most algorithms outperformed the linear QR, likely due to their increased flexibility, which leads to better performance on large datasets.A potential drawback limiting the performance of QRNN is the redundancy of predictor variables, particularly the repeated distance variables throughout the sample.This redundancy can be largely mitigated by using tree-based algorithms like QRF and GRF.However, these methods face limitations in extrapolation beyond the training data and tend to focus on the central region of the predictive density function.Consequently, the ability of LightGBM to handle these aspects naturally explains its superior performance.Notably, the results might differ in different spatial settings or with smaller datasets, where simpler methods like linear approaches may be more favourable.
Furthermore, the feature importance tests showed that satellite precipitation variables are more important predictors than the elevation at a point of interest, which is a more important predictor than distances between a point of interest and satellite grid points.
The same tests indicated that IMERG provides more useful precipitation information compared to PERSIANN.This outcome agrees with results by Papacharalampous et al.
(2023c), a study on satellite precipitation product correction that did not involve uncertainty quantification.
Future research could focus on potential benefits that could stem from the application

Summary and conclusions
This is the first general study on the use of machine learning for the important, yet rarely performed, task of estimating predictive uncertainty in merging precipitation data from satellites and ground-based gauges.The concept of quantile regression was extensively investigated for the first time for fulfilling this task.To provide useful guidance on how this concept can be exploited, a large-scale comparison was conducted between quantile regression (QR), quantile regression forests (QRF), generalized random forests (GRF), gradient boosting machines (GBM), light gradient boosting machine (LightGBM) and quantile regression neural networks (QRNN).The comparison took place at the monthly temporal scale using 15-year-long data that span across the contiguous United States.
al. 2021, Cui et al. 2022) and in other hydrological disciplines (e.g., in Weerts et al. 2011, Tareghian and Rasmussen 2013, Papacharalampous et al. 2020, Tyralis and Papacharalampous 2021b, Tyralis et al. 2023b), but have only been provided by a few studies for the topic of remote sensing of precipitation (Bhuiyan et al. 2018, Zhang et al. 2022, Glawion et al. 2023 and Tyralis et al. 2023a).Still, each of these studies uses a single algorithm and the one by Tyralis et al. (2023a) additionally focuses on extreme quantiles of the predictive probability distribution.Therefore, the benefits that machine learning can bring to the general task of predictive uncertainty quantification in merging satellite and gauge-measured precipitation data have not been sufficiently explored so far.The aim of this work was to fill in this specific gap for advancing the state of the machine learning-driven applications of predictive uncertainty quantification in merging precipitation observations from satellites and ground-based gauges.This was achieved though extensively investigating the machine learning concept of quantile regression for fulfilling this task.Indeed, as the various quantile regression algorithms exhibit different relative performances at different prediction problems (see diverging results from other fields, for example, in Zhang et al. 2018, He et al. 2019, Sesia and Candès 2020), a comparison between alternative implementations in the field of interest was necessary.

Following
procedures proposed in previous works (Papacharalampous et al. 2023a, b, c and Tyralis et al. 2023a), the four closest grid points from each grid to each ground-based station were identified and indexed according to their distance from this station, which was computed in meters.The monthly precipitation totals at the four closest grid points 1−4 are hereinafter called "PERSIANN values 1−4" or "IMERG values 1−4", corresponding to the satellite precipitation dataset [PERSIANN (Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks) and IMERG (Integrated Multi-satellitE Retrievals) are two satellite datasets, that are described in detail later in Section 4.1.2].Similarly, the distances di, i = 1, 2, 3, 4, where d1 < d2 < d3 < d4, are hereinafter called "PERSIANN distances 1−4" or "IMERG distances 1−4".These total monthly precipitation values and distances were the predictor variables, together with the elevation at the station.Practically, the spatial prediction setting is a regression problem.

Figure 1 .
Figure 1.Illustration of how the spatial prediction problem was formulated.The satellite precipitation data are available at grid points.The distances of a given station (station 1) from its four closest grid points (grid points 1−4) are denoted with di, i = 1, 2, 3 and 4.These distances and the monthly precipitation data from satellites at the same grid points were used as predictors for precipitation at the given station.

3. 4
Gradient boosting machines Gradient boosting machines (GBM; Friedman 2001) are a boosting (Mayr et al. 2014, Tyralis and Papacharalampous 2021a) variant that can be used, among others, for predictive uncertainty quantification.A boosting algorithm is an ensemble learning algorithm that composes a strong learner by sequentially adding weak base learners to the ensemble.Specifically, the training of every new base learner targets at minimizing the error of the current ensemble.This error is the quantile loss (Gneiting 2011) for the case of GBM.Notably, the number of times that a new base learner is added to the ensemble (which is the same with the number of base learners) should be that large that it ensures proper fitting, simultaneously avoiding overfitting.In this work, GBM employed 500 decision trees as their base learners.The other parameters of GBM were set to their defaults in Greenwell et al. (2022).3.5 Light gradient boosting machine Light gradient boosting machine (LightGBM; Ke et al. 2017) is the second boosting variant used in this work.It is known to have some better properties (e.g., smaller training time) compared to GBM (Tyralis and Papacharalampous 2021a

4 .
Tyralis et al. (2023a) for a similar problem.The other parameters of LightGBM were set to their defaults in Shi et al. (2023).3.6 Quantile regression neural networks Quantile regression neural networks (QRNN; Taylor 2000, Cannon 2011) are a neural network (Hastie et al. 2009, chapter 11) algorithm that can, by its construction, facilitate uncertainty quantification by minimizing a quantile scoring function.In modelling with neural networks, linear combinations of the predictors are first defined.Then, the form of a non-linear function of these combinations is given to the dependent variable.In this work, QRNN were run with a number of repeated trials equal to 1.The other parameters of QRNN were set to their defaults in Cannon (2023).Data, application and comparison 4.1 Data To evaluate and compare the learners, three types of data were used.These were gaugemeasured precipitation (see Section 4.1.1),satellite precipitation (see Section 4.1.2) and elevation (see Section 4.1.3)data.Note that the same data compilation was previously used for different benchmarking purposes in Papacharalampous et al. (2023c).4.1.1Gauge-measured precipitation dataMonthly precipitation totals from ground-based gauges were extracted from the Global Historical Climatology Network monthly database, version 2 (GHCNm)(Peterson and Vose 1997).The extraction took place through the repository of the National Oceanic and Atmospheric Administration (NOAA), which is available online in the following address (accessed on 2022-09-24): https://www.ncei.noaa.gov/pub/data/ghcn/v2.The gaugemeasured data refer to 1 421 locations (see Figure2) in the CONUS and to the time period starting from 2001 and ending with 2015.

Figure 2 .
Figure 2. Locations of the ground-based gauges that recorded data used in this study.

Figure 3 .
Figure 3. Statistics for facilitating the evaluation and comparison of the learners based on the entire dataset.Specifically: (a) Sample coverage of the learners at the various quantile levels; (b) quantile prediction skill of the learners at the various quantile levels; and (c) quantile scoring rule skill of the learners.For interpreting (a):The closest the sample coverage to the quantile level, the more reliable the quantile predictions.For interpreting (b): The larger the quantile prediction skill (and the darker the colour on the respective heatmap), the better the quantile predictions on average compared to the quantile predictions of the benchmark learner (i.e., quantile regression).For interpreting (c): The larger the quantile scoring rule skill, the better the probabilistic predictions on average compared to the probabilistic predictions of the benchmark learner (i.e., quantile regression).

Figure 4 .
Figure 4. Quantile prediction skill of (a) quantile regression forests, (b) generalized random forests, (c) gradient boosting machines, (d) light gradient boosting machine, and(e) quantile regression neural networks at the various quantile levels for 200 arbitrary stations.The larger the quantile prediction skill (and the darker blue the colour on the heatmaps), the better the quantile predictions on average compared to the quantile predictions of the benchmark learner (i.e., quantile regression).
of the ensemble learning concept (Wang et al. 2022) for dealing with the general task of interest.Such investigations could focus both on simple ways to combine algorithms (e.g., the one in Petropoulos and Svetunkov 2020) and on advanced stacked generalization procedures (e.g., those in Wolpert 1992), and could use the learners implemented and compared in this work as base learners in the ensemble learning.They could also benefit from the comparative framework of this work.Another idea for future research is to extend the comparison to other families of machine learning algorithms for predictive uncertainty quantification.Summaries of such families and their known properties are available in the reviews by Papacharalampous and Tyralis (2022) and Tyralis and Papacharalampous (2023).Lastly, potential benefits from incorporating time series features, such as those computed in Kang et al. (2017), into the predictive uncertainty quantification frameworks in the field of satellite precipitation product correction could be investigated.
Tibshirani and Athey (2023)n additional randomization procedure.While, in standard regression, the output prediction is the mean of the predictions of all the decision trees (with this mean approximating the conditional mean of the response variable), for QRF it is a resemblance of the conditional distribution of the response variable.In this work, QRF were implemented with 500 decision trees.The other parameters of QRF were set to their defaults inTibshirani and Athey (2023).In this work, GRF were implemented with 500 decision trees.The other parameters of GRF were set to their defaults in Tibshirani and Athey (2023).