The observation range adjusted method: a novel approach to accounting for observation uncertainty in model evaluation

Model evaluations are performed by comparing a modelled quantity with an observation of that quantity and any deviation from this observed quantity is considered an error. We know that all observing systems have uncertainties, and multiple observational products for the same quantity can provide equally plausible ‘truths’. Thus, model errors depend on the choice of observation used in the evaluation exercise. We propose a method that considers models to be indistinguishable from observations when they lie within the range of observations, and hence are not assigned any error. Errors are assigned when models are outside the observational range. Errors calculated in this way can be used within traditional statistics to calculate the Observation Range Adjusted (ORA) version of that statistic. The ORA statistics highlight the measurable errors of models, provide more robust model performance rankings, and identify areas of the model where further model development is likely to lead to consistent model improvements.


Introduction
This paper presents a method to include observational uncertainty in model evaluation using climate models as an example.Climate models are our best tools to understand the climate system and how it is changing as we change the atmospheric composition.These models are simulating an extremely complex system with processes that span a wide range of time and space scales.Output from climate models inform a wide range of issues from global CO 2 mitigation strategies (IPCC 2022), to national climate change risk assessments, to local adaptation planning.These models are not perfect and are continuously being evaluated and improved.
Climate model evaluation is quantified through comparison with observations.Hence, they depend on both the model being evaluated and the observational dataset used to do the evaluation.All observations contain errors and uncertainties.The main sources of these errors include: • Limitations of the observing sensors (Thorne et al 2011) • Uncertainties related to the temporal and spatial density of observations (Cowtan and Way 2014, Herrera et  There are many gridded observational products that have dealt with these uncertainties in different ways leading to several equally plausible gridded estimates of each climate variable (Roca et al 2019).When models are evaluated against a single observational dataset it is not possible to quantify how much of the result is due to the model errors versus the observational uncertainty.
Most previous studies did not consider observational uncertainty but evaluated the model using a single observation dataset, leaving the observational uncertainty to be implicitly included in their results.Some previous studies have adopted various approaches to deal with observational uncertainty in climate model evaluation.These range in complexity from simply reporting the evaluation results against more than one observation dataset (Gleckler et al 2008, Ahn et al 2022, Lee et al 2024), to combining multiple evaluations using a weighted mean (Collier et al 2018), to probabilistic frameworks to combine multiple model simulations and multiple observation datasets into a distribution of outcomes (Annan et al 2011).While these probabilistic methods can be valuable tools, they tend to be complex to calculate and understand, and can require subjective choices to be made in their interpretation.Below we propose a method for model evaluation that accounts for observational uncertainty, can be applied to well-known statistical measures, and maintains the interpretation of their outcomes.
Here we calculate the ORA statistics for the Coupled Model Intercomparison Project phase 6 (CMIP6) models using multiple precipitation and temperature observation datasets.

Methods
2.1.Observation range adjusted (ORA) statistics ORA statistics were first introduced in (Evans et al 2016).The basic premise of ORA statistics relies on comparing a model estimate to multiple observation estimates, or observational estimates with quantified uncertainty, such that: • When a model is within the range of observations (and their uncertainties) it is considered to have no error • When a model is outside the observation range, the error is the distance between the model and the nearest observation.
That is, while the model is within the observation range it is considered to be indistinguishable from observations and hence has no error.Model errors only exist when the model is outside this observation range.This is shown schematically in figure 1.
This set of errors can then be used within standard statistical metrics, such as bias and root mean square error (RMSE), to calculate ORA versions of those metrics.In practice, this is achieved by creating a pseudoobservation dataset that is equal to the model when it is within the observation range and is equal to the nearest observation when the model is outside the observation range (equation ( 1)).We note that the observation range can be derived from errors reported for individual observation datasets as well as from the range spanned by multiple observation datasets as indicated in figure 1.
Where O pseudo is the pseudo observations.M is the model.max(O) and min(O) are the maximum and minimum of the observations respectively.Statistics such as the RMSE are then calculated as normal using the pseudo observations.
Where O pseudo is the pseudo observations.M is the model.N is the number of observations.

Results
We used the ensemble of 59 CMIP6 models and three different observational temperature datasets to calculate the average Root Mean Square Error (RMSE) of monthly mean temperature across global land grid points.The heatmap in figure 2 is coloured by the RMSE rank of the models as measured against each observation dataset separately, the mean of these separate dataset metrics, and altogether using the ORA-RMSE.The colours represent the ranks and values in the cells show the average RMSE.
If the observational dataset was not important for determining the relative performance of the models, then each row would be the same colour.In the figure, the colours are quite jumbled indicating clearly that the choice of observational dataset plays an important role in deciding how well a model performs within this ensemble (rank).Focusing on the worst performing model (red cells with black borders), we see that the same model is the worst performer regardless of observational dataset used and this remains true for ORA-RMSE as well.The best performing model (purple cells with thick white borders) is not always the same model, showing that the best performing model is dependent on the observational dataset chosen.
The ORA-RMSE performance ranking is not the same as any ranking for an individual observational dataset, nor is it the same as the mean of the rankings across all these observational datasets.This demonstrates the value of explicitly considering the observational range when assessing model errors.We note that the ORA-RMSE values are lower than the RMSE values against any individual observation dataset.This indicates that all models, even the worst performers, spend some time, in some locations, within the observation range.
The spatial distribution of RMSE is shown in figure 3.For much of the global land area the RMSE is similar regardless of the observations used.However, there are some locations where RMSE calculated against individual observation datasets differs substantially.In some of these locations, such as Greenland and the Tibetan plateau, the ORA-RMSE is notably lower than individual observation datasets.This highlights areas where the observation datasets differ from each other creating a wide observation range.In this case it is easier for models to fall within the observation range, not because the models are performing well, but because the observations are relatively poorly constrained.Relying on a single observation dataset in these locations may produce high RMSE for a given model, from which one may conclude that improving the representation of model processes important in these regions would provide good overall model improvement.However, such a conclusion would be erroneous as the observation uncertainty is large and it is not clear that model performance in these regions is worse than elsewhere.It does suggest that work toward reducing observational uncertainty in these regions is required.
An example for another metric and variable is shown for the precipitation bias in figure 4. The heatmap in figure 4 is coloured by the Bias rank of the models as measured against each observation dataset separately, the mean of these separate dataset metrics, and altogether using the ORA-Bias.The colours represent the ranks and values in the cells show the average Bias.Equivalent maps of this bias can be found in Supplementary figure 3.
As we saw for the temperature RMSE the colours are quite jumbled indicating clearly that the choice of observational dataset plays an important role in deciding how well a model performs within this ensemble (rank).Focusing on the worst performing model (red cells with black borders), we see that the same model is the worst performer regardless of observational dataset used and this remains true for ORA-Bias as well.The worst performing models are consistently outside the observation range and hence have consistently poor performance regardless of observation dataset.
The best performing model (purple cells with thick white borders) is not always the same model, showing that the best performing model is dependent on the observational dataset chosen.The inclusion of GPCC and GPCC_nucc allows some explanation of the ORA-Bias numbers.By having 'no undercatch correction' the GPCC_nucc consistently provides lower estimates of the observed precipitation than GPCC.Thus, at a given grid cell, models that overestimate GPCC_nucc and underestimate GPCC will fall within the observation range and receive no ORA-Bias.In cases such as the best performing ORA-Bias model (CAS-ESM-2-0) we can see that on average the model overestimates GPCC_nucc and underestimates GPCC and hence is often within the observation range.This leads to an ORA-Bias that is much lower than the bias against any individual dataset.
The ORA-Bias performance ranking is not the same as any ranking for an individual observational dataset, nor is it the same as the mean of the rankings across all these observational datasets.This demonstrates the value of explicitly considering the observational range when assessing model errors.
Similar heatmaps for other metrics are provided in the supplementary material and reinforce the findings presented here.

Conclusions
Not accounting for observation uncertainty in climate model evaluation means that identified errors cannot be unambiguously attributed to model inadequacies.In this case, regions with large observation uncertainties may be inappropriately identified as regions with large model errors.Here we have introduced the observation range adjusted (ORA) method to explicitly include the observation uncertainty when calculating performance metrics.The method relies on using observational estimates of acceptable quality and can use error or uncertainty ranges provided with the observational dataset.So long as an observation range can be defined using multiple plausible observation datasets, and/or well-defined errors associated with the observational dataset itself, then ORA statistics can be calculated to overcome this misidentification issue.Hence, ORA statistics highlight unambiguous model errors (not caused by observational uncertainty) that should be the primary targets for model improvements.ORA statistics are relatively simple to calculate and maintain the interpretation of the standard equivalent statistics.
al 2019, Grainger et al 2022) • Approximations within interpolation techniques used to put observations on grids used for model evaluation (Avila et al 2015, Timmermans et al 2019, Bador et al 2020)

Figure 1 .
Figure 1.Schematic showing the observation range adjusted model error.

Figure 2 .
Figure 2. Heatmap showing the performance ranking of the CMIP6 Models in terms of the average climatological monthly temperature RMSE over land against each observation dataset, mean RMSE across observation datasets, and the ORA-RMSE ranking.Best (worst) performing models in each column are outlined in white (black).

Figure 3 .
Figure 3.The top-left map shows the observational range for annual temperature from the three observation datasets.Maps of the ensemble mean of the climtological monthly mean temperature RMSE calculated against each dataset and the ORA-RMSE.

Figure 4 .
Figure 4. Heatmap showing the performance ranking of the CMIP6 Models in terms of average climatological precipitation bias over land against each observation dataset, mean bias across observation datasets, and the ORA-Bias ranking.Best (worst) performing models in each column are outlined in white (black).
3.1.CMIP6 model ensembleGlobal Climate Model (GCM) simulations from the CMIP6 (Eyring et al 2016) historical scenario are used in this study.Monthly mean temperature and precipitation totals for the period 1982 to 2014 are used in the evaluations.All GCM data is interpolated from the native grid to a common 1 × 1 degree grid for the evaluation.A total of 59 CMIP6 simulations are included in this analysis (see supplementary table 1).
(Becker et al 2013)020)s 3.2.1.TemperatureWe use three global temperature datasets.The Climatic Research Unit Timeseries (CRU TS) v4.07(Harris et al 2020), Berkeley Earth (Rohde and Hausfather 2020), and TerraClimate(Abatzoglou et al 2018).All three datasets are underpinned by observations taken by the global network of meteorological stations which are quality controlled and interpolated to a grid.If required, we regrid the data to the common 1 × 1 degree grid used in this study.3.2.2.PrecipitationThe precipitation datasets include the Rainfall Estimates on a Gridded Network (REGEN:(Contractor et al 2020), Multi-Source Weighted-Ensemble Precipitation (MSWEP:(Beck et al 2018), Global Precipitation Climatology Centre (GPCC:(Becker et al 2013)with undercatch correction, and without undercatch correction (GPCC_nucc).Since most precipitation datasets do not include an explicit undercatch correction we include both versions of GPCC here to ensure that our observation range includes known errors associated with precipitation undercatch.All datasets are underpinned by observations taken by the global network of meteorological stations which are quality controlled and interpolated to a grid.In addition, MSWEP incorporates satellite-based rainfall estimates.If required, we regrid the data to the common 1 × 1 degree grid used in this study.