Constraining decadal variability regionally improves near-term projections of hot, cold and dry extremes

12 Hot, cold and dry meteorological extremes are often linked with severe impacts on the public 13 health, agricultural, energy and environmental sectors. Skillful predictions of such extremes 14 could therefore enable stakeholders to better plan and adapt to future impacts of these events. 15 The intensity, duration and frequency of such extremes are affected by anthropogenic climate 16 change and modulated by different modes of climate variability. Here we use a large multi-17 model ensemble from the Coupled Model Intercomparison Project Phase 6 and constrain these 18 simulations by sub-selecting those members whose global SST anomaly patterns are most similar 19 to observations at a given point in time, thereby phasing in the decadal climate variability with 20 observations. Hot and cold extremes are skillfully predicted over most of the globe, with also a 21 widespread added value from using the constrained ensemble compared to the unconstrained full 22 CMIP6 ensemble. On the other hand, dry extremes show skill only in some regions with results 23 sensitive to the index used. Still, we find skillful predictions and added skill for dry extremes in 24 some regions such as western north America, southern central and eastern Europe, southeastern 25 Australia, southern Africa and the Arabian Peninsula. We also find that the added skill in the 26 constrained ensemble is due to a combination of improved multi-decadal variations in phase with 27 observed climate extremes and improved representation of long-term changes. Our results 28 demonstrate that constraining decadal variability in climate projections can provide improved 29 estimates of temperature extremes and drought in the next twenty years, which can inform 30 targeted adaptation strategies to near-term climate change.

A c c e p t e d M a n u s c r i p t 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 A c c e p t e d M a n u s c r i p t ensembles of climate projections (Befort et al 2020, Mahmood et al 2021, 2022 conceptually similar to initialisation of climate predictions (Meehl et al 2021) and has the main 85 advantage to exploit initialisation information beyond the 10 years of decadal prediction, without 86 much computational cost since it uses existing climate projections, and which in addition can 87 provide seamless information until the end of the century. These constrained climate projections 88 are consistent with the model-specific climate attractors and are therefore not affected by shock, 89 drift and related artefacts (Hazeleger et al 2013, Bilbao et al 2021, Smith et al 2013. 90 91 Here we follow the approach of Mahmood et al (2022) for constraining decadal climate 92 variability in a large multi-model ensemble, and we assess the prediction skill of hot, cold and 93 dry extremes in these constrained projections over global land areas. With this method, we 94 constrain climate variability based on the similarities, at a given point in time, between a large 95 CMIP6 multi-model ensemble (MME) and multi-annual averages of observed sea surface 96 temperature (SST) anomaly patterns. The method, for each year, sub-selects only those ensemble 97 members which are most in agreement with the observed SST patterns. For the skill assessment 98 we focus on the next 20-year period after applying the constraint, which is a time-scale where a 99 previous study (Mahmood et al 2022) showed added value for some annual mean variables and 100 where the role of internal variability is still large. 101 102 2. Data 103 We use 149 ensemble members coming from a MME of 19 CMIP6 models (Table S1) The data we analyse for calculating the extreme indices are monthly total precipitation (mm), 107 daily and monthly minimum and maximum surface temperatures (°C). By the time of the 108 analysis, these 149 members were all available members from the MME used in Mahmood et al 109 (2022) that provided daily data required for computing the extremes indices. We evaluate these 110 simulated extremes against observations-based datasets; to address sensitivity to the choice of 111 reference dataset we use one observational and one reanalysis dataset for temperatures and two 112 observational datasets for precipitation.  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 A c c e p t e d M a n u s c r i p t (2010)) with accumulation periods of 3-, 6-and 12-months.

155
The SPI is computed solely from monthly total precipitation and it is often used to measure 156 meteorological drought, with lack of precipitation indicated by negative values. On the other 157 hand, the SPEI is computed from monthly total precipitation and monthly mean of daily 158 maximum and minimum temperatures , the last two used to compute potential evapotranspiration 159 following the Hargreaves (1994) approximation; SPEI therefore represents drought in terms of 160  (2017)) from the National 181 Oceanic and Atmospheric Administration (NOAA). The monthly mean model and observed SST 182 data were regridded to a common 3°x3° grid and the climatological mean  was 183 removed to compute the anomalies.

185
Internal climate variability is constrained by comparing spatial distributions of global SST 186 anomaly patterns between each of the 149 CMIP6 ensemble members and the observed anomaly 187 averaged over a 9-year period preceding the start of the prediction. Such comparison is 188 performed via area-weighted spatial pattern correlation. Similar to Mahmood et al (2022), we 189 choose the top ranking 30 members (referred to as "Best30") for hindcasting up to 20 years after 190 the initialization. The unconstrained ensemble consists of all 149 members (referred to as "All 191 ensemble").

193
We use 9-year averages of SST anomalies since constraining based on this period showed high 194 skill in constrained projections as shown by Mahmood et al (2022), who also tested sensitivity to 195 using other averaging periods. To start a constrained prediction from January 1961, we use the 9-196 year mean SST anomalies from January 1952 to December 1960 to select the Best30 members.

197
Such a procedure is repeated every year and the Best30 members selected based on SST anomaly 198 comparison from 1953 to 1961 are used for predictions starting in 1962, 1954-1962 for 199 predictions starting in 1963, etc. Here we focus on the hindcast period of 1 to 20 years after the 200 The Spearman Correlation Coefficient (Spearman 1904) estimates the linear relationship 211 between the observational reference and the CMIP6 MME mean. It ranges between -1 (worst 212 agreement) and 1 (best agreement). We use the Spearman rank correlation to avoid assumptions 213 about distributional properties (e.g. normality). The Spearman correlation coefficient is defined 214 as: 215 216 where i corresponds to each time step (from 1 to n), and di is the difference between the ranks of 219 xi and oi (simulated and observed value, respectively, for time step i). 220 221 In order to assess whether the Best30 ensemble captures more observed variability than the All 222 ensemble, we use the residual correlation ( we can predict the variations around the forced signal and it therefore quantifies the added skill 225 from aligning variability phases or "initialising" the predictions. We therefore remove the forced 226 signal (using the All ensemble mean as best estimate of the forcing response) from the observed 227 and Best30 mean time-series by subtracting their corresponding linear fits with the All ensemble 228 mean ( The Root Mean Squared Skill Score (RMSSS; Murphy (1988)) is also a deterministic skill 237 measure computed from the MME mean and is used to assess whether the Best30 ensemble is 238 more skillful than a reference hindcast. The RMSSS is based on the Root Mean Squared Error 239 where RMSexp and RMSref correspond to the Root Mean Square (RMS) difference of the hindcasts 251 and reference hindcast, respectively, from the observed value oi, which is computed as: The Ranked Probability Skill Score (RPSS; Wilks (2011)) is used to estimate the skill of 256 probabilistic products from all members of the MME. The RPSS is based on the Ranked 257 Probability Score (RPS) which evaluates the skill in terms of probabilities (computed as the 258 percentage of members that fall into each equiprobable tercile category, with the three categories 259 indicating below average, approximately average and above average conditions  where j corresponds to the probabilistic category (from 1 to J=3), and pxj and poj are the 275 hindcasted and observed probabilities, respectively, for the probabilistic category j.

277
We estimate the statistical significance of the correlation and residual correlation with a two-278 sided t-test (Wilks 2011)  Best30 TX90p shows high skill in most global land regions, with correlations exceeding 0.9 in 290 the majority of grid cells, and RMSSS > 0.8 and RPSS > 0.6 in large areas, respectively ( Figure  291 1(a)-(c)). Improved skill from the constraint in Best30 in comparison to All ensemble as 292 measured by positive residual correlations is found in the western USA, South and eastern North 293 America, Africa, the Arabian Peninsula, Europe, most of Asia and northern Australia ( Figure  294 1(d)), meaning that in these regions observed variability is captured better by Best30 than by All 295 ensemble. Improved skill based on the RMSSS is found over central and northern South 296 America, Greenland, most of the African continent, southeastern Europe, the Arabic Peninsula 297 and most of central and southern Asia (Figure 1e), pointing out a good agreement between 298 Best30 and the reference dataset. Improved skill measured by the RPSS is widespread and 299 similar to the one of residual correlation (Figure 1(f)) and indicates that Best30 is more skillful 300 than All ensemble when evaluating the skill in terms of probabilities.

302
Best30 TXx shows often weaker skill compared to TX90p, as also found for multi-annual 303 predictions by Delgado-Torres et al (2023), but 20-year projections are still skillful over large 304 areas of the globe for the three metrics. Lack of skill is found in some parts of North and South 305 America, Scandinavia, western and southern Africa, central parts of Asia and northern Australia 306 (Figure 1(g)-(i)). Improved skill as measured by residual correlation is found over Alaska, 307 Canada, eastern North America, southwestern USA, Mexico, northern South America, eastern 308 Europe, India, eastern Russia, southeastern Asia and western Australia (Figure 1(j)). RMSSS 309 shows improved skill mainly over central Africa (Figure 1(k)), whereas the negative RMSSS 310  increased mean bias in Best30 compared to All ensemble. RPSS improved skill is found in 312 western North America, South America, central and eastern Europe, central Africa and in some 313 localised parts of Asia (Figure 1(l)).

315
Similarly to hot extremes, we find high hindcast skill also for indices of cold extremes ( Figure  316 S2). TN10p and TNn show high Best30 skill over most of the globe, with the former having 317 larger areas with significant skill than the latter ( Figure S2(a)-(c), (g)-(i)). For both indices, we 318 find added skill compared to All ensemble over southeastern North America, eastern Brazil, 319 equatorial Africa, southeastern China and northern Australia ( Figure S2(d)-(f), (j)-(l)). When 320 using ERA5 as reference datasets we find similar spatial patterns for hot and cold extremes in 321 both the Best30 skill and skill improvement ( Figures S3-S4).

333
The observational reference dataset is BEST.

335
We next inspect the regional average time series for three regions in which the Best30 ensemble 336 shows improved skill over the All ensemble, namely western North America, southern central 337 and eastern Europe, and southeastern China (Figure 2). We use these time series plots to 338 illustrate some of the characteristics that help explain the improved skill in Best30 compared to 339 the unconstrained ensemble. While all time series indicate a long-term warming over the analysis 340 period for both TX90p and TXx in all three regions, there are also some noteworthy differences.

342
In SCE Europe and SE China ( Figure S1) the Best30 ensemble mean has lower values than the 343 All ensemble mean for both TX90p and TXx in the first two decades of the investigation period.

344
These values are closer to the observed temperature values, contributing to the improved skill.

345
Overall this leads to a stronger long-term warming of hot extremes in these regions in Best30 346 compared to All, and more similar to observations. In addition, The Best30 ensemble also 347 captures some of the observed decadal-scale variations with accelerated warming in the 1980s 348 and early 1990s and reduced warming rates from the mid 1990s, whereas the All ensemble mean 349 features temporally more homogeneous increases. In WN America ( Figure S1) the Best30 350 ensemble mean also has lower values than the All ensemble mean during the first two decades of 351 the investigation period. In this case this makes it more different to the observed time series, as 352 also reflected by the negative RMSSS values when using the All ensemble as reference.

353
However, the positive Residual Correlation (and positive RPSS for the TXx index) indicate some 354 added skill in Best30 over the All ensemble, and this is indicative of correctly predicting some 355 aspects of the decadal-scale variations in the warming rates (such as the reduced warming rates 356 in the 1990s). Similar time-series are also found when using the ERA5 reference datasets as 357 shown in Figure S5.

359 360
Page 10 of 20 AUTHOR SUBMITTED MANUSCRIPT -ERL-116054.R2  Skill for dry extremes is overall spatially more limited when compared to the skill for hot 372 extremes. However, there are some areas where near-term projections of dry extremes are 373 skillful, and where our constraint adds skill.

375
Best30 SPI3_dry correlations are locally significant over the southwestern USA, central and 376 southern South America, Greenland, northern Europe, central Africa, parts of Asia and 377 southeastern Australia (Figure 3(a)). Whereas RMSSS and RPSS show similar patterns of 378 positive skill over central South America, northern Europe, central Africa and central and 379 northern Asia (Figure 3(b),(c)). Residual correlations indicate skill improvements from the 380 constraint for SPI3_dry in a few regions, e.g. over the southern USA, central Africa and in other 381 localised areas of the globe (Figure 3(d)). Also RMSSS and RPSS indicate some added value for 382 the constrained ensemble in similar regions, e.g. the southern USA, some scattered areas in 383 South America, the Arabian Peninsula and southern Australia (Figure 3(e)-(f)).

404
Page 12 of 20 AUTHOR SUBMITTED MANUSCRIPT -ERL-116054. R2   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 A c c e p t e d M a n u s c r i p t In the following, we focus on regional timeseries of the drought measures in three regions where 406 the constraint adds skill (i.e. WN America, SCE Europe and SE Australia), to better understand 407 the characteristics of the improved hindcasts ( Figure 4). Here, Best30 and All ensemble correctly 408 capture both the stationarity (Figure 4(a),(c)) and long-term changes in the observations ( Figure  409 4(b),(d)-(f)) for both indices. There is also added value in Best30 compared to All, especially for 410 WN America where Best30 captures some of the observed decadal-scale variations around the 411 CMIP6 (All) mean, although with smaller magnitude. We obtain similar results with SPI6_dry, 412 SPEI6_dry, SPI12_dry and SPEI12_dry ( Figure S11), or when using the ERA5-REGEN 413 reference datasets (Figures S12-S13). In summary, these results illustrate how the constraint can 414 improve near-term projections of drought, by enhancing the representation of both decadal-scale 415 variations and long-term changes in WN America, SCE Europe and SE Australia.

417
When comparing the skill for these drought indices (i.e. SPIn_dry or SPEIn_dry) against the skill 418 in predicting the entire distributions of SPI or SPEI (i.e. including dry and wet conditions), we 419 note some interesting differences ( Figure S14). A c c e p t e d M a n u s c r i p t  1961-1980 to 2000-2019). We showed that the constrained 438 ensemble (Best30) has high skill for hot and cold extremes over large parts of the globe, with 439 also added value compared to the unconstrained ensemble (All) in several regions. Dry extremes, 440 on the other hand, showed lower skill compared to temperature extremes but drought predictions 441 are skillful in some regions. These regions include e.g. western North America, Southeastern 442 Europe and Southeastern Australia, which were affected by prominent dry and hot extremes in 443 recent decades.