CMIP6 skill at predicting interannual to multi-decadal summer monsoon precipitation variability

Monsoons affect the economy, agriculture, and human health of two thirds of the world’s population. Therefore, predicting variations in monsoon precipitation is societally important. We explore the ability of climate models from the sixth phase of the Climate Model Intercomparison Project to predict summer monsoon precipitation variability by using hindcasts from the Decadal Climate Prediction Project (Component A). The multi-model ensemble-mean shows significant skill at predicting summer monsoon precipitation from one year to 6–9 years ahead. However, this skill is dependent on the model, monsoon domain, and lead-time. In general, the skill of the multi-model ensemble-mean prediction is low in year 1 but increases for longer-lead times and is largely consistent with externally forced changes. The best captured region is northern Africa for the 2–5 and 6–9 year forecast lead times. In contrast, there is no significant skill using the ensemble-mean over East and South Asia and, furthermore, there is significant spread in skill among models for these domains. By sub-sampling the ensemble we show that the difference in skill between models is tied to the simulation of the externally forced response over East and South Asia, with models with a more skilful forced response capable of better predictions. A further contribution is from skilful prediction of Pacific Ocean temperatures for the South Asian summer monsoon at longer lead-times. Therefore, these results indicate that predictions of the East and South Asian monsoons could be significantly improved.


Introduction
Two thirds of the world's population lives in areas where there is a monsoon in summer (Wang and Ding 2006). Monsoon precipitation variability has effects on economies, agriculture, and human health, among other sectors. Therefore, predicting the future evolution of monsoon precipitation is important, for adaptation strategies (e.g., infrastructure planning).
Individual predictions systems have shown skill at predicting monsoon precipitation on a large range of time scales (Dunstone et al 2020, Monerie et al 2021). Regionally, some skill has been found over East and southern Africa (Beraki et  There are multiple sources of skill for predicting summer monsoon precipitation. The role of sea surface temperatures (SSTs), among other slowly varying lower boundary conditions, in predicting monsoon precipitation variations, was theorized by Charney and Shukla (1981). On seasonal time scales, it was shown that the El Niño Southern Oscillation is key for providing skill at predicting precipitation over the tropics (Shukla and Paolino 1983, Wang et al 2018, Sohn et al 2019, Dunstone et al 2020. On decadal time scales, the North Atlantic and Indian Ocean SSTs also yield a certain amount of predictability for monsoon precipitation , Wang et al 2018, due to the high prediction skill of prediction systems for Atlantic and Indian Ocean SST (Guemas et al 2013, García-Serrano et al 2015 and to the effects of these oceanic basins on monsoon precipitation.
Anthropogenic forcing is a source of prediction skill for global mean surface air temperature (Boer et al 2016) and SST (e.g., Guemas et al 2013) and has known effects on summer monsoon precipitation worldwide (Marvel et al 2020, Monerie et al 2022. Previous studies have quantified skill at predicting monsoon precipitation on multi-year time scales with a small number of climate models and ensemble members (e.g., Bellucci et al 2015). However, prediction skill values increases with ensemble size (Smith et al 2019) and we therefore use the large ensemble of the Decadal Climate Prediction Project (DCPP; Boer et al 2016), reducing unpredictable noise, and providing a better estimate of prediction skill. The large ensemble facilitates understanding of the causes of differences between prediction systems at predicting monsoon precipitation, including structural differences between prediction systems. No robust evaluation across a range of models, monsoon domains and timescales has been provided so far. We thus provide, for the first time, a quantification of the ability of sixth phase of the Climate Model Intercomparison Project (CMIP6) prediction systems at predicting interannual to decadal summer monsoon precipitation variability in a global monsoon framework. We expect skill at predicting summer monsoon precipitation to be model dependent (as shown by Delgado-Torres et al (2022) for the surface air temperature), area-dependent and lead-time dependent.
We address the following questions: • Are CMIP6 initialized prediction systems skilful at predicting summer monsoon precipitation on interannual-to-decadal time scales? • How model dependent is the skill at predicting summer monsoon precipitation? • Can we identify the sources of skill?
The paper is organized as follows: section 2 describes the simulations and the methodologies used. In section 3 we quantify skill at prediction monsoon precipitation for the multi model mean and for each individual prediction system. Sources of skill are shown in section 4. We discuss results in sections 5 and 6 concludes the main findings of the study.

Data
We use hindcasts of nine climate models from DCPP Component A (Boer et al 2016) (DCPPA hereafter). These climate predictions are initialized from observationally constrained datasets every year from 1960 to 2019 and 5-10 ensemble members are used depending on the climate model (table 1). We assume 5-10 ensemble members to be large enough to allow considerably increased prediction skill of monsoon precipitation (Jain et al 2019, Monerie et al 2021). DCPPA simulations are initialized in November each year and last for 5-10 years after initialization and are forced with historical external forcing.
We assess the impact of initialization by comparing DCPPA simulations to the uninitialized CMIP6 historical simulations (Eyring et al 2016; table S1), using the same climate models. These historical simulations begin in 1850 and last for ∼150 years (1850-2014) and use the same external forcings as the DCPPA simulations. Prior to analysis, observations and simulations are first interpolated onto a common 1 • horizontal grid.
We assess skill at predicting precipitation using the data of the Climate Research Unit (CRU; Harris et al 2014), available from 1901 to present. Skill at predicting surface air temperature is quantified using the NCEP reanalysis (Kanamitsu et al 2002), which is given on a 2.5 • × 2.5 • horizontal resolution and from 1948 to present.

Assessing skill
Prediction skill is estimated using the anomaly correlation coefficient (ACC) metric, computed between observed and simulated time series. We assess skill at three lead times. The 1 year lead time allows determination of skill at predicting interannual variability in summer monsoon precipitation and is months 14-17 (8-11) for the southern (northern) hemisphere in DJFM (JJAS). Years 2-5 and years 6-9 are 4 years averaged between years 2-5 and 6-9, respectively, and documents predictability of the summer monsoon precipitation on longer time scales. Prediction skill is assessed over the period 1960-2020.
We estimate the significance of the ACC by randomly resampling time series of the ensemble means. We use a 5 year block bootstrap to conserve lowfrequency variability in precipitation and temperature using 5000 permutations in a Monte Carlo framework. The ACC values are judged significant at the p < 0.05 level if the correlations are stronger than 97.5% of the randomly obtained correlation values, using a two-sided test.
We acknowledge that ACC scores are sensitive to the existence of linear trends (e.g., in precipitation, figure S1). However, we note that removing a linear trend can artificially improve skill ( figure S2). Therefore, we document the skill at predicting the total summer monsoon variability (internal variability + variability induced by the externally forced response).

Persistence
The n-year persistence is computed based on the observed values in the n years prior to the start date.
We computed a 1 year and a 4 year persistence.

Defining ensembles
We define ensembles to explore the spread in model skill and to understand sources of prediction skill for summer monsoon precipitation.

Ensemble mean (ENSM and HIST)
We assess the ability of DCPPA simulations to predict monsoon precipitation by defining the ensemble mean across models and ensemble members, hereafter called ENSM, as: with P precipitation of all m ensemble members i and for each start date j, andP precipitation averaged across all ensemble members and start date. m is the total number of ensemble members across all models. HIST is defined in the same way as ENSM but using the uninitialized simulations. Uninitialized ensemble members simulate internal climate variability, but ensemble members would not be expected to be in-phase and the ensemble mean is an estimate of the forced response to external drivers (e.g., Deser et al 2012). Therefore, the comparison of ENSM and HIST allows for an exploration of the importance of initialization for the prediction skill.

Best model (BEST)
The prediction system that performs best is selected, according to the ACC values, with the BEST ensemble consisting of only one individual model, for each monsoon domain and each lead-time.

A subset of models (SUBSET and WORST SUBSET)
The SUBSET approach follows the ENSM approach, computing the ensemble mean with only the three prediction systems that have the highest ACC values over a given monsoon domain and for a given lead time. The composition of the SUBSET ensemble is, thus, monsoon domain and lead time dependent.
The WORST SUBSET is defined in the same way as SUBSET but selecting the three prediction models that have the lowest ACC values. We expect a comparison of SUBSET against WORST SUBSET and ENSM to provide information on sources of prediction skill. Finally, the effect of initialization is here estimated by comparing SUBSET with HIST SUBSET, which is composed of the same models as SUBSET but using historical uninitialized simulations only.

Quantifying DCPPA ensemble-mean prediction skill
We assess prediction skill of ENSM for summer monsoon precipitation at each grid point, and when averaged over each monsoon domain.
We find significant skill in predicting summer monsoon precipitation in ENSM, but the skill appears to increase with lead time. Figure 1(a) shows that skill at predicting precipitation at the 1 year forecast lead time is relatively low over much of the globe, although there are regions with statistically significant prediction skill. For example, over the tropics, prediction skill is significant over northern South America, Argentina, and the western Sahel. Nevertheless, relative to the 1 year predictions, we find an increase in skill for the 2-5 and 6-9 forecast lead times. This increase in skill stands out over the Sahel, western India and Southeast Asia, and northern South America (figures 1(b) and (c)). Figure 2 shows the skill at predicting summer monsoon precipitation when averaged over all monsoon domains. At the 1 year forecast lead time, ENSM is skilful at predicting NAM and AUS precipitation, as well as the hemisphere-wide quantities (NHM and SHM) ( figure 2(a)). However, there is no significant skill over the NAF, SAS, EAS, SAM and SAF monsoon domains. For the 2-5 and 6-9 year forecast lead times, skill remains statistically significant for NHM, SHM, and NAM precipitation (figures 2(b) and (c)) and increases substantially for NAF and SAM precipitation. In contrast, ENSM does not show significant skill for the SAS, EAS and SAF monsoon domains for any lead-time. Results show higher skill for ENSM than for the CMIP5 decadal prediction systems (Bellucci et al 2015).
We assess the sources of model skill at predicting summer monsoon precipitation variability compared with persistence forecasts and with uninitialized simulations. Figure 2 shows that ENSM prediction skill generally exceeds persistence, implying that the skill does not only depend on the inertia of the climate system. We note, however, that persistence is more skilful than ENSM for the 1 year and 2-5 year forecast lead times for NAF summer monsoon precipitation (figures 2(a) and (b)). The effect of initialization (defined as the difference between initialized and uninitialized hindcasts) only emerges for a limited number of monsoon domains, especially at the longer lead times (e.g., 6-9 years), indicating that changes in external forcing are an important source of prediction skill on these time scales.

Understanding the range of prediction skill
So far, we have only explored the multi-model mean skill. But modelling systems will likely exhibit different levels of skill. Figure 2 also shows the range of skill for each model in the DCPPA ensemble and there is a significant diversity of model skill for all lead times (purple vertical lines). There is a consensus for some monsoon domains and lead-times, with all models exhibiting positive skill (e.g., NAM summer monsoon precipitation for the 1 year forecast leadtime and NAF summer monsoon precipitation for the 2-5 and 6-9 forecast lead-times). However, there is more diversity in prediction skill for the SAS and EAS domains, with individual models performing much better or lesser than ENSM (figures 2(b) and (c)), as also shown with seasonal hindcasts (Mishra et al 2018,

Sources of prediction skill
We explore the source of skill by selecting models according to their prediction skill. As expected, the BEST and SUBSET ensembles generally show improved skill relative to ENSM for all forecast lead time ( figure 3).
ACC value is around tripled in SUBSET (ACC = 0.61) compared to ENSM (ACC = 0.18) for EAS summer monsoon precipitation and the 2-5 year forecast lead time (figure 3(b) and table S2). ACC values is approximately quadrupled in SUBSET (ACC = 0.40) relative to ENSM (ACC = 0.09) for SAS summer monsoon precipitation and for the 6-9 year forecast lead time (figure 3(c) and table S3). This is a consequence of the large diversity in prediction skill over South and East Asia, with prediction systems exhibiting either high or low skill. Thus, skilful predictions can be obtained in the regions that ENSM is not skilful. We used another observational dataset (GPCC) and show that results are robust across observations (not shown).

Source of prediction skill for EAS summer monsoon precipitation
We focus on prediction of EAS summer monsoon precipitation for the 2-5 year forecast lead time, for which the SUBSET-ENSM difference in skill is the largest. The improved skill in SUBSET, relative to ENSM, is largely due to the multi-decadal variation in EAS summer monsoon precipitation. After applying a 7 year running mean to the 2-5 year forecasts we find the ACC is 0.70 in SUBSET but only 0.14 in ENSM. This is further confirmed using a 21 year running mean to only capture the slow variation of the EAS summer monsoon precipitation (figure S5). In contrast, the difference in skill between SUBSET (ACC = 0.18) and ENSM (ACC = 0.07) is low when considering higher frequency variability (defined as the residual relative to the 7 year running mean). Figure 4(a) shows the smoothed time series in EAS summer monsoon precipitation. Although there is significant skill in SUBSET, both ENSM and SUBSET ensemble underestimate the observed variability. The difference between SUBSET and ENSM ensembles is that there is a long-lasting drying trend in ENSM while SUBSET simulates a small decrease in precipitation from 1960 to the 1980s and an increase in precipitation afterwards, hence better following the observation ( figure 4(a)). In contrast, the WORST SUBSET shows a strong drying trend. Therefore, the difference in trends appears to be key to understand the differences in monsoon precipitation skill.
There is a large effect of the externally forced response on the multi-annual variation in EAS summer monsoon precipitation, as evidenced by the high correlation coefficient between the uninitialized and initialized simulations (r * = 0.94 between ENSM and HIST; figure 4(a)).
Hence, we hypothesise that the range in skill of the DCPPA ensemble to be due to the differences in the response to external forcing. This is assessed by comparing maps of SUBSET-WORST SUBSET difference in skill ( figure 5(a)) to the HIST SUBSET-HIST WORST SUBSET difference in skill ( figure 5(b)). As the skill of uninitialized simulations is due to the response to the external forcing, the strong similarity between figures 5(a) and (b) confirms a strong role of the simulation of the externally forced response on the spread in prediction skill over EAS. These results have a strong societal importance because the increase in skill is the highest over eastern China, a heavily populated region where precipitation variability is high ( figure 5(a)).

Sources of prediction skill for SAS summer monsoon precipitation
We focus on prediction of SAS summer monsoon precipitation for the 6-9 year forecast lead time, for which the SUBSET-ENSM difference in skill is the greatest. As for EAS summer monsoon precipitation, the externally forced response has strong effects on the long-term variation in simulated SAS summer monsoon, as shown by the high correlation coefficient between uninitialized and initialized simulations (r * = 0.98 between ENSM and HIST; figure 4(b)). The spread in SAS summer monsoon prediction skill is also associated with the ability of prediction systems to simulate the multi-decadal variation in SAS summer monsoon precipitation. This is evidenced by the absence of pre-1990 drying in WORST SUBSET, while SUBSET shows a multidecadal variation in SAS summer monsoon precipitation, in better agreement with the observations (figure 4(b)).
An effect of the response to external forcing on the spread of South Asian summer monsoon prediction skill is confirmed by the similarity between patterns of difference in prediction skill (SUBSET-WORST SUBSET; figure 5(c), and HIST SUBSET-HIST WORST SUBSET; figure 5(d)). However, the response to externally forced response does not fully explain the SUBSET-WORST SUBSET difference in skill. We thus also expect other drivers of South Asian summer monsoon precipitation variability to contribute to the spread in SAS summer monsoon precipitation skill.
The multi-decadal variability in South Asian precipitation has been linked to the interdecadal variability of the Pacific Ocean (IPO) (Zhang et al 2018, Figure 5. SUBSET minus WORST SUBSET difference in anomaly correlation coefficient skill score for predictions of precipitation over (a) East Asia for the 2-5 year forecast lead time, (c) South Asia for the 6-9 year forecast lead time. Skill at predicting precipitation is computed in comparison to CRU. Green contours indicate the precipitation variance, in mm 2 .d −2 , (b) and (d), as in (a) and (c) but for the HIST SUBSET-HIST WORST SUBSET difference in anomaly correlation coefficient skill score for precipitation. Stippling indicates that the difference in ACC is significantly different to zero according to a Monte-Carlo procedure, resampling both BEST and ENSM and computing difference in ACC values. We use 5000 permutations and a 95% confidence level. The same number of ensemble members are used for the ensembles of initialized and uninitialized simulations (19 ensemble members for HIST SUBSET and 26 ensemble members for WORST SUBSET for EAS summer monsoon precipitation; 30 ensemble members for HIST SUBSET and 15 ensemble members for WORST SUBSET and for the SAS summer monsoon precipitation.).
Huang et al 2020). We show a strong relationship between skill at predicting the IPO and that of the SAS summer monsoon precipitation at the 6-9 year forecast lead time (figure 6(b); r = 0.88). The spread at predicting the IPO thus also contribute to the spread at prediction the SAS summer monsoon precipitation. We performed the same analysis with the uninitialized simulations and show that the result of figure 6(b) is due to initialization (r = 0.01 with the uninitialized simulations), and thus to the simulation of internal climate variability and to the correction of an incorrect forced response.

Discussion
Although we show improved skill over EAS and SAS summer monsoon precipitation in SUBSET, which we attribute to the impact of external forcing and to the simulation of the IPO, the exact mechanisms that explain the higher skill are unclear. For example, we explored mechanisms focusing on known drivers of the monsoon circulation, such as the large-scale gradients in surface air temperature and of surface air temperature over the oceans. However, differences in skill at predicting surface air temperature between the SUBSET and ENSM ensembles are low (figures S6 and S7). Further work could focus on understanding differences in atmospheric circulation, and regional changes between SUBSET and ENSM. We also acknowledge that different estimations of the internal components of the IPO could lead to different conclusions and future work could be devoted to understanding what leads to better prediction skill of the IPO and its role for predicting summer monsoon precipitation at multi-annual forecast lead times. In addition, the results are expected to be sensitive to the estimate of the IPO (e.g., Parker et al 2007, Henley et al 2015 and to the use of different observations/reanalysis. However, we show that skill at predicting South Asian summer 5-MPI-ESM1-2-HR; 6-CanESM5; 7-EcEarth3; 8-NorCPM1) and the 2-5 and 6-9 year forecast lead times, respectively. A 7 year running mean was applied to both precipitation and IPO time series before to compute anomaly correlation coefficients. SUBSET (WORST SUBSET) models are shown with a green (magenta) circle. monsoon precipitation is sensitive to the skill at predicting the Pacific Ocean SSTs for the 6-9 year forecast lead-time.
We explored further the role of the IPO indices on summer monsoon prediction skill, correcting IPO indices and effects on summer monsoon precipitation, using observations. We show that an improved prediction skill for the IPO leads to a better prediction skill for the SAS summer monsoon precipitation (figure S8 and text in the supplementary material). Better predicting the IPO can allow improved prediction skill over South Asia. In addition to the IPO, we found a moderate relationship between skill at predicting North Atlantic temperature and SAS summer monsoon precipitation (r = 0.35) for the 6-9 year forecast lead-time. In contrast, prediction skill of the IPO has no effects on prediction skill for EAS summer monsoon precipitation (figure 6(a); r = −0.18) and we found no relationship between prediction of the North Atlantic, Indian Ocean, and equatorial Pacific Ocean temperature on prediction skill of EAS summer monsoon precipitation skill (not shown) for the 2-5 year forecast lead-time.
We acknowledge here that we do not suggest the full skill of the prediction systems to arise only due to the externally forced response. Instead, we suggest that differences in skill in initialized predictions are partly due to differences in the simulation of the externally forced response. These differences in skill could be due to model biases. However, we found no relationships between biases in seasonal mean precipitation (or variability) and prediction skill, when using monsoon domain averages (not shown). Yet, further work might identify the importance of model biases for prediction skill. A focus could be given to the biases in simulation of the mean state tropical SSTs (e.g., Turner et al 2005). We also highlight that an increased number of models could allow increasing robustness of the results.

Conclusions
We quantify the ability of CMIP6 initialized decadal prediction systems (Boer et al 2016) to predict summer monsoon precipitation in a global monsoon framework and focus on three forecast lead times (1 year, 2-5 years, and 6-9 years). Overall, skill is low for the forecast 1 year lead time but increases for the 2-5 and 6-9 year horizons. Furthermore, the skill is model dependent, monsoon-domain dependent and lead-time dependent.
We explore sources of skill for predicting summer monsoon precipitation. In particular, the impact of initialization is rather small when focusing on the 2-5 and 6-9 forecast lead times. Therefore, the results highlight the importance of the externally forced response for providing skill at predicting summer monsoon precipitation. By selecting models, based on their prediction skill, we suggest that differences in simulating the externally forced response between models explains a large proportion of the diversity skill of the CMIP6 model ensemble, over South and East Asia.
Nevertheless, differences in skill at predicting the IPO also contributes to differences in skill between models for predictions of the South Asian summer monsoon precipitation at the 6-9 year forecast lead time. We show that initialization and improved prediction of the Pacific SSTs is important for prediction of South Asian summer monsoon precipitation, but it is unclear if this is due to improved prediction of internal variability, a correction of an incorrect forced response or mean state. Besides, we acknowledge that skill at predicting the IPO can be a manifestation of an effect of the externally forced response on temperature over the Pacific Ocean. Therefore, improving our understanding of the differences between how models simulate the effects of external forcing and of IPO on South and East Asian summer monsoon precipitation could be an important avenue for improving prediction skill on a multi-annual time scale. The mechanism (e.g., anomalies in atmospheric circulation, in temperature gradients) that explains model skill diversity remains unclear. Further work is needed, focusing on, for instance, the atmosphere dynamics or model biases.
We do not argue here that selecting models based on their prediction skill should be used for predicting the future evolution of the East and South Asian summer monsoon precipitation up to ten years ahead. A reason for that is that prediction skill depends on the period used as reference (figure S9) and the ensembles might thus not provide the best prediction for the coming decade.