Multi-model ensembles for regional and national wheat yield forecasts in Argentina

While multi-model ensembles (MMEs) of seasonal climate models (SCMs) have been used for crop yield forecasting, there has not been a systematic attempt to select the most skillful SCMs to optimize the performance of a MME and improve in-season yield forecasts. Here, we propose a statistical model to forecast regional and national wheat yield variability from 1993–2016 over the main wheat production area in Argentina. Monthly mean temperature and precipitation from the four months (August–November) before harvest were used as features. The model was validated for end-of-season estimation in December using reanalysis data (ERA) from the European Centre for Medium-Range Weather Forecasts (ECMWF) as well as for in-season forecasts from June to November using a MME of three SCMs from 10 SCMs analyzed. A benchmark model for end-of-season yield estimation using ERA data achieved a R 2 of 0.33, a root-mean-square error (RMSE) of 9.8% and a receiver operating characteristic (ROC) score of 0.8 on national level. On regional level, the model demonstrated the best estimation accuracy in the northern sub-humid Pampas with a R 2 of 0.5, a RMSE of 12.6% and a ROC score of 0.9. Across all months of initialization, SCMs from the National Centers for Environmental Prediction, the National Center for Atmospheric Research and the Geophysical Fluid Dynamics Laboratory had the highest mean absolute error of forecasted features compared to ERA data. The most skillful in-season wheat yield forecasts were possible with a 3-member-MME, combining data from the SCMs of the ECMWF, the National Aeronautics and Space Administration and the French national meteorological service. This MME forecasted wheat yield on national level at the beginning of November, one month before harvest, with a R 2 of 0.32, a RMSE of 9.9% and a ROC score of 0.7. This approach can be applied to other crops and regions.


Introduction
Crop yield is expected to become more variable [1,2] under advancing climate change [3,4] with possible consequences for food security [5,6].To counteract these disturbances, crop yield forecast models are used [7].These models assist to prepare humanitarian aid efforts [8] or to negotiate trade agreements [9].However, yield forecasts that are provided before harvest face unknown weather conditions between forecast issue date and harvest.When meteorological and vegetation indices are available for machine learning models, this gap of unknown weather is usually ignored [10].Alternatively, statistical [11,12] or dynamical seasonal climate models (SCMs) [13][14][15] are used.SCMs with prediction ranges of several months have become increasingly skillful [16] and can help anticipating unexpected weather.In addition, Multi-Model-Ensembles (MMEs) of individual SCMs have improved weather forecasts [17,18] and crop yield forecast [15,[19][20][21].The MMEs used are usually provided 'out-of-the-box' from climate centers such as the APCC-MME [20,21] or the MMEs are built without analyzing the accuracy of the participating SCMs [15].In the field of crop yield forecasting, no attempt has been made yet to systematically analyze SCMs to build a more accurate MME.SCM forecast skill varies by region, forecasted quantity, lead time, spatial and temporal aggregation and forecast issue month [22,23].Here, we propose a linear regression model to forecast regional and national crop yield variability with a MME that is built specifically for a model use case by analyzing ten SCMs [18].Other assessment studies on the seasonal forecast performance over South America indicated moderate to poor skill for the Argentinian Pampas [23][24][25], one of the most important agricultural region in the world.Given that Argentina is a major wheat producer and exporter in South America, contributing around 60%-70% of total South American wheat grain output [26,27], we use Argentina as a test country and wheat as the model crop to analyze if seasonal forecast performance can be improved and used within a crop yield model when a MME is built systematically.Like previous studies [6,10,[28][29][30], we assess the approach on regional level across a wide range of climate conditions and on national level.Like several other countries in the Southern Cone, Argentina's cereal crops have been negatively affected by climate change.The rise in average daily temperature, mostly due to higher nighttime temperatures, accelerates crop phenology and diminishes yields [31].In that context, wheat yield forecasts are crucial for Argentina.In our proposed approach, we have validated both the end-of-season estimation case using the European Centre for Medium-Range Weather Forecasts Reanalysis data (ERA) and the inseason forecast case using a MME.

Material and methods
The data flow of this study is shown in figure 1, including indications on the corresponding sections with further information.Section 2.1 explains the preparation of historical wheat data from Argentina.Sections 2.2 and 2.3 summarize the preprocessing steps of ERA and SCM data.Finally, section 2.4 explains the underlying model training and the validation procedure for both end-of-season (ERA) and in-season (MME) applications.

Region of study and historical wheat data
In Argentina, wheat cultivation is mainly rainfed [32] and the season spans from June to December [33].Around 95% of wheat is produced in the five provinces Santa Fe, Córdoba, Entre Rios, Buenos Aires and La Pampa [34].From these provinces, historical wheat data was collected from 182 municipalities through the Ministry of Agriculture of Argentina [34].During data preparation, municipalities with incomplete time series from 1993-2016 were dropped (142 municipalities remaining) and assigned to regions.The regions used in this study were obtained from the 'Map of Argentinian wheat sub-regions and other winter cereals' [35].In this map, municipalities form a region based on similar soil properties as well as temperature and precipitation during the winter cropping season from April to December.The twelve regions analyzed in this study are depicted in figure 2(a).Municipality yield was aggregated to regional (supplementary figure S1) and national level (black line, figure 2(b)) using a weighted average with weights being the harvested area.
Yield anomalies for each region and on national level is computed as follows using absolute yield: Yield anomaly for year t (anomaly t ) was calculated by subtracting the 5 year-moving average of absolute yields from years t − 2 to t + 2 from the current year's absolute yield (absolute t ) and dividing the difference by the 5 year moving average and multiply the result by 100 to obtain a percentage.The moving average approach to estimate expected yield has been proposed by Iizumi et al on crop yield variability forecasts [21].National wheat yield anomalies are depicted in figure 2(c).Regional absolute yield as well as the trend and anomalies can be found in the supplementary materials (supplementary figures S1 and S2).

Temperature and precipitation data
Wheat yield (grains per unit area) is determined by two numerical components: Spikes per unit area and grains per spike.There is an important plasticity of these numerical yield components depending on the environmental conditions to which wheat is exposed.For example, a greater initiation of tillers prior to stem elongation (August-September) that results in greater establishment of spikes will in part be offset by a reduction in the number of fertile florets (grains) per spike at anthesis (October-November) [36].To account for the climate impact on yield formation, we used monthly mean temperature and precipitation from the last four months before harvest (August-November) [31].Higher resolution climate data were not available for SCMs coming from the North-American-Multi-Model Ensemble (NMME) (see section 2.3).In addition, Asseng et al [37] and Wall et al [38] have shown a robust linear relationship between wheat yield and seasonal mean temperature.Other crop yield models for Argentina used additional variables like soil moisture [30], large-scale climate indices like the El Niño-Southern Oscillation index [21,39] or remotely sensed quantities like land   [35].Wheat yield on municipality level was aggregated to regional and national level by calculating the weighted mean yield with weights being the harvested area.(b) National absolute wheat yield (solid line) and yield trend (dashed line) estimated using a 5 year rolling average.(c) Trend-corrected national yield variability.surface temperature and vegetation indices [28,40].Our objective here is an isolated analysis of the inseason forecast accuracy coming from a MME (see section 2.3).Therefore, we decided to solely use meteorological indices [13,20].The in-season forecasts are benchmarked with the end-of season estimation, when all indices are available as observations and crop yield can be estimated under full knowledge.The data for the end-of-season estimation was gathered from the ERA5-Land reanalysis dataset (ERA) for August to November from 1993-2016 [41], which was available at 0.1 • spatial resolution and was subsequently aggregated to our study regions.The monthly mean climate conditions from 1993-2016 per variable and region are indicated in figure 3.In addition to providing the end-of-season benchmark, ERA data was used for model training (finding the coefficients) for the in-season forecast validation (see section 2.4) using MME data.

Seasonal climate forecasts
To provide wheat yield forecasts before harvest, forecasted monthly temperature and precipitation data were obtained as hindcasts (retrospective forecasts) from SCMs.Three platforms that offer access to hindcasts of multiple SCMs are: The Copernicus Climate Data Store from the Copernicus Climate Change Service (C3S) [42], the Data Library of the NMME [43] and the Climate Information Toolkit of the Asia-Pacific Climate Center (APCC) [44].Hindcasts from APCC were generally not considered, because of their sparse spatial resolution of 2.5 • .The NMME consists of six SCMs and the C3S offers access to eight SCMs, however some SCMs were available on both platforms and others had missing data for some years.Hence, we chose 10 SCMs (table 1).For each SCM, hindcasts from 1993-2016 across the study region at a resolution of 1.0 • were collected.The study period was chosen, as it was the largest common period where hindcasts were available for all 10 SCMs.To validate the in-season performance of the wheat yield model, we used hindcasts that were initialized at the beginning (within the first week) of each month from June to November throughout the cropping season with forecasts for the months August-November.The highest temporal resolution across all SCMs were monthly mean data and therefore used across all applications.At each month of initialization (from June to November), hindcasts of monthly mean temperature and precipitation for August, September, October and November were obtained, when these months were in the future.Each SCM provides an ensemble of hindcasts that are run with different initial conditions [45].Here, we took the ensemble mean to only have one output per SCM.
The gridded hindcasts were then aggregated to the twelve study regions.Next, regional hindcasts were bias-adjusted with ERA data.For temperature data, scaled distribution mapping [53] was used for biasadjustment and for precipitation data, linear scaling was applied [54].The structural errors of individual SCMs in forecasting weather conditions can be Table 1.Seasonal climate models used in this study.From each model, monthly hindcasts were generated from June to November to forecast monthly mean temperature and rainfall for August, September, October and November during 1993-2016.In addition to these seasonal climate models, reanalysis data from the European for Medium-Range Weather Forecasts (ERA) provided by the Copernicus Climate Data Store was used for model training, end-of-season estimation and to complete the feature vector for in-season forecasts e.g. to provide August and September data for forecasts made in beginning of October.The seasonal climate models used in this study were all obtained in grid cell resolutions of 1  [52] reduced by combining their outputs to MMEs [17].Smaller MMEs with data from the best (maximum five) available SCMs outperform larger MMEs [18].
To identify the best SCMs, we calculated the mean absolute error (MAE) between SCM and ERA predictors across all variables and regions, separately for each month of initialization.Then, the ten SCMs are filtered and only those are kept that are at least in one month within the top five SCMs [18] with the lowest MAE.Those SCMs that are not ranked within the top five SCMs in any month are discarded.With the remaining SCMs all possible MMEs of sizes 2-5 were built, e.g. for MME size 2, all possible SCMpairs out of the best SCMs were built, for MME size 3, all possible MMEs with three SCMs were built and so on.To avoid spatial and temporal overfitting, we decided to validate the same MME for all regions and months of initialization.The final MME was selected based on the coefficient of determination (R 2 ) of national yield forecasts.If R 2 were the same, the selection was made using root-mean-square error (RMSE) (see section 2.4).Since hindcasts are retrospective forecasts and our ultimate technical objective are inseason yield forecasts, we will align the terminology and refer to hindcasts as forecasts from now on to make the text clearer and easier to read.

Statistical model and analysis
Regional wheat yield variability ŷregional was modeled as a linear combination of the eight available predictors x 1 , . . ., x 8 (monthly mean temperature and precipitation from August to November) multiplied with their corresponding learned coefficients β 1 , . . ., β 8 [2] plus an intercept β 0 .We determined the coefficients using a ridge estimator [3] to avoid overfitting [55].
In ridge regression, coefficients are found by minimizing the sum of squared errors between estimated and actual yield anomaly ( and the sum of squared coefficients (α ∑ 8 j =1 β 2 j ), with α being a parameter that controls the strength of the coefficient regularization.To find the optimal α, we used Leave-One-Year-Out Cross-Validation (LOOCV).For each α, regional models were trained and validated (using ERA data) and the average performance metrics across all regions were calculated.The statistics used in this study are the RMSE, R 2 and the MAE [56].The model with the lowest RMSE across all regions was selected as the final model.In addition to the regression analysis, we validated how well our model discriminates between positive and negative yield anomalies using the Receiver Operating Characteristic (ROC) [57].ROC scores above 0.6 indicate skill that is better than random guessing with 1 being a perfect model [58].For the in-season forecasts, the final model was validated with MME data.To complete the feature set for yield forecasts made at, e.g. the beginning of October, ERA data was used for August and September and MME data was used for October and November.Yield estimations and forecasts on regional level were aggregated to national level by using estimated harvested area as weights.Actual harvested area is unknown before harvest, so harvested area for each year was estimated as the mean harvested area from all other years.

Results
The best performing end-of-season model, trained and validated with ERA features, was found with a ridge regression regularization term α of 8.The average regional and national estimation performances of the final model are given in figure 4. The endof-season model achieves a R 2 of 0.24, a RMSE of 17.7% and a ROC score of 0.65 on regional level (figure 4 5(k)) with R 2 of 0.56 and 0.59, RMSE of 20.1% and 11.9% and ROC scores of 0.8 and 0.9, respectively.In terms of spatial distribution, wheat yield estimates are generally more skillful in the western regions (supplementary figure S3).
To forecast wheat yield variability in-season before the predictors are available as observations, forecasted features derived from SCM data were used.For each month of initialization from June [6] to November [11], the MAE across all years and regions between forecasted and observed ERA features is shown in figure 6.A spatial analysis of the regional forecast skill by SCM and month of initialization can be found in supplementary figure S4.Up until September, the performance among the SCMs is mixed with no clear trend of improvement (figure 6).In October and November, most SCMs become more accurate.In November, all SCMs perform best across all months of initialization.The five best SCMs with the lowest MAE per month of initialization are indicated with a superscript asterisk.The best SCMs in June are ECCC, NASA, CMCC, UKMO and DWD.From July to October, the five best SCMs are always CMCC, DWD, ECCC, ECMWF and ECMWF.In November, the best SCMs are CMCC, ECCC, ECMWF, METFR and UKMO.The count of how often each SCM is within the top five is indicated at the right next to the heatmap.The three SCMs from NCAR, GFDL and NCEP are never within the best SCMs and are not considered for the MME analysis.The remaining seven SCMs were used to build all possible MMEs of sizes 2-5.None of these 120 MMEs makes skillful yield forecasts on national level before November (supplementary table S1).The best forecasts (measured as R 2 and RMSE) were achieved with a 3-member-MME using the unweighted average of the forecasted features from ECMWF, NASA and METFR (supplementary table S1 and S2).This MME also outperforms the yield forecasts made with individual SCM data (supplementary table S1 and S2).Average regional in-season wheat yield forecast quality of this 3-member-MME is shown in figure 7.Although forecasts improve from June (figure 7(a)) to November (figure 7(f)) with lower RMSE, higher R 2 and higher ROC scores, skillful yield forecasts are never possible, indicated through a R 2 ⩽ 0 and a ROC score ⩽ 0.6 for all months (figures 7(a)-(f)).Forecasts generally fail to anticipate the extend of extreme years, depicted through the gray dashed slopes that are more flat than the 1:1 identity lines.A regional analysis of the forecasts quality reveals that skillful forecasts are possible in November for the two regions Llanos nordpatagónicos and Pampa semiárida sur and in October for Pampa semiárida central (supplementary figure S5, supplementary table S3).
In-season wheat yield forecast quality on national level is shown in figure 8, where forecasts are compared with observed yield anomalies throughout the cropping season.From June (figure 8(a)) until October (figure 8(e)), national wheat yield forecasts are not skillful, depicted through a R 2 ⩽ 0 and a ROC score ⩽ 0.6.Both R 2 and RMSE improve throughout the season.In November, the 3-member MME (with ECMWF, NASA and METFR data) makes skillful national wheat yield forecasts with a R 2 of 0.32, RMSE of 9.9% and a ROC score of 0.7 (figure 8(f)).In November, forecast skill has approached the endof-season estimation skill (figure 4(a)).The approach could not forecast the extent of extreme low yields (e.g.2002, 2008) but anticipated correctly a negative yield event in these years.

Discussion
We proposed an approach of forecasting regional and national wheat yield variability in Argentina before harvest using monthly meteorological indices from the last four months before harvest (August-November) with a 3-member-MME consisting of ECMWF, NASA and METFR data.This is the first crop model study that compared as many as 10 SCMs to build a more accurate MME for in-season yield forecasts.On national level, the suggested approach performed poorly from the beginning of the season in June until October, with R 2 ⩽ 0 or a ROC score ⩽ 0.6.However, at the beginning of November, one month before harvest begins, national yield forecasts were possible with R 2 of 0.32, RMSE of 9.9% and a ROC score of 0.7.On regional level, the approach performed poorly with skillful forecasts in only two regions in November and one region for October and November.
In Argentina, grains per unit area are mostly defined during the critical period from October to November, through the establishment of the number of spikes per unit area and the definition of the number of fertile florets per spike at anthesis.As the meteorological indices for November are more accurately forecasted from the MME towards the end of the season, the crop yield model makes better forecasts with an increased R 2 and a lower RMSE.The end-of-season model that uses ERA features instead of forecasted features from SCMs achieved a R 2 of 0.33, a RMSE of 9.8% and a ROC score of 0.83 on national level and a R 2 of 0.24, a RMSE of 17.7% and ROC score of 0.65 on regional level.Our approach suggests that large MMEs are not needed to provide more accurate national in-season wheat yield forecasts and is in alignment with the results from climate studies [18].One possible explanation on why the best MME found in our study was a three-member-MME with data from ECMWF, NASA and METFR is that their forecast errors have lower model-to-model correlations than other SCMs (supplementary figure S6).In addition, only using weather conditions from the last four months before harvest as predictors were sufficient to discriminate between positive and negative yield anomalies one month before harvest.As outlined above, this could be because the most important numerical yield component, the number of grains per unit area, is defined during the final months before harvest.It is important to clarify that another numerical yield component, grain weight, is also defined during the last part of the critical period, however, the ability to explain the variations in yield of this component is much more limited than that of the number of grains per unit area [36,59].Wheat yield estimation and forecast performance increased when aggregated from regional to national level.This has been observed in similar studies [10] with a possible explanation being that yield on regional level is overestimated in some regions and underestimating in others and through the aggregation the effects were reduced and averaged out.For national November forecasts, we analyzed model uncertainty, which varies among seasons, but is in general relatively small based on the three MME members used in the study (figure S7(a)).However, little uncertainty, i.e. small differences between the forecasts of the two models, does not necessarily indicate high precision, i.e. a small error in yield forecasting (figure S7(b)).Based on hindcasts, our results of 9.8% RMSE for national end-ofseason-estimates, 9.9% RMSE for national in-season November forecasts and 19.8% RMSE for regional inseason November forecasts are comparable to another study by Franch et al for Argentinian wheat yield with 11%, 12.4% and 22.5% RMSE, respectively [28].The performance difference may be explained by the fact that Franch et al [28] employed a statistical method to forecast future vegetation indices for inseason yield forecasts, whereas our approach is based on forecasted weather predictors from dynamical SCMs.Furthermore, their subnational approach uses administrative units, while we use the regions from the 'Map of Argentinian wheat sub-regions and other winter cereals' [35] that are based on climate and soil conditions.
While the results are promising and encourage an application to other crops and regions, there are some limitations and obstacles that require further attention.To have a consistent dataset, we decided to use reanalysis data (ERA) which is available in the same gridded format like the SCM data [13,21].It has been shown that reanalysis data can differ from in situ observations, especially in mountainous regions [60,61].Although our study region is defined by flat terrain across the Argentinian Pampas, differences between ERA data and actual ground conditions may have affected the model performance.The MAE was calculated over the entire 24 years of the study period.In the validation setting, however, when forecasting for a specific year, the MAE would only be known for all other 23 years except the current year.For example, in September, the error between forecasted and ERA precipitation for November is not known because the ERA data is not available yet and hence, SCMs cannot be selected on the basis of this year.We chose this approach to show the potential in case the true accuracy of all SCMs would be known for the target use case.In an operational setting, we recommend to calculate the MAE and to identify the best SCMs using all available past years, where a true performance ranking is likely to converge to the same SCMs.Furthermore, the SCMs from the NMME were only available at monthly resolution, which restricted the introduction of additional day-based indices, such as killing degree days [62].Future studies may attempt to include even more SCMs that are currently certified by the World Meteorological Organization [63] than the ones studied here.Future work may use our findings to propose a probabilistic yield forecast model [64] instead of the deterministic regression approach chosen here.This would facilitate a quantification of reliability, which is crucial for transparent communication and for the interpretation of the outputs [65].One possible approach to achieve this would be the calculation of yield forecasts with each ensemble member from the individual SCMs to obtain a yield forecast range [64].We used the ensembles mean of the SCMs to obtain one output per SCMs which may have caused a loss of signal [18].In addition, since each ensemble member is a physically plausible forecast, averaging these forecasts leads to an output that is not based on the underlying physical processes anymore [18].Better yield forecasts may also be possible if non-linear machine learning models are used, such as Random Forest [66].Lastly, vegetation indices can be used as additional features to weather indices and might lead to improved forecasts [67].

Conclusion
In-season wheat yield forecasts can help various stakeholders to optimize their strategies and stabilize food supply [7,8].To forecast wheat yield variability in Argentina before harvest, we have shown how a 3-member-MME can be applied.As medium and long range weather forecasts are expected to further improve in the near future [68,69], in-season crop yield forecast models that employ weather forecasts can play a crucial role for food security.

Figure 1 .
Figure 1.Data flow used in this study.The chart is divided into three parts (dashed rectangles).The section with further information on the steps carried out are indicated on top of the dashed rectangle.Actions are denoted in rectangles with solid rounded corners, data is indicated in rectangles with solid sharp corners.

Figure 2 .
Figure 2. Study area and wheat yield data.(a) Major wheat-cultivating municipalities and their corresponding regions from the 'Map of Argentinian wheat sub-regions and other winter cereals'[35].Wheat yield on municipality level was aggregated to regional and national level by calculating the weighted mean yield with weights being the harvested area.(b) National absolute wheat yield (solid line) and yield trend (dashed line) estimated using a 5 year rolling average.(c) Trend-corrected national yield variability.

Figure 3 .
Figure 3. Climate conditions over study area.Observed mean daily temperature (a)-(d) and mean daily precipitation (e)-(h) for August, September, October and November between 1993 and 2016 is indicated.The borders of the 12 study regions are marked as black.

Figure 4 .
Figure 4. End-of-season wheat yield estimation.Estimated yield variability against the corresponding observed yield values on regional (a) and national (b) level, including the average performance metrics.The gray dashed line corresponds to the fit of the linear function and the black line represents the 1:1 relationship.(c) Annual estimated wheat yield variability (dashed line) and performance metrics (legend) using Leave-One-Year-Out Cross Validation over 24 years from 1993-2016.

Figure 5 .
Figure 5. Regional end-of-season wheat yield estimations.Annual estimated wheat yield variability (dashed line) and performance metrics (legend) using Leave-One-Year-Out Cross Validation over 24 years from 1993-2016 are indicated for each of the 12 regions (a)-(l), sorted alphabetically by name.

Figure 6 .
Figure 6.SCM feature forecast skill.Mean absolute error (MAE) between forecasted and reanalysis features is shown per seasonal climate model (SCM) and month of initialization.Column-wise for each month of initialization, the MAE is computed for all features that are forecasted at this month, ignoring features from past months.The MAE values are presented in units of standard deviation (σ), given that all features underwent standardization during preprocessing.SCMs were sorted in ascending order by the November MAE.Each month, an asterisk indicates the five best SCMs with the lowest MAE.The number of months in which a SCM was within the top five with the lowest MAE are shown on the right hand side.

Figure 7 .
Figure 7. Regional in-season forecast analysis.Forecasted yield anomaly against the corresponding observed yield anomaly values on regional level is shown for each month of initialization from June (a) to November (f).The gray dashed line corresponds to the fit of the linear function and the black line represents the 1:1 relationship.The average forecast statistics across all regions are indicated in the legend.

Figure 8 .
Figure 8. Seasonal national wheat yield forecasts for Argentina.Forecast of national wheat yield anomalies from June (a) to November (f).The observed (solid black lines) and the forecasted (dashed gray lines) wheat yield anomalies with multi-model ensemble (MME) data for each year is shown.In each plot, the R 2 , the root-mean-square error (RMSE) and the receiver operating characteristic (ROC) score of the MME approach is shown.