Assessing the land-use harmonization (LUH) 2 dataset in Central Asia for regional climate model projection

Although the land-use harmonization (LUH) datasets have been widely applied in regional climate model (RCM) projections for investigating the role of the land-use forcing in future climate changes, few studies have thoroughly assessed them on local scale, which may bring large uncertainties in the resultant climate information for designing adaption and mitigation measures of climate change. The authors use a local land-use dataset (referred to as Li-LU) as the benchmark to assess the latest version of the LUH datasets, LUH2, in Central Asia (CA) which has undergone extensive land-use changes (LUCs) and might undergo extensive LUCs in the future. The results show that LUH2 has large biases in depicting the historical land-use states in CA for 1995–2015. For instance, the area of grassland (cropland) in LUH2 is about 1.4–1.5 (0.4–0.5) times of that of Li-LU. Moreover, the future LUCs predicted by LUH2 for 2045 (relative to 2005) are much smaller than those of Li-LU and these two datasets generally have opposite signals in changes. In addition, the predicted LUCs of LUH2 do not follow the causal mechanisms [the causal connections between the key drivers (e.g. population, economy, and environment) and land use] behind the LUCs in the past. If the future scenario of LUH2 is used for RCM projection in CA with the historical land-use information from Li-LU, the simulation results could be misleading for understanding the impacts of LUCs on future climate changes there. This study suggests that the LUH datasets should be carefully assessed before using them for regional studies and provides practical notes for selecting the appropriate land-use dataset for RCM projections in other areas around the world.


Introduction
Regional climate model (RCM) projection is the process to use low-spatial-resolution global climate models (GCMs) outputs to drive RCMs with higher spatial resolutions over a limited area, so as to provide climate information at appropriate scales for designing adaption and mitigation measures of climate change for the coming decades and up to the end of the century. GCMs can simulate the response of the general circulation of the atmosphere and oceans to the elevated greenhouse gases (GHGs) under future plausible scenarios. Driven by the surface and lateral boundary conditions derived from GCMs, RCMs can thus simulate the response of local-scale processes to the GHG forcing. Accordingly, one of the central goals of the RCM projection is to assess the effects of the GHG forcing on the local climate.
Studies based on modeled and observed data documented that historical land-use changes (LUCs) have significant impacts on mean climate (e.g. surface air temperature and precipitation) (Oleson et al 2004, Mahmood et al 2010, Pielke Sr et al 2011, Cao et al 2015 and climate extremes (Findell et al 2017, Chen andDirmeyer 2019) over the regions that have undergone extensive landscape changes, by altering the biophysical (e.g. albedo, evapotranspiration and roughness) and biogeochemical (e.g. the net flux of GHGs) characteristics of the land surface (Deng et al 2013, Ward et al 2014, Perugini et al 2017. Inspired by these studies, numerous RCM projections have taken LUC scenarios into account, to understand the role of the land-use forcing in future climate changes complementary to the GHG forcing, such as those over North America (Bukovsky et al 2021), South America (Lejeune et al 2015), East Asia (Zhang et al 2013, Hua et al 2015, Niu et al 2019, Zheng et al 2022, and South Asia (Shastri et al 2019). Generally, there are three experiments designed in these studies. They are the historical and future simulation with the historical land-use map which is derived from the satellite images and the future simulation with the land-use scenario predicted with different approaches or models. Moreover, many studies used the land-use scenarios developed as input for the World Climate Research Program Coupled Model Intercomparison Project (CMIP), the land-use harmonization (LUH) 1 dataset (Hurtt et al 2011) for CMIP5 and LUH2 (Hurtt et al 2020) for CMIP6.
Although the LUH datasets have been widely applied in RCM projections, to our knowledge, the previous studies generally focus on one or two landuse types, such as cropland, urban land and forest. Few have thoroughly assessed the LUH datasets on local scale and both in depicting all the land-use types (also including grassland, barren land, water bodies, etc) during the historical period and in predicting their alternations in the future. Directly using the LUH datasets for the RCM projection may bring misleading simulation results for understanding the impacts of LUCs on regional climate changes in the future, as found in this study. Some inadvertent errors have been found in the LUH2 dataset, such as the overestimation of grazing land area and anomalous increase or decrease in cropland area in Brazil between 1990 and 2020 (Chini et al 2021).
The authors attempt to use a local land-use dataset produced with a geostatistic model to thoroughly assess the latest version of the LUH datasets, LUH2, for the RCM projection in Central Asia (CA) that has undergone extensive LUCs and might undergo extensive LUCs in the future, which will help selecting the appropriate land-use dataset for RCM projections in other areas around the world.
The objective of this study is to evaluate LUH2 in describing the historical land-use states and predicting the future scenarios in CA. The remainder of this paper is organized as follows: sections 2 and 3 describes the study area and data, respectively; section 4 is twofold, the evaluation of LUH2 in depicting the historical land-use states for 1995-2015 and in predicting the future scenarios for 2015-2045; some discussions are in section 5; section 6 summarizes the main results.

Study area
Central Asia (referred to as CA, figure 1) is located in the center of Eurasia and consists of the former Soviet republics of Kazakhstan, Uzbekistan, Tajikistan, Kyrgyzstan, and Turkmenistan. This region has undergone significant landscape changes in the past decades. The cropland area increased by an annual rate of 3.00% during 2000-09 while the area of natural vegetation decreased (Chen et al 2013). The population growth and economic development has led to the expansion of urban land (Hu and Hu 2019). Due to the increased water withdrawals for the irrigation (Micklin 2014), the surface of the Aral Sea had shrunk by 63.80% during 1990-2009 while other water bodies (e.g. Balkhash Lake) did not change significantly in surface area (Chen et al 2013). Li et al (2019) predicted LUCs over CA for 2035 and found remarkable expansion of cropland (22.10%) and urban land (322.40%) and shrinking of water bodies (−38.43%) and barren land (−9.42%) relative to 1995. Given dramatic changes in the various land-use types (e.g. cropland, urban land, and water bodies) there in the past and potential changes in the future, CA is an important area to assess the feasibility of using the LUH datasets for the RCM projection.

The local land-use dataset
The local (not global) land-use dataset (referred to as Li-LU) is produced by Li et al (2019) originally for assessing the effects of LUCs on ecosystem service values in CA in the future. It has a 300 m spatial resolution and is developed with the Cellular Automata-Markov Chain (CA-Markov) model, which has been widely used to predict the spatial-temporal changes of land use (Gidey et al 2017, Fu et al 2018, Mathanraj et al 2021. The historical time series (1995, 2005, and 2015) from the European Space Agency Climate Change Initiative Land Cover project (CCI-LC) (Bontemps et al 2013) are used as the model input to predict the future time series (2025, 2035, and 2045). The CCI-LC dataset is generated using global time series acquired by the Envisat MERIS Full and Reduced Resolution and from SPOT-Vegetation sensors (Bontemps et al 2013) and have been validated to have high accuracy in depicting the global land-use states, with the overall accuracy reaching 96%.
To do the land-use prediction over CA, the original land-use types from the CCI-LC dataset were reclassified into the main land-use types there, including cropland, forest, grassland, wetland, urban land, barren land, and water bodies. To evaluate the CA-Markov model in predicting the LUCs in CA, the Note that the spatial resolution and proportional errors of the input maps and the model parameters (e.g. iteration number) bring uncertainties to the LUC modeling, which may significantly affect the accuracy of the model results (Palmate et al 2022).

LUH2
As the updated version of LUH1, LUH2 is a new harmonized set of land-use scenarios that smoothly connects the historical reconstructions of land use with eight future projections in the format for CMIP6 models. It covers the time period 850-2100 at 0.25 • resolution, with a set of historical data (850-2015) based on the History of the Global Environment database (HYDE) 3.2 (Klein Goldewijk et al 2017) and multiple shared socioeconomic pathway (SSP) and representative concentration pathway (RCP) scenarios of the future (2015-2100) developed by integrated assessment model (IAM) teams (Riahi et al 2017).
In detail, data from HYDE 3.2 were at 5 ′ resolution and every 100 years from 800 to 1700, every 10 years from 1700 to 2000, and annually from 2000 to 2015. These data were used to calculate the grid cell area fraction of the land-use types (cropland, managed pasture, rangeland, urban land, ice and water) at 0.25 • resolution. The ice and water was assumed constant over time. Data were then linearly interpolated in time to produce annual maps. By subtracting the aforementioned land-use types, the grid cell area fraction of natural vegetation was determined. The natural vegetation was then subdivided into primary or secondary forest or non-forest based on two assumptions: (1) the primary land is where not harvested, cut, or converted since 850 CE; (2) separation of forest and non-forest areas is based on potential aboveground standing stock (which was estimated by the MIAMI-LU ecosystem model) of 2 kg C m −2 (Hurtt et al 2011). The cropland was further subdivided into five crop functional types: C 3 annuals, C 4 annuals, C 3 perennials, C 4 perennials, and C 3 nitrogen fixers. Table  S1 summarizes the 13 land-use types in the LUH2 dataset.
For 2015-2100, the land-use information was from eight SSP-RCP scenarios (SSP5-8.5, SSP3-7, SSP2-4.5, SSP1-2.6, SSP4-6.0, SSP4-3.4, SSP5-3.4OS, and SSP1-1.9) derived from five different IAMs. This study focuses on assessing the 'middle of the road' pathway, SSP2-4.5 (with an addition radiative forcing of 4.5 W m −2 by the year 2100). It was simulated in a structure of interlinked disciplinary and sectorial models referred to as the IIASA IAM framework (Riahi et al 2007, Fricko et al 2017. Within the framework, land-use alterations were modeled by the Global Biosphere Management Model (GLOBIOM), which is a recursive-dynamic partial-equilibrium model (Havlík et al 2011). GLOBIOM considers detailed grid cell information on biophysical constraints and technological cost as well as a rich set of environmental parameters, including comprehensive agriculture, forest, and other land use GHG emission accounts and irrigation water use.

Reclassification from the LUH2 land-use types to the Li-LU types
To do the assessment, data from the Li-LU dataset are converted to the grid cell area fraction of the land-use types at 0.25 • resolution and the LUH2 land-use types are reclassified to the Li-LU types. Table S1 shows the reclassification method. The LUH1 dataset does not separate forest from non-forest type. The LUH2 dataset is updated to divide the primary and secondary land into forest and non-forest type. The nonforest primary and secondary land can be furtherly divided into grassland and barren land according to biomass density of natural vegetation with a threshold of 0.5 kg C m −2 as suggested by Hou et al (2022).

Results
In this section, the LUH2 dataset with the future scenario SSP2-4.5 are compared with the Li-LU dataset for the historical  and future (2015-2045) period, respectively. Figure 2 shows the time series of the fraction of the main land-use types over CA. The data for the years 1995, 2005, 2015 in the Li-LU dataset are produced based on the satellite images, which represent the realistic land-use states over CA and can be used as a benchmark to evaluate the LUH2 dataset. The Li-LU dataset shows that the main land-use types over CA are grassland (49.33%-51.04%, the range depends on time), barren land (24.54%-25.48%), and cropland (18.32%-21.17%) during 1995-2015, totally accounting for 94.84%-95.40% of the study area. The rest is shared by water bodies (2.13%-2.86%), forest (1.94%-1.96%), and urban land (0.07%-0.23%). Although the LUH2 dataset also identifies grassland, barren land, and cropland as the main land-use types over CA, the fractions of these three land-use types differ greatly between LUH2 and Li-LU. The fraction of grassland (70.29%-72.43%) in LUH2 is about 1.4-1.5 times of that of Li-LU and the fraction of barren land (15.42%-17.08%) and cropland (7.95%-8.56%) in LUH2 is only 0.6-0.7 and 0.4-0.5 times of that of Li-LU, respectively. In addition, water bodies are assumed static (with a fraction of 2.84%) in LUH2 while Li-LU shows a decreasing trend in the fraction of water bodies during 1995-2015 (figure 2(d)). Figure 3 shows the spatial distribution of the fraction of the main land-use types in each grid cell (LANDUSEF, units: %) in 2005 depicted by the Li-LU (a)-(f) and LUH2 (g)-(l) dataset and the bias of LUH2 (m)-(r) relative to Li-LU. The Li-LU dataset depicts that the grasslands are mainly in central and southern Kazakhstan and parts of southeastern CA ( figure 3(a)). However, the grasslands in the LUH2 dataset almost cover the entire study area ( figure 3(g)). The Li-LU dataset shows that the barren lands cover most of southwestern CA ( figure 3(b)) while the LUH2 dataset indicates that the barren lands with high LANDUSEF are only over parts of southwestern CA ( figure 3(h)). The Li-LU dataset shows that the croplands are mainly in northern Kazakhstan and along the Amu Darya and Syr Darya rivers ( figure 3(c)). Although the LUH2 dataset well captures the spatial distribution of the croplands (figure 3(i)), it largely underestimates the LANDUSEF ( figure 3(o)). The Li-LU dataset shows a shrinking Aral Sea (figure 3(d)) while the lake has a large surface area in the LUH2 dataset ( figure 3(j)). In addition, the LUH2 dataset depicts low-to-high LANDUSEF of water bodies (figure 3(j)) and forest (figure 3(k)) in Tajikistan which are actually not real referring to the Li-LU dataset (figures 3(d) and (e)).

The evaluation of LUH2 in predicting the future scenarios for 2015-2045 4.2.1. The predicted LUCs by the Li-LU dataset
The Li-LU dataset shows remarkable expansion of cropland (figure 2(c)) and urban land (figure 2(f)) in the coming decades and significant shrinking of barren land (figure 2(b)), water bodies (figure 2(d)), and forest (figure 2(e)). The area of grassland is predicted to be relatively stable ( figure 2(a)).
Take the predicted LUCs in 2045 relative to 2005 as an example. The area of grassland is predicted to increase from 189.44-198.08 million km 2 (by 4.56%, table 1). The net grassland gain is a result of imbalance between the grassland loss, highlighted by the former grasslands replaced by croplands (i.e. figures 4(a) and (c)) and grassland gain, highlighted by grasslands established over former barren lands (i.e. figures 4(a) and (b)).
The area of barren land is predicted to decrease from 96.76 to 50.19 million km 2 (by 48.12%). The barren loss is primally driven by conversion to grassland (as stated above) and agricultural expansion (i.e. figures 4(b) and (c)). The barren gain is mainly due to the shrinkage of the water bodies (i.e. figures 4(b) and (d)).
The area of cropland is predicted to increase by 55.96%, from 79.32 to 123.71 million km 2 . The cropland expansion is primarily over former grasslands and barren lands (as stated above) and the net crop loss is very small (figure 4(c)).
The area of water bodies is predicted to decrease by 70.18%, from 9.32 to 2.78 million km 2 . In 2045, the Aral Sea will almost dry up (i.e. figures 3(d) and 4(d)) and the exposed lakebed will become deserts. The dynamics in other water bodies are minor. The area of urban land is predicted to increase by 380.53%, from 0.59 to 2.85 million km 2 , mainly over the border areas of Kazakhstan, Kyrgyzstan and Uzbekistan ( figure 4(f)).
In addition, the area of forest is predicted to decrease from 7.51 to 5.31 million km 2 (by 29.30%).

The predicted LUCs by the LUH2 dataset
The predicted LUCs are much smaller in LUH2 than in Li-LU. The areas of all the land-use types except urban land are predicted to increase or decrease by less than or near 10% in 2045 relative to 2005 (table 1). Although the area of urban lands will increase by 108.68%, it is also much smaller than that (380.53%) of the Li-LU dataset. Moreover, changes in LANDUSEF are small and over scattered areas over CA (figures 4(g)-(l)). In addition, these two datasets generally have opposite signs in the alterations. For instance, the LUH2 dataset shows decreases in LANDUSEF of the croplands over parts of northeastern CA (figure 4(i)) while the Li-LU dataset indicates increases there ( figure 4(c)). The opposite signs are also obvious for the changes of grassland and barren land (figure 4(a) vs figure 4(g), figure 4(b) vs figure 4(h)).
There are no observations to evaluate the Li-LU and LUH2 dataset in predicting the future scenarios. However, the causal mechanisms [the causal connections between the population, economic drivers (e.g. gross domestic product), environmental drivers (e.g. precipitation) and land use] which are found during the historical period can help assessing the reliability of the predicted LUCs with the assumption that the future LUCs have a high probability to be an extension of the historical experience. Hu and Hu (2019) used the multiple stepwise regression method to identify the key factors that drive the historical LUCs in CA. They found that the wetting condition in southern CA is the key factor to drive the conversion from barren land to grassland there. The increase in precipitation and the population growth led to the cropland expansion. The urbanization during the historical period is driven by both population growth and economic development. Both the CMIP5 and CMIP6 models show that the annual precipitation is expected to increase over CA in the future (Huang et al 2014, Jiang et al 2020. Researchers predict that the CA population will climb in the coming decades, with high birth rate and low mortality rate (CA Bureau for Analytical Reporting 2020). Given the increase in precipitation and population in the future, the LUCs predicted by the Li-LU dataset seem plausible, like the expansion of cropland and urban land and the shrinking of barren land. However, the LUCs predicted by the LUH2 dataset do not follow the causal mechanisms based on the empirical data. The Li-LU dataset predicts that the Aral Sea will almost dry up in the future, which is a reliable land-use scenario for this lake. However, water bodies are assumed constant over time in the LUH2 dataset, which limits the use of this dataset in the regions with potential drastic changes in water bodies in the future.

Discussion
The reasons why the LUH2 dataset have large biases in depicting the historical land-use states in CA are discussed. First, the biases may partially come from some assumptions for producing the HYDE 3.2 dataset which is the source data of LUH2. For instance, mosaic cropland land cover types are assumed to have 60% and 40% of cropland or pasture following the managed grass definition of Poulter et al (2015). And 90% of the cropland and grassland account for small areas of infrastructure, wetlands, unsuitable terrain, steep slopes or small patches of vegetation that are not explicitly identified in the original land over product (Verburg et al 2009). These assumptions may result in low LANDUSEF of the cropland throughout CA. Further studies are needed to investigate if these assumptions are appropriate for this region. Second, the biomass density of natural vegetation in LUH2 are overestimated over CA, which causes too many areas to be identified as grassland rather than barren land. Third, the biases may be partially contributed by the threshold of the biomass density used for separating grassland and barren land in the LUH2 dataset. This study adopts a threshold of 0.5 kg C m −2 as suggested by Hou et al (2022). Li et al (2015) found the biomass density of grassland is in a range of about 0.3-0.6 kg C m −2 over CA. New reclassifications with the upper and lower bound of the range are carried out and the results (not shown) are similar to those with a threshold of 0.5 kg C m −2 , which indicates that setting the threshold as 0.5 kg C m −2 contributes little to the biases. Thus, the large biases in LUH2 are mainly related with the quality of the source data (i.e. HYDE 3.2) and the biomass density. The large discrepancies in the predicted LUCs between the Li-LU and LUH2 dataset are related with the different models and assumptions or scenarios utilized. For instance, in the SSP2-4.5 scenario, the world remains to a certain degree fragmented economically and crop yields grow relative faster in the global South than in the global North, which may explain why the changes in cropland are small in the LUH2 dataset (Riahi et al 2017). However, in the CA-Markov model, future LUCs are modeled on the basis of the preceding state. Therefore, the future LUCs in the Li-LU dataset reflects an extension of the historical changes, like the expansion of cropland. Objectively speaking, it is difficult to say which dataset is more plausible in predicting the LUCs in CA. Under the assumption that the future LUCs have a high probability to be an extension of the historical experience, the Li-LU dataset seems more plausible.
As stated in section 1, the previous RCM projections generally design three experiments to understand the role the land-use forcing plays in future climate changes. They are the historical and future simulation with the historical land-use map and the future simulation with the predicted land-use scenario. For the RCM projection in CA, as usual, the land-use map of 2005 from the Li-LU dataset (referred to as Li-LU-05) can be used for the first and second experiment and the land-use map of 2045 from the LUH2 dataset (referred to LUH2-45) used for the third experiment. With this experiment design, the spatial pattern of the changes between LUH2-45 and Li-LU-05 generally reflects the great discrepancies between LUH2-05 and Li-LU-05, that's to say, the large biases of LUH2-05, (i.e. figure S1 vs. figures 3(m)-(r)) because the changes between LUH2-45 and LUH2-05 are very small. Evidently, this experiment design will bring misleading information if the simulation results are used to assess the impacts of the land-use forcing on the future climate changes in CA.

Summary
In this study, a local land-use dataset, Li-LU, is used as the benchmark to thoroughly assess the feasibility of using the LUH 2 dataset for the RCM projection in CA to understand the impact of LUCs on future climate changes. The historical data of Li-LU is derived from the satellite images, which can represent the realistic land-use states in CA during the historical period. And the potential LUCs predicted by Li-LU follow the causal mechanisms (the causal connections between the key drivers and land use) behind the LUCs in past, which indicates that the future data of Li-LU is fairly reliable.
Comparison between LUH2 and Li-LU shows that the LUH2 dataset have large biases in depicting the historical land-use states in CA. For instance, the area of grassland (cropland) in LUH2 is 1.4-1.5 (0.4-0.5) times of that of Li-LU. Moreover, the future LUCs predicted by LUH2 are much smaller than those of Li-LU and these two datasets generally have opposite signals in changes. In addition, the predicted LUCs in LUH2 do not follow the causal mechanisms based on empirical data. If the future scenario of LUH2 is used for RCM projections in CA with the historical land-use information from Li-LU, the simulation results could be misleading for understanding the impacts of LUCs on future climate changes there. This study suggests that the LUH datasets should be carefully assessed before using them for RCM projections and gives practical notes for the assessment.

Data availability statement
The data that support the findings of this study are openly available at the following URL/DOI: https://doi.org/10.11888/HumanNat.tpdc.273028 (Qiu 2022