Annual runoff prediction of the upstream of Heihe River Basin, China

A prediction method of annual runoff is proposed for the upstream Heihe River Basin. The Mann-Kendall test, order cluster analysis, variance analysis and R/S analysis methods were employed to investigate trends, mutations, periods and persistence of annual runoff time series, respectively. Based on various components, the linear superposition method was applied to forecast annual runoff. The Nash-Sutcliffe efficiency coefficient (NSE), the mean absolute percentage error (MAPE) and the percentage of pass (POP) were utilized to evaluate the simulation and prediction effects of annual runoff. Moreover, the calibration period was between 1958 and 2010, while the validation period was from 2011 to 2015. The results show that i) the annual runoff time series had a significant increasing trend (0.65×108 m3/10a), a probable mutation at 2006 and two significant periods (6 years and 22 years) in the calibration period; ii) the randomization of annual runoff time series was improved after removing the trend, mutation and periodic components, with the Hurst exponents 0.52; and iii) the simulation effects of calibration period (NSE=0.74, MAPE=6.6% and POP=90.6%) and the prediction effects of validation period (MAPE=6.5%, POP=80%) jointly indicate that the annual runoff prediction method is suitable in this river basin.


Introduction
Heihe River, the second largest inland river of China, is located in the northwest of China with the arid or semi-arid climate. It supplies the important water resource to the socioeconomic development and ecological environmental protection of northwestern China. The upstream of the river basin is the main runoff-yield region which is fed by rainfall and ice-snow melt water [1]. In the middle of the basin, there distribute many agricultural irrigation districts which consume a large amount of water each year [2]. The downstream of the river flows through desert and Gobi area, where the evaporation is intensive and the ecosystem is fragile [3].
In Heihe River Basin, it is very important to correctly handle the relations among irrigation, ecologic environment and hydroelectric generation under the condition of limited water resources. From 1960 th , the annual runoff in the downstream river basin reduced sharply because of the rapid population growth and extensive irrigation area in the middle river basin, causing a serious degradation of ecological environment of the middle and lower river basin [4][5][6]. Since 2000, the government of China has taken many water allocation measures to relieve the water contradiction between irrigation and ecological environment in the river basin, achieving some success [7][8][9]. However, the water contradiction is still serious currently, and the downstream ecological environment has not yet restored to the level of the 1980 th [10,11]. Besides, the upstream of the river has rich waterpower resources, and many hydropower plants were built there in the first decade of 21 st century. These hydropower stations provide reliable electric power for the local socioeconomic development, so the requirements of the hydropower plants should be properly taken into account when allocating the surface water of Heihe River [12].
The annual runoff at the upper stream could directly influence the water allocation between different users of Heihe River Basin, since the upstream basin is the main runoff generation area. In order to better realize the comprehensive benefits of irrigation, ecologic environment and hydroelectric generation, the decision makers always make the water distribution schemes based on the next year's runoff volume in advance. Therefore, it is of great significance to predict the annual runoff at the upstream of the river basin.
The prediction of annual runoff belongs to long-term runoff forecast, which could be solved by three kinds of common methods. The first one is the time series analysis method, which acquires the future runoff based on the internal relations (autocorrelation and relations with time) of the historical runoff series [13,14]. The second one is the relevant factor analysis method, which forecasts the runoff based on the statistical relations between runoff and the other factors such as precipitation, atmospheric temperature and wind speed [15][16][17]. The last one is runoff generation and concentration model, which predicts the runoff according to the physical mechanism of runoff formation [18][19][20]. Among the three methods, the first one is the simplest because it only needs the historical runoff data, but it lacks physical mechanism and excessively depends on the constant statistical characteristics of runoff series. The second one could demonstrate the explicit causes and course of runoff formation, but it is the most complicated approach which needs a great mass of data (hydrologic data, meteorological data, vegetation data, soil data, land use data, etc.). As to the third one, its advantages and disadvantages are between the first and second methods. On the whole, it is difficult to prove which way is best for forecasting the annual runoff due to many reasons (e.g. assumed conditions, empiric formulas, data deficiencies and error propagation) without considering specific river basins.
The time series analysis method is often employed to forecast the annual runoff of Heihe River because of the insufficient observation data and the unrevealed ice-snow melting mechanism of the river basin. Chen et al [17] adopted grey model to the extract tendency part of runoff time series, employed the wavelet analysis method to extract the period part, utilized the autoregressive method to deal with the stochastic part, and then built the runoff prediction model of Heihe River. Jiang and Liu [21] combined the wavelet analysis method with radial basis function network to forecast the annual runoff of Heihe River. Chu et al [22] applied Mann-Kendall method to study the trend of runoff time series of Heihe River, used the coupling method of periodic wave extension and stepwise regression to determine the period part. Zhang and Ding [23] employed the grey topological method to predict runoff volume at the valley outlet of Heihe River. Hou et al [24] established the annual runoff forecast model of Heihe River through the periodic mean superposition method. Li et al [25] proposed a new runoff prediction model of Hiehe River based on R/S analysis method and grey system theory. It can be seen that the applications of the time series analysis method to runoff prediction of Heihe River in the past mainly focused on three components of runoff time series (trend, periodic and stochastic components), and merely provided the deterministic forecasting results to the decision-makers. However, the annual runoff series of the upstream Heihe River probably has mutation and becomes more uncertain in the background of regional climate change [26,27], which may bring to the decision of water allocation certain risks in future.
Specific to the upstream Heihe River, this study still adopts the time series analysis method to build the annual runoff prediction model, but it is different from the prior studies. Firstly, the mutation of annual runoff series will be investigated before establishing the model. Secondly, the model not only provides the deterministic prediction results but shows the main variation range of runoff, which could help the decision-makers make adequate preparation against the risks of uncertainty. Moreover, the effect of the extraction of deterministic components on annual runoff series will be discussed. Therefore, this study is organized as follows: Section 2 introduces the general situation of the Heihe River; Section 3 displays the data and methods for trend analysis, mutation test, period detection, persistence analysis and prediction of annual runoff; in Section 4, the results are presented and discussed; and the conclusions are drawn in Section 5.

Study area
Heihe River is a typical inland river of China, flowing through Qinghai, Gansu and Inner Mongolia provinces from south to north (figure 1). The river originates from the north foot of Qilian Mountain and feeds into the West Juyan Lake and the East Juyan Lake, with a core drainage area of 130,000 km 2 and a mainstream length of 821 km. The upstream Heihe River (UHR), the study area, is between the river sources and Yingluoxia canyon with the drainage area of 10,009 km 2 , the river length of 303 km and the average channel gradient of 9.1‰. UHR is characterized by the mountainous terrains with the average elevation of 3737.7 m. The climate of UHR is cold and moist, with the annual average air temperature of 2°C and the annual average precipitation of 350 mm. Many mountain areas of UHR with high elevation (more than 4,000 m) are annually snow-covered. Moreover, a number of modern glaciers distribute in UHR, which is the natural reservoir of northwestern China. Yingluoxia canyon is the outlet of UHR with the annual average runoff volume of 1.6 billion m 3 , which will be mainly used for irrigating crops, recharging groundwater, improving ecological environment in the middle and lower Heihe River.

Materials
The average daily flow data of Yingluoxia canyon is observed at Yingluoxia hydrologic station (figure 1), which was built in 1943. Due to the poor field monitoring condition and the backward data reorganization technology, the reliability of average daily flow data before 1958 is much lower. Therefore, the annual runoff data of Yingluoxia station between 1958 and 2015 are employed in this study. Moreover, UHR annual runoff data between 1958 and 2010 are applied to calibrate the prediction model, and the remaining data (2011)(2012)(2013)(2014)(2015) are utilized to validate the model.

Trend analysis method.
Mann-Kendall test is widely applied to detect the trends of hydrologic time series because it is a nonparametric test and is not sensitive to the mutations of nonstationary time series. The statistic (U) is built as below [28,29]: where N is the length of hydrologic time series; Xi (i=1, 2, …, N) is the i th hydrologic variable (e.g. annual runoff); kij is the paired value between Xi and Xj; and K is the summation of all the paired values.
When N is large enough (N≥10), the statistic U approximately obeys the standard normal distribution. The original hypothesis is that the hydrologic time series has not any obvious trends. The significance level α is given before the trend test. If |U|≤Uα/2, the original hypothesis is accepted, i.e. the series has no obvious trends; otherwise, the original hypothesis is refused, which shows that there exists evident tendency in the series. Besides, the sign of U represents the trend direction ('+' represents the growth trend while '-' indicates the decreasing tendency), and the absolute value of U shows the obvious degree of trend (the degree will become more and more obvious with the increasing of |U|).
In addition to the overall trend analysis, the local trends are also investigated in the trend analysis of hydrologic time series. The turning points and stages of various local trends are distinguished by the trend variation index (T), which is defined as follows: where τ (10<τ≤N-10) is the division point of series; tτ,1 and tτ,2 represent the linear trend slopes of the former and latter subseries divided by τ, respectively. In equation (4), when Tτ reaches the maximum value, the division point τ is considered as one turning point, which divides the series into two obvious different stages of local trends. Moreover, if a hydrologic time series has l (l=1, 2, …) turning points, it has l+1 different trend stages.

Mutation analysis method.
The mutation represents an irreversible jump of hydrologic time series at a certain time. The order cluster analysis method could effectively find the most probable mutation point of hydrologic time series [30]. Its principle is to find out an optimal segment point dividing the series into two subsequences, where the variable difference in the same subsequence is the minimum and the variable difference between two subsequences is the maximized.
The mutation index (Z) of hydrologic time series is computed as follows [30]: where τ is possible mutation point of series; X  and S  are the mean value and the standard deviation of the subsequence before τ, respectively; N X  and S   represent the mean and the standard deviation of the subsequence after τ, respectively. In equation (5), the square of mean difference (the numerator part) represents the variable difference between two subsequences, and the variance sum (the denominator part) reflects the variable difference in the same subsequence. According to the principle of the order cluster analysis method, the point where Z reaches maximum is the most probable mutation point of time series.
After identifying the most probable mutation point, the next necessary step is to test its significance. Rank sum test is also a common nonparametric method in examining the obvious degree of mutation of hydrologic time series [30]. Firstly, the figures of the original series are sorted from small (large) to large (small). Secondly, the rank of each figure of the short subsequence in the sorted series is recorded. Finally, the summation of these the ranks (rank sum) is calculated and employed to establish the statistic of rank sum test.
The statistic (U) of rank sum test is built as below [30]: where W is the rank sum of short subsequence; n1 is the length of the short subsequence; and n2 is the length of the long subsequence. When n1 is large enough (n1≥10), the statistic U approximately obeys the standard normal distribution. The original hypothesis is that the series has no mutation points, and the significance level α is given before rank test. If |U|≤Uα/2, the original hypothesis is accepted, i.e. the series has no mutation; otherwise, the original hypothesis is refused, i.e. there is a remarkable mutation in the series. Moreover, the difference between the means of the series before mutation and after mutation is the jump height, which could also be understood as the mutation component of series.

Period analysis method.
Variance analysis is a very useful period identification method, and its obvious advantage is that it is not confined to the exact form of periodic wave function [31,32]. The period detection process of variance analysis is described in the following [32].
 Divide the time series into an equal number of groups based on the length of the assumed period.
where P (2≤P≤N/2) is the assumed period; r is the remainder when dividing N by P; and mj (j=1, 2, …, P) is the size of the j th group.  Calculate the mean values of original time series and each group.
where Xi (i=1, 2, …, N) is the i th figure of original time series; X is the mean of original series; j X (j=1, 2, …, P) is the mean of the j th group; and Xkj (k=1, 2, …, mj) is the k th figure of the j th group.  Compute the intra-group and inter-group sum of deviation square.
where 2 1 S is the intra-group sum of deviation square; and 2 2 S is the inter-group sum of deviation square.
 Calculate the value of statistic F, give the significance level α and test the significance of the supposed period.
In the definition formula (14), the statistic F follows F distribution with the degree of freedom (P-1, N-P). The significance level α is given before F test. If F>Fα(P-1, N-P), the assumed period P is significant; otherwise, it is not obvious. After finding the first significant period, its periodic wave could be constructed according to the mean values of corresponding groups. Generally, there often exist more than one obvious period in the hydrologic time series. Hence, it is essential to continue to search an additional number of significant periods after extracting the first periodic process. Meanwhile, the number of prominent periods of series is not more than three considering the negative impact of some pseudo periods [32].

Persistence analysis method.
The persistence analysis of runoff series has been initiated since the middle of 20 th century, which was rapidly expanded to many other geophysical records including temperature and precipitation data [33]. Currently, it is proved that the time series with long range dependence property (persistence) widely exist in the natural world [34]. Hurst exponent H (0<H<1), presented by Hurst in 1951, could quantitatively describe the persistence of hydrologic time series very well [35]. Specific to Hurst exponent, there are three kinds of persistence condition in the following.
 If H=0.5, the hydrologic time series is an independent random sequence with a limited variance.  If 0.5<H<1, the series has long range dependence property (persistence), which indicates that the future tendency is consistent with the past trend. Moreover, the persistence of series will endlessly strengthen when H approaches 1.  If 0<H<0.5, the series still has long run memory property (anti-persistence), which shows that the future tendency is contrary to the past trend. Besides, the anti-persistence of series will continually enhance when H is close to 0. At present, there are many methods for estimating Hurst exponent [34]. Among them, R/S analysis is the most universal method because of its robustness and nonparametric analysis feature (i.e. it has no special demands on the distribution type of time series). The calculation steps of Hurst exponent by R/S analysis method are expressed as below [33,34].
 Create an accumulated deviation series (ADS) where R(τ) is the range of ADS; and S(τ) is the standard deviation of ADS.  Estimate Hurst exponent where C is a constant.
In equation (19) In equation (20), the different components of X series should be estimated and extracted in turn. XC is considered as the mean value X'C of X series; XT is estimated by the linear trend part X'T of X-X'C series based on Mann-Kendall test result; XM is estimated by the jump height X'M at the mutation point of X-X'C-X'T series according to the consequence of mutation analysis; XP is estimated by the periodic part X'P of X-X'C-X'T-X'M series through the variance analysis method; and the remaining part X-X'C-X'T-X'M-X'P (X'S) is thought as the estimation value of XS. Moreover, X'S could be further subdivided into a dependent random component X'S1 and an independent random component X'S2. Hurst exponent H of X'S calculated by R/S analysis method determines the proportions of two components in X'S series. If H≤0.4 or H≥0.6, the proportion of X'S1 could not be neglected. On this occasion, X'S1 should be estimated through autoregressive model [17] and stripped from X'S. Then, the remaining part of X'S is X'S2. If 0.4<H<0.6, X'S could be regarded as an independent random series, i.e. X'S2= X'S. Without taking X'S2 into account, the medium estimation value X M of the X series could be calculated as below.
In order to reduce the uncertainty of prediction, it is essential to master the main variation range of X'S2. According to the probability distribution type of X'S2, its lower and upper bounds could be defined as where p is the occurrence probability of event X'S2≥X'S2(p); X L and X U are the lower and upper estimation values of X, respectively; q (0.5<q<1) is the appointed probability; X'S2(q) and X'S2 (1-q) represent the lower and upper bounds of X'S2, respectively. In this study, Nash-Sutcliffe efficiency coefficient (NSEC) and the mean absolute percentage error (MAPE) are employed to evaluate hydrological forecast effects [37,38]. NSE (NSE≤1) and MAPE (0≤MAPE≤1) are two commonly used indexes in evaluating hydrologic prediction model. The forecast accuracy improves with the increasing of NSE and the reducing of MAPE. Instead of measuring the forecast accuracy, and the percentage of pass (POP) is defined to evaluate the reliability of prediction variation range.
where R i X and M i X are the i th (i=1, 2, …, N) observed and medium estimation values of hydrologic series, respectively; R X is the mean value of observed hydrologic series; and M is the number of events that X R is located between X L and X U .

Results and discussion
The average annual runoff of Yingluoxia station was 16.04×10 8 m 3 between 1958 and 2010 (calibration phase). Thus, the constant component X'C of UHR annual runoff series X was equal to 16.04×10 8 m 3 , which would be stripped from X before extracting the other components.

Trend component
Based on Mann-Kendall test with the significant level α=0.05 (Uα/2=1.96), the trend test results of X-X'C series were shown in figure 2. In graph (a), Max_s curve was above Min_e curve from 1 to 26 consecutive years, which directly indicated that X-X'C series had a clear increasing tendency between 1958 and 2010. In addition, Min_s and Min_e curves had obvious abrupt turns between 13 and 14 consecutive years, indicating a trough (1968)(1969)(1970)(1971)(1972)(1973)(1974)) of X-X'C series. In graph (b), the overall trend of X-X'C series was significant with a linear growth speed 0.65×10 8 m 3 /10a. The expression of X'T was the trend equation of X-X'C series. In graph (c), it showed that X-X'C series had two remarkable turning points (1970 and 2000). Thus, the series had three stages of local trends, which were displayed in graph (d). The first stage was between 1958 and 1970, with a decreasing speed 2.50×10 8 m 3 /10a; the second stage ranged from 1971 to 2000, with a slow growth speed 0.75×10 8 m 3 /10a; and the third stage started from 2001 and ends at 2010, with an increasing speed 5.60×10 8 m 3 /10a. However, none of the three local trends were significant through Mann-Kendall test. The overall trend and local trends of X-X'C series were closely related to the variation processes of precipitation and atmospheric temperature in UHR during the past 50 years before 2010 [1,3,39].  The period analysis results were shown in figure 4. The left graph displayed two significant simple periods (P1=22 years and P2=6 years) of X-X'C-X'T-X'M series, and the right one presents their respective variation processes. Through the variance analysis method, the 22 years' period was found in the first F test when given the significance level α=0.05. After extracting the 22 years' period component, there was no obvious period component in the remaining part in the second F test when α=0.05. Generally, the potential periods in the second test were less significant than the periods detected in the first test. Based on this, it was imperative to lower the significant level to search the potential periods in the second test. When given the significance level α=0.1, the 6 year' period was found in the second test, but no remarkable periods were found in the third test. So far, the period detection work of X-X'C-X'T-X'M series by variance analysis method was finished. The two significant simple period components of UHR annual runoff series mainly result from the sun activity [17]. At last, the synthesized period component X'P was acquired through linear superposition of two simple period components.

Stochastic component
After extracting all the deterministic components from the UHR annual runoff series X, the remaining part was the stochastic component X'S which was displayed in the left graph of figure 5. Further, Hurst exponents of X and X'S were also calculated through R/S analysis method, and the results were presented in the right graph of figure 5. Seen from the right graph, Hurst exponents of X and X'S were 0.78 and 0.52. The two Hurst exponents showed that X had a remarkable long run dependence property and the randomness of X'S was very strong. Meanwhile, the great change of Hurst exponent also proved that the persistence of UHR annual runoff series was mainly caused by its deterministic components. Since Hurst exponent of X'S was very close to 0.5, it was believed that X'S did not contain any dependent random components. Hence, the following work was to investigate the main distribution range of X'S as an independent random series. Numerous studies had shown that the hydrologic time series of various rivers in China could be well fitted by Pearson III (P-III) type distribution [40]. Considering many negative values in X'S series, P-III distribution type was employed to fit X'C+X'S series. The theoretical curve-fitting result was shown in the left graph of figure 6, and it can be seen that the curve fitting (R 2 =0.9888) was very successful. Further, the distribution range of X'S was also obtained based on theoretical frequency curve. In the right graph, according to equations (22)- (24), the accuracy rate of prediction was equal to q-(1-q) (0.5<q<1), and the forecast range is equal to X'S(1-q)-X'S(q). The forecast range rose faster and faster with the growth of accuracy rate, which exactly brought a dilemma to the prediction of annual runoff. Even so, the accuracy rate of prediction still attracted more attention. In this study, the appointed probability q was assigned 0.95, and the forecast range was 4.66×10 8 m 3 with the accuracy rate 90%.

Prediction and evaluation
Based on the deterministic and stochastic components, the estimation results of UHR annual runoff series in the calibration phase (1958-2010) were exhibited in figure 7. The evaluation indexes (NSE=0.74, MAPE=6.6%, POP=90.6%) showed that the simulation effects were satisfactory in calibration phase. Moreover, table 1 displayed the medium, lower and upper estimation values of UHR annual runoff series in the validation phase (2011)(2012)(2013)(2014)(2015). Seen from this table, the estimation range failed to cover the measured value at 2011, and the medium estimation values were always smaller than the measured values in four consecutive years (2012)(2013)(2014)(2015). Nonetheless, the estimation effects in validation phase were still acceptable on the basis of the evaluation indexes (MAPE=6.5%, POP=80%). Therefore, the model based on the time series analysis method was feasible in the prediction of UHR annual runoff. In order to provide the basis for the water allocation of Heihe river basin, the annual runoff series in future five years (2016-2020) were calculated and shown in table 2. In addition, the future UHR annual runoff series probably had some unexpected variations under the double impact of regional climate change and human activities. Hence, it was of great significance to update the current UHR annual runoff prediction model utilizing the latest observation data regularly.

Conclusions
An annual runoff prediction method was adopted to forecast annual runoff of UHR based on the linear superposition method. In other words, various components were extracted and investigated by decomposing the historical annual runoff series, hence constructing the forecast model through the linear superposition of these components. The main conclusions are drawn as below:  During 1958 and 2010, the annual runoff of the upstream Heihe River had an obvious increasing trend with a speed 0.65×10 8 m 3 /10a, a probable mutation at 2006, two significant periods (22 years and 6 years) and a persistence property.  The adopted method based on the linear superposition method is suitable to forecast the annual runoff in the Heihe River.