Comparison of three recurrent neural networks for rainfall-runoff modelling at a snow-dominated watershed

In recent years, rainfall-runoff modelling using LSTM has shown high adaptability. However, LSTM requires far more computational costs than traditional RNN. In addition, a different type of RNN, GRU, has been developed to solve this issue of LSTM. Therefore, this study compares the accuracy of the deep learning methods for rainfall-runoff modelling using three deep learning methods in a snow-dominated area. Besides, the setting of hyperparameters may affect accuracy. The accuracy of these deep learning methods was investigated by trying multiple combinations of hyperparameters. The input data were daily temperature data and precipitation data. The results show that GRU gives the highest accuracy in most combinations.


Introduction
Application of deep learning is a hot topic in various research fields. Deep learning is nowadays also applied to many kinds of issues in geoscience as summarised in review papers in high-impact journals [1][2][3]. These applications showed the high potential of deep learning in geoscience.
Rainfall-runoff modelling including flow discharge forecasting may be an issue that deep learning is frequently applied. There is a type of deep learning that is suitable for time series modelling. It is known as the recurrent neural network (RNN). A method categorized into RNN, which is called Long and Short-Term Memory (LSTM) network, has large potential to model time series that has a longterm dependency. Due to this feature of the LSTM, LSTM has been applied for time series modelling in geoscience, especially, rainfall-runoff modelling. For instance, Kratzert et al. (2019) used meteorological data such as precipitation, air temperature, and radiation as input, and then implemented flow discharge models at multiple watersheds in the United States [4]. Some other research groups applied LSTM for flow forecasting [5][6][7]. Furthermore, Kao et al. (2020), Li et al. (2020), and Xiang et al. (2020) applied the encoder-decoder version of LSTM for flow forecasting [8][9][10]. The results of these previous studies show the high applicability of LSTM for rainfall-runoff modelling.
Although LSTM has an advantage of accuracy, it has a disadvantage over the traditional RNN. It is known that LSTM requires much more computational resources than the traditional RNN because of the complex structure of LSTM. Due to this issue of LSTM, another type of RNN with a simpler structure, Gated Recurrent Unit (GRU), was developed by Cho et al. (2014) [11]. Jeong and Park (2019) [12] applied GRU and LSTM for groundwater level modelling. The accuracy of GRU is comparable to LSTM. Thus, GRU may also be another good option for rainfall-runoff modelling.  [13] applied GRU for flow discharge forecasting at a sewer, and then they showed similar model accuracy between GRU and LSTM.
On the other hand, comparisons among the deep learning methods in the above studies are generally conducted with a single trained result. For the training process of a deep learning method, the initial states of its learnable parameters such as weights and biases are generally set with random values. The randomness of their initial states may affect the accuracy of the trained results. Meanwhile, a deep learning method including the traditional RNN, LSTM, GRU has model options to be set, which are called hyper-parameters. The setting of hyper-parameters may also affect accuracy. The objective of this study is to compare the model accuracy among the traditional RNN, LSTM, and GRU for rainfall-runoff modelling. In order to consider the effects of the initial states of their learnable parameters and their model configuration, the comparisons are conducted with multiple initial states of their learnable parameters, which are given by random values. Meanwhile, the comparisons are also conducted with multiple combinations of hyper-parameters. As a study watershed, the Ishikari River watershed is selected. The Ishikari River watershed is a snow-dominated watershed, which is located in the Hokkaido region, Japan. There are long-term dependencies between meteorological variables and flow discharge at a snow-dominated watershed. It may be a suitable issue to compare the three methods. [14] LSTM is described as follows:

Long and Short-Term Memory (LSTM) Network
where , , , and are the input, forget, cell, and output gate at time . , are the hidden state and cell state at time s. , , , , ℎ , ℎ , ℎ , and ℎ are weights. , , , , ℎ , ℎ , ℎ , and ℎ are biases. When is input, is used −1 is the hidden state at time s. is a linear transformation of . The internal information of −1 is updated by inputting and −1 . Finally, of each LSTM block is linearly transformed, and the total value becomes the .

Traditional Recurrent Neural Network (RNN)
The traditional RNN can be written by the following equation: = tanh( ℎ + ℎ + ℎℎ −1 + ℎℎ ) where , are the hidden state and the input at time . − is the hidden state at the previous time ( − 1). , ℎℎ are input weights. ℎ ， ℎℎ are biases. The hidden state is updated each time. Finally, the value obtained by linearly transforming the hidden state is given as output. [15] GRU consists of the following equations: Where , ,and are the reset, update, and new gates at time s. is the hidden state at time s. GRU is a method that has a simpler structure than LSTM.

Hyperparameters
The above three methods have various hyper-parameters which are their model options to be set. The input data length (IDL) may be an important hyper-parameter. Rainfall-runoff modelling at a snowdominated watershed is affected by the snow accumulation and melting processes which leads long term dependencies between the meteorological variables and flow discharge. The hidden state length (HSL) may also be an important hyper-parameter. HSL is equivalent to the number of neurons of a traditional neural network. A deep learning method is generally trained by the mini-batch gradient descent method. This study also utilized the mini-batch gradient descent method. In this case, the batch size needs to be set. Meanwhile, the mini-batch gradient descent method requires a loss function to update its learnable parameters, and an optimization algorism to adjust the learning rate. Among these hyper-parameters, this study focused on IDL, HSL, and Optimization algorism. The combinations of hyper-parameters used in this study are tabulated in Table 1-9.

Study area
The Ishikari River originates from Mt.Ishikari whose altitude is 1,967 meters above sea level. This channel length is 268km, which is the third-largest in Japan. This basin is 14,330km 2 , which is the second-largest in Japan. This basin is located at E141°20'45" N43°14'14" alt ( Figure 1). This basin which is a snow-accumulated area accumulates a lot of snow in winter. The accumulated-snow period is from November to March, and the snow-melting period is from April to May. The flow discharge in this basin increases significantly compared to other periods.

Input and target data
This study utilizes precipitation and air temperature data as input. The target data are flow discharge data at the study watershed. These data were obtained from different data sources as follows.  [16]. APHRODITE is a gridded observed precipitation dataset at the daily-scale. Its spatial coverage is entire Japan. Its spatial resolution is approximately 5 km. The temporal coverage is from 1900 to 2015.
3.2.2. Air temperature data. The air temperature data except precipitation data were extracted from an atmospheric reanalysis dataset, the fifth generation of the European Centre for Medium-Range Weather Forecasts Re-analysis (ERA5) [17]. The horizontal resolution of ERA5 is approximately 25 km x 25 km (0.25 o x 0.25 o ). The temporal resolution of the provided data is hourly. The ERA5 dataset contains many variables. This study obtained air temperature from the single-level data. The basin average value at the study watershed was calculated from the gridded information at each hour. Then the hourly data were converted into the daily scale after adjusting the time difference between the global time and Japan standard time.

Flow Discharge data.
The flow discharge data that are used as the target data in this study were obtained at the Ishikari Ohashi station, which is provided by Water Information System (WIS). WIS is operated by the Ministry of Land, Infrastructure, Transport and Tourism, Japan (http://www1.river.go.jp/). The Ishikari Ohashi station is located at E141 o 32'32" N43 o 07'20". The river mouth is 26.60 km downstream from the station. The original data were obtained at hourly-scale. The original flow discharge data were also converted to the daily-scale.

Comparison procedure
This study utilized a deep learning framework, PyTorch [18], to develop rainfall-runoff models with LSTM, RNN, and GRU. Daily precipitation and daily mean air temperature were given together to the model as input. The target data are the daily flow discharge at the Ishikari Ohashi station. The dataset was divided into three subsets: the training dataset (1998-2009), the validation dataset (2010-2012), and the test dataset (2013-2015). The models by each deep learning method are trained with the dataset with each combination of the hyperparameters (Table 1-9). To consider the randomness contained in the initial states of the learnable parameters, each method with each hyperparameter combination is trained 100 times. The best result with respect to the losses for the validation dataset is picked up and then evaluated with the test dataset. Then, the three methods are compared with respect to statistical values such as root mean square error (RMSE) and Nash-Sutcliffe efficiency (NSE) for the test dataset. The RMSE decreases as the observed and predicted values approach each other. The accuracy of NSE increases as it approaches 1.0. The accuracy is high when NSE = 0.5 or more. The accuracy of R increases as it approaches 1.0. Meanwhile, these statistical values for the training and validation datasets are also investigated to check the fluctuations in the results due to the randomness in the initial states.

Results and Discussions
First, IDL and the optimization algorithm were set to be 365 and Adam, respectively. Then the five HSLs were tried with all the methods. /s, respectively. However, the worst results are found with HSL=100 for LSTM, with HSL=100 for RNN, and with HSL=50 for GRU. On the other hand, GRU shows better results than LSTM and RNN with all the HSL. The R-value for GRU is 0.017-0.041 higher than that for LSTM, and 0.043-0.083 higher than that for RNN. The NSE value for GRU is 0.035-0.085 higher than that for LSTM, and 0.078-0.179 higher than that for RNN. The RMSE value for GRU is 15.6-36.5 m 3 /s small than that for LSTM, and 33.9-69.8 m 3 /s smaller than that for RNN.  Figures 2-4 show R, NSE, and RMSE, respectively, for the training and validation datasets that were obtained by three methods with the five HSLs. LSTM and GRU clearly show better results than RNN with respect to all three statistical values. Furthermore, GRU is slightly better than LSTM with respect to its median values. Meanwhile, LSTM and GRU obtained better accuracy with a longer HSL although there is not such a relation between the model accuracy and HSL for the test dataset as shown above. The simulation for the test dataset was obtained with the trained model that shows the best results with respect to the losses for the valuation dataset among the 100 trained results as explained above. This indicates that the best result for the validation dataset may not be the best for the test dataset although there is some consistency between them.
Next, HSL and the optimization algorithm were set to be 50 and Adam, respectively. Then the five IDLs were tried with all the methods. Table 4-6 shows the three statistical values (R, NSE, and RMSE) for the test dataset. For LSTM, the best results were obtained with IDL=180. R, NSE, and RMSE are 0.894, 0.796, and 176.1 m 3 /s, respectively. RNN obtained the best accuracy with IDL=100. R, NSE, and RMSE are 0.877, 0.766, and 188.3 m 3 /s, respectively. GRU showed the best with IDL=365. R, NSE, and RMSE are 0.909, 0.813, and 168.5 m 3 /s, respectively. LSTM and GRU seem to show better accuracy with a longer IDL. Similar to the above comparisons, GRU shows better results than LSTM and RNN with all the IDLs. The R-value for GRU is 0.004-0.031 higher than that for LSTM, and 0.004-0.059 higher than that for RNN. The NSE value for GRU is 0.005-0.044 higher than that for LSTM, and 0.005-0.100 higher than that for RNN. The RMSE value for GRU is 2.42-18.7 m 3 /s small than that for LSTM, and 1.6-39.1 m 3 /s smaller than that for RNN. Figures 5-7 show the corresponding R, NSE, and RMSE values, respectively, for the training and validation datasets. The three statistical values are worse with IDL=10 for all three methods. With the other IDLs, the results show similar features to those with different HSLs as shown above.
For the last comparison, HSL and IDL were set to be 50 and 365, respectively, and then four optimization algorithms were tried with all three methods. With respect to the statistical values (R, NSE, and RMSE) for the test period (Table 7-9), LSTM and GRU show the highest accuracy with Adagrad while RNN shows the best with Adam. R, NSE, and RMSE for LSTM with Adagrad are 0.903, 0.808, and 170.8 m 3 /s, respectively. These for GRU with Adagrad are 0.912, 0.823, and 164.0 m 3 /s, respectively. Unlike the above two comparisons, GRU is not always the best among the three methods. LSTM is the best with RMSprop and Adadelta. Figures 8-10 show R, NSE, and RMSE, respectively, for the training and validation datasets obtained by the three methods with the four optimization algorithms. The figures show that RNN was stably trained only with Adagrad and Adam. Especially, RNN cannot properly be trained with RMSprop. The training process of GRU was also very unstable with RMSprop. Only LSTM is relatively stable with all the optimization algorithms.
Comparisons between the three methods and each Optimizer have been performed, but none have been compared at the same time. There is not much comparison by HSL or IDL. Therefore, by making this comparison, it was clarified which method is suitable for each hyperparameter.
The results in this study show that GRU has a high potential to obtain high accuracy for rainfallrunoff modelling at a snow-dominated watershed although LSTM has the most complex structure among the three methods. GRU showed the best accuracy with most combinations of hyperparameters. However, GRU was unstable with RMSprop for the optimization algorithms. These results indicate that the comparisons among the methods should be conducted with some combinations of hyper-parameters.

Conclusions
This study compared three deep learning methods, the traditional RNN, LSTM, and GRU for rainfallrunoff modelling at a snow-dominated watershed, which has long-term dependencies between input data (meteorological data) and the target data (flow discharge). In order to consider the effects of the selections of hyper-parameters, the comparisons were conducted with several combinations of hyperparameters. The results in this study show that the GRU of the three methods is the best accurate in most cases. Therefore, GRU is the best among the three methods with most of the combinations of the hyper-parameters. On the other hand, GRU is worse than LSTM with a few combinations. This indicates that the comparisons among the methods should be conducted with various combinations of hyper-parameters.