Daily rainfall modeling using Neural Network

In the early 2020, Indonesia experienced flooding in several areas. This disaster caused a lot of damage and losses. One of the causes of flooding in Indonesia is due to high rainfall. This was not anticipated beforehand so there was a flood. Therefore, research on rainfall in Indonesia is very important to anticipate floods. If it is predicted that rainfall is very high and conditions do not allow it to accommodate, the government can prepare watersheds so that rainwater can flow and not be trapped. In this research, the rainfall data were obtained from Meteorological, Climatological, and Geophysical Agency (BMKG Indonesia), then the analysis of rainfall data in Indonesia was carried out. There are several statistical methods that can be used. There are ARIMA and Neural Network. In this research, the results of ARIMA model are used as input variables in the Neural Network model. Then there are several numbers of hidden layer in the Neural Network model that are compared. The results of ARIMA model and Neural Network model showed that Neural Network model is better than ARIMA model, because the mean square error (MSE) value of Neural Network model is smaller than ARIMA model.


1.
Introduction Floods are events that have occurred in Jakarta for a long time, even before Indonesia's independence. Some actions were even taken to overcome it, but it seems that it has not been successful. The construction of several canals could not prevent the flood at that time. The Governor of DKI Jakarta, Mr Ali Sadikin, had worked with foreign parties to build reservoirs in the city and the construction of new waterways. However, major floods still occurred in early 1976. During the leadership of Mr Sutiyoso, Jakarta experienced severe floods that caused casualties. There are 80 people died, 320,000 were displaced, as well as substantial material losses. Major floods then occurred again in 2015 and 2020. For rainfall, Meteorological, Climatological and Geophysical Agency (BMKG) noted that the rainfall that occurred in early 2020 was the highest rainfall since 1996 [1].
According to the National Disaster Management Agency (BNPB), floods in Jakarta in early 2020 submerged 308 urban communities with a maximum water level reaching six meters. While the death toll reached 60 people, with a total of 92,621 refugees scattered in 189 points. Such a large flood is not new in Jakarta. Prior to this, there were at least five major floods at DKI Jakarta in 2002, 2007, 2013 and 2014. The impacts of this such as death tolls, the distribution of flood points to the number of refugees, then it can be called 2007 to be the worst flooding in Jakarta [2].
Meteorological, Climatological and Geophysical Agency (BMKG) Indonesia explained that the occurrence of climate change increases the risk and opportunities of extreme rainfall so that it triggers floods in Jakarta. In addition, there was a repetition of high rainfall for several years. The cause of flooding in Jakarta is not only a matter of extreme rainfall and meteorological phenomena. However, there are several other factors such as the large amount of water runoff from the upstream area, the reduction in reservoirs and lakes where flood waters are stored. In addition, the problem of narrowing and the river's shallowness due to sedimentation and full of garbage, tidal soaking due to sea level and ground subsidence factors that increase the risk of standing water [3].
The negative impact caused by the flood needs to be done in prevent flooding. The problem is that there is no rainfall prediction that can be used to anticipate flooding. One of the efforts that can be done is to forecast rainfall data. Because if high rainfall can be predicted, action can be taken to overcome it. Therefore, in this research rainfall forecasting is done in Indonesia. There are several time series analysis methods that can be used for forecasting. The most commonly used method is ARIMA. But along with the development of technology and science, several new methods have emerged, one of which is Machine Learning. Because of this rainfall forecasting is done using the Neural Network. This study aims to perform daily rainfall modelling in Indonesia using several methods, namely ARIMA and Neural Network. Then the results of rainfall forecasting using ARIMA and Neural Network will be compared.

2.
Rainfall Indonesia has two types of seasons, that are rainy season and dry season. The problem that occurs in Indonesia during the rainy season is the occurrence of floods in several locations. One of the reasons is due to high rainfall in several locations, in addition to high altitude conditions in the area. Rainfall measurement is a measurement of the amount of rainwater that fall on a flat surface at a certain period in millimeters (mm) with the assumption that the collected rainwater in the flat does not evaporate, does not sink in and does not flow [4]. The rainfall of one mm means that a flat area of one square meter is accommodated by one millimeter or one liter of rainwater [5].

3.
Autoregressive Integrated Moving Average (ARIMA) Model Autoregressive Integrated Moving Average (ARIMA) is one of the time series methods that can use to forecast the value for next period. ARIMA models also known as Box-Jenkins models [6]. The stationarity of a time series is related to its properties in time. The definition of stationarity or weal stationarity are as follows.
• The expected value of the time series does not depend on time.
• The autocovariance function defined as ( , + ) for any lag k is only a function of k is only a function of k and not time: that is ( ) = ( , + ). For a time-invariant and stable linear filter and a stationary input time series with = ( ) and ( )= ( , + ), the output time series is also a stationary time series (2) General model of ARIMA(p,d,q)×(P,D,Q,) S as follows.
where is the backshift operator, p is the order of non-seasonal AR, d is the differencing of non-seasonal and q is the order of non-seasonal MA. While S is time span of repeating seasonal pattern, P is the order of seasonal AR, D is the differencing of seasonal and Q is the order of seasonal MA.
where ( ) and ( ) are the non-seasonal component of ARIMA model. Then ( ) and ( ) are the seasonal component of ARIMA model. 1 , … , and 1 , … , are the parameters of AR for non-seasonal and seasonal component that must be estimated. While 1 , … , and 1 , … , are the parameters of MA for non-seasonal and seasonal component that must be estimated.

4.
Artificial Neural Network Artificial Neural Network (ANN) is a nonlinear data processing that is built from an interconnected or connected which called a neuron and there is a weight for each connection. ANN was applied for data pattern identification analysis or data classification [7]. In this research, the data used is time series data, then the input layer in the Neural Network model is the lag of the time series at the target output.  Figure 1 shows a single hidden layer of Neural Network. There are three layers namely input layer, hidden layer, and output layer [8].
where is the number of nodes in input layer, is the threshold value of hidden layer, is the connection weight of input layer and hidden layer, and is the lag of the time series in the input layer. Then using the activation function that has been determined in the hidden layer, the values in the hidden layer are obtained as follows.
The function for is called an activation function of hidden layer. There are linear and nonlinear activation functions. Some activation functions in ANN are follows [9].
• An identity or linear function • The binary step function • The binary sigmoid function or S-shaped curve • The bipolar sigmoid function Furthermore, the value of output layer can be calculate based on the activation function that has been determined in the output layer as follow [8].
where ℎ is the number of nodes in hidden layer, is the threshold value of output layer, is the connection weight of hidden layer and output layer, and is the forecast value of point t. To get the thresholds and weights in the Neural Network model, the backpropagation algorithm is used to update these values. Backpropagation architecture is presented in the figure 2. In the backpropagation algorithm there are forward and backward phases. The straight arrow line in figure 2 shows the forward phase, while the dashed arrow line shows the backward phase. In the forward phase, weights are calculated starting from the input layer to the output layer using the specified activation function. Then the error value is obtained which is the difference between the forecast value with the target value. While in the backward phase, the error is propagated backward starting from the output layer to the input layer, then get new weights that minimize the error [9].

5.
Criteria for selecting the best model In this research, ARIMA and NN are used to model rainfall. Based on the two models, the best model will be chosen to the most appropriate model. In determining the selection of the best model, several criteria are needed. There are several criteria for selecting the best model including AIC, RMSE and MAE. AIC (Akaike Information Criteria) is one of the methods that used to select the best regression model found by Akaike and Schwarz [10]. According to Wei (2018) Enders (2004), it is necessary to determine the appropriate number of lag dependent variables [12]. In determining the optimal lag there are several fairly efficient ways, such as Akaike Information Criterion (AIC) and Schwarz 'Bayesian Information Criterion (SBC). The best model is the one that gives the smallest residual or error rate.
Root Mean Square Error (RMSE) is a popular method for assessing machine learning algorithms, including algorithms that are far more sophisticated than linear regression [13]. The RMSE value is used to distinguish the performance of the model in the trial period and the validation period [14]. Besides that, the Mean Absolute Error (MAE) can also be used to choose the best model. The MAE is the average of the absolute values of the difference between the forecast value and the observed value [15]. The RMSE and MAE value defined by the following formula.
where n : number of observations : observed value ̂ : forecasting value

Data and Methodology
The data that used in this research were obtained from the Climatology, Meteorology and Geophysics Agency (BMKG) Indonesia [16]. The variable that be used in this research is rainfall data at Region I, which is the western region of Indonesia. The rainfall data used is the daily rainfall from January 2008 to December 2019. This data is time series data, then the analysis carried out is time series analysis. The stages of analysis to achieve the research objectives are as follows.

7.
Results and Discussion Before the modelling was performed to the rainfall dataset, data exploration was carried out using the autocorrelation function (ACF) plot and the partial autocorrelation function (PACF) plot. Figure 3 is the ACF and PACF plot of the rainfall data. The identification results from the ACF and PACF plots showed that visually the data is stationary because the data is cut off at a certain lag. But to ensure stationarity in the rainfall data, the test was carried out using the Augmented Dickey-Fuller (ADF) test. The result of test statistics is -8.9025 and pvalue is 0.01. The hypothesis of stationary test as follows. H0 : Data is not stationary H1 : Data is stationary Because the p-value less than = 10% then H0 is rejected. It means that the data is stationary. Then there is no need for transformations or differencing on the data. Furthermore, ARIMA modeling is carried out on rainfall data because the data is stationary. Then the ARIMA model does not use a difference (d = 0) because the data is stationary.
Based on the significant lag in the ACF and PACF plots, there are several ARIMA models as p and q value that can be tried. Because there is no indication of seasonal pattern then s = 0. Then ARIMA models only uses regular ARIMA which is ARIMA(p,d,q). There are several ARIMA models that have been applied on rainfall data that have significant parameters, including those presented in table 1. If the orders of p and q increase, then the parameters are not significant in the model. Based on the result in table 1, it can be concluded that the best model is ARIMA(2,0,1) because this model has The hypothesis of testing the significance of the parameters. H0 : = 0 H1 : ≠ 0 Table 2 showed that the p-values of all parameters less than = 10%. It means that all of parameters are significant in the ARIMA model. Furthermore, the residuals in the ARIMA model must meet the assumptions. Residuals have to follow the white noise process and residuals must be normally distributed. Ljung-box test is used to test the residuals follow white noise process and Kolmogrov Smirnov test is used to test the residuals following normally distribution. The hypothesis of the Ljung-box test as follows. H0 : Residuals are white noise. H1 : Residuals are not white noise. The result of Ljung-Box test showed that test statistics is 0.0013502 and p-value is 0.9707. Because the p-value greater than = 10% then do not reject H0. It means that the residuals are white noise. Then the assumption of ARIMA model is fulfilled. The hypothesis of the Kolmogorov Smirnov test as follows. H0 : Residuals are normally distributed. H1 : Residuals are not normally distributed. The result of Kolmogorov Smirnov test showed that test statistics is 0.24235 and p-value = 2.2 x 10 -16 . Because the p-value less than = 10% then reject H0. It means that the residuals are not normally distributed. Then the assumption of ARIMA model is not fulfilled.
Since the ARIMA model did not fulfilled the assumption that residuals are normally distributed, then as an alternative to the ARIMA model, the Neural Network model can be used. In this research, the NN model uses a non-linear model, then the activation function used is logistic function or binary sigmoid function. Neural network modelling is very important in determining the number of inputs in the NN model. The input variables on NN model based on the significant lag in the ARIMA model. In this NN model, the input variables are Yt-1 and Yt-2. This is based on the AR parameters in ARIMA model showing that the best ARIMA model is ARIMA(2,0,1). Then the next step is to determine the number of nodes in hidden layer. There are several numbers of nodes in hidden layer that have been tried in the Neural Network model to get the optimum number of nodes in the hidden layer. The comparison of Neural Network model for different number of nodes in the hidden layer can be seen in table 3.  The others number of nodes in hidden layer does not converge then the parameters in the NN model are not obtained. Based on the optimum nodes in hidden layer that show in table 3, there are 6 nodes in hidden layer. Figure 4 show the best Neural Network model with 2 nodes in input layer, 6 nodes in hidden layer and 1 node in output layer. It can be written as NN(2,6,1).

Conclusion
There are several methods that can be used to analyze time series data. The Autoregressive Integrated Moving Average (ARIMA) model is one of the most frequently used methods for analyzing time series data. But this method can be use if the data is stationary and the model have some assumptions. The result of ARIMA model for rainfall data in this research showed that data is stationary but the assumption of residuals are normally distributed do not met in the ARIMA model. Therefore, the analysis is carried out using the neural network method as an alternative to the ARIMA model. The input variable of Neural Network model is determined from the significant lag in the ARIMA model. Then the input variables of NN model are Yt-1 and Yt-2. Based on the experimental results, the number of nodes in the hidden layer concluded that the optimum number of nodes is 6 nodes in the hidden layer. Then in the NN model are determined 2 nodes in the input layer, 6 nodes in the hidden layer and 1 node in the output layer. The comparison between ARIMA model and NN model showed that NN model is better than ARIMA model. Because the RMSE and MAE value of NN model is smaller than ARIMA model. Based on this analysis, if we want to forecast the daily rainfall for today, we must consider the rainfall of yesterday and 2 days ago. There are some limitations in this research such as the number of nodes in the input layer, the activation function which only uses a logistic function, and the data is univariate time series. Then in the future research, we can use more nodes in the input layer, use other activation functions and try to use multivariate time series analysis.