Predicting COVID-19 Confirmed Case in Surabaya using Autoregressive Integrated Moving Average, Bivariate and Multivariate Transfer Function

In March 2020, the first case of Covid-19 was found in Indonesia. The increase of confirmed, suspected, and exposed in Surabaya has also significantly. Some studies show there is a relation among temperature, humidity, suspected, and exposed patients in an area with the number of confirmed COVID-19. Several statistical techniques that can be used to determine this relationship are to analyze and predict it using the ARIMA, bivariate, and multivariate transfer functions. The aim of this study is the performance of three models and determine the best model. The performance on the training data for ARIMA is 0.376, which shows that the accuracy of the model is 37.6%. The bivariate transfer function accuracy is 0.409, and the accuracy of the multivariate transfer function is 0.478. The result performance of ARIMA testing is 0.074, the bivariate transfer function is 0.055, and the multivariate transfer function is 0.108. The multivariate transfer function forecasting model is a technique in this case with the best performance.


Introduction
In recent, the Corona-virus disease (COVID-19) has become a serious concerned for public health around the world. Outbreaks of infectious diseases such as COVID-19 can also significantly affect many aspects of each region. March 2, 2020, the first case of COVID-19 was found in Indonesia, until now the number of confirmed, suspected, and exposed is still increasing. Since the beginning of the case, until the last data was obtained in August 2020, Surabaya is the second city with the highest number of COVID-19 confirmed, suspected, and exposed in Indonesia based on government data daily new cases of COVID-19 [1].
COVID-19 confirmed that patients are patients infected with COVID-19 with positive test results through the PCR Swab examination. A suspected is a person with an acute respiratory infection, i.e. fever (> 38°C) or history of fever, accompanied by any of the symptoms or signs of respiratory disease. It has are cough, shortness of breath, sore throat, runny nose, pneumonia mild to severe, and there were no other causes based on the conclusive clinical picture. In a suspected patient, 14 days before symptoms appear, patients have a travel history or live in a country or region that has a report of the occurrence of COVID-19 transmission. Exposed patients are people who have a fever (≥38°C) or a history of fever or respiratory system disorders. In the 14 days before symptoms appeared, IOP Publishing doi: 10.1088/1757-899X/1077/1/012055 2 exposed patients had a travel history or lived in a country or region that had reports of COVID-19 transmission.
Based on government data, the number of confirmed, suspected, and exposed patients in the Surabaya has also increased significantly every day. Coronavirus transmission can be influenced by many factors, including climatic conditions, such as temperature (temp) and relative humidity (RH) [2].
Several previous research studies on the influence of climatic conditions and the spread of COVID-19 in a region, research by Fabiola is the influence of temperature, evaporation, rainfall, and regional climate on local transmission of the SARS-CoV-2 corona-virus in 31 states and Mexican capitals [3]. The statistical analysis obtained is the Local Transmission Ratio (LTR) calculated by region. Some days of infection effective since regional attacks in each state. The results obtained are the regional beginnings in dry climates appear earlier. It occurs because of lower temperatures than other regions and higher rainfall than tropical climates [3].
Research by Mazhar et al. analyzes the relation among COVID-19 and local climate parameters using existing global climate scales [4]. The results of the average day-time working hours correlate with the total cases of COVID-19 with a coefficient of determination of 0.42, and the average high temperature shows 0.59 and 0.42 with sum case of COVID-19 and case of deaths respectively [4]. Based on several previous research, it can be seen that using a variety of methods can be used to obtain the relation among regional climate variables and COVID-19 transmission in a region. In this research, we will predict the number of COVID-19 transmissions based on regional climate variables, i.e. temp and RH.
Previous research, predicting the transmission of COVID-19 using the LSTM network method in Canada, this research used Long Short Term Memory (LSTM) method which obtained the results that the transmission rate in Canada is a linear trend and in the USA it is exponential growth. One of these is another popular statistical technique and often used is Autoregressive Integrated Moving Model (ARIMA) [5]. Farajzadeh et al. are used the ARIMA model for forecasting monthly rainfall in the northwest from Iran [6]. ARIMA has become a valuable method for Multivariate phenomenon modelling, namely Vector Autoregression (VAR) rainfall forecast [6], Vector transfer function [7], and Spatio-temporal [8].
Based on the results of this research, this research compares the performance of three methods for predicting the number of COVID-19 confirmations, i.e. ARIMA, bivariate, and multivariate transfer functions method function which also in some research on other topics showed good performance in predicting time series data. We use Bivariate and multivariate transfer function methods to evaluate the effectiveness of adding external variables (independent variables) compared to ARIMA, which only uses confirmation cases variable. The dependent variable in this research is the confirmation case, and the independent variable is exposed COVID-19, suspect COVID-19, temp, and RH.

Case Study
The observations daily are from March 23 until August 2, 2020. The number of data is 131 data. The distribution of the confirmed infected cases, suspect, and exposed COVID-19 by the COVID-19 in Surabaya was provided by COVID-19 Response Acceleration Task Force and available online https://covid19.go.id/peta-sebaran.
The climate data used in this research observation were obtained from the Meteorology, Climatology, and Geophysics Agency for the cities of Surabaya and Sidoarjo, Class I Juanda Meteorological Station, WMO-ID: 96935 (http://dataonline.bmkg.go.id/). The data used are daily data. This weather station observes a 30 km 2 area in Surabaya, Sidoarjo, and surrounding. The daily average temperature and humidity were used in our research to forecast confirmed cases.
The definition of these variables is: Confirmed is anyone who has tested positive for COVID-19 by the laboratory, suspected is anyone who has symptoms such as someone positive for COVID-19 but IOP Publishing doi:10.1088/1757-899X/1077/1/012055 3 not tested positive for Covid-19 by the laboratory, and exposed is anyone who has a history of contact with probable cases or confirmed cases of COVID-19 [9]. Figure 1 shows the fluctuation in the number of COVID-19 confirmations. Based on the ACF and PACF correlograms in Figure 2 and Figure 3, it can be identified that the time series data already has a stationary-pattern. It is because the ACF chart has an exponentially decreasing pattern and the PACF chart has a 'cut-off' pattern after the 2nd lag [10].

Data Preprocessing
The identification of stationary patterns can use testing, namely an augmented dickey-fuller test (adf.test). The test defines an alternative hypothesis that the time series data already has a stationary-pattern. The "adf.test" value is -3.5217 with a lag on order 15 and p-value 0.0432. The results show that the COVID-19 confirmation data has a stationary-pattern. The pre-processing required is to divide the total amount of data into two subsets, namely the training and the testing subset.set. The data will divide by a ratio of 70% for training and 30% for testing. The first subset is consists of 92 records and the second is consists of 39 records.   where p is AR model order q is MA model order d is difference order and ∅ ‫,)ܤ(‬ ߠ ‫)ܤ(‬ based on equation (2,3).
Flow-chart for the ARIMA model is described in Figure 4.

Bivariate Transfer Function (BTF)
The transfer function model is not the same method as the ARIMA model. In the time series model, the ARIMA model is a univariate model, but the transfer function is a multivariate model. So it can be concluded that the ARIMA model only connects the circuit with its past. In addition to the past series, the transfer function model also relates the series to other time series as described in Figure 5 and Figure 6. Other time series variables are used as predictor variables in the transfer function model.  Figure 6. Transfer function modelling.
Transfer function models can be used to model with one output and more than one output systems [6]. The first stage of pre-whitening is the dependent variable pre-accounting to eliminate the existing pattern dependent variable series so that there is only a white noise input series. Second, the independent variable pre-whitening to maintain the functional relationship on the transfer function and noise. The next stage is to detect and measure the power relation and by using the Cross-Correlation Function [12]. input the value of ܺ ௧ into ܼ ௧ . However, the transformation on ܼ ௧ does not have to convert ܼ ௧ to white.
In the model with one output, only one equation is required to describe the model. It is showed to as a single-input transfer function model, as equation (7). where with b is a delay, s is the numerator, and r is the dominator.

Multivariate Transfer Function (MTF)
Like multiple regression models, transfer function models can also contain more than one input variable. In this paper, the single input and multi-input transfer functions are applied to assume stationary for the entire input and output series the general form of a single-input transfer function model in equation (9). We can apply the differencing technique if those variables are non-stationary.

Performance Model
The best model is evaluated using four performance measures, i.e. Root Mean Square Error (RMSE), Mean Absolute Error (MAE), R square (ܴ ௦ ) and Correlation (r) [13]. A good model is to have small RMSE and MAE and have a large ܴ ௦ and r. The four performance measures are equation (10)(11)(12)(13). Based on the flowchart of the ARIMA model (Figure 4), the confirmed COVID-19 data is identified, it has been stationary using time-series plots. The COVID-19 confirmation data is stationary according to figure 1 because the time-series plot shows that the pattern is a fluctuating in average-value. Secondly, we identified the stationarity time-series based on ACF and PACF pattern. The ACF correlogram is shown in Figure 4. Events have an exponential decrease which is thought to be a nonseasonal model of AR. Figure 2 and Figure 3 shows that the confirmation COVID-19 data is stationary because PACF has been cut off after the 2nd lag. No transformation is needed nor difference.
Confirmed-data is identified as AR (2), ARMA(1,1), MA(2), etc. Based on the ACF and the PACF, several ARIMA models are proposed ‫(‬ ݀ ‫ܲ()ݍ‬ ‫ܦ‬ ܳ) ௦ using training data. Table 1 shows the six significant models in which the best model is the third model as an equation (14): This model has the least RMSE and the highest ܴ ௦ . Therefore it was selected as the best model. The PACF correlogram of the residual model (2 0 0) has a white-noise pattern so that the model is acceptable. White noise means that the partial correlation value of the residuals is not significant; it can be seen from the PACF value not exceeding the UCL-LCL limit, according to figure 7. In figure 8, from the forecasting results, the graph that formed on the forecasting plot does not represent a graph similar to actual testing data; this results in low accuracy values. Table 2 shows that Rsq-value is a few.

Bivariate Transfer Function (BTF)
The prediction of the model is improved by adding one predictor variable. This model is called the bivariate input transfer function. The independent variables used by the predictors are carried out alternately to get the best BTF model.   figure 10, the number of exposed does not affect the number of confirmations, so it is not suitable for BTF. Cross-Correlation Function (CCF) from suspected variable to COVID-19 confirmed has a relationship. It is proven that the BTF (suspected and confirmed) model is significant. Partially, the other three variables influence the number of confirmed cases.
The best predictor is the suspected variable. The increase in the performance of BTF is better than ARIMA, so it can be concluded that the BTF model is better. From figure 9, plot "training -testing": among actual and predicted for the BTF (suspected -confirmed) model is better than ARIMA, and then the graph can follow fluctuations in the observed data. Table 3 shows that the suspect variable is the best predictor for Covid-19 Confirmed in the BTF model. While the testing accuracy is still low, as shown in table 4.

Multivariate Transfer Function (MTF)
The third model is MTF that the predictor variables as input are four variables. The results of the training process for the MTF model are shown in table 5. The exposed variable is not as significant as a predictor based on the BTF model. Meanwhile, the suspected, temperature and humidity variables are significant as predictors. The training results show that the performance of the MTF model is quite good. In comparison, the performance of the testing process is shown in table 6 with a low Rsq-value. Figure 12 shows that the training results are close to the actual value, but the test results still cannot follow the fluctuation of real data.    Table 7 is a summary of the performance comparison of the ARIMA, BTF, and MTF models. These results indicate that the MTF has better performance than the other two models.

Conclusion
The ARIMA model has the worst training and testing performance because it cannot follow the actual data pattern. The smaller accuracy value also evidences this than the other two methods, the ܴ ௦ value of 0.376 for training and 0.074 for testing. BTF model has better accuracy than ARIMA because the relative test results can follow the real data pattern. However, it has a small correlation, the ܴ ௦ value of 0.409 for training and 0.055 for testing. The best model is MTF because it has better accuracy than Recommendations that can be given for future research are that researchers can add a lot of data and can try using other methods that can be used to predict such as Recurrent Neural Network (RNN) or Long Short-term Memory (LSTM).