Prediction Modelling of COVID-19 on Provinces in Indonesia using Long Short-Term Memory Machine Learning

The COVID-19 is a dangerous virus that has been declared by the world health organization (WHO) as a pandemic. Many countries have taken policies to control the virus’s spread and have played an active role in overcoming this global pandemic, including Indonesia. Indonesia consists of many islands, so the level of distribution varies. Although the mortality rate is shallow than the cure rate, this virus’s spread must be controlled. This paper aims to model the prediction of infected cases, cases of recovery from COVID-19, and mortality for each province in Indonesia using the Long Short-Term Memory (LSTM) machine learning method. The results of the model evaluation of this method used the root mean squared error (RMSE) approach.


Introduction
Changes in world activities had had a significant impact when the World Health Organization (WHO) declared the pandemic coronavirus 2019 disease (COVID-19) as a global pandemic and caused a global health emergency on March 11, 2020. COVID-19 was caused by Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-COV2), which became one of the spreads of potentially fatal diseases. The transmission is spread by droplets from person to person, so that the government made a policy of maintaining distance, wearing masks, washing hands as part of the health protocol. This virus attaches to the host cell and can cause an inflammatory reaction. The incubation period of up to 14 days can show fever, leucopenia, respiratory syndrome, thrombocytopenia, and multi-organ failure conditions, and it can cause death. An understanding of the characteristics of this virus is needed to control the mortality rate [1]. The high mortality rate is usually influenced by the presence of comorbidities that are owned by patients who have been infected with COVID-19, the vulnerable age factor, and inadequate health facilities [2]. This virus pandemic has drawn attention to global connectivity, which logically brings about changes globally, including Indonesia.
At the time of the spike in cases in Indonesia after the announcement, several regions in Indonesia imposed large-scale social restrictions, and agencies and schools were closed. On the other hand, the air quality in the environment is outstanding and significant because it is slightly contaminated by air pollution [3]. The spike in cases is increasing over time, so that various ways are taken to control this virus's spread. This paper aims to model the prediction of the virus's reach, the number of cured, and the number of mortality from Indonesia's provinces. The modelling used in this paper is a long short-term memory (LSTM) machine learning model. Each case's model in these provinces will be evaluated using the root mean squared error (RMSE). The model obtained in the predictions can be used to examine broader implications for a region and trends in the short-term and the long-term [4]. These predictions are used to evaluate the resulting intensity of this pandemic. Public policy and individual awareness in this matter are interrelated to overcome this pandemic. The use of computational intelligence techniques is still believed to be quite good and is widely applied in most application models. It is because the performance can reach an acceptable level of use with measurable evaluation. It makes this technique and the resulting machine learning algorithm is a solution to future problems.

Related Works
Cases related to COVID-19 have caught the attention of many countries in the world. Scientists are trying to predict its spread as a means of mitigating this pandemic. It is used as a strategy for the public health system and an effort to minimize people's death rate with COVID-19. The best model level was sequentially carried out by [5], who got the modeling results of Bi-Long Short-Term Memory (Bi-LSTM), Long Short-term memory (LSTM), Gated Recurrent Unit (GRU), Support Vector Regression (SVR), and Auto-Regressive Integrated Moving Average (ARIMA), respectively. The application of deep learning based on time series techniques has a position in short and mediumtime dependence with adaptive learning. Long-term prediction modeling for this case, the researcher [6] implemented multivariate LSTM. The results that have been compared to obtain a reliable model show that the stacked LSTM algorithm has a higher accuracy value. Research [7] modeled this pandemic case using recurrent neural network (RNN) based LSTM variants consisting of bidirectional LSTM, convolutional LSTM, and stacked LSTM to predict points one month ahead. The best prediction model is generated from Convolutional LSTM with high accuracy and fewer errors.
From several studies indicating cases of COVID-19 using the LSTM method, the authors observed the LSTM model to model predictions of the spread of COVID-19 infection, patients who recovered from COVID-19, and COVID-19 patients who died. It is due to the high level of COVID-19 disease in Indonesia [8].

Research Methodology
This section describes the dataset used in the study. The dataset used is open data, so it's easy to get it from the website. The data used in this paper consists of data plots per region/province marked by the longitude and latitude of each region in Indonesia. Descriptions of the dataset will be searched to determine the amount of data used in the study, the mean, standard deviation (std), and daily maximal case (max) used for each subject. The case data used is the number of everyday issues of cases infected with COVID-19, patients who have recovered from COVID-19 infection, and patients who have died after being infected with COVID-19. This dataset will be used as training data and test data with a ratio of 80:20. Meanwhile, the method employed in this paper is a nested ensemble model or machine learning model using deep learning methods based on the Long Short-Term Memory (LSTM).

Datasets
The datasets of the COVID-19 for this research were taken from the website of https://data.humdata.org/dataset/indonesia-covid-19-cases-recoveries-and-deaths-per-province [9]. This dataset is taken from 34 provinces in Indonesia. The points of longitude and latitude of each representation of these provinces are shown in Table 1. The positions of each longitude and latitude of these provinces are shown in Figure 1 with plotting spatial data in an Indonesia's map. Indonesia consists of thousands of islands and is divided into 34 provinces. Some of these islands are separated by a vast ocean. The spread of this virus in big cities is relatively high because human interaction there is quite significant. So that the policies made between a town and another are somewhat different. The distribution of confirmed cases of COVID-19, patients who have recovered from COVID-19, and deaths from COVID-19 patients are shown in Figure 2.  Figure 2. Plotting spatial data on 34 provinces in Indonesia.
The data used in this paper is daily and not the sum per day. The data for COVID-19 infected cases used in this paper were taken from March 15, 2020, to October 29, 2020. The data used for COVID-19 patients who had recovered were taken from March 21, 2020, to October 29, 2020. Meanwhile, the number of patients COVID-19 who died was also taken from March 21, 2020, to October 29, 2020. Blank data in the dataset, which means there are no case reports, will be replaced with zero. This research is only based on the dataset obtained, while unreported case data is not included in this paper. So it could be that the number of people infected with this virus is enormous, but because they do not report themselves to the authorized officers, their data is not recorded [9]. This paper seeks to discuss predictions obtained from the LSTM model only so that the proposed prediction model can be used to anticipate future pandemics. Each province data used in this paper is written in table 2.    The working principle of the LSTM is to decide which input will be removed from the candidate values [t-1 of the cell state using a sigmoid gate called the forget gate \t. This gate reads the values of ]t-1 and xt, which will produce values '0' to stop the input element and '1' to forward the input element for each component in ]t-1, as depicted in figure 4.

Long Short-Term Memory
The symbol [[t-1, xt] is a concatenation operation, Z is a weight of input, and E is a bias of the gate input. The next step is to decide which new information to use in [t. In this process, there are two parts: the sigmoid gate, known as the input gate Dt, which determines the information to be updated, and a gate without generating a new vector candidate 9t. The two processes combine to make up-todate information, as shown in Figure 5.
This process adds new information to the cell to replace long-forgotten elements. Updating the old cell ]t-1 to the new cell ]t will need to multiply that old cell by \t to forget things it decided to ignore. This process also multiplies the new candidate cell by 9t with Dt to determine how many new candidate cells to include, then adding both of them, as shown in Figure 6.
In this case, dumping old information and adding new information is as decided in the previous step. This output will be based on the value in the cell passed to a filter. The process is by running a sigmoid gate called the output gate Mt to decide which pieces of information will be generated. Then give this information through tanh to make the value between −1 and 1, multiply that by the output of the sigmoid gate so that only the part is decided upon. It can be depicted as shown in Figure 7. The LSTM modeling applied in this paper uses five hidden layers, 1000 epochs, and a learning rate of 0.001. The optimizer used in this paper has employed an Adam optimizer while to evaluate the prediction model using the root mean square error (RMSE).

Result and Discussion
This section discusses the results obtained from the predictive model of LSTM. The model evaluation results were obtained using the RMSE for each infected case, patients who recovered from COVID-19, and patients who died for each province in Indonesia, as shown in Table 3.

Conclusion
This paper has modeled predictions from cases of COVID-19 infection, patients who have recovered from COVID-19, and cases of COVID-19 patients who have died for each province in Indonesia. The smallest RMSE value of the LSTM model that uses five hidden neurons, 1000 epochs, a learning rate of 0.001, and Adam's optimizer for cases of infection, recovery, and death from COVID-19 patients are found in the provinces of Jawa Barat, DKI Jakarta, and Jawa Tengah, respectively. There is a tendency for a relationship between the mean and standard deviation of the dataset and the RMSE results obtained from LSTM modeling. However, this needs to be tested further so that to get a rough idea of the results from the RMSE model of this LSTM, it can be seen from the mean, the standard deviation of the entered data set, and RMSE of the LSTM modelling for infected, recovered, and died cases, as shown in Figure 8, 9 and 10, respectively.