Classification of rainfall data using support vector machine

Classification problems can be based on cross-section or time-series data. In general, some of the characteristics found in time series data are data that are very susceptible to containing noise and outliers. In this paper, time series data-based classification was carried out using the support vector machine (SVM) method. The SVM method is the most popular binary classification technique in machine learning. The advantage of this method is that it can find a global optimum solution and always achieve the same solution for every running. Another advantage is that it can solve the over-fitting problem by minimizing the upper limit of generalization errors. SVM classification model performance can be seen from classification accuracy and sensitivity and specificity tests. The results showed that based on the level of classification accuracy used the SVM method resulted in an accuracy rate of the prediction results with training data of 96,3% and the accuracy of the prediction results with testing data of 90,08%. Therefore, the classification of rainfall data using the SVM method has a very good performance.


Introduction
One of the studies of Machine Learning is classification. Classification is the process of finding a function that describes and distinguishes data classes with the aim that the function can be used in predicting the class of an object or data whose class is unknown. Classification is characterized as a method for determining a collection of models or functions to describe and distinguish concepts or data classes in order to predict classes of specific objects or to define labels of unknown objects [1].
Several machine learning-based modeling techniques were developed to assist in classification problems including Artificial neural networks (McCulloch and Pitts, 1943), Classification Adaptive Regression Tree (Breiman, Friedman, Olshen and Stone, 1984), Multivariate Adaptive Regression Spline (Friedman, 1991), k-Nearest Neighbor (Fix and Hodges, 1951), Support Vector Machine (Cortes and Vapnik, 1995) and so on.
The support vector machine (SVM) is one of those modelling techniques focused on machine learning for data classification. Vapnik [2] first implemented the SVM method as a prediction tool for classification and regression problems and was very successful. As this approach can be performed analytically (mathematically), the SVM method can effectively function for pattern recognition and become very common in machine learning. This can be done as SVM can specify the optimum global solution and always have the same solution for all runs [3]. As in addition, SVM has performed as an effective tool for dealing with overfitting problems by minimizing the upper limit of generalization of error, based on structural risk minimization theory. However, it should be noted that SVM was once theoretically and very efficiently built for the problem of classification in the resolution of two-class  [4]. Simply put, the principles of SVM function to decide the optimum hyperplane separating two groups [5].
Originally, SVM was used for binary data classification, but real-world issues are not only issues with binary classification, but rather multi-class issues. In practice, decomposition is made into a set of binary issues in order to solve the problem of multi-class data classification so that standard SVM can be implemented directly. The main aim of this study is how the support vector machine method can be used to classify time series data.

Support vector machine
The Support Vector Machine (SVM) method was introduced for the first time by Vapnik et al in 1995 [2] and has been very successful in making predictions, both in classification and regression cases. For every running, the SVM method seeks the optimal global solution and always reaches the same solution. By mapping training information to a high-dimensional space, SVM works. In a high dimensional space, a classifier that is able to maximize the margin between two data classes will be sought. The use of this method tries to find the optimal separator function that can separate two data sets from two different classes (-1, +1) or also known as the best hyperplane among infinite functions.
A linear classifier, namely classification cases that can be separated linearly, is the basic concept of SVM. SVM is classified into 2 groups in the linear classification, namely separable and non-separable.
Given a set of = { , , ⋯ , }, with ∈ , = 1, ⋯ , . Knowing that is a specific pattern, that if is a member of a certain class then given label (target) = +1, otherwise = −1. Data will therefore be given as a pair ( , 1 ), ( , 2 ), …, ( , ) which is a training vector set of two groups that will be categorized by SVM [19][20], from a normal vector parameter called , a separating hyperplane is defined, and parameter defines relative plane position towards coordinate center called , the two parameters are shown as follows ( . ) + = 0 (2) It must fulfill this restriction when determining canonical-shaped hyperplane separation [19][20], to get the ideal hyper-plane is done by maximizing the margin 2 ‖ ‖ or minimizing the following functions: The optimization issue can then be solved by the Lagrange function, where is the multiplier of Lagrange. The function of Lagrange is primal space, so it must be converted into dual space so that the function can be solved simpler and more effectively. Therefore, the dual space solution can be obtained as follows: thus the classification using the following formula [20], where, In fact, not all data can be considered linearly separable, so defining a particular linear hyperplane is very hard. By converting the data into a higher dimensional feature space, this issue can be solved so that the data can be segregated linearly in the new feature space. On nonlinear data, SVM also works. Basically, the following optimization problem[20] is solved by nonlinear classification.
The linear kernel, the polynomial kernel, and the Gaussian radial basic function (RBF) are some of the kernel functions which are widely used.

Evaluation of model performance
The overall performance of the classification model is demonstrated by classification accuracy, where the higher the classification accuracy, the better the performance of the classification model.  (13) To obtain an optimal and more specific classification, sensitivity, and specificity can be tested.

Data source and description
The data source used in this study is secondary data in the form of monthly rainfall data for the period 1999 to 2015 obtained from the National Center for Environmental Prediction -NOAA on the website http://www.esrl.noaa.gov. The data used only select 1 observation station in East Nusa Tenggara Province, namely the Komodo station with data descriptions in Table 1.
In 1993, McKee developed a drought index calculation using the SPI method for the first time. The help is to detect and minimize dryness. Meteorological drought analysis using the SPI method can be carried out over a period of one month, three months, six months, twelve months, and so on according to the purpose of the analysis. In this study, the SPI method was used for a period of 3 months. To find out the nature of SPI can be seen in Table 2.  Figure 1 shows the time series plot for rainfall data usage. It shows a pattern formed from plots of monthly rainfall data over a 16 year period. As for the rainfall data plot based on the SPI classification, it can be described in Figure 2. The data that will be used consist of monthly rainfall data ( 1 ), SPI value ( 2 ) and Weather nature (Y).   The SVM method of rainfall data classification is carried out using only radial kernels with parameter values of C = 1 and σ = 0.33. The data used in this study will be partitioned into training data and testing data with a ratio of 40%: 60%. Obtained the number of training data as many as 81 data and for testing data as many as 121 data. The results of the accuracy are presented in Table 3 as follows  Table 3 shows that the results of the accuracy of time series data classification using the SVM method in this study are that the accuracy value for training data is higher than the accuracy value for testing data.

Conclusion
The results showed that the results of the time series data classification model performance evaluation, in this case, were monthly rainfall data using the SVM method. It can be concluded that the level of accuracy of the prediction results with training data is 96.3%, for the sensitivity value obtained is 95.15%, then the specificity value was 97.4%. As for the level of accuracy from the prediction results with testing data of 90.08%, for the sensitivity value obtained for 87.81%, then the Specificity value of 92.97%. Therefore, the classification of rainfall data using the SVM method has a very good performance.