Evaluation of Decision Tree, K-NN, Naive Bayes and SVM with MWMOTE on UCI Dataset

Imbalanced data causes misclassification because the majority of the dominant data is in the minority data, which results in a decrease in the value of accuracy. UCI dataset is a public dataset that can be used as a dataset in machine learning. This study aims to evaluate the Decision Tree, K-NN, Naive Bayes, and Support Vector Machine classification methods on data imbalances in MWMOTE. MWMOTE is used in resolving Imbalanced cases through weighting and grouping. This goal is achieved by evaluating the Decision Tree, K-NN, Naive Bayes, and Support Vector Machine classification methods in MWMOTE to produce more representative synthetic data and increase the accuracy value. The results obtained from this study indicate that the Decision Tree has higher evaluations of recall, precision, F-measure, and accuracy compared to K-NN, Naive Bayes, and Support Vector Machine for data that are balanced with MWMOTE.


Introduction
Misclassification is a problem that often occurs in classifying Imbalanced data because classifiers are more inclined towards majority data so that low accuracy is obtained in minority data [1]. To handle imbalanced, some research manipulates data samples (synthetic data creation) and the use of algorithms [2]. Classification methods provide accuracy values for all data by eliminating minority classes and all data considered as the majority class. The dataset is assumed to have a balanced distribution, and minority classes will be noise or outliers [3] [4]. Imbalanced data problems between minority and majority data, causing minority data accuracy to be low [5]. Imbalanced distribution results in classification events that are more inclined to the majority of data (negative) compared to the number of minority data (positive) [6].
State that the case of misclassified is caused by the imbalanced dataset [7]. Imbalanced cases can group data into 2, namely minority and majority data [2]. Also, imbalanced can lead to poor model making [8] as well as overfitting and decreasing classification accuracy [9]. Oversampling is one way to handle imbalanced problems by distributing balanced data by randomly replicating minority (synthetic data) data by iterating. Oversampling has disadvantages in making synthetic data with the appearance of overfitting because this mechanism makes synthetic data less precise. The Majority Weighted Minority Oversampling Technique (MWMOTE) can handle overfitting. Making synthetic data in MWMOTE has three stages, namely identification of minority class samples and majority classes on datasets, minority class weighting, and clustering The results of these proposals were able to reduce the degree of bias or noise and to produce synthetic data with better accuracy [2]. In this study, using Decision Tree, K-NN, Naive Bayes, and Support Vector Machine in classifying imbalanced data using MWMOTE in the UCI dataset, especially in pre-processing and testing phases.

Decision Tree
Decision Tree or decision tree method is an algorithm of ID3 development that is used to predict data or facts large enough to become a decision tree by classifying or segmenting or increasing prediction [10]. To select an attribute as the root, based on the highest gain value (1) of the existing attributes. After getting the gain value, there is one more thing that needs to be done which is to calculate the value of entropy (2). Entropy is used to determine how informative an input attribute is to produce an output attribute [11].

K-Nearest Neighbors
K -Nearest Neighbors (K-NN) is one of the simple algorithms in the learning algorithm to predict a class in a dataset [12]. Classification of classes on K-NN based on the closest neighbours distance using Euclidean distance, City block distance, Cosine distance, Correlation, Hamming distance [13]. The distance between neighbours is an important part to optimize the K-NN algorithm, so the authors use Euclidean distance (3) [14]. The lack of the K-NN algorithm requires store or memory and the computational process is quite large [15]. (3)

Naïve Bayes
Where to classify, data must be provided that have been defined for each attribute or class of criteria and classes [16]. To do the classification is calculated based on the probability value of each class for the variable (4).

(4)
p (H | E) = hypothesis probability value for evidence, p (H) = hypothesis probability value, p (E | H) = probability evidence value for the hypothesis, and p (E) = probability value of evidence.

Support Vector Machine
Support Vector Machine is a learning machine algorithm that works on the principle of Structural Risk Minimization (SRM) to find the best hyperplane (Figure 1) that separates two classes in the input space [10].

Imbalanced Data
Data that has Imbalanced ratio between one data and other data can be said to be imbalanced. Data mining means imbalanced by the amount of majority class data more than the minority class. Imbalanced problems occur in machine learning so that often results in misclassification has an impact on the value of the accuracy of class predictions decreases [17]. The decrease in accuracy in imbalances is due to the presence of noise or outliers in test datasets from minority classes [18]. One way to deal with imbalanced is by comparing classification methods with the addition of algorithms or modifying methods [19]. Imbalanced can be solved by adding synthetic data to the minority class with the method of oversampling and under-sampling. Imbalanced has quite high complexity and is differentiated into 3 cases (Figure 2) [20].

Dataset UCI
Imbalanced data set is a special case for classification problems where class distribution is not uniform among classes. Usually, they are organized by two classes: the majority (negative) and minority (positive) classes [21] (table 1).

Oversampling
The oversampling method for making synthetic data may have some inaccuracies in many scenarios.
To overcome this problem, a new method of Majority Weighted Minority Oversampling Technique (MWMOTE) [2]. The purpose of MWMOTE is twofold, namely: To improve the process of sample selection and to improve the process of making synthetic samples. MWMOTE has three stages (Figure 3), namely: MWMOTE identifies minority data that is difficult to study. Minority data that is in the majority data, adjacent minority data (borderline) with the majority data and minority data that information is on the borderline. Second, each member of an informative minority sample is assigned a weighted sample selection weight (Sw).

Classification
Classification is the process of creating models using data testing and separating dataset categories by labelling each class [22]. The purpose of classification can be used as a prediction for future data trends [10]. The stages in classification consist of three parts [23], for the stage of model development, a model is created to solve the problem of classifying attributes or classes in a dataset. The model is built based on training data sets of problems faced and has good information. The stage of applying the built model is used to determine the attribute or class of testing data with the attribute/class not yet known. Evaluation is the stage of applying the previous model evaluated using measured parameters to determine whether the model is acceptable or not.

Evaluation
Precision is the level of accuracy between the information requested by the user and the answers provided by the system. Recall is the level of success of the system in finding back information. Fmeasure is one of the evaluation calculations in the information retrieval that combines recall and precision. Accuracy is defined as the level of closeness between the predicted value and the actual value.

Result and Discussion
In this study, the evaluation of Decision Tree, Naive Bayes, K-NN, and Support Vector Machine algorithms uses four approaches of Precision, Recall, F-Measure, and accuracy. The imbalanced dataset is divided into two parts with the composition of training data and testing data (80:20). Table 2  -3 and Table 4 -5 the training, data has precision, recall, f-measure, and accuracy.  The Evaluation of four algorithms (Decision Tree, Naïve Bayes, K-NN, and Support Vector Machine) aims to determine the results of precision, recall, f-measure, and accuracy. Oversampling in the training dataset is done to make synthetic data into balanced data (figure 4 and 5). The imbalanced dataset obtained an accuracy evaluation (figure 6) with a value of 96.30% in the decision tree algorithm, K-NN 92.95%, Support Vector Machine 82%, and 78.74% in the Naïve Bayes classifier. 96.57% is the result of evaluating the precision of the decision tree classifier, Naïve Bayes 80.32%, for K-NN 93.38%, while for Support Vector Machine itself has a precision of 84.36% (figure 6).   Figure 7 is the result of the classifier evaluation conducted after the imbalanced dataset becomes balance by adding synthetic data to the minor class. Recall and F-Measure in the decision tree classifier are 96.31% and 93.30%. Naïve Bayes 78.74% and 78.38%, in K-NN, obtained 92.94% and 92.92%, and the Support Vector Machine has a recall and F-measure of 82% and 81.45%.

Conclusion
In this research, the training data obtained an accuracy value of 93.73% K-NN, and the Naïve Bayes obtained an accuracy value of 79.90%. For Decision Tree test data has an accuracy value of 94.32, K-NN 92.67%, Support Vector Machine 85.61%, and Naïve Bayes 84.30%. After oversampling with MWMOTE on imbalanced data accuracy Decision Tree 96.30%, K-NN 92.95%, Support Vector Machine 82.00%, and Naïve Bayes 78.74%. That balanced data has less accuracy than balanced data with an average of 2-4%.