Dataset Weighting Features Using Gain Ratio To Improve Method Accuracy Naïve Bayesian Classification

The Naïve Bayes method is proven to have a high speed when applied to large datasets, but the Naïve Bayes method has weaknesses when selecting attributes because Naïve Bayes is a statistical classification method that is only based on the Bayes theorem so that it can only be used to predict the probability of the class membership of a class independently. Independent without being able to do the selection of attributes that have a high correlation and correlation between one attribute with other attributes so that it can affect the value of accuracy. Naïve Bayesian Weight has been able to provide better accuracy than conventional Naïve Bayesian. Where an increase in the highest accuracy value obtained from the Water Quality dataset is equal to 88.57% in the Weight Naïve Bayesian classification model, while the lowest accuracy value is obtained from the Haberman dataset which is 78.95% in the conventional Naïve Bayesian classification model. The increase in accuracy of the Weight Naïve Bayesian classification model in the Water Quality dataset is 2.9%. While the increase in accuracy value in the Haberman dataset is 1.8%. If done the average accuracy of each dataset using the Weight Naïve Bayesian classification model is 2.35%. Based on the testing that has been done on all test data, it can be said that the Weight Naïve Bayesian classification model can provide better accuracy values than those produced by the conventional Naïve Bayesian classification model.


1.
Introduction Naïve Bayesian Classifier (NBC) is aclassification technique for calculating a set of probabilities by adding up the frequency and combination of values from a given dataset and giving strong (naive) independence assumptions. Data classification is done by analyzing training samples and giving a categorization (giving class) to the data based on the predicted value of an attribute. Attributes are characteristics or characteristics of data. The expected prediction of an attribute is a value that produces an accurate category description [1][2][3]. To improve the classification accuracy of a predictor, the algorithm that can be developed is by giving weight to the attribute before the data classification stage [4].
The weighting of attributes from data samples before the classification stage. This aims to increase the accuracy of the conventional Naïve Bayesian method by applying attribute reduction through weighting the attributes of the data sample using the Information Gain method. The application of the Information Gain Weighted Naïve Bayesian Classifier (IGWNBC) method produces a significant correctness rate better than conventional Naïve Bayesian methods. Correctness rates generated from the car, zoo, and mushroom datasets have an average accuracy of over 97% [5][6].

=1
Propose the weighting to the attributes of the data sample using the Multivariable Linear Regression model to produce a coefficient (weight coefficient) then classified using the Naïve Bayesian Classifier (NBC). The application of the MLRM and NBC models produces a value of accuracy (correctness rate) which is significantly better than conventional Naïve Bayesian methods. The correctness rate generated from the UCI Machine Learning dataset is 80% [7].
Propose local attribute weighting using the K-Nearest Neighbors Classifier (KNN) method. This method is carried out to produce a number of K nearest neighbors from the sample data and calculate the probability values of each attribute and then weight each attribute then classified using the Naïve Bayesian Classifier (NBC). The sample data used comes from historical bus route data and weather conditions data from August to December 2014. The accuracy obtained from the Naïve Bayesian Local Attribute Weighted KNN method was 89% [8].

2.1.
Gain Ratio C4.5 algorithm is a decision tree method where the attribute selection is based on Gain Ratio. Gain Ratio (GR) is a modification of Information Gain that reduces the bias. Determination of Gain Ratio is as follows: 1. Calculate the Entropy value for each attribute Calculate the information gain value for each attribute

2.2.
Naïve Bayesian Method (Naïve Bayesian Classifier) Naïve Bayes is a classification that calculates a set of probabilities by adding up the frequency and combination of values from a given dataset. Naïve Bayes is based on the simplification assumption that attribute values are conditionally mutually independent if output values are given. The advantage of using Naive Bayes is that this method only requires a small amount of training data to determine the estimated parameters needed in the classification process. Naïve Bayes often work far better in most complex real-world situations than expected.Following are the basic equations of the Bayes theorem [9]: Naïve Bayes method still has a level of weakness when going to do attribute selection, because Naïve Bayes itself is a statistical classification method based only on the Bayes theorem so that it can only be used with the aim of predicting the probability of membership in a group or class. Theref ore we need attribute weighting to increase accuracy more effectively.

Results And Discussion
To implement the performance of the proposed method, two sets of data are used, the Water Quality dataset and the Haberman dataset. The Dataset of Water Quality originates from the result of research [3], where the data is the result of data collection by the Ministry of Environment on Water Quality Provisions which are classified into four categories. The Haberman dataset comes from the KEEL-Dataset Repository with the urladdress:https://sci2s.ugr.es/keel/dataset.php?cod=62, where the Haberman data description is a number of breast cancer patient data that is predicted whether the patient these can survive for 5 years or more (Positive) or vice versa the patient dies within a period of 5 years (negative) after surgery [3].
The cleaning process is done by removing data duplication so that the number of Water Quality Status datasets which were originally 120 instances to 117 instances and the same thing also applies to the number of Haberman datasets which originally amounted to 306 instances to 289 instances.

Acquisition Result Weight Gain Ratio
The highest Gain Ratio for the Total Coliform and Pij attribute is 1. The lowest Gain Ratio is for the DO attribute, which is 0.257.

Accuracy Result of Conventional NaivesBayesian Models (Dataset Water)
The Water Quality Status dataset has 8 attributes, 4 classes and 120 instances, class distribution in the form of good condition (30 instances), lightly polluted (30 instances), medium polluted (30 instances) and heavily polluted (30 instances). The data is divided by 70% from the data will be used as training data and as much as 30% of the data will be used as test data conducted randomly.

Accuracy Result of Conventional Naives Bayesian Models (Haberman Dataset)
Dataset of Haberman has 3 attributes, 2 classes and 289 instances. Data shared by 80% of the data will be used as training data and as much as 20% of the data will be used as test data. Result of Conventional Naïve Bayesian Accuracy (Haberman Dataset) are as follows: