Attribute Weighting Based K-Nearest Neighbor Using Gain Ratio

K- Nearest Neighbor (KNN) is a good classifier, but from several studies, the result performance accuracy of KNN still lower than other methods. One of the causes of the low accuracy produced, because each attribute has the same effect on the classification process, while some less relevant characteristics lead to miss-classification of the class assignment for new data. In this research, we proposed Attribute Weighting Based K-Nearest Neighbor Using Gain Ratio as a parameter to see the correlation between each attribute in the data and the Gain Ratio also will be used as the basis for weighting each attribute of the dataset. The accuracy of results is compared to the accuracy acquired from the original KNN method using 10-fold Cross-Validation with several datasets from the UCI Machine Learning repository and KEEL-Dataset Repository, such as abalone, glass identification, haberman, hayes-roth and water quality status. Based on the result of the test, the proposed method was able to increase the classification accuracy of KNN, where the highest difference of accuracy obtained hayes-roth dataset is worth 12.73%, and the lowest difference of accuracy obtained in the abalone dataset of 0.07%. The average result of the accuracy of all dataset increases the accuracy by 5.33%.


Introduction
K-Nearest Neighbor (KNN) is one of the effective, simple and performs well method for classification [1][2][3], but from several studies, the result performance accuracy of KNN is lower than other methods. One of them is in the study by [4] which compared performance between support vector machine (SVM) and KNN. The result of their research seen that performance of SVM better than KNN, where the value accuracy obtained by SVM of 82.54% while the value obtained from KNN of 79.22%. Another study by [5] compared K-Nearest Neighbor (KNN) and Artificial Neural Network (ANN). The results of their study seen that the performance of ANN (with 5 hidden layers) better than KNN, where the result of accuracy is 90.5%. Research by [6] compared the performance of Naïve Bayes, Decision Tree and K-Nearest Neighbor (KNN). The results show that Naïve Bayes has the best accuracy in classification compared to Decision Tree Tree and K-Nearest Neighbor (KNN) with an average accuracy of 73.7%. whilethe average accuracy of Decision Tree and KNN respectively 58.9% and 56.7%. In the study by [7] compared KNN and Naïve Bayes in diagnosing heart disease. Results obtained by the value of accuracy Naïve Bayes of 79.62% while the average value of accuracy on KNN is only 64.85% whenk = 10. From some research above seen that the accuracy of KNN method is still lower than other methods. One of the causes of the low value of accuracy produced, because each attribute has the same effect on the classification process. The solution to this problem is to give weight to each attribute [8]. In this research, we propose attribute weighting based K-Nearest Neighbor using Gain Ratio to increase the accuracy value of K-Nearest Neighbor (KNN) method by giving weight to each attribute, whereby the weights obtained are calculated using the normalization equation min-max, then calculatedthe data similarity using the distance model proposed by [1]. The rest of this paper is structured as follows. Section 2 will summarize previous studies on the theoretical foundation regarding the topic. In Section 3 we will provide the result and discussions and in Section 4 will provide with conclusions 2. Feature Weight K-Nearest Neighbor (FWKNN) K-Nearest Neighbor (KNN) is a nonparametric learning method and sensitive to distance function due to the inherent sensitivity of irrelevant attributes [1]. So to get FWKNN modeling based on attribute weighting.FWKNN determines the weight of the attribute by identifying the nearest k-neighbor that has the highest similarity with the test data. Therefore the FWKNN requires a model to measure the distance between test data and training sample data.
To measure the distance between data, one of the most popular options is the Euclidean distance model [1]. FWKNN uses a modified distance using the weighted feature for standard Euclidean distance expansion. The equations used in FWKNN are as follows: where d w (X'-X)is the Euclidean distance from the test data and training data, y is the number of attributes, the attribute weights,x i 'andx i the attribute values of test data and training data. The detail of FWKNN algorithm is as follows: Step 1: Determine the weight of each feature Step 2: Determine the value parameter k Step 3: Calculate the distance using equation (1) Step 4: Sorting results obtained by ascending (sequentially from high to low).
Step 5: Calculate the number of each class based on the k-nearest neighbor.
Step 6: The majority class is used as a class for test data. FWKNN makes a difference to the characteristics by giving weight to its attributes, meaning that the more important attributes have a greater effect on distance determination [8]. Thus the error in the classification process can be reduced [1].

Gain Ratio
Algorithm C4.5 is a decision tree method wherein the selection of attributes based on Gain Ratio. Gain Ratio (GR) is a modification of Information Gain which reduces its bias. Gain Ratio improves the information gain by taking intrinsic information from each attribute [9]. The steps in determining the Gain Ratio are as follows: Step 1: Calculate the value of Entropy on each attribute, with the equation : where : S = Set of Cases n = Number of Partitions S pi = Proportions ofSiS Step 2: Calculate the value of the information gain in each attribute by the equation : where : Step 3: Calculate the value of Split Information for each attribute with the equation below : where : The Gain Ratio is found in the C4.5 algorithm, where the gain ratio is used to calculate the effect of attributes on the target of a data [11]. Gain Ratio is the development of information gain, where the gain ratio eliminates the bias value of each attribute.

Previous Research
Research work by [12] comparing K-Nearest Neighbor (KNN) with Support Vector Machine (SVM) for water quality status classification. The test results show the highest accuracy value found SVM, which is 92.40% while the average value of accuracy on KNN only 71.28% with k=7.The accuracy of KNN method is still lower than other methods. One of the causes of the low value of accuracy produced, because each attribute has the same effect on the classification process.
Weighting attribute has been proposed by [8]However, the weighting of attributes is less effective, because it gives the same weight for each of characteristics, while in research by [13], his research suggests consideration of KNN method by using different weighting functions.
In addition, attribute weighting is also performed on research work by [1]. in his research suggested the method of Information Gain as a basis for weighting attributes on the KNN algorithm known as Feature Weight K-Nearest Neighbor (FWKNN). This method is considered more effective than original KNN.
Finally, [14] in his research using Gain Ratio as the basic weighting attribute on KNN. From their research, it is seen that KNN by using Gain Ratio is considered more intuitive and easy to understand. The results obtained in the study, weighting attributes by using Gain Ratio able to increase the highest accuracy of 5%

The Proposed Method
To describe the proposed method, it will be explained step by step in this chapter. The stages can be seen in figure 1: From figure 1 it can be explained that the proposed method has several stages, among others : Step 1: Weighting Attributes by Gain Ratio The weight is calculated using the normalization equation min-max [10], where the lowest weight after normalized is 0.1 and the highest weight after normalized is 1. The equation used is as follows: Step 2: Classification with KNN Furthermore, the process of classification with KNN, after the weight of each attribute is obtained it will be classified by KNN. In determining the proximity between data, each attribute will have different effects based on the weight value of the attribute using the equation (1). This research will use the gain ratio as a parameter to see the correlation between each attribute in the data and the gain ratio also will be used as the basis for weighting each attribute of the data set. The higher the gain ratio of an attribute the correlation to the data class will be greater, so the weight of the attribute is also higher. This research will give analysis result to the performance of KNN method with weighting attribute using gain ratio on the performance of KNN method conducted by [11].

Result and Discussion
This research uses data sets obtained from UCI Machine Learning Repository and KEEL-Dataset Repository such as abalone, glass identification,haberman, hayes-roth and one dataset of water quality status derived from research conducted by [11]. The details of the data used can be seen in table 1:

Testing Using Data Set
This study used a 10-fold cross validation on datasets, so the total dataset was randomly divided into 10 datasets with the same comparison values, Then do 10 trials, where each trial using a data partition to-kas the testing of data and utilize the remaining more partitions as training data.The average accuracy of each data can be seen in figure 2:  Figure 2. The average result of the accuracy of the dataset.
The figure 2 shows the KNN using Gain Ratio has a higher accuracy value than the original KNN, where the highest difference of accuracy obtained hayes-roth data set is worth 12.73%, and the lowest difference of accuracy obtained in the abalone dataset of 0.07%. The average result of the accuracy ofall dataset increases the accuracy of 3.79%.

Testing Using Water Quality Status
To know the weighting of attributes using gain ratio can improve the accuracy of KNN, then the next test will be conducted by using the water quality status datasets derived from research by [11]. The result of comparison between KNN and KNN with Gain Ratio can be seen in table 2. Results of testing using KNN and KNN with Gain Ratio classification contained in table 2. The table shows originalKNN with the highest difference of accuracy grade are in grades k = 1, 2, 6 that is 87.5% and the lowest difference of accuracy grade are in grade k = 7 that is 82.5%, while the results KNN with Gain Ratio is obtained the highest difference of accuracy grade are in grades k = 3 that is 95% and the lowest difference of accuracy grade are in grade k = 10 that is 85.8% .The comparison of the accuracy of KNN and KNN with Gain Ratio shows that the average result of the accuracy of all k increases the accuracy by 5.35% against accuracy value of original KNN.

Conclusion
Based on the findings and discussion that has been described previously, it can be concluded as follows: 1. The test using 10-fold cross validation proved that Attribute weighting based K-Nearest-Neighbor using Gain Ratio has a higher accuracy value than the original KNN, where the highest difference of accuracy obtained hayes-roth data set is worth 12.73%, and the lowest difference of accuracy obtained in the abalone dataset of 0.07%. The average result of the accuracy of all dataset increase the accuracy by4.09% 2. The test results weighting of attributes using gain ratio on thewater quality dataset can improve the accuracy of KNN with the highest difference of accuracy grade are in grades k = 3 that is 9.2% and the average result of the accuracy of all k increase the accuracy by 5.35% against accuracy value of original KNN. 3. KNN is a good classifier but when we implementthis method over textual data (nominal data) it's all performance parameters are varied according to the size of the dataset. KNN performs poor results as the size of data set increases it is the best fit for a small dataset.