Analysis of C4.5 Algorithm of Water Quality Dataset

The C4.5 algorithm still has weaknesses in predicting or classifying data if a large number of classes are used which can lead to increased decision-making time. So an approach is needed to improve the performance of the C4.5 algorithm with the selected split attributes that use the application of the average gain value to help predictions. The C4.5 algorithm is one of the Decision Tree methods in the classification process using the information entropy concept. The C4.5 algorithm uses the split criteria from ID3, the Gain Ratio is a modification of the method. The ID3 algorithm uses Information Gain (IG) for the split attribute criteria, while the C4.5 algorithm with Gain Ratio (GR), where the root value comes from high gain. The conclusion of the tests that have been carried out using the Water Quality dataset in the C4.5 method has an accuracy rate of 91.30%, with a classification error rate of 8.70%. Successful implementation using the C4.5 method in predicting the Water Quality dataset.


Introduction
In classifying, data requires an object that has been defined in the data. Classification algorithms are widely used in making decisions [1]. The Decision Tree algorithm performs a grouping method such as a flowchart that forms a tree-like structure, with classes divided into nodes. The classification model on the root of the leaves tests the attributes as a result of the classification in the Decision Tree [2]. The C4.5 algorithm is a classification method of data mining that is used to construct a decision tree. Decision trees are used as a reasoning procedure to get answers to the problems that are entered. The classification process can also use the entropy concept in the Decision Tree, namely the C4.5 Algorithm. In this algorithm has split criteria derived from ID3, the Gain Ratio is modified from the previous method. The ID3 Algorithm implements Information Gain in its split attribute criteria, and in the C4.5 algorithm applies a Gain Ratio where the root starts from Gain that has a high value. The classification process uses the C4.5 algorithm, namely by calculating the entropy value. The Gain Ratio value is calculated, then the attribute that has a high Gain Ratio value will be selected as the root and the low one will become the branch, then recalculate the Gain Ratio value of each attribute by not using the selected attribute as the root of the process. Previously, the next process was carried out to produce a Gain value of 0 on the remaining attributes.
In a decision tree approach, pruning is the process of cutting or removing unnecessary branches (nodes). Unnecessary nodes will cause noisy data that is less relevant [3]. This causes a large decision tree called overfitting [4]. [2] The second pruning method is to use threshold pruning. The two pruning methods were tested using the Decision Tree classification model with split criteria from the ID3 and C4.5 algorithms (heterogeneous cost) on six datasets and concluded that the two proposed pruning methods can be used for the decision tree classification model. In this study, we propose a framework for eliminating irrelevant features, without eliminating information contained in the original data uses the C4.5 decision tree algorithm for classifying the data set used. The aim is to Predict the classification model with the C4.5 decision tree to get relevant features and to improve C4.5 decision tree accuracy. To evaluate the proposed approach, we used the Water Quality dataset as data.

Research Method
In the C45 heuristic approach using two approaches in testing carried out with a probability rating as follows: 1. Information gain that minimizes the total entropy value of the subset bias that occurs when testing using numerical data. 2. Obtaining the Gain Ratio results obtained using Information Entropy derived from each attribute of the data used in the test. The C4.5 algorithm is another type of decision tree method which is similar to a flowchart structure, where each internal node is represented as a test attribute. Branches that represent output by testing the nodes (leaf nodes) by determining the class. The very top node of the tree is the root node. The stages of the C4.5 algorithm are as follows [5]: a. Calculating of the Entropy value for each attribute: Where: S = Data n = Number of Partitions S pi = Proportion of the subsets of S on the i-particle b. Calculating of the Information Gain value for each attribute: Where: S = The amount of data A = Attributes to the test data = the amount of data from the attribute | | = the measure of data that an attribute has in the data | | = a measure of the amount of data tested c. Calculating of the Split Information value for each attribute:

Result And Discussion
The preprocessing results are represented in tabular form (x1, x2, x3 ... x32) with the same sorting representation as the data being tested as showns as table 1. The following are the stages of the C4.5 algorithm: a.
Calculation of the value of Entropy for each attribute:  The total entropy calculation for each class is as follows: The results obtained from the calculation of each attribute will be shown in the table 3: Classifying the decision tree from the dataset being tested, so that the calculation is carried out in the test and then described in a table called confusion matrix [6]. All about the good and bad parameters in classifying with test data in different classes, namely negative and positive classes, namely the process of the confusion matrix as shown in table 4. The level of closeness between the class prediction and the actual class or the number of correct class predictions from the C4.5 classification model is 91.30% b. Classification Error = + + + + = 1+1 8+13+1+1 = 2 23 = 0.0869 * 100% = 8.70%

Conclusion
The Water Quality Status dataset is obtained from the results of research [7] where the data is the result of data collection by the Ministry of Environment on Water Quality Provisions which are classified into four categories. The Water Quality Status dataset is 117 records with 8 attributes and consists of 2 classes. The conclusion of the tests that have been carried out using the Water Quality dataset in the C4.5 method has an accuracy rate of 91.30%, with a classification error rate of 8.70%. Successful implementation using the C4.5 method in predicting the Water Quality dataset.