Optimization of C4.5 algorithm-based particle swarm optimization for breast cancer diagnosis

Data mining has become a basic methodology for computational applications in the field of medical domains. Data mining can be applied in the health field such as for diagnosis of breast cancer, heart disease, diabetes and others. Breast cancer is most common in women, with more than one million cases and nearly 600,000 deaths occurring worldwide each year. The most effective way to reduce breast cancer deaths was by early diagnosis. This study aims to determine the level of breast cancer diagnosis. This research data uses Wisconsin Breast Cancer dataset (WBC) from UCI machine learning. The method used in this research is the algorithm C4.5 and Particle Swarm Optimization (PSO) as a feature option and to optimize the algorithm. C4.5. Ten-fold cross-validation is used as a validation method and a confusion matrix. The result of this research is C4.5 algorithm. The particle swarm optimization C4.5 algorithm has increased by 0.88%.


Introduction
Data mining has become a basic methodology for computational applications in the field of medical domains. Data mining can be applied in the field of health such as diagnosing breast cancer, heart disease, diabetes and others [1]. Data mining has various techniques such as estimation, classification, association, and clustering. Among the various algorithms, classification algorithm plays an important role in predictive analysis. Classification aims to divide the object assigned only to one of the categories called class [2].
Utilization of data mining can be done in various fields, for example for Clustering Student Scholarship Applicants [3], Optimization of Classification of Student Final Project [4]. In the field of health such as for Prediction of Pregnancy Hypertension with Decision Tree Technique [5], Identification of Tuberculosis (Tb) Disease in Humans using Naïve Bayesian Method [6].
One of the most powerful and widely used techniques for classification and prediction is decision tree [7]. Decision tree is a frequently used classification algorithm and has a simple structure as well as easy to be interpreted [8]. Decision Tree transforms a very large fact into a decision tree presenting the rules [9]. The C4.5 algorithm proves its performance in predicting with best results in terms of accuracy and minimum execution time [10]. Many researchers have tried to apply the machine learning algorithm to diagnose breast cancer. Breast cancer is the most common cancer happens to women in both developed and developing countries. Breast cancer is a disease in which there is an excessive growth or uncontrolled development of breast tissue cells. Breast cancer is considered the most common invasive cancer in women, with more than one million cases and nearly 600,000 deaths occurring around the world each year [12]. The most effective way to reduce deaths from breast cancer is by early diagnosis [13].
The C4.5 algorithm has weaknesses in handling large data, including: (1) empty branch, nodes with zero value or near zero value do not contribute to generate rules or help to build classes for classification tasks but make bigger and more complex tree sizes, (2) insignificant branch, insignificant branch not only reduce the usefulness of the decision tree but also bring overfitting problems, (3) Overfitting occurs when the algorithm model takes data with unusual characteristics (noise) [5].
Data quality such as noise and overfitting data can affect the performance of classification algorithms. Feature selection is commonly used in machine learning when it involves attributes of high-dimensional and noise datasets. Feature Selection is the process of selecting relevant features, or a subset of feature candidates [13]. Feature selection search locally. Metaheuristic optimization can find solutions in full search space and use global search capabilities that significantly improve the ability to find high-quality solutions within a reasonable timeframe [14]. Improved algorithmic accuracy is required, for example through the application of Discretization and Bagging Techniques to Improve Classification Accuracy in Algorithm C4.5 [15].
One of metaheuristic optimization for feature selection is Particle Swarm Optimization (PSO). PSO has proven to be more competitive than genetic algorithms in some cases, especially in the area of optimization [16]. In this study, a combination of PSO-based C4.5 algorithms is proposed to improve the accuracy of breast cancer diagnoses and to overcome weaknesses in the C4.5 algorithm using PSO metaheuristic optimization for feature selection and to optimize C4.5 algorithm accuracy. Based on the description above, it is necessary to improve the method of diagnosing breast cancer accurately.

Methods
In this research would be conducted analysis of comparation and fusion of two classification methods of data mining. The method used was the C4.5 algorithm and particle swarm optimization. The first step in this research was to measure the accuracy of C4.5 algorithm. The next step was to measure the accuracy of C4.5 algorithm based on particle swarm optimization. Particle swarm optimization as feature selection and to optimize the accuracy of C4.5 algorithm, then compare which algorithm gives better accuracy. At this stages conducted the steps of the method used. Flowchart of C4.5 algorithm optimized using particle swarm optimization was shown in Figure 1.
At preprocessing stage was done initial processing of data. In the data of Wisconsin breast cancer, there were 699 records consisting of 11 attributes with 10 attributes of numerical type and 1 categorical type. In this research was done pre-processing in accordance with KDD process that was data cleaning, data selection, and data transformation.

a. Data cleaning
At this stage was done cleaning on incomplete, empty, or null data, data containing noise, and inconsistent data. There were 16 missing value data on bare nuclei attribute. There were several ways of missing value handling, among others ignoring tuples, filling missing value manually, using global constants to fill missing value, using measures of central tendency for attributes (eg, mean or median), using mean or median attributes for all samples included in the class which was the same as the tuple given, and using the value that was most likely to be filled in the lost value [16]. Handling of missing value using average in this study reduced the level of accuracy. Therefore, the handling of missing value in this study was done by reducing the data object so that the amount of wisconsin breast cancer dataset which was originally 699 records became 683 records. The detail of data to be cleaned was shown in Table 1 b. Data selection At this stage data selection would be done to reduce irrelevant and redundant data. In dataset of wisconsin breast cancer was done the process of elimination on the attribute of sample code number due to the attribute included into nominal or ordinal feature that was categorical types and qualitative value. This value was actually a symbolic value, it was impossible to perform arithmetical operations as in numerical type so that only 10 attributes were used with 9 attributes as predictor variables and 1 attribute as destination / target variable. The attribute details were shown in Table 2.