Bagging Techniques to Reduce Misclassification of Breast Cancer Prediction Base on Gradient Boosted Trees (GBT) Algorithm

Breast cancer is a malignant tumour that grows in breast cells and has a risk of death. Breast cancer has levels ranging from stage 0 to stage 4. The higher the stage of breast cancer, the higher the risk of death and is difficult to treat. The application of machine learning algorithms has been proposed to help predict breast cancer. Predictions made by classifying patients tend to have breast cancer or not. This research proposes to implement bagging techniques to reduce misclassification in the Gradient Boosting Trees (GBT) algorithm. The experimental results show that the application of bagging techniques can reduce misclassification and improve prediction accuracy.


Introduction
Breast cancer is the most common cause of death, especially women throughout the world [1]. Breast cancer is formed when cells in the breast grow abnormally and are out of control [2]. These cells generally form tumours that feel like benign or malignant lumps [3]. The collection of cells will spread throughout the body and grow in a network around the chest. Early symptoms of breast cancer are usually not detected or difficult to show. Therefore, it is very important to follow the screening guidelines recommended for detecting breast cancer early for women [2]. Current technological developments to diagnose or detect breast cancer continue to increase with the times and aim to provide less invasive choices for accurate and better diagnoses [1].
There have been many studies using machine learning technology specifically in predicting breast cancer from many sources [2]. To Diagnose Breast Cancer can use a Gradient Boosted Trees (GBT) algorithm.
The Gradient Boosting Trees (GBT) algorithm is an information-theoretical discriminative predictor for boosting regression 230 accuracy. GBT classification can work well and is also effective on a broad set of data and with a combination of many features that are not normalized.
Different hyperparameters used in the algorithm for each tree built (e.g., maximum tree depth) and others using the configuration of all models (e.g., numbers of trees to build) [3]. but the level of accuracy obtained from the GBT algorithm is still low at 0.58%. To increase the accuracy of prediction of the GBT algorithm by using bagging techniques. Whereas in the approach to combining or pairing Based on the prediction problem of breast cancer above, the Machine Learning algorithm will be implemented with a feature selection technique using a dataset downloaded from the UCI Machine Learning Repository. By using a bagging technique to reduce misclassification in the Gradient Boosting Trees (GBT) algorithm.

Methodology
The dataset used in this study uses a secondary dataset. Secondary data is data that is not obtained directly from the object of research, obtained has been collected by other parties. The secondary data used in this study is a collection of biomedical data taken directly from the UCI Machine Learning Repository which can be downloaded via the Dataset site used in this study. The Dataset collection of breast cancer in the UCI repository is divided into 4, namely breast cancer, Wisconsin (original) or WBCD breast cancer, Wisconsin (Prognostic) Breast Cancer or WPBC and Coimbra Breast Cancer. The specifications of the dataset used in this study are shown in table 1. The purpose of this study is to improve accuracy and AUC using bagging techniques on the GBT algorithm. This research is expected to be able to get a better diagnosis of breast cancer than using the GBT algorithm. The proposed framework of the prediction model for this work is shown in figure 1. The new GBT-based algorithm incorporates a novel component to the regular regression error component which is optimized through dataset testing [3]. While to reduce misclassification of software defect prediction, an ensemble algorithm (Bagging/AdaBoost) is applied because it can improve the classification accuracy [5].
Gradient Boosting Trees (GBT) algorithm is an information-theoretical discriminative predictor to improve accuracy and AUC.
It is described as follow: The 2. Output of final classifier: The purpose of this model is to increase accuracy accurately significantly greater than individual models, and stronger against the effects of noise and overfitting of the original training data. The proposed model is applied using 1 dataset from UCI Machine Leaning Repository. This dataset will be chosen alternately as testing data and others as training data until all datasets have tested the data. The distribution of the dataset is training data and testing data can be seen in Figure 2. In the first validation, the first dataset is used as the testing data, while the second until fifth datasets are training data. In the second validation, the second dataset is used as testing data, and the other as training data. Validation is repeated until all datasets have been used as testing data.
Validation results are used to measure model performance. Model performance is usually measured using a matrix. A confusion matrix is a useful tool for analyzing how well classifiers can recognize tuples/features of different classes [8]. Confusion matrix also provides performance appraisal of classification models based on the number of objects predicted correctly and incorrectly [9]. The confusion matrix is a 2-dimensional matrix shown in Table 2. The performance of the model can be seen from the value of Accuracy or AUC. To calculate the performance of the model the following equation can be used

Results and Discussion
Based on the model proposed in figure 1, to find out the performance of the basic model applied by the Gradient Boosted Trees algorithm as a classification without being optimized. The second model is integrating Gradient Boosted Trees with bagging techniques.  Based on the graph in figure 3, it can be seen about the prediction of breast cancer using the Gradient Boosted Trees algorithm with bagging techniques to reduce misclassification and improve prediction accuracy to the Gradient Boosted Trees algorithm without being optimized. The accuracy and AUC performance models have the same high values for the Gradient Boosted Trees algorithm with the bagging technique. Based on the number of validation results, it can be seen that the performance of the model that implements bagging techniques with the Gradient Boosted Trees algorithm has high accuracy and AUC values compared to the Gradient Boosted Trees algorithm without optimization.

Conclusion
Breast cancer prediction is an important research topic to avoid because breast cancer can be deadly if new sufferers realize it after entering the final stage. The proposed model shows the results that there is no model that produces very good performance. The experimental results show that the application of bagging techniques can reduce misclassification and improve prediction accuracy. The proposed model can help predict breast cancer accurately.