Feature Selection Based on Naive Bayes for Caesarean Section Prediction

Data mining using machine learning algorithms can be used to help analyze historical data to predict the need for a caesarean section. The dataset used for predicting caesarean section has many features, but those features have the possibility of redundancy or irrelevance that can cause a decrease in classifier performance. This research proposes a model that implements feature selection to select relevant features and can provide improved performance predictions for caesarean section. Some proposed feature selection techniques are Sequential Forward Selection (SFS), Sequential Backward Selection (SBS), Sequential Forward Floating Selection (SFFS), Sequential Forward Floating Selection (SBFS), Sequential Backward Floating Selection (SBFS), and selectKBest. The classification algorithm used to classify is Naive Bayes. The model that gives the best performance value is the model that applies the SelectKbest as feature selection.


Introduction
Excessive interpretation of Cardiotocography (CTG) is a common and direct cause of unnecessary cesarean section (CS) [1]. Cardiotocography (CTG) is the most common method used to monitor the fetus during the early stages of delivery and clinical decisions are made using the visual inspection of CTG traces. Caesarean section is a surgical procedure that is usually done in the process of giving birth to prevent maternal and child deaths due to certain medical conditions [2] [3]. Caesarean section has improved, but is this the right decision? Predictions and judgment about the need for a caesarean are very important because the cesarean section has a risk in mothers who have had a cesarean for a long time. Before performing a cesarean section, considerations of many things are needed. In certain conditions decision-making must be taken quickly. Not all doctors have the same knowledge and experience, so they can make the wrong decision. The short-term adverse associations of caesarean delivery for the mother, such as infection, haemorrhage, visceral injury, and venous thromboembolism [4].
Before performing a caesarean section, considerations of many things are needed. In certain conditions decision-making must be taken quickly. Not all doctors have the same knowledge and experience, so they can make the wrong decision.
Data mining is a method that can be used by computers to find out the pattern of a set of data. By using computer data mining can be used to assist in decision making. One data mining method that can be used to help decision making is classification. This method is used to classify new data into certain groups based on the data set provided. So this method can be used to classify whether a mother who is going to deliver a caesarean section or not based on her condition.
The dataset used for predicting caesarean section has many features, but these features have the possibility of redundancy or irrelevance that can cause a decrease in the performance of classifiers [5]. IOP  This research proposes to implement feature selection to choose the relevant features and can provide improved performance models for caesarean section predictions. Some proposed feature selection techniques are Sequential Forward Selection (SFS), Sequential Backward Selection (SBS) [6], Sequential Forward Floating Selection (SFFS), Sequential Forward Floating Selection (SBFS), Sequential Backward Floating Selection (SBFS) [7]. The classification algorithm used to classify is Naive Bayes.

Method
This work is experimental research, the experiment was carried out by making a program using the Python language to implement the feature selection algorithm and the classification algorithm use. This experiment carried out by proposing caesarean prediction models, then applying to the caesarean dataset.
In this experiment, a general dataset was obtained from https://archive.ics.uci.edu/ml/datasets/Caesarian+Section+Classification+Dataset. The dataset consists of 6 attributes shown in Table 1. This dataset was collected and used in the research of Gharehchopogh, Mohammadi, and Hakimi [8] as well as in Amin and Ali's research [9]. The results of model performance measurements are compared to get the best model. The proposed model of the caesarean prediction in this work is shown in Figure 1. The first is to standardize the Caesarean dataset. Then the features are selected using the proposed feature selection algorithm. The dataset of feature selection results is then used to train and test the model using a 10fold cross validation technique. The test results are entered into a confusion matrix table to measure the performance of the model. Naïve Bayes algorithm is used as a basic classification because it has been proven to have an efficient [10], effective [11], and good performing [12] based on previous studies.
Based on the proposed model, there will be 6 models, namely NB, SFS, SBS, SFFS, SBFS and SelectKBest. The performance of the five models was compared to get the best model. SFS is a deterministic feature selection method that uses hill climbing search to add and assess all possible single attribute extensions for the current subset [13]. While SBS works in the opposite direction to SFS [14]. SFS and SBS select features in one-way, so the features that have been evaluating cannot be selected again, but these weaknesses avoided in SFFS and SBFS [15]. SelectKBest is a module in the scikit learn library that select k feature that has the highest score. The score is calculated based on univariate statistical analysis, which is an analysis of variables one by one. The test results are entered in the confusion matrix table and calculate the performance of classifiers is carried out in the form of accuracy and AUC (Area Under the Curve). he confusion matrix is a very useful tool for analysing the performance of classifying models and being able to recognize tuples and features from different classes [13]. Analysis using a confusion matrix is done by counting the number of objects that are predicted correctly and incorrectly to determine the performance of the model [14]. Validation values that have been entered into the confusion matrix are used to calculate the Accuracy or AUC value of each model and measure the performance of the model. In this study using the python programming language that has provided many libraries that can be used, including to measure the performance of the model in the form of a matrix confusion, accuracy, and AUC.
Naïve Bayes algorithm is a machine learning method that analyzes data based on calculating probabilities. When used to analyze large-sized datasets, Naïve Bayes can provide high accuracy and speed. Naïve Bayes can also be used even if it doesn't have enough datasets with accurate results [15]. The Naïve Bayes equation is based on the Bayes theorem as follows: (1) In the above equation, C is a class whose probability value will be calculated because it is influenced by the value of feature x. When feature x has a continuous value, it is assumed to have a Gaussian distribution with averages (μ) and standard deviations (σ) [16]. So, the equation is as follows:

Results and Discussion
Based on the model proposed in Figure 1, to find out the performance of the basic model applied by the Naïve Bayes algorithm as a classification without being optimized. The other model applied by integrated Naïve Bayes and feature selection, namely SFS, SBS, SFFS, SBFS, and SelectKBest.
Base on experiment, the selected feature shown in Table 2. The performance are shown in Table 3 for accuracy and Table 4 for AUC.  Based on the performance in table 4 and table 5, the highest performance is obtained when using one feature, namely Heart Problem. The highest performance value is the accuracy is 65% and the AUC value is 0.673. The performance results are then visualized using the graph shown in Figure 2 and Figure 3.

Conclusion
The decision to make a caesarean section or give birth normally is very important, because the main goal is to prevent maternal and child mortality. But cesarean section also has effects that can harm the mother and child, so it should be avoided except in certain conditions that can endanger the health of the mother and child. By using a dataset that has been collected and analyzed using machine learning can help make decisions in caesarean section. The experimental results show that the application of feature selection can help choose relevant features and can improve the performance of the model. The best feature selection algorithm for predicting the need for a caesarean is SelectKBest.