Handling leukaemia imbalanced data using synthetic minority oversampling technique (SMOTE)

High dimensional data always lead to overfitting in the prediction model. There are many feature selection methods used to reduce dimensionality. However, previous studies in this area of research have reported that an imbalanced class raises another issue in the prediction model. The existence of the imbalanced class can lead to low accuracy in the minority class. Therefore, high dimensional data with imbalanced class not only increase the computational cost but also reduce the accuracy of the prediction model. Handling imbalanced class in high dimensional data is still not widely reported in the literature. The objective of the study is to increase the performance of the prediction model. We increased the sample size using the Synthetic Minority Oversampling Technique (SMOTE) and performing the dimension reduction using minimum redundancy and maximum relevance criteria. The support vector machine (SVM) classifier was used to build the prediction model. The leukaemia dataset was used in this study due to its high dimensionality and imbalanced class. Consistent with the literature, the result shows that the performance of the shortlisted features is better than those without undergoing the SMOTE. In conclusion, a better classification result can be achieved when high dimensional feature selection coupled with the oversampling method. However, there are certain drawbacks associated with the use of a constant amount of synthesis of SMOTE, further study on different amounts of synthesis might provide different performances.


Introduction
In classification, one of the crucial challenges is the high dimensionality of the dataset [1]. The excessive features in high dimensional data are not always beneficial when building the predictive model because many of the features are likely to be irrelevant and redundant [2]. The high dimensional data can be tackled by using feature selection which is capable to select the best subset of features for the construction of the classification model [3]. But feature selection still has its issue dealing with an imbalanced dataset [4,5]. An imbalanced dataset is a dataset that presents skewed distribution in the label class. The skewness will result in biased prediction and misleading accuracy [6]. Consequently, the predictive model may only able to predict the majority class as the model does not learn enough information about the minority class. Therefore, there is a high chance of misclassification due to the huge difference between the label class [7].
Due to an imbalanced dataset, most of the machine learning algorithms will perform inefficiently and even only focus to predict the majority class. Some performance evaluation methods such as the receiver operating characteristic curve (ROC) and area under the curve will lose their ability to evaluate the performance of algorithms [8]. Besides, the recall from the predictive model that gets from imbalanced data might be low. Therefore, there is a need to balance the dataset before building any classification model or doing any further analysis. Previous studies have shown that features selection play an important role in high dimensional dataset and several features selection techniques have been developed [7,9,10]. However, the past studies only focus on the dimension reduction on the high dimensional data and not paying too much attention to the imbalanced high dimensional data.
This study focuses on improving the performance of the prediction model by increasing the sample size of the minority class in imbalanced dataset before dimension reduction. An oversampling technique known as Synthetic Minority Oversampling Technique (SMOTE) was implemented to increase the sample size of the minority class [10,11,12]. Then a mutual information-based feature selection method was applied to the 'balanced' data to select the optimal number of significant features [13,14,15]. By hybridizing these two methods in the imbalanced high dimensional dataset, better performance has been achieved compared to past researches that used only the features selection method [16,17].

SMOTE
The SMOTE method is a widely used oversampling technique to solve the imbalanced problem. It creates synthetic data points from the minority class [18,19]. Consider there are n samples in the training dataset and m features in the feature space in the dataset. Firstly, a sample x from the minority class is randomly selected. Next, consider its k nearest neighbours in the feature space, which have the smallest Euclidean distance. Then, one of the nearest neighbours was randomly chosen and denoted as X NN . The new synthetic data point is calculated by using the following formula: (1) where u represents the random number from Uniform distribution, which has the value from 0 to 1. The process repeated until it reaches the desired amount of the synthetic samples.

Feature Selection
The feature selection called mRMR used to rank the importance of the features and also used to measure the relevance between features and the redundancy among features and class vector for a given set of features [20,21]. They used mutual information to estimate how one vector is related to another. One of the mRMR conditions is maximum relevance [22]. The idea of this is to select the features that are relevant to the class label. The maximum relevance conditions: (2) The S denotes the subset of features we are looking for. I(h,i) represents the mutual information between the target classes, h and the features, i.
is the number of features in S. There is a possibility that features which satisfy the Max-Relevance condition could be redundant [23]. Therefore, the next step is to select the features which are mutually exclusive to each other. The minimum redundancy conditions: where I(i,j) represents the mutual information between two features, i and j. The features with the high relevance to class label and low redundancy among the features will be recorded as higher scores. The feature with the highest importance score will be added to the selected feature set S in this mRMR process.

The proposed method
The general flow of the proposed method is indicated in figure 1. The (leukaemia) data was first divided into two subsets: the training set and the test set according to the ratio of 7:3. The training set underwent the SMOTE to increase the samples in the minority class so that the classes in the data became balanced. Later, the mRMR algorithm was applied to this high dimensional data to select the important features. A support vector machine classifier (SVM) was used to build the predictive model.

The performance indexes
The performance of the predictive model of the SMOTE-mRMR hybrid method was compared with the following three methods which were taken to be a control group. They were (i) random selection, (ii) SMOTE with random selection, and (iii) mRMR (without SMOTE). The cross-validation accuracy of the training set and the accuracy of the test set were plotted to compare the performance of the proposed SMOTE-mRMR model with the control groups. The predictive model with better accuracy in the cross-validation and accuracy of the test set would be considered as a better predictive model. The performance of the predictive model was also evaluated by other commonly used indexes such as the ROC curve and confusion matrix (CM). The accuracy of the predictive model is calculated using the following formula: (4) where TN denotes true negative, TP is true positive, FP is false positive, and FN is false negative. The performance of the predictive model was also assessed using the recall, false-positive rate (FPR), false-negative rate (FNR) and the area under curve (AUC). The recall which lies between 0% and 100% with a higher percentage indicating better predictive performance. The FPR and FNR also fall within the same range but on the contrary, the smaller the percentage, the better the predictive performance. The value of the AUC extends from 0 to 1 with its value closer to 1 implies higher accuracy.

Dataset
Training set

Experimental Result and Discussion
The comparison of performance among the SMOTE-mRMR method and the controlled methods reveals that SMOTE-mRMR outperforms the others both in terms of cross validation accuracy of the training set ( figure 2) and test accuracy (figure 3). The performance of the random selection method ranks the lowest because the randomly selected features may be irrelevant. Since the data is of high dimension, SMOTE will not perform well in the prediction model independently. Although mRMR is meant to handle high dimension data, it carries weakness for unbalanced data, hence the hybrid method using the SMOTE-mRMR method can improve the performance of the predictive model.   (2), where k is subset of S. Besides, the poor performance of randomly selected features is mainly due to the fact that it does not consider relevancy and redundancy during the selection. This suggests that feature selection ought to be applied before building the predictive model. The mRMR selects the features that have a strong relationship with the label class but when working on imbalanced data, the prediction power is still lower than the proposed hybrid method.
The performance evaluation methods that have been used are the receiver operating characteristics (ROC) curve and the confusion matrix (CM). The average AUC value and other performance evaluation methods are summarised in table 1. Table 1 shows the average performance measurements for each of the four method combinations. The highest average accuracy was obtained in the mRMR+SMOTE (98.193%), whereas the lowest average accuracy was achieved in the Random Features (55.067%). When the dataset is imbalanced, it is no longer reliable to use the accuracy as the performance index. Instead, we should rely more upon the recall to measure performance for an imbalanced dataset. Obviously, the recall value achieved by the hybrid mRMR+SMOTE predictive model was the highest.
Based on the measurements listed in table 1, it is interesting to notice that the accuracy performance of these four methods can be arranged in the ascending order as Random Features → Random Features + SMOTE → mRMR → mRMR+SMOTE. SMOTE can improve the Random Feature by balancing the data, but mRMR still outperforms these two methods. This implies that not only is the data balancing important, but feature selection considering relevancy and redundancy is more crucial. When data appears to be both imbalanced and high dimensional, mRMR+SMOTE will be the best choice.

Conclusions
The imbalance in data is a very common issue that causes the deviation in the performances of the classifiers. On a set of leukaemia data, we applied the SMOTE oversampling technique followed by the mRMR feature selection method to tackled both the imbalance and the high dimension issues. This SMOTE-mRMR coupling method managed to give a better performance in term of accuracy, recall and area under curve. On this basis, the SMOTE-mRMR method seems to offer a better alternative for classifying imbalanced high dimension data. Nevertheless, more investigation on the influence of the difference between label class in the predictive model is necessary to validate the kinds of conclusions drawn here.