KNN-SVM Classifiers in Complex Diagnosis

In many applications, classification plays an indispensable role due to its powerful detection and diagnosis function. Especially in real data on disease, the detection of important factors and the diagnosis of the result usually bring huge contributions to patients. Simultaneously, complex problems in real data such as imbalanced data and missing data also lead to more challenges and difficulties. The ignorance of missing data will undermine study efficiency, and sometimes introduce substantial bias. Imbalanced data tends to be overwhelmed by the majority classes and ignoring the minority ones. The paper develops new support vector machine classifiers using k-nearest neighbors’ information (KNN-SVM), to impute missing data by calculating k-nearest neighbors’ statistical characteristic values and to interpolate some new samples between k-nearest minority class examples. As comparisons, the paper uses different kernel functions in KNN-SVM classifiers to show the different performances in disease diagnosis accuracy.


Introduction
Classification belongs to one of the widely discussed topics in many fields.From the perspective of detection and diagnosis, lots of classifiers have been well-developed under the assumptions that the training sets are well-balanced, completely-observed and all misclassification errors cost equally.Well-balanced data means all the classes have relatively equal samples rather than some classes have much more samples than the others.Completely-observed data means each variable and each sample is fully observed without missing data.However, in real data world, data is usually imbalanced and partly missing, causing challenges and difficulties to standard classifiers.More specifically, imbalanced data problem tends to be overwhelmed by the majority classes and ignoring the minority ones, and missing data problem may lead to less efficiency and biased findings which do not represent all the subjects.
In imbalanced data sets, the classes having more examples are defined as the majority classes and the ones having fewer examples as the minority classes.In various real-world settings, the class imbalance problem often occurs in classification investigation due to different reasons [1].In medical record databases such as whether the patients are dead or not after operation, a large number of patients will belong to the classification "live" rather than "die", which is often treated as the minority class.In practical applications, the ratio of the minority classes to the majority classes can vary from 0 to 1, drastically.Here 0 represents that minority classes do not have any sample at all, and 1 represents that minority classes have the same number of samples as the majority classes.
The basic idea of dealing with imbalanced data is to generate new minority class samples based on k-nearest neighbours' information, and thus obtain relatively equal numbers of samples in both majority and minority classes.There already exist several methods to accomplish the balance through under-sampling the majority class, over-sampling minority class, or both [2][3][4].Considering the information loss caused by under-sampling methods, the paper considers the existing synthetic minority over-sampling technique (SMOTE).Its core part is generate new samples belonging to knearest neighbours' minority classes through interpolating between the existing minority class samples which lie together [5].
Another common-seen data problem that we mentioned above is missing data.Ignoring the missing data will undermine study efficiency, and sometimes introduce substantial bias.In the case of categorical variables, relatively limited methods can be used to impute new values.In this paper, we consider k-nearest neighbours imputation method (KNN) based on the most frequent value due to the case of categorical variables in real data [6].
As a widely used tool for classification, the support vector machine (SVM) was firstly motivated by the geometric consideration of maximizing the margin [7].The basic principle of SVM is to find a hyperplane that separates the two classes of data points.There also exist many other classifiers in data mining and machine learning domains such as random forest, neural networks and lasso [8].In our paper, we focus on classical SVM classifier as basic model and develop two kinds of new support vector machine classifiers using k-nearest neighbours' information (KNN-SVM), which simultaneously deal with imbalanced data and missing data in classification investigations.
More specially, the paper develops the first kind of KNN-SVM classifiers to impute missing data by calculating k-nearest neighbours' means or modes and to interpolate some new samples between knearest minority class examples, and then to classify the patients in two groups according to relatively balanced and completely observed data, which belongs to the main contribution of our paper [9].As comparisons, the paper uses different kernel functions in KNN-SVM classifiers to show the different performances in disease diagnosis accuracy.It should be noted that, the paper does not develop the KNN-SVM diagnosis tools and methods in both static and dynamic settings [10,11].
The remainder of the paper is organized as follows.We describe our theoretical investigation in Section 2, including the response variable, covariates and settings in Section 2.1, and missingness in Section 2.2, respectively.In Section 3, we introduce the KNN-SVM classifiers.Then Section 4 applies our KNN-SVM classifiers to real data analysis and compared their performances in disease diagnosis accuracy.Some final discussions are placed in Section 5. Figure 1 presents our investigation flowchart.

Response, Covariates and Settings
The real data is part of a surgery data from one medical institute, containing 552 patients and 38 variables.The response variable, which is a binary variable (equals 1 or 0), represents whether the patients are currently dead or not. 1 represents the minority class "die" with 101 samples, and 0 represents the majority class "live" with 451 samples.It is easily to calculate the imbalance rate as 451/101 (4.47:1).The covariates represent 36 physical and chemical indicators from 552 patients.For brevity, we denote the setting with imbalanced rate 4.47:1 from 552 patients as Setting One.
To further investigate the performances of our proposed KNN-SVM, we randomly delete 50 samples which belong to minority class, and thus we get another sub-dataset containing 502 patients.Obviously, the imbalance rate here is 451/51 (8.84:1), and we denote this setting as Setting Two.
In both Setting One and Setting Two, the response Y and 36 covariates are unchanged.All the covariates and their abbreviations can be seen in table 1.As has mentioned in former section, the ignorance of missingness in input features may lead to potential bias or less efficiency.The common-used approach is to impute the missing values in some way.In the paper, we assume that the missingness is missing at random (MAR).The MAR can be characterized by the conditional distribution of the missing data indicator matrix  given  , say (|, ).More specifically, the missingness depends only on the observed components and not on the components that are missing [9].That is, where  represents all the covariates vector,   represents the observed components and   represents the missing components. represents unknown parameter. represents the missing data indicator matrix, in which the elements equal 1 which missing, otherwise 0.

KNN-SVM Classifiers
Support vector machine classifiers using k-nearest neighbours' information (KNN-SVM) can be summarized as an optimization problem in imbalanced and partly missing data cases.The main idea of KNN-SVM consists of three steps: Step 1 Impute missing data by calculating k-nearest neighbours' characteristic values referring to the existing k-nearest neighbours imputation method.That is, KNN-SVM uses the k-nearest neighbours to fill in the missing data in a data set.For each case with any missing data, it will search for its k most similar cases and use these cases' values to fill in the missing parts.In our paper, we choose k = 5 and 10.
Step 2 Interpolate some new samples between k-nearest minority class examples referring to the existing synthetic minority over-sampling technique (SMOTE).In other words, SMOTE generate new samples through the following way: Calculate the difference between the target feature sample and its nearest neighbour.Multiply this difference by a random number between 0 and 1, and add it to the target feature vector.This leads to the selection of a random point along the line segment between two specific features.The new generated samples increase the samples belonging to the minority class, which help to effectively force the corresponding decision region to become more general.In our paper, we choose k = 5.
Step 3 Run support vector machine procedure on relatively balanced and completely observed data.Therefore, KNN-SVM can be written as the following equations.

≤ 𝐶
Based on all above equations, the KNN-SVM classifiers enlarge the feature spaces in specific ways using kernels.In our paper, we take linear, polynomial and radial kernels in total.Algorithm 1 displays the details of KNN-SVM classifiers.

Algorithm 1. KNN-SVM classifiers
Step 1: Impute missing data according to the existing k-nearest neighbours imputation method.
Step 2: Interpolate some new samples between k-nearest minority class examples.
Step 3: Run support vector machine procedure on relatively balanced and completely observed data.
Step 4: Compare the accuracies, computing times and choose the most appropriate KNN-SVM classifier.

Accuracy Comparisons
Traditionally, accuracy is one of the commonly used measures and plays a crucial role in both assessing the classification performance and guiding the classifier modelling.Due to the binary categorical response variable, the paper investigates two-class problem.In imbalanced domains, most studies mainly concentrate on two-class problem.This is because that multi-class problem can be simplified to a class of two-class problems.By convention, accuracy measure is calculated based on confusion matrix, which can be seen in table 3.

Table 3. A confusion matrix for a two-class classification
Predicted as positive Predicted as negative Actual positive class True Positive () False Negative () Actual negative class False Positive () True Negative () Table 3 displays a confusion matrix based on the two-class problem.The first column in table 3 is the actual class label of the samples: Actual positive class and actual negative class.The first row in table 3 presents their predicted class label: Predicted as positive and predicted as negative.True Positive () and True Negative () respectively denote the number of positive and negative samples that are classified correctly, while False Negative () and False Positive () denote the number of misclassified positive and negative samples, respectively. = ( + )/( +  +  + ) Table 4 illustrates the diagnosis accuracies which are calculated by the above equation using KNN-SVM classifiers.As comparisons, the paper directly deletes the samples with missing data before running KNN-SVM classifier.This is similar to completely cases analysis (CC).Therefore, we denote this competitor as CC-SVM.
In table 4, we can conclude the following findings: (1) The classifiers with kernel "radial" always have relatively better performances in accuracy than the classifiers with other two kernels "linear" and "polynomial".(2) KNN-SVM classifiers with kernel "polynomial" or "radial" perform better than CC-SVM with corresponding kernels.Different from the kernel "linear", both "polynomial" and "radial" are suitable for nonlinear classification problems.It illustrates that nonlinear KNN-SVM classifiers are more powerful than nonlinear CC-SVM due to the advantages of "polynomial" and "radial" kernels.(3) In Setting One, KNN-SVM classifiers with K = 10 are a little bit more accurate than the classifiers with K = 5.However, the increase of K does not bring obvious and stable effects on improvement of diagnosis accuracies.

Computing Time
In this section, we investigate the computing time of different KNN-SVM classifiers and their competitors, which can be seen in table 5.As expected, the computing time using KNN-SVM classifiers is three times that of using CC-SVM.However, all classifiers' computing time is less than 1.2 seconds, and with the increase of K from 5 to 10, the KNN-SVM classifiers are comparable.For brevity, table 5 displays the computing time in Setting One and do not report the corresponding results in another Setting Two.

Discussions and Conclusion
In the paper, we develop new KNN-SVM classifiers in complex real data on disease diagnosis.Essentially, KNN-SVM classifiers are expansions of the well-known SVM with more "powerful functions" in dealing with complex imbalanced and missing data simultaneously.The reason why we call them KNN-SVM classifiers due to the following two aspects to accomplish their "powerful functions" using k-nearest neighbours' information: (1) Impute missing data by calculating k-nearest neighbours' statistical characteristic values and (2) interpolate some new samples between k-nearest minority class examples.
To further investigate KNN-SVM classifiers' performances in accuracies, the paper chooses three types of kernels and two K values in calculating the imputation values of missing data.Through the real data, we find that classifiers with kernel "radial" always have relatively better performances in accuracy than the classifiers with other two kernels "linear" and "polynomial".
In the future, the paper will continue to develop more valuable diagnosis tools and methods in both static and dynamic settings.From the perspective of dealing with imbalanced data, more classifiers in our paper belong to the category of over-sampling methods.More specifically, the proposed KNN-SVM classifiers generate new minority class samples to obtain relatively equal numbers of samples in both majority and minority classes.In the future, we will carry out more investigations on developing classifiers with under-sampling techniques, which also reflect the idea of "k-nearest neighbours" and enrich our investigations in classifiers for more complex diagnosis.

Declaration of Competing Interest
The author declares that he has no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article.

Table 1 .
All the covariates and their corresponding abbreviations.XP1 and XP2 have observations with missing values.Table 2 displays the missing rates of the above nine covariates and their corresponding types in two settings Setting One and Setting Two.

Table 4 .
Diagnosis accuracies using KNN-SVM classifiers and their competitors in Setting One and Setting Two.

Table 5 .
Computing time (seconds) using KNN-SVM classifiers and their competitors in Setting One.