Aided diagnosis methods of breast cancer based on machine learning

In the field of medicine, quickly and accurately determining whether the patient is malignant or benign is the key to treatment. In this paper, K-Nearest Neighbor, Linear Discriminant Analysis, Logistic Regression were applied to predict the classification of thyroid,Her-2,PR,ER,Ki67,metastasis and lymph nodes in breast cancer, in order to recognize the benign and malignant breast tumors and achieve the purpose of aided diagnosis of breast cancer. The results showed that the highest classification accuracy of LDA was 88.56%, while the classification effect of KNN and Logistic Regression were better than that of LDA, the best accuracy reached 96.30%.


Introduction
Breast cancer, ranking the first in the incidence of malignancy in women, is a major malignant tumors seriously influencing women's physical and mental health and even life-threatening. Breast cancer is the leading cause of death among women between 40 and 55 years of age and is the second overall cause of death among women exceeded only by lung cancer. Fortunately, the mortality rate from breast cancer has decreased in recent years with an increased emphasis on diagnostic techniques and more effective treatments. A key factor in this trend is the early detection and accurate diagnosis of this disease. So CAD systems are designed to support radiologist in the processing of visually screening mammograms to avoid miss-diagnosis. Classification systems can help minimizing possible errors that can be done because of inexperienced experts, and also provide medical data to be examined in shorter time and more detailed.
There have been a lot of research on medical aided diagnosis of breast cancer, and most of them have reported high classification accuracy. Quinlan reached 94.74% classification accuracy using 10fold cross validation with C4.5 decision tree method [1]. The researcher, H. J. Hamiton, and N. Shan, reached 94.99% accuracy with RIAC technique [2]. While B. Ster and A. Dobnikar have got to 96.8% with linear discreet analysis method [3]. Using neuro-fuzzy techniques, the accuracy of method proposed by D. Nauck and R. Kruse was 95.06% [4]. In Abonyi and Szeifert, an accuracy of 95.57% was obtained with the application of supervised fuzzy clustering technique [5]. The fuzzy-GA method was introduced and a classification accuracy of 97.36% was achieved [6]. Three different methods, optimized learning vector quantization (LVQ), big LVQ, and artificial immune recognition system (AIRS) were applied and the obtained accuracies were 96.7%, 96.8%, and 97.2% respectively [7]. Multilayer perceptron neural network, four different methods, combined neural network, probabilistic neural network, recurrent neural network and SVM were used respectively by Ubeyli; and the highest classification accuracy of 97.36% was achieved by SVM [8]. Two different methods, Bayesian classifiers and artificial neural networks were applied and the obtained accuracies were 92.80% and 97.90%, respectively [9]. Murat Karabatak and M. Cevdet Ince, using the method combined with association rules and neural network, obtained the accuracy of 95.60% [10]. In the paper, K-Nearest Neighbor,Linear Discriminant Analysis,Logistic Regression combination with normalization, principal component analysis, 5-fold cross validation are applied to predict the classification of thyroid,Her-2,PR,ER,Ki67,metastasis and lymph nodes in breast cancer, in order to recognize the benign and malignant breast tumors and achieve the purpose of auxiliary diagnosis of breast cancer.

K-Nearest Neighbor
K-Nearest Neighbor, abbreviated to KNN, is a simple and effective non-parametric classifier in the field of pattern recognition. The core idea of K-Nearest Neighbor is that there is a sample in the feature space, if the majority of the samples belong to a certain category, the sample also belongs to the same category. In the KNN algorithm, the similarity between objects is usually calculated by the distance between objects, generally using Euclidean distance or Manhattan distance, showed as in equation (1) and equation (2), respectively.

Linear Discriminant Analysis
Linear Discriminant Analysis, abbreviated to LDA, is one of the commonly used classification methods in statistics. The basic theory of the algorithm is to project a high-dimensional pattern of samples into the best discriminant vector space to guarantee model samples have the largest inter-class distance and the minimum inner-class distance.

Logistic Regression
Logistic Regression is actually a classification method for two classification problems. Its basic flow is to find a suitable predictor function firstly, generally expressed as h function, which is a classification function, used to predict the input data to determine the results, as in equation (3).
And constructing a Cost function which represents the deviation between the predicted output (h) and the training data class (y), taking the "loss" of all training data into consideration, and summing or averaging Cost as J (θ) function, which indicates the deviation of the predicted value of all training data from the actual category, as in equation (4). Obviously, the smaller the value of the J (θ) function is, the more accurate the predictor function is. So this step needs to find the minimum value of the J (θ). were also very important to decide the patient's treatment schedule. In the experiment, the 77 samples were all used to train to construct a classification model, and then randomly choose 20 samples for test.

Data Processing
Data normalization processing is a basic step of data mining. Different evaluation indicators often have different dimension units, which will affect the results of data analysis. In order to eliminate the impact of indicators, data must be standardized. After the raw data is normalized, the indicators are in the same order of magnitude, suitable for comprehensive evaluation. In this paper, the equation of data normalization processing is defined as in equation (5).
Principal component analysis is also an important issue in building classification systems. It refers to the export of a few principal components from the original variable so that they retain as much information of the original variable as possible and have as much good prediction as possible. K-fold cross validation is shown to be efficient for small data sample and gives more reliable results than common test sample estimates. In the processing, the whole set is divided randomly into k nearly equally large subsets. The k-1 of subsets will be used to train, and the last will be used to test, which guarantee that each of subsets is involved exactly once in testing. K times results are obtained and then averaged as the accuracy of the algorithm.

Experimental Scheme
In the experiment, the 77 samples were all used to train to construct a classification model, and then randomly choose 20 samples for test. Normalization, principal component analysis, 5-fold cross validation were applied to processing the original data respectively. Based on the processed data and raw data, KNN, LDA,Logistic Regression three methods are applied to predict the classification of thyroid,Her-2,PR,ER,Ki67,metastasis and lymph nodes in breast cancer, respectively.
The performance of the classifier was evaluated using Sen,Spe and Accuracy. Accuracy indicated the percentage of all samples identified correctly. They could be written as in equations (6)

Results and Discussion
Based on the original features, KNN, LDA,Logistic Regression three methods were used to obtain the values of Sensitivity, Specificity and Classification accuracy in Table 1.
From the table 1, we could see clearly that in the method of logistic regression, metastasis, thyroid, ki-67 had relatively high accuracy, respectively 92.60%, 88.89%, 85.19%. And in LDA, the classification accuracy of these seven indicators were all relatively lower. However, in the algorithm of KNN, the classification accuracy of metastasis reached the highest, about 96.30%. Considering the results above, this showed that for the same data set, the classification effect varied from classifier to classifier. So in order to get a good classification effect, the choice of classifier was the key.
In addition, the sensitivity of thyroid,Her-2,metastasis in KNN, Her-2,metastasis in logistic regression and the specificity of ER,ki-67 in KNN and logistic regression were low, and even some was zero. After analyzing, that may be due to the reason that the sample data set was not large enough, and positive and negative samples were uneven distribution, so the learning and master of characteristics was not enough. Therefore, in the next time, the sample data would be collected further.  After the data processing of normalization, principal component analysis, 5-fold cross validation, respectively, KNN,LDA,Logistic Regression three methods were used to obtain the values of classification accuracy again. We would compare the results with the previous results in Table 2.
As can be seen from Table 2, in the method of KNN, with the data normalization the results of thyroid,PR,ER were higher than the original results. When using the method of LDA, the classification accuracy of all indicators were slightly improved comparing 5-fold cross validation with original results. However, selecting the algorithm of logistic regression, the original features without any processing would achieve the highest accuracy. Experimental results showed that for the same data set, the classification effect was different with the input characteristics processed differently. So in order to get a better classification effect, the choice of way of data processing was the key.
Seen from Table 1 and Table 2, KNN classification efficiency was better, the highest accuracy rate of 96.30%, while in LDA and SVM, the highest accuracy rate was 92.60%, 85.19%. So for different classification samples, to choose a suitable classifier to get a higher accuracy was essential to achieve the purpose of aided diagnosis of breast cancer.

Conclusion
Diagnosis of breast cancer requires the comprehensive use of clinical, imaging, pathology, computeraided diagnosis and other knowledge and technical means. Early detection and early diagnosis of breast cancer are the key to improve curative effect. In this paper, KNN, LDA and logistic regression three methods were used to the identification of breast cancer pathology, to help doctors comprehensive assessment to make specific treatment options and drug selection. Experiment results showed that KNN classification accuracy, better than LDA and logistic regression, was expected to become an effective and practical computer assisted diagnostic tools in breast cancer that could reduce the human subjective misdiagnosis rate and bring the gospel for the breast cancer patients.