Efficient multi-level lung cancer prediction model using support vector machine classifier

This paper aims at the requirement for an interactive learning framework which empowers the successful checking of disorder in a patient. Principal component analysis stands out as an outstanding algorithm to significantly classify the target classes. PCA blends associated characteristics and makes a dissipated showcase of its components well. Scree plot examination gives solidarity of how many principal components are to be retained. Support Vector Machines (SVM ) is a fast and dependable classification algorithm that outperforms other techniques with a limited amount of data. The obtained components will be served to Support Vector Machine for further classification. The pre-dangerous stage will remind the clinical experts to give additional consideration to those patients. The expectation ability is estimated in terms of the confusion matrix. The model developed gives a high and uncompromising accuracy in early detection of different levels of malignancy


Introduction
In this time, the primary reason of death around the globe is malignant growth or carcinoma. Among the different types of existing cancers like breast, liver, stomach and colorectal, majority of deaths are caused by lung carcinoma (1.6 million) [1]. According to the overall information in 2012, lung cancer contributes 13% all cancers considered. Lung Carcinoma seems to be a propagation of morphological changes [8]. The two varieties of lung cancer are Small Cell and Non-Small Cell Cancers (SCLC and NSCLC) [2]. The reasons for lung disease incorporate hereditary and non-hereditary reasons [3]. Tobacco smoking is the far leading cause of lung cancer. It can severely affect the health of chain smokers as well as people in contact with smokers [4]. The lungs of the non-smokers can be affected by this malignancy due to air pollution, working atmosphere with radon, asbestos or any other chemical exposure [4]. Hereditary causes include the gene mutations in lungs. Individuals who have had taken radiation therapy for other diseases can also be the victims of this malignancy [4]. Mostly this condition is recognized at the advanced stage. This can be limit the cure of therapeutic intervention [5]. The different stages of lung cancer [27] are tabulated below with explanation in Table. The primary care of medical practitioners is needed for the fast survival. The identification of lesion type will gives the signals for the further treatment. The malignant stage is red signal, benign stage counts for green stage and an intermediate stage can be introduced as pre-malignant stage which is indicated as yellow signal. If a yellow signal is appeared, the doctors should continuously monitor those patients and it will be helpful for the fast cure. The popular examples declared by World Health Organization (WHO) are typical adenomatous hyperplasia; diffuse idiopathic pulmonary neuroendocrine cell hyperplasia, squamous metaplasia with dysplasia and carcinoma in-situ [6]. These conditions can be advance to harmful stages like carcinoid tumors, squamous cell carcinoma and adenocarcinoma [7]. Manual evaluation of lung features for the qualitative result is mentioned in [10]. Ground Glass Opacity [GGO] is considered to be pre-malignant, but this shadow is not differentiable by radiologist [9]. People reached a stage to be aware of the pre-malignant stage.
Detection using screening procedures includes Computed Tomography (CT) and other bronchoscopic techniques [6]. CT visualizes the spongy nodules of the windpipe within single breath threshold [12]. These techniques can capture 2-3 mm of lesions and it can create confusions on determining whether it is malignant, benign or the intermediate stage. Until this point in time, screening preliminaries have had no critical effect on survival [7]. Machine learning and feature selection has a golden chair in Lung cancer Detection, nowadays. A numerical measurement fed to a most accurate machine learning system is more preferable than the distorted image obtained by radiologists. In [9], GGO is extracted for the study using Deep Learning, which is as intelligent as human brain. Automated system to view histopathological part of respiratory tract is explained in [10]. The trending state of art to do lung cancer diagnosis with the aid of computer knowledge is explained in [11], touching the technical concerns and validation. CAD procedure has a golden standard in terms of speed and accuracy, when compared to other methods [12].
In the world, main part of datasets are available via web based networking and from this tremendous piece of information, features useful for the tasks like classification and prediction can be fetched [13]. These types of feature vectors are classified into target classes using Random Forest Classifier in [14]. Along with the existing blocks of classifiers, weighted models of SVM, ANN and NB created sequential diagnostic model for lung cancer patients in [15].
Some of the reported studies regarding machine learning applications on lung cancer is tabulated in TABLE.2. This paper introduces an automated platform for the lung cancer victims belongs to premalignant stage. The application is overlying on a PCA -SVM combined model and all the statistical parameters are well drawn for the further analysis.  Table 2. Reported studies on Lung Cancer Detection

Methodology
The strategy followed in this work is drawn in Fig.1

Data Visualization
Visualizing the data is the method of converting the data into abstract images which follows certain patterns or trends. It encourages the analysts to think of huge choices in application level. Python offers various incredible diagramming libraries like pandas, Matplotlib etc., which creates various highlighted histogram, scatter plot and density curve [28]. The features that are skewed in TABLE.2 are stretched as combination of histograms and density curves in Fig. 2. The followed pattern of each feature corresponding the target class through histograms. The normal or exponential distribution of features helps in parametric analysis in terms of maximum value, standard deviation, minimum value, mean etc.

Data Skewness.
The skewness of the data represents the asymmetry in statistics. The curve will seems to be skewed to either left or right [29]. The result can be quantified to analyse the difference between the obtained distribution and normal distribution.

Data Preprocessing
Preprocessing is an important pre-requirement for any data examination. It is generally an excellent plan to set up the information in such a manner to uncover the structure of the data to the machine learning calculations that needs to use. Data preprocessing techniques are well known in enhancing the capability power of classification systems [31]. This includes various exercises [30] like:  Allotting numerical qualities to target.  Dealing with missing numerical.  Normalizing the highlights Some unimportant highlights like patient-id, Age and Gender are cleared out to expand the effectiveness of the model we consider. As of now the information contains 3 targets, 21 characteristics and 599 cases. The cleaned csv dataset completes the preprocessing, after reducing the dimensionality.
Dimensionality Reduction -Principal Component Analysis (PCA). PCA handles the feature selection strategies through the view of reduced dimensionality. The unfocussed features can bring tremendous decline in the performance rate of prediction models. Diminishing the dimensionality of a data [38] by picking the significant features present in the underlying dataset are named as feature selection. These informational indexes are simpler to investigate and break down information in simpler way for calculations without superfluous factors to process. PCA is figured as a powerful data representation tool in [37]. PCA works on the mathematical basis of linear algebra which analyses the correlation of features [39]. In Fig. 3, PCA feature space transformation is done with two target classes namely Malignant (M) and Benign (B) and samples are spread along the axes. This type of transformation is also applicable for the Pre-malignant class, which is displayed along with the conventional classes in Figure. For distinguishing normal and abnormal prostate cells, PCA is used along with signal processing in [32]. The irrelevant points are cleared out from the original feature space using PCA in breast cancer detection [33]. PCA pre-forms the data and concentrates on relevant features for training the model [34]. The PCA based algorithm achieved improved accuracy with the absence of overfitting and outliers in [35]. In [36], a PCA-FNN system is proposed to tune parameters for categorizing liver cancer.  Fig. 3. are PCA_1 and PCA_2, which indicates the two components, used to transform the feature space. The significance of choosing principal component number in PCA can be vividly seen in the scree plot in Fig. 4. A scree plot is an analytic device to check whether PCA functions are good indicators or not [40]. The components are arranged by the measure of variety they spread. PCA_1 catches the most variety, PCA_2 the second most, etc. Fig.4 shows an elbow curve at the point 2, which points to the generation of PCA_1 and PCA_2 for the fed input.

Classification
Classification is the logical grouping of data. It has the capability to make decisions on real and unstructured data. It plays a major role in health care and data security. CVDs and diabetes are classified evidently using ANN in [43]. Neural network synced with fuzzy system to detect asthma in [44]. Diagnosis of epilepsy using signaling techniques and machine learning can be seen in [45]. The possibilities of Alzheimer's disease are explored using classification task in [46]. In the field of data security, variable indexes are used to classify in [47] and mobility of big data is enhanced to improve data security in [48]. In [49], data mining is explored to eradicate malware and DoS attacks. Texts are pruned and classified to ensure security in an automated way [50]. Conventionally, the cancer dataset was classified as Malignant (M) and Benign (B), but this paper throws light into an intermediate condition called "Pre-malignant", which provides an extra care for the patients. In data security, there will be a continuous monitoring of categories like copied, transmitted and retrieved data. Classification involves labeling information to make it effectively accessible and identifiable. It can eliminate duplications which reduces storage and backup costs. It can incredibly decrease the processing time.

Support Vector Machine (SVM):
SVM is one of the most popular classification algorithms which have an elegant way of transforming nonlinear data. Classification strategy of SVM is well explained in [52]. Hyper plane is the important tool of SVM that separates the data points in such a manner that the margin between two classes will be wide and the data points will be as far as possible. In this way, hyper plane will be creating a decision boundary with support vector points nearer to the left and right hyperplane. Linear SVM model is used for this lung cancer prediction study. A sample of SVM classification is shown in Fig. 7. Class 1 belongs to Malignant (M) and class -1 belongs to Benign (B).  [53]. Chip based system with SVM flavor is designed in [54]. The various strategies of SVM to detect breast cancer in an accurate way is explained in [55] [56] [57].

Results -Performance measurement
Results are obtained through confusion matrix and classification report. Confusion matrix is generated from the binary classification outcomes [58]. The highlighted parameters like accuracy, error rate, truth positive rate (TPR), false positive rate (FPR), truth negative rate (TNR) and false negative rate (FNR) can be calculated on the basis of this matrix. Accurate model with high score is the uncompromising factor for detection systems. GPS based map matching systems [59], and processors with frequency clocks [60] question the score of accuracy and ask for improvement. So, data scientists will try to minimize the error rate to a great extent. Accuracy score and error rate of this model are equated below as eqn [1] and eqn [2]. Commonly, the cancer prediction is deployed by 2 x 2 matrices. Here, it is innovated as 3 x 3 matrix and drawn in Fig. 8 along with the classification report. Classification Report will gives the precision, recall, f1-score and support of this classification system. Parametric definitions are in Table 4.

Conclusion and Future Work
This investigation draws thoughtfulness to the significance of pre-malignant stage in early detection of lung carcinoma. The real and inconsistent data is processed and cleansed. The precision of the SVM classifier could clearly signature with high accuracy of the prediction with high distinction. The confusion matrix obtained could effectively label and identify the information hidden in the dataset.
Here the performance is coined in terms of classification report toasted on confusion matrix and its calculations. Accuracy score and error rate of this model is presented as 3 x 3 matrix in contrast to the conventional 2x2 matrix. Classification Report gives the precision, recall, f1-score and support of this classification system. The work gives a scope for extension to optimization level that can ensure leading accuracy in machine learning diagnostics.