Variant component principal linear reduction for prediction of hypothyroid disease using machine learning

With the tremendous technological growth, the world is shifted to adapt the different food and life style by the people that results in the improper working of the body organs. The change in the food habits leads to a major problems that we face in the current scenario is the presence of hypothyroid in the body. The likelihood of hypothyroid still ruins as a challenging issue due to the uncertainty of proper symptoms. With this background, the machine learning can be used towards health care scenarios for the prediction of disease based on the patients past history. This paper focus on predicting the existence of hypothyroid with respect to the patients’ medical parameters. The hypothyroid patient dataset is taken from the UCI Metadata repository with 24 columns and 3163 unique patient’s records is used for the experimentation of hypothyroid with the following contributions. Firstly, the hypothyroid dataset from UCI machine repository is subjected with the data processing and exploratory analysis of the dataset. Secondly, the unrefined data set is fixed with different classifier algorithm to find the presence of hypothyroid and to examine the efficiency metrics before and after feature scaling. Thirdly, the data is processed to PCA with various combination of components as 5, 7 and 10 and is fixed with different classifier algorithm to examine the efficiency metrics before and after feature scaling. Fourth, the data is processed to LDA with various combination of components as 5, 7 and 10 and is fixed with different classifier algorithm to examine the efficiency metrics before and after feature scaling. Experimental results show that the Kernel Support Vector Machine classifier is found to have the accuracy of 99.52% for all the 10, 7, 5 component reduced PCA dataset. Similarly, the Logistic Regression, Kernel Support Vector Machine and Gaussian Naïve Bayes classifier is found to have the accuracy of 99.52% for all the 10, 7, 5 component reduced LDA dataset.


Introduction
The Hypothyroid disease is the underperformance situation in which the thyroid gland fails to secrete the needed quantity of the important hormones in the body. The hypothyroid is the difficult infectious disease that causes due to the unintended the thyroid simulating hormones and it level of existence in the body. It is also caused due to the constructive problem in the growth of the thyroid gland itself. The hypothyroid can be stabilized with the safe situation where the body has to stimulate the antibodies that fights with the unwanted hormones. The machine learning technology were used for health care application in the recent days for the prediction of diseases.

Literature review
The paper proposes with the experiment that was done on 3710 instances and 29 features of thyroid patients. The entire future prediction of the hypothyroid disease is categorized as two classes with positive and negative class. The efficiency of the outcome was implemented by making different combinations of clinical values and it is classified with the thyroid dataset [1]. This paper makes an assumption that the prediction of hypothyroid can be done with the logistic regression, decision trees for the improvement in the performance of the prediction [2]. The comparative study on thyroid disease diagnosis by using Support Vector Machine, Multiple Linear Regression, Naïve Bayes and Decision Trees. The results were compared and it was seen that Decision Trees could be successfully used to help the diagnosis of thyroid disease [3]. An attempt is made to analyze Logistic regression and Support Vector Machine for multiclass classification of thyroid dataset [4].
The Artificial Neural network models can be used for the functioning of the thyroid gland networks and it is fed into the various input levels for the prediction of the hypothyroid gland. The resultant outcome of the thyroid disease prediction can be highly improved by including the biological process that is related to the contribution of thyroid gland to the disease [5]. The simulation results show that One-Against-All Support Vector Machines are superior to One-Against-One Support Vector Machines with polynomial kernels. The accuracy of OAASVM is also higher than AdaBoost and Decision Tree classifier on hypothyroid disease datasets from UCI machine learning dataset [6]. The CAD system is comprised of three stages. Focusing on dimension reduction, the first stage applies PCA to construct the most discriminative new feature set [7]. The proposed system in this paper is used for system switches to the second stage whose target is model construction. ELM classifier is explored to train an optimal predictive model whose parameters are optimized [8].
In this research work, various classification models are used to classify thyroid disease based on the parameters like TSH, T4U and goiter [9]. The objective of this research paper observation is to determine hyperthyroidism, hypothyroidism and thyroids participation in hormones can be good predictor of the final result of laboratories and to examination whether the propose ensemble approach can be similar accuracy to other single classification algorithm [10]. In this paper, it presents an enhanced fuzzy knearest neighbor (FKNN) classifier based computer aided diagnostic system for thyroid disease [11]. This paper proposes a framework Thyroid Disease Types Diagnostics that aims to assist the physicians during the diagnostic process of thyroid diseases in a very structured and transparent manner [12]. The purpose of this study is predication of thyroid disease using different classification techniques and also to find the TSH, T3, T4 correlation towards hyperthyroidism and hypothyroidism and also to finding the TSH, T3, T4 correlation with gender towards hyperthyroidism and hypothyroidism [13]. The performances of the classification algorithms have been analyzed on breast cancer and hypothyroid datasets [14]. The proposed model is to categorize the usage of the thyroid gland to the secretion of the hormones with the help of feature reduction and classification methods [15].

Our contributions
The main analysis of this paper is to analyze how well the classification algorithms are predicting and maintaining the accuracy level with the change in the component reduction with principal component and linear Discriminant analysis. The overall architecture of this paper is shown in figure.1.The existence of hypothyroid is predicted using machine learning classification algorithms with the following contributions.
(i) Firstly, the hypothyroid dataset from UCI machine repository is subjected with the data processing and exploratory analysis of the dataset. (ii) Secondly, the unrefined data set is fixed with different classifier algorithm to find the presence of hypothyroid and to examine the efficiency metrics before and after feature scaling. (iii) Thirdly, the data is processed to PCA with various combination of components as 5, 7 and 10 and is fixed with different classifier algorithm to examine the efficiency metrics before and after feature scaling. (iv) Fourth, the data is processed to LDA with various combination of components as 5, 7 and 10 and is fixed with different classifier algorithm to examine the efficiency metrics before and after feature scaling.

Dataset Exploratory Analysis
The python scripting language is coded in Spyder editor with Anaconda navigator for implementation. The dataset information is shown in figure. 2. The feature analysis of hypothyroid dataset is depicted as density plot and coorelation matrix is shown in figure. 3 -figure. 6.

Results and discussions
The hypothyroid dataset from UCI machine repository is subjected with the data processing and exploratory analysis of the dataset. The unrefined data set is fixed with different classifier algorithm to find the presence of hypothyroid and to examine the efficiency metrics before and after feature scaling and is shown in figure. 7-figure.8. The data is processed to PCA with various combination of components as 5, 7 and 10 and is fixed with different classifier algorithm to examine the efficiency metrics before and after feature scaling. The cumulative variance VS Evariance of the 10 component PCA is shown in figure 9 -figure 10. The classifier Performance analysis of the 10 component PCA before and after feature scaling and is shown in figure. 11-figure 12.     The cumulative variance VS Evariance of the 7 component PCA is shown in figure 13 - figure 14. The classifier Performance analysis of the 7 component PCA before and after feature scaling and is shown in figure. 15-figure 16. The cumulative variance VS Evariance of the 5 component PCA is shown in figure 17 -figure 18. The classifier Performance analysis of the 5 component PCA before and after feature scaling and is shown in figure. 19 -figure 20.       The data is processed to LDA with various combination of components as 5, 7 and 10 and is fixed with different classifier algorithm to examine the efficiency metrics before and after feature scaling. The classifier Performance analysis of 10 component LDA before and after feature scaling is shown in figure 21 -figure 22.

Conclusion
An attempt is made in this paper to analyze how well the classification algorithms are predicting and maintaining the accuracy level with the change in the component reduction with principal component and linear Discriminant analysis. Experimental results show that the Kernel Support Vector Machine classifier is found to have the accuracy of 99.52% for all the 10, 7, 5 component reduced PCA dataset. Similarly, the Logistic Regression, Kernel Support Vector Machine and Gaussian Naive Bayes classifier is found to have the accuracy of 99.52% for all the 10, 7, 5 component reduced LDA dataset.