Heart disease prediction using machine learning techniques

Machine Learning (ML), which is one of the most prominent applications of Artificial Intelligence, is doing wonders in the research field of study. In this paper machine learning is used in detecting if a person has a heart disease or not. A lot of people suffer from cardiovascular diseases (CVDs), which even cost people their lives all around the world. Machine learning can be used to detect whether a person is suffering from a cardiovascular disease by considering certain attributes like chest pain, cholesterol level, age of the person and some other attributes. Classification algorithms based on supervised learning which is a type of machine learning can make diagnoses of cardiovascular diseases easy. Algorithms like K-Nearest Neighbor (KNN), Random Forest are used to classify people who have a heart disease from people who do not. Two supervised machine learning algorithms are used in this paper which are, K-Nearest Neighbor (K-NN) and Random Forest. The prediction accuracy obtained by K-Nearest Neighbor (K-NN) is 86.885% and the prediction accuracy obtained by Random Forest algorithm is 81.967%.


Keywords
Heart Disease; Machine Learning; K Nearest Neighbor (K-NN); Random Forest

Introduction
Human body is made up of various organs, all of which have their own functions. Heart is one such organ which pumps blood throughout the body and if it does not do so, the human body can have fatal circumstances. One of the main reasons of mortality today is having a heart disease [1]. So, it becomes necessary to make sure that our cardiovascular system or any other system in the human body for that matter must remain healthy. Unfortunately, people all around the world have been facing cardiovascular diseases. Any technology that can help diagnose these diseases before much damage is done will prove as helpful in saving people's money and more importantly their lives. Data mining techniques can be useful in predicting heart diseases. Predictive models can be made by finding previously unknown patterns and trends in databases and using the obtained information [2]. Data mining means to extract knowledge from large amounts of data [3]. Machine learning is a technology which can help to achieve diagnosis of heart disease IOP Publishing doi:10.1088/1757-899X/1022/1/012046 2 before much damage happens to a person. As an emerging field in science and technology, machine learning can classify whether a person might be suffering from a heart disease or not.

Literature Review
Research has been done in this field and people have produced methods to predict cardiovascular disease using supervised machine learning algorithms. Several research papers have been written on this topic. A survey has been presented in the form of a paper which analyzes performance of various models based on machine learning algorithms and techniques [4]. In one of the papers, work has been done to create a Graphical User Interface (GUI) to predict whether a person is suffering from heart disease or not, using Weighted Association rule based Classifier [5]. In another paper, a new approach has been presented which is based on coactive neuro-fuzzy interference system (CANFIS) for the prediction of heart disease [6]. A summary of commonly used techniques for heart disease prediction and their complexities is given in one of the papers [7]. One of the papers presented a classifier approach for heart disease detection and shows how Naive Bayes can be used for classification purpose [8]. In one of the papers, a survey is done which includes different papers in which one or more algorithms of data mining have been used for heart disease prediction [9].

K-Nearest Neighbor (K-NN)
In K-NN algorithm a data point is taken whose classification is not available, then the number of neighbors, k is defined. After that k neighbors are selected according to the lowest Euclidian distance between the selected data points and their neighbors. The selected data point is then classified into a category, which is same as the category which has majority of neighbors among the K neighbors.

Random Forest
Random Forest works by constructing multiple decision trees of the training data. each of the trees predicts a class as an output and the class, which is the output of the greatest number of decision trees is taken as the final result, in case of classification. In this algorithm we need to define the number of trees we want to create. Random Forest is a bootstrap aggregating or bagging technique. This technique is used to decrease the variance in the results.

Experimental Setup
The first step for the setup is to obtain the data set containing the features of a person suffering from a heart disease and a person who is not along with the result, that whether the person is suffering from the disease or not. The data set used in this experiment is taken from a website called Kaggle (https://www.kaggle.com/ronitf/heart-disease-uci). The programming language used to do the experiment is Python. Thirteen attributes are used which are available in the data set. The information of the attributes is available on Kaggle.
The next step is to analyze the data. For this, the information of the data set is required. To gather the concise summary of the DataFrame, the info() function is used on the data set which is provided by the Pandas library. The describe() function provided by the Pandas library is used to retrieve some statistical information of the data set like mean of the values of the attributes used. An attribute named target is taken whose value is 1, if the patient is suffering from a heart disease, or 0, if the person is not suffering from any heart disease. Now the data set is to be checked that is it balanced or not. This is done using countplot, which is provided by the Seaborn library, on the target attribute. After looking at the plot, it can be concluded that the data is quite balanced. We can also use countplot with different attributes of the data set like the sex attribute which has values 1 (male) and 0 (female) and the cp (chest pain) attribute which shows the type of chest pain ranging from 0 to 3.  After checking that the data is balanced, the correlation between the data is found out and is plotted as a heat map using the Seaborn library. The heat map clearly shows that the attributes like cp (chest pain) and thalack (maximum heart rate achieved) have positive correlation with the target attribute. Now that the correlation has been checked, we need to convert categorical variables like sex, cp, fbs, restecg, exang, slope, ca and thal into dummy variables. This can be done by using get_dummies method of the Pandas library. After creating dummy variables, the data in columns like age, trestbps, chol, thalach and oldpeak needs to be standard scaled, because they have much varied quantities and units. This can be done using Scikit-learn library in Python.
The data set has been divided into two parts, training data which is 80% of the whole data set and testing data which is 20% of the whole data set. After preparing the data, the algorithms are applied and the confusion matrix has been found out. The results have been found out in term of accuracy of the algorithm. The accuracy has been found out with the use of a confusion matrix.

Results
After applying the algorithms, the results obtained are as follows:

K-Nearest Neighbor (K-NN)
The value of k is taken as 12, as 12 was one of the values which gave the highest accuracy of the algorithm. The confusion matrix obtained was as follows: From the confusion matrix, the accuracy is calculated which comes out to be 86.885%.

Random Forest
The value of number of trees is kept 10. The confusion matrix obtained was as follows: From the confusion matrix, the accuracy is calculated which comes out to be 81.967%.

Conclusion
After applying various algorithms, it can be said that machine learning is proving to be extremely valuable in predicting heart disease which is one of the most prominent problems of the society in today's world. As more and more work is being done in the field of machine learning, soon there may be new methods to make machine learning more helpful in the field of healthcare. The algorithms used in this experiment have performed really well using the available attributes. The conclusion can be finally drawn that machine learning is able to reduce the damage done to a person physically and mentally, by predicting heart disease.