Heart Failure Prediction with Machine Learning: A Comparative Study

Heart failure is a worldwide healthy problem affecting more than 550,000 people every year. A better prediction for this disease is one of the key approaches of decreasing its impact. Both linear and machine learning models are used to predict heart failure based on various data as inputs, e.g., clinical features. In this paper, we give a comparative study of 18 popular machine learning models for heart failure prediction, with z-score or min-max normalization methods and Synthetic Minority Oversampling Technique (SMOTE) for the imbalance class problem which is often seen in this problem. Our results demonstrate the superiority of using z-score normalization and SMOTE for heart failure prediction.


Introduction
Heart failure is a serious problem which has a huge impact on people's life. With the accelerated pace of life, increased portion sizes and inactivity, most people always ignore their health. Moreover, because of the environmental deterioration, those factors can lead to the issue of heart failure which can become more and more common in the future. If people did not pay attention to the issue of heart failure, it would finally cause the death.
In the past years, different researchers used different methods to collect and analyze data with the aim to predict heart failure. These data include electronic health record (EHR) data of patients with heart failure in different hospitals from different countries, Cleveland heart disease dataset, biomedical science datasets from UCI, etc. Based on these data, various methods are being applied, e.g., predicting the survival of patients by utilizing classifiers of machine learning, using supervised deep learning and machine learning algorithms, training a boosted decision tree algorithm, utilizing machine intelligence-based statistical model, random under-sampling method and deep neural network models, using bioinformatic explainable deep neural network (BioExpDNN), etc.
Although there are some work based on machine learning, comprehensive comparisons between different models have not been made, nor have new models such as Deep Forest been attempted. The work of this paper can fill these inadequacies. We use F1 score and accuracy as the evaluation metrics. The F1 score reflects the robustness of the model, and the accuracy reflects the overall accuracy. First, with the min-max normalization method, the results of 18 models are compared. It can be seen from the accuracy that SMOTE (Synthetic Minority Over-pling Technique) method is higher than that without SMOTE method, and it can be seen from the comparison of F1 score that SMOTE method is higher than that without SMOTE method. By comparing the z-score normalization methods with the 18 models, the accuracy of the SMOTE method is higher than that of the non-SMOTE method, and the F1 score comparison shows that the SMOTE method is higher than that of the non-SMOTE method. Overall, the z-score normalization method is better than the min-max normalization method.

Related Work
In this section, we give a short review of recent related works.
In [1], taking advantage of 299 patients who have cardiac failure in 2015. Those data have 13 features for example high blood pressure, sex, and smoking. The authors utilized some different classifiers of Machine Learning to forecast the proportion of survivors, and rank the features corresponding to the most important risk factors. They find serum creatinine and ejection fraction are the most important factor to forecast the proportion of survivors.
In [2], the authors use some data which are from electronic health record (EHR) and use some models about Machine Learning and Deep Learning. The results show novel machine learning models is possible to change the prediction accuracy of model.
In [3], taking advantage of dataset which is from EHR data and has 26,575 patients who have cardiac failure in 2018. Finally, the results show age of patients, creatinine, body mass index, and levels of blood pressure were significant factors in predicting mortality within one year among heart failure patients.
In [4], the authors collect data from 5,822 hospitalized and ambulatory patients with heart failure and it included eight variables, for example, diastolic blood pressure, white blood cell count, platelets, albumin, and red blood cell distribution width. By training a boosted decision tree algorithm, a model was developed to correlate a subset of 5822 inpatient and outpatient heart failure patients with a very high or very low risk of mortality. As a result, using machine learning and off-the-shelf variables, the authors generated and validated a risk score for death in patients with heart failure that was more accurate than other risk scores compared to it.
In [6], the authors use Cleveland heart disease dataset in the research paper and take advantage of machine intelligence -based statistical model, random under-sampling method and DNN model. They found that the given predictive model is helpful for doctors to diagnose heart failure.
In [7], this study used three datasets about biomedical science which are from UCI and utilized bioinformatic explainable deep neural network. They found that the results of research paper showed that the classification accuracies about CDS, CCBRDS, and HFCRDS were 92.59 percent, 100 percent, and 78.9 percent.
In [8], the authors took advantage of data which is from the medical records about cardiac failure patients. The dataset contains 299 cardiac failure patients in medical records and has clinical and lifestyle data. Different machine learning algorithms are used. The result showed that Machine learning algorithms are tools which are useful and effective to classify the records of medicine of patients with cardiac failure.
In [9], the authors used 299 patients who have cardiac failure in EHR and they are recorded at the Faisalabad Institute of Cardiology and the Allied Hospital in Faisalabad, from April to December in 2015. Oversampling was used by authors to solve the imbalance problem by adding information in a few classes. Random forest (RF) was also utilized by the authors to classify. Finally, the SMOTE is the most suitable way for oversampling EHR data to predict the proportion of death in patients with cardiac failure.
In [10], the authors took advantage of these data of patients with cardiac failure from the Faisalabad Institute of Cardiology and the Faisalabad United Hospital. Using the multilayer perceptron neural network to analyze. The results showed that the heart failure dataset, with 88 percent accuracy, outperformed other studies in predicting heart failure.

Dataset Description
We utilize a common dataset in public: https://www.kaggle.com/andrewmvd/heart-failure-clinical-data. This data contains the following input characteristics and prediction targets. The target we want to predict is the Death Event. The features have Age, anaemia, high blood pressure, creatinine phosphokinase, diabetes, ejection fraction, platelets, sex, serum creatinine, smoking, and time. Since the number of positive and negative samples is different, I will try SMOTE method later to obtain a uniform sample distribution.

Models
I want to introduce all machine learning models I use in my research paper. With the development of big data and artificial intelligence, machine learning has been successful in a series of problems. For heart failure prediction, there are also some relevant studies as we introduced in the related work part. However, there lacks a comprehensive comparison between different machine learning models. Logistic Regression (LR): Logistic regression which is a classification model, and it is often used in dichotomy. Logistic Regression is one of algorithms of ML to solving binary (0 or 1) problems, which is used to estimate the likelihood of some things.
K Nearest Neighbour (kNN): KNN is one of easiest algorithms in data analysis and it is a nonparametric statistical method for classification and regression.
Naivers Bayes (NB): Naive Bayes classifier is a series of simple probability classifier based on Bayes theorem under strong (naive) independence between assumed features. For example, Bayes' theorem is to solve the problem which often encountered in the real life.
Decision Tree (DT): Decision Tree is a tree structure which is binary or non-binary. Each non-leaf node represents a test on a feature attribute, each branch represents the output of this feature attribute on a range of values, and each leaf node stores a category.
Linear SVM (SVM-Linear) and RBF SVM (SVM-RBF): SVM is one of supervised learning methods, can be widely used in statistical classification and regression analysis. SVM belongs to generalized linear classifiers, which are characterized by their ability to minimize empirical error and maximize geometric edge region at the same time. Therefore, another name of SVM is maximum edge region classifier.
Multi-Level Perceptron (MLP): MLP is a kind of feed-forward neural network (ANN). The term MLP is vague, sometimes loosely applied to any feed-forward ANN, and sometimes strictly refers to a network consisting of multiple layers of perceptron.
Ridge Classifier (RC): In the least squares of ordinary linear regression, it is only valid when the matrix is nonsingular matrix, but in most cases the matrix is not rank, and ridge regression can be used to solve this problem.
Random Forest (RF): Random forest is a classifier which includes a lot decision trees, and its class of output is decided by mode of class of outputs of individual trees.
Quadratic Disc. Analysis (QDA): Quadratic Discriminant Analysis is improved on Linear Discriminant Analysis.
AdaBoost (AdaBoost): The Adaboost model is an iterative algorithm, whose core idea is to train different weak classifiers for the same training set, and then assemble these weak classifiers to form a strong classifier. Adaboost can handle classification and regression problems Gradient Boosting (GB).
Linear Disc. Analysis (LDA): LDA is a dimension reduction technique of supervised learning, that is to say, each sample of its data set is output by category.
Extra Trees (ET): Extra Trees is one of the most useful algorithms in Machine Learning. It is similar with Random Trees and it is generated by a lot decision trees.
Extreme Gradient Boosting (XGBoost): XGBoost is an open-source library, and it can provide an efficient and effective implementation of gradient enhancement algorithms.
Light Gradient Boosting (LGB): Light Gradient Boosting Machine is a framework to implement GBDT algorithm, which supports efficient parallel training, faster training speed, lower memory consumption, better accuracy, distributed support, and fast processing of massive data.
CatBoost Classifier (CatBoost): CatBoost is an open-source gradient enhancement library, and it is based on decision trees that provides off-the-shelf classification feature to support Python and R.
Deep Forest (DF): Deep Forest is an integration of the breadth and depth of the traditional forest. The purpose of integration in depth is to improve the classification capability. Purpose of integration in breadth is to reflect the difference of input data.

Experiments
In this part, we describe the experiment details. Firstly, I divided the data set, and divided 10% of the dataset, namely 30 sample points, as the test dataset. 90% of the entire dataset as the training dataset. Moreover, I compare the following situations: min-max normalization without SMOTE, min-max normalization with SMOTE, z-score normalization without SMOTE, z-score normalization with SMOTE. The accuracy of different machine learning models shows in Table 1. The F1. score of different machine learning models shows in Table 2.  6 We choose F1. score and accuracy as our evaluation metrics. The F1. score reflects the robustness of the model, and the accuracy reflects the overall accuracy. The conclusion is made by comparing the accuracy as well as the F1 score. First, the min-max normalization method, which uses 18 models, is compared. It can be easy to see from the accuracy that SMOTE method is higher than that without SMOTE method, and it can be seen from the comparison of F1 score that SMOTE method is higher than that without SMOTE method. By comparing the z-score normalization method with the 18 models, the accuracy of the SMOTE method is higher than that of the non-SMOTE method, and the F1 value comparison shows that the SMOTE method is higher than that of the non-SMOTE method. The z-score normalization method is better than the min-max normalization method.
As can be seen from the accuracy value, SMOTE has improved the accuracy of 10 of the 18 models in the min-max normalization method, and 13 of the 18 models in F1 score. SMOTE has improved accuracy of 9 of the 17 models in the z-score normalization method, and 11 of the 17 models in F1 score. Thus, the SMOTE approach is helpful for most models used in this study.
I further evaluate the feature importance of these machine learning models, especially the tree-based models. The feature importance graphs of the Random Forest, AdaBoost, Light Gradient Boosting, Extra Trees, XGBoost and CatBoost models are shown in Figure 3-8. As we can see feature importance of those models from these figures. Feature 'time' is the most important input features, which is different from other models. Feature 'serum_creatinine' is the second most important input features in the entire models. However, different models may assign different feature importance scores for different features.

Conclusion
In this paper, I analyzed and compared the performance of 18 different machine learning models for heart failure prediction based on 12 clinical features. I utilize F1 score and accuracy as the evaluation metrics. The F1 score reflects the robustness of the machine learning model, and the accuracy reflects the ratio of correctly classified cases. Our results demonstrate that z-score normalization is better than min-max normalization and using SMOTE for class imbalance is helpful. At the same time, we also show the feature importance of different models. Input feature 'time' is the most significant, which is different from other models in different models. Heart failure is a serious problem which have a negative effect on people's live and more than 550,000 people every year are affected by the disease. I need to predict for this disease and utilize some approaches to reduce its impact in the body of human.
In the future, I will continue to search for other approaches which can be optimal for the performance of prediction of heart failure. For example, combining different models of machine learning.