Application of Data Mining techniques to relate Cardiovascular Risk and Coronary Calcium

Introduction: Knowledge Discovery in Databases (KDD) constitutes a process that allows data sets to be modeled and analyzed in an automated and exploratory manner. In this sense, data mining can be considered the main core of this procedure. Objective: In this study, a classification of clinical subjects (cluster) based on the comparison of parameters associated to cardiovascular risk factors was performed by means of KDD-based algorithms. Materials and Methods: the K-means algorithm, Hierarchical Agglomerative Clustering and Kohonen’s Self-organizing Maps were applied to the database in order to obtain relationships based on the dissimilarity of its constitutive fields. Results: Four different clusters were obtained, represented by a group of well-defined clustering rules. Conclusion: KDD can be used to extract relevant data from clinical databases, which are strongly correlated with well-known cardiovascular risk markers.


Introduction
Knowledge Discovery in Databases (KDD) consists in the automated and exploratory analysis and modeling of big data repositories. KDD is an organized process aimed at identifying useful and novel patterns from complex datasets. In this sense, Data Mining (DM) constitutes the core of this process since it includes inference algorithms used to explore data and to understand, analyze and make predictions about the phenomena involved [1]. A subset of these algorithms may be defined as Clustering Algorithms. They have various applications ranging from data compression and vector quantization [2] to pattern discovery and recognition [3], among others. Their implementation yields a differentiated set of clusters (resulting from the application of specific rules). The notion of what constitutes a proper cluster depends on its application and there are various methods to obtain them. Such methods, in turn, must meet different criteria, both ad-hoc and systematic [4].
In biomedical terms, coronary diseases are the main cause of death worldwide. According to the world health statistics 2014 published by the World Health Organization (WHO), coronary heart disease (ischemic) is the leading cause of premature death worldwide [5]. These diseases affect the vascular conduits that irrigate the cardiac muscle (myocardium), where the ischemic event is produced by atherosclerotic obstructions of the arterial walls. This event prevents blood supply to the cardiac muscle, thus causing cell death (in ongoing situations) [6].
In the cardiovascular health, the need to obtain information for making decisions has become critical [7] together with the increasing importance of having accurate markers from a Digital Clinical History System [8]. In this regard, DM algorithms have been successfully applied in the prediction of clinical events in patients with chronic diseases [9], in the evaluation of the effectiveness of specific treatments for certain types of cancer [10], and in increasing the level of accuracy of medical evaluations (making differential diagnoses) [11], among other applications.
The main aim of this work was to obtain a series of clusters from a database (generated for the prediction of atherosclerotic events) with a specific set of clinical parameters and risk factors, by applying DM techniques. This implementation and correlation with sophisticated parameters of cardiovascular evaluation was accompanied by an analysis based on inclusion rules, which was not influenced by subjective evaluations from healthcare professionals.

Clustering Algorithms
The term clustering refers to the set of techniques and tools used for fractioning or partitioning data in a database. In order to do this, each cluster is grouped with those elements which are most alike and, at the same time, with those which are most different from the elements of the other clusters. In turn, the term centroid refers to the point equidistant (in Euclidean terms) to all the objects belonging to such cluster [12]. In this work, this type of algorithms has been used to establish a taxonomy of subjects from clinical parameters obtained in a non-invasive way.

k-means algorithm
Let us consider a dataset , , where the objective is to partition the dataset in M disjunct clusters . The k-means algorithm determines the local minimum solution to the clustering error, defined as the sum of the Euclidean distance between each point of data and the center of the cluster to which it belongs. Analytically, the clustering error is defined as: if is true and 0 in another case.

Hierarchical Agglomerative Clustering
The idea behind Agglomerative Hierarchical Clustering (HAC) is to start with each object in a cluster of its own and then repeatedly merge the closest pair of clusters until we end up with just one cluster containing everything [13]. The basic algorithm is given in Table 1. Table 1. Hierarchical Agglomerative Clustering Basic Algorithm 1. Assign each object to its own single-object cluster. Calculate the distance between each pair of clusters.
2. Choose the closest pair of clusters and merge them into a single cluster (so reducing the total number of clusters by one).
3. Calculate the distance between the new cluster and each of the old clusters. 4. Repeat steps 2 and 3 until all the objects are in a single cluster.
If there are N objects there will be N −1 mergers of two objects needed at Step 2 to produce a single cluster. However the method does not only produce a single large cluster, it gives a hierarchy of clusters as we shall see. In this case, the linkage criterion used was the decrease in variance for the cluster being merged [14].

Kohonen's Self-organizing Maps
Kohonen's self-organizing maps (SOM) are important neural network models for dimension reduction and data clustering. "Self-Organizing" is because no supervision is required. SOMs learn through unsupervised competitive learning on their own. "Maps" is because they attempt to map their weights to conform to the given input data. The nodes in a SOM network attempt to become like the inputs presented to them. In this sense, this is how they learn. SOM can learn from complex, multidimensional data and transform them into a topological map of much fewer dimensions typically one or two dimensions.

Determining the number of clusters
As is known, determining the number of clusters is one of the major problems in the use of clustering algorithms. In this context, this problem has been addressed in two directions. On the one hand, we have used the "elbow method" to determine k, where k is the number of clusters to seek the k-means algorithm. The elbow method is based on the observation that increasing the number of clusters can help to reduce the sum of within-cluster variance of each cluster. This is because having more clusters allows one to capture finer groups of data objects that are more similar to each other. However, the marginal effect of reducing the sum of within-cluster variances may drop if too many clusters are formed, because splitting a cohesive cluster into two gives only a small reduction. Consequently, a heuristic for selecting the right number of clusters is to use the turning point in the curve of the sum of within-cluster variances with respect to the number of clusters. Technically, given a number, k > 0, we can form k clusters on the data set in question using a clustering algorithm like k-means, and calculate the sum of within-cluster variances, var(k). We can then plot the curve of var with respect to k. The first (or most significant) turning point of the curve suggests the "right" number [15].
As seen in Figure 1, the first (and most important) peak corresponds to four clusters. Thus, it is used k = 4 for implementing the k-means algorithm.

Figure 1. Number of clusters -Within groups sum of squares
On the other hand, the method for determining the number of clusters defined by the HAC algorithm has been implemented as part of the algorithm itself, taking into consideration the biggest jump in the dendrogram.

C4.5 algorithm
Once every cluster had been defined, the C4.5 algorithm was applied in order to obtain inclusion rules. This algorithm, developed by Ross Quinlan [16], is an extension of the ID3 algorithm previously developed by him. C4.5 is used to generate a decision tree, which takes an object or situation described by a set of attributes as input and then provides a "true/false" decision. Therefore, this set of rules defines the natural behavior of such input. In this study, the classification field obtained from the application of the k-means algorithm was defined as the input and the remaining attributes were defined as "descriptors".

TANAGRA Software
Within the currently available options, the Tanagra plat-form (Entrepôts, Représentation et Ingénierie des Connaissances, Lyon, France) was chosen because it is a free soft-ware tool used for academic and DM specialized research purposes. This platform allows multiple supervised learning algorithms to be implemented. The main objective of the Tanagra project is to provide students and researchers with an entirely accessible DM algorithmic structure, which makes it easy to overcome the intrinsic programming problems in this domain.

Database
We have a database belonging to the French PCV-METRA cholesterol follow-up program (Prévention Cardio-Vasculaire en Médicine du Travail), a comprehensive evaluation of cardiovascular risk factors and non-invasive detection of infra-clinical atherosclerosis during a day of hospitalization. The subjects involved in the study were contacted between 1995 and 1996. None of them had a history (or symptoms) of cardiovascular disease. The database used was composed of 618 records and the clinical parameters evaluated by the algorithm are detailed in Table 2. In addition, the existing data include a quantification of cardiovascular risk based on the Framingham model (CVR) and the determination of coronary arterial calcium (CAC) [17]. Under such terms, the mean values for both parameters were obtained for the group of subjects belonging to each cluster. All attributes described above have been converted to numeric-continuous power in order to be properly processed by the data mining algorithms. Furthermore, in order to standardize the attributes are applied to the function: (1)

Results
After the k-means algorithm, detailed in previous sections was applied (McQueen method [18]), 4 clusters composed of 187, 153, 32 and 246 data records respectively, were obtained. They are characterized by the calculated centroids, whose values are shown in Table 3. The algorithm was executed in 5 Trials and 40 as the maximum number of iterations. The 5 most significant inclusion rules for each cluster may be observed in Table 4. Thus, the mean values of CVR and CAC corresponding to each cluster are described in Table 5.
As it can be observed, the 4 clusters are clearly defined, represented by their inclusion rules, which are met by more than 80% of the cases studied.  On the other hand, the HAC algorithm has been executed as another entry point in the data analysis process. The resulting output was four clusters with centroid that can be seen in Table 6. Furthermore, in order to compare the results of K-means and HAC, a comparative table of the classification obtained by each method has been prepared (see Table 7).  As a result of and by means of the cluster obtained, a hierarchy of attributes was generated, which determines inclusion of each data record in its corresponding cluster (Table 8).

Discussion
CVR estimation is mainly used for raising public aware-ness of the occurrence of diseases with high morbidity and mortality rates as is the case of cardiovascular diseases, over a 5 to 10-year timespan. In the present clinical practice, CVR is estimated by means of stratified tables where cardiovascular risk factors are evaluated through a scoring system for men and women separately. Equations based on the Framingham model, the SCORE algorithm and the PROCAM model, among others, constitute some of the standardized predictors for the evaluation of CVR [19]. Coronary and cerebrovascular events, as a result of atherosclerosis development, occur in a sudden and asymptomatic manner. The reduction of the occurrence of coronary and cerebrovascular events depends on early modifications of factors such as tobacco, diet and sedentary lifestyle, together with periodic controls of cholesterol, glycemia and blood pressure [20]. In addition, in this work, DM techniques were applied with the aim to cluster subjects from a clinic database (patients who attended a cardiovascular check-up) without the direct intervention of healthcare professionals. Thus, we have implemented three different clustering algorithms in order to assess the existence of a classification / hierarchy in the data. Then these results were compared to identify any relationship between them. Thereby, observing the tables centroids K-means algorithm, HAC and SOM (Table 3, Table 6 and  Table 9, respectively), and then the comparison chart between these algorithms (Table 7 and Table  10), one can observe a strong relationship between clusters.
In clinical terms, it may be inferred that there are 4 clearly differentiated clusters. Cluster 4 corresponds to healthy subjects (non-smokers, non-diabetics, with normal blood pressure). Cluster 1 is composed of smokers mainly. Cluster 3 is identified with diabetic subjects and, finally, Cluster 2 is made up of hypertensive subjects.
These results are in line with the clinical diagnoses that any general practitioner would provide [21]. However, this study shows that cluster 4, characterized by low risk subjects (see Table 5, CVR 10.9%) are also subjects with low artery calcium levels, which is a specific marker for the presence of atherosclerosis. The hypertensive group (cluster 2) mostly composed of non-smokers (70%) has a higher CVR (17.9%) as a result of their high blood pressure but with a CAC 2.5 times higher than that of normal subjects. Cluster 1, composed of 100% smokers has a CVR similar to that of hypertensives. Nonetheless, they show a slight increase in CAC, which would imply a low predisposition for coronary disease. Finally, cluster 3, composed of 100% diabetics, has the highest CVR (28.4%) as well as a high marker for atherosclerosis (CAC 100).
It was observed that, when using 3 clusters or more, there is always a group exclusively composed of diabetics. In this sense, the analysis of the results shows that diabetic subjects have physiopathological characteristics significantly different from those of a non-diabetic subject. Finally, the methodology selected has the advantage that it is not biased by the opinion of the healthcare professional, and, as a result, it is free from subjectivities that could be present in the analysis and examination performed by the physician [22]. As a disadvantage, it must be acknowledged that this analysis requires that the number of groups to be identified by the algorithm be defined beforehand; and therefore, the ideal number of clusters is obtained through successive runs.

Conclusion
An original way to obtain relevant information from clinical databases is showed as feasible in this work. Such information is strongly correlated to frequently used cardio-vascular risk markers, as a comparison technique. The basic clustering analysis done here is enough to aim futures lines of research. However, further studies are required in order to optimize the determination of the number of groups as well as a better correlation between the attributes (cardiovascular risk factors) and the evaluation of such risk. Among other improvements, exists clustering schemes that could contribute in a best way to classify the clinical cases based upon sophisticated clustering algorithms.