An Automatic Labeling of K-means Clusters based on Chi-Square Value

Automatic labeling methods in text clustering are widely implemented. However, there are limited studies in automatic cluster labeling for numeric data points. Therefore, the aim of this study is to develop a novel automatic cluster labeling of numeric data points that utilize analysis of Chi-Square test as its cluster label. We performed K-means clustering as a clustering method and disparity of Health Human Resources as a case study. The result shows that the accuracy of cluster labeling is about 89.14%.


Introduction
Clustering is an unsupervised method that is widely used in several domains to categorize data into some clusters based on its similarity. In image domain, clustering can be implemented as a method in image segmentation [1], images database categorization [2], image quality verification [3], etc. Moreover, clustering are also widely implemented in text domain, such as [4], [5], [6], etc.
In general, the studies of clustering method only implement related method for grouping the data into several clusters without annotations (automatic labeling) against the resulted clusters. Some of studies in text domain have been implemented cluster labeling, such as Wikipedia-based cluster labeling [7], hierarchical cluster labeling [8], [9], etc. However, there are limited studies for automatic cluster labeling in other domains. The existing cluster labeling for numeric data points is cluster labeling method for Support Vector Clustering (SVC) which is developed based on some invariant topological properties of a trained kernel radius function [10]. Its labels are support vectors, data points, and stable equilibrium points.
On the other hand, Chi-Square test is statistical methods that can be used detect different conditions between actual condition and ideal condition. Therefore, we proposed a novel method for cluster labeling of numeric data points that utilize analysis of Chi-Square test as its cluster label. In this study we performed K-means clustering as a clustering method and a disparity of Health Human Resources (HHR) as a case study.

Related Works
As clustering is an unsupervised method which categorizes data on some clusters based on its similarity, then there is no label for each resulted cluster. This leads to the emergence of challenges in the cluster labeled according to the data characteristics. The cluster label has been widely implemented in text data domain. However, it still very rarely implemented in numeric data. One of implemented methods in text data domain is chi-square method. It is used to test each word at each node in a hierarchy starting at the root and recursively moving down the hierarchy [11]. Since chisquare method can be used to find the association between two variables, i.e. observed value and expected value, then it is possible to be implemented in numeric data for cluster labeling.

Clustering
There are several methods of clustering that are widely implemented such as K-means Clustering, K-Medoids, Hierarchical Clustering, DB Scan, etc. In this study, we apply K-means Clustering as its simplicity. We implemented K-means Clustering with the following specification, namely a standard Euclidean distance as similarity measure and Rule of Thumbs formula as a method for determining the number of cluster based on the following equation: Where is defined as the number of cluster and is the number of data. The algorithm of K-means Clustering is as follow [12]: 1. Select k initial cluster centroid 2. Iteratively refining them as follows: a. Each instance is assigned to its closest cluster centroid b. Each cluster centroid is updated to be the mean of its constituent instances 3. Stop the iteration when there is no further change in assignment of instances to each cluster.

Cluster Homogeneity
Cluster homogeneity can be determined based on the silhouette coefficient value that can be obtained through the stages as follows [13]: 1. Calculate the average distance from a document supposes i with all other documents that are located in one cluster Where | | is defined as the number of data in Cluster A, ( , ) is defined the distance between ℎ document and ℎ document, and ( , ) is defined as the index of document 2. Calculate the average distance from ℎ document with all the documents in the cluster and take the smallest value ∈ Where ( , ) is defined the distance between ℎ document over all object in other cluster, in which A ≠ C 3. The value of the Silhouette coefficient is

The Standard Analysis of Requirement Status of HHR
According to the Regulation Number 32 of 1996, health professionals is everyone who devoted themselves in the field of health and has the knowledge and/or skills through education in the field of health, which requires the authority to do the efforts of health in specific types. One of methods that can be applied to plan the HHR requirement is "population ratio" method, namely the ratio of health professionals to the total population for a region. The ratio of the standard population is determined based on population per regency, population growth per regency as well as a number of health professionals.
On the other hand, a disparity analysis can be implemented to determine the appropriateness between the availability of HHR with "standards of the population ratio". A Chi-Square analysis is a method that can be used to perform the disparity analysis with the following formula: Where 2 is Chi-square value, is observed value, and ℎ as predicted value The interpretation value of the Chi-Square is based on the comparison between the values of arithmetic Chi-Square with the value of Chi-Square table. The greater value of arithmetic Chi-Square means there is a significant difference between the existence of health professionals with the normative needs of health care professionals, and it's applied on the contrary. Subsequently, this framework applied as a method for cluster labels.

Research Procedures
Illustration of the research procedure can be seen in Figure 4, whereas the details explanation about each step can be seen on the following subsection.

Data Collection
As explained before, the collected data including:  Population data obtained from the Central Bureau of Statistics, Central Java Province  HHR data (doctors, nursing, midwives) are acquired through an online media, i.e. dashboard of Health Human Resources Information System version 2016. This system is published by the PPSDM, Ministry of Health Republic of Indonesia. In addition, this research also uses the HRR Requirement ratio per 100,000 total populations as described in Table 1.

Pre-processing
The pre-processing step is used to normalize the obtained data into range [0,1] by using min-max normalization. This step is important since the range of the obtained data is difference for each type and it would give an impact on the performance of clustering process, where the process of the clustering data was conducted based on the similarity level between the data.

Clustering
The clustering process will be applied with K-means clustering method, whereas the Euclidean distance applied as the measurement techniques of similarity level. The detail of clustering methods has been described in Subsection 2.1.

Cluster Homogeneity Test
The aims on performing the cluster homogeneity cluster are to define the quality and the strength of the cluster, how good an object placed in a cluster. The method for performing the cluster homogeneity test on this research is Silhouette coefficient. The detail of the homogeneity cluster test can be seen on the Subsection 2.2.

Chi-Square Test
As described previously Chi-square test is performed to know the requirements disparity of HHR in accordance with the Government Regulation No. 32 of 1996 compared with the real conditions in the field about the availability of HHR.

Evaluation
Evaluation was performed to evaluate the accuracy level of the Chi-Square based labeling.

Result and Analysis
The results from the clustering process using K-means Clustering can be seen in Figure 2 with the value of the centroid for each cluster can be seen in Table 2 and the number of the cluster based on the Rule of Thumbs formula is defined as 4 clusters.   1  9363060  1160  990  310  6100  306  2  2101091  841  1231  231  5891  96  3  14143392  1482  1692  442  9512  662  4  10915493  3733  9283  1343  31043  404 As mention in the previous section, the cluster homogeneity test was performed based on Silhouette coefficient of about 0.541 which means the results of clustering have a good structure.
Subsequently, the calculation of Chi-Square Test was performed for automatic cluster labeling in general as well as the identification of HHR disparity status (the difference between the value of the availability and value of requirement) for regency in Central Java province in specific. Since we used 2 categories (availability and requirement) then the degree of freedom is (2-1) = 1. Based on its value and fault tolerance 0.5, thus the value of Chi-Square table is 3.841. Therefore, we perform the automatic labeling process based on the rule in eq. (6) and the result label in table 4. Based on Figure 2 and Table 4, there are four cities in Central Java Province that have good disparity label (fair status for both general practitioner and dentist availability), namely the Magelang City, Salatiga City, Pekalongan City, and Tegal City. Those cities are included in cluster 2. The final evaluation process was performed in order to measure the accuracy of the labeling process based on the ratio between the number of true predicted label and the total number of labels. The result shows that the labeling accuracy is about 89.14%.

Conclusions and Future Works
The automatic labeling of K-means cluster based on Chi-Square Tests was successfully applied with the accuracy is about 89.14%. Based on the resulted label, there are four cities in Central Java Province that have good disparity label (fair status for both general practitioner and dentist availability), namely the Magelang City, Salatiga City, Pekalongan City, and Tegal City.
The same strategy can be applied to all the regencies and cities in Indonesia and to assess the availability of other HHR's such as pharmacists, public health officer, sanitarian, etc. While to improve the accuracy of the value can be done by optimizing the K-means Clustering algorithm thus they have better Silhouette coefficient.