Principal component analysis with successive interval in K-Means Cluster Analysis (Study case: Poverty data 2013 in East Nusa Tenggara)

K-Means Cluster is a cluster analysis for continuous variables with the concept ofdistance used is a euclidean distancewhere that distance is used as observation variables, which are uncorrelated with each other. The case with the type of data that is correlated categorical can be solved by making categorical data into numerical data by the method called the successive interval and then used Principal Component Analysis. Applied this method in poverty data of East Nusa Tenggara Province in K-Means clusterobtained that Principal Component Analysis with Successive interval obtained variables that take effect to the cluster formation are toilet, fuel, and job.


Introduction
In the social and behavioral science, researchers are often confronted with a large number of variables. Often, variables used has a different measurement scale; there are numerical and categorical scale. Cluster analysis is a method in multivariate to grouping a set of objects in the same group (cluster) are more similar to each other than to those in other groups (clusters). In general, variables used as basic cluster consist of two types, there are categorical variables (nominal and ordinal) and numerical variables (interval and ratio). In general, there are two methods in cluster analysis, hierarchical and non-hierarchical method. The hierarchical method is used if the number of clusters is previously unknown while non-hierarchical method is used if the number of cluster is already known. One of non-hierarchical cluster analysis commonly used is the analysis of K-Means cluster used for numerical data and used Euclid distance as a measure to see similarity and dissimilarity measured. Euclidean distance is used if the observedvariables are independent oruncorrelated with one another.One of the assumptions in cluster analysis is no multicollinearitybetween variables [1]. Euclid distance concept used for numerical data by transforming the variables using Principal Component Analysis [2]. Nominal and ordinal data can be an interval by using the method proposed by Hays [3] called successive interval. Principal Component Analysis can be used in the results of successive interval to reduce data and solve multicollinearity, so it can be analyzed by cluster analysis.Previous research related to cluster analysis for categorical data by Yuniato [4] was about the comparison between cluster analysis with nonlinear principal component analysis and two-step cluster in mix data and the result that the two-step cluster method is better and more specific than centroid linkage method with  [5]. This cluster analysis aims to find out the significant factors of poverty in East Nusa Tenggara so that it can be an input for the government to reduce and even overcome poverty in East Nusa Tenggara.

Categorical data
Based on the type of data, it can be divided into numerical data (quantitative) and categorical data (qualitative). Numerical data is data expressed in numerical magnitude (numbers), for example, the data in per capita income, expending, price, etc., whereas categorical data is classified by category or a particular class. Categorical data consist of nominal data and ordinal data. Nominal data is data with an order or the value does not indicate the level merely as a label only, such as religion, gender, ethnicity or race. Ordinal data is data with the order of categories shows level or rank, such as education level, smoking habits. Categorical data can also be obtained by grouping continuous data, but having a risk of losing information. In practice, it is easier to record categorical data rather than continuous data; respondents are easier to answer sensitive questions; it is more practical significance. Presentation of categorical data can be either frequency, frequency tables and contingency tables [6].
In cases where the numerical variables are linked, correlation analysis is one option to see the relation between variables. However, if the two categorical variables are linked, then correlation analysis can not be used because the numbers on a category code arenot the only form of actual values. Another reason why the correlation analysis can not be used on categorical data is due to one type of categorical variable that can not be sorted as nominal category. Giving a different order will give different correlation values so that two people who calculate the correlation value likely give the same results. For this reason, the chi-squared analysis will be used to find relationships (associations) between the categorical variables. The analysis is based on the chi-square contingency table (often called cross-tabulations). A contingency table is a table whose cells contain the frequencies of the intersection of rows and columns. The general form of contingency tables with the first variable has m categories and the second variable has k categories. The hypotheses were: The statistical test used was: H . Strong or weak associations can be seen from the association between the interval -1 to 1.
Values association equals to 1 which means there is a strong association among variables, if the association equals to the value of zero, it means there is no association among variables.

K-Means cluster
K-Means cluster analysis has been developed by Mac Queen in 1967 and it is the most well-known non-hierarchical data grouping method that widely used in many fields because it is simple and easy to implement [7]. K-Means is a partitioning cluster method that separates data into different groups. The purpose of grouping the data is to minimize the variance within a group and maximize the variance between groups. Basic K-Means algorithm is as follows: 1. Determine the k value as the number of clusters to be formed 2. Generate the center point of the initial k randomly cluster 3. Calculate the distance of each data into each cluster center using Euclid distance 4. Group each data based on the distance between the nearest inter-data center 5. Determine new cluster center position by calculating the average value of the data existing on the same cluster center. The percentage of total variance is considered sufficiently representative of variance if data is 75% or more.

Successive interval
Methods to convert data from nominal or ordinal data to interval data [8].
Steps that have to be carried out are: a. Calculate the frequency of each data category b. Calculate the proportion based on the frequency of each category c. Calculate the cumulative proportion d. Calculate the value of for each cumulative proportion e. Calculate the value of the density function f. Calculate the value scale (the average interval) for each category g. Calculates caling

Data
The data used is the core data SUSENAS (

Method
In this research, steps to analyze are: 1. Data exploration 2. Checked multicollinearity among variables Applied NLPCA to data 3. Carried out the Successive interval to data 4. Applied PCA to successive interval data 5. Applied K-Means Cluster Analysis on principal component scores of PCA with Successive interval

Data description
Description SUSENAS core data of East Nusa Tenggara Province in 2013 to 10422 households. The following three variables were taken for describing the data:   Figure 1, we can see 64% of household heads in East Nusa Tenggara Province 2013 had their own toilet, 21% did not have toilet, and 15% not has their own. From Figure 2, we see that household fuel is still dominated by charcoal / briquettes / wood by 83.4%, electricity / gas by 0.9% and 15.7% kerosene. Figure 3 shows that 91% ofthe households had a job while 9% did not have job.

Checking multicollinearity
From the 15 independent variables of nominal and ordinal scale, the test multicollinearity assumptions were done by looking at the association between variables in the contingency tables. Test of the association among the variables was large enough association between the variables of literacy and education by 0.815, between jobs and education by 0.656, between the working status and education by 0.567, and between jobs and working status by 0.813.

Principal component analysis with successive interval
Results of the principal component analysis showed that when the two main components were taken to mean, only 35.2% could explain the variance of the initial data. The selection of the main components to be used in the K-Means cluster is based on the percentage of the cumulative variance of data about 75% or more [9]. To be able to represent the variance of the data in this study were taken 9 main components with atotal variance of 79.1%. The final results of the principal component analysis area score of major components used for the analysis of K-Means cluster.    Table 3 above, variables X1 value of the first principal component is 0.372 meaning that the principal component gives a contribution 0.372 but has a positive influence on the observation unit; the second component is 0.040 and has a negative influence on the observation unit. This also applies to principal components and other variables.  Table 4, it can be seen that the ratio between the variance in the cluster of method showed that the method of Principal Component Analysis method with successive interval has a ratio of 0.00039. We can see variance within a cluster smaller than between clusters. Ratio 0.00039 can be used if we want to compare two methods of clustering, the smaller ratio shows that the method better than the other method. ANOVA results show that the variables that differentiate in cluster formation are variable with pvalue less than 0.01. In this case, the significant variables are X6 (toilet), X8 (fuel), X11 (job). Therefore, the factors need to be considered by the East Nusa Tenggara government to reduce problems. Toilet is related to public health and cleanliness, so every household must have it. The result of SUSENAS shows that about 36% of households did not have a toilet. From fuel, there are still many households that use charcoal/briquettes/woods about 83.4%, the government must solve these problems, one way is to distribute electricity to the village. From SUSENAS about 91% households had a job, but mostly the jobs not a proper job with a good salary, so there are many households classified to poverty households.