Application of the K-Means algorithm to determine poverty status in Hulu Sungai Tengah

Poverty is a condition of living in an inability to meet the minimum needs of life or basic needs. In Indonesia, poverty is one of the main problems that still need an optimal solution. Several government programs to address the problem of poverty have been carried out, but not infrequently the implementation is not right on target. The importance of this assistance is expected to improve the welfare of the community so it is very unfortunate if the assistance has not been right on target. This study aims to determine the status of poverty in Hulu Sungai Tengah Regency. By observing a problem above, it can be necessary to use a grouping method in determining poverty status. so that in this study using the cluster method, namely K-Means in clustering population data. Based on the results of data analysis using 353 head of family in the population data of HST Regency, it can be concluded that there are three poverty status clusters, namely low-level poverty (cluster 3) with a total of 130 head of family, medium-level poverty (cluster 2) with a total of 130 head of family. 111 head of family, and high poverty level (cluster 1) with a total of 112 head of family.


Introduction
Poverty is a condition of living in an inability to meet the minimum needs of life or basic needs. Poverty occurs not only because of an income but also because of limited household facilities and infrastructure. In Indonesia, poverty is one of the main problems that still need an optimal solution [1].
Several government programs to overcome the problem of poverty have been carried out, however, not infrequently the implementation is not right on target. There are several main factors causing the inaccuracy of government program targets/targets in overcoming poverty, namely the accuracy of the data and the accuracy of the data analysis used to determine the poverty status of the population [1]. For example, the results of the information obtained in the Mata Banua article in 2020 that in Hulu Sungai Tengah (HST) Regency there are still delays in the process of delivering social assistance in the Regency. The results of this delay information are due to one of the reasons for the non-optimal process of data collection and distribution at the local government level [2] So looking at the problems above, a grouping method is needed so that the data reading process is faster. Clustering is a method that can be used to group data objects that have the same characteristics into one cluster, data with different characteristics will be grouped into other groups. In a previous study conducted by Aras in 2016 the clustering method using the K-Means algorithm was used to determine the priority of the recipients of home surgery assistance. Then in 2019 and 2020, this method was also used to determine poverty status clusters based on population data in South Jambi District, West Java Province, and Banten Province. From the conclusion of previous studies that the K-Means algorithm is suitable for poverty data. [1,3,4].
Based on the description above, the research was carried out using the Clustering method, namely the K-Means Algorithm to determine poverty status clusters based on population data in HST Regency with the title "Application of the K-Means Algorithm to Determine Poverty Status in Hulu Sungai Tengah Regency".

Descriptive Statistics
Descriptive statistics is a method used in collecting data, processing, presenting, and calculating other measures. In addition, to make the data easier to understand, it can be done in the form of tabulations, diagrams or graphs [5].

Data Minig
Data mining is a process that uses statistical, mathematical, and machine learning techniques in extracting and identifying useful information and related knowledge from various data [6] Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. The idea is to build computer programs that sift through databases automatically, seeking regularities or patterns. [7]

Clustering
Cluster analysis is an analysis that has the purpose of grouping data objects on the condition that they have similar characteristics and then becomes one cluster, but if objects with different characteristics will be grouped in different clusters [8]. Clustering is divided into two clustering are hierarchical and nonhierarchical. Hierarchy is a method that creates a level of data or objects in a structured manner based onthe similarity of its nature and the desired cluster and the number of unknowns is unknown, while nonhierarchical is used for grouping data or objects in which the number of clusters to be formed can be determined in advance. [9] 2.4. K-Means Clustering K-Means Clustering Algorithm is one of the methods of part cluster non-hierarchical grouping to partition existing data into one or more clusters, so that data with the same characteristics will be grouped into one cluster and data with different characteristics will be grouped into other groups [10]. In K-means, each data must be entered into a certain cluster, but it is possible for each data to be entered into a certain cluster at one stage of the process, in the next step, moving to another cluster [11]. The procedure of the K-Means algorithm: 1) Determine k as the number of clusters that you want to form. 2) Determine the initial centroid randomly / randomly as many as k are formed. To calculate the next centroid, the following formula is used: 3) Determine the initial centroid randomly / randomly as many as k are formed. 4) Calculate the distance of each data to the centroid of each cluster. To calculate the distance between the data and the centroid, you can use Euclidian Distance.

5) Repeating
Steps c to e, until a convergent condition is reached, i.e. no one moves clusters or the result of the cluster position in the last iteration is the same as the position of the previous iteration.

Elbow Method
The Elbow method is a method used to determine the best number of clusters by looking at the percentage of the comparison between the number of clusters that will form like an elbow at a point [12]. To get the results of the comparison by calculating the SSE (Sum of Square Error) from each distance. The following is the equation of the SSE formula: After calculating the SSE for each cluster, the more clusters there will be a decrease in the SSE value. Elbow values are obtained from a drastic decrease in SSE and the subsequent decrease slowly. So that if the SSE value is represented in a graphic form, then the elbow value will form an angle like an elbow in the number of clusters 3. In Figure 1, the SSE value is given with the number of clusters.

Decision tree.
Decision tree is a flowchart structure that resembles a tree, where each internal node represents a test on a variable, each branch represents the test result, and a leaf node represents a class [13]. Basic Concepts of Decision Tree is turning data into a decision tree and decision rules, where each node represents attribute, branch represents the value of the attribute, and leaves represents the class [14]. The root of the Decision tree can be seen in Figure 2 below.

Figure 2. General Decision Tree Forms
In making a tree Decision tree can determine a tree root, the root will be taken from the selected variable by calculating the gain value of each, the highest gain value will be the first root [15].

Data Sources
The research material used in this study is population data with a total sample of 353 household heads in 2019 in Hulu Sungai Tengah Regency, where this data was obtained from the Office of Social Affairs, Family Planning Population Control, Women Empowerment and Child Protection, Hulu Sungai Tengah (Dinas Sosial, PPKB, PPPA).

Research Variable
In this study, there were 10 variables used, namely building status, number of household members, floor type, floor area, wall type, drinking water source, power, cooking fuel, facilities for final disposal of feces, and last education.

Analysis Steps
(1) Data preprocessing (2) Descriptive statistical analysis (3) The process of determining poverty status using the K-Means algorithm (4) Determining the optimal number of poverty status using the elbow method

Descriptive Analysis
In this section, information will be described or described regarding the variable number of household members, building status, floor area, type of floor, type of wall, source of drinking water, electricity, cooking fuel, and defecation facilities with 353 head of family. The following is a graph that represents the distribution of data from each variable used in this study. 2) The status of the building is still in the status of lease or contract as many as 134 families (38.1%).
3) The floor area is 15 -26 M2 as many as 119 families (33.8%). 4) The type of floor is using low-quality wood as much as 290 families (82.4%). 5) The type of wall is using wood as much as 328 families (93.2%). 6) The source of drinking water is using drilled wells or pumps as many as 121 families (34.4%). 7) Electrical power is using 900 watts of electrical power as much as 208 families (59.1%). 8) The fuel is using firewood as much as 217 families (61.6%). 9) Ownership of defecation facilities is using their own as many as 142 families (40.3%).

4.2.
Determining poverty status using the K-Means algorithm In determining poverty status, it can be determined the number of clusters or the best number of statuses in determining poverty status. The following is the result of processing the elbow method to determine by calculating the SSE value in each cluster 2 to 9 which is shown in Table 1. So that the SSE value in Table 1 can be represented in graphical form, the following Figure 4 is a graph of the elbow. So that the SSE value in Table 1 can be represented in graphical form, the following Figure 4 is a graph of the elbow.

Figure 4. SSE of Value Elbow Chart
Based on the results from Table 1 and Figure 4, it shows that the largest decrease in SSE value occurs in cluster 3 so that in cluster 3 gives an angle like an elbow in the graph and then there is a stable decrease in SSE value. In accordance with the concept of the elbow method, the cluster value used is 3 clusters.
In the K-Means algorithm process, the first step is to determine the number of groups or clusters of 3 clusters from 353 head of family. After determining the number of clusters, the centroid value is determined for each cluster for each variable. The centroid value in the first iteration (first time calculation) is given randomly. In the next iteration, the centroid value (1st iteration up to the normal position/maximum iteration) is given by calculating the average value of the data in each cluster. If the old centroid value is not the same as the new centroid value, then the iteration process is continued until the value is the same or up to the maximum iteration value that has been previously set (eg 50). For example, if the second centroid is the same as the first centroid, the grouping process stops. The centroid value in Table 2 will be used to calculate the distance between the data and the centroid. The following is an example of calculating the Euclidien distance equation. After the second centroid value is obtained, the second centroid value will be compared with the first centroid value. If there is a difference between the values of the two centroids, then continue the process of calculating the distance for each data using the second centroid value. Because the calculation results show that there is no difference in the value of the second centroid and the value of the first centroid, the calculation of the distance between each data is continued with the value of the second centroid. Then repeat the third step, which is to determine the distance between the data and the cluster center. In this study, the centroid value experienced no difference between the sixth and seventh centroid values. It can be concluded that all processes in K-Means were completed in the 6th iteration. 6th iteration centroid.

Process Decision tree
Decision tree is a flowchart structure that resembles a tree (Tree), which is used to represent data resulting from the K-Means clustering process. There are several stages of the Decision tree process: 1) Calculate the entropy value, to determine the gain value of each variable. The entropy value is also useful for being a condition or branch in the root of a Decision tree. It is known that the poverty status of cluster 1 is 112 families, cluster 2 is 130 families, cluster 3 is 111 families. The following is a calculation of the total entropy value for cluster 1, cluster 2 and cluster 3. 3) The calculation of the entropy and gain values for all variables is carried out to obtain the highest gain value which will be used as the root. The following Table 4 provides the results of the calculation of the gain value for all variables. 4) Based on the results of the calculation, the results of the analysis show that there are 3 clusters namely poor, medium and rich for determining the status of poverty in the community in HST district. So the results of the Decision tree show that the significant variables from this research are the source of drinking water, cooking fuel and floor area. From the root, a rule is obtained so that it can assist in categorizing a KKT into 3 clusters, namely cluster 1 (Poor), cluster 2 (Medium) and cluster 3 (Rich) including the categories of Poor, Medium, and Rich. a. Cluster 1: Sources of low-level drinking water such as rainwater, rivers, earthen wells, while lowlevel fuel sources such as firewood and kerosene: and have the number of household members=3. b. Cluster 2: Medium level drinking water sources such as protected springs: fuel used unless low level is not used such as firewood, and kerosene. c. Cluster 3: Sources of high-level drinking water such as protected springs and bottled drinking water: fuel used is at least >3 kg of gas to use electricity: and has a small number of household members = 2.

Conclusion
K-Means Clustering is one method that can assist in determining poverty status so that it can help the government to more easily overcome delays in the process of determining aid, based on the results of the formation of 3 poverty statuses in Hulu Sungai Tengah Regency, namely high poverty levels are members who are in cluster 1: medium level is a member who is in cluster 2: and low level is a member of low level cluster.