Integration K-Means Clustering Method and Elbow Method For Identification of The Best Customer Profile Cluster

Clustering is a data mining technique used to analyse data that has variations and the number of lots. Clustering was process of grouping data into a cluster, so they contained data that is as similar as possible and different from other cluster objects. SMEs Indonesia has a variety of customers, but SMEs do not have the mapping of these customers so they did not know which customers are loyal or otherwise. Customer mapping is a grouping of customer profiling to facilitate analysis and policy of SMEs in the production of goods, especially batik sales. Researchers will use a combination of K-Means method with elbow to improve efficient and effective k-means performance in processing large amounts of data. K-Means Clustering is a localized optimization method that is sensitive to the selection of the starting position from the midpoint of the cluster. So choosing the starting position from the midpoint of a bad cluster will result in K-Means Clustering algorithm resulting in high errors and poor cluster results. The K-means algorithm has problems in determining the best number of clusters. So Elbow looks for the best number of clusters on the K-means method. Based on the results obtained from the process in determining the best number of clusters with elbow method can produce the same number of clusters K on the amount of different data. The result of determining the best number of clusters with elbow method will be the default for characteristic process based on case study. Measurement of k-means value of k-means has resulted in the best clusters based on SSE values on 500 clusters of batik visitors. The result shows the cluster has a sharp decrease is at K = 3, so K as the cut-off point as the best cluster.


Introduction
Clustering is in daily life, because it could not be separated with a number of data that produce information to meet the needs of life. One of the most important tools in relation to data is to classify or classify the data into a set of categories or clusters [1]. One clustering technique is the method of Kmeans algorithm using the process repeatedly. The K-Means method is the simplest and most common clustering method. K-means has the ability to group large amounts of data with relatively fast and efficient computation time [2]. However, K-Means has a disadvantage depending on the initial cluster center determination. K-Means cluster test results in the form of solutions that are local optimal. From the trial process is expected to have similarities or closeness between data so that can be grouped into several clusters, where among cluster members have a high level of similarity [3]. According to (Celebi et al., 2013) the K-Means algorithm is also versatile, which is easy to modify at every stage of the process, simple in the distance calculation function, and depends on iteration termination criteria. K-Means Clustering is a localized optimization method that is sensitive to the selection of the starting position from the midpoint of the cluster. So choosing the starting position from the midpoint of a bad 2 1234567890''"" cluster will result in K-Means Clustering algorithm being trapped in the optimal locale [4]. The Kmeans method will choose the pattern up to k as the starting point of the centroid randomly or randomly. The number of iterations with the centroid cluster will be affected by the initial centroid cluster at random. So that can be fixed by determining the centroid cluster in the high initial data to get higher performance [5]. Consumer segmentation based on consumer behaviour measured by consumer profiling. Customer profiling is built through information with criteria: age, gender, and residential area information. To process customer profiling using data mining by segmenting consumers can be estimated through the consumer profile, which conducted using clustering algorithm for bank customer segmentation [6]. Kaur et al, 2013 proposed improvements to the classic K-Means algorithm to produce more accurate clusters. The proposed algorithm is based on data separation, to find the initial centroid according to the data distribution. The results of this study have resulted in better clusters in a short calculation time. The main objective of this research is the segmentation of bank customers to find the transactional relationship between customer and company to provide mutual solution [14]. In other words, improving customer relationships and evaluating customer segments by predicting credentials for each group of customers and will provide a more appropriate type of transaction model with the customer [6][7]. Clustering results vary depending on the number of clustering parameter changes. The k-means method is a simple clustering technique and quickly takes care of the problem to determine the exact number of clusters in the data set. Customer segmentation research is proposed for the use of various regulations for various customers with high risk to the bank [8]. Elbow proposes several ways to determine k as the number of dynamically formed clusters, one of which is the elbow method [9]. This research will combine K-Means with elbow method to determine the actual number of clusters Customer profiling segmentation in SME. The value on k will continue to increase in each process and decrease with great value. The graph shows the elbow n of all the k values obtained. This research will find the best value of k by using elbow method. The elbow method is easy to implement by looking at the ideal k value graph with the position on the elbow along with the SSE (Sum of Square Error) which is less than 1. The best cluster k result will be the basis for clustering. The smaller the value of SSE and the elbow graph decreases the better the cluster results.

Literature Review 2.1. Customer Segmentation
One of the ways used to know customer characteristics information by doing customer segmentation. One way that can be used to manage the relationship between customers and companies is to provide different treatment according to customer characteristics of each segment. Customer segmentation is done with data mining to know the customer characteristics information hidden inside. The way to find out the customer segments of a company is clustering analysis. Clustering is the process of forming segments of a set of data by measuring similarities between data with other data [10]. The purpose of segmentation is to customize the products, services, and marketing messages for each segment. The segmentation process puts customers in line with the characteristics of similar customer groups. Customer segmentation is a preparatory step to classify each customer according to a defined customer group. Customer segmentation based on market research and demography requires understanding the characteristics of all customers to be more effective. Customer segmentation also identifies segmentation in customer behaviour. In addition, customer payment transaction data is used to gain insight into customer behaviour. Customer segmentation to form groups based on their income and expenses. It can be used to identify high-value customers and prioritize service [11]. The K-means algorithm is one of the algorithms with partition, since K-Means is based on determining the initial number of groups by defining the initial centroid value [12]. The K-Means algorithm requires precise numbers in determining the number of clusters k, since the initial cluster centre may change so that this event may result in unstable grouping of data [13]. The output of K-Means depends on the selected centre values on clustering. This algorithm the initial value of the cluster's centre point becomes the basis for the cluster determination. The initial cluster centroid cluster randomly assigns an impact to the performance of the cluster (14-16). K-Means Clustering algorithm is one of the clustering methods by partitioning from set data into cluster K. It is a distance-based clustering algorithm that divides data into a number of clusters in numerical attributes.

K-Means Clustering
1. Determine the number of clusters K and the number of maximum iterations. 2. Perform the initialization process K midpoint cluster, then the equation of centroid count feature: ∑ Equation 1 is done as much as p dimensions from i = 1 to i = p 3. Connect any observation data to the nearest cluster. Euclidean distance spacing measurements can be found using equation 2. √( ) ( ) 4. Reallocation of data to each group based on comparison of distance between data with each group's centroid [9].
Recalculate the cluster midpoint position. is the value of the membership of point x i to the centres of the group c 1 , d is the shortest distance from the data x i to the group K after being compared, and c 1 is the centre of the group to 1. The objective function used by this method is based on the distance and the value of the data membership in the group. The objective function according to MacQueen (1967) can be determined using equation.

∑ ∑ ( )
n is the amount of data, k is the number of groups, a i1 is the membership value of the data point x i to the c 1 group followed a has a value of 0 or 1. If the data is an anngota of a group, the value a i1 = 1. If not, the value a i1 = 0. 6. If there is a change in the cluster midpoint position or number of iterations <the maximum number of iterations, return to step 3. If not, then return the clustering result.

Elbow Criterion
Illustration of K value on Elbow combination with K-Means was graph of cluster relationship with error decreasing, increasing value of K then graph will decrease slowly until result of value of K is stable. For example, the value of the cluster K = 2 to K = 3, then from K = 3 to K = 4, it shows a drastic decrease to form the elbow at point K = 3 then the ideal cluster k is K = 3 [11]. The combined Elbow and K-Means Methods can determine the value of K at the best cluster.

∑ ∑ ‖ ‖
With k = many clusters formed = the i-th cluster, x = the data present in each cluster.
2. Determine the cluster's center point at the beginning at random. Early centroid determination is done randomly from the available objects as much as cluster k, then to calculate the next i-cluster centroid, by the following formula: 3. Calculate the distance of each object to each centroid using the Euclidian Distance.

Results and Discussion
K-means clustering research is done by taking data during the last 1 month for profiling customer with parameters as attribute in the form of customer criteria. Where each criterion has a predetermined category range, for gender criteria and activity has sub criteria, following sub criteria for gender and activity. Customers have been filled out the questionnaire using 6 criteria, according to  The data contained in table 1 would not be directly processed, because there is a large number of numbers between the variables. The difference in distance or the magnitude of this number can be quite difficult in the process of grouping. One of the solutions used to minimize the number of variables between the variables with the equation 5 [11].
( ) ( ) The value of the variables is normalized to the range 0 -1. Normalization of numbers on each variable before the calculation process is done so that the centroid value on the parameter does not exist that predominates in the calculation of the distance between data [11]. The data used in the form of data on the number of visits batik sales with criteria : Profession, Income, Quality of Batik, and Gender Education. The result combined system test of K-Means method with Elbow according to Table 3.  Test Data used k-means clustering and uses the Elbow method at 100 and 300 Customer Profiling. The K-Means Clustering process uses the Elbow method to determine the value of k. The result of the cluster formed will be labelled or named to facilitate the company in considering the characteristics of its customers. Performance tests have used 100 and 300 customer data purchases of goods. The results of Sum of Square Error calculations of each cluster have experienced the greatest decrease in K = 3 can be seen in Table 4, Figure 1 and Figure 2. In this test will find the performance of each number of clusters that are adjusted to the range of values on the Elbow method. The graph contained the SSE value in the experimental number of clusters between 2 and 8. The number of segments of 3 SSE values is 313,29, the value is not the lowest SSE value but the lower value.