A Multi-angle Improved Small Sample Clustering Algorithm

The random selection of initial clustering centers, outliers, and the differences between attributes will affect the clustering effect of k-means. This article first uses the elbow method to determine the number of categories and then uses the square difference radius method to select the cluster seed center to optimize the cluster center’s reselection. Finally, the entropy method is used to calculate the difference between attributes. The results show that when the number of categories remains the same and abnormal data is added, the improved clustering algorithm from multiple perspectives is more accurate and stable for small sample data with small dimensions and large differences between categories.


Introduction
Cluster analysis is a method of grouping samples according to the sample data itself. It originated from taxonomy [1] and was first proposed by the American scholar Mac Queen [2] . It is currently used in many fields. Its general function is to summarize each category's characteristics by mining the in-depth information in the data to make corresponding decisions. For example, companies develop different marketing mixes by studying customer consumption characteristics in the business field, dividing customer categories.
K-means is a distance-based non-hierarchical clustering algorithm [3] . The existing drawbacks mainly include the following 4 points: ①The value of k must be determined first. ②The initial center will affect the final result. ③The classification attribute will lead to the local optimum. ④Outliers will cause interference. To this end, scholars have proposed a variety of improvement methods. For example, document [4] proposed hierarchical clustering with k*(k>k*) seed initialization. Literature [5] uses genetic operators to improve clustering. Literature adopted the k-mode method that uses the modulus to replace the mean. Literature [6] proposed the CURE method to filter outliers. Literature classifies Hadoop and offers a new model of STING. However, many experts and scholars have optimized the single aspect of the four shortcomings of the k-means algorithm to a certain extent. They have not considered all aspects of the small sample data set.
Given the four deficiencies of the k-means algorithm, using the elbow method to determine the number of categories k to solve the defects, the number of classes cannot be given in advance; optimizing the initial clustering centers to avoid falling into local minimums; then the clustering centers are recreated. The selection process reduces the sensitivity of the algorithm to outliers; finally, weighing the attributes to correct the Euclidean distance between data objects so that the algorithm can achieve more accurate and significant clustering effects. 1 2 n is the number of objects, and m is the number of object dimensions.
Step 2 The attribute units and values are different, and the dimension is eliminated after standardization. In the process of standardization, it is necessary to judge whether the impact of each index on the classification is positive or negative.
Positive indicator expression: max min min max max min min 1 , x is the maximum value of an attribute, and min x is the minimum value of an attribute.
Step 3 The entropy value of the j-th dimension attribute of the i-th object: In formula (2): ij M represents the specific gravity, and ij x represents the attribute value.
Step 4 Entropy value of the j-th dimension attribute: For a given j , the smaller the j H and the larger the j q , the more critical the attribute.
Step 6 The weight of the j-th dimension attribute: In formula (8), ， .

Determination of the category number k
Aiming at the defect that the number of categories of the k-means algorithm cannot be given in advance, the general solution is to use the contour map method to determine. Proposed by Peter J. Rousseeuw in 1986, different algorithms can be evaluated based on the same data [9] . However, the more considerable average contour value calculated by the contour map method does not necessarily mean that the average distance between the object and the same object is significant. It may also be the intermediate distance b between the object and other objects and the average distance a between the object and the same object. The latter is much larger. In the clustering of small samples, there may be individual abnormal data, which causes the value of b to be larger than that of a. In this case, it may not be reasonable to rely on the contour map method to determine the value of k. Unlike the contour map method, the elbow method determines k based on the sum of squared errors of all objects. The core idea is that when k is less than the actual number of categories, increasing k will significantly reduce the square error; when k reaches the exact number of types, the clustering quality will decline rapidly as k increases, which will lead to the honest mistake of each object and the rate of decrease slows down. The image looks like an elbow, and the value corresponding to the elbow is the actual number of categories [10] .
Input: n data objects with m features and a data set D with cluster labels and the number of clusters k [11] .
Output: the sum of squared errors corresponding to each k value. The specific steps of the elbow method are: Step 1 The Euclidean distance between the object in the cluster and the cluster centre: Calculate the Euclidean distance between the data object a in the current cluster ( Step 2 Intra-cluster error sum of squares: Euclidean distance from each object in cluster i C to the cluster centre is squared and summed.
Step 3 Calculate the sum of squares of the total error of the current classification sample according to the result in step 2: Determine k, and add the sum of squares of error corresponding to each cluster.
Step 4 Calculate the SSE of the data set by incrementing k value (1, 2, , ) k  . Step 5 Determine the optimal k value: store the k different SSE calculated by the loop and their corresponding k values in a two-dimensional array, and draw the two-dimensional coordinate points in a rectangular coordinate system by traversing the array.

Identify high-quality cluster centers
The weighted Euclidean distance is: In formula (9), w is the weight of the d-th dimension attribute. Taking the standard deviation as the standard measurement function, the weighted category objective value function is: In formula (10):  is the weighted standard deviation of a certain type of information entropy; j T is the number of objects contained in j T .
The initial cluster centre of k-means is the centre of each category. Based on this, select k samples with different dense domains far from the initial centre. The process is: take the sample with the smallest variance as the initial centre of the first type and make a circle. The sample with the smallest variance outside the process is the initial centre of the next category in turn, and k initial centres are found [12] .

Improve the reselection method of cluster centers
K-means takes the mean of the points in the cluster as the new centroid. If there are abnormal points in the cluster, the standard will deviate seriously. To reduce the sensitivity to outliers, one object in each class is taken as a representative. Other items are divided into different categories according to the similarity with the representative point and iteratively repeated so that each representative point becomes the actual centre point of the corresponding class. Compared with the method of taking the average value of the objects in the cluster as the clustering centre, the advantage is that it is not sensitive to the noise point data, and the disadvantage is that it takes a long time and is suitable for clustering of small and medium samples. According to the similarity with the representative points, other objects are divided into different categories and iterated repeatedly to make each representative point the corresponding type's actual centre point, thereby completing the clustering.
Assuming that there are 1 n objects 1 1 2 , , , n x x x  in a cluster j T after initial clustering, the sum of the distances between the i-th object i x and other objects in the cluster are: The new cluster center is defined as the object with the smallest distance from other items, thus completing the cluster center's determination.

Algorithm description
The steps of the improved algorithm are described as follows: Input: Data set A , ' A with abnormal data added. Output: k optimal clusters.
Step 1 Determine the number of categories k based on experience, and N is the number of data. Take k one by one in 2, N     , and use the elbow method to determine the optimal k value.
Step 2 Data standardization, entropy weight method calculates the weight of object attributes.
Step 3 Determine k initial cluster centres. Find the sample 1 a x with the smallest variance as the first initial centre 1 C ; take the average distance S of the data set samples as the radius, collect the sample 2 a x with the smallest variance outside the circle, as the second initial centre 2 C , and find k initial centres in turn.
Step 4 Scan all data and classify it into the most similar category based on the similarity with various sub-centres.
Step 5 Choose an actual object in each category as a representative, and divide other objects into different categories based on the similarity with the representative point, and repeatedly iterate to make each representative point the actual centre point of the corresponding category, thus completing the aggregation class.
Step 6 Until the number of iterations of the algorithm is completed, the loop terminates and repeats this process.
Step 7 Calculate the standard deviation of each category, and if there are non-numeric data, perform clustering again.

Experiment and result analysis
To verify the algorithm's effect, the experimental platform is Intel(R)Core(TM)i7-9750H CPU@2.60GHZ, RAM 16G, Windows 7 operating system, and select Matlab R2019a as the programming tool. Test with a small sample data set of the UCI database.   We can see from Figure 3 and Figure 5 that each attribute has different functions. If we ignore the attribute function, the results will be different.

3.3.
Clustering accuracy, which refers to the quotient of the same object as the predefined category and the total number, is one indicator for evaluating the quality of clustering [13] . Use the accuracy to assess the clustering effect of the Iris and Wine datasets. See Table 1 and Table 2.