Comparative analysis based on clustering algorithms

This article summarizes and evaluates the clustering effects of commonly used clustering algorithms on data sets with different density distributions. In this paper, circled datasets, different sized datasets, and Gaussian mixture datasets were designed as the typical datasets. Then, the K-means, Gaussian mixture clustering, DBSCAN, and Agglomerative clustering were developed to evaluate the clustering performance on these datasets. The results show that the DBSCAN is more stable when the density distributions of the data sets are not clear. Besides, the Agglomerative clustering that calculates the shortest distance can determine the type of data set. Moreover, it is not appropriate to use only a single clustering algorithm to analyze a Gaussian mixture dataset. It is recommended to use multiple clusters to process the dataset after preprocessing.


Introduction
In the data science era, almost every field has completed data automation to achieve high-speed operation of daily life and work [1]. To effectively explore and understand the data, unsupervised machine learning analyses the data by knowing only the domain and data without labelling training samples. These unlabeled samples contain the spatial distribution and structural features of unknown data, such as features such as volume, density, shape, and direction [1]. As a form of unsupervised learning, clustering is a grouping method by extracting the "similarity" features of physical or abstract objects [2]. The similarity feature or measurement method between these objects is usually a distance or density function. All in all, the clustering algorithm has a solid theoretical foundation and has become a complex research object in different research communities [3,4]. The most common algorithms are kmeans, Gaussian mixture models [5,6], DBSCAN, and the hierarchical cluster method.
The K-means is a partition-based method that attempts to minimize the sum of the squared Euclidean distances of the average of each group as the objective function to pursue homogeneity within clusters and qualitative differences between clusters [7]. This algorithm has been widely used to classify the cloud top pressure optical thickness joint histogram to obtain the cloud system [8]. For example, Kmeans classification combined with IR information can be applied to grid data. This provides convenience for continuous image analysis of weather and sub-weather processes on a pixel-by-pixel basis [9]. The Gaussian mixture model is a Model-based method. Using a Gaussian mixture model to perform cluster analysis on data is similar to the K-means goal of maximizing the similarity of observations in a shared space [2]. However, the key difference compared with K-means is that the cluster assignment probability in the Gaussian mixture model is definite [10]. Observations with similar probabilities belonging to the same component (or "cluster") are grouped but with a certain degree of uncertainty. Another difference is the existence and use of probabilities, which makes most hybrid models Become parameterized. In this way, the Expectation-Maximization algorithm is an iterative algorithm, most commonly used to estimate parameters, limiting the clustering process [10].
As one of the most famous Density-based methods, the DBSCAN uses the scan radius (eps) and the minimum number of points (MinPts) to find neighbours or obtain the density of each data point. For specific application scenarios such as user communities linked by friends in social networks, the ecosystem can be displayed as regions with homogeneous feature values in satellite images [11]. Therefore, the DBSCAN is widely utilized in many fields, such as aviation. The DBSCAN could analyze and identify unknown events in conventional flight data without prior knowledge of historical flight data [12]. In the biological field, the DBSCAN plays a role in detecting protein structure [13]. Besides, the early warning method based on DBSCAN clustering can effectively predict oil spill accidents in the energy field [14].
The hierarchical method establishes an attractive tree structure, making the algorithm more explanatory and persuasive for data partitioning [15]. Among them, Agglomerative clustering is a bottom-up hierarchical algorithm. The principle is to search for objects and clusters starting from the initial partition, where each object corresponds to a group [16]. The algorithm merges the two most similar sets in the simplest form. This process continues until the algorithm reaches the partition. Therefore, it is a method often chosen in exploratory analysis. For example, through the integration of diversified investment portfolios, using the agglomeration clustering model to trade stocks has a higher rate of return [17].
Since these four different clustering methods are widely used in different fields, this article will discuss their clustering performance-based three data sets with different density distribution. This article reported what types of data sets are suitable for different clustering algorithms, including the accuracy of these algorithms, cohesion, and the separation of different data sets. It provides a reference for future researchers to choose to cluster in various fields.

Clustering
K-means Given a sample set = x 1 , x 2 ,⋯, x m , the K-means algorithm divides the clusters , ,⋯, to minimize the square error: is the mean vector of the cluster . (1) shows how closely the samples in the cluster surround the mean vector of the cluster. The similarity of the samples in the cluster increases as the E value decreases. However, minimizing the value of E is not easy. Finding its optimal solution needs to examine all possible cluster divisions of the sample set . Therefore, K-means solves the value through iterative optimization. Its algorithm flow is shown in figure 1. First, the mean vector is initialized, and then the current cluster division and mean vector are iteratively updated. If the nodes in the cluster are unchanged, the recent cluster division result will return. Gaussian mixture clustering Unlike K-means using a mean vector, the GMM uses a probability model to construct a cluster structure.
The first is to define the (multivariate) Gaussian distribution. For a random vector x in the ndimensional sample space X, if x is suitable for the gaussian distribution, the probability density function is: 1 is the n-dimensional mean vector, and is the covariance matrix of . According to (2), the mean vector and the covariance matrix can ultimately determine the Gaussian distribution. From this, we define the Gaussian mixture distribution: The distribution consists of k mixed components. Each mixture component corresponds to a gaussian distribution, where and Σ are the gaussian mixture component parameters. | , is the probability density function, and 0 is the corresponding mixing coefficient, ∑ 1 . Assuming that the gaussian mixture distribution gives the sample generation process. First, select Gaussian mixture components according to the prior distribution defined by , , ⋯ , , where is the probability of the mixture components; then, according to the selected mixture components' probability density function, sampling generates the corresponding samples.
If the above process generates the training set , , ⋯ , , let the random variable ∈ 1, 2, ⋯ , represents the Gaussian mixture component of the unknown value of the generated sample . The prior probability of corresponds to 1, 2, ⋯ , . According to Bayes' theorem, the posterior distribution of corresponds to: The model parameter , , Σ | 1 in (3) will be solved using the Expectation-Maximization method, and the formula is: Its iterative optimization method is as follows: If the parameter , , Σ | 1 can maximize the (6), then 0 has: From (4) and The weighted average of the samples can estimate the mean value of each composite component, and the sample weight is the posterior probability of each piece belonging to the element. From 0, we can get: For the mixing coefficient , in addition to maximizing , it also needs to satisfy 0, ∑ 1 . Considering the Lagrange form of is: 1 , 1 0 is the Lagrange multiplier, since the derivative of (10) to a is 0, then: Multiplying both sides by , and sum all samples to know , which has: The mixing coefficient of each Gaussian component is determined by the average posterior probability of the sample belonging to the element.
The EM algorithm of the GMM can be obtained from the above derivation. In each iteration, the posterior probability of each sample belonging to each Gaussian component is calculated according to the current parameters. Then the model parameter , , Σ | 1 is updated according to (8), (9) and (12).

2.1.3.
DBSCAN Density clustering algorithms usually examine the connectivity between samples from the perspective of sample density and continuously expand clusters based on connectable pieces to obtain the final clustering results. The DBSCAN is based on a set of neighborhood parameters , to describe 14 Therefore, if is the core object and the set of all sample combinations whose density-reachable is counted as ∈ | ℎ , it is not difficult to prove that is a cluster that satisfies the continuity and the maximum value.
Therefore, the DBSCAN algorithm first selects a core object in the data set as the seed and then determines the corresponding cluster. The algorithm flow is as follows. First, the algorithm finds all core objects according to the given neighborhood parameters , . It then takes any core object as a starting point to find out the samples with reachable density to generate clusters until all core objects are visited.

2.1.4.
Agglomerative clustering The hierarchical clustering attempts to divide the data set at different levels to form a tree-shaped cluster structure. Agglomerative clustering is a bottom-up hierarchical clustering algorithm. It first regards each sample in the data set as an initial cluster and then finds the two closest groups in each step of the algorithm and merges them until the preset number of groups is reached. The formula for calculating the distance between clusters and is as follows: , min ∈ , ∈ dist , , 1 5 , max ∈ , ∈ dist , , 1 6 1 7 The closest sample of two clusters determines the minimum distance. The farthest selection of the two clusters determines the maximum length and the same piece of the two clusters determines the average distance. When the cluster distance , or is calculated, the algorithm is called single-linkage, complete-linkage and average-linkage algorithm accordingly. is the number of data points of the same category in and the same category in . is the number of pairs of data points that belong to the same class in but belong to different types in .
is the number of data points that are not in the same category in but are in the same category in .
is the number of data points that are not in the same category in and do not belong to the same type in .
At this time, the Rand index is: Adjust the Rand coefficient is: Its value range is 1, 1 , the value of 1 indicates the best clustering effect, and the value of 1 indicates the worst clustering effect.

2.2.2.
V-measure V-Measure is the harmonic average of cluster homogeneity and completeness, which is to find the degree of uncertainty in the division of a certain category after another. Its value is 0, 1 . The value of 1 indicates the best clustering effect, and a value of 0 indicates the worst clustering effect. Among them, the formula for homogeneity is: Among them, | is the entropy of a given cluster classification, is the classification entropy.
The formula for completeness is: The formula of V-Measure is:

Circle dataset
The original data set is two concentric circles generated from 1500 samples, as shown in figure 2. The color in the figure reflects the correct clustering results.  A preliminary conclusion can be drawn from figure 3: the clustering effect of DBSCAN on the circle dataset is better than the effect of K-means and Gaussian mixture model.
For hierarchical clustering, this clustering has different effects in different distance selections. Three different distances are assessed here, namely the minimum distance (AC-single), the maximum distance (AC-complete), and the average distance (AC-average). Their clustering effects are as follows: Figure 4. The results of AC-single, AC-complete, and AC-average (Dataset 1) A preliminary conclusion can be drawn from figure 4: the hierarchical clustering effect using the minimum distance is better than the hierarchical clustering effect of the maximum distance and the average distance.
Through the evaluation results of Adjust rand index and V-measure, the effect of each cluster can be seen more clearly.  Table 1, the clustering effect of DBSCAN and AC-single is completely consistent with the correct clustering effect. The subsequent effects from good to bad are AC-complete, AC-average, GM, and K-means.

Different sized dataset
The original data set is three clusters of different sizes generated from 1500 samples, as shown in figure  5. The color in the figure reflects the correct clustering results. For hierarchical clustering, the clustering effects of the minimum distance (AC-single), the maximum distance (AC-complete), and the average distance (AC-average) are as follows: Figure 7. The results of AC-single, AC-complete, and AC-average (Dataset 2) A preliminary conclusion can be drawn from figure 7 is the clustering effect of hierarchical clustering on this data set is not ideal.
Six clustering models are evaluated by Adjust rand index and V-measure:  Table 2, the accuracy of clustering results of these six models on different sized datasets from high to low is: GM, DBSCAN, K-means, AC-average, AC -complete, AC-single.

Gaussian mixture dataset
Ten Gaussian clusters are generated from 1500 samples in the original data set, and the center of each cluster is randomly selected. As shown in the figure below, the color in the figure reflects the correct clustering result.   Table 3 that the evaluation scores of these six clustering methods on the clustering effect of the Gaussian mixture dataset are very low.

Conclusion
By evaluating the effects of six clustering models on three different types of data sets (circle dataset, data sets of different sizes, and Gaussian mixture data sets), we can draw the following conclusions:  Since DBSCAN has a better classification effect on small data sets, when the data set is uncertain, the researchers can give priority to DBSCAN.  Agglomerative clustering (AC-single), which calculates the shortest distance, only has an excellent clustering effect on circle datasets. When the data set is complex, AC-single can determine whether the data set is a circle dataset.  It is not appropriate to use only conventional clustering to analyze Gaussian mixture data sets. It would help if used multiple clusters to process the data set or preprocess the data set before using clustering. These findings may be a good reference for those who intend to contribute to learning research and machine learning based on basic clustering models in small sample data sets in the future. Besides, the analysis results may help people in other fields who need to use unsupervised learning to support their results.