K-means Algorithm Based on Flower Pollination Algorithm and Calinski-Harabasz Index

Aiming at the problems that the Flower Pollination (FP) algorithm is easy to fall into the local optimum, the searchability is weak, and the k-means algorithm is easily affected by the selection of the initial clustering centre, a k-means algorithm based on the FP algorithm is proposed. Six benchmark functions test the improved FP algorithm. The effectiveness of the k-means algorithm based on the improved FP algorithm was tested and verified with UCI machine learning and artificial datasets. The verification results showed that the improved FP algorithm improved based on ensuring a faster convergence speed. Compared with other algorithms, the performance of this algorithm has been significantly improved in all aspects.


Introduction
Clustering is a commonly used unsupervised learning method [1,2], widely used in intrusion detection, machine learning, image processing, data mining and other fields.Clustering tries to divide the samples in the data into several categories so that the same category.The class samples are as similar as possible, and the samples of different classes are as different as possible.The learning of unlabeled training samples reveals the data inherent properties and laws, providing a basis for further data analysis.The K-means algorithm has a simple structure and fast convergence speed as a centre-based classic clustering algorithm.Still, it is easily affected by the selection of the initial clustering centre, resulting in falling into a local optimum and making the clustering result unstable [3,4].The setting of the initial number of categories directly affects the quality of the clustering results.In actual situations, the optimal number of clusters is difficult to predict, and the algorithm needs to obtain the optimal number of clusters automatically.The swarm intelligence optimisation algorithm is a computational intelligence algorithm that simulates biological activities in nature and swarm intelligence behaviour.Commonly used algorithms include particle swarm, ant colony, genetic, artificial bee colony algorithms, etc. Lukasik et al. [5] studied the activities of FP in nature, and a new intelligence optimisation algorithm, the FP algorithm [6], is proposed.This algorithm has a simple structure, fewer parameters to control, and faster fast convergence.The swarm intelligence optimisation algorithm is often used in clustering because of its excellent global optimisation ability.Literature [7,8] proposes the application of a particle swarm optimisation algorithm to optimise the clustering algorithm, literature [9,10] proposes [11] proposed a genetic algorithm clustering algorithm, and literature [12,13] uses an artificial bee colony algorithm to solve the clustering problem.

2
Aiming at the problem that the FP algorithm is highly susceptible to converging to a local optimum, the searchability is weak, and the K-means algorithm is easily affected by the selection of the initial cluster centre, an improved K-means algorithm of the FP algorithm is proposed.The improved FP algorithm is used to improve the performance of the algorithm.Then the improved FP algorithm is used to optimise the clustering centre of the K-means algorithm, reducing the influence of the initial clustering centre, avoiding falling into local optimum, and improving the algorithm's stability.The clustering comprehensive validity evaluation function automatically obtains the optimal number of clusters.

Clustering Overview
Clustering problems are usually given a sample set m ,  = { 1 ,  2 , … ,   }, and divided its clustering into k different classes  = { 1 ,  2 , … ,   }, where the clustering objective function takes the mean square error: (1) In Equation (1),   is the cluster centre, and the smaller the value E, the better the clustering effect.

K-Means algorithm
The main process of the K-means algorithm is as follows: (1) Randomly select k samples from the sample set D as the initial cluster centre; (2) Calculate the distance between each sample and each cluster centre,   = ‖  −   ‖ 2 , and divide each sample into the corresponding cluster according to the distance; (3) Calculate and update the cluster centre, Turn to (2) again to calculate.

Clustering Validity Evaluation
To consistently minimise the similarity between clusters and maximise the similarity within a cluster, this paper uses the natural properties of the distribution of clustering results to evaluate the separation between clusters and the identity within a cluster and construct clusters.Use this function to analyse the clustering effect in different situations and obtain the optimal number of clusters [14].
The cluster density is related to the variance within the cluster.Given a sample set,  = { 1 ,  2 , . . .,   }, the variance within the cluster is defined as, It can be seen from this that the smaller the variance within the cluster, the better the clustering density and the higher the identity of the data samples.If a special sample individual is divided into one category, the value of the clustering density is 0. Cluster proximity is defined as, Equation ( 5), (  ,   )which represents the Euclidean distance between cluster centres   and cluster centres   ,  is a Gaussian constant, and the proximity of clusters is inversely proportional to the distance between clusters; that is, the smaller the proximity, the better.If a special sample individual is divided into one category separately, the value of clustering proximity is 0. The clustering comprehensive effectiveness evaluation function is constructed by combining the above clustering density and clustering proximity.
In Equation ( 6),  ∈ [0,1] , the greater the   value of the clustering comprehensive effectiveness evaluation function, the better the clustering effect.

Flower Pollination (FP) Algorithm
FP algorithm is an iterative population-based nature-inspired optimisation technique to tackle ongoing optimisation problems.Solving this class of problems is equivalent to finding  * which satisfies the following: where  ⊂   , and () constitutes solution  cost function value.Therefore, the actual task of the optimiser is to find argument minimising .The FP algorithm can be described as follows in Figure 1.

K-means Algorithm of Flower Pollination (FP) Algorithm
Following the basic clustering criteria of the K-means algorithm, using the FP algorithm to optimise the clustering centre of the K-means algorithm, reduce the impact of the initial cluster centre, obtain Here, the mean square error  is chosen as the moderate function , that is, The FP algorithm is used to optimise the clustering centre of the K-means algorithm, mainly because the FP algorithm is a stochastic optimisation algorithm that is not affected by the initial value and has global optimisation capabilities and local exploration capabilities, which can Avoid falling into the local optimum so that the cluster centre that can be obtained in the global optimisation space has a small mean square error.The result is as close to the optimal as possible by improving the FP algorithm.Moreover, since the data is divided according to the criterion of the minimum Euclidean distance to the cluster centre, the nuances of the cluster centre will not have much influence on the results, so use the FP algorithm to optimise the K-means algorithm The result obtained by the cluster centre can be regarded as the best clustering result.Then, the optimal number of clusters in the corresponding situation can be automatically obtained through the clustering comprehensive effectiveness evaluation function.
The main flow of the algorithm: (1) Specify the value range of the number of clusters, let the initial number of clusters  = 2; (2) Initialise the sample data according to the current number of clusters and then calculate the objective clustering function to obtain and record the current optimal solution; (3) Carry out the three movements of the FP algorithm in a cyclic iteration, and the optimal clustering result of this operation is obtained at the end of the cyclic iteration, and the clustering comprehensive effectiveness evaluation function   is calculated; (4) Let  =  + 1, when  <  max , jump to execute (2), otherwise execute (5); (5) Calculate the optimal number of clusters according to the clustering comprehensive effectiveness evaluation function   , and then obtain the corresponding clustering results.Special attention should be paid to the initialisation process.Perform chaotic initialisation on the data set and specify the value range of the cluster number as [2,  max ]and the search source as   = ( 1 ,  2 , . . .,   ), which is a division of the sample data set D, where  max ≤ √, and   are the initial cluster centres; In the process of calculating the objective function, first, calculate the distance between each sample data and each cluster centre:   = ‖  −   ‖ 2 , divide each sample data into corresponding clusters according to the minimum criterion of Euclidean distance, and then calculate the objective clustering function according to Equation (1).The pursuit of identifying an appropriate partition that aligns with the structure of datasets continues to present a formidable challenge, even when the number of clusters k is known.The process of quantitatively assessing the efficacy of a clustering algorithm in capturing the inherent groups within a dataset is commonly referred to as cluster validation [15].In cases where an accurate ground truth solution is unattainable, internal validation techniques that rely exclusively on the information contained within the partitioned data are employed.One such technique belonging to this category is the Calinski-Harabasz index.The sequential steps involved in the computation of this index are as follows [15]: Step 1: Calculate the between-group sum of squares (BGSS), which measures the weighted sum of squared distances between the centroids of a cluster and the centroid of the whole dataset.The between-group sum of squares is calculated as: where   represent the number of observations in the cluster k,   is the centroid of the cluster k, C is the centroid of the dataset, and K is the number of clusters.
worse.Compared with the FP algorithm alone, the FP algorithm performs better on the four benchmark functions.Looking at the overall situation, the FP algorithm can better avoid falling into the local optimum in the experiment, and the overall performance is optimal.

Optimal number of clusters finding performance analysis
The experiment uses artificial data and UCI machine learning data set [18] to test the effect of the verification algorithm on finding the optimal number of clusters.The experimental data includes artificial data S and Iris, Wine, Glass, and Synthetic, as shown in Table 2.When verifying the clustering accuracy of the algorithm, according to the data set information, the standard clustering number is given first; when verifying the optimal clustering number of the algorithm to find the performance, record the number of times to get the correct optimal number of clusters by running 20 times, see Table 3 for details.It can be seen from Table 3 that the algorithm in this paper has a good clustering accuracy rate in the artificial data set and UCI real data, and for the search for the optimal number of clusters, the algorithm in this paper has better separation performance for the artificial data set S and the real data set Iris has a good effect.The standard optimal clustering number can also be found for other data sets.The effect is good, but it needs to be further improved.Overall, the algorithm's performance in this paper is good and can be applied to various data sets.

Conclusion
This paper proposes a K-means algorithm based on the FP algorithm.This algorithm improves the FP algorithm by coordinating the algorithm's global optimisation and local exploration ability, which improves the comprehensive optimisation ability of the algorithm.The experimental verification shows that the FP algorithm improves the global optimisation ability based on ensuring a faster convergence speed, and the performance comparison analysis with other algorithms shows that the algorithm's performance has obvious advantages in all aspects.
Aiming at the shortcomings of the K-means algorithm, an improved K-means algorithm of the FP algorithm is proposed, the optimal clustering number adaptive mechanism is introduced, and experimental tests verify the algorithm's effectiveness.

Equation ( 3
), (  , ̅ ) represents the Euclidean distance between   and ̅ ; ̅ represents the mean value of the sample set  , that is, to the distribution of clustering results, clustering density is defined as,
the clustering situation under different clustering numbers, at the same time, the clustering comprehensive validity evaluation function is introduced to obtain the optimal number of clusters automatically. 4

Table 2 .
Data Set Description