An improved K-Means clustering based on differential evolution

Randomly selecting the initial cluster center point in the K-means clustering algorithm can lead to the issue of local optima. The paper introduces an enhanced differential evolution algorithm for optimizing K-means clustering. Experimental results demonstrate the superiority of the improved K-means clustering, based on the proposed differential evolution, over other comparison algorithms. Notably, the optimized version exhibits exceptional performance in clustering Iris, Wine, and Glass datasets. The initial clustering center can improve the clustering accuracy in the later stage. Therefore, the proposed algorithm is more superior.


Differential evolution
The differential evolution algorithm, proposed by American scholars Storn and Price, is a global search optimization technique.It employs real number coding and is effective for solving problems in the real number domain.The core principle of this algorithm involves randomly selecting two individuals from the population.Their difference vector components serve as disturbances to a third randomly chosen reference vector, creating a mutation vector.This mutation vector is then combined with the reference vector (or target vector) through crossover to produce a test vector.Finally, the reference vector competes with the test vector, and the superior ones are retained in the next generation group.Iterates for a certain number of times to gradually improve the quality of the group, so that the group gathers to the optimal solution position.The differential evolution algorithm mainly has three important steps: mutation, crossover and selection.
Mutation operator: Mutation involves altering certain genes on a chromosome.In the context of the differential evolution algorithm, the mutation operator is denoted as DE/x/y, where DE represents the differential evolution algorithm, and x represents the selection method for the reference vector.The selection methods typically include "rand" and other variants.Two types of selections, "rand" and "best," are utilized in the differential evolution algorithm.The "rand" selection involves randomly choosing individuals as the reference vector for the mutation operator, while the "best" selection involves picking the current optimal individual as the reference vector.The parameter y represents the number of difference vectors used in the mutation process.The commonly used mutation strategies for mutation operator operations are: Among them, V i represents the target vector; X i represents the mutation vector, r 1 , r 2 , r 3 are randomly selected different integers, and F represents the scaling factor.
The crossover operator is responsible for reorganizing the current vector and the mutation vector within the population, resulting in a new test vector.By comparing the fitness values of the previous generation vector and the new test vector, individuals with high fitness values are chosen to proceed to the next generation.The commonly used crossover method is the binomial crossover.
Selection operation: The greedy selection strategy is generally used for selection.The main formula is as follows: The differential evolution algorithm is primarily controlled by two parameters: the scaling factor F and the crossover probability CR.F governs the search step size, influencing the algorithm's ability to find the optimal solution.A larger F increases population diversity, enhances global search capability, and improves the chances of obtaining the optimal solution.However, it may reduce convergence speed.Conversely, a smaller F reduces population diversity and search space, decreasing the likelihood of finding the optimal solution.Although it accelerates convergence, it may lead to local convergence.The value of crossover probability CR is problem-dependent.

K-means clustering algorithm
The K-means clustering algorithm was proposed by MacQueen in 1967 and is an unsupervised learning algorithm based on partitioning.Generally speaking, the K-means clustering algorithm needs to satisfy two conditions: (1) high similarity within classes; (2) high difference between classes.The basic idea of this algorithm is to divide a set containing n individuals into k subsets, where k must be less than or equal to n.The basic flow of the K-means clustering algorithm is given below: Input: the number of classes k, the data set of n individuals, the number of iterations t and the iteration termination condition.Output: k classes that satisfy the iteration end condition.
K-means clustering algorithm is a classic partitioning and clustering method.Many scholars have improved the traditional K-means clustering algorithm to improve its performance and expand its application range.The work in this area has also achieved remarkable results.The main advantages of the algorithm are: simple structure and low time complexity.The disadvantages are: the number of classes k needs to be predetermined; the randomly selected initial center point is easy to fall into the local optimum; isolated data and noise data affect the clustering effect [7][8].

Improved differential evolution algorithm
The adaptive adjustment of the scaling factor allows for controlling the population diversity based on the current situation.Scholars have proposed various scaling factors tailored to specific problems, enabling real-time adjustments to the convergence behavior of the differential evolution algorithm.A smaller scaling factor F generally improves the convergence speed but may lead to local optima.Conversely, a larger F might help escape local optima but decreases convergence speed.In this paper, real-time scaling factor adjustments are crucial.When the algorithm prematurely converges to a local optimum or the individual fitness differences within the population are too small, the scaling factor F is increased.Conversely, if the individual fitness differences are too large, the scaling factor F is reduced.The proposed algorithm in this paper adopts formula (4) for adaptive scaling factor adjustments, ensuring optimal performance: Among them, F i is set to 0.4; f avg,i represents the average fitness value of the i th generation; f best,i represents the best fitness value of the i th generation; f i represents the i th generation of individuals to be mutated fitness value.In the process of mutation of each individual, the possibility of the differential evolution algorithm falling into local convergence prematurely can be reduced.
The crossover probability is a crucial factor influencing the convergence speed and the balance between global and local search capabilities in the algorithm.Different problems may require varying crossover probabilities.In the differential evolution algorithm, the crossover operation determines the likelihood of selecting a gene from an individual to be mutated into a new individual.Similar to the scaling factor, adaptively adjusting the crossover probability can enhance the selection of high-quality new individuals, leading to improved algorithm performance.This dynamic adjustment ensures that the algorithm optimally adapts to different problem characteristics during the optimization process.
Therefore, in order to preserve the parameter settings that can generate high-quality individuals, and adjust the parameters that cannot generate high-quality individuals.
(5) Among them, N(0.5, 0.5) represents a random number that obeys the normal distribution with mean and variance of 0.5.The algorithm flow chart is shown in Figure 1.

Test dataset
This paper employs the well-known UCI database datasets, Iris, Wine, and Glass, for experimentation.The Iris dataset consists of 150 samples, representing three flower types: Setosa, Versicolor, and Viningica.Each category comprises 50 samples, with each sample having four attributes: sepal length, sepal width, petal length, and petal width; the specific classification is shown in Table 1  The Wine dataset encompasses three distinct varieties of wine produced in a particular region of Italy.It comprises 178 samples, divided into three categories.Each sample comprises 13 attributes, namely: Alcohol, Malic acid, Ash, Alcalinity of ash, Magnesium, Total phenols, Flavanoids, Nonflavanoid phenols, Proanthocyanins, Color intensity, Hue, OD280/OD315 of diluted wines, and Proline; specific classifications are shown in The Glass dataset comprises 214 samples, representing 6 categories.Each sample contains 9 attributes, including Refractive index (Ri), Sodium (Na), Magnesium (Mg), Aluminum (Al), Silicon (Si), Potassium (K), Calcium (Ca), Barium (Ba), and Iron (Fe); the specific classification is shown in Table 3

Experimental results and analysis
In this experiment, the traditional K-means algorithm, PKM algorithm, K-DE algorithm and the algorithm proposed in this paper are used for comparison.Table 4, Table 5 and Table 6 show the clustering accuracy of the above algorithms for different test data sets.4 demonstrates the proposed algorithm's clustering accuracy for the Iris test dataset, which reaches 94.52%.This is approximately 6% higher than the traditional K-means clustering algorithm.Tables 5 and 6 further highlight the superior clustering results of the proposed algorithm compared to traditional K-means, PKM, and K-DE algorithms.Overall, the algorithm's notable enhancement in clustering correctness holds significant importance.

Conclusions
This paper introduces an enhanced differential evolution algorithm to address the issue of slow convergence resulting from the random initialization of cluster centers in the k-means algorithm.Experimental results on three datasets demonstrate the effectiveness of the proposed algorithm in resolving the cluster center initialization problem and significantly improving the algorithm's classification accuracy.

Figure 1 .
Figure 1.The flow chart of improved algorithm.

Table 2 .
Classification of Wine dataset.

Table 4 .
Comparison of experimental results on Iris dataset.

Table 5 .
Comparison of experimental results on Wine dataset.

Table 6 .
Comparison of experimental results on Glass dataset.