Improving K-means clustering based on firefly algorithm

Data clustering determines a group of patterns in a dataset which are homogeneous in nature. The objective is to develop an automatic algorithm which can accurately classify an unleveled dataset into groups. The K-means method is the most fundamental partitioned clustering concept. However, the performance of K-means method is fully depending on determining the number of clusters, K, and determining the optimal centroid for performing the clustering process. In this paper, an adaptive firefly optimization algorithm, which is a nature-inspired algorithm, is employed to improve the K-means clustering. The experimental results of clustering two real datasets show that the proposed method is able to effectively outperform other alternatives methods.


Introduction
Data clustering is the process of grouping together similar multi-dimensional data vectors into a number of clusters. Clustering algorithms have been applied to a wide range of problems, including exploratory data analysis, data mining [1] , image segmentation [2] and mathematical programming [3] . "Clustering techniques have been used successfully to address the scalability problem of machine learning and data mining algorithms , where prior to , and during training , training data is clustered , and samples from these clusters are selected for training thereby reducing the computational complexity of training process , and even improving generalization performance [4,5]. Clustering algorithms can be grouped into two main classes of algorithms, namely hierarchical and partitional. The k-means clustering method [6] is one of the most commonly used partitional methods. However, the results of k-means solving the clustering problems highly depend on the initial solution and it is easy to fall into local optimal solutions. For overcoming this problem, many scholars began to solve the clustering problem using meta-heuristic algorithms. Nikham et al. have proposed an efficient hybrid evolutionary algorithm based on combining ACO and SA ( simulated annealing algorithm, 1989 [7] for clustering problem [8,9]. In 1991, A.Colorni et al. have presented an ant colony optimization ( ACO )algorithm based on the behavior of ants seeking a path between their colony and a source of food . Then P.S. Shelokar and Y. Kao solved the clustering problem using the ACO algorithm [10,11]. J. Kennedy and R.C Eberhart have proposed a particle swarm optimization ( PSO ) algorithm which simulates the movement of organisms in bird flock or fish school in 1995 [12]. The algorithm also has been adopted to solve this problem by M. Omran and V.D. Merwe [13,14]. Kao

The principle of data clustering
In the clustering process, if the given data set D should be divided into k clusters ( ) 12 , ,..... The main idea of clustering is to define K centers, one for each cluster. These centers should be placed in a crafty way, because different location will causes different result. Therefore, the better choice is to place then as far away from each other as possible. In this paper, we will use Euclidean metric as a distance metric. The expression is given as follows: if the value of ( ) 12 ,

The proposed algorithm
Nature has been an inspiration for the introduction of many meta-heuristic algorithms. Swarm intelligence is an important tool for solving many complex problems in scientific research. Swarm intelligence algorithms have been widely studied and successfully applied to a variety of complex optimization problems. The firefly algorithm (FFA), is one of the recent novel swarm intelligence methods and the most powerful optimization algorithms, which was developed by Yang [17].
Firefly algorithm has been proved to be a good performance and the effectiveness for solving various optimization problems [18]. The firefly algorithm has been inspired by the simulation of the social behavior of fireflies on the basis of the flashing lights or the flash attractiveness. By representing the advantage of some flashing characteristics of fireflies and how fireflies interact with flashing lights, the firefly flash is a signal system which used to attract another firefly [19]. Mathematically Each firefly has its light intensity or brightness. The brightness value is used to evaluate the goodness of firefly, which is affected by the landscape of the optimization problem [20][21][22][23]. The brightness of firefly i at a particular or current position x can be denoted by the objective function value as follows: The light intensity of the firefly is directly proportional to its brightness and is related to objective values. In comparing the two fireflies, both fireflies are attracted, the firefly which has a lower light intensity is attracted toward the other firefly with the higher light intensity. The light intensity of a firefly depends on the intensity 0 where  is used to control the decrease of the light intensity or brightness an and can be taken as a constant. Each firefly has its distinctive attractiveness which indicates how powerful it attracts other members in the swarm. Attractiveness,  , is relative, which means that it must be judged by others, and therefore varies with the distance ij r . As mentioned earlier, the brightness decreases with the distance from the source and the light is also absorbed by the air, therefore the where () r  represents attractiveness function of a firefly at a distance, r , and 0  denotes the initial attractiveness of a firefly at distance 0 r = and it can be constant. For implementation usually 0  set to be 1 for most problems.
The fireflies will try to move to the best position. This means that the lower light intensity one will be attracted by the brighter one. The location updates for each pair of fireflies i and j .  Equation (7) decreases quickly and the random movement of the firefly will almost vanish within small number of iterations [25]".

The performance evaluation function of data clustering
For explaining the evaluation process explicitly. We suppose that given data set D should be divided into k subset. And the dimension of individual of data set D is m. In order to optimize the coordinates of centers of k subset, it is easily to find that the dimension of solution should be km  . The individual in the population can be described as

Real data results
To test our proposed algorithm, two real datasets are used. The comparison is conducted between our proposed algorithm, PFFA, the original FFA, and the standard K-means algorithm. The parameter configurations for our proposed method are presented as follows: "The number of fireflies is 50  [26,27]. The comparison of algorithm for data set Iris and Wisconsin breast cancer is listed in Tables  1 and 2, respectively. Table 1 shows that the best value, worst value, mean value and standard deviation of our proposed algorithm, PFFA, are all better than the original FFA algorithm. Convergence curves of two algorithms for data set Iris are shown in Figure 1. The curves in Figure 1 show that PFFA has faster convergence speed. And the convergence curve of PFFA is smoother.
Related to Wisconsin breast cancer dataset, the comparison of algorithm is listed in Table  2. And the convergence curves of algorithms are shown in Figure 2. The best value, worst value, mean value and the standard deviation of PFFA are the best comparing to FFA. Convergence curves for dataset Wisconsin breast cancer shown in Figure 3 clearly show that PFFA has a faster convergence speed".

Conclusion
This paper proposed an adaptive procedure for improving the firefly algorithm for clustering the data. The PFFA algorithm computes the optimal centroid for performing the data clustering that is based on the minimum fitness function. The clustered data are the optimally clustered data and it provide all the valuable information for the decision-making process. Performance analysis carried out using the two datasets prove that the PFFA outperforms the existing FFA and K-means by attaining a minimum value of the objective function.