A clustering algorithm based on maximum entropy principle

Aiming at the shortcomings of clustering performance of many traditional text clustering methods, a clustering algorithm based on maximum entropy principle is proposed. The algorithm uses the cosine similarity measure cited in the traditional text clustering algorithm SP-Kmeans, and then introduces the maximal entropy theory to construct the maximal entropy objective function suitable for text clustering. The maximum entropy principle is introduced into the spherical K-mean text clustering Algorithm. The experimental results show that compared with DA-VMFS and SP-Kmeans algorithms, in addressing the large number of text clustering problem. The performance of CAMEP clustering algorithm is greatly improved, and has a good overall performance.


Introduction
With the growth of the World Wide Web and various text resources, people's desire for rapid, accurate and comprehensive access to information is increasing. Text clustering technology has received more and more attention and research as unsupervised clustering technology. In present, text clustering technology has become the key technology of automatic text categorization [1].
In a certain vector space model, the text can be expressed as a vector of high dimensional space by appropriate preprocessing, which has sparsity and Non-negative [1].In recent years, research shows that the text data also has the direction [2].This feature allows the text vector data to be normalized before clustering, and then the clustering analysis is performed. The SP-Kmeans [3] algorithm uses the cosine similarity to measure the correlation of the text vectors.
In recent years the great entropy principle has also been widely used in natural language processing [7] and text classification [8]. Surian D [9] in pointed out that the text clustering algorithm based on mixed vMF density model movMF in the text clustering process hidden variable entropy changes with self-annealing characteristics, Shi Zhong [10,11] The deterministic annealing technique is used to improve the clustering performance of the movMF [14] algorithm, which provides the basis for introducing the maximum entropy principle in the traditional text clustering algorithm.

Algorithm for Maximal Entropy Clustering Algorithm and Spherical K-means Clustering Algorithm
Using the statistical physical degradation process, Yasuda proposed a deterministic annealing technique [4], which is an important branch of natural law. It is based on the annealing process, the optimal solution of the optimization problem into a series of temperature changes with the physical system of free energy function is minimal. Karayianni [5] Introduced deterministic annealing techniques into clustering. In this algorithm, a very large entropy clustering algorithm is proposed, and its essence is to use the deterministic annealing technique to find the objective function of clustering minimum. In a variety of versions of the maximum  [15], although the description is different, but only the formal differences. The MEC of the maximal entropy clustering algorithm is introduced only in the literature [6].
For the dataset ij u for each sample belongs to the center of the probability. , , 1 [0,1],1 …，K ,and the clustering of each cluster is obtained by the maximum entropy fuzzy clustering algorithm MEC Center, the following objective function is minimized.
,T is the Lagrange multiplier. The above equation can also be expressed as: For a large T, the main attempt is to maximize the entropy H(u), the system is maintained at a higher temperature, with the decrease of T, entropy for the reduction of distortion, when T tends to zero, the minimum Jc (U, V) Directly obtain a non-random (hard) solution. Thus the Lagrangian multiplier there is equivalent to the temperature coefficient of the deterministic annealing technique, also known as the annealing coefficient.
The basic steps of the maximum entropy clustering algorithm MEC are as follows: Initialization: Given the initial clustering center , and the fuzzy partitioning matrix with the following formula: If T is minimized, stop; otherwise adjust the annealing factor T = T-! T to (2). MEC algorithm can avoid the local minimum and get the global minimum, which has been widely used. However, one of the defects of the MEC algorithm is to use the European metric. For the high-dimensional vector data, the direction feature of the text vector is more important than the size feature, so the MEC is not suitable for clustering the text data.

The Clustering Algorithm Based on Maximum Entropy Principle
In this paper, the maximum entropy principle is introduced into the spherical mean clustering, and the clustering algorithm based on maximum entropy principle CAMEP is deduced for text clustering. of the sample j x belonging to the K center, the value of which is different from the hard division of the spherical K mean, but the fuzzy division between 0 and 1 truly reflects the data points and the center of the class Practical relationship, and meet , 1 At this point the global maximum cost function can be considered: In order to obtain the maximum value of the equation (6), we can get the maximum entropy principle by avoiding the local minimum and get the global minimum. In this case, we can define the minimized objective function: Note that the form of Eq. (7) is similar to Equation (2), and entropy terms are introduced.
(2) Using the European measure, and (7) uses a cosine similarity measure. Equation (7) can also be expressed as: ,T is a Lagrange multiplier, which can be taken according to the need, and its value has some influence on the final clustering result. () Hu is the entropy of the membership matrix.
When the value of There is a partial derivative of each center vector i v in ( , , , ) Let equation (10) be equal to 0, then there is: Since 1 T ii vv , by (11) can be further introduced: 1 (ln 1) Let equation (13) (14) can be further introduced: The minimum process (9) is the clustering algorithm based on maximum entropy principle (CAMEP). It is noted that the Lagrange multiplier T is equivalent to the inverted annealing coefficient, When the T value is small, the system is maintained at a higher temperature, and the process of T increases is the process of system annealing, and the minimum point of the objective function is obtained by a series of changes with temperature T. A complete description of the CAMEP algorithm is given below.
Step1:Giving ,the maximum number of iterations is M, the set annealing coefficient is T, the maximum annealing coefficient is MaxT, the threshold is  , the number of iterations is r = 0; Step 2: Find the update

Experimental Results and Analysis
First, the evaluation criteria of evaluating the performance of text clustering are described, then the various data sets and experimental setups of the experiment are described. Finally, the experimental results of each data set are given and analyzed.

Algorithm Performance Evaluation Criteria
The performance evaluation criteria of the algorithm based on the objective function are the internal evaluation standard and the external evaluation standard. If vi is the center of the normalized class Ki, the ACS is defined as follows: (16) Where N is the number of samples, and the larger the ACS value, the higher the total tightness of the data and the center vectors. In addition, for text clustering experiments, the text of the class is often known, so the external evaluation criteria of the general use of mutual information (NMI).The NMI value is defined as follows: Assuming that X represents a known text class random variable, and Y represents the class random variable of the clustering result, then: Where X (Y) is the mutual information of variables X and Y, H (X) and H (Y) are the entropy of variables X and Y.Because clustering often does not know the number of clusters in advance, the NMI value can be used to evaluate the performance of the algorithm when the number of different clusters is better. The higher the NMI value, the more accurate the clustering result is. The NMI value is 1, marked exactly the same.

Description of The Experimental Datasets
The experiment uses 20 -Newsgroups data sets and some of the eight datasets from the CLUTO [12] text clustering toolbox. The data set contains the number of samples ranging from 690 to 19949, the smallest data dimension is 8 261 dimension, the largest is 43586 dimension, the actual number of clusters is 3, the largest is 20. From the above characteristics we can see that these data sets reflect the characteristics of the text datasets. Where the NG20 data is averaged from 20 different newsgroups, and the Bow toolkit [13] prepares the 20- Newsgroups text with 19949 vector text data. NG17-19 is a subset of NG20 data, the actual number of categories for the three categories, each category includes nearly 1 000 from the political news of the text, and according to the characteristics of these news is divided into three categories, the previous clustering algorithm on the The clustering results of the data set show that the clustering of the data set is more difficult because of the overlap between classes and classes. The other data comes from the CLUTO toolbox [12], which has been pre-processed as vector text data. A detailed description of the data set is shown in Table 1.
It should be noted that the balance in Table 1 ( is the total number of document is the total number of terms, k is the number of classes) is the balance of the data, that is, the ratio of the number of classes containing the minimum number of texts to the number of texts in the class containing the maximum number of texts, which reflects the balance between the class and the class in the data set. The NG20, NG17-19, and sports data sets used in the experiment are more balanced, ie, the number of samples is similar in each class, and the balance of other data sets is poor.

Experimental Results and Analysis
In order to test the maximum entropy sphere K-means algorithm proposed by the author, the clustering results of the above data sets are compared with each other when the number of clusters and the number of clusters are different. In the experiment, the algorithm is run 20 times for each case, and the NMI average of its clustering results is taken as the final evaluation value. At the same time, the specific NMI mean and the deviation table of the experimental results and the average cosine similarity of the clustering results degree. Experiments show that the maximum entropy spherical K-means algorithm has achieved satisfactory results for these data sets. Table 2 and Table 3, it can be seen that the SP-Kmeans algorithm has the lowest clustering NMI value for each data set, and its clustering effect is obviously lower than the other two clustering algorithms. The clustering effect of using CAMEP is better than that of DA-SPKM.  In addition, the authors find that the NMI deviation of the CAMEP clustering algorithm is much smaller than that of the SPK-Means and DA-VMF algorithms in most cases, which means that the algorithm of maximal entropy spherical clustering overcomes the sensitivity to initialization. In the clustering of the difficult data set NG17-19, the author finds that the NMI value of the algorithm CAMEP can reach 0.53, but the NMI value is very large, and its inner reason needs further study.   Table 4 and Table 5 show the average cosine similarity (ACS) of the clustering results of different clustering algorithms for different clustering algorithms. It can be seen from the table that the ACS values of CAMEP and DA-SPKM are greater than those of SP-Kmeans value.

Comparison of clustering results of different algorithms for different clusters.
The clustering algorithm often does not know the actual number of clusters in advance. Therefore, the authors compare the clustering performance of each algorithm in different clustering categories. In order to ensure the accuracy of the experiment, a clustering class running algorithm 20 times, and finally 20 times the average of NMI as the class of NMI value. Figure 1 and Figure 2 are the algorithm for some data sets in a variety of clusters under the NMI value comparison chart, we can see from the figure CAMEP due to the use of the maximum entropy strategy to avoid the local minimum point, so in different poly Class clustering performance is better than SP-Kmeans.  Table 6 shows the clustering time comparison of the partial data sets in the actual clustering number. The clustering time is much smaller than the other two algorithms, Based on the analysis of the above sections, the author thinks that different clustering algorithms can be used for different data clustering tasks.

Conclusion
In this paper, the maximum entropy principle is applied to the objective function of the spherical K-means algorithm, and the clustering algorithm based on maximum entropy principle is proposed. A large number of experiments show that the algorithm can effectively the clustering performance of text data set is better than that of traditional clustering algorithms. In addition, the author also found some problems: how to improve the clustering effect of CAMEP in the case of high difficulty data cluster, and how to further improve the CAMEP clustering effect reduce its clustering time. The above question is also the author's next research goal.