A community detection algorithm based on structural similarity

In order to further improve the efficiency and accuracy of community detection algorithm, a new algorithm named SSTCA (the community detection algorithm based on structural similarity with threshold) is proposed. In this algorithm, the structural similarities are taken as the weights of edges, and the threshold k is considered to remove multiple edges whose weights are less than the threshold, and improve the computational efficiency. Tests were done on the Zachary’s network, Dolphins’ social network and Football dataset by the proposed algorithm, and compared with GN and SSNCA algorithm. The results show that the new algorithm is superior to other algorithms in accuracy for the dense networks and the operating efficiency is improved obviously.


Introduction
In reality, many systems can be expressed by the network. Examples include the Internet, the network of scientists' cooperation, the transportation network, a variety of protein interaction networks and many others. It is called "complex network" because of the complex internal structure of these networks. The vertex in the network represents an entity as well as the edge is described as a relationship, such as an interpersonal social network, people are represented as vertices that can be connected by edges. There are many basic statistical characteristics of complex networks: the smallworld property [16] , power-low degree distribution [2] and the community structure [4] . Many networks have in common is the community structure, also called community clustering, which is a characteristic that network is composed of a plurality of sets with the same or similar functions [17] . In short, the connections between internal vertices are dense but external vertices are sparse. Analysing the community structure and understanding the functions of complex networks have very important application prospects and practical significance to discover the potential rules in complex networks and predict the behaviour of complex networks. For example, it is helpful to find the source of infection, cut off the route of transmission so as to achieve the purpose of curbing the spread of disease, through analysing the structural characteristics of the disease transmission network and the spread of the disease.
At present, many methods have been proposed in terms of the community detection from different perspectives. The first method called GN algorithm used for identifying the community structure is proposed [4] , which deletes the edge of highest betweenness every time. This algorithm has a high precision, although recalculating the edge betweenness is time consuming. Liu Dayou presents a community detection algorithm based on loop compactness (LTA), which can reduce the time complexity effectively [7] . The problem of local solution is solved from the perspective of discrete 2 1234567890 ISAMSE 2017 IOP Publishing IOP Conf. Series: Materials Science and Engineering 231 (2017) 012069 doi:10.1088/1757-899X/231/1/012069 particle swarm optimization [1] . Ma et al. proposes a LED algorithm for detecting overlapping communities [14] , which reduces the threshold sensitivity in a way. In addition, there are many other algorithms such as Kernighan-Lin [6] and Spectrum Average method [3] . In this article, an improved algorithm, called the community detection algorithm based on Structural Similarity with threshold k (SSTCA) is proposed, which improves the operational efficiency obviously.

Network description
Networks are usually described as graph G (V, E), where V contains vertices and E is a set of edges, with symmetric adjacency matrix. It can not only express the relationship between nodes and nodes in G, but also represent the relationship between nodes and edges. In this section, a method which converts the adjacency matrix to structural similarity will be described. The definition of symmetric adjacency matrix is given as follows: As is shown in formula above，the degree i m of vertex i is defined as  . Obviously, it is easy for us to convert the E into adjacency matrix.

Structural similarity construction
The structural similarity is derived from the acquaintance model in sociology, which is used to measure the similarity strength between two people, that is, the more common neighbours of the two people, the greater the possibility of belonging to the same community [9] . It is widely applied to the community detection, because of the fast calculation speed for it does not need to consider the global nodes when the structural similarity is calculated. The method of structural similarity construction based on adjacency matrix is shown as follows: Definition 2(Vertex neighbours). Let iV  , the vertex neighbours are defined as () i  by the vertex and its neighbours.
In this formula, the neighbours' number of vertex i is | ( ) | 1 i im    and the size of common neighbours of two neighbours is defined as follows:

Community detection algorithm based on similarity
After calculating the structural similarities of network, it is necessary to take some strategies to delete the edges. Our algorithm is inspired by the idea of SSNCA algorithm, which is classified into four steps. A) The structural similarities in graph G are calculated as the weights of the edges. B) The weightless edges are deleted. C) Going to step A and B until there is no edge deleted. D) The community structure is detected [5] . In this algorithm, the structural similarity is used instead of the edge betweenness in the GN algorithm, which improves the computational speed. But only one edge is deleted at a time, the computational efficiency can be improved. Thus, we consider the structural similarity of two vertices as the weight if there is edge between the vertices. Then deleting one more edges once, which weights are less than the threshold.

The SSTCA algorithm
In this section, we consider how to delete multiple edges, so as to reduce the iterations' times and time complexity. Therefore, this paper introduces threshold k, which is used to delete the edges of which weights are less than k , and gives the optimal community selection strategy to avoid reducing the accuracy of community partition and selecting k value blindly :The step size k  is defined and the modularity Q corresponding to the discrete threshold in the threshold interval is calculated to find out the maximum modularity, and then the best community partition is detected. So the idea of the algorithm is as follows: A) Define the threshold interval 1 2 [ , ] k k , and E) Find out the community partition which has the maximum modularity as the result of the optimal community partition.

Evaluation of community partition quality
In order to describe the community quantitatively in the network, this algorithm uses the modular function Q proposed by Newman [10] as the standard of community partition quality evaluation. The formula is shown as follows: Where e is symmetric matrix with kk  rows and columns, ij e is the proportion of the edges' number connected community i with community j in total edges' number of network. ii e is the proportion of all edges whose vertices in the community i . i ij j ae   is the proportion of the total edges connected to the vertices of the community i [12] .The larger the Q, the better the quality of community partition. In reality, the Q is 0.3~0.7, the upper limit of Q is 1, and Q is close to 1, indicating the community structure is more obvious. The negative value of Q indicates that the community structure is very poor [13] .

Experimental results and analysis
In this section, tests were done on the Zachary's network, dolphins' social networks and football dataset by the proposed algorithm, and compared with GN and SSNCA algorithm. The experimental results show that the SSTCA algorithm can effectively discover the community structure, and the computation speed is significantly improved. The environment parameters of experiment were as follows:

Zachary's Network
Zachary's network [15] is a common data set in complex network community detection, which contains 34 vertices and 78 edges and reflects the social relations among the members of the American karate club. The club is divided into two major groups headed by the director and the principal because of the charges, as shown in Fig.1, the vertex 1 on behalf of the director and the president is vertex 33. The vertices of different colors represent the members of each group. The community result detected by SSTCA algorithm is as shown in Fig.2, of which solid vertices (yellow in the colourful figure) and hollow vertices (white in the colourful figure) represent two different communities, and the GN algorithm result slightly (Fig.3) but completely consistent with SSNCA algorithm classification results (see in Fig.4).

Football club
Football club dataset contains 115 vertices and 613 edges, in which each of vertices represents a team and each of edges represents a regular season between two teams. The teams are randomly divided into 12 groups, each group containing 8-12 teams. The number of teams in the group matches more than the number of matches between groups. Therefore, the community structure of network is obvious [11] . The result of SSTCA algorithm on Football club is as shown in Fig.5. The algorithm divides the network into 12 communities. Fig.6 shows the statistics of the size for every community, which is consistent with the actual situation in the 8~12 branch. Therefore, the algorithm can detect the community structure effectively.

Dolphins' social Network
Lusseau constructs a dolphins' social network with 62 vertices and 159 edges via observing the habits of the 62 bottlenose dolphins. Each vertex represents a bottlenose dolphin, and the edge represents two dolphins' frequent activities. Lusseau find that these dolphins communicate with a specific pattern, that is, a certain community structure [8] . The experiment result of SSTCA algorithm for Dolphins' social network is shown in Fig.7. The vertices in the network represent a successful community in circular sets, different colors represent the different community. Obviously, the algorithm divides the network into 3 communities. The other contains outliers 37, 40, and 56 (as shown in the square shape). The experimental result is consistent with the actual network community. ), where the X axis represents a different threshold k, and the Y axis represents the Q value obtained in the case of the selected threshold k. As can be seen from Fig.8 Table 2, where the average clustering coefficient indicates the density of the network, the larger the average clustering coefficient, the denser the network. The statistics of time consuming on SSTCA, GN and SSNCA algorithm for Zachary's dataset, dolphins' social network and football club are shown in Table 3. The parameters in the Table 3 are the average of 100 times. The experimental results of different data sets are shown by three algorithms: for the same dataset, the larger the network size, the longer the computing time; for the same dataset in different algorithms, the SSTCA algorithm is the fastest, followed by SSNCA, and the GN is the slowest. Because the average clustering coefficient of dolphins' social network is small, that is, the network is sparse, the partition effect is poor, what's more, the accuracy is low, and the Zachary's club and the football club network are dense, the SSTCA partition effect is better. In conclusion, the experimental results show that our algorithm can find the community structure effectively, and improves the computational efficiency of SSNCA algorithm.

Conclusion
In this paper, an algorithm based on structural similarity of the community detection algorithm SSTCA was proposed by setting the threshold k and considering the deletion of multiple edges. Firstly, the definition of complex network and the method of constructing the structure similarity based on adjacency matrix were listed. Secondly, the main idea of the SSTCA algorithm and the modularity function Q which was used for evaluating the quality of community partition were described. Finally, the SSTCA algorithm was tested on Zachary's network, football club and dolphins' social network via comparing with the GN and SSNCA algorithm. The experimental results show that the algorithm has good performance for dense network and can improve the computation speed greatly. The next work is to optimize the threshold determination strategy and improve the accuracy of the algorithm for sparse networks.