Overlapping community detection in signed networks

Complex networks considering both positive and negative links have gained considerable attention during the past several years. Community detection is one of the main challenges for complex network analysis. Most of the existing algorithms for community detection in a signed network aim at providing a hard-partition of the network where any node should belong to a community or not. However, they cannot detect overlapping communities where a node is allowed to belong to multiple communities. The overlapping communities widely exist in many real world networks. In this paper, we propose a signed probabilistic mixture (SPM) model for overlapping community detection in signed networks. Compared with the existing models, the advantages of our methodology are (i) providing soft-partition solutions for signed networks; (ii) providing soft-memberships of nodes. Experiments on a number of signed networks show that our SPM model: (i) can identify assortative structures or disassortative structures as the same as other state-of-the-art models; (ii) can detect overlapping communities; (iii) outperform other state-of-the-art models at shedding light on the community detection in synthetic signed networks.


1.
Introduction Complex networks [1] provide a powerful tool for representing many real world complex systems, such as information systems [2,3], social systems [4,5], ecological systems [6], and others [1,7,8].The task of complex network analysis is to identify the network's properties including network structure.Newman firstly introduced two types of network structures: assortative structure and disassortative structure [9].The assortative structurealso called community structure-is a type of network structures in which most edges are within a group.The disassortative structure is a type of network structures in which most edges are across groups.For these two types of network structures, a large number of effective techniques have been proposed during the last several years, such as Potts model [10] and modularity model [11].A detailed survey about them was presented by Fortunato [12].
Networks considering both positive and negative links are called signed networks.The community structure in signed networks is different from the assortative/disassortative structure in un-signed networks.In signed networks, most edges within a community are positive links, and most edges across communities are negative links [13].The signed networks have gained considerable attention from different scientific disciplines, i.e. biology [14], computer [15], and social sciences [16].For example, in a social network, positive links may denote friendship, agreement or trust whereas negative links may denote hostility, disagreement or distrust.Many studies have been presented for community detection in signed networks.Yang et al. proposed an agent-based approach to extract community structures by performing a random walk on positive links [17].Gomez et al. extended the modularity method from un-signed networks to signed networks for community detection [18].Traag proposed an algorithm based on the Potts model to find community structures in signed networks with only negative links [19].Shen et al. provided a statistical probability model based on the mixture model to detect the disassortative structures of signed networks with both positive and negative links [20].
Most of the existing algorithms for community detection in a signed network aim at providing a hard-partition of the network where any node should belong to a community or not.However, they cannot detect overlapping communities where a node is allowed to belong to multiple communities.The overlapping communities widely exist in many real world networks.For example, a person in social networks may belong to both family and hobby groups.In signed networks, a node is overlapping on the condition that it connects with nodes in other communities by positive links or connects with nodes in the same community by negative links, i.e. the node E and node F shown in figure 1.Recently, several methods have been proposed for overlapping community detection in networks with only positive links, which may fall into three categories: Clique Percolation Method (CPM) [21], fuzzy clustering-based method [22][23][24] and mixture models [25][26][27][28][29].The CPM supposes that edges within a community are likely to form a clique due to their high density whereas edges across communities are unlikely to form a clique.The fuzzy clustering-based method uses fuzzy relation to describe the case of a node belonging to more than one community [22].Dunn used a fuzzy probability to describe a node belonging to a community (i.e., c-means clustering) [23].Zhang et al [24] provided a method to approximately map nodes into a dimensional space by combining a modularity function, spectral mapping and fuzzing clustering.The mixture models model a network using generative graphical models.There are two types of mixture models for network community detection.The first are stochastic block-based models [26,28], which generate a network from a perspective on node.In a stochastic block model, each node is assigned to a block, group, or community.And undirected edges then are placed independently between node pairs with probabilities, from a function of the group memberships of the nodes.The second are probabilistic mixture-based models [25,27,29], which generate a network from a perspective on edge.The probabilistic mixture model is inspired by the probabilistic latent semantic analysis [30] for text mining, and introduced for community detection in the Ref. [29].Instead of assigning each node to a specific community, a probabilistic mixture model assigns each edge to one of blocks, groups, or communities with a probability, and then picks up nodes of the edge from the corresponding blocks.The latent variables in stochastic block models operate on vertices, while that in probabilistic mixture models operate on edges [31].Both types of mixture models are suitable for network community detection.The main difference between them is that the stochastic block models can be used not only for network community detection, but also for link prediction [32,33] from the node perspective; but the probabilistic mixture models can only be used for network community detection from the edge perspective.For community detection, stochastic block models usually perform well on synthetic networks, but poorly on many real-world networks [34], whereas the probabilistic mixture models perform well on both synthetic and real-world networks.
Although several methods have been proposed for overlapping community detection, most of them are limited to networks with only positive networks.They do not work in signed networks.In this paper, we proposed a novel probabilistic mixture model based on expectation-maximization (EM) method [35], called signed probabilistic mixture (SPM) model, to detect overlapping communities in undirected signed networks.It is a variant of the probabilistic mixture model which generates positive and negative links with different probabilities.To give a clear description, we presented an illustrative undirected signed network as shown in figure 1.For the signed network, our model will provide a correct overlapping partition, i.e. the community (A, B, C, D, E, F) and community (E, F, G, H, I).The advantages of our method are (i) providing soft-partition solutions in signed networks, such as nodes E and F belonging to two communities simultaneously; (ii) providing soft-memberships, which quantify "how strongly" a node belongs to a community.Experiments on a number of signed networks show that our SPM model (i) can identify assortative structures or disassortative structures as the same as other state-of-the-art models; (ii) can detect overlapping communities; (iii) outperform other state-of-the-art models at shedding light on the community detection in synthetic signed networks.
The remaining part of this paper is organized as follows.Section 2 presents the SPM model.Section 3 discusses the performance of the SPM model on the signed network with only positive links or negative links.Experiments are presented in section 4. Section 5 draws conclusions.A, B, C, D, E, F) and (E, F, G, H, I).The solid lines denote positive links and the dotted lines denote negative links.

2.
The signed probabilistic mixture model (SPM) Before introducing the signed probabilistic mixture model, we give a brief definition of an undirected signed network.Generally, a network is represented by an adjacency matrix A with n dimensions.We use E to denote the edge set, to denote the edge between node i and node j.In addition, E + and E -are used to denote the positive and negative links in a signed network, respectively.It is easy to understand that: . If there is no edge between node i and node j, In the SPM model, given K communities, an edge can chose a community pair from K K  possible community pairs because each node of the edge can chose a community from K possible communities.(Here, K is a predefined number of communities in a network).Formally, we use rs  to denote the probability of a edge choosing a community pair And the probability of a negative edge We unify them into the probability of an edge ij E : since there are no difference between the edges ij E and ji E .Finally, the marginal likelihood of the signed network can be written as The parameters of Eq. ( 4) cannot be estimated using the likelihood maximization estimation because the chosen community pair } ' , ' { s r of an edge is a hidden variable.In our study, we use the EM algorithm for parameter estimation, which is a general approach to estimate the parameters of probabilistic mixture model by maximizing the expected likelihood iteratively.In each iteration, it computes the posterior probabilities of hidden variables using model parameters in E step, and re-estimates the model parameters in M step.
The log-likelihood function of Eq. ( 4) is It is usually converted to an expected log-likelihood function as Eq.( 6) using the Jensen's inequality because it is difficult to be optimized directly. where respectively denote the probabilities of a positive link from a community r and a negative link from different communities r and s.
In the E step, the algorithm calculates the posterior probability of hidden variable ( ' r , ' s ) (i.e., ijrr q and ijrs Q ) using  and  .They can be calculated by In the M step, the algorithm re-estimates  and  using ijrr q and ijrs Q from the E step.To estimate  and  , we optimizes the expected log-likelihood function in Eq. ( 6).Considering that , we obtain the Lagrange form of Eq. ( 6) as follows: where  , r  are the Lagrange multipliers.All parameters are derived by setting the derivative of L to be 0: Once the model parameters are estimated as in Eq. ( 10), the probability of node i belonging to community r denoted by It means that a node can belong to several communities simultaneously.Therefore, the proposed model provides a soft-partition of the network with soft memberships of nodes, not a hard-partition.If we want to get a hard-partition, we can simply assign each node i to the community it most likely belongs to.That is } ,..., , max{ arg . If the EM algorithm converges within T iterations, the timecomplexity of the SPM model will be

Two extreme signed networks
Particularly, in a signed network with only positive links, that is In the E step, In the M step, Our algorithm is similar to the Simple Probabilistic Algorithm Expectation Maximization (SPEAM) model [29], for assortative structure detection, where all edges are in communities.
Similarly, in another signed network with only negative links, that is In the E step, In the M step, Our algorithm can be used to identify the disassortative structure, where all edges are across communities.

Experiment and analysis
To investigate the effectiveness of the SPM model on overlapping community detection in signed networks.We first test it on a large number of signed networks including a signed network with only positive links, a signed network with only negative links, an illustrative network, two real-world networks and a series of synthetic networks.Then we discuss the model selection issue-how to determine the optimal number of communities.

4.1
Community detection in a signed network with only positive links The Zachary club network, which characterizes the acquaintance relationship between 34 members [36], is used to test the capability of our model on assortative structure detection in a signed network with only positive links.The club network is split into two groups because of a dispute between the administrator and karate teacher.It has been used as a common dataset for overlapping community detection in many studies.Figure 2 shows the communities detected by our algorithm when setting K=2.The SPM model correctly identifies two assortative structures with several overlapping nodes: {3, 9, 14, 20, 31, 32}.To further investigate the effectiveness of our model, we compared it with several popular models, including Generalized Stochastic Blockmodel (GSB) [25], Newman Mixture Model (NMM) [9] and SPEAM [29].Table 1 shows the memberships of the 6 overlapping nodes when using different models.The numbers in a parenthesis are coefficients indicating how strongly a node belongs to all communities (called community coefficients).For example, the first (0.51, 0.49) in the first row indicates that node 3 belongs to two communities with probabilities of 0.51 and 0.49 respectively.We can see that our model gets the same result as GSB and SPAEM, and a better result than NMM.It is easy to understand that the results from GSB, SPAEM and our model are the same because a signed network with only positive links only contains assortative structures.Each assortative structure is a community.It is not surprised that our model outperforms NMM since GSB has been proved superior to NMM on the Zachary club network in [25].(0.17, 0.83) (0.00, 1.00) (0.17, 0.83) (0.17, 0.83)

Community detection in a signed network with only negative links
We adopt the dataset used in [37] to test the capability of our model on disassortative structure detection in a signed network with only negative links.The dataset is a network of 112 common adjectives and nouns in the novel David Copperfield by Charles Dickens connected by 425 edges.Each edge in the network denotes a pair of adjacent words in the text.To test our model on the dataset, we change the original edges into negative links. Figure 3 shows the communities detected by our model when setting K=2.Our model detects a bipartite structure, which is composed of two disassortative structures: an adjective group and a noun group.In addition, we also compare our model with GSB, NMM and SPAEM.Their performance is measured by the node accuracy of the hard-partition derived from them. 100 of the 112 nodes are correctly classified by GSB, NMM and our model, while only 60 of the 112 nodes are correctly classified by SPAEM.It means that SPAEM is worse than GSB, NMM and our model on dissortative structure detection in signed networks with only negative links.The reason is that the SPAEM assumes that networks are composed of assorative structures.

4.3
Overlapping community detection in an illustrative signed network We test our model on the illustrative signed network shown in figure 1.It contains 9 nodes connected by 16 positive links and 9 negative links.The nodes fall into two overlapping communities with two overlapping nodes (i.e., E and F).When setting the number of communities K = 2, our model correctly detect two communities with two overlapping nodes as shown in figure 4, where the numbers are community coefficients of nodes.It is very clear that nodes (A, B, C, D) completely belong to the left community since their community coefficients are (1.0,0.0), nodes (G, H, I) completely belong to the right community since their community coefficients are (0.0, 1.0), and nodes (E, F) belong to the two communities simultaneously since their community coefficients are (0.43, 0.57) and (0.34, 0.66) respectively.

4.4
Overlapping community detection in real-world signed networks We test our model on two public datasets, which are widely used for community detection.The first signed network is a relation network of 10 parties of the Slovene Parliamentary in 1994 [38] as shown in figure 5(a).The numbers are the weights of links in the network estimated by 72 questionnaires among 90 members of Slovene National Parliament.The questionnaires were designed to estimate the distance of 10 parties on the scale from -3 to 3, and the final weight were the averaged value multiplying by 100.The 10 parties fall into two communities: (1,3,6,8,9) and (2,4,5,7,10), a hard-partition of the network.When setting K=2, our model detected two communities with an overlapping node as shown in figure 5(b).The community coefficients of nodes are shown in figure 5(c).The overlapping communities detected by our model are (1,3,6,8,9,10) and (2,4,5,7,10), which are a little different with the real communities.The difference is reasonable because this network is designed for finding a hard-partition, not a soft-partition.For node 10, it really does not completely belong to the community on the right because there are two positive links (10-2, 10-4) and two negative links (10-5, 10-7) related to it in that community.On the other hand, if we use the method mentioned in section 2 to convert the soft-partition predicted by our model into a hard-partition, the hard-partition will be the same as the real one.The second signed network is the Gahuku-Gama Subtribes network about the cultures of highland New Guinea [39] as shown in figure 6(a).It describes the political alliance and enmities among the 16 Gahuku-Gama subtribes.The positive and negative links of the network correspond to different political arrangements.The 16 subtribes fall into three communities.Among them, one subtribe sides with two communities.When we apply our model on this network with K=3, three communities are correctly detected with an overlapping node as shown in figure 6(b).The community coefficients of nodes are shown in figure 6(c).

4.5
Community detection in synthetic signed networks It is common to validate the performance of algorithms for community detection on synthetic networks.In our study, we also test the SPM model on some synthetic signed networks.The synthetic signed networks are generated using the method proposed by Yang [17].We use SG(c, (n 1 ,n 2 ,…,n c ), k, p in , p + , p -) to denote a synthetic signed network, where c is the number of communities, (n 1 ,n 2 ,…,n c ) are the number of nodes of each community, k is the degree of each node, p in is the probability of each node connecting with other nodes in the same community, p + denotes the probability of positive links across communities and p -denotes the probability of negative links within communities.Note that we simplify it as SG(c, n, k, p in , p + , p -) when n 1 =n 2 =…=n c .We test our model on two types of synthetic signed networks: partitionable signed networks in which both p + and p -are 0; non-partitionable signed networks in which p + or p -is not 0. In addition, we conduct a number of experiments to test the robustness of the SPM model.The performance of all models is measured by the Normalized Mutual Information (NMI) [40], which is a widely used method for evaluating the community detection: is the mutual information between them.A high P nmi means a good detection.Specially, P nmi = 1 means that the detection is perfect.

Partitionable signed networks
We test our model on three partitionable synthetic signed networks: SG(4, 30, 16, 0.8, 0, 0) as shown in figure 7(a), SG(4, 30, 16, 0.1, 0, 0) as shown in figure 7(c) and SG (20,30,16, 0.8, 0, 0) as shown in figure 7(e).The positive links are denoted by white points; the negative links are denoted by black points in the figures.The first network has the same parameters as the second one except the density of edges in each community.The first network has the same parameters as the third one except the number of communities.The three networks are used to test the effect of density of edges in a community on our model, and the effect of the number of communities on our model.When our model is applied on these three networks, all communities are correctly detected as shown in figure 7(b), 7(d) and 7(f) respectively.The P nmi of our model on all three networks is 1.The results show that the SPM model is unaffected by not only the density of edges in each community (SG(4, 30, 16, 0.8, 0, 0) vs SG (20,30,16, 0.1, 0, 0)), but also the number of communities (SG(4, 30, 16, 0.8, 0, 0) vs SG(4 30, 16, 0.8, 0, 0)).

Non-partitionable signed networks
We also test our model on two non-partitionable synthetic signed networks: SG(4, 30, 16, 0.8, 0.2, 0.2) as shown in figure 8(a) and SG(4, (30, 60, 90, 120), 16, 0.8, 0.2, 0.2) as shown in figure 8(c).All of the first network are the same as the first network in figure 7(a) except p + and p -.In order to test the effect of noises on our model, we set both p + and p -to 0.2.The second network has the same parameters as the first one except the number of edges in each community.In the first network, all communities compose of the same number of nodes.In the second network, the number of nodes in a community is different with each other's.The two networks are used to test the effect of noise on our model, and the effect of the number of edges in each community.When our model is applied on these two networks, all communities are correctly detected as shown in figure 8(b) and figure 8(d) respectively.The P nmi of our model on both two networks is 1.The results show that the SPM model is unaffected when either adding noises (SG(4, 30, 16, 0.8, 0.2, 0.2) vs SG(4, 30, 16, 0.8, 0, 0)) or setting different numbers of nodes in communities (SG(4, 30, 16, 0.8, 0.2, 0.2) vs SG(4, (30, 60, 90, 120), 16, 0.8, 0.2, 0.2)).

Robust evaluation
We not only compare the SPM model with the Signed Newman Mixture (SNM) model [20], which needs a predefined number of communities, but also compare it with the FEC model [17] and Traag's model [19], which do not need a pre-defined number of communities.For comparison, we construct two types of signed works: SG(4, 30, 16, p in , 0, 0) with p in gradually changing from 0 to 1, and SG(4, 30, 16, 0.8, p + , p -) with both p + and p - gradually changing from 0 to 1.The results on them are shown in figure 9 and figure 10.Note that results at some points are not displayed in figure 9 and figure 10 as there is no ground-truth community at them.For example, when p in = 0.0, there is no ground-truth community in SG(4, 30, 16, p in , 0, 0) as there is no positive link in any community.When p + >0.5, there is also no ground-truth community in SG(4, 30, 16, 0.8, p + , p -) as the positive links in any community is less than the positive links across communities.When p -> 0.5, there is no ground-truth community in SG(4, 30, 16, 0.8, p + , p -) too as the negative links across communities is less than the negative links in any community.
In figure 9, each curve is the average P nmi of a model with p in on 30 synthetic random networks.All the models are applied on the same networks.The P nmi of the SPM model is always 1 when  In figure 10, a surface is also the average P nmi of a model with p + and p -on 30 synthetic random networks.The P nmi of the SPM model is much higher than the SNM model when  In summary, experiments on networks with only positive links or negative links show that the SPM model correctly identifies assortative structures or disassortative structures as the same as other state-of-the-art models; experiments on real-world signed networks show that the SPM model is able to detect overlapping communities which are neglected by most of the current popular models; experiments on synthetic signed networks show that the SPM model outperforms other state-of-the-art models at shedding light on community detection in signed networks.

4.6
Model selection issue A limit of our model is that it requires a predefined community number, which is usually unknown or uncertain in many real-world networks.Therefore, it is necessary to provide a criterion to determine the community number for our model.We test our model under two criteria: the minimum description length (MDL) principle [41], and the criterion in [42].
The MDL is a popular criterion for model selection issue which contains two parts: one describing the coding length of the networks; the other one describing the length of model parameters.For our model, the coding length .We apply the MDL to our model on the aforementioned signed networks.The MDL fails to acquire the number of communities in all networks.When the MDL is applied to the SNM model, it fails too.It seems that the MDL is not suitable for our model as well as the SNM model.
The error criterion function presented in [42] can be written as where N denotes the total number of negative links within communities, P denotes the total number of positive links between communities, and  denotes the weight of negative links ( 1 0   ).It can be used to determine the community number of signed networks because there exists only one partition to make the criterion function (Eq.19) minimum for any signed network according to the theorem in [43].We used the error criterion function to determine the community number on the signed networks aforementioned for the SPM model.The results are shown in figure 11.The SPM model finds one optimal community number on the Slovene Parliamentary Party network, the Gahuku-Gama Subtribes network, the SG(4, 30, 16, 0.1, 0, 0) network and the SG(4, (30, 60, 90, 120), 16, 0.8, 0.2, 0.2) network, which are correct.On the illustrative network, the SG(4, 30, 16, 0.8, 0, 0) network, the SG(20, 30, 16, 0.8, 0, 0) network and the SG(4, 30, 16, 0.8, 0.2, 0.2) network, the SPM model finds multiple optimal community numbers.By checking community coefficients of nodes in each network of them, we find that all the optimal community numbers correspond to the same hard-partition, which is also correct.Thus, the error criterion function is suitable to determine the community number for the SPM model.

5.
Conclusions In this paper, we proposed a novel probabilistic model for overlapping community detection in signed networks.The proposed model is a variant of the probabilistic mixture model.The advantages of the model are (i) providing soft-partition solutions for signed networks; (ii) providing soft-memberships of nodes.Experiments on a number of real-world and synthetic signed networks show that our SPM model: (i) can identify assortative structures or disassortative structures as the same as other state-of-the-art models; (ii) can detect overlapping communities; (iii) outperform other state-of-the-art models at shedding light on the community detection in synthetic signed networks.In addition, the general criterion function is proved suitable to determine the optimal number of communities.As future work, we will apply our model to community detection on real scalable signed networks, and seek possible applications.

Figure 1 .
Figure 1.An undirected signed network with two overlapping communities (A, B, C, D, E, F) and (E, F, G, H, I).The solid lines denote positive links and the dotted lines denote negative links.

Figure 2 .
Figure 2. The network of the Zachary club with 34 nodes and 78 positive links.The real communities in this network are marked by different type of shapes: square and circle.The shading nodes in the ellipse are overlapping nodes of soft memberships identified by our algorithm.

Figure 3 .
Figure 3.The network of 112 common adjectives and nouns in the novel David Copperfield by Charles Dickens connected by 425 negative links.The adjective and noun groups (i.e., communities) are denoted by circles and squares, respectively.The shading nodes are overlapping nodes of soft memberships identified by our algorithm.

Figure 4 .
Figure 4. Overlapping community detection in the illustrative signed network shown in figure 1.(a) The overlapping communities are detected by the SPM model.(b) The community coefficients of all nodes predicted by the SPM model.

Figure 5 .
Figure 5. Overlapping community detection in the Slovene Parliamentary signed network.(a) The adjacency matrix of the relation network of 10 Slovene Parliamentary parties.(b) The overlapping communities are detected by the SPM model.The solid lines denote the positive links, and the dotted lines denote the negative links.The real communities are marked by different types of shapes (i.e., square and circle).The shading nodes are nodes of soft memberships identified by our model.(c) The community coefficients of all nodes obtained by the SPM model.All overlapping nodes are emphasized in (b) and (c).

Figure 6 .
Figure 6.Overlapping community detection in the Gahuku-Gama Subtribes signed network.(a) The adjacency matrix of a network of 16 Gahuku-Gama Subtribes.(b) The overlapping communities are detected by the SPM model.The solid lines denote the positive links, and the dotted lines denote the negative links.The real communities are marked by different types of shapes: square, circle and triangle.The shading nodes are overlapping nodes identified by the SPM model.(c) The community coefficients of all nodes obtained by the SPM model.All overlapping nodes are emphasized in (b) and (c).

Figure 7 .
Figure 7. Community detection on three partitionable synthetic signed networks: SG(4, 30, 16, 0.8, 0, 0), SG(4, 30, 16, 0.1, 0, 0) and SG(20, 30, 16, 0.8, 0, 0).(a) The adjacency matrix of the first network.(b) The communities detected by the SPM model on the first network.(c) The adjacency matrix of the second network.(d) The communities detected by the SPM model on the second network.(e) The adjacency matrix of the third network.(f) The communities detected by the SPM model on the third network.
model slightly outperforms the SNM model and Traag's model, significantly outperformed the FEC model.P nmi of the SNM model, and Traag's model achieves 1, which is the same as the SPM model.For the FEC model, the P nmi achieves 1 the SPM model outperforms the other three models when p in gradually changes from 0.05 to 1.

.
Although the SPM model is slightly inferior to the Traag's model when 2 P nmi of the SPM model is still not less than 0.6, which is acceptable.Overall, the SPM model is superior to the SNM and FEC models, competitive with the Traag's model.

Table 1 .
The memberships of 6 overlapping nodes when using different models.