Defining and identifying cograph communities in complex networks

Community or module detection is a fundamental problem in complex networks. Most of the traditional algorithms available focus only on vertices in a subgraph that are densely connected among themselves while being loosely connected to the vertices outside the subgraph, ignoring the topological structure of the community. However, in most cases one needs to make further analysis on the interior topological structure of communities to obtain various meaningful subgroups. We thus propose a novel community referred to as a cograph community, which has a well-understood structure. The well-understood structure of cographs and their corresponding cotree representation allows for an immediate identification of structurally-equivalent subgroups. We develop an algorithm called the Edge P4 centrality-based divisive algorithm (EPCA) to detect these cograph communities; this algorithm is efficient, free of parameters and independent of additional measures mainly due to the novel local edge P4 centrality measure. Further, we compare the EPCA with algorithms from the existing literature on synthetic, social and biological networks to show it has superior or competitive performance in accuracy. In addition to the computational advantages over other community-detection algorithms, the EPCA provides a simple means of discovering both dense and sparse subgroups based on structural equivalence or homogeneous roles which may otherwise go undetected by other algorithms which rely on edge density measures for finding subgroups.


Introduction
As one hotspot and keystone of the research on complex networks, community or module detection has been heavily developed in the past few decades [1]. While a range of algorithms have been proposed to focus mainly on how to detect a cohesive group of vertices as a rough community, they primarily use the macroscopic property of communities, since they are internally edge-dense while being sparse outside and pay little attention to the interior topological structure. The fact that these traditional algorithms do not reveal a specific structure in their detected communities means that extra work will have to be done in order to identify the important subgroups or modules within the community. In applications of complex networks, one often needs to investigate the next-level structure of sub-communities or modules. For example, while protein complexes (modeled as modules) detected in protein-protein interaction (PPI) networks can help us understand biological networks, they still cannot provide enough information due to the fact that we also want to obtain the core components of the complexes [2] or to identify the essential proteins [3]. Additionally, for communities detected on practical networks we also want to know not only which vertices are grouped together from a network partition but also the relationships among the individual members of the obtained communities such as the hierarchical organization of actors in a social network [4]. Traditional algorithms cannot meet such requirements without extra tools from network analysis.
While the main approach to community detection has been to find the resulting network clusters via partitive algorithms, there has been some work done in attempting to characterize the topological structure of the community, which leads to an alternate algorithmic approach of attempting to find these special structures.

Induced subgraphs and the P 4
An induced subgraph of a network is specified by a set of vertices, and all of the edges that exist on those vertices in the network are also part of the induced subgraph. More formally, for a network = G V E ( , ),a subnetwork where for every pair u and v of ′ V , uv is in ′ E only if uv is in E. A P 4 is an induced graph on four ordered vertices, which are connected as a simple path [9]. That is, it contains three consecutive edges and, just as importantly, there are no additional edges within these four vertices. An example of a P 4 − − − a b c dis shown in figure 1(a), and these four vertices would not be a P 4 in a network if the network contained an edge joining a and c, for example.

Cographs
A graph is called a cograph (also known as a P 4 restricted graph), if it does not contain a P 4 as an induced subgraph [9]. A single vertex is a trivial cograph, as is any network with three or fewer vertices. An example of a cograph is shown in figure 1(b), and we reiterate that while vertices b, d, c and w form a path, they do not induce a path since those four vertices also contain edges bw and dw.

Cograph community
Cograph communities are defined as the connected components of a network that has no P 4 subgraph. As will be seen in the following section, algorithm EPCA will delete edges that have high P 4 -centrality until our modified network is a cograph. The resulting connected components will define the cograph communities.

Cotree
The rooted tree representing the parse structure of a cograph in normalized form is referred to as a cotree. The leaves of a cotree are the vertices of the corresponding cograph, and each internal tree vertex represents the union or joint operation. In order to establish various properties about cographs we label each internal vertex of a cotree as follows: the root is labelled 1, the children of a vertex with label 1 are labelled 0 and the children of a vertex labelled 0 are labelled 1 [14,15]. Figure 1(c) illustrates the cotree for the cograph depicted in figure 1(b). The set of cographs is exactly the set of graphs which can be represented as a cotree, and every cograph has a unique cotree representation.  [9]. For example, as shown in figure 1(c), vertices v and u are strong siblings, while vertices b and c are weak siblings. Strong and weak siblings have also been called true twins and false twins in other contexts. Cographs can also be characterized as graphs which can be generated by repeatedly adding strong and weak siblings to a single vertex.

The approach EPCA and cograph communities
To detect the cograph communities of a complex network efficiently, we give an algorithm called the EPCA, a typical divisive algorithm based on edge P 4 centrality. In the following, we first introduce edge P 4 centrality and the approach EPCA; then, we demonstrate the properties of cograph communities.
3.1. EPCA based on P 4 centrality 3.1.1. P 4 centrality The set of edges that link the vertices of the same community (also called intra-links) are generally expected to be denser than the set of edges that link different communities (also called inter-links). That is, the inter-links are relatively sparser than the intra-links. Intuitively, there are many more cycles embedded in intra-links, while one does not expect to find many cycles using inter-links. This means that the inter-links tend to belong to more paths. Since a P 4 is a very simple induced path there are no small cycles among those four vertices of a P 4 . Thus, the inter-links tend to belong to more P 4 s since they tend to be part of paths, while the intra-links are inclined to compose fewer P 4 since they tend to belong to more cycles. From these facts, we define the edge P 4 centrality, which is a score assigned to edges which counts the number of P 4 s to which that given edge belongs. This definition of edge P 4 centrality gives us a way to quantitatively measure the fact that an edge ij is more inter-linklike than intra-link. If its P 4 centrality is large it is more likely to be an inter-link, while if its P 4 centrality is smaller it is more likely to be an intra-link.
Formally, the edge P 4 centrality of an edge ij is defined as the number of pairs of vertices x y { , }for which the set i j x y { , , , }induces a P 4 . Note that these four vertices can extend edge ij to a P 4 in a number of ways: and can extend all of their reversals. If any of these configurations occur, this 4-set i j x y { , , , }contributes a score of 1 to the P 4 centrality of the edge ij and to the two other edges on these four vertices.
One can check if four vertices induce a P 4 if the induced subgraph on these four vertices contains two vertices of degree 1 and two vertices of degree 2. So, one could write a function IsP a b c d ( , , , ) 4 easily (but we omit the details as this highly depends on the data structures one uses to store and access the elements of their graph). Using such a function, a simple algorithm to compute the P 4 centrality of all of the edges would be to enumerate all sets of 4-distinct vertices and test IsP a b c d ( , , , ) 4 and, if it is true, increment the centrality score for the three involved edges. Of course, there are a number of improvements that can be added to this process, for example, shortcutting the inner loops when the first three vertices induce degrees of 0, 0, 0 or 2, 2, 2, as these configurations cannot extend to a P 4 . One can also limit the search for the next candidate vertex by only choosing from the neighbourhood set of the appropriate vertices already in consideration. Of course, obtaining the neighbourhood of a vertex is, once again, dependent on the graph data structure used; so, we do not discuss the finer details of this process here.
We note that updating the P 4 centrality of all of the edges upon an edge deletion is much faster than the initial calculation of all of the centrality scores, since after removing edge ab, we must only search over pairs of candidate vertices c d { , }to find the affected changes rather than recalculate a search over all of the quadruplets. To understand the edge P 4 centrality better, we compare it with edge anti-triangle centrality [16], the edge clustering coefficient [17] and edge betweenness [18,19], respectively. We choose edge anti-triangle centrality for comparison because it has a similar definition, the edge clustering coefficient as it is a typical and efficient local centrality and edge betweenness for its well-known accuracy in identifying an edge as being inside or outside a community. We plot the scatters of the logarithm of edge P 4 centrality and edge anti-triangle centrality, the edge clustering coefficient and the logarithm of edge betweenness, respectively, on the Zachary karate club network (ZKCN) [20], the LFR synthetic network [21] with the mixing parameter = mu 0.5 and the S.cerevisiae PPI networks (SceDIP) [22] obtained from the DIP. The details of these three network comparisons are demonstrated in table 1. We compare these four centralities for the ability of discriminating inter-links from intra-links on the ZKCN, the LFR synthetic network = mu ( 0.5)and the SceDIP, successively. In particular, we compare them for two important quantities: the first one is the fraction of vertices contained in the giant component, denoted by RGC [23]. A sudden decline of RGC is observed if the network disintegrates after the deletion of a certain fraction of edges. The second quantity is the so-called normalized susceptibility [23], defined as  where n s is the number of components with size s, N is the size of the whole network and the sum runs over all of the components except the largest one. When S is a function of the fraction of removed edges f an obvious peak can be observed that corresponds to the precise point at which the network disintegrates [23,24].
In figures 2(a), (d) and (g), we plot the scatters of edge anti-triangle centrality and the logarithm of edge P 4 centrality on the ZKCN, the LFR synthetic network = mu ( 0.5)and the SceDIP, respectively. Figures 2(b), (e) and (h) show the scatters of the edge clustering coefficient and the logarithm of edge P 4 centrality, also on these three networks. Figures 2(c), (f) and (i) demonstrate the scatters of the logarithm of edge betweenness and logarithm of edge P 4 centrality. Here, we utilize the logarithm function for edge betweenness and edge P 4 centrality to obtain the same quantitative order with the edge anti-triangle centrality and edge clustering coefficient. As expected on all three classical networks, the edge P 4 centrality is positively correlated to the edge anti-triangle centrality and edge betweenness, while it is negatively correlated to the edge clustering coefficient, although these relations are not revealed very rigorously from visual inspection. In particular, notice that the edges with the highest edge P 4 centrality are neither always those with the highest edge anti-triangle centrality and edge betweenness nor those with the lowest edge clustering coefficient. By figure 2, the positive relations between the edge P 4 centrality and edge betweenness are revealed more distinctly than with the other two centrality measures. Figures 3(a)-(c) compare the edge P 4 centrality, edge anti-triangle centrality, edge clustering coefficient and edge betweenness for the ability of discriminating inter-links from intra-links from the point of view of RGC on the ZKCN, the LFR synthetic network = mu ( 0.5)and the SceDIP, respectively, and correspondingly from figures 3(d)-(f) from the point view of normalized susceptibility S. Due to the high computational cost of edge betweenness on the SceDIP we do not make comparisons to it in figures 3(c) and (f). As figure 3 shows, on the ZKCN, edge P 4 centrality can gain better performance from the point of view of RGC and S, while on the synthetic network and on the SceDIP it can be slightly poorer than the edge anti-triangle centrality and edge betweenness, and it is better than the edge clustering coefficient on all three networks. Although edge P 4 centrality is slightly poorer than edge anti-triangle centrality and edge betweenness from the point of view of RGC and S on synthetic networks and on the SceDIP, the results of the communities obtained by the EPCA based on the edge P 4 centrality are much better than the algorithms based on edge anti-triangle centrality and edge betweenness. This is explained by the fact that we observed that the edges of the highest edge P 4 centrality do not correspond to the edges of the highest edge anti-triangle or edge betweenness; so, the respective algorithms, which delete these edges of high centrality, will make different deletion choices early in their execution. Comparing edge P 4 centrality with edge anti-triangle centrality, edge clustering coefficient and edge betweenness by plotting scatters and from the point of view of RGC and S, we can summarize that edge P 4 centrality can be appropriate for community detection and obtain a significant competitive advantage over other processes that remove edges.

EPCA for cograph community detection
We assume that the networkG V E ( , )is connected, undirected and unweighted. The EPCA repeatedly removes the edge with the highest edge P 4 centrality score until the scores of the remaining edges are all zero. The EPCA is described in detail as follows: Input:G V E ( , ) Output: cograph communities Calculate the P 4 centrality score for each available edge While the highest score ≠0 do Remove the edge with the highest score Recalculate the scores of those edges affected by the removal End The EPCA is a typical divisive algorithm for community detection, while it possesses two significant differences to the general divisive algorithm. First, the EPCA does not need to remove the edges one by one until there is no edge left in the complex network. It just removes part of the whole edge set until the P 4 centrality of the remaining edges are all zero; this is sometimes only a small portion of the edge set. This makes it more computationally efficient and free of any parameters. Second, unlike the general divisive algorithms, which depend on additional measures to decide the community structure, the EPCA does not depend on any additional measures, and it outputs the current components as the expected cograph communities. The remaining components are cographs since they do not contain a P 4 and they possess additional algorithmic and structural properties.
The P 4 centrality is a local centrality measure. The complexity of the EPCA is the same as that of EACH [16], and the total space complexity of the EPCA isO E ( ) . The computational time-complexity is + 2 4 where E is the number of edges, k is the average degree of the networks andT is the maximum number of iterations. Here, we want to emphasize thatT is not a real parameter of the approach EPCA and does not need to be fixed a priori. The condition of the highest P 4 centrality score of the available edges equaling zero is the only condition used for ending the loop. The valueT is just used to represent the maximum number of iterations to express the complexity of the EPCA for convenience. The space and time complexities of other state-of-the-art algorithms are listed in [16], and the complexity of the EPCA is much lower than the others and is the same as that of the algorithm EACH.

Properties of a cograph community
A cograph community is a connected cograph, which is a special structure that has graceful topological properties evidenced by its unique cotree. The sub-communities or modules of a cograph community are not just a group of tightly-cohesive vertices, like in the case of traditional communities, they are also sparselyconnected subgroups of vertices which are structurally identical. From the definitions of cographs and P 4 , the diameter of a cograph community is at most a two-hop since there is no induced P 4 , and as a whole community it reveals more intensive social roles or biological functions than those obtained by other traditional algorithms, as demonstrated in section 4.1. Cographs possess a range of algorithmic and structural properties, for example, they can solve the (otherwise NP-hard) problems such as coloring, clique detection, hamiltonicity, etc, which can be done in polynomial time problems on cographs [14,15].
Here, however, we are especially interested in the property that we can construct, which is its corresponding unique cotree representation, and we are interested in extracting information on subgroups from it. Almost all of the properties of cographs are revealed by the corresponding cotree; so, constructing the cotree is a prerequisite to making a comprehensive and deep study on cographs. The vertices of a cograph community, organized by the cotree, can reveal more lucidly the interior structure and provide a convenient framework for making a next-level analysis. One can construct a cotree in linear time (that is, in time proportional to the time required to simply read the graph) by the algorithms given in the papers [14,15] and note that the cotree for a particular cograph is unique up to a permutation of the children of the internal vertices.
The cotree possesses the advantage of revealing various subgroups of the vertices of the corresponding community. The vertices belonging to the same subgroup are characterized as those having the same adjacency behavior to the remaining nodes in the community. That is, two vertices are in the same subgroup if their neighbours outside of the subgroup are equivalent. The trivial subgroups are the whole community and each of the isolated vertices. In addition to these trivial ones, the most basic subgroups are the strong siblings or weak siblings. In a general subgroup, if vertex u of this subgroup is adjacent to some vertex v which is not in the subgroup, all of the vertices of this subgroup are adjacent to v; this means the vertices of this subgroup possess the same connecting pattern. Similarly, if vertex u (in the subgroup) is not adjacent to vertex v (outside of the subgroup), none of the vertices in this subgroup are adjacent to v; that means they possess the same disconnecting pattern. These subgroups have been called homogeneous sets, modules or indistinguishable sets due to the fact that all of the vertices in the subgraph interact with the rest of the vertices (outside of the subgraph) in identical ways.
As shown clearly in figure 1(c), the vertices v and u are strong siblings in the pale-green region, and the vertices w and y form another pair of strong siblings in the hazel region, while the vertices b and c are weak siblings in the light-pink region, and vertices a and x are weak siblings. The larger subgroup of b, c, d and e is determined by the outer topological environment by being completely connected to the subgroup of v, u, w and y while being disconnected to the subgroup of z, a and x.
Finding various meaningful subgroups of cograph communities according to their neighbours and nonneighbours indeed brings new angles to investigating the relationships among the members of the communities. What we want to emphasize is that these meaningful subgroups within cograph communities cannot be detected by the traditional hierarchical algorithms, which focus only on obtaining hierarchical communities. The essential difference is that a set of vertices can form a structurally-equivalent subgroup even when there is no edge joining any two of them. Traditional community-detection algorithms that depend on identifying dense clusters will never associate such a group of nodes together.
The vertices revealing similar functions or roles in real networks are not always densely linked by edges but are sparsely linked by edges, as introduced by [25]. For these reasons, traditional hierarchical communitydetection methods cannot always find the sparse groups very accurately. It is reasonable and novel to investigate the subgroups within cographs according to their outer topology.
The cotree has the natural advantage of demonstrating various subgroups according to their structural similarity, and we can identify these sets easily. Thus, analyzing cograph communities based on their cotrees is novel and very fascinating. Several examples of a cotree analysis of the cograph communities or modules obtained by the EPCA on the practical networks are made in detail in section 4.2.

Experiments and analyses
We produce the experiments and analyses in this section. In section 4.1 we compare the performance of the EPCA with algorithms from the existing literature on synthetic, social and biological networks. In section 4.2 we perform a next-level analysis of the cograph communities based their corresponding cotrees to obtain various meaningful subgroups.

Accuracy analyses
Before making comparisons, we first introduce why we select these state-of-the-art algorithms; how we implement them for performance comparison; where we obtain the synthetic, social and biological networks, the protein complex golden standard sets and the high-level GO term sets; and what criteria we use to evaluate the performance of the selected algorithms. After that, we compare all of the algorithms on synthetic, social and biological networks to show that the EPCA has comparative and superior performance.
Considering the EPCA as a typical non-overlapping community or module detection algorithm, the main compared algorithms here are non-overlapping identification algorithms. The state-of-the-art algorithms GN [18,19], ECCA Q [16,17], ECCA D [16,17], EAC [16] and EACH [16] are selected, as these algorithms are edgecentrality-based. In particular, GN is based on edge betweenness centrality, which is a typical global centrality, and depends on the Q value [18] to decide the community structure. Both the ECCA Q and ECCA D are based on the edge clustering coefficient, which is a typical local centrality. The ECCA Q and ECCA D indicate the ECCA based on the additional measures of the Q and D values [26], respectively. The EAC and EACH are based on the anti-triangle centrality, whereas the EACH varies from the EAC by just including an added isolated vertex handling strategy. Neither of them depends on an additional measure while deciding the community structure. The NMF [27,28] and SC [29,30] possess matrix theory supports, while the CNM [31] attempts to optimize the additional measures to decide the community structure. The MCL [32] is based on random walks and is well known for its robustness. The INFOMAP [33] has significant accuracy performance, as reported in [34]. The OSLOM [35] is the only algorithm for detecting overlapping communities; here, we use OSLOM2, a much faster version from http://oslom.org/software.htm instead. The LOUVAIN [36] is very fast and widely used for community detection. We use the NodeXL (http://nodexl.codeplex.com/) implementations of the GN and CNM. The ECCA is implemented according to [17]; the NMF and SC are implementations in the R packages NMFN [37] and clusterSim [38], respectively. Lastly, we obtain the source code of the MCL (http://micans.org/ mcl). The INFOMAP is implemented by the R package igraph [39], and we get the MATLAB version for the LOUVAIN from http://perso.uclouvain.be/vincent.blondel/research/louvain.html. All of the parameters are at default, as set in the corresponding tools or packages for the available algorithms.
For the sake of convenience, we first list several networks used in the experiments in table 1: the series of LFR synthetic networks (SNs) [21], the Zachary karate club network (ZKCN) [20], the Political books network (PBN) (http://www.orgnet.com/), the Bottlenose dolphins network (BDN) [40] and the Football network (FN) [19,41]. The parameters of the LFR synthetic network are: average degree = k 15, mixing parameter = mu 0.5, minimum community size = minc 20 and maximum community size = maxc 50. Here, we set = mu 0.5 since its median is 0.5. In fact, aside from mu, all of the other parameters are defaults from the original code (http:// santo.fortunato.googlepages.com/inthepress2). Here, the SceDIP represents the S. cerevisiae PPI networks obtained from the DIP [22], and HsaHPRD represents the H. sapiens PPI networks extracted from HPRD [42]. We use the largest components of these two networks as the input of the algorithms. There are four protein complex golden standards: for the SceDIP we use the Munich Information Center for Protein Sequences (MIPS) [43] and the Saccharomyces Genome Database (SGD) [44] golden standards, while for HsaHPRD the golden standards are the Human Protein Complex Database with a Complex Quality Index (PCDq) [45] and the Comprehensive Resource of Mammalian Protein Complexes (CORUM) [46]. We remove the golden standard protein complexes, which consist of less than 2 proteins. The GO terms are not all of the terms but are instead the high-level GO terms, which have information content that is more than 2 [47]. The definition of the information content IC ( )of a GO term g is = − ( ) IC log , g root as given in the literature [47], where 'root' is the corresponding root GO terms (molecular function (MF), biological process (BP) or cellular component (CC)) of g. In addition, the GO terms that contain less than 2 proteins are removed. Lastly, we remove the protein complexes or GO terms of which no members appear in the corresponding PPI networks. The details of the SceDIP and HsaHPRD, the complex golden standards and the GO terms are also listed in table 1.

Synthetic networks and social networks
To quantify the accuracy performance of the compared algorithms on synthetic and social networks, we adopt the widely used normalized mutual information(NMI) [48,49] to measure the similarities between the obtained communities and the real ones. The details of this issue are introduced in appendix B.
We compare the NMI values of the results obtained by the compared algorithms on the synthetic networks, as shown in figure 4. Each node of the figure corresponds to the average NMI value of over 20 LFR networks constructed with the same parameters. The NMI values of all of the algorithms decrease as the mixing parameter mu increases. The reason for this is that the community structures of the LFR networks become increasingly fuzzier and thus are more difficult to be detected correctly as mu increases. As figure 4 shows, the black line represents the NMI value of the EPCA, and the results of the other algorithms are indicated by the corresponding color lines with signs. The INFOMAP can obtain the best performance among these compared algorithms, as reported in [34], and OSLOM2, LOUVAIN are not far behind. As figure 4 shows, the EPCA has superior performance compared with the algorithms SC and CNM across all of the networks, while as ⩾ mu 0.6, it is better than MCL, ECCA D and ECCA Q .
Note that the algorithms INFOMAP, OSLOM2, LOUVAIN, GN, ECCA D , ECCA Q need to set the number of output communities or must depend on the additional measures to decide the structure of the communities. As well as the global edge betweenness centrality GN based has a very high computational cost, while the inflation parameter of MCL affects the granularity of communities directly. By accounting for these factors, the EPCA can gain impressive performance in general on synthetic networks since it does not need additional measures, is free of parameters and is based on a typical local edge centrality at a very low computational cost.
Then, we compare the NMI values of the results obtained by the compared algorithms on the social networks, as shown in table 2. As depicted in table 2, the EPCA has comparative performance on the four social networks. Although the GN, ECCA D , ECCA Q , MCL, SC, NMF, INFOMAP and OSLOM2 algorithms obtain slightly better NMI values on some networks, they all depend on additional measures or a series of parameters; among them, SC and NMF also need to set the number of expected communities, which may bring in great difficulties if we do not have a prior knowledge of the networks. The NMI values of the EAC and EPCA reveal that the edge P 4 centrality and the edge anti-triangle centrality [16] have similar performances for community detection on small networks, while the EPCA may produce fewer isolated vertices on these social networks. The better performance of EACH is just due to its isolated vertex handling strategy. However, the communities  obtained by EACH are no longer cographs after the isolated vertex handler is used since the diameter of the communities obtained by EACH may be as high as a four-hop. So, the results obtained from EACH are not conducive to a cotree-based deeper analysis of those communities. Here, the EPCA is free of any parameters and has lower computational cost. Detecting communities by the EPCA is not only beneficial from the comparative performance on accuracy but also provides cograph communities on which we can perform a deeper analysis into the meaningful subgroups. Figure 5 shows the cograph communities obtained by the EPCA on the ZKCN in detail. As figure 5(a) shows, the ZKCN consists of 34 vertices and 78 edges, representing 34 members and 78 social relationships among the members of the karate club. The club suffered a division which split the club into two, and the split very closely corresponds to a mini-cut that separates the two opposing individuals of the largest influence of the vertices 1 and 34. As shown in figure 5(b), the EPCA obtains 3 communities and one isolated vertex by just removing 23 edges. In essence, the 3 cograph communities obtained by the EPCA match the practical division of two communities comparatively well. Furthermore, the EPCA purifies the community headed by vertex 1 by removing the attachment vertex 17 since it does not directly connect with the leader vertex 1, as shown in figure 5(b). The EPCA partitions the community, led by vertex 34, by removing the sub-community, which consists of vertices 32, 25, 26, since vertices 25 and 26 do not directly connect with the leader vertex 34, as shown in figure 5(b). Figure 6 depicts the cograph communities obtained by the EPCA on FN in detail. The FN consists of 115 vertices and 613 edges, representing 115 teams and 613 games played against each other, as shown in figure 6(a). The 115 teams are grouped into 11 conferences, with a 12th group of independent teams. The EPCA gains 13 communities and 10 isolated vertices after removing 291 edges. Surprisingly, we find the 13 cograph communities matching the 12 groups comparatively well in general. Here, we focus on the nontrivial cograph communities and ignore the isolated vertices. In fact, the two cograph communities in the shaded area of figure 6(b) mainly correspond to the group presented by the red triangle in figure 6(a). The group presented by the red triangles in figure 6(a) is realized as two smaller sub-communities because the group has the most members, and the members tend to be led by the two leaders Kent and BallState, respectively. The ten isolated vertices emerge from the yellow, bright-blue circles and the triangle groups, as shown in figure 6(b), all of which are relatively looser than the others. The 12th group consists of 8 independent teams, presented by green triangles, as shown in figure 6(a); since they are the independent teams, two of them are likely mismatched into the green circle group, and another two are arranged into the two cograph communities in the shaded area, as shown in figure 6(b). Other than the independent and isolated teams, all of the other teams are arranged correctly.

Biological networks
In the following, we perform experiments on biological networks, and we test the quality of the algorithm for community or module detection by how well it can be applied to make predictions for protein complexes and GO terms. Protein complexes typically have a dense modular structure within which proteins are highly connected. To examine whether the detected modules capture protein functional relationships other than just protein complexes, we use the high-level GO terms in all three domains (MF, BP and CC) as the golden standards for GO term prediction.
To evaluate the performance for the complex prediction, we use two independent quality measures [50] to assess the similarities between the predicted complexes and the golden standard reference complexes. In our experiments, we do not consider the one-protein module for all of the compared algorithms. The first measure counts the number of predicted modules that match the golden standards. A predicted module N 1 withV N 1 proteins or genes is thought to match with a reference module N 2 withV N 2 proteins or genes when the neighborhood affinity is where the threshold ω is usually set as 0.2 or 0.25 [51,52]. The second measure is the geometric mean of two other measures, which are the cluster-wise sensitivity Sn ( )and the cluster-wise positive predictive value(PPV) [52]. Given that r is predicted and s is the reference complexes, let t ij denote the number of proteins that exist in both predicted complex i and reference complex j, and w j represents the number of proteins in reference complex j. Then, Sn and PPV can be defined as respectively. Since Sn can reach its maximum by grouping all proteins in one module, and PPV can be maximized by putting each protein in its own module, we use their geometric mean Sn PPV (5) as 'accuracy' to balance these two measures [50,52], whereas higher Acc scores the better results.
To investigate the functional significance of identified modules, we follow the same strategy as used in the literature [25,47] to compute the F-measure based on high-level GO term prediction. The neighborhood affinity score between a predicted module p and a real GO term rg, NA p rg ( , )is used to determine whether they match each other. If ω ⩾ NA p rg ( , ) , they are considered to be matched with each other. Here, we set ω to 0.20, as was done in the literature [51]. We assume that PC and RG are the sets of modules predicted by a computational method and by real GO terms, respectively. N cp is the number of correct predictions which match at least a real GO term, and N crg is the number of real GO terms that match at least a predicted one. Precision P ( )and recall R ( )are defined as follows [53] crg The F-measure F ( )is the harmonic mean of precision and recall, and it is depicted as follows Among the compared algorithms used in the previous section, here, GN, NMF and SC are not used to test on the SceDIP and HsaHPRD. GN is excluded, for it is too slow on large networks due to the expensive edge betweenness calculation. Both the NMF and SC need to fix the number of expected modules, which brings an inconvenient ambiguity when we face different golden standards. Although we can follow the same strategy to implement a grid search using the number of expected modules = ∼ k 500 3000 with an interval of 100, as in the literature [25], we thought the step length to be too big. For uniformity in the comparisons, we exclude these three algorithms.
We first depict the results of the protein complex prediction in table 3; then, we display the results of the GO term prediction in figure 7. Table 3 shows the performance of complex predictions on the SceDIP and HsaHPRD in detail. The column headings of table 3 include the network for testing, the golden standard, the algorithms for comparison, the number of coverage proteins, the number of modules predicted, the average size of modules, the number of matched protein complexes, the cluster-wise sensitivity Sn ( ), the cluster-wise positive predictive value(PPV) and the accuracy score Acc ( ).We compare the results on the SceDIP according to the S. cerevisiae protein complex golden standards MIPS and SGD, while on HsaHPRD we compare the results according to the H.sapiens protein complex golden standards PCDq and CORUM.
As depicted in table 3, the Acc scores of the EPCA are obviously better than the ones of other algorithms on SGD, and the scores of the EPCA on the MIPS is 0.3749 lower than the highest 0.3960 of INFOMAP; the scores of the EPCA on the PCDq is 0.4607, which is just slightly lower than the highest 0.4613 of the ECCA D , while on CORUM the score is 0.3088, which is also slightly lower than the highest 0.3153 of the MCL. The numbers of matched protein complexes of the EPCA are the largest ones among the compared algorithms on MIPS and PCDq, respectively. The average size of the modules of the EPCA on the PCDq is 4.53, which is close to the average size of the reference protein golden standards 4.51. Considering the Acc scores and the matched numbers of the compared algorithms, EPCA, MCL and ECCA D are the competitive ones, and they all outperform others dramatically. As for the aspect of the GO term prediction, shown in figure 7, this aspect (a) illustrates the F-measure of the compared algorithms and (b) shows the percentage of GO terms that are considered to be correctly matched to at least one of the identified modules by different algorithms on SceDIP and HsaHPRD, respectively. Figure 7 also clearly illustrates that the EPCA, MCL and ECCA D are competitive, as they also outperform others since the EPCA can obtain better F-measure scores on both the SceDIP and HsaHPRD while having slightly fewer matched GO terms than ECCA D on SceDIP. Summarily, the EPCA is more attractive than MCL and ECCA D since the EPCA is free of any parameters and has comparatively lower computational cost. Namely, the inflation parameter of MCL can affect the granularity, and the ECCA D depends on the additional measure: the D value. Also, the ECCA D performs better than the ECCA Q , emphasizing the drawback that the same algorithm operating with different additional measures can lead to different results.
To demonstrate the comparison intuitively, we display, for instance, the Arp2/3 complex predicted by the compared methods in figure 8. The Arp2/3 complex consists of seven-subunit proteins, which play a major role in the regulation of the actin cytoskeleton. As figure 8 shows, the EPCA can detect the complex perfectly, while MCL can obtain a module that includes ten proteins, with three additional proteins. The ECCA D can obtain four proteins of the Arp2/3 complex. None of the algorithms (EAC, EACH, INFOMAP and LOUVAIN) can extract a candidate complex, including YDL029W, which is an essential protein member of the Arp2/3 complex. The OSLOM2 obtains a candidate complex which includes 20 proteins, while 13 are not the correct members. Unfortunately, the CNM and ECCA Q cannot obtain a valuable candidate complex for the Arp2/3 complex.

Various meaningful subgroups realized as cotree subgroups
In this section we mainly make a next-level analysis for the cograph modules obtained by the EPCA based on their corresponding cotrees to obtain various meaningful subgroups. As introduced in section 3.2, the structure of a cograph module can be demonstrated more explicitly by the corresponding cotree. The reasons for detecting subgroups this way are that the vertices belonging to a subgroup must possess structural similarity to the rest of its network. This strategy lets us find not only the dense subgroups but also the sparse subgroups and even those with no connections within the subgroup. The simple fact is that finding sparse or non-connected subgroups is a remarkable property that exhibits the superiority of analyzing cograph modules by cotrees. In the following, as shown in figure 9, we analyze four typical cograph modules obtained from HsaHPRD based on their cotrees in detail. Figures 9(a), (c), (e) and (g) depict the first, second, third and fourth cograph module, while (b), (d), (f) and (h) display the corresponding cotrees, respectively. Figures 9(a) and (b) show the first cograph module, which consists of 6 genes and its corresponding cotree. The three genes PIWIL1, PIWIL4 and PIWIL2 are partitioned into the weak sibling subgroup in the light-blue region, which are all adjacent to DICER1 and nonadjacent to the strong siblings of TARBP2 and PRKRA. In fact, the weak sibling subgroup of PIWIL1, PIWIL4 and PIWIL2 perfectly matches the GO term 'piRNA binding', numbered by GO:0034584. Here, we want to emphasize that among these three genes, there are no interactions among them. So, if we use traditional algorithms that use density, we can never identify the subgroup matching the term GO:0034584, while the three genes TARBP2, PRKRA and DICER1 compose a subgroup shown in figure 9(b), which perfectly matches the term 'RNA interference, production of siRNA', numbered by GO:0030422. The strong siblings of TARBP2 and PRKRA interact with the gene DICER1; together, they compose the dense larger subgroup of TARBP2, PRKRA and DICER1, a clique in fact, as shown in figure 9(a). The second cograph module and its cotree are shown in figure 9(c) and (d), respectively. The two distinct subgroups of weak siblings are revealed clearly by the cotree. One of the two subgroups consists of NPY1R, NPY2R, PPYR1 and NPY5R in the light-yellow region, perfectly matching the 1139th term of the 4457 golden standard terms. Furthermore, three genes of this subgroup are just the members of the 1396th term. We again emphasize that there are no interactions among these genes; so, density-based clustering methods would fail to detect them.
The third cograph module and its cotree are shown in figure 9(e) and (f), respectively. The cograph module consists of 9 genes that form a typical star-shaped subgraph, and we can obtain two subgroups intuitively from its cotree. One is the center BTK; the other is the weak siblings, including the rest of the 8 genes. Among the 8 genes, there are three genes in the light-blue region matching the term '1-phosphatidylinositol-5-phosphate 4kinase activity', numbered by GO:0016309 perfectly.
The fourth cograph module and its cotree are shown in figure 9(g) and (h), respectively. The cograph module consists of 10 genes, and distinct subgroups are revealed by the cotree. Surprisingly, the strong siblings of BID and BAK1 in the light-blue region just match the three terms simultaneously. The three matched terms are the 'activation and oligomerization of BAK protein', numbered by REAC:111452; the 'tBID activates BAK protein', numbered by REAC:139895; and the 'tBID binds to inactive BAK protein', numbered by REAC:168848, respectively. Also, the genes BIK, BOK from the weak siblings and the gene BAK1 from the strong siblings together compose the subgroup which just matches the 1353th term. The cotree structure predicts more meaningful subgroups and especially predicts the weak siblings of PMAIP1 and BBC3 to have the same or similar functions since they are structurally-equivalent.
In summary, analyzing the cograph modules based on their corresponding cotrees can lead to an immediate prediction of distinctive subgroups. The vertices organized by the cotree reveal very fascinating subgroups which are not only dense but also sparse, even when there are no connections in the subgroups. Most importantly, some of the subgroups revealed in this manner have significant biological meanings and also match the corresponding GO terms perfectly.

Conclusion and discussion
In this paper, we propose the novel cograph community and develop an approach (EPCA) for extracting cograph communities based on edge P 4 centrality. We compare the EPCA with algorithms from the existing literature on synthetic, social and biological networks to show that the EPCA has superior or competitive performance in accuracy and speed, in addition to having the advantages of being free of any parameters and independent of additional measures. The cograph communities have a fine granularity, and their diameters are at most a two-hop. More importantly, cograph communities exhibit a specialized internal structure, which decomposes the community into structurally-equivalent subgroups. The equivalence to cotrees allows a simple pictorial view for performing a next-level structural analysis for the purpose of finding meaningful subgroups that have functional similarity. In particular, these structurally-equivalent subgroups reveal homogeneous roles or functions that cannot be detected by traditional hierarchical clustering algorithms, which depend on edge density for community detection. Analyzing networks with cograph communities can contribute greatly to understanding the global structures and local structures of the networks more easily and distinctly. Since edge P 4 centrality is defined for unweighted graphs, we cannot currently use the EPCA to detect cograph communities on weighted and directed networks. In a future study, we will attempt to develop an extended version of P 4 centrality for weighted and directed networks and propose a framework for detecting overlapping and hierarchical [6,48,54] cograph communities. Being able to identify structurally equivalent Figure 8. Illustration of the results predicted by the compared algorithms about the Arp2/3 complex. (a) The real Arp2/3 complex and (b-i) the candidate Arp2/3 complex predicted by the ECCA D , MCL, EAC, EACH, INFOMAP, OSLOM2, LOUVAIN and EPCA, respectively, where the proteins in the orange color are the members of the real Arp2/3 complex, and those in the green color are not. CNM and ECCA Q cannot extract a valuable candidate complex for the Arp2/3 complex. subgroups in directed networks may have immediate applications to network controllability [55] by controlling communities with few drivers and thus controlling the whole network efficiently.
We feel that there are many interesting applications obtainable from studying other real biological or complex networks with the EPCA and cotrees. Another avenue of potential research is to explore other uses of these cograph communities and their cotree representations since they possess a multitude of algorithmic and structural properties and benefits besides subgroup detection.
We do, however, stress that although we have shown numerous benefits granted by the EPCA, such as polynomial runtime, ease of implementation, accuracy in finding cograph communities and the inherent ability to detect meaningful subgroups, we also find that it suffers from the transition between the undetectable and detectable regimes like virtually all community-finding algorithms [56][57][58][59][60], and we illustrate the details of this transition in appendix C, where we test the EPCA on Block model networks analogous to those used by Radicchi in [59].