Paper The following article is Open access

Dynamic community discovery via common subspace projection

, , and

Published 19 March 2021 © 2021 The Author(s). Published by IOP Publishing Ltd on behalf of the Institute of Physics and Deutsche Physikalische Gesellschaft
, , Citation Lanlan Yu et al 2021 New J. Phys. 23 033029 DOI 10.1088/1367-2630/abe504

Download Article PDF
DownloadArticle ePub

You need an eReader or compatible software to experience the benefits of the ePub3 file format.

1367-2630/23/3/033029

Abstract

Detecting communities of highly internal and low external interactions in dynamically evolving networks has become increasingly important owing to its wide applications in divers fields. Conventional solutions based on static community detection approaches treat each snapshot of dynamic networks independently, which may fragment communities in time (Aynaud T and Guillaume J L 2010 8th Int. Symp. on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (IEEE) pp 513–9), resulting in the problem of instability. In this work, we develop a novel dynamic community detection algorithm by leveraging the encoding–decoding scheme present in a succinct network representation method to reconstruct each snapshot via a common low-dimensional subspace, which can remove non-significant links and highlight the community structures, resulting in the mitigation of community instability to a large degree. We conduct experiments on simulated data and real social networking data with ground truths (GT) and compare the proposed method with several baselines. Our method is shown to be more stable without missing communities and more effective than the baselines with competitive performance. The distribution of community size in our method is more in line with the real distribution than those of the baselines at the same time.

Export citation and abstract BibTeX RIS

Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

Community detection, also known as node clustering, is one of the most active topics in the field of graph mining and network science [25]. The task of community detection is to group the vertices of a graph into clusters by considering the connection structure in such a way that the edges within the cluster far exceed the edges between them. Generally, the wide variety of grouping techniques are based on similarity measures defined on structural properties of the vertices [68]. As community detection has been used extensively in many applications, such as identifying political parties [9], genetically similar structures [6] and fraud in telecommunication networks [10], there are many community detection methods available in the literature [11, 12]. Most existing approaches are developed to tackle static community discovery, where network connections are fixed in time. However, in many real world systems, the relationships between the nodes are constantly changing, leading to the evolution of community structures. That is, a community may grow by absorbing new members or become smaller due to the leave of some of its nodes, or even disappear after a period of time. Examples include social networks such as active collaboration circles [13], video sharing [14], communication networks such as email communications [15], and food webs [16].

Though the conventional community detection methods for static networks can be applied in dynamic scenarios when a time-evolving network is represented by a sequence of snapshots of networks, a fundamental drawback rooted in such scheme is that most of traditional algorithms are sensitive to tiny changes in the network structure [1, 1719]. Specifically, for the two similar networks, the algorithm can provide very different partition results even if a few links are disturbed. Besides, as dynamic community discovery is usually associated with community tracking [20], the framework of an independent identification of each snapshot needs an additional set matching operation to obtain the evolution trajectories of the nodes' affiliations. Instead of detecting and tracking communities in phases like the aforementioned solution, we propose to achieve the two goals simultaneously. Precisely, we tailor the succinct representation method [21] to project all snapshots into a common but lower-dimensional subspace where the snapshots can be reconstructed and compared more effectively. It should be noted that the proposed method here is different from the previous work [21]. First, that work aims at summarizing a dynamic network, i.e. finding a representation of all snapshots of the dynamic network in a period of interest, which is achieved with principal component analysis (PCA) on the multiple network snapshots. In the present method, however, we focus on dynamic community detection using node similarities across the period of interest, i.e. the correlation between nodes' concatenated connectivity profile. Although we also use PCA, this technique is applied to the node similarity matrix 4 to obtain an eigen-space, wherein all connections among nodes at each time step can be reconstructed. The benefits of this operation are, it not only makes the strong ties between similar nodes retained, but serves the need of global smoothing.

Furthermore, we compare the proposed method with various existing baselines including the state-of-the-art methods and show the superiority of our method. Surprisingly, we find that most existing dynamic community detection approaches lack adaptiveness to different community evolution patterns, which will be detailed in the experimental results. Besides, we explore the community properties related to the performance of the algorithms and especially analyze the results detected by our method.

In the remainder of this paper, we first provide a short summary of previous works relevant to this topic in section 2. In section 3 we describe the proposed method in detail, followed by its experimental evaluations and discussion. Finally, we conclude with a discussion of the proposed method and highlight some promising directions for future work.

2. Related work

2.1. Static community discovery

A community is described as a substructure in a network [22], where the nodes are more densely connected internally than with the rest of the network. However, community is not defined strictly, but algorithmically [12]. Even so, it is still of particular importance to view the network structure in mesoscales. To this end, many approaches have been developed to mine the communities in modular networks [2325]. Generally, those approaches fall into several categories: node similarity based method [26], optimization oriented method [27], network dynamics based method [28] and statistical modeling [25]. Among various methods, spectral clustering [29] is a commonly used community detection method based on nodes' similarity, which has been adopted to solve the problem of overlapping community structure [7]. Spectral clustering based community detection intimately associates with subspace sparse coding. Several recent studies [8, 30, 31] have shown that, by introducing subspace sparse representations of the nodes, the performance of spectral clustering can be improved.

However, most of the existing community discovery algorithms work on static networks without temporal information.

2.2. Dynamic community discovery

Community discovery in dynamic networks focuses on identifying and tracing the community evolution, which can characterize the time-varying behaviors of the dynamic network as well [18, 32]. Generally, the dynamic community discovery algorithms fall into two main categories: one-stage methods [33, 34] and two-stage methods [1, 35]. A one-stage method is designed by detecting and tracing simultaneously, TimeRank [34] falls in this category. By labeling the time attribution on the nodes of each snapshot, and transferring the dynamic network into a static network, TimeRank method [34] uses the static community discovery algorithm to detect the cross-time communities and trace the evolution of communities. In contrast, a two-stage method detects communities in each snapshot and then matches each community between pairs of snapshots to trace the dynamic community structure. Moreover, the two-stage method is more complex than the one-stage method due to additional matching for all potential community pairs. A critical issue in one-stage dynamic community detection is the stability of the results, which generally involves two aspects: a tiny variation in network structure may lead to quite different detection results for a generic algorithm, and the algorithms may fail to detect meaningful communities due to the complex evolution processes of dynamic networks [1]. Note that, the two-stage method still cannot solve the instability problem if its detecting operation does not involve the smoothing over snapshots [18, 32]. Recently, by considering the time-dependent relationship between sequential snapshots, some two-stage methods [36, 37] have shown its advances in mitigating the instability. Those methods belong to the evolutionary clustering [19].

In this work, we propose a one-stage method. But different from the existing algorithms, our approach solves the problem from a signal processing perspective, i.e. we treat a dynamic network as the combination of signal connections and noisy connections, for which the theoretical basis is demonstrated in the following section. We first generate a similarity matrix to characterize the proximity strengths of the nodes in terms of their dynamic connection records. By leveraging the above signal combination assumption, we suppose that small weights in similarity matrix may indicate noisy links. There we use the PCA technique on similarity matrix to get the signal subspace, where the links between dissimilar nodes can be filtered out by projection and reconstruction operations on the connections of the original network snapshots. As a result, the connections between different communities are most likely to be removed and the relationships among nodes in the communities are emphasized, facilitating the identification of communities. Compared to two-stage methods that require additional matching for all potential community pairs, our method detects the communities all at once, which makes it more efficient than two-stage methods. Moreover, since our method is performed on the snapshots of dynamic networks, not the conjunction of all snapshots in many one-stage methods, our method is also superior to the state-of-the-art one-stage method such as TimeRank.

3. Problem formulation

Considering a dynamic network G that describes the evolution of relations between interacting objects, there are two ways to model it: by temporal networks [3840] or alternatively, by a sequence of snapshots [41]. Our proposed method belongs to the latter form. That is, G = {Gt |t = 1, 2, ..., T}, where Gt is the snapshot at the tth time, and T is the length of the sequence G. Let At denote the adjacency matrix corresponding to the snapshot Gt = (Vt , Et ), where Et and Vt are the edges and nodes appearing in Gt , respectively. Accordingly, the node set of G is V = ∪Vt within the period-of-interest.

Dynamic community discovery aims to detect all dynamic communities in a dynamic network. We define the dynamic communities C = {C1, ..., Cj , ..., CK } as a set of clusters with constant labels Cj (j = 1, ..., K), where K is the total number of dynamic communities. Moreover, Cj = {it |it Vt , 1 ⩽ tT} consists of nodes from different time steps, where it denotes node i at time t. That is, the community members are time-stamped. As such, it is convenient to observe the life-cycle of a dynamic community as well as the transition of community affiliation of a node [32]. For instance, to see how many members a community Cj has at a specific time t0, one can count the nodes with time stamp t0 in Cj . Similarly, since there is a correspondence between community affiliations and time-stamped nodes, by tracing the community affiliation of a node in time, the transition from one affiliation to another can be detected.

Generally, discovering the underlying community patterns in networks is also known as clustering of nodes, which is usually based on certain similarity measures defined over networks in a high-dimensional feature space. Among the existing approaches, matrix factorization has been actively used in mapping the nodes to a lower-dimensional vector subspace of the latent feature space [30, 35]. For instance, the spectral decomposition of the Laplacian of the adjacency matrix or its variants that can embed nodes into the space composed of one or more eigenvectors is a common practice. In this work, the proposed method is also based on matrix factorization. However, different from spectral clustering where the nodes are encoded in a subspace and then spatial clustering algorithm is applied on the embedding, we construct a common subspace to act as the filter for the connection among nodes, which can be derived from the following theoretical analysis.

Since the connections that link different communities usually cause the ambiguity of community structure and affect the decision boundary of detection models, such connections can be treated as noise. From this point of view, we decompose the zero-mean similarity matrix as $\bar{M}=X+W$, where X denotes the signal part and W is the noise part. Suppose that all signals and noises are mutually uncorrelated and the noises have identical variance δ, then the covariance matrix $R={\bar{M}}^{\mathrm{T}}{\times}\bar{M}$ can be rewritten as

Equation (1)

where I is the identity matrix. By applying eigen-decomposition on R, then we have R = U × Σ × UT, where U is the eigen-matrix composed of the eigenvectors corresponding to R. Let Σ be the diagonal matrix composed of eigenvalues corresponding to R, and U be the matrix of eigenvectors, then R can be rewritten as:

Equation (2)

Equation (3)

Suppose the signal X is low-rank and the signal-noise ratio is large, then equation (2) can be rewritten as:

Equation (4)

where ${\Lambda}=\left[\begin{matrix}\hfill {U}_{s}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill 0\hfill \end{matrix}\right]$. Obviously, the first term in equation (2) is equivalent to the reconstructed signal matrix X. The principal components in U can be obtained by performing PCA on R.

In the light of the theoretical analysis, we use PCA to obtain a common low-dimensional subspace for all network snapshots, where the encoding–decoding operation acts as filtering and retains the strong ties between similar nodes in each snapshot. Then, we perform a conventional clustering method on the node column vectors corresponding to the decoded snapshots. As a result, changes of the community affiliations of a node can be easily identified.

4. Community discovery in common subspace

The framework of our method is diagrammatically illustrated in figure 1 with a toy example. Specifically, we first measure global similarities between nodes by comparing their connectivity profiles across time. In the example shown in the figure, the inner product is performed on each pair of historical connectivity profiles of the nodes (e.g., i, n, k and j) obtained by concatenating the adjacency matrices of snapshots, resulting in the similarity matrix (as shown in the left-hand side of figure 1. Then we employ the PCA method on the node similarity matrix to derive an eigen-subspace (as shown in the center of the plot) with which different network snapshots are reconstructed in an encoding–decoding scheme. Thus the left-hand side adjacency matrices in figure 1 can be projected to the eigen-subspace and then reconstructed with inverse projection. This way, the weak ties lying between two communities are most likely to be filtered out as noises. As a result, two nodes in the same community are much more likely similar to each other but distinct if they belong to different communities. That is, the community structure in the snapshots is more prominent, which will facilitate the follow-up clustering. Below are the details:

Figure 1.

Figure 1. The framework of the proposed method. It consists of five steps: (1) concatenating nodes' instant connectivity profiles, (2) similarity measuring based on node connectivity profiles, (3) eigen-decomposition using PCA, (4) snapshot reconstruction in the derived common subspace and (5) time-stamped node clustering. There are two data processing channels indicated by gray arrows and blue arrows, respectively. Specifically, gray arrows represent the process of finding the common subspace (step (1)–(3)), while blue arrows represent the process of structural filtering on individual network snapshots (step (4), (5)).

Standard image High-resolution image

As the connections on each node is time-varying, a matrix $\mathbb{A}\in {\mathbb{R}}^{\left(T{\times}\vert V\vert \right){\times}\vert V\vert }$ is constructed to store the historical edges in the dynamic network G. Here, $\mathbb{A}\left(i{\ast}t,j\right)=1$ indicates that node i is connected with node j in the tth snapshot, where i, jV and t = {1, 2, ..., T}. Clearly, a node can be characterized by its immediate and higher-order neighborhoods. For simplicity, we use the nearest neighbors to describe a node, that is, the ith column (or row) of At (:, i) is a basic representation of node i in snapshot Gt , which is denoted by ${\boldsymbol{v}}_{{i}_{t}}$. Then the complete representation of node i is the column vector $\mathbb{A}\left(:,i\right)$, i.e. ${\boldsymbol{v}}_{i}={\left[{{\boldsymbol{v}}_{{i}_{1}}}^{T},{{\boldsymbol{v}}_{{i}_{2}}}^{T},{{\boldsymbol{v}}_{{i}_{3}}}^{T}\right]}^{\mathrm{T}}$ as shown in figure 1.

Next, node similarity is calculated according to the similarity of connectivity profiles between two nodes. In general, the direct connections and the second order neighbors are used to depict the connectivity profile of a node, as a node is more relevant to its nearest neighbors than distant nodes, while the second order neighbors can account for the similarity when the nearest neighbors change in time frequently. In this case, the corresponding similarity matrix M can be written as follows:

Equation (5)

where $M\left(i,j\right)={\boldsymbol{v}}_{i}^{T}\cdot {\boldsymbol{v}}_{j}$, is the total number of common neighbors of two nodes in T time stamps, indicating a rough similarity between nodes. As aforementioned, the direct connectivity is a straightforward indicator of the pairwise similarity, especially for the case where dynamic networks change unevenly. Thus alternatively one can combine these two components together, which results in

Equation (6)

where α, β ∈ [0, 1] determine the relative importance of the direct and indirect connectivity in measuring dynamic similarity between nodes, respectively. 5 Then the columns of M are employed to represent the nodes, instead of $\mathbb{A}$. Assuming that the contribution of different neighboring nodes to the characterization of a target node is different, we expect that the nodes can be expressed by their similar neighbors. Therefore, we take the links between a node pair with a low similarity value (measured by the number of common neighbors) as noise. Then, a denoising technique can be devised to derive a low-dimensional subspace based on the previous theoretical analysis.

We apply PCA to the covariance matrix M to get the principal matrix P which usually spans a low-dimensional space. The spanned space is taken as the common projection subspace. Specially, we first normalize the column representation vectors M(:, i) as follows:

Equation (7)

In step 2, we calculate the covariance matrix Φ,

Equation (8)

Step 3 performs the spectral decomposition to get the principal matrix $P={\left({\boldsymbol{p}}_{1}^{T},{\boldsymbol{p}}_{2}^{T},\dots ,{\boldsymbol{p}}_{s}^{T}\right)}^{\mathrm{T}}$, where ${\boldsymbol{p}}_{i}=\left({p}_{i1},\dots ,{p}_{i\vert V\vert }\right)\in {\mathbb{R}}^{\vert V\vert }$ is an eigenvector corresponding to the top-s eigenvalues. Generally, s is a prior parameter and can be set according to the distribution of the eigenvalues, which is detailed in the appendix A. Alternatively, s can also be determined by the contribution rate of s principal components, i.e. $\frac{{\sum }_{i=1}^{s}{\lambda }_{i}}{{\sum }_{j=1}^{N}{\lambda }_{j}}$. The subspace spanned by the principal matrix P is what we seek.

To mitigate the instability of the communities detected by static community detection approaches, we project all snapshots to the common subspace and then reconstruct the connection patterns by filtering the negative weights. The details are shown in the following:

(i) Firstly, the feature representations of the nodes in each snapshot are normalized to have zero-mean:

Equation (9)

where jt is the node in Vt and ${\boldsymbol{v}}_{{j}_{t}}$ is its vector representation.

(ii) Then the normalized vector ${\bar{\boldsymbol{v}}}_{{i}_{t}}$ is projected to the common subspace:

Equation (10)

(iii) Furthermore, the low-dimensional embedding of each snapshot is decoded in the following way:

Equation (11)

which means that the relational patterns among the nodes in each snapshot are reconstructed. In fact, ${\boldsymbol{v}}_{{i}_{t}}$ embodies the information about neighbors of node i. Thus, ${f}_{{i}_{t}}$ reconstructs the relationship between node i and other nodes in snapshot Gt , in which positive values correspond to statistically strong connections among similar nodes but negative values indicate a weak relation over the time window. Through the above encodingdecoding process, ${f}_{{i}_{t}}$ combines the local connection pattern with temporal information for global-smoothing.

(iv) Filtering the negative connection, we keep the strong connection for the reconstructed relationship, which is the succinct and global-smoothing representation of each node i at time t. The detailed operation is as follows:

Equation (12)

where δ(⋅) is the step function which takes 1 for positive variables and 0, otherwise. ${\tilde {\boldsymbol{v}}}_{{i}_{t}}$ is the normalized vector of node i in snapshot Gt .

Finally, we cluster all the novel time-stamped representation of the nodes $\left\{{\tilde {\boldsymbol{v}}}_{{i}_{t}}\vert t=1,2,\dots ,T,i=1,2,\dots ,\vert V\vert \right\}$ using existing clustering methods such as the K-means algorithm [42]. This way, detecting and tracing dynamic communities are done in one stage. Specifically, the label of clusters are constant over time in this case and the nodes that appear in several snapshots may have multiple community labels, which makes it very convenient to trace a possible change of each community members and the evolution of communities as well.

Algorithm 1 gives a detailed description of the whole procedure, which is called ComSP for short. We find that the time complexity of the algorithm is of O(T × |V|2 × K), which is lower than the state-of-the-art one-stage method TimeRank [34].

Algorithm 1. Dynamic community discovery by a common subspace (ComSP).

Input: G = {Gt |t = 1, 2, ..., T}, the dynamic network; s, the selected number of principal components; K, the number of clusters
Output: the set of dynamic communities C
1: $\mathbb{A}{\leftarrow}$ the matrix storing the historical edges in G
2: M ← the similarity matrix produced by equation (6)
3: P ← the principal matrix by equations (7) and (8)
4: R = ⊘
5: for t = 1 → T do
6:  for iVt do
7:   ${\bar{\boldsymbol{v}}}_{{i}_{t}}{\leftarrow}\mathrm{t}\mathrm{h}\mathrm{e}$ normalized vector by equation (9)
8:   ${\tilde {\boldsymbol{v}}}_{{i}_{t}}{\leftarrow}\mathrm{t}\mathrm{h}\mathrm{e}$ representation by equations (10)–(12)
9:   $R=R\cup {\tilde {\boldsymbol{v}}}_{{i}_{t}}$
10:  end for
11: end for C ← the result of the K-means algorithm for R
Return C

5. Experiments

5.1. Dataset description

We conduct experiments using synthetic networks and a real-world social network with different periods. Table 1 summarizes the statistics of the datasets.

Table 1. Statistics of the datasets: the total number of nodes (n), the minimum number of nodes and edges in the snapshots (nmin and mmin, respectively), the maximum number of nodes and edges in the snapshots (nmax and mmax, respectively), the average ($\bar{n}$ and $\bar{m}$) and the variance (nvar and mvar) of node and edge numbers, respectively, the number of dynamic communities (K) and the size of dynamic network (T). Decimal part is omitted.

  n nmin nmax $\bar{n}$ nvar mmin mmax $\bar{m}$ mvar K T
SBM1000100010001000075 36775 44275 423102744
Reddit-I(a)28211715012817313019115059334
Reddit-I(b)47011415012921711219114451738
Reddit-II1861296723135123
Reddit-III(a)27531133861666251761103511174
Reddit-III(b)3393113364136025176743135178

Synthetic networks is a sequence networks with four snapshots, designed to imitate the evolution of communities in dynamic networks. We use the stochastic block model (SBM) [4347] to generate four snapshots, where the cross-block connection probability is set to 0.2 and the in-block connection probability is 0.6. Moreover, there are 4 communities with 250 nodes for each in these snapshots. Then a perturbation is applied on the edges of the snapshots with the probability 0.01. Particularly, in the last snapshot, the community affiliations of two randomly selected nodes are changed, compared to the previous snapshot. Hence, this sequence embodies typical operations on communities, i.e. apparent continuation and unnoticeable growth/contraction [48].

Reddit 6 the raw dataset records the social networking activities with the threads of posts and comments associated with subreddits, e.g. 'dogs', 'tennis', which are the innate community labels for the networks composed of the users and their interactions. We take these subreddits as the ground truths (GT) of communities. Moreover, the community label of a node is determined by the majority of the subreddits their replies belong to. Then the communities of the evolving social network behave in a dynamic way, e.g. birth/death, growth/contraction and merging/splitting. Note that, to reduce the influence of random sampling on the performance comparison, we use the entire dataset contributed by Pushshift 7 . Based on the raw records, we construct three sequences of weekly unweighted and undirected networks (referred to as Reddit-I(a), Reddit-II and Reddit-III(a), respectively) from 1st September 2010 to 28th September 2010, and two longer sequences (Reddit-I(b), Reddit-III(b)) from September and October 2010. The details of how to construct these network sequences can be found in https://github.com/NightmareNyx/CommunityTracking.

5.2. Baselines and experimental settings

The baseline methods used for comparison cover all three categories of dynamic community discovery approaches [32], namely, the instant optimal that considers communities of each snapshot independently, the temporal trade-off in which communities at t is the result of trade-off between the optimal solution at t and the known past (or global optimization), and the cross-time method that searches partition solutions for all snapshots simultaneously.

MultiGL [41, 49] utilizes a multislice generalization of modularity to study the community structure of dynamic network. Specially, this method couples between successive snapshots and rewrite the weight of intra-slice and inter-slice connections to realize 'cross-time' category. Then GenLouvain algorithm with the specified quality function is used to detect communities in the new network with |V| × T nodes.

TimeRank [34] also belongs to the 'cross-time' category. In the method, a time-weighted network with (|V| × T) nodes is producted by MutuRank [50] then undergoes spectral clustering. It has two variants, i.e. TR-AOC, TR-NOC, relying on the type of relations between nodes. Since it was shown that TR-NOC is generally better than TR-AOC [34], in our experiment, we use TR-NOC as the baseline.

Spectral clustering [29] is designed for a static community discovery. Here, we use it as an 'instant optimization' method. Thus we implement the two-stage detection with spectral clustering, which is referred to as ts-Spect approach. Specifically, we use the spectral clustering algorithm to discover communities in each snapshot, and match the communities in two consecutive snapshots. The match quality is measured with the Jaccard index.

GDG [30] considers the continuous density field to map each node into the geometric space, then detects static community by clustering algorithm. In order to conduct comparisons with two-stage methods, we use this algorithm to detect communities in each snapshot and then align communities to trace dynamic communities.

PisCES [35] is a temporal trade-off approach, which achieves the global smoothing of the eigenvectors of each snapshot by the regularization optimization method. PisCES detects the clusters in each snapshot by applying the K-means algorithm on their corresponding smoothed eigenvectors. These two operations are performed simultaneously. Finally, we utilize the maximum matching degree to trace the community detected by PisCES.

sE-NMF [36] uses evolutionary non-negative matrix factorization as a temporal smoothness framework for community detection. Then, by the greedy search procedure, sE-NMF uses mutual information (MI) to measure similarity between clusters of successive snapshots and maps local cluster to dynamic community for tracing.

CCPSO [37] formulates dynamic community detection as multi-objective optimization problem. It regards the common knowledge between the optimal partitions of the current and previous snapshots as consensus community, then utilizes consensus community to guide particle swarm optimization (PSO) for the current snapshot. At the last step, CCPSO detects the clusters for each snapshot. In this sense, CCPSO falls into the category of temporal trade-off approach. Due to lack of tracing method in its original algorithm [37], we track a community in a way similar to PisCES.

Adj-Mat uses the adjacency matrix At (:, i) as the representations of the nodes in each snapshot, which is the basic representation ${\boldsymbol{v}}_{{i}_{t}}$ of our method. By performing K-means on all basic representations of each snapshot, Adj-Mat detects and traces dynamic communities.

SuRep [21] obtains the symmetric matrix M by averaging all snapshots. Then, similar to the proposed method, SuRep groups the novel time-stamped representation of the existing nodes and traces the dynamic communities.

Other methods such as non-negative tensor factorization [33] have been compared by Sarantopoul et al [34], and TimeRank outperformed this method in most cases. We omit it in our experiments.

5.3. Evaluation metrics

We evaluate the model performance with several metrics that are borrowed from the field of clustering and commonly used in testing and comparing methods. They are normalized mutual information (NMI) [51], averaged rand index (ARI) [52], and the BCubed version of precision (P) recall (R) and the combined metric (F1) [53], whose definitions and calculations are detailed in the appendix B. Particularly, NMI derives from entropy in information theory, which calculates mutual information between ground truth labels and labels from clustering results. However, the most important information we can get from clustering results is not labels, but rather which nodes are clustered together and which ones are not, motivating people to define rand index. To notice the importance of small clusters in imbalanced data, ARI is proposed. Difference from the macroscopic measure of NMI and ARI, BCubed measures the precision and recall of the predicted community affiliation of each node and averages them to compare predicted communities with GT communities.

6. Results

Table 2 displays the NMI, ARI and Bcubed indexes for all six datasets. There are several observations in the results. Apparently, no method excels in every dataset. However, our proposed model shows a competitive performance, compared to the baselines. In particular, the improvement is prominent in ARI and F1. More importantly, it is shown that the proposed method is more stable than all nine baselines when confronted with different datasets. Surprisingly, TR-NOC, PisCES and Adj-Mat identify only one giant community in Reddit-I(a) (with 4 snapshots) and Reddit-I(b) (with 8 snapshots) respectively, which is attributed to the value 0 of ARI. Similar results are found for CCPSO in SBM, implying a lack of adaptiveness in diverse evolution patterns of communities. In fact, the poor detection results of CCPSO in SBM can be traced to the its initialization (PGLP) [54], which is more likely to identify the whole network as one community in the networks where there are many links between communities, resulting in high recall but low precision and NMI index. Another observation is that, MultiGL has poor performances for all six datasets. The reason is that MultiGL take little notice of time coherence between snapshots, leading to high sensitivity to small changes of links. Moreover, our method is superior to SuRep on most of the datasets, though both of the two methods use PCA for network reconstruction. We also note that in the simulated dynamic network dataset (SBM), the proposed method and SuRep fail to spot two nodes whose community affiliations are changing in the fourth snapshot. We believe this is due to the denoising effect of the PCA operation, as trivial changes of a community will be neglected in the encoding–decoding process.

Table 2. Performance comparison of the proposed method (ComSP) and nine baselines. The top-2 performance is highlighted in bold.

DatasetMetricts-SpectGDGPisCESsE-NMFMultiGLCCPSOTR-NOCAdj-MatSuRepComSP
SBM K 4444424444
NMI1.0001.0001.0001.000 0.000 0.1641.0001.0000.9970.997
ARI1.0001.0001.0001.000 -0.001 0.017 1.0001.0000.9990.999
P1.0001.0001.0001.0000.2500.3001.0001.0000.9990.999
R1.0001.0001.0001.0000.2500.9061.0001.0000.9990.999
F1 1.0001.0001.0001.0000.2500.4511.0001.0000.9990.999
Reddit-II K 3534222222
NMI0.2690.6740.8840.6620.224 1.000 1.000 0.160 1.000 1.000
ARI 0.078 0.5610.8800.3940.276 1.000 1.000 -0.078 1.000 1.000
P0.770 1.000 1.000 1.000 0.745 1.000 1.000 0.696 1.000 1.000
R0.5560.6350.9290.51340.656 1.000 1.000 0.633 1.000 1.000
F1 0.6450.7770.9630.6780.698 1.000 1.000 0.663 1.000 1.000
Reddit-I(a) K 528148373333
NMI0.2680.378 0.000 0.191 0.015 0.581 0.710 0.099 0.629 0.719
ARI0.3280.370 0.000 0.123 -0.023 0.3610.632 -0.084 0.654 0.770
P0.6070.7690.4390.5590.446 0.977 0.7500.4700.815 0.860
R0.6660.346 1.000 0.3770.3950.324 0.977 0.8170.7630.829
F1 0.6350.4770.6110.4500.4190.487 0.849 0.5970.788 0.844
Reddit-I(b) K 8571710593333
NMI0.1280.274 0.000 0.179 0.011 0.514 0.032 0.090 0.558 0.691
ARI0.1460.240 0.000 0.164 -0.010 0.291 0.004 -0.043 0.679 0.747
P0.4930.6950.4100.5560.416 0.940 0.4140.4360.744 0.826
R0.6460.248 1.000 0.3190.3650.236 0.990 0.8760.7110.807
F1 0.5590.3660.5820.4050.3890.3770.5840.582 0.727 0.817
Reddit-III(a) K 28441154303917171717
NMI0.5100.5950.2910.6410.188 0.805 0.785 0.5220.7410.773
ARI0.1970.258 0.023 0.217 -0.003 0.596 0.5000.1020.512 0.546
P0.4080.5340.1750.6260.153 0.805 0.6150.4200.617 0.651
R0.3460.367 0.897 0.2690.1580.591 0.798 0.6800.6080.683
F1 0.3740.4350.2930.3770.156 0.681 0.695 0.5200.6120.667
Reddit-III(b) K 36703481394517171717
NMI0.4420.5960.2580.6250.1550.723 0.754 0.4870.657 0.728
ARI0.1400.222 -0.018 0.169 -0.006 0.476 0.610 0.058 0.409 0.539
P0.3960.6300.249 0.693 0.161 0.746 0.6190.4400.5670.658
R0.2370.2890.5750.2060.1280.432 0.797 0.622 0.5190.575
F1 0.2970.3960.3470.3170.1430.547 0.696 0.5160.542 0.614

Modularity is the most widely used measure to evaluate the compactness and topological consistency of communities. However, a major drawback of using such golden quality function is that it will favor methods that are designed to maximize it, which may result in misleading comparisons. The stability of the detection methods can also be manifested from the variations of the modularity [55] at each time step, as shown in figure 2. By comparing the modularity curves, it is shown that the instant(static) community detection methods (ts-Spect and GDG) are considerably volatile with large variations of modularity, while our method is relatively more stable than the baselines in term of modularity changes and scores as well. Specifically, our method outperforms the baseline methods significantly except CCPSO and TR-NOC that are competitive on last two datasets. It should be noted that PisCES and CCPSO are designed for smoothing [32, 35, 37], leading to even curves in all datasets, especially in Reddit-III(b) there is a remarkable structural change in the snapshot at time step 5, compared to the previous snapshots, where the performance of other methods vary evidently. However, global smoothing (e.g. PisCES) overlooks the evolution of communities, which usually causes poor performance of community identification.

Figure 2.

Figure 2. Modularity of community partitions on Reddit-I (subplots (a)–(b)) and Reddit-III(subplots (c)–(d)), including different community label from the ground truth and discovery dynamic communities detected by the methods TR-NOC, PisCES, ts-Spect, GDG, sE-NMF, CCPSO, Adj-Mat, SuRep and ours.

Standard image High-resolution image

Besides the ground-truth based metrics and golden standard indices, the partition quality can also be evaluated by comparing the distribution of the community size resulting from the different methods and ground-truths. Figure 3 shows the sizes of communities in the boxes, where red boxes are the ground-truth distributions. Compared with the others, our method and SuRep produce the size distributions that closely approximate the ground-truth distribution, while as shown in table 2 our method excels SuRep. Moreover, the distribution of community sizes in TR-NOC shows large variances and low medians, indicating that TR-NOC tends to generate large communities, which explains why TR-NOC has high recalls. By contrast, the distribution in CCPSO has low variances and low medians, which means that it generate a lot of small communities shown in table 2 and explains why it has high precision.

Figure 3.

Figure 3. Community size distribution of the ground truths (GT) and discovered communities in Reddit-I (subplots (a)–(b)) and Reddit-III (subplots (c)–(d)) respectively by ten methods, i.e. TR-NOC (TR), PisCES (PS), ts-Spect (TS), GDG (GD), sE-NMF (SE), CCPSO (CC), MultiGL(GL), Adj-Mat (AM), SuRep (SR) and ComSP (CS). Red box indicates the ground truth distribution and red dotted lines represent the median of ground truth distribution.

Standard image High-resolution image

Note that, the tracking operation in two-stage methods, e.g. GDG and ts-Spect, may produce lots of communities. We evaluate our method and two-stage methods by averaging the metrics of each snapshot in Reddit, the results are shown in table 3. It can be found that the proposed method shows a competitive performance, compared to the state-of-the-art two stage methods. In particular, although CCPSO is superior to our method in terms of precision, our method has much higher recall values than CCPSO. By looking into the precisions of CCPSO which are close to 1, we find that CCPSO generates a lot of communities in each snapshot, independent of the aligning community operation. As a result, CCPSO performs well in Reddit-III which consists of a large number of small communities.

Table 3. Performance comparisons between ComSP and the two-stage methods on the identification of community affiliations of each snapshot in Reddit. The top 2 performance is highlighted in bold, and the second-best results are also underlined.

MethodReddit-I(a)Reddit-I(b)Reddit-III(a)Reddit-III(b)
NMIARI P R F1 NMIARI P R F1 NMIARI P R F1 NMIARI P R F1
ts-Spect0.3380.3310.6420.7780.7000.2720.2130.5660.7670.6420.5970.2050.6980.4930.5600.6070.1730.7890.3920.497
GDG0.4580.3820.7970.4220.5400.4000.2990.7420.3620.4810.6790.4540.6980.5600.6140.7190.4830.8150.5590.65
PisCES0.0000.0000.464 1.000 0.6310.0000.0000.427 1.000 0.5960.1490.0460.2800.9070.3720.3250.0820.4900.6650.440
sE-NMF0.3410.1870.6520.6100.6280.3860.2650.6410.6250.6280.6700.2790.740.4060.5190.6590.2690.8130.3730.504
CCPSO0.6390.409 1.000 0.3920.5550.6240.351 1.000 0.3330.491 0.853 0.659 0.918 0.695 0.789 0.839 0.659 0.959 0.6590.772
ComSP (ours) 0.722 0.775 0.866 0.839 0.852 0.735 0.775 0.846 0.858 0.850 0.806 0.635 0.673 0.921 0.769 0.808 0.680 0.735 0.898 0.797

To get more insights into the differences between the proposed method and its rival TR-NOC, we visualize the partition results of the two methods for each snapshot of Reddit-I(a) with the ground truth as a reference. It becomes evident in figure 4 that the communities discovered by our method accord with the ground truth in tracking the community evolution. Specifically, there are three ground-truth clusters (labeled in three colors) and two major components in the first snapshot which are successfully identified by the proposed method. However, TR-NOC neither unfolds three clusters in the first snapshot, nor tracks the split of the second largest component at the second time step. More noticeably, in the challenging situation of community organization at the third time step where there are some bridging nodes between two communities (as shown in the ground truth), our method provides basically clear separation between these two communities with only one error-tagging node on the boundary. In contrast, TR-NOC mixes two communities (in violet and green, respectively) together. We can conclude from the visualized results that TR-NOC tends to form large communities in essence, which explains why the recall of TR-NOC is remarkably higher than most of the baseline methods(as shown in figure 3).

Figure 4.

Figure 4. The visualization of community detection results by our approach (ComSP) (b) and TR-NOC (c) on the network sequence of Reddit-I(a), in comparison with the ground truth (c). Different colors in the plots correspond different community labels in the ground truth, respectively. Here, the ratio of correctly clustering on each snapshot of the dynamic network is [0.917, 0.854, 0.907, 0.932] for ComSP and [0.900, 0.846, 0.787, 0.846] for TR-NOC, respectively.

Standard image High-resolution image

Furthermore, we compare the performance of our method with TR-NOC in identifying the community affiliations of the top-10% nodes in average clustering coefficient 8 in dynamic networks. In fact, tracking the community affiliations of critical nodes has its own merits in some fields such as neuroscience [56]. In the experiments, we use this task to evaluate the performance of our method. Specifically, we rank the nodes according to their average clustering coefficient in the period of interest, where the average clustering coefficient is defined as the local clustering coefficients of a node in all snapshots averaged over time. Then we select the nodes whose average clustering coefficient is in the top-10% and spot the community memberships of these nodes in Reddit datasets with different time spans. Table 4 shows the superiority of our method on four datasets as a whole. Note that recall of TR-NOC remains higher than ours in this application, as discussed above.

Table 4. Performance comparisons between ComSP and TR-NOC on the identification of community affiliations of the top-10% nodes in terms of average clustering coefficient in Reddit.

MethodReddit-I(a)Reddit-I(b)Reddit-III(a)Reddit-III(b)
NMIARI P R F1 NMIARI P R F1 NMIARI P R F1 NMIARI P R F1
TR-NOC0.7330.5900.767 1.000 0.8680.0000.0000.432 1.000 0.603 0.930 0.745 0.802 1.000 0.890 0.8530.6360.672 0.963 0.791
ComSP (ours) 0.854 0.870 0.938 0.925 0.931 0.746 0.797 0.851 0.831 0.841 0.9160.665 0.919 0.8050.858 0.892 0.709 0.879 0.781 0.827

7. Conclusion

In this work, we have proposed a novel dynamic community discovery algorithm, which projects each snapshot into a common subspace to produce a global smoothing for each snapshot, and clusters on all time-stamped nodes in dynamic networks. This way, our method gains the best stability performance in dynamic networks, compared to the state-of-the-art methods. Another advantage of our method is, by clustering the nodes in the projected subspace, that community detection and tracing are performed in one stage. Compared with the two-stage methods, the one-stage method omits the matching stage for the sequential snapshots and reduces the computational complexity. However, we also note that one limitation of our method, like other one-stage approaches, is that our method is off-line, which means we cannot yet detect communities of dynamic networks in real time.

We have evaluated the proposed method on both real and synthetic datasets and demonstrated that it performs more stably than the baselines. The sizes of communities discovered by the proposed method are more close to the GT, that is, the number of communities is almost the same as the ground truth, indicating that our method can successfully trace most of the communities. However, the recall rate of our method is inferior to the state-of-the-art method. We believe this is due to clustering based on the constructed node similarity matrix which tends to neglect some details of temporal connection patterns among nodes. To mitigate the influence of scattered nodes and increase the purity of clusters is an important part for our future work.

Acknowledgments

This work is supported by the National Science Foundation of China (Nos. 61873218 and 61973086), Sichuan Province Science and Technology Department (No. 18YYJC1147), Nanchong school cooperation project (No. 18SXHZ0009), Shanghai Municipal Science and Technology Major project (No. 2018SHZDZX01) and ZJ Lab, and SWPU Innovation Base Foundation (No. 642).

Data availability

The data that supports the findings of this study are available from the corresponding author upon reasonable request.

Appendix A.: Selection of parameter s and K

In our algorithm, the optimal subspace is composed of principal eigenvectors and can render the clustering pattern of nodes, based on similarity matrix. In the light of previous work [57] where it has been shown that for a network with N nodes and m communities, there will typically be m eigenvalues that are particularly larger than the magnitudes of all the other (Nm) eigenvalues. That is, there are significant gaps between m eigenvalues and the remainder. Thus we choose the value of s in a similar way: we select s leading eigenvalues that distance themselves from the majority of eigenvalues with low values, based on the eigenvalue distribution of the covariance matrix (as shown in figure 5).

Figure 5.

Figure 5. The distributions of the eigenvalues for similarity matrix, where the inset plots show significant eigen-gaps between leading eigenvalues.

Standard image High-resolution image

Regarding the number of communities, we also heuristically approximate it according to the aforementioned observations. As is shown in figure 5, the prominent eigen-gaps indicate the existence of clustering pattern when nodes are embedded in the eigen-subspace, and the clusters in the summarized similarity relationships given by M most likely correspond to the communities present in the dynamic network. In particular, the optimal number of communities is expected to be close to the value of s, which makes it convenient for K-means to search the appropriate number of clustering. Therefore, we use the elbow method to search the optimal number of clusters around the value of s. For example, for Reddit-I(a) we first determine the optimal number of eigenvalues, i.e. s = 4, then we search the cluster number around 4, as shown in figure 6, where the metrics is the sum of squared distances to the closest centroid for all observations in K-means. It can be found that the optimal K is on the 'elbow' of the curve, where K = 3.

Figure 6.

Figure 6. The elbow method to search K in Reddit-I(a), where we select K = 3 (dashed red line) in our experiment.

Standard image High-resolution image

Appendix B.: Evaluation metrics

Let |S| be the number of nodes, L(i) be the real community of node i, and |L(i)| be the size of the community L(i). ψ(i, j) is 1 if and only if nodes i and j are correctly detected in the same community, then Bcubed measure can be written as follows:

Equation (B.1)

where

Equation (B.2)

and

Equation (B.3)

We draw a statistical matrix N with the dimensions |C| × |L|, where |C| is the number of detected communities. Ncl is the number of nodes which belong to the cth predicted community and the lth true community at the same time, then Nc and Nl are the number of nodes in the cth predicted community and the lth true community, respectively. Therefore, NMI is given by:

Equation (B.4)

where

Equation (B.5)

Equation (B.6)

and

Equation (B.7)

Besides, the corrected-for-chance version of the rand index called adjusted rand index [52] is employed with a statistical matrix N, which reads as:

Equation (B.8)

where RI is the rand index [58]:

Equation (B.9)

and, Max(RI) and E(RI) are the maximum index and the expected index respectively:

Equation (B.10)

Equation (B.11)

Note that, $\left(\begin{matrix}\hfill \vert S\vert \hfill \\ \hfill 2\hfill \end{matrix}\right)$ is the number of all possible node pairs

Appendix C.: Complementary visualization of experimental results

We visualize some results here complementary to the tables and figures in the main text. The number of communities detected by each method is shown in figure 7. Moreover, we transform table 2 into the figure to visualize the results of experiments, as shown in figure 8. However, for the methods whose scores are very close or equal to zero, it is not direct to recognize them via figure.

Figure 7.

Figure 7. Number of community partitions on SBM (subplot (a)), Reddit-II (subplot (b)) Reddit-I (subplots (c)–(d)) and Reddit-III(subplots (e)–(f)), where each point corresponds to the number of communities detected by eight methods, i.e. TR-NOC (TR), PisCES (PS), ts-Spect (TS), GDG (GD), sE-NMF (SE), CCPSO (CC), MultiGL(GL), Adj-Mat (AM), SuRep (SR) and ComSP (CS), respectively.

Standard image High-resolution image
Figure 8.

Figure 8. The evaluation of the methods on the real datasets, i.e. TR-NOC (TR), PisCES (PS), ts-Spect (TS), GDG (GD), sE-NMF (SE), CCPSO (CC), MultiGL(GL), Adj-Mat (AM), SuRep (SR) and ComSP (CS).

Standard image High-resolution image

Footnotes

  • Node similarity matrix is used to organize the mutual similarities between nodes, where the similarity is measured based on the common historical neighbors of two nodes.

  • In practical applications, for dynamic networks with drastic structure changes, the direct connectivity need to be more weighted than the second order one, i.e. α > β, while for steady networks equal weights are recommended.

  • The average clustering coefficient is computed by $\bar{c}\left(i\right)=\frac{1}{T}{\times}{\sum }_{t=1}^{T}{c}_{t}\left(i\right)$, where ct (i) is the clustering coefficient of node i in snapshot Gt .

Please wait… references are loading.