Predicting future links with new nodes in temporal academic networks

Yijun Ran; Si-Yuan Liu; Xiaoyao Yu; Ke-Ke Shang; Tao Jia

doi:10.1088/2632-072X/ac4bee

1. Introduction

Complex networks have emerged as a powerful modeling tool for understanding and describing the real-world systems from the traffic system [1, 2] to the power system [3, 4], from the neural system [5, 6] to the ecosystem [7], from the economic system [8, 9] to the social system [10, 11], in which nodes are the entities of a system and links are the interactions or connections between entities [12, 13]. Understanding the evolutionary mechanisms of real systems as to which generates a pattern governing the evolution is a problem of fundamental importance for network science [14]. Most network evolutionary models such as Watts–Strogatz small-world model [15], and Barabási–Albert preferential attachment model [16] provide an iterative process for constructing global properties, but cannot capture the evolutionary mechanisms of real systems. Although real system may seem to be found regularly, capturing directly patterns on the whole process of evolution is not trivial. Hence, over the last few years, network scientists devoted a large effort to study the properties of local structures such as link prediction, community detection and network reconstruction, which can help us understand the real system's evolution law [17–19].

Link prediction has long been a question of great interest in network mining, which can be dated back to the first time by Liben-Nowell and Kleinberg [20]. The aim of link prediction is to uncover missing links from known parts of the network or quantify the likelihood of the emergence of future links from current structures of the network. Up to now, a considerable number of studies have demonstrated that link prediction not only has a wide range of practical value but also has an important theoretical significance [21–24]. In applications, the results of link prediction are helpful to understand what structural damage causes a certain disease by analyzing the molecular network in the cell [21]. Link prediction techniques can also help pharmaceutical experts to develop therapeutic drugs without obvious side effects through studying the medical network [22]. In theoretical aspects, with the help of link prediction algorithms, one can understand the evolutionary mechanisms of real systems and further generate evolution model [25–27]. While some researches have been carried out on link prediction, there is still very little scientific understanding of link prediction in temporalnetworks.

Recently, several methods have been proposed to predict future links in temporal or time-evolving networks [28]. Shang et al pointed out that if a pair of nodes are linked to each other in the current network, they will have a higher probability to link the same nodes in future networks [29]. Güneş et al adopted the neighborhood-based node similarity scores with different timestamps as time series, and then utilized the time series model to predict new or recurring links in evolving networks [30]. Wang et al showed that each node has the ability to attract other nodes where the ability of node depends not only on its structural importance, but also on its current popularity—active nodes have much more probability to connect future links [31]. Chi et al proposed the similarity-based predictor, DLPA, that is associated with the attraction force between nodes to detect missing links and to predict whether potential links will become real links in the future [32]. Bütün et al took advantage of both local and global topological structures to predict the citation count of scientists by employing the temporal link prediction metric [33]. The widely existing methods of link prediction almost focus on predicting future links in time-evolving networks with only links appearing or vanishing. However, most real systems evolve over time in which entities and the interactions between entities are added and removed—new entities or relationships appear and old entities or relationships vanish. Therefore, it is necessary to design an efficient approach to predict future links in temporal networks with nodes appearing orvanishing.

In real systems, two users would establish a friendship with a high chance in the near future if they have many common friends or interests, in contrast, the common friends or interests can be utilized to uncover lost friends or predict future friends [18, 34, 35]. This shows that network structure information and nodal attributes are both important for link prediction. However, users may suffer from unavailable and unreliable attribute information due to the privacy policy in most scenarios. Fortunately, the attributes of users in real systems such as scholar social system are easy to obtain and are relatively reliable. This is because a new scholar need to fill in some basic attribute information, such as affiliation, research direction, academic field, and so on, when he or she registers in an academic social system like ResearchGate, SCHOLAT. Hence, it is of great significance to take advantage of the attribute information to study the interaction between new users and old users.

In this study, we extend the structural equivalence and shortest path length hypotheses to predict future links between new nodes and old nodes in long-path networks with time-evolving [36]. This work generalizes our previous work [36] for an extension of long-path static networks into long-path networks with time-evolving. To solve the temporal link prediction with new nodes, here we integrate the network structure and external information (e.g., nodal attribute) to predict future links. Especially, we take into account nodal attribute similarity and shortest path length, namely, ASSPL to predict future links when the networks have new nodes. The results tested on scholar social network and academic funding networks show that ASSPL is highly effective and applicable for funding networks with time-evolving.

Our contributions are as following: (1) we propose a new similarity-based framework that integrates the network structure and nodal attribute for the temporal link prediction with new nodes. For this, we make full use of an efficient parameter to exploit how network structure or nodal attribute has an impact on the performance of temporal link prediction. (2) Extensive experiments are conducted on several scholar social networks and academic funding networks, and ASSPL predictor we proposed is shown to outperform state-of-the-art methods for temporal link prediction with new nodes in funding networks with time-evolving.

The remainder of the paper is organized as follows. We give a brief description of the temporal link prediction task and real network data in section 2. In section 3, we introduce static link prediction methods and propose temporal similarity methods. We report the main results and findings in section 4. Finally, section 5 is the conclusion and discussion.

2. Problem definition and data description

2.1. Problem definition

Given time periods, t = 1, ..., T, where T is the window size, we consider an undirected time-evolving network (snapshot) G_t composed of a set of nodes V_t and a set of links L_t, in which a node cannot connect to itself (no self-loops) nor share more than one link with another node (no repeated links). Each node i has a set of associated attributes denoted as A_i which do not change over time. Here, some new nodes or links in undirected time-evolving network will appear and some old nodes or links in undirected time-evolving network will vanish within time period t. Given a time-evolving network G_t at time t, we formulate the temporal link prediction problem as follows.

The problem of temporal link prediction is to predict the occurrence probabilities of future links by utilizing past network structures. Here, for a time-evolving network, we define the L_t links of the network at time t as the training set ${L}_{t}^{T}$ . The L_t+1 links of the network at time t + 1 are considered as future links set ${L}_{t+1}^{P}$ and need to be inferred from network topology given by the links in ${L}_{t}^{T}$ . To test the performance of temporal link prediction, another probe set ${L}_{t+1}^{N}$ is used as the control group of ${L}_{t+1}^{P}$ , which is composed of randomly chosen nonexistent links from the network G_t+1 usually with the same size of ${L}_{t+1}^{P}$ . And this sampling process must make sure that the randomly chosen nonexistent links also do not exist in G_t. Figure 1 illustrates this procedure of temporal link prediction where the dot line in the time-evolving network at time t = T + 1 is future links set ${L}_{t+1}^{P}$ , the solid line in the time-evolving network at time t = T is training set ${L}_{t}^{T}$ .

**Figure 1.** An example demonstrating the task of temporal link prediction. In this network, the old nodes or links may vanish and the new nodes or links may appear over time. We take advantage of the network structures at time t = T to predict future links at time t = T + 1. Here, the attributes of node 1 is A₁ = {a}, the attributes of new node 7 is A₇ = {c, d, e}, and so on. More importantly, from the beginning, every node in the network has attributes that will not change over time.
Download figure:
Standard image High-resolution image

In the temporal link prediction problem, a predictor takes node attributes and some features of the network, and then assigns a score S_ab to each pair of nodes a and b. which is proportional to the probability that nodes a and b will be connected. As G_t is undirected, the score is symmetric, i.e., S_ab = S_ba. The scores for node pairs that are not currently connected are sorted in descending order and the top candidates are likely future links. The prediction quality is measured by comparing the scores of predicted links in ${L}_{t+1}^{P}$ and ${L}_{t+1}^{N}$ . Here, we randomly pick up a link from ${L}_{t+1}^{P}$ and ${L}_{t+1}^{N}$ respectively, and then compare the score of the two links each time. After n times of independent comparisons, if there are n' times that the score of future link in ${L}_{t+1}^{P}$ is higher than that of the nonexistent link in ${L}_{t+1}^{N}$ , and n'' times that the future link and the nonexistent link have the same score [34, 37], hence the result of AUC can be calculated as

$\begin{equation}\mathrm{A}\mathrm{U}\mathrm{C}=\frac{{n}^{\prime }+0.5{n}^{{\prime\prime}}}{n}.\end{equation} \tag{ 1 }$

2.2. Data description and processing

2.2.1. Data description

In this work, two kinds of large temporal networks are taken to evaluate the performance of algorithms. One is scholar social network which is from the SCHOLAT website. The SCHOLAT is an emerging social system specifically designed and built for scholars, learners and course instructors in China. And the SCHOLAT website allows users around academic and learning discourse to strengthen social interaction and collaboration among scholar communities. This social network is an open-access dataset from SCHOLAT Lab which contains 10 607 nodes and 168 540 links. This dataset has been divided into the training set ${L}_{t}^{T}$ and the future links set ${L}_{t+1}^{P}$ . According our precious study, the scholar social network is short-path network as shown in figure 2(a). More complete statistics of the social dataset are given in table 1.

**Figure 2.** A visualization of two kinds of real networks. (a) The scholar social network built by SCHOLAT dataset which has 10 607 nodes and 168 540 links. This scholar social network is short-path network in which there are a myriad of close triangular structures (high clustering coefficient). Through using fast greedy algorithm [38], this short-path network can be represented as 325 different communities. (b) The academic funding network built by Funding0809 dataset which has 18 192 nodes and 21 006 links. This academic funding network is long-path network which has many chain-like structures. Through using fast greedy algorithm, this long-path network can be represented as 4420 different communities. Here, nodes and links in the same community have the same color. That is to say, different communities are represented as different colors.
Download figure:
Standard image High-resolution image

Table 1. The basic topological features of real-world networks. Here, the N_t and M_t are the total numbers of nodes and links in network G_t, respectively. That is to say, M_t is the number of links in the training set ${L}_{t}^{T}$ . We define ⟨d⟩ as the average shortest path length of giant component in network G_t. According this study, the networks with ⟨d⟩ < 9 are short-path networks, and the networks with ⟨d⟩ ⩾ 9 are so-called long-path networks [36]. M_t+1 represents the number of links in the future links set ${L}_{t+1}^{P}$ . F_n denotes the fraction of new nodes in G_t+1 where the new nodes are in V_t+1 but not in V_t.

Datasets	N_t	M_t	⟨d⟩	T	M_t+1	F_n	Attributes of nodes
Funding0809	18 192	21 006	19.188	2 year	16 327	0.650	Title, discipline, affiliation
Funding0910	25 698	30 109	15.136	2 year	21 830	0.629	Title, discipline, affiliation
Funding1011	31 952	38 157	15.581	2 year	32 531	0.638	Title, discipline, affiliation
Funding1112	43 069	54 361	15.260	2 year	43 904	0.601	Title, discipline, affiliation
Funding1213	56 926	76 435	14.765	2 year	52 912	0.565	Title, discipline, affiliation
Funding1314	69 976	96 816	13.918	2 year	61 194	0.542	Title, discipline, affiliation
Funding1415	81 540	114 106	13.887	2 year	70 522	0.526	Title, discipline, affiliation
SCHOLAT	10 607	168 540	4.510	\	16 854	0.010	Interests, discipline, affiliation

Another one is academic funding network (co-occurrence network of funds) which denotes the relationship between two funds that appear in the same paper. We collect the fund datasets from the National Natural Science Foundation of China (NSFC). There are 320 226 funds in the datasets we collected. In addition, for academic papers, we use the publication data of the Web of Science (WoS) from year 2008 to 2016. To mine the co-occurrence relationship which represents the fact that two fundings jointly fund a research paper, we perform link prediction analysis on academic funding network. Since a fund has its own survival time, here we take advantage of the datasets by a two-year window to construct a time-evolving network to predict the future links in the next year. Hence, we construct 7 academic funding networks utilizing the datasets we collected. For example, we denote Funding0809 in table 1 as the network which is constructed by using the data of year 2008 and 2009, and then we study the likelihood of the emergence of links within year 2010 on Funding0809 network. According to our precious study [36], the 7 academic funding networks are all long-path networks, such as Funding0809 network shown in figure 2(b). More complete statistics of the funding dataset are given in table 1.

Here, the scholar social network and academic funding network are typical time-evolving networks in which new nodes or links will appear and old nodes or links will vanish over time. Hence, the performance of traditional temporal approaches in those networks is poor. In this study, we propose a new similarity-based framework that integrates the network structure and nodal attribute for the temporal link prediction with new nodes.

2.2.2. Data processing

Regardless of whether a scholar social network or academic funding network is considered, the attribute information we collected is unstructured Chinese text data such as research interests, discipline, fund title and so on. Indeed, Chinese text data is totally different from English text data. We firstly carry out word segmentation for Chinese text data, and then remove stop words (stopwords) that do not provide any useful information. To better study the temporal link prediction task with new nodes, here we apply the latent Dirichlet allocation (LDA) topic model [39] to build the topic vector of each node and measure the similarity between nodes by cosine similarity [40]. The process of topic vector processing is as follows:

Word segmentation. To segment the proper nouns well, for example, do not segment 'Southwest University' into 'Southwest' and 'University', here we first use the 'modify dictionary' mode in the Jieba tool [41] to adjust the frequency of a single word which can segment the proper nouns well. Then we apply the 'keyword extraction' mode based on TF-IDF algorithm in the Jieba tool to re-segment the node's attributes and extract the top five weighted keywords of a node. For the node who has no keyword, in this study we take advantage of the node's neighbors in the network G_t to obtain the union set of keywords. This attribute augmentation process is depicted in figure 3.
Keywords vector. After getting the keywords list for all nodes, we build a dictionary for each keyword with a unique index. Then, we utilize the bag-of-words model (Doc2BOW in Gensim tool [42]) to process each node's keywords into a vector according to the frequency of occurrence. We further use TF-IDF model to convert the result of the bag-of-words model from the frequency vector of the topics to the importance vector of keywords.
LDA topic model. In this step, we use the importance vector of keywords calculated by TF-IDF to build LDA topic model, in which we set the count of topics as 10 (the dimension of the topic space). The 10 dimensions here are chosen after extensive testing. The LDA model can project a node's keywords into the topic space, which helps us to calculate the similarity between nodes by cosine similarity.

**Figure 3.** An example of the attribute augmentation for a node without attributes. Here, the node E has no attributes, hence we take E's neighbors (A, B, C, and D) in the network G_t to obtain the similar attributes. That is to say, we apply the union set of the keyword sets of the four neighbors to obtain the augmented keyword set of E.
Download figure:
Standard image High-resolution image

3. Similarity-based methods

Among many similarity-based predictors, Cao et al systematically compared common neighbors (CN) with other well-known similarity predictors and further demonstrated that CN predictor performs best in short-path static networks [43]. Zhou et al proposed resource allocation (RA) predictor and showed that the predictive performance of RA is higher than CN, and Adamic–Adar (AA) index in short-path static networks [44]. In addition, to study the role of weak ties, Lü and Zhou extended the CN, AA and RA into weighted static networks by introducing a free parameter that controls the contributions of weak ties [45]. Hence, to study temporal link prediction in short-path networks, we employ the approach of Lü and Zhou to extend the CN, AA and RA as temporal similarity methods by integrating the network structure and nodal attribute. In our recent study, we proposed SESPL predictor which is associated with the principles of structural equivalence and shortest path length. The importance and originality of our recent study are that it systematically explored link prediction in short-path and long-path networks and showed that SESPL is a state-of-the-art predictor in long-path static networks. Therefore, in this study, we also extend SESPL predictor into long-path networks with time-evolving.

3.1. Static similarity methods

The static similarity methods mostly take advantage of network topology to predict future links in static networks. Here, we concentrate on CN, AA, RA and SESPL predictor, whose definitions are as following.

(1) Common neighbors (CN)

The CN directly counts the number of CN two nodes share [46]. The idea behind is that two nodes that share the same neighborhood node are likely to share other common features hence are likely to have a link. It is defined as

$\begin{equation}{S}_{ab}^{\mathrm{C}\mathrm{N}}=\vert n(a)\cap n(b)\vert ,\end{equation} \tag{ 2 }$

where n(a) and n(b) denote the set of neighbors nodes that node a and b have, respectively.

(2) Adamic–Adar index (AA)

AA was proposed by Adamic and Adar and is used to compute the similarity between two web pages [47]. The AA emphasizes less-connected CN. It is defined as

$\begin{equation}{S}_{ab}^{\mathrm{A}\mathrm{A}}=\sum\limits _{c\in n(a)\cap n(b)}\frac{1}{\mathrm{log}\enspace k(c)},\end{equation} \tag{ 3 }$

where k(c) = |n(c)| denotes the degree of node c.

(3) Resource allocation index (RA)

Zhou et al [44] motivated from the physical processes of RA, hence proposed the RA that puts penalties to large degree nodes. Yielding

$\begin{equation}{S}_{ab}^{\mathrm{R}\mathrm{A}}=\sum\limits _{c\in n(a)\cap n(b)}\frac{1}{k(c)}.\end{equation} \tag{ 4 }$

(4) SESPL

Our previous work proposed a novel similarity measure, SESPL, which is associated with the principles of structural equivalence and shortest path length [36]. The experimental results showed that the SESPL predictor works better than other classical similarity-based predictors in long-path networks. SESPL is defined as

$\begin{equation}{S}_{ab}^{\mathrm{S}\mathrm{E}\mathrm{S}\mathrm{P}\mathrm{L}}=\frac{\mathrm{SE}(a,b)}{{d}_{ab}-1},\end{equation} \tag{ 5 }$

where d_ab denotes the shortest path length between node a and b, and SE(a, b) is a measure that quantifies the structural equivalence between the local structures centered on a and b, respectively. Here we quantify the structural equivalence of two nodes by taking into account local structures consisting of the first-order and second-order neighbors. The SE(a, b) is defined as

$\begin{equation}\enspace \mathrm{SE}(a,b)=1-\mathcal{J}({P}_{a},{P}_{b}),\end{equation} \tag{ 6 }$

where $\mathcal{J}({P}_{a},{P}_{b})$ represents the Jensen–Shannon divergence between the local structure centered on nodes a and b, respectively. And P_a is a vector in which the elements are the degree centrality of each node i in the local structure centered on nodes a.

3.2. Temporal similarity methods

Most real systems evolve over time in which entities and the interactions between entities are added and removed—new entities or relationships will appear and old entities or relationships will vanish. This indicates that a systematic understanding of how a new node links to another node is the most meaningful work in time-evolving systems. However, because the new node does not have pre-existing structure information, it is very difficult to predict the future links with new node for static similarity methods. To deeply solve the temporal link prediction with new nodes, here we integrate the network structure and external information (e.g., nodal attribute) to predict future links. The topic vector of each nodal attribute was processed by the ways in section 2.2.2. Here, we first utilize the topic vector to define the similarity between two nodes by the cosine similarity [40], which is defined as

$\begin{equation}\mathrm{C}\mathrm{o}\mathrm{s}(a,b)=\frac{\sum\limits _{i=1}^{m}\enspace V{(a)}_{i}V{(b)}_{i}}{\sqrt{\sum\limits _{i=1}^{m}\enspace V{(a)}_{i}^{2}}\times \sqrt{\sum\limits _{i=1}^{m}\enspace V{(b)}_{i}^{2}}},\end{equation} \tag{ 7 }$

here V(a) is the topic vector of node a in which m represents the dimension of the topic space in LDA model.

We then extend static CN, AA and RA predictor into temporal networks by employing the approach of Lü and Zhou [45]. For CN predictor, here we name TCN in temporal networks. It is defined as

$\begin{equation}{S}_{ab}^{\mathrm{T}\mathrm{C}\mathrm{N}}=\begin{cases}\sum\limits _{c\in n(a)\cap n(b)}\mathrm{C}\mathrm{o}\mathrm{s}{(a,c)}^{\beta }+\mathrm{C}\mathrm{o}\mathrm{s}{(c,b)}^{\beta }\quad \hfill & \quad \mathrm{i}\mathrm{f}\enspace a\enspace \mathrm{a}\mathrm{n}\mathrm{d}\enspace b\enspace \mathrm{a}\mathrm{r}\mathrm{e}\enspace \enspace \mathrm{n}\mathrm{o}\mathrm{t}\enspace \enspace \mathrm{n}\mathrm{e}\mathrm{w}\enspace \enspace \mathrm{n}\mathrm{o}\mathrm{d}\mathrm{e},\hfill \\ \mathrm{C}\mathrm{o}\mathrm{s}(a,b)\quad \hfill & \quad \mathrm{o}\mathrm{t}\mathrm{h}\mathrm{e}\mathrm{r}\mathrm{w}\mathrm{i}\mathrm{s}\mathrm{e},\hfill \end{cases}\end{equation} \tag{ 8 }$

here β is a free parameter. Likewise, TAA and TRA are defined as

$\begin{equation}{S}_{ab}^{\mathrm{T}\mathrm{A}\mathrm{A}}=\begin{cases}\sum\limits _{c\in n(a)\cap n(b)}\frac{\mathrm{C}\mathrm{o}\mathrm{s}{(a,c)}^{\beta }+\mathrm{C}\mathrm{o}\mathrm{s}{(c,b)}^{\beta }}{\mathrm{log}(1+s(c))}\quad \hfill & \quad \mathrm{i}\mathrm{f}\enspace a\enspace \mathrm{a}\mathrm{n}\mathrm{d}\enspace b\enspace \mathrm{a}\mathrm{r}\mathrm{e}\enspace \enspace \mathrm{n}\mathrm{o}\mathrm{t}\enspace \enspace \mathrm{n}\mathrm{e}\mathrm{w}\enspace \enspace \mathrm{n}\mathrm{o}\mathrm{d}\mathrm{e},\hfill \\ \mathrm{C}\mathrm{o}\mathrm{s}(a,b)\quad \hfill & \quad \mathrm{o}\mathrm{t}\mathrm{h}\mathrm{e}\mathrm{r}\mathrm{w}\mathrm{i}\mathrm{s}\mathrm{e},\hfill \end{cases}\end{equation} \tag{ 9 }$

$\begin{equation}{S}_{ab}^{\mathrm{T}\mathrm{R}\mathrm{A}}=\begin{cases}\sum\limits _{c\in n(a)\cap n(b)}\frac{\mathrm{C}\mathrm{o}\mathrm{s}{(a,c)}^{\beta }+\mathrm{C}\mathrm{o}\mathrm{s}{(c,b)}^{\beta }}{s(c)}\quad \hfill & \quad \mathrm{i}\mathrm{f}\enspace a\enspace \mathrm{a}\mathrm{n}\mathrm{d}\enspace b\enspace \mathrm{a}\mathrm{r}\mathrm{e}\enspace \enspace \mathrm{n}\mathrm{o}\mathrm{t}\enspace \enspace \mathrm{n}\mathrm{e}\mathrm{w}\enspace \enspace \mathrm{n}\mathrm{o}\mathrm{d}\mathrm{e},\hfill \\ \mathrm{C}\mathrm{o}\mathrm{s}(a,b)\quad \hfill & \quad \mathrm{o}\mathrm{t}\mathrm{h}\mathrm{e}\mathrm{r}\mathrm{w}\mathrm{i}\mathrm{s}\mathrm{e},\hfill \end{cases}\end{equation} \tag{ 10 }$

where s(a) = ∑_c∈n(a)Cos(a, c)^β.

What is interesting about the contributions in our previous work is that we take advantage of the principles of the structural equivalence and the shortest path length to mine missing links in long-path static networks. While the SESPL we proposed in the previous work cannot predict future links with new nodes in time-evolving networks, SESPL has efficient scalability. The single most striking feature to emerge from the SESPL predictor prompts us to further improve the algorithm so that it can be applied to temporal link prediction research. Therefore, we can make full use of nodal attribute similarity and shortest path length, namely, ASSPL, to predict future links with having new nodes. Because the more similar identities of two people, the more likely it is for them to promote trust and cooperation. Here, we define ASSPL as following:

$\begin{equation}{S}_{ab}^{\mathrm{A}\mathrm{S}\mathrm{S}\mathrm{P}\mathrm{L}}=\begin{cases}\frac{\beta \ast \mathrm{SE}(a,b)+\mathrm{C}\mathrm{o}\mathrm{s}(a,b)}{{d}_{ab}-1}\quad \hfill & \quad \mathrm{i}\mathrm{f}\enspace a\enspace \mathrm{a}\mathrm{n}\mathrm{d}\enspace b\enspace \mathrm{a}\mathrm{r}\mathrm{e}\enspace \enspace \mathrm{n}\mathrm{o}\mathrm{t}\enspace \enspace \mathrm{n}\mathrm{e}\mathrm{w}\enspace \enspace \mathrm{n}\mathrm{o}\mathrm{d}\mathrm{e},\hfill \\ \mathrm{C}\mathrm{o}\mathrm{s}(a,b)\quad \hfill & \quad \mathrm{o}\mathrm{t}\mathrm{h}\mathrm{e}\mathrm{r}\mathrm{w}\mathrm{i}\mathrm{s}\mathrm{e}.\hfill \end{cases}\end{equation} \tag{ 11 }$

4. Results

4.1. The failure of static similarity methods

To examine the temporal link prediction task, here we first apply static similarity methods which are the best-known algorithms for mining missing links to predict future links with new nodes in time-evolving networks. For the unconnected node pairs including new nodes, the static similarity methods score their similarity zero. As we expected, all static similarity algorithms fail to predict future links with new nodes in time-evolving funding networks (long-path networks), shown in table 2. This is because it is similar to the performance of the randomly chosen method for that of static predictors measured by AUC. Especially, for SESPL predictor which is associated with the principles of structural equivalence and shortest path length, it exhibits extremely poor accuracy. However, the result to emerge from table 2 is that all static similarity methods can achieve almost the best predictive performance in SCHOLAT network (short-path network). This may be because future links in SCHOLAT network have fewer new nodes than those in funding networks. Perhaps, this is also because link prediction may be fundamentally easier in social networks [48]. Together, static similarity algorithms cannot deal with future links with new nodes in time-evolving networks.

Table 2. The performance of static similarity predictors measured by AUC. In this study, all the results are based on the average over 1000 independent runs of simulation.

Datasets	CN	AA	RA	SESPL
Funding0809	0.519	0.519	0.519	0.499
Funding0910	0.522	0.522	0.523	0.502
Funding1011	0.520	0.520	0.520	0.499
Funding1112	0.524	0.524	0.524	0.502
Funding1213	0.529	0.529	0.529	0.509
Funding1314	0.530	0.530	0.530	0.505
Funding1415	0.531	0.531	0.531	0.507
SCHOLAT	0.994	0.995	0.996	0.989

4.2. The highlight of temporal similarity methods

In order to mine future links with new nodes in time-evolving networks, we perform the predictive ability of our temporal similarity methods. In the beginning, we only consider nodal attributes for future links between old nodes, and set the score of future links involving one or two new nodes as zero. We find that each predictor in table 3 has a gain on the predictive accuracy comparing to the results in table 2. This shows that the attributes of the node can promote the performance of missing links mining.

Table 3. The performance of temporal similarity predictors measured by AUC. Here, we compute the score of future links only with old nodes in the next year by using network structure and nodal attributes. Hence, the score of future links with new nodes is zero. In this table, the parameter β in all temporal similarity methods is 1.

Datasets	TCN	TAA	TRA	ASSPL
Funding0809	0.520	0.520	0.519	0.501
Funding0910	0.522	0.522	0.522	0.503
Funding1011	0.520	0.520	0.520	0.498
Funding1112	0.524	0.524	0.524	0.503
Funding1213	0.530	0.529	0.529	0.511
Funding1314	0.530	0.530	0.530	0.506
Funding1415	0.530	0.531	0.531	0.509
SCHOLAT	0.994	0.995	0.996	0.989

Next, to further predict the future links with new nodes, we adopt nodal attributes to compute the similarity of future links with new nodes. Table 4 is quite revealing in several ways. First, compared to table 2, we can see that temporal similarity methods can achieve a better performance than that of static similarity predictors. Moreover, the average performance of ASSPL is 30.86% $(\frac{0.653-0.499}{0.499}\%)$ higher than that of SESPL. Second, compared to table 3, we find that the reasonable use of the similarity of new nodes can totally improve the performance of link prediction in time-varying networks. Third, the state-of-the-art predictor for predicting future links with new nodes comes from ASSPL in academic funding networks. This means that ASSPL we extended is also highly effective and efficient in long-path networks with time-evolving. Finally, while static similarity predictors can successfully predict future links with new nodes in SCHOLAT network (table 2), temporal similarity methods utilizing nodal attribute can further improve the predictive performance. Taken together, these results indicate that temporal similarity methods can better deal with future links with new nodes. In addition, these results also raise an important question: prediction accuracy in temporal similarity methods is enhanced by which network structural property or nodal attribute?

Table 4. The performance of temporal similarity predictors measured by AUC. Here, we compute the score of all future links of next year by using network structure and nodal attributes. Bold numbers are the best performance of all algorithms. In this table, the corresponding optimal parameters β in all temporal similarity methods are shown in table 5. Moreover, the AUC performance of ASSPL is also the best and can beyond 0.7 for funding networks, if we only consider the earliest 10% new nodes of next year.

Datasets	TCN	TAA	TRA	ASSPL
Funding0809	0.644	0.642	0.623	0.653
Funding0910	0.643	0.643	0.619	0.654
Funding1011	0.656	0.653	0.632	0.666
Funding1112	0.654	0.652	0.627	0.670
Funding1213	0.633	0.632	0.600	0.654
Funding1314	0.633	0.633	0.600	0.657
Funding1415	0.625	0.621	0.591	0.650
SCHOLAT	0.997	0.998	0.997	0.993

Table 5. The optimal parameters β in all temporal similarity methods subject to the highest AUC.

Datasets	TCN	TAA	TRA	ASSPL
Funding0809	1.6	−1.0	1.0	3.0
Funding0910	−0.6	−1.0	0.8	2.8
Funding1011	1.0	−0.6	3.0	3.0
Funding1112	0.6	0.8	1.8	3.0
Funding1213	1.6	−1.0	1.8	2.6
Funding1314	0.4	−0.6	2.8	3.0
Funding1415	0.4	−0.8	3.0	3.0
SCHOLAT	0.6	2.4	2.0	1.6

To answer this question, we further deeply analyze how the network structure and node attributes affect the prediction performance by setting different the parameter β. As shown in figure 4, the most obvious finding is that the parameter β has a major impact on the prediction accuracy of ASSPL in short-path network (SCHOLAT) and long-path networks (funding networks). The performance of ASSPL increases monotonously and eventually reaches a stable level with the increase of the β. This means that nodal attributes and network structure can work together well for predicting future links with new nodes in time-evolving networks. However, β almost has the same impact on the prediction accuracy of TCN, TAA, TRA in short-path network and long-path networks when β < 1.5. The prediction accuracy of TCN, TAA, TRA in short-path network has a downward trend when β > 1.5. This indicates that the parameter β may have an inhibitory effect on the prediction accuracy of TCN, TAA, TRA after exceeding the threshold.

**Figure 4.** The performance of temporal similarity predictors measured by AUC under the variation of the parameter β. (a) Funding0809, (b) Funding1314, (c) Funding1415, and (d) SCHOLAT. (a)–(c) belong to long-path networks and (d) is short-path network. All the results in this figure are based on the average over 1000 independent runs of simulation.
Download figure:
Standard image High-resolution image

5. Conclusion and discussion

To summarize, we extend the structural equivalence and shortest path length hypotheses to predict future links between new nodes and old nodes in long-path networks with time-evolving. To solve the temporal link prediction problems with new nodes, in this study, we take into account nodal attribute similarity and shortest path length, ASSPL, to predict future links with new nodes. The results tested on scholar social network and academic funding networks show that it is highly effective and applicable for ASSPL in funding networks with time-evolving. In addition, to study how network structure and nodal attribute affect the prediction accuracy in temporal similarity methods, we further deeply analyze the prediction performance by setting different parameter β. We find that ASSPL can obtain the best predictive performance when taking advantage of nodal attribute similarity and network structure in time-evolving networks. Meanwhile, our finding shows that the more similar funds are, the more likely for them to jointly fund the same research. More importantly, this finding will bring new insights into the standardized funding of the fund in a work.

In this work, we propose a new similarity-based framework that integrates the network structure and nodal attribute for the temporal link prediction with new nodes. This may have a problem that the nodal attribute in most real networks cannot utilize. In the future, for the link prediction task of whether there are new nodes, we will take advantage of the representation learning techniques to further study. In addition, we will take into account the time series utilizing only network structure to predict future links with new nodes.

Acknowledgments

We thank Leilin Feng for her help in funding datasets collection. This work is supported by National Natural Science Foundation of China (61803047, 61603309), Industry-University-Research Innovation Fund for Chinese Universities (No. 2021ALA03016), Major Project of the National Social Science Foundation of China (19ZDA149, 19ZDA324), Fundamental Research Funds for the Central Universities (14370119, 14390110). Yijun Ran is supported by China Scholarship Council (CSC No. 202006990042).

Data availability statement

The SCHOLAT dataset used in this study is available from SCHOLAT website. The funding datasets used in this study are available from the corresponding author upon reasonable request.

Predicting future links with new nodes in temporal academic networks

Article metrics

Submit

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction