Abstract
Link prediction aims to predict the potential existence of links between two unconnected nodes within a network based on the known topological characteristics. Evaluation metrics are used to assess the effectiveness of algorithms in link prediction. The discriminating ability of these evaluation metrics is vitally important for accurately evaluating link prediction algorithms. In this study, we propose an artificial network model, based on which one can adjust a single parameter to monotonically and continuously turn the prediction accuracy of the specifically designed link prediction algorithm. Building upon this foundation, we show a framework to depict the effectiveness of evaluating metrics by focusing on their discriminating ability. Specifically, a quantitative comparison in the abilities of correctly discerning varying prediction accuracies was conducted encompassing nine evaluation metrics: Precision, Recall, F1-Measure, Matthews correlation coefficient, balanced precision, the area under the receiver operating characteristic curve (AUC), the area under the precision-recall curve (AUPR), normalized discounted cumulative gain (NDCG), and the area under the magnified receiver operating characteristic. The results indicate that the discriminating abilities of the three metrics, AUC, AUPR, and NDCG, are significantly higher than those of other metrics.

Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 license. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
1. Introduction
Link prediction represents a highly vibrant research direction within the realm of network science [1]. Over the past decade, numerous pioneering works in link prediction have emerged [1–5], and it has also been found that link prediction can be applied in various fields, such as life sciences, information security, social network analysis, and transportation planning [6–10]. In social media platforms, the application of link prediction facilitates a more accurate comprehension and forecasting of relationship developments within social networks [11]. The link prediction techniques enhance the precision and personalization level of recommender systems [12], thereby increasing user stickiness [2]. In the domain of life sciences, a significant proportion of interactions remain unobserved. For instance, in protein-protein interaction networks, approximately only 20% of interactions among yeast proteins are known, and engaging in blind experimental validation of potential interactions will lead to substantial resource wastage. Conversely, employing link prediction techniques enables the probable interactions for subsequent validation, thereby significantly enhancing efficiency and reducing experimental costs [13–16].
As of the present time, the majority of research endeavors about link prediction have been primarily centered on designing novel algorithms or refining existing ones. However, the effectiveness of these algorithms often depends on various factors. For instance, Jing et al found that understanding the inherent constraints imposed on data sets is crucial for evaluating the effectiveness of algorithms [17]. Similarly, Ran et al studied the effect of topological features on the theoretical limits of link prediction, providing valuable insights into how specific network topologies limit the effectiveness of prediction algorithms [18]. These studies underscore the importance of considering the intrinsic and topological characteristics of data when applying algorithms. At the same time, the careful selection of pertinent evaluation metrics for the accurate assessment of algorithm performance stands as an indispensable cornerstone within the domain of link prediction. However, Researchers often arbitrarily choose classic metrics (e.g. area under the curve (AUC), balanced precision (BP), and area under the precision-recall (AUPR)) without conducting a detailed analysis of the evaluation metrics [1, 19–21]. Recently, several scientists have conducted critical reevaluations of this foundational issue, with a particular focus on commonly employed metrics such as Precision and AUC [22–24]. For instance, Yang et al [25] argued that, when evaluating link prediction performance, the precision-recall curve might provide better accuracy than AUC. Similarly, Austin [26] emphasized the need to reevaluate the sole reliance on AUC as a benchmark for model efficacy, especially in the prediction of species distribution. Additionally, Lobo et al [27] also questioned AUC and offered reasons against its use. In the pursuit of how to evaluate algorithms for imbalanced classification and a nuanced characterization of the differences between algorithms, some researchers introduced innovative evaluation metrics, such as area under the magnified receiver operating characteristic (AUC-mROC) [28].
In this study, we introduce an artificial network model paired with a corresponding link prediction algorithm. Within this algorithm, we incorporate a parameter to adjust the noise intensity, wherein increased noise reduces the algorithm's prediction accuracy. Consequently, we can use a single parameter to monotonically and continuously adjust the prediction accuracy of the algorithm. If a metric can depict variances in prediction accuracy across diverse noise intensities accurately, it shows that the metric has strong discriminating ability and can distinguish the pros and cons of different algorithms. We advances the analytical framework established in our previous research [29], which concentrated solely on threshold-free metrics such as AUC, AUPR, and BP. The current work expands this scope to include additional metrics like normalized discounted cumulative gain (NDCG), AUC-mROC, Precision, Recall, F1-measure, and Matthews correlation coefficient (MCC), particularly emphasizing the newly explored AUC-mROC. This extension not only enhances the applicability of our framework but also provides deeper insights into the discriminating abilities of these metrics in practical scenarios. This work facilitates to elucidate current controversy and confusion within this domain, offering guide to design novel evaluation metrics. Furthermore, the findings derived from this work hold broader implications and provide valuable references for addressing more generalized classification challenges.
2. Problem description
Let denote a network, where V represents the set of nodes and E represents the set of links. A link is drawn between two nodes if there exists a certain relationship or interaction between them [30]. For instance, in social networks, users can be represented as nodes, and friendships as connecting links. This study considers the simplest type of networks, ignoring the weight and directionality of links, and disallowing multiple links and self-connected links. Denoting the size of G as the cardinality of its node set, say , and the set of all potential links within G as U. Evidently, . There may be some existing links in but not yet being observed. For instance, in biological networks, numerous interactions remain undiscovered, commonly referred to as missing links. Alternatively, with the evolution of networks, new links may appear, often termed as future links. The primary objective of link prediction algorithms is to predict, based on the observed link set E, which among the potential links in U − E are likely to be missing links or future links.
To facilitate the training of models and validation of algorithms, we need to partition the observed link set E into training set ET and testing set EP , ensuring that and . The links in the training set are considered known, whereas those in the testing set are regarded as unknown and serve as the basis for algorithmic validation. Evidently, an efficacious algorithm should identify links in EP as having higher likelihoods of being either missing links or future links among all presumed unknown links . In practice, when forecasting missing links, a subset EP is typically constituted by randomly selecting links from E, and when predicting future links, a subset EP is often constituted by choosing links appearing later. This study predominantly focuses on the former scenario. It is noteworthy that link prediction is also a typical binary classification problem, so, most of the evaluation metrics tailored for binary classification problems can be seamlessly adapted to evaluate the efficacy of link prediction algorithms.
3. Evaluation metrics
Evaluation metrics can be broadly categorized into threshold-dependent metrics and threshold-free metrics. Threshold-dependent metrics produce results that are contingent on the chosen threshold parameters, whereas threshold-free metrics are independent of any threshold parameters. Given that the selection of thresholds often entails an ad hoc approach, which may not be universally applicable, threshold-dependent evaluation metrics are frequently perceived as lacking in persuasiveness for addressing generalized problems. In contrast, threshold-free metrics are generally favored, unless the selection of thresholds is tied to the specific problem rather than being arbitrarily designated by researchers. This work considers nine evaluation metrics, including four threshold-dependent metrics and five threshold-free metrics.
3.1. Threshold-dependent metrics
Common threshold-dependent metrics include Precision@k [31], Recall@k [25], F1-Measure [32], and MCC [33]. Before introducing these specific metrics, we first review the confusion matrix in binary classification problems. Within the confusion matrix, all samples are classified into four categories based on whether they are positive samples (corresponding to missing links EP in link prediction) or negative samples (corresponding to non-existent links U − E in link prediction), and whether they are correctly predicted. These four categories are: true positive (TP), where a positive sample is correctly predicted as positive; false positive (FP), where a negative sample is incorrectly predicted as positive; true negative (TN), where a negative sample is correctly predicted as negative; and false negative (FN), where a positive sample is incorrectly predicted as negative.
Precision is delineated as the ratio of samples accurately predicted as positive by the algorithm that are indeed positive samples. Without loss of generality, a link prediction algorithm can rank all links in the set in descending order based on their likelihoods of being missing links. The algorithm may then consider the top-k links as missing links (positive samples), while the remaining links are considered as non-existent links (negative samples). In this context, the parameter k serves as a typical threshold parameter. Selecting the top-k links as predicted missing links is equivalent to setting a likelihood threshold and considering links with likelihood scores higher than this threshold as predicted missing links. Once k is determined, precision is calculated as the proportion of potential links ranked within the top k that are indeed missing links, as
where and respectively refer to the number of missing links and non-existent links among the top-k potential links as ranked by the algorithm.
Recall is defined as the proportion of positive samples that are correctly predicted as positive by the algorithm. Clearly, given a specific algorithm, as k increases, is monotonically non-decreasing. When , i.e. the algorithm predicts all potential links as missing links, . The formula to compute is as follows:
As the parameter k varies, Precision and Recall generally exhibit inverse trends. To balance the influence of both metrics and provide a more holistic evaluation of algorithm performance, the F1-Measure computes the harmonic mean of Precision and Recall, as
For the sake of clarity, we omit in equation (3) and some later equations, given that no ambiguity will be introduced. The values of Precision, Recall and F1-Measure are all in the interval [0, 1].
MCC is utilized to depict the correlation between actual outcomes and predicted results, taking into account the values of TP, FP, TN, and FN. Due to its balanced nature, even when there is a significant disparity in the sample sizes between the two classes, MCC can effectively reflect the algorithm's performance. The formula for MCC is as follows:
The range of MCC is , where indicates prefect prediction (corresponding to ). Conversely, signifies entirely erroneous predictions. Random classification is corresponding to . According to equation (4), TP, FP, TN, and FN carry equal importance, suggesting that even if the positive and negative samples are interchanged, the value of MCC remains unchanged, underscoring its symmetric nature.
3.2. Threshold-free metrics
Threshold-free metrics never require any effort on determining appropriate thresholds, and they also circumvent the issue that different thresholds lead to different winners. Well-know threshold-free metrics include BP [31], AUC [34], AUPR [35], and NDCG [36]. This work will also analyze a recently proposed metric called AUC-mROC [28].
BP represents the intersection of the Precision@k and Recall@k curves, specifically when the threshold k equals the size of the testing set (i.e. ), as
AUC represents the area under the receiver operating characteristic (ROC) curve. For each threshold k, there exists a corresponding point on the ROC curve. As shown in figure 1(a), The x-coordinate of this point is the false positive rate at k, denoted as , and the y-coordinate is the true positive rate at k, denoted as . By varying the threshold k from small to large values, the ROC curve can be obtained. For the specific task of link prediction, there is a more straightforward method to plot the ROC curve. We firstly sort all potential links based on their predicted likelihoods in descending order, then we start the origin and sequentially scan these potential links. If a missing link is encountered, we move upwards by , while if a non-existent link is encountered, we move right by . Upon completing this scan, the ROC curve is obtained that spans from to . If the likelihoods of potential links are entirely randomly assigned, the ROC curve would approximate the diagonal, with an AUC close to 0.5. Generally, an AUC value provided by an algorithm will range between 0.5 and 1; a value closer to 1 indicates better prediction performance.
Figure 1. Illustration about (a) AUC, (b) AUPR and (c) AUC-mROC.
Download figure:
Standard image High-resolution imageThe value of AUC can be intuitively interpreted as the probability that a randomly chosen positive sample (missing link) has a higher predicted likelihood than a randomly chosen negative sample (non-existent link). This intuitive interpretation is a distinct advantage of the AUC metric. We assume that in the sorted sequence of potential links, the positions of the missing links are . Then, prior to the ith missing link, there are non-existent links. Therefore, when comparing the ith missing link to all non-existent links, it will lose against of them. In other words, its winning probability is . Accordingly, the AUC can be calculated by averaging these winning probabilities across all missing links, as
where . Given that most real networks are sparse [37], i.e. , it can be approximated as:
which is also known as ranking score in some previous literature [38]. AUPR represents the area under the Precision-Recall curve. The PR curve is constructed by plotting Precision (on the y-axis) against Recall (on the x-axis) for various threshold values (see figure 1(b)). For a given threshold k, the corresponding point on the PR curve is . When k takes its maximum value , the PR curve ends at the point . Similar to equation (6), AUPR can be expressed as:
where is defined as .
Discounted cumulative gain (DCG) considers that the importance of positions in the ranking of potential links is not uniform. If an algorithm ranks missing links higher up in the list, it receives a higher score. Conversely, if these missing links are ranked lower, their scores are discounted. Specifically, DCG employs a logarithmic discounting mechanism, as
Note that equations (7) and (9) are quite similar. However, in the approximate definition of AUC, the contribution of a missing link ranked at position r is , while in the definition of DCG, the contribution of a missing link at position r is . Clearly, as r increases, the corresponding contribution in DCG diminishes more rapidly. For instance, if there are a total of 10 000 samples to be predicted, a positive sample at the top rank contributes a score of 1 to both AUC and DCG. However, a positive sample ranked at r = 5000 contributes a score of approximately 0.5 to AUC but only about 0.08 to DCG. While DCG can be used to compare algorithm performances, its absolute value lacks meaning. To address this challenge and enable cross-dataset comparisons, normalization can be implemented by dividing by the maximum possible value of DCG. Clearly, when all missing links are precisely ranked in the top positions, DCG attains its maximum possible value. Consequently, the corresponding NDCG is given by
The AUC-mROC [28] applies the idea of NDCG to optimize AUC. Specifically, it transforms both axes of the ROC curve using logarithmic transformations. After the transformation, the horizontal and vertical coordinates are defined as and respectively. The AUC-mROC represents the area under this transformed curve, as shown in figure 1(c).
4. Discriminating ability
In this study, a simple method is adopted to generate an artificial network [39]. For any node pair (i, j) in G , , and , the likelihood to form a link is denoted as qij . These likelihood values are independently generated from a uniform distribution , where the parameter can be utilized to control the linking density. Once all the likelihood values are generated, links between node pairs are established or not based on their corresponding likelihoods. For instance, if the likelihood value for a link is q, then the probability of establishing this link is q, and the probability of not establishing the link is . In this model, if all likelihoods are known, the optimal prediction algorithm would set the likelihood for (i, j) as [39].
By introducing a parameter that depict the noise, we can continuously and monotonically adjust the prediction accuracy of the algorithm. Let represents an algorithm with noise η. The likelihood value sij provided by this algorithm for any node pair (i, j) in is , where the noise term nij is sampled from a uniform distribution . Clearly, as η increases, the prediction accuracy of the algorithm decreases. Given an evaluation metric M that operates on the algorithm, if there are two noise parameters η1 and η2 with , then should hold. If in an experiment, the metric M indeed satisfies , we say it correctly distinguishes the performance difference of the algorithm; otherwise, M fails to do so. Clearly, the smaller is, the greater the probability of incorrection. Suppose , and in X independent comparisons, there are x comparisons with the result . Then, the p-value for the noise parameters given M is defined as . When is less than a pre-defined significance level , we conclude that the metric M is capable of distinguishing between the algorithms and . Setting , if , and , if , then is symmetric. This matrix is referred to as the discrimination matrix, denoted as . By contrasting the P-matrix corresponding to different evaluation metrics, we can intuitively assess the discrimination ability of different evaluation metrics [29].
5. Results
In the simulation, we set the number of nodes in the network to N = 1000, the parameter of the uniform distribution is , the proportion of the prediction set is . For threshold-dependent metrics, we assume that the predicted links are the top-k links ranked by the value sij and mainly demonstrate the cases of , , and . For each set of parameters, we randomly generate ten networks, and indenpendently run 100 simulations for each network. Due to the simplicity of the model and the independence between the likelihoods, as long as N2 is sufficiently large and is not too small or too large, the results are similar. Figure 2 shows the situation where the values of different metrics change with the increase of noise intensity. It can be seen that the values of all metrics decrease overall as the noise increases, but some metrics have large fluctuations. Intuitively, metrics with large fluctuations are difficult to distinguish the performance differences of algorithms when the difference is small. In figure 2, it can be seen that the overall fluctuation of a threshold-dependent metric is generally greater than that of AUC, AUPR, or NDCG. Among all metrics, AUC-mROC has the largest fluctuation.
Figure 2. How the value of a metric varies with changing noise. (a)–(d), (e)–(h) and (i)–(l) respectively represent the results when the thresholds for Precision, Recall, F1-Measure, and MCC are set to , , and . (m)–(p) Depict the results for AUC, AUPR, NDCG, and AUC-mROC. The gray points represent the simulation of given values in single runs, the red points represent the average values of given noise intensities, and the error bars indicate the corresponding standard deviations.
Download figure:
Standard image High-resolution imageFigure 3 presents the outcomes from 1000 runs at noise intensities of , , η = 0.5, , and η = 0.9. Clearly, if there are noticeable gaps between the curves of different noise intensities for a specific metric, it indicates that the metric can differentiate the performance differences of algorithms under various noise intensities. In figure 3, it is evident that AUC, AUPR, and NDCG can effectively distinguish between adjacent noise intensities (i.e. a difference of 0.2). On the other hand, threshold-dependent metrics find it challenging to differentiate between neighboring noise intensities. Specifically, AUC-mROC struggles to discern algorithms with close noises.
Figure 3. The values of the evaluation metrics under different noise intensities, where η = 0.1, η = 0.3, η = 0.5, η = 0.7 and η = 0.9. The x-axis represents the number of runs, and the y-axis represents the values of the evaluation metrics in the 1000 runs. (a)–(d), (e)–(h) and (i)–(l) represent the results when the thresholds for Precision, Recall, F1-Measure, and MCC are , , and , respectively. (m)–(p) denote the results for AUC, AUPR, NDCG, and AUC-mROC.
Download figure:
Standard image High-resolution imageTo provide a more intuitive comparison of the discrimination abilities of different metrics, we set . Only when , we consider the current metric capable of distinguishing between noises η1 and η2. Figure 4 displays the binarized discrimination matrix. The colored areas represent elements for which , while the white areas represent elements for which . Clearly, a larger colored area (indicating the distinguishable region) suggests a stronger discrimination ability for the corresponding evaluation metric. Figure 4 shows that AUC, AUPR, and NDCG exhibit significantly superior discrimination abilities compared to other metrics.
Figure 4. The binarized discrimination matrices of different evaluation metrics. The x-axis and y-axis represent the intensity of noise. (a)–(d), (e)–(h) and (i)–(l) respectively depict the results when the thresholds for Precision, Recall, F1-Measure, and MCC are , , and . (m)–(p) illustrate the outcomes for AUC, AUPR, NDCG, and AUC-mROC.
Download figure:
Standard image High-resolution image6. Discussion
We conduct a study on the discriminating abilities of evaluation metrics in link prediction on artificial networks. In a scenario where the likelihoods of links in the network are known, we devise a straightforward algorithm, whose accuracy can be regulated by adjusting a single parameter: the intensity of noise. By examining whether the evaluation metrics could accurately discern the performance differences of algorithms for different noise intensities, we can measure the discriminating ability of these metrics. Through observations on the magnitude of fluctuations in the metric values at given noise levels and the discrimination matrix composed of p-values, we found that the discriminating ability of AUC, AUPR, and NDCG was notably superior to other metrics, including commonly used BP (corresponding to Precision@k for ) and the recently proposed AUC-mROC. In addition to uniform distribution, we have also tested some other distributions, showing consistent results with our main finding. In figure 5, we show an example corresponding to Gaussian distribution of qij , with mean 0.25 and standard deviation 0.1 (we reset any value smaller than 0 as 0, and any value larger than 1 as 1). Still, AUC, AUPR, and NDCG look better but AUC becomes slightly less discriminative than AUPR and NDCG.
Figure 5. The binarized discrimination matrices of different evaluation metrics are generated using a Gaussian distribution with a mean of 0.25 and a standard deviation of 0.1 for qij values. The x-axis and y-axis represent the intensity of noise. (a)–(d), (e)–(h) and (i)–(l) respectively depict the results when the thresholds for Precision, Recall, F1-Measure, and MCC are , , and . (m)–(p) illustrate the outcomes for AUC, AUPR, NDCG, and AUC-mROC.
Download figure:
Standard image High-resolution imageLink prediction in sparse networks is a classic imbalanced binary classification problem, where positive samples (i.e. missing links) are scarce while negative samples (i.e. non-existent links) are abundant. In such cases, to find as more as possible positive samples in limited attempts is more crucial than to put possible samples in relatively higher positions than negative samples. Both AUC-mROC and NDCG address this issue by assigning higher weights to the topper positions of the prediction list. However, the discriminating ability of AUC-mROC is found to be poor in the considered artificial networks studied in this paper. One potential reason could be the transformations applied to both coordinates of the ROC curve, making AUC-mROC curve highly sensitive to the accurate prediction of the initial entries in the prediction list. Given that qij in this study was independently generated from a uniform distribution, the average likelihood difference between positive and negative samples is relatively small, presumably much less than that in real-world networks. This implies that even with zero noise, the accuracy of the initial predictions made by the Ω algorithm would not be exceptionally high. Therefore, if an evaluation metric is particularly sensitive to the accuracy of predictions at the very beginning of the list, its discriminating ability in this study would not be high. Analogously, in a recent work, Bi et al [40] used hundreds of real-world networks to examine the consistency of evaluation metrics, suggesting that AUC-mROC produces far different rankings of algorithms in compared with other metrics. To further elucidate the discriminating abilities of weighted metrics, including AUC-mROC, it is essential to employ more complex generation processes that captures topological features of real networks, or to use real networks and commonly used algorithms.
Figures 4 and 5 indicate a notable trend where higher thresholds typically enhance the discriminating abilities of metrics. It is because when the threshold is small, the value of an evaluation metric depends only on a few top-predicted links. This is also the reason of the poor discriminating ability of AUC-mROC, as discussed above. It again indicates that the performance of these metrics is highly contingent on the threshold levels used. However, the process of selecting an optimal threshold is often arbitrary and lacks a standardized approach, which can lead to inconsistencies in performance assessments across different studies. Our findings emphasize the necessity for developing robust guidelines for how to choose a threshold to ensure consistent and reliable evaluations of link prediction algorithms.
Finally, we strongly recommend readers to pay close attention to NDCG. While the discriminating ability of this metric is close to AUC and AUPR, it has been seldom utilized in previous studies of link prediction. As previously demonstrated, NDCG also assigns higher weights to the rankings at the forefront of the list in a logarithmic manner (albeit more conservatively than AUC-mROC), partially reflecting the demands of imbalanced classification. Therefore, we suggest readers to consider NDCG as an alternative evaluation metric. Furthermore, it is essential to highlight the limited discriminating ability of BP. Given its intuitive simplicity, BP still holds some reference value, however, relying solely on this metric to rank candidate algorithms might suppress the credibility of the conclusions drawn.
Data availability statement
The data cannot be made publicly available upon publication because no suitable repository exists for hosting data in this field of study. The data that support the findings of this study are available upon reasonable request from the authors.