Common neighbours and the local-community-paradigm for topological link prediction in bipartite networks

Bipartite networks are powerful descriptions of complex systems characterized by two different classes of nodes and connections allowed only across the two classes. For instance, modelling the connections between workers and their employers, or electors and parties they vote for, are examples of affiliation networks in social analysis. Ultimately, predicting interactions between products and consumers in personal recommendation systems and market models can provide priceless information for managing cyber-commerce. Surprisingly, current complex network theory presents a theoretical bottle-neck: a general framework for local-based link prediction directly in the bipartite domain is missing. Indeed, local state-of-the-art methods for link prediction do not directly exploit the inner bipartite topology, but rather rely on its projection into two one-mode-dimension networks, an example of which is the monopartite network of consumers connected by products and the monopartite network of products connected by consumers. Unfortunately, the one-mode-projections are always less informative than the original bipartite structure. Here, we overcome this theoretical obstacle, and we present a formal definition of common neighbour index (CN) and local-community-paradigm (LCP) for bipartite networks. As a consequence, we are able to introduce the first node-neighbourhood-based and LCP-based models for topological link prediction that utilizes the bipartite domain. We performed link prediction evaluations in several networks of different size and of disparate origin, including technological, social and biological systems. Our models significantly improve topological prediction in many bipartite networks, and represent the first attempt to create a local-based formalism that allows to intuitively implement link prediction fully in the bipartite domain.

Bipartite networks are powerful descriptions of complex systems characterized by two different classes of nodes and connections allowed only across the two classes. Several coupled systems at diverse scales in science, such as plant/pollinator in ecological networks or drug/target in molecular interactomes, can be approximated using bipartite network topology. Bipartite graphs emerge naturally also in many applicative domains. For instance, modelling the connections between workers and their employers, or electors and parties they vote for, are examples of affiliation networks in social analysis. Ultimately, predicting interactions between products and consumers in personal recommendation systems and market models can provide priceless information for managing cybercommerce. Surprisingly, current complex network theory presents a theoretical bottleneck: a general framework for local-based link prediction directly in the bipartite domain is missing 1 . Indeed, local state-of-the-art methods for link prediction do not directly exploit the inner bipartite topology 2,3 , but rather rely on its projection into two one-modedimension networks, an example of which is the monopartite network of consumers connected by products and the monopartite network of products connected by consumers. Unfortunately, the one-mode-projections are always less informative than the original bipartite structure 4 . Here, we overcome this theoretical obstacle, and we present a formal definition of common neighbour index 5 (CN) and local-community-paradigm 6 (LCP) for bipartite networks. As a consequence, we are able to introduce the first nodeneighbourhood-based and LCP-based models for topological link prediction that utilizes the bipartite domain. We performed link prediction evaluations in several networks of different size and of disparate origin, including technological, social and biological systems. Our models significantly improve topological prediction in many bipartite networks, and represent the first attempt to create a local-based formalism that allows to intuitively implement link prediction fully in the bipartite domain.
The theory of link prediction based merely on the connectivity structure of a network -a.k.a. topological link prediction -is an extremely developed branch of network science for monopartite undirected networks 7 . Given a monopartite undirected network G(V,E) where V is the set of observed nodes and E is the set of observed links, the topological link prediction problem consists in estimating the likelihood that each nonobserved link (missing link) between the observed nodes exists. The ranking of the nonobserved links in order of decreasing likelihood values represents a list of candidate links to be employed for modelling the network growth.
Indeed, the extent to which the process of network formation is explicable by a method coincides with the method capability to predict missing links 8 . There are two main families of methods: model-based and model-learning. The first approach is assuming that a well-characterized physical mechanism is the primary driving force behind the network organization 8 , and is based on an explicit deterministic model that simulates such a mechanism. The second type does not make a priori assumptions about any specific organization principle of the network topology, and instead relies on an implicit stochastic model-learning: providing at each step a different solution that should converge to the hidden network behaviour in a sufficient number of iterations.
Elaborate methods with several parameters to tune have been introduced 6 in the class of stochastic models. Nevertheless, the majority of these gracefully designed techniques present drawbacks 6 . Apart from the problem of tuning the parameters in an unsupervised framework, a severe limitation is their high computational time, which in practice reduces their application to networks of small dimensions (no more than few hundreds of nodes), in comparison to the large networks used in real problems. For these reasons, methods based on explicit models represent an efficient and parameter-free alternative widely employed in many applicative domains, and the node-neighbourhood-based (local-based) indices are the most relevant among them. The rationale behind these methods is that the likelihood of an interaction between two non-adjacent nodes is strongly related with mechanisms of organization involving their first neighbour nodes.
The common neighbours (CN) index, which is the precursor of these methods, follows the intuition that the likelihood that two seed nodes x and y interact increases if their sets of firstnode-neighbours overlap substantially: this implies that the larger the number of common neighbours, the higher the likelihood that the two seed nodes interact. The network organization model behind the CN index is named triadic closure 9 (Fig.1A). Many other nodeneighbourhood-based indices can be interpreted as a variation or generalisation of CN: Jaccard's index (JC) is a normalisation of CN, Adamic & Adar (AA) and Resource Allocation (RA) indices give more importance to CNs with low degree 6 . CN and all derived measures following the triadic closure principle are very powerful topology-based methods for link prediction in many monopartite undirected real networks from various fields 7 . However, it has been shown that due to bipartite networks' specific properties 9 all topological methods based on the triangle closing model cannot work in the bipartite case 1 . Thus, at the moment, both CN index and all the related variations (JC, AA and RA) are not defined for bipartite networks. On the other hand, out of the node-neighbourhood-based link prediction methods, the preferential attachment 5,10 (PA) model can be used in bipartite networks, and has been shown to be better performing than various algebraic (e.g. matrix factorization) methods in many bipartite real networks 1 .
In contrast to the existing node-neighbourhood-based approaches, a new strategic shift has been introduced recently in which the focus is no longer only on groups of common nodes and their node neighbours, but also on the organization of the links between them. This theory, defined and tested only in monopartite undirected networks, is known as the local community paradigm (LCP-theory) 6 . The LCP-theory holds that for modelling link prediction in complex networks, the information content related with the common neighbour nodes should be complemented with the topological information emerging from the interactions between them. The cohort of common neighbours and their cross-interactions form what is called a local community; the cross-interactions between CNs are called local community links (Fig. 1A). In order to demonstrate the validity of the theory on several classes of networks, different classical nodebased link prediction techniques like CN, JC, AA, RA and PA were reinterpreted according to the LCP-theory 6 , by introducing terms related with the local community links (LCLs) in their formulations. This mathematical reformulation represents the Cannistraci variation of CN, JC, AA, RA and PA respectively renamed CAR, CJC, CAA, CRA and CPA 6 . For simplicity, from here on we will refer to the former as classical models and the latter as LCP-based models. All the details on these models are provided in the Methods.
A theoretical innovation of this article is the definition of the LCP-theory and the relative localbased models for link prediction in undirected bipartite networks. However, the first and nontrivial step required for this extension is the definition of the concept of CN index in bipartite topologies: surprisingly, as mentioned above, a concept not yet formally defined in network theory. The definition of CNs in monopartite networks is related to the concept of triadic closure.
In an undirected monopartite graph, the triadic closure produces a triangle when an edge connects two nonadjacent nodes which already have a CN. Therefore, any node connected to two nonadjacent nodes that might be involved in the triangular closure between them is a CN.
Moreover, the triadic closure has the main property of enclosing the shortest path between the two nonadjacent nodes. Thus, intuitively the definition of CNs in bipartite networks turns out to be related to quadrangular closure (Fig. 1B) that is the equivalent of triangular closure operation in bipartite topologies 4 . In fact, in bipartite structures, the quadrangular closure encloses the shortest path between two non-adjacent nodes that belong to two distinct classes (Fig. 1B). This entails that in bipartite networks we define the CNs of two given nonadjacent seed nodes (for which we want to estimate the likelihood of their missing interaction) as the nodes that are involved in all possible quadrangular closures between these seed nodes, and the LCLs as all the links that occur between these CNs (Fig. 1B). Analogously, for two given adjacent seed nodes for which we want to estimate the reliability of their existing interaction, we define the CNs as the nodes that are involved in all already existing quadrangles passing through the seed nodes, and the LCLs as all the links that occur between these CNs. A practical consequence of these definitions is that mathematically, both the CN, JC, AA and RA indices and their Cannistraci variations for prediction in the bipartite networks, can be expressed using the same formulations provided for monopartite networks (see Methods).
Comparing the proposed LCP-based and classical models (which work in the bipartite domain) to the state-of-the-art methods (which are based on the one-mode-network projections) we found that in general the LCP-based link predictors offer a significant improvement in many technological, social and biological bipartite networks of assorted size. Suppl. Fig. 1 shows the precision of each tested link predictor in recovering L edges, where L is the number of edges equivalent to 10% of the original edges present in the network. For each network L edges were randomly removed and the remaining observed network topology was used to apply the link predictor, which offered as output the list of candidate links (composed of the nonobserved and the removed edges) ranked in decreasing likelihood of existence. The precision represents the ratio of correct edges recovered out of the top L edges in the candidate list generated by each link predictor. This operation was repeated 100 times for each network and we report the mean and standard error for each method in Suppl. Fig.1. The described procedure is the gold standard for quantification of link prediction performance 8 . In Fig.2, where the LCP-based models demonstrate again clear superiority, we provide the values obtained using the same procedure for estimation of another measure called area under precision recall curve (AUPR). While precision assesses the ability of each method to rank the L removed edges among the top L in the candidate list, AUPR assesses the ability to rank the L removed edges the highest possible in the entire candidate list. For this reason AUPR is considered a more complete and robust estimation of performance. Furthermore, Fig. 3A-B, shows the mean precision and mean AUPR across all networks. Taken together, these results emphasize that LCP-based models can significantly (pvalue <0.001, Fig.4) overcome -at least in bipartite networks topologically similar to the ones considered here -state of the art methods for link prediction, offering 123% improvement in precision compared to classical models, and 186% improvement in precision compared to onemode-projection methods (Fig.4A,C). The performance improvements evaluated according to AUPR are even larger (Fig.4B,D). As a final remark, considering only the biological networks, we can ascertain that LCP-based models provide noteworthy link prediction performance. This is demonstrated by comparing the performance of LCP-based models versus a supervised model named Bipartite Local Model (BLM), which is considered a baseline algorithm specifically developed for prediction of drug-target interactions in bipartite molecular networks 11,12 . BLM builds a classifier using not only specific local network topology information of each drug or target directly in the bipartite domain, but also integrating prior knowledge (chemical and sequence similarities) between the molecules that constitute the nodes of such biological networks 11,12 . We should expect that the performance of BLM is significantly higher than that of LCP-based models, which are unsupervised and exploit only the topological bipartite information. However, we unexpectedly observed that, in two out of three tested networks, the best of LCP-based models are not significantly different and sometimes are even better than BLM (Fig. 3C,D and Suppl. Fig. 2-3).
To conclude, the bipartite formulations of CN index and LCP-theory here presented are not only a valuable contribution to improve topological prediction in bipartite complex networks, but also a first effort to intuitively devise local-based link prediction completely in the bipartite domain.

Data Description
We performed link prediction evaluations in six bipartite networks of different size and origin, Basic topological statistical features are reported in Suppl. Table S1.
State-of-the-art link prediction methods based on the one-mode-network projections.
It has been shown that any bipartite network can be projected into its two monopartite network representations by means of a procedure called bipartite network projection (BNP) 2 . Based on the vector representation of the two one-mode projections, we performed link prediction applying the following advanced metrics, which represent the state of the art for link prediction in bipartite networks: Network Based Inference (NBI) 2 , Bipartite Projection via Random-walk (BPR) 3 ; and baseline spatial distance measures: Jaccard (Jac), Euclidean (Euc), Cosine (Cos) and Pearson (Pea).
All these measures have been computed as described by Coscia et al. 3 , using a python implementation provided by the authors: (http://www.michelecoscia.com/?page_id=734).
Classical node-neighbourhood-based and LCP-based models for link prediction.
Given two non-adjacent nodes x and y, the classical formulations 6 are: Competing interests: The authors declare no competing financial interests.
Data and materials availability: the MATLAB code of the link prediction models is available upon request to the corresponding author. {NOTE: Upon article publication it will be publically released}.