Zoo guide to network embedding

Networks have provided extremely successful models of data and complex systems. Yet, as combinatorial objects, networks do not have in general intrinsic coordinates and do not typically lie in an ambient space. The process of assigning an embedding space to a network has attracted great interest in the past few decades, and has been efficiently applied to fundamental problems in network inference, such as link prediction, node classification, and community detection. In this review, we provide a user-friendly guide to the network embedding literature and current trends in this field which will allow the reader to navigate through the complex landscape of methods and approaches emerging from the vibrant research activity on these subjects.

Networks are intrinsically combinatorial objects (i.e., interconnected nodes, where certain pairs of nodes are connected by links), with no a priori ambient space, nor node geometric information such as 'coordinates'.Network embedding (also known as representation learning) is the process of assigning such an ambient space (called the latent or embedding space) to a network.This is typically done by mapping the nodes to a geometric space, such as a Euclidean space R n , while preserving some properties of the nodes, links, and/or network (7).Overall, network embedding methods are used for learning a low-dimensional vector representation from a high-dimensional (as measured by the number of nodes) network.The relationships between nodes in the network are represented by their distance in the low-dimensional embedding space.Then, the low-dimension vector representation can be used for visualisation, and in a wide variety of downstream analyses, from network inference or link prediction to node classification or community detection.Moreover, network embedding can provide insights into the geometry of the underlying data.These insights can be useful for performance improvement, by working on a lower dimensional space, or exploiting the richer geometry of the embedding space.Finally, some downstream analyses, such as machine learning techniques, require a vector representation of the network.In this context, embedding network into a vector space is a prerequisite (8).
Network embedding raises many challenges.First, a fundamental question is which network properties should be preserved by the embedding.For instance, the embedding space may preserve the intra-community similarity, the structural role similarity, or the similarity between nodes labels.A second challenge is related to the choice of dimension.The dimension of the embedding space will be a trade-off between two competing requirements: preserving the information encoded in the original network (favours high-dimensional space representations) and reducing the complexity or noise in the original network (favours low-dimensional space representations).Third, the scalability of network embedding methods is important: embedding methods applied to real networks face low parallelizability and data sparsity issues.Methods need to be efficient for networks of the size of typical modern-day network data sets, that is, up to several million nodes and edges (9)(10)(11).Lastly, the interpretation of the results of network embedding can be difficult (12).
For decades, dimensionality reduction methods based on factorisation matrices appeared as a relevant way to encode topological network information (13)(14)(15).These methods provided an initial set of network embedding techniques due to the success and ubiquity of network models.Over the past few years, there has been a significant surge in the number of embedding methods, making it challenging to navigate this fast-evolving field.The purpose of this review is to provide an overview based on a novel taxonomy that extends previous ones (16,17) and describe current trends in network embedding.
First, we introduce the basic concept of network embedding and the state-of-the-art taxonomies of these methods.Next, we present our own taxonomy based on the common mathematical processes that underlie the embedding methods.This taxonomy aims to assist readers in navigating the field.We describe the two well-established classes of methods: the shallow embedding methods, and the deep learning methods.In addition to these two classical approaches, we include two sections dedicated to the higher-order network embedding methods and the emerging network embedding methods.These sections highlight current trends in the field, although the taxonomy is broad enough to integrate these new methods.Finally, we illustrate the wide range of network embedding applications, with one section devoted to the classical applications that include a user guideline and another section dedicated to the emerging applications that are currently growing in popularity.

Definitions and preliminaries
A network, defined formally as a pair G = (V, E), consists of a non-empty set V of vertices (or nodes), and a set of edges (or links) E connecting certain pairs of nodes.In the case of undirected networks, we can define E as a subset of {{u, v} | u, v ∈ V }, and call {u, v} ∈ E an undirected edge between vertices u and v, so that {u, v} = {v, u}.In the case of directed networks, we can define E ⊆ V ×V , and call (u, v) ∈ E a directed edge from vertex u to vertex v, so that (u, v) = (v, u).If we agree on a labeling of the vertices, V = {v1, . . ., vn, . ..}, we can write eij ∈ E for a vertex between vi and vj (undirected case) or from vi to vj (directed case).Depending on the network model, we can also add node or edge weights and types (see below).
In its simplest form, a network embedding maps each node of a network to a space X, typically a Euclidean vector space X = R d with d n the number of nodes.This space is called the latent space or embedding space.In the latent space, certain properties (of the nodes, edges, or the whole network) are preserved.Hence, a network embedding (into X = R d ) is a mapping function The embedding vector zi is expected to capture the topological properties of the original network while reducing the network dimension n.Network embedding methods can embed different components of the network.The previous definition is describing the most common embedding, namely node embedding method.In node embedding methods, each node of the network is embedded into an embedding space X, typically a reduced vectorial representation, that is, a mapping function V → X.However, some methods handle edge embeddings E → X, where each edge of the network is embedded into an embedding space X.Other embedding methods target subgraph or whole-network embedding, where the whole network, or some of its parts, are projected into an embedding space, such as a vector space.
To design efficient network embedding methods, several criteria need to be considered: • Adaptability: Embedding methods need to be applicable to different data and task, without, for instance, repeating a learning step.
• Scalability: Embedding methods need to process largescale networks in a reasonable time.
• Topology awareness: The distance between nodes in latent space should reflect the connectivity and/or homophily (similar nodes in a network will be close in the embedding space) of the nodes in the original network.The homophily is the tendency of nodes to be connected to similar nodes.
• Low dimensionality: Embedding methods should reduce the dimension of the network, by mapping a network with n nodes to a d-dimensional space, with d n.
• Continuity: The latent space should be continuous, which is beneficial in some tasks like classification (18).
As mentioned, while reducing the dimension, the embedding space should preserve some node, edge, and/or network properties.Focusing on node properties, the most common properties preserved by network embedding methods include: • The first-order similarity between two vertices, which is the pairwise similarity between the vertices.In other words, the weight of the edge between vertices defines a first-order similarity measure.Let sv i (respectively sv j ) be the first-order vector similarity associated with the node vi (resp.vj) to every other node in the network.
• The second-order similarity between two vertices, which considers the similarity of vertices in terms of neighbourhood structures.The second-order similarity between the nodes vi and vj is defined as the similarity between the first-order vectors sv i and sv j .Higher-order similarities are based on the same idea.These second or higher-order similarities define structural equivalence between nodes.
• The regular equivalence similarity, which defines the similarity between vertices that share common roles in their neighbourhood, i.e., that have similar local network structures.For instance, if a node is a bridge between two communities, or if a node belongs to a clique.The regular equivalence aims to unveil the similarity between distant vertices which share common roles, in contrast to to common neighborhood.
• The intra-community similarity, which defines the similarity between vertices in the same community.The intra-community similarity aims to preserve the cluster structure information of the network.
Embedding methods are often designed to use specific types of networks as input.These network types include: • Homogeneous networks, which correspond to the standard definition of networks mentioned above G = (V, E), where V is a non-empty set of vertices (nodes) and E a set of (directed, or undirected) edges (links).A more general setup, which allows multi-edges, is G = (V, E, s, t) where V = ∅ and E are arbitrary (vertex, edge) sets, and s, t : E → V are the source, respectively target, functions.Homogeneous networks can also be weighted: an edge, respectively a vertex, a weight function is a function wE : E → XE, respectively wV : V → XV , where XE and XV are weight sets, typically numeric XV = XE = R.
• Heterogeneous networks.In homogeneous networks, the nodes and the edges are all the same type.In a heterogeneous network, nodes and edges can have types.Formally, a heterogeneous network is a network G = (V, E), associated with two type functions φ : V → A and ψ : E → R.These functions associate each node (respectively edge) to its type.More precisely, we define A = {a1, a2, ..., aα}, with α the number of node types, and R = {r1, r2, ..., r β }, with β the number of edge types.
• Signed networks.This is a particular case of a weighted homogeneous network with weights ±1.Namely, G = (V, E) is a network, and τ : E → {−1, 1} is a mapping function that associates a sign to each edge.
• Temporal networks.Temporal networks are specific cases of multilayer networks where the layers are ordered by time, that is, they represent the evolution of a graph over time (19,25,26).
• Knowledge networks.Knowledge graphs are defined as a set of triples (u, r, v) ∈ V × R × V , where the nodes u and v belong to the nodes set V , and they are connected by edges of type r ∈ R.

Existing Taxonomies of network embedding methods
The huge amount and variety of embedding methods (27) make their classification into a unified taxonomy a difficult task.The methods can indeed be sorted according to several criteria.We will briefly present some of the most common taxonomies.
A first way to classify network embedding methods is based on the type of networks used as input.Some authors thereby distinguish the methods designed for homogeneous or heterogeneous networks (27) .The same strategy can be used to classify embedding methods designed for static and temporal networks, or single and multilayer networks.Based on this taxonomy, it is possible to add a layer of complexity by considering the type of component that the methods embed in the vectorial space, i.e., the nodes, the edges, the subgraphs, or the whole network.Other authors use some properties of the network embedding process to classify the different methods (28).For instance, the network embedding methods may be classified depending on the network properties they intend to preserve, in addition to the part of the network the methods is focused (on the nodes, edges, or the whole network).Focusing on nodes, three different types of property preservations can be defined, at different scales: microscopic, mesoscopic, and macroscopic properties.Methods preserving microscopic properties retain structural equivalences between nodes, such as first-order, second-order, or high-order similarities.They hence seek to preserve the homophily existing in the original network.Methods preserving mesoscopic properties focus on the regular equivalence between nodes, on intra-community similarity, or, more generally, on properties that are in between the close node neighbourhood and the whole network.Finally, methods preserving macroscopic properties tend to preserve whole network properties, like the scale-freeness (29).Based on the same idea, a different taxonomy has been adopted by Cui et al. (7).In this work, the authors discriminate the network embedding based on the information they wish to encode.The first class of methods, called structure and property preserving methods, preserves structural information like the neighbourhood or the community structures.The second class, called information-preserving methods, constructs the embedding using complementary information like node labels and types (for heterogeneous networks) or edge attributes.
The third class, called advanced information-reserving method, gathers supervised methods that propose an end-to-end solution (learning process where all parameters are trained jointly) and use various complementary information to learn the embedding space.
In conclusion, multiple taxonomies have been proposed, based on several criteria: the property preserved by the properties preserved by the network embedding methods (7,28), the type and the properties of input networks (18,27), or based on mathematical considerations (17,(30)(31)(32).Our approach for the review is based on mathematical considerations and similar to those in Chami et al. (17).However, while Chami et al. extended the encoder-decoder framework of Hamilton et al. (16) (see section 3) to include deep-learning methods as a special case.Herein, we have defined a more flexible approach to organize these methods that are not constraints by the encoder-decoder framework, while this framework can be a useful tool for understanding the methods, we believe that a more flexible approach will offer easier integration of new methods into this taxonomy.Significantly, our review includes higher-order network embedding methods that were not covered in (17).Note that our review presents a wide range of methods akin to a diverse set of 'species' coexisting within a 'zoo', hence the chosen title for our review.In this way, the objective of this new taxonomy is to be fine-grained, to offer a consensual view, and to be easily extended to integrate novel methods.In addition, our approach is independent of the scientific domain of development and application of the embedding methods.

Taxonomy of network embedding methods
Recently, important efforts have been made to produce general frameworks defining different embedding methods under a common mathematical formulation (16,17,33).Notably, Hamilton et al. (16) proposed an encoder-decoder framework to define embedding methods, following four components: This function defines the similarity measure between the nodes in the original (i.e., direct) network space.
2. An encoder function: Enc : V → R d .This function encodes the nodes into the embedding space.
For instance, the node vi ∈ V is embedded into the vector zi ∈ R d .

A decoder function: Dec
This function associates a similarity measure in the embedding space to each pair of embedding vectors.
l(Dec(zi, zj), sG(vi, vj)) , [1] Many embedding techniques align with this taxonomy, but a few of them are inaccurately described.For instance, higherorder network embedding techniques (34)(35)(36) do not only use pairwise similarity functions (see section 3.C).While this framework has been extended and enhanced to incorporate deep learning techniques (17), it still does not cover higherorder network embedding methods.
Another general framework has been proposed by Yang et al. (33) to classify heterogeneous network embedding (HNE) methods.The idea of this framework is to convert the homophily principle (similar nodes in a network will be close in the embedding space) into a generic objective function: The term wv i v j denotes the proximity weight, d(zi, zj) is the embedding distance between the embedding vectors associated with the nodes vi and vj, and JR represents some additional objectives such as regularisers.
The taxonomy proposed in this work is based on a mathematical point of view, illustrated in Fig. 1, which splits the methods into two main classes depending on their depth: the shallow embedding methods (3.A), and the deep learning methods (3.B).We complement these two classes by including higher-order methods (3.C), which can be classified as either shallow embedding or deep learning methods, enabling us to spotlight these new types of methods.In the next sections, we adopt the notation defined in section 1 for both the network and the associated embedding.In the following, we write .F for the Frobenius norm, and . 2 for the Euclidean norm.

A. Shallow network embedding methods.
In this section we will consider the shallow network embedding methods, which are a set of methods with an encoder function that can be written as follows: Enc(vi) = Zv i , [3] where Z corresponds to the matrix with the embedding vectors of all nodes, and v i corresponds to the indicator vector associated with each node vi (vector of zeros except in position i, where the element is equal to 1).In this case, the objective of the embedding process is to optimise the embedding matrix Z in order to have the best mapping between the nodes and the embedding vectors (Fig. 2).We define three major classes of shallow embedding methods based on different mathematical processes: the matrix factorisation methods, the random walk methods, and the optimisation methods.

A.1. Matrix factorisation methods.
Matrix factorisation is based on the fact that a matrix, such as the adjacency matrix or the Laplacian matrix, can fully represent a network.This fact implies that existing methods from matrix algebra, such as matrix factorisation, can be used for network embedding.Network embedding methods based on matrix factorisation are directly inspired by linear dimensionality reduction methods, such as PCA (37), LDA (15), or MDS (18).Other methods are inspired by non-linear dimensionality reduction methods such as Isomap (38), which is an extension of MDS (18), LLE (39), t-SNE (40), or more recently UMAP (41).The factorisation process depends on the properties of the matrices.For positive semi-definite matrices, like graph Laplacians, the embedding can be obtained by eigenvalue decomposition.However, for unstructured matrices, like covariance matrices, gradient descent or Singular Value Decomposition (SVD) should be used to obtain the network embedding.Thereafter, we will describe the most common network embedding methods based on matrix factorisation.
• Laplacian Eigenmaps (LE) (42) aims to embed the network in such a way that two nodes close in the original network are also close in the low-dimensional embedding space, by preserving a similarity measure defined by the weight between nodes.In that way, we define the weight matrix denoted by W , where Wij encodes the weight between the nodes i and j.The learning process is done by optimizing the following objective function: with Dec(zi, zj) = zi − zj 2 2 , and sG(vi, vj) = Wij.We can introduce the Laplacian matrix L, defined as L = D − W , with Dii = j Wji.The equation [4] can 1 Fig. 1.Pie charts describing the new taxonomy defined in this manuscript.In the top pie chart, the methods are divided into two main categories: shallow embedding methods and the deep learning methods, complemented by higher-order methods that can be either a shallow embedding or a deep learning methods.The bottom pie chart highlights the three major emerging groups of methods.Notably, these emerging groups of methods can be classified into our defined taxonomy due to its flexibility.
be written as: [5] with Z = (z1, z2, ..., zn) ∈ R d×n .The loss function needs to respect the constraint Z T DZ = I to avoid trivial solutions.The solution can be obtained by finding the matrix composed of the eigenvectors associated with the d smallest eigenvalues of the generalized eigenvalue problem LZ = ΛDZ, with Λ = diag([λ1, λ2, ..., λn]) (43).
It is important to note that Laplacian Eigenmaps use a quadratic decoder function.This function does not preserve the local topology because the quadratic penalty penalises the small distance between embedded nodes.
• Cauchy Graph Embedding (44) aims to improve the previous method (Laplacian Eigenmaps), which does not preserve the local topology.Cauchy Graph Embedding use a different decoder function Dec(zi, zj) = with σ 2 representing the variance.Consequently, the loss function can be written as follows: with the following constraints: i zi = 0, and Z T Z = I, where Z = (z1, z2, ..., zn) ∈ R d×n .The solution is obtained by an algorithm that mixes gradient descent and SVD.
• Graph factorisation (45) proposes a factorisation method that is designed for network partitioning.It learns an embedding representation that minimises the number of neighbouring vertices across the partition.The space.This projection is achieved using a mapping function f that enables the mapping from the direct space to the embedding space.The mapping function f is derived by optimizing a loss function L, which aims to minimize the difference between the similarity measures of nodes in the direct space (S D ) and their equivalents in the embedded space (S E ) obtained through the decoder function.
loss function can be written as follows: with Dec(zi, zj) = z T i zj, sG(vi, vj) = Wij, and λ a regularisation parameter.Notably, this method is scalable and can deal with networks with millions of vertices and billions of edges.
• GraRep (46) extends the skip-gram model (47) to capture higher-order similarity, i.e., k-step neighbours of nodes.The value of k is chosen such as 1 ≤ k ≤ K, with K the highest order.GraRep is also motivated by the Noise-Contrastive Estimation (NCE) approximation (48) which consists in learning a model that converges to the objective function.INCE trains a binary classifier to distinguish between node samples coming from the similarity distribution sG and node samples generated by a noise distribution over the nodes.Grarep defines its k-step loss function as follows: where the matrix T represents the transition matrix, defined as T = D −1 A, with A the adjacency matrix, and D the degree matrix.The vectors xi and xj are the vector representations of the nodes vi and vj in the direct space.The term E v j ∼p k (V ) is the expectation of the node vj, obtained by negative sampling.The expectation follows the noise distribution over the nodes in the network, denoted by p k (V ).The parameter λ indicates the number of negative samples, and σ(.) is the sigmoid function defined as σ(x) = (1 + e −x ) −1 .GraRep reformulates its loss function minimisation into a matrix factorisation problem.Each k-step term is computed from the matrix Then, the low-dimensional representation of the matrix C k is constructed from the Singular Value Decomposition: SVD(X k ).Finally, the final representation is obtained by concatenating all order-term matrices, • High-Order Proximity preserved Embedding (HOPE) (49) has been developed to encode higher-order similarity of large-scale networks while also capturing the asymmetric transitivity, i.e., going from node vi to node vj can be different from going from node vj to node vi.HOPE can hence deal with directed networks.The loss function is equal to: with Dec(zi, zj) = z T i zj and sG(vi, vj) denoting any similarity measure between vi and vj.The authors of HOPE introduce a general factorisation in which the similarity measure can be factorised in one matrix associated with the global similarity Mg and another matrix associated with the local similarity M l .So, the similarity matrix can be expressed as S = M −1 g M l , where both local and global similarities are polynomial sparse matrices.This also enables HOPE to use efficient SVD decomposition for embedding large-scale networks.The authors considered different similarity measures such as Katz index (S katz = (I − βA) −1 (βA)), Rooted PageRank (S RP R = (I − αT ) −1 ((1 − α)I)), common neighbours (S CN = I(A 2 )), or Adamic-Adar (S AA = I(ADA)), where A indicates the adjacency matrix, T represents the transition matrix, α a value ∈ [0, 1), and β a value less than the spectral radius of the adjacency matrix.
• Modularised Nonnegative Matrix factorisation (M-NMF) (50) aims to obtain an embedding representation aware of the community structure of the original network while maintaining the microscopic information from the first-order and second-order similarities.Let us define the similarity measure S = S (1) + ηS (2) ∈ R n×n , where S (1) is the first-order similarity matrix, for instance, S (1) ij = Aij with A the adjacency matrix, and S (2) is the second-order similarity matrix, defined as in ) the first-order similarity vector of the node i.The parameter η is the weight of the second-order term (often chosen equal to 5 (50)).The embedding of the microscopic structure can be expressed in the NMF framework as the following optimisation problem: min with M ∈ R n×d and U ∈ R n×d two non-negative matrices; Ui is the embedding of the node i.
The community structure is obtained with modularity maximisation, which is expressed for two communities as )hihj, with ki the degree of the node i, hi is equal to 1 if the node i belongs to the first community, otherwise is equal to −1, and m is the total number of edges.Let us define B such as The generalisation of the modularity optimisation problem for k communities is defined as: [11] with H ∈ R n×k , β is a positive parameter.The second equation imposes the association of each node to one community.The two models are combined using a term that uses the community structure to guide the node representation learning process.Formally, we define C ∈ R k×d as the community representation matrix; Cr as the representation of the community r, and UiCr represents the propensity of the node i to belong to the community r.So the last term to optimise is equal to α H − U C T 2 F , with the constraint that C > 0, and α a positive parameter.Finally, the equation to be optimised is the following one: Due to the non-convex behavior of the previous function, a non-trivial optimisation process has been developed (50).
• Text-Associated DeepWalk (TADW) (51) aims to integrate text data information into the network embedding process.The authors first prove that the learning process used in the Deepwalk embedding method (see the section about Random walk network embedding methods) is equivalent to the optimisation of a matrix factorisation problem, M = W T H, with M ∈ R n×n the matrix of the original network, W ∈ R d×n the weight matrix, and H ∈ R d×n the factor matrix.The factorisation matrix problem is the following: min The idea of TADW is to take into account a text factor matrix T into the decomposition, such that M = W T HT , with M ∈ R n×n , W ∈ R d×n , H ∈ R d×k , and T ∈ R k×n .The new factorisation matrix problem is: min The optimisation process is obtained with the gradient descent algorithm introduced by H. Yu et al. (52).
• The idea behind random walk embedding is to encode the scores of the random walk into an embedding space.Most methods use ideas initially developed in the Deepwalk paper (64).In this section, we describe the most common methods and some of their extensions.
• Deepwalk (64) is a scalable network embedding method that uses local information obtained from truncated random walks to learn latent representations.Deepwalk treats the walks as the equivalent of sentences.The process is inspired by the famous word2vec method (47,65), in which short sequences of words from a text corpus are embedded into a vectorial space.The first step of Deepwalk consists in generating sequences of nodes from truncated random walks on the network.Then, the update procedure consists in applying the skip-gram model (47) on the sequences of nodes, in order to maximise the probability of observing a node neighbour conditioned on the node embedding.The loss function is defined as follows: ) , [15] with w indicating the window size (in terms of node sequence), and φ : V → R d indicating the mapping function.
We can also see φ ∈ R n×d as the matrix of the embedding representation of the nodes.The skip-gram model transform the equation [15] as follows: Then, the hierarchical softmax function ( 66) is applied to approximate the joint probability distribution as: where vj is defined by a sequence of tree nodes (b0, b1, ..., b log(n) ), with b0 the root of the tree, and b log(n) the node vj.Notably, similarly to the TADW method, Deepwalk is equivalent to the following matrix factorisation problem: M = W T H (51, 67), with W ∈ R d×n the weight matrix, and H ∈ R d×n the factor matrix.Several extensions of the deepwalk have been adapted for multilayer networks (68,69).
• node2vec (70) is a modified version of Deepwalk, with two main changes.First, node2vec uses a negative sampling instead of a hierarchical softmax for normalisation.This choice improves the running time.Second, node2vec uses a biased random walk that offers more flexible learning with control parameters.The biased random walk can be described as: where πyx is the unnormalised transition probability between node y and node x, and Z is the normalizing constant.The variable π is defined as follows: where ωyx is the weight of the edge between the node y and the node x, and dtx is the shortest path between the node x and the node t, which is the node reached before the node y.The parameters p and q are two control parameters of the random walk.The return parameter p controls the likelihood of immediately revisiting a node in the walk, while the in-out parameter q controls the likelihood of visiting a node in the neighbourhood of the node that was just visited.Both parameters control if the random walk follows a Breadth-First Sampling (BFS) strategy or a Depth-First Sampling (DFS) strategy.The first strategy preserves the structural equivalence of the nodes, the second one preserves their homophily.Recently, Multinode2vec (71) and PMNE (72), two extensions of node2vec, adapted the random walk process to multilayer networks.
• HARP (73) is an algorithm that was developed to improve the Deepwalk and node2vec embedding methods.
The idea is to capture the global structure of an input network by recursively coalescing edges and nodes of the network into smaller networks with similar structures (see also section 4.B on network compression).The hierarchy of these small networks is an appropriate initialisation for the network embedding process, because it directly express a reduced-dimension version of the input network while preserving its global structure.The final embedding is obtained by propagating the embedding of the smallest network through the hierarchy.
• Discriminative Deep Random Walk (DDRW) (74) is particularly suitable for the network classification task.
It can be seen as a Deepwalk extension that considers the label information of nodes.To do so, DDRW jointly optimises the Deepwalk embedding loss function and a classification loss function.The final loss function to optimise is defined as: where η is a weight parameter, and σ the Heaviside function, i.e. σ(x) = x for x > 0 and σ(x) = 0 otherwise.Moreover, zi is the embedding vector of the node vi, yi is the label of the node vi, C is the regulariser parameter, and β the subsequent classifier.
• Walklets (75).Given the observation that Deepwalk can be derived from a matrix factorisation containing the powers of the adjacency matrix ( 76), it appears that Deepwalk is biased towards lower powers of the adjacency matrix, which correspond to short walks.This can become a limitation when higher-order powers are the most appropriate representations, for instance to embed the regular equivalence between nodes.To bypass this issue, Walklets propose to learn the embedding from a multiscale representation.This multi-scale representation is sampled from successive higher powers of the adjacency matrix obtained from random walks.Then, after partitioning the representation by scale, Walklets learns the representation of each node generated for each scale.
• Struct2vec (77) aims to capture the regular equivalence between nodes in a network.In other words, two nodes that have identical local network structures should have the same embedding representation.The construction of the embedding representation is based on different steps.The first step is to determine the structural similarity between each pair of nodes for different neighbourhood sizes.The structural similarity between the nodes vi and vj, when considering their k-hop neighbourhoods (all nodes at a distance less or equal to k and all edges among them), is defined as follows: where R k (vi) is the set of nodes at a distance less or equal to k from the node vi, s(S) is the ordered degree sequence of a set of nodes S, and g(S1, S2) is a distance measure between the two ordered degree sequences S1 and S2.The distance used is the Dynamic Time Warping (78) and by convention d−1 = 0.This procedure produces a hierarchy of structural similarities between nodes of the network.The hierarchy is used to create a weighted multi-layer network, in which layers represent node similarities for different levels of the hierarchy.The edge weights between node pairs are inversely proportional to their structural similarity.After that, a biased random walk process is applied to the multilayer network to generate sequences of nodes.The sequence of nodes are used to learn a latent representation with the skip-gram process.
• Other random walk methods: Some methods are designed to integrate additional information in the learning process.For instance, SemiNE ( 79) is a semi-supervised extension of Deepwalk that takes into account node labels.GENE (80) also integrates node labels.Node labels, as well as additional node contents, are also integrated into TriDNR (81).SNS ( 82) is another method that aims to preserve structural similarity in the embedding representation.SNS measures the regular equivalence between nodes by representing them as a graphlet degree vectors: each element of the graphlet degree vector represents the number of times a given node is touched by the corresponding orbit of graphlets.The previous two methods involve two distinct mathematical processes: matrix factorisations, and random walks.(Matrix factorisation is a common mathematical operation, while random walk encompasses several methods that share a common principle.)However, there are additional embedding techniques that do not fit into either category, but they do share a common objective of optimising a loss function.In essence, these methods use a broad range of mathematical processes but ultimately involve an optimisation step, which is usually achieved through gradient descent.As a result, these approaches can be viewed as hybrid methods that all utilise a shared optimisation step.The most important step in optimisation methods is to define a loss function that encodes all the properties that should be preserved through the embedding.This loss function often gathers similarities between nodes in the direct space, together with some regulariser terms that depend on network features that we want to preserve.The embedding representation is obtained based on the optimisation of this loss function.We will present the most common optimisation-based network embedding methods and some of their extensions.
• VERtex Similarity Embeddings (VERSE) (94) is a versatile network embedding method that accepts any network similarity measure.Let G be a network with an associated similarity measure sG : V × V → R + .The VERSE method constructs the embedding representation of the network G, noted Z ∈ R d×n , associated with a similarity measure in the embedding space Dec : R d × R d → R + .Each column of the matrix Z is the embedding vector zi of the node vi.The embedding representation is based on the optimisation of a loss function, noted L, corresponding to the Kullback-Leibler divergence between the similarity matrix in the direct space (i.e.original network) and the similarity matrix in the embedding space: sG(vi, .)• ln(Dec(vi, .))+ C , [23] with sG(vi, .)(resp.Dec(vi, .)) the similarity vector between the node vi and all the other nodes of the network in the direct space (resp.embedding space).Notably, n j=1 sG(vi, vj) = n j=1 Dec(vi, vj) = 1.Moreover, C = n i=1 sG(vi, .)• ln(sG(vi, .))defines a constant that does not affect the optimisation algorithm, and can therefore be neglected.The vector sG(vi, .)corresponds to the vector associated with the node vi in the similarity matrix defined in the direct space.As stated above, the similarity matrix in the direct space (network) can be defined by several measures.The authors proposed three different similarity matrices: the adjacency matrix, the SimRank similarity matrix (95), and a similarity matrix based on Random Walk with Restart.The vector Dec(vi, .)corresponds to the vector associated with the node vi in the similarity matrix defined in the embedding space.The vector Dec(vi, .)can also be seen as the similarity vector between the vectors zi and zj with j = i, j ∈ 1, n , where n is the number of nodes in the network.The vectors gathered in the similarity matrix in the embedding space are defined by the following equation: . [24] The node embedding is obtained by optimizing the loss function with a gradient descent algorithm.Usually, the embedding vectors are initialized with a normal distribution with a mean equal to zero.Because the Kullback-Leibler optimisation is a time-consuming process, a negative sampling procedure, such as NCE (Noise Contrastive Estimation) (48,96) is often used.
Recently, an extension of VERSE to heterogeneous multiplex networks, named MultiVERSE, has been developed (97).
• Large Scale Information Network Embedding (LINE) (98) aims to embed both first-order and secondorder similarities.
-The embedding space of the first-order similarity can be obtained by the following optimisation algorithm.
Let us define the theoretical expected probability as: ; zi ∈ R d , [25] and the empirical probability as: The main goal of LINE is to minimise the error between the theoretical expected probability p1 and the empirical probability p * 1 .To do so, the loss function O1 minimises the distance d(p * 1 (., .),p1(., .))by using the Kullback-Leibler divergence.Hence, O1 can be written as follows: [27] -The embedding space of the second-order similarity is obtained with the optimisation process described as follows.Let us define the theoretical expected probability as: The empirical probability is defined as: w ik , [29] where N (i) is the neighborhood of the node i, and di defines the out-degree of the node i.The idea is again to minimise the error between the theoretical expected probability and the empirical probability.
To do so, the loss function O2 minimises the distance d(p * 2 (. | vi), p2(.| vi)) by using the Kullback-Leibler divergence.Hence, O2 can be written as follows: [30] The first and second-order node representations are computed separately, and both first and second-order embedding representations are concatenated for each node.
• Transductive LINE (TLINE) (99) is a transductive version of LINE that also uses node labels to train a Support Vector Machine (SVM) classifier for node classifications.Both node embedding and the SVM classifier are optimised simultaneously, in order to make full use of the label information existing in the network.Notably, TLINE as LINE permits fast embedding of large-scale networks by using edge sampling and negative sampling in the stochastic gradient descent process.The method optimises a loss function OT composed of the same loss functions as LINE, O1 and O2 to embed both first and second-order similarity, and the SVM loss function OSV M , that is, where β is a trade-off parameter between LINE and SVM, n the number of nodes, K the number of label types in the network, zi the embedding vector representation of the node vi, w k the parameter vector of the label class k, and y k i is equal to 1 if the node vi is in the class k.
• Other optimisation methods: A wide range of optimisation methods are applied to heterogeneous networks.Many of them are similar to LINE and optimise first and second-order similarities.In PTE (Predictive Text Embedding) (100), the loss function is divided into several loss functions, each associated with one network of the heterogeneous network.The APP (Asymmetric Proximity Preserving) (101) network embedding method is similar to VERSE, and captures both asymmetric and high-order similarities between node pairs thanks to an optimisation process over random walk with restart results.Recently, many network embedding methods have been adapted or designed for multilayer networks, and a significant portion of them are based on an optimisation process (102)(103)(104)(105)(106). In addition, several embedding methods applied to knowledge graphs are optimisation methods.These methods are often called relation learning methods.We can mention some of the most common ones, like the translation-based methods first defined by Bordes et al. (107).This method, named TransE, embeds multi-relational data that uses directed graphs.Edges can be defined by three elements: the head node (h), the tail node (t), and the edge label (l).The embedding vector of the tail node t should be close to the embedding vector of the head node h, plus some vector that depends on the relationship l.This approach constructs the embedding representation by optimizing a loss function that integrates these three elements.This method has given rise to several alternative methods: TransH (108) that improves TranSE for reflexive/one-to-many/manyto-one/many-to-many relationships, TransR (109) that builds entity and relation embeddings in separate entity space and relation space (in contrast to the two previous methods), and TransD (110), which is an improvement of TransR for large-scale networks.Recently, the RotatE method has been developed (111).RotatE is a knowledge graph embedding method that can model and infer var-ious relation patterns such as symmetry, inversion, and composition.

B. Deep learning methods.
In recent years, the use of deep learning for data analysis has increased steadily, and network analysis, including network embedding, is no exception.The success of deep learning methods can be explained by their ability to capture complex features and non-linearity among input variables.We define three major classes of deep learning embedding methods: the conventional neural networks based methods, the graph neural networks based methods, and the graph generative methods methods.All three of these classes of methods are based on different philosophies and mathematical formulations of deep learning.

B.1. Conventional neural networks.
The first network embedding methods based on deep learning methods use conventional deep learning techniques.We can cite the following class of deep learning architectures:

B.2. Graph Neural Networks (GNN).
Recently, an important class of deep learning methods for network embedding has been developed: Graph Neural Networks (GNNs) (120).GNNs generalise the notion of Convolutional Neural Networks (typically applied to image datasets, with an image seen as a lattice network of pixels) to arbitrary networks.GNNs encode high-dimensional information about each node neighbourhood into a dense vector embedding.GNNs algorithms can be divided into two main components.The encoder, which maps a node vi into a low-dimensional embedding vector zi, based on the local neighbourhood and the attributes of the node, and a decoder, which extracts user-specified predictions from the embedding vector.This kind of method is suitable for end-to-end learning and offers state-of-the-art performance (120,121).GNNs and their application to network embedding can be divided into different classes of methods.
• Graph Convolutional Networks (GCNs): Graph Convolutional Networks (GCNs) is the generalisation of Convolutional Neural Networks (CNNs) to graphs (122).The basic idea behind CNN is to apply convolutional operations during the learning process to capture local properties of the input data, recognising identical features regardless of the spatial locations.Several similar successful approaches have been developed, and we can mention Chebyshev Networks (120) and SAGE (123).
• Graph Attention Network : A well-known shortcoming of the graph convolutions procedure is that they consider every node neighbour as having the same importance.Graph Attention Networks (GAT) are neural networks architectures that leverage masked self-attentional layers to address this shortcoming (124)(125)(126).
• Other Graph Neural Networks: Several other embedding methods based on alternative architectures of GNNs exist (127)(128)(129).The interested reader can refer to the review of Zhou et al. on GNNs for more details (130).

B.3. Graph generative methods.
Graph generative methods are other deep learning methods mostly known for Generative Adversarial Networks (GANs) (131).The principle of GANs is based on two components: a generator, and a discriminator.The idea of GANs is to train a generator until it is efficient enough to mislead the discriminator.The discriminator is misled when it cannot discriminate real data from the data generated by the generator.Based on this idea, several embedding methods appeared, including GraphGAN (132), Adversarial Network Embedding (ANE) (133), and ProGAN (134).In the case of GraphGAN, for a given vertex, the generator tries to fit its underlying true connectivity distribution over all other vertices and produces 'fake' samples to fool the discriminator, while the discriminator tries to detect whether the sampled vertex is from the ground truth or generated by the generative model.An alternative method to GAN is the Restricted Boltzmann Machine (135), which inspired different embedding methods (pRBM (136)).

C. Higher-order network methods.
We have seen in section 3.A that shallow embedding methods use pairwise similarity functions.This choice is imposed by the structure of the graphs, which by definition connect nodes by pairwise interactions.However, generalisations of graphs that can encode higher-order interactions, such as hypergraphs and simplicial complexes, are increasingly being studied (137)(138)(139)(140). Hypergraphs encode arbitrary relations between any number of nodes, that is, edges are generalised to hyperedges which can contain any number of nodes, not just two.Simplicial complexes generalise graphs by allowing triangles, tetrahedrons, and higher-dimensional 'cliques' to be represented, and are closely related to Topology, particularly Topological Data Analysis (141,142).Note that simplicial complexes are a type of hypergraphs, so, at least in principle, hypergraph methods also apply to simplicial complexes.Recently, network embedding methods have been extended to consider these new types of higher-order networks.In particular, some of the existing network embedding methods described in sections 3.A-B have been extended to hypergraphs.Methods capable of embedding hypergraphs include shallow embedding methods: Learning hypergraph-regularized (143) (matrix factorisation method), Spectral hypergraph embedding (34) (random walk method), LBSN2Vec++ (144) and HyperEdge-Based Embedding (35,145) (optimisation methods).There are also methods derived from deep learning processes: Deep hypergraph network embedding (146) (autoencoder), Hypergraph neural networks(36) (GNN), Hypergraph attention embedding (147) (GAT).As explained above, simplicial complexes are a special type of hypergraphs which are amenable to be treated by powerful tools based on algebraic topology (148), including Topology Data Analysis (TDA) (141,142).The literature on simplicial complexes embedding and simplicial neural networks is rapidly growing (148).It includes simplicial and cell complex neural networks (149)(150)(151)(152)(153)(154)(155) and Geometric Laplacian eigenmaps embedding (Glee) (156).Simple graphs can also be described as higher-order set-of-sets formed by node neighbourhoods, for which recently a new graph embedding has been proposed (HATS) (157).

D. Emerging methods.
Here we highlight some key emerging methods for network embeddings which deserve particular attention.Specificaly, we discuss key results in the rapidly growing literature focusing on network embeddings in non-Euclidean spaces, including hyperbolic and Lorentzian spaces.Moreover, we cover the very vibrant research activity on a new generation of neural networks using Magnetic and Connection Laplacians which provide powerful tools to treat directed networks and to improve the explainability of the algorithms.Finally, we discuss the important research direction aiming at comparing different embedding algorithms.

D.1. Network embedding in hyperbolic and Lorentzian spaces.
Embedding in hyperbolic spaces offers advantages according to different perspectives (158,159).Among the benefits of hyperbolic spaces, the most relevant one is probably the fact that hyperbolic spaces are natural spaces to embed trees having a number of nodes growing exponentially with the distance from the root, and in general to embed small-world networks.We distinguish three major approaches to embedding networks in hyperbolic spaces: • Embedding based on the complex hyperbolic network models (160,161) in the H 2 plane.According to the spatial hyperbolic network (160) and PSO (161) models, the radial coordinate of the node embedding is determined by the degree of the nodes and the angular coordinate of the node embedding is determined by a similarity metric.The hyperbolic embedding can be used to formulate a greedy algorithm for network navigability (159,162), to predict missing edges (163), and also to relate the clusters found along the angular coordinates to network communities (164,165).For a general review of this approach see Ref. (166).The original embedding algorithm HyperMap (161), used to determine the angular coordinates of the nodes and revisited in (167), maximizes the likelihood that the data is drawn from the model.Mercator (168) improves this algorithm by initializing the position of the nodes using Laplacian Eigenmaps and, for each optimisation step, the angular positions of the nodes is updated by choosing, among a set of several possible moves, the one that optimises the likelihood.The possible new moves are drawn from a Gaussian distribution centred on the mean angles among the neighbour nodes.This algorithm has computational complexity O(n 2 ) on sparse networks.A fast and efficient alternative to this approach is provided by the noncentered minimum curvilinear embedding (ncMCE) (169) which provides a machine learning pipeline including three main steps: (i) a pre-weighting procedure which identifies the network backbone; (ii) the extraction of the matrix D of nodes similarities (distances) measured on this network backbone (iii) a dimensionality reduction of the matrix D. The ncMCE has been further extended in (170) to further reduce the loss calculated according to the PSO loglikelihood.Recently, an exact and rapid one-dimensional embedding algorithm (171) based on dynamic programming has been proposed.This algorithm can determine the angular coordinates of the hyperbolic embedding in H 2 and, more generally, extract other types of one-dimensional embeddings.Finally, the hyperbolic embedding of directed networks has been addressed in Ref. (172).
• Embedding of hierarchical data.Hyperbolic spaces allow the embedding of hierarchical data and in particular trees without distortion in spaces of low dimension (173) while the same data would require a high dimensional Euclidean embedding if low distortion is desired.This fundamental property of hyperbolic spaces has been exploited in (174) to first embed a network in a tree and then embed the tree into hyperbolic spaces achieving a fast and reliable hyperbolic embedding.This approach has been extended to knowledge graphs in Refs.(175) while in (176) Hyperbolic Graph Convolutional Neural Networks have been proposed to learn the hyperbolic embedding.Note that embedding a network into another network (which might not in general be a tree) is a generalised embedding problem tackled also in Ref. (177).
• Filtering of networks generating the Triangulated Maximally Filtered Graph (TMFG) (178).The TMFG generalises the Minimal Spanning Tree and has the topology of a 'fat tree' formed by d connected (d + 1)cliques (d-simplices).The structure of TMFG reduces to the structure of the model Network Geometry with Flavor (179) with natural hyperbolic embedding in H d (180).Therefore TMFGs have a hyperbolic geometry while the previously proposed maximally filtered planar graphs (181) have a natural R 2 embedding.
Finally, we point out also that network embeddings in Lorentzian (Minkowskian) spaces have been proposed (182) and applied to the study of directed acyclic networks as citation networks.This embedding can be used for paper recommendation, identifying missing citations, and fitting citation models.

D.2. Network embeddings using Magnetic and Connection Laplacians.
Despite many networks being directed, machine learning methods are typically developed for undirected networks.One challenge in directed networks is that they are naturally encoded by asymmetric adjacency matrices with a complex spectrum while machine learning techniques usually require a loss function that is real and positive definite.In order to address this challenge, the use of magnetic Laplacians is attracting growing attention.Magnetic Laplacians are Hermitian matrices (hence having a real and non-negative spectrum) that encode the direction of edges through a complex phase.Using a magnetic Laplacian, it is possible to formulate Eigenmap embeddings (183)(184)(185) that can detect non-trivial periodic structure in the network such as three or more communities whose connection pattern is cyclic.Additionally, in Ref. (186), the magnetic Laplacian is also used to propose Magnet, a novel and efficient neural network for directed networks.This vibrant research activity on the magnetic Laplacian is indicative of the recent interest in network structure with complex weights (187).The Magnetic Laplacian can be considered a special case of the Connection Laplacian, which can be used for Vector Diffusion Maps (188).Moreover, the Connection Laplacian is used for formulating Sheaf Neural Networks (189,190), which are a new generation of neural networks obtaining excellent performance on several Machine Learning tasks.Finally, the non-backtracking matrix that identifies non-backtracking cycles and efficiently detects the network communities (191), has been recently proposed for embedding oriented edges (Non-backtracking embedding dimensions: NBED) (192).

D.3. Comparison of different algorithms.
An important question is how to choose among the different network embedding algorithms and how to select their hyperparameters.For instance, the selection of the embedding dimension is a crucial hyperparameter for network embedding algorithms.Ref. (193) provides a method to determine the value of the embedding dimension that constitutes the best trade-off between favouring low dimensions and requiring the generation of a network representation close to the best representation that the chosen algorithm can achieve.The comparison between different embedding algorithms (194) and the investigation of network embedding algorithms as opposed to alternative inference approaches such as community detection (195) is attracting increasing attention and can guide the choice of the most suitable algorithm.Finally, we note that a unifying approach of node embeddings and structural graph representation has been recently proposed (196).

Applications
A wide variety of applications exist for network embedding methods.Some methods have been designed focusing on specific task(s) but others admit versatile applications.In the following sections, we will present the most common applications of network embedding methods as well as some emergent applications offering promising results.We associate each application with some widely used embedding methods in order to give the interesting reader a guideline for applying network embedding in different tasks (Fig. 3).

Operator Definition
Average Table 1.Binary operators to compute infer edges between pairs of embedding vectors.
The variable zi defines the embedding vector associated with the node vi, and zi(k) defines the k-th element of the embedding vectorzi.
A. Classical applications of network embedding.
• Node classification: The aim is to predict labels for unlabelled nodes based on the information learned from the labelled nodes.Network embedding methods embed nodes into vectors which can be used in an unsupervised setting.In this case, the nodes associated with similar node embedding vectors will have similar labels.In a supervised setting, the classifier is trained with the vectors associated with the labelled nodes.The classifier is then applied to predict the labels of unlabelled nodes.
• Link prediction: The aim is to infer new interactions between pairs of nodes in a network.The similarity between the nodes encodes the propensity of the nodes to be linked.The similarity can be computed, for instance, with an inner product or a cosine similarity between each pair of node embedding vectors.In the embedding space, several operators exist to infer edges between pairs of embedding vectors (70).These operators can be, for instance, binary operators (Table 1), or, heuristic scores (Table 2).
• Node clustering/Community detection: The aim is to determine a partition of the network such that the nodes belonging to a given cluster are more similar to each other than to the nodes belonging to other clusters.In practice, any classic clustering method can be directly applied in the latent space to cluster the nodes.K-means (197) is often used for this purpose.However, it is useful to notice that some embedding methods are designed specifically for this task (45).
• Network reconstruction: The aim here is to reconstruct the whole network based on the learned embedding representation.Let us define n as the total number of nodes in the network.The reconstruction imposes n(n − 1)/2 evaluations to test each potential edge, and each evaluation is equivalent to a link prediction.
• Visualisation: Network embedding methods, as other reduction dimension methods, can be used to visualise high-dimensional data in a lower-dimensional space.We expect similar nodes to be close to each other in the visualisation.However, network embedding methods that were not designed for this specific task show poor results when directly projected into a two-dimensional embedding space (28,98).To bypass this shortcoming, the visualisation of the result of network embedding methods is frequently projected into a d-dimensional embedding space (2 < d n), and then into a two-dimensional space.The two-dimensional space is obtained with a dimension reduction method suitable for visualisation, like Principle Component Analysis (PCA) (13,37), t-distributed Stochastic Neighbor Embedding (t-SNE) (40), or Uniform Manifold Approximation and Projection (UMAP) (41).Recently, an extension of PCA for data lying in hyperbolic spaces has been developed (198).Finally, visualisation is a powerful tool to investigate the results obtained by embedding methods.The interpretability of results can be enhanced through visualisation, and a recent method has been proposed to address questions related to bias and fairness in results (199).

B. Emerging applications.
• Network compression and coarsening: The aim of network compression is to convert a large network into a smaller network containing a reduced number of nodes and edges.This compression is expected to store the network more efficiently and to allow running the network algorithms faster.Network coarsening is often used as a preliminary step in the network embedding process to produce a compression by collapsing pairs of nodes and edges with appropriate criteria.One example is network compression using symmetries.Real-world networks have a large number of symmetries (200,201) and this can be exploited in practice for compression or coarsening, as well as for speeding up calculations (202,203).
• Network classification: The aim is to associate a label to a whole network.Network classification can easily be applied in the context of whole network embedding.A wide range of applications has been proposed, such as classifying molecular networks according to their properties (117,(204)(205)(206), predicting therapeutic effects of candidate drugs based on molecular networks (206), or classifying images that have been converted into networks representation (207).
• Applications to Knowledge graphs: Let us consider a knowledge graph defined by a set of triples (u, r, v) ∈ V ×R×V .There are three main applications of embedding for knowledge graphs.Link prediction is used to infer the interaction between u and v. Triplet classification, which is a standard binary classification task, determines if a given triplet (u, r, v) is correct.And finally, knowledge completion aims to determine the missing element in a triplet where only two of the pieces of information are known (27,208).
• Illustration of network embedding with biological applications: Network embedding is an active research topic in bioinformatics (32), and all the classical applications previously mentioned also flourish in the context of biological networks.Some applications emerge as particularly relevant to network biology.We describe here some of them and the interested reader can refer to (8,32) for detailed reviews.A first important application in biology is related to network alignment, which aims to find correspondences between nodes in different networks.This can be useful to reveal similar subnetworks, and thereby biological processes, in different species, by aligning their protein-protein interaction networks (209,210).Another important application pertains to network denoising, which consists of projecting a graph into an embedding space to reduce noise by preserving the most relevant properties of the original network.For instance, highorder structures of the original networks can be preserved by diffusion processes.Network embedding methods can also be used to predict the functions of proteins ( 102), or to detect modules in chromosome conformation networks (8).Laplacian Eigenmaps have also been used to map the niche space of bacterial ecosystems from genomic information (211).In genomics, UMAP embedding has been shown to be useful to identify overlooked sub-populations and characterise fine-scale relationships between geography, genotypes, and phenotypes in a given population (212).Moreover, the prediction of edges (also known as associations in the biological context) is a common task in biological network embedding.For instance, a recent method named GCN-MF (213) used Graph Convolutional Networks and matrix factorisation to predict gene-disease associations.Another method, named NEDTP (106), used a multiplex network embedding based on an optimisation method to predict drug-target associations.Finally, knowledge graphs are also used in biomedical contexts.For example, Electronic Health Record (EHR) can be represented as a knowledge graph and embedded jointly with other networks integrating proteins, diseases, or drug information, to predict patient outcomes (214,215).

Conclusion
Network embedding methods are powerful tools for transforming high-dimensional networks into low-dimensional vector representations that can be used for visualization and a wide range of downstream analyses, such as network inference, link prediction, node classification, and community detection.The resulting low-dimensional representation also enables the use of deep-learning analysis, which is not possible directly on the original network as it is inherently a combinatorial object.Moreover, the latent space generated through the embedding process can highlight relevant features of the original network and filter out the noise inherent in the dataset used to construct the network.This helps to reveal the underlying structure and organization of the network, making it easier to • Properties to preserve?(Sections 1, 2) • Metadata information (nodes and edges label, additional contents)?(Sections 2, 3) • Class of methods?(Section 3) • Dimension of the embedding space?(193) • Internal check by LOOCV, network reconstruction (Section 4) • Compare to method on direct space (194) • Compare to other embedding methods (194) • Investigate the bias and fairness (199)

End
• PCA (37) • t-SNE (40) • UMAP (41) • HoroPCA (198) • HOPE (49) • VERSE (94) • SDNE (113) • GraRep (46) • Walklets (75) • DNGR (112) • pRBM (136) • HOPE (49) • Node2vec (70) • SDNE (113) • GraphGAN (132) • Deepwalk (64) • Node2vec (70) • LINE (98) • GraphGAN (132) Fig. 3. Workflow for choosing a network embedding method.The application of network embedding starts with some preliminary questions (top box).Depending on the answers, and the task to be performed, different network embedding methods can be applied.Here, we list the most common methods associated with the most common network embedding tasks.Once the embedding representation has been obtained, the workflow could further perform some 'sanity checks' to measure the efficiency of the network embedding setup (bottom box).Depending on the results of these sanity checks, we can either stop the development or go back to the preliminary questions in order to improve the workflow.These improvements are usually based on adding complementary information, tuning the parameters, or redefining the properties to be preserved by the embedding representation.
analyze and interpret its properties.Despite the numerous advantages of network embedding methods, there are still significant questions surrounding their use, particularly with regard to selecting appropriate methods and assessing their properties, interpreting the results, and ensuring adaptability across a range of contexts.These questions are especially critical as datasets become more complex and require more sophisticated tools to achieve the desired level of adaptability and interpretability, particularly in the biological contexts, where datasets may be particularly noisy, leading to errors in the learning process.In recent years, there has been a substantial increase in the number of network embedding methods, making it challenging to stay up-to-date in this rapidly evolving field.Therefore, it is crucial to have access to a comprehensive review of the state-of-the-art methods.Our network embedding review provides not only a summary of the current state-of-the-art, but also lays the foundation for future developments by presenting a flexible yet rigorous mathematical perspective-based taxonomy of embedding methods.Despite the many advancements in the field, challenges remain regarding the selection and interpretability of these methods.To address these issues, we offer a set of guidelines for practical applications that take into account these considerations and are aligned with the latest developments in the field.

Fig. 2 .
Fig. 2. Shallow network embedding:To perform shallow network embedding, a network is projected into a low-dimensional vector space, such as a 2-dimensional embedding

•
Multilayer networks.A multilayer network is a type of heterogeneous network where the nodes are grouped into layers, and the edges can connect nodes in the same, or different, layers.Formally, a multilayer network is a triplet M = (Y, G, G), where Y is the layer index set, G = {Gα | α ∈ Y } are (homogeneous) networks Gα = (Vα, Eα), and

Table 2 .
Heuristic scores are used to predict edges between pairs of nodes in the direct space.The variable N (vi) defines the neighbor set of nodes associated with the node vi.