Mercator: uncovering faithful hyperbolic embeddings of complex networks

We introduce Mercator, a reliable embedding method to map real complex networks into their hyperbolic latent geometry. The method assumes that the structure of networks is well described by the Popularity$\times$Similarity $\mathbb{S}^1/\mathbb{H}^2$ static geometric network model, which can accommodate arbitrary degree distributions and reproduces many pivotal properties of real networks, including self-similarity patterns. The algorithm mixes machine learning and maximum likelihood approaches to infer the coordinates of the nodes in the underlying hyperbolic disk with the best matching between the observed network topology and the geometric model. In its fast mode, Mercator uses a model-adjusted machine learning technique performing dimensional reduction to produce a fast and accurate map, whose quality already outperform other embedding algorithms in the literature. In the refined Mercator mode, the fast-mode embedding result is taken as an initial condition in a Maximum Likelihood estimation, which significantly improves the quality of the final embedding. Apart from its accuracy as an embedding tool, Mercator has the clear advantage of systematically inferring not only node orderings, or angular positions, but also the hidden degrees and global model parameters, and has the ability to embed networks with arbitrary degree distributions. Overall, our results suggest that mixing machine learning and maximum likelihood techniques in a model-dependent framework can boost the meaningful mapping of complex networks.


I. INTRODUCTION
The main hypothesis of network geometry states that the architecture of real complex networks has a geometric origin [1][2][3]. The nodes of the complex network can be characterized by their positions in an underlying metric space so that the observable network topologyabstracting their patterns of interactions-is then a reflection of distances in this space. This simple idea led to the development of a very general framework able to explain the most ubiquitous topological properties of real networks [1,2], namely, degree heterogeneity, the small-world property, and high levels of clustering. Network geometry is also able to explain in a very natural way other non-trivial properties, like self-similarity [1] and community structure [4][5][6], their navigability properties [7][8][9], and is the basis for the definition of a renormalization group in complex networks [10]. The geometric approach has also been successfully extended to weighted networks [11] and multiplexes [12,13].
Beyond being a formal theoretical framework to explain the topology of real networks, network geometry can be used to develop practical applications for real systems, including routing of information in the Internet [9], community detection [9,14,15], prediction of missing links [3,16,17], a precise definition of hierarchy in networks [15], and downscaled network replicas [10], to name a few. However, applications require faithful embeddings of real-world networks into the hidden metric space using only the information contained in their topology. Several algorithms have been proposed to solve this problem, most of which either use maximum likelihood estimation techniques [9,[18][19][20][21], machine learning [22][23][24], or a combination of both [25].
Maximum likelihood (ML) techniques assume that the network under study has been produced by a given model -a geometric one-and finds the value of its parameters that maximize the probability for the model to generate the observed topology. This technique requires finding the coordinates of every nodes in the latent geometry that maximize the likelihood function: a task that, in general, is NP-hard and consequently must rely on heuristics to obtain a reasonable approximate solution. Maximum likelihood methods are therefore generally slow, and their accuracy depends strongly on the chosen heuristic as well as on the quality of the underlying theoretical model.
In contrast, machine learning techniques are fast and model independent, so they can be used to find embeddings of large networks. A promising and accurate method is based on Laplacian Eigenmaps (LE) [22,23], originally designed to find dimensional reductions of sets of n points [26]. The original method requires the definition of Euclidean distances between nodes in R n , but since no information is available about the "real Euclidean" distances between connected pairs of nodes in networks, the use of heuristic arguments is necessary to estimate these distances [23]. A more fundamental problem with machine learning methods is that the embeddings are performed on Euclidean spaces. However, as shown in [2,3], the geometry of real complex networks is better described by hyperbolic rather than Euclidean geometry, where angular coordinates on a circle are a proxy for the similarity between nodes, and their radial coordinates account for their popularity, which is typically measured by their degrees [1]. Machine learning methods are only able to infer the angular coordinates corresponding to the similarity sub-space while radial coordinates have to be inferred using some geometric model. Hence, these methods end up being model dependent as well.
Consequently, both types of methods are very sensitive to the model used to describe the network. References [18,19,[22][23][24] are based on the Popularity×Similarity Optimization (PSO) model described in [3], which uses a simple mechanism to explain the emergence of an effective hyperbolic geometry in growing networks. However, this model can only generate pure power-law degree distributions P (k) ∼ k −γ with γ > 2, whereas the degree distribution in many real networks shows important deviations from such pure power laws. Moreover, the model does not spontaneously generate the nested hierarchy of self-similar subgraphs with increasing average degree, as observed in real systems [1].
In this paper, we introduce Mercator, a ready-to-use C++ code [27] that mixes the best of the maximum likelihood and machine learning approaches. The mixing of the two techniques was explored in Ref. [25] using the PSO model to maximize the likelihood function. Instead, we use the static version of the same type of Popularity×Similarity geometric models, the S 1 /H 2 model [1,2], that can accommodate arbitrary degree distributions and can reproduce the self-similarity patterns observed in real networks. The first step in Mercator is to apply a LE approach, as in Ref. [23], but using the S 1 /H 2 model instead of the PSO to infer the weights of the Laplacian matrix. Doing so yields a first (and fast) embedding method that already outperforms the one of Ref. [23]. The resulting embedding uses only information about pairs of connected neighbors, and can be further improved by using it as a starting point in a ML optimization-based again on the S 1 /H 2 model-that uses information from both connected and not-connected pairs of nodes. The final result is the most accurate embedding method currently available in the literature. Yet, the final complexity of the method is O(N 2 ) for sparse networks with N nodes, which makes it competitive for real applications.

II. METHODOLOGICAL BACKGROUND
The S 1 model is the simplest among the class of geometric models [1]. The similarity space is a one dimensional sphere-a circle of radius R-where N nodes are distributed with a fixed density, set to one without loss of generality, so that N = 2πR [28]. Each node is also given a hidden variable κ ∈ [κ 0 , ∞) proportional to its expected degree. In general, κ and the angular position θ can be correlated and distributed according to an arbitrary distribution ρ(κ, θ). In such case, the model is able to generate community structure [4][5][6] and can reproduce different degree-degree correlation patterns and clustering spectra.
Once all nodes are assigned a tuple (κ, θ), each pair of nodes is connected with probability where d ij = R∆θ ij is the arc length of the circle between nodes i and j separated by an angular distance ∆θ ij . Parameters µ and β control the average degree and the clustering coefficient, respectively. The model can be defined using any connection probability as long as it is an integrable function p(χ) with χ = d µκκ [1]. The particular functional form that we chose here (the Fermi distribution) is the one that defines maximally random ensembles of geometric graphs that are simultaneously clustered, small-world, and with heterogeneous degree distributions.
If nodes are uniformly distributed over the circle, we have ρ(κ, θ) = ρ(κ)/2π. In this case, the choice µ = β 2π k sin π β guarantees that, in the thermodynamic limit, the expected degree of a node with hidden variable κ isk(κ) = κ and the network average degree is k = κ . It is therefore possible to associate unambiguously the hidden variable κ with the node degree. For finite systems, however, the values of the hidden variables κ must be evaluated numerically. It is also important to notice that the parameter µ is, in fact, superfluous since it can be absorbed in the definition of κ; κ would then be proportional, but not exactly equal, to the expected degree. As a result, the embedding task only requires the estimation of 2N + 1 parameters: the hidden variables (κ i , θ i ), i = 1, · · · , N , and the parameter β. Quite remarkably, the S 1 model can be expressed as a purely geometric model in the hyperbolic plane. By mapping the expected degree of each node κ i to a radial coordinate as withR ≡ 2 ln N µπκ 2 0 , the connection probability becomes where is a good approximation of the hyperbolic distance between two nodes separated by an angular distance ∆θ ij and with radial coordinates r i and r j [29]. The connection probability thus becomes a function of the hyperbolic distance alone, which turns the model into a purely geometric one and has important consequences for the global connectivity of the network. For instance, topological shortest paths closely follow geodesic curves in the hyperbolic plane, and can therefore be used to efficiently navigate the network [7,9]. Furthermore, when the distribution of expected degrees follows a power law of exponent γ, the radial distribution in the hyperbolic plane is with γ = 2α + 1 and α > 0. Nodes are therefore homogeneously distributed in the hyperbolic plane for γ = 3 and are quasi-homogeneously distributed for other values of γ. In this paper, we use the S 1 model for likelihood maximization, and its equivalent H 2 version for visualization purposes.

B. Embedding techniques
Mercator exploits two different embedding techniques, based on ML and on LE, which are briefly outlined in this section.

Model-corrected Laplacian Eigenmaps
Laplacian Eigenmaps was originally designed as a method for dimensional reduction. Given a set of points {x i ∈ R n , i = 1, · · · , N } with the Euclidean metric, LE finds a mapping of these points {x i → y i ∈ R m } with m < n such that the loss function is minimized. Here, |y i − y j | is the euclidian distance between points i and j in R m and ω(·) is a decreasing function of the distance between the same pair of points in the original Euclidean space R n . Intuitively, placing pairs of points far apart in R m if they were originally close in R n increases the loss function Eq. (6). Minimizing should therefore yield a faithful dimensional reduction of the data. In the case of network embedding, our aim is to find coordinates of nodes in R 2 of a network whose structure can be modeled by the S 1 model. To do so, the weight function ω(·) is taken to be proportional to the adjacency matrix so that it is only different from zero if nodes i and j are connected. Yet, the weight associated to connected pairs of nodes is still assumed to be a decreasing function of their original Euclidean distance, which must somehow be estimated from the network structure. To do so, we leverage the S 1 model and estimate the expected distance in R 2 (the chord length) of a pair of nodes based on their degrees. The set of coordinates that minimize the loss function is the solution of a generalized eigenvalue problem with the Laplacian matrix, for which very fast algorithms exist if the network is sparse [30].

Maximum likelihood estimation
Given a real network with adjacency matrix {a ij }, maximum likelihood estimation finds the values of {κ i , θ i }, i = 1, · · · , N , that provide a good match between the S 1 model and the observed network. The posterior probability, or likelihood, that a network specified by its adjacency matrix {a ij } is generated by the S 1 model is where the function L({a ij }, {κ i , θ i }|S 1 ) is the joint probability that the S 1 model generates simultaneously the set of hidden variables {κ i , θ i } and the adjacency matrix {a ij }. Using Bayes rule, we then compute the likelihood that the hidden variables {κ i , θ i } take particular values conditioned on the observed adjacency matrix {a ij } where is the prior probability density function of the hidden variables, is the probability that the S 1 model generates the adjacency matrix {a ij } conditioned on the hidden variables {κ i , θ i }, and p ij is the connection probability given by Eq. (1).
If we have information about the prior distribution of hidden variables, Prob({κ i , θ i }), Bayesian estimators can be obtained by maximizing the likelihood in Eq. (8). However, in most cases, prior information is not available. We then use an improper prior distribution Prob({κ i , θ i }) = cte, and obtain the maximum likelihood estimator as the set of values {κ * i , θ * i } that maximize Eq. (10) or, equivalently, its logarithm The maximization with respect to the expected degrees κ can be performed semi-analytically. The derivative of Eq. (11) with respect to the expected degree κ l of node l is where the second term on the right hand side is the expected degree of node l, and the first term is its actual degree k l . The value κ * l that maximizes the likelihood is therefore the solution of The term on the right hand side can be evaluated in the model assuming a homogeneous angular distribution of nodes on the circle. We use this method to provide estimates of the expected degrees that are then used to maximize the likelihood function with respect to the angular coordinates, as explained in Sec. A 6.

III. MERCATOR AT A GLANCE
We have now all the theoretical background to briefly describe Mercator; the full details are given at Secs. A 1-A 7. Given a network with adjacency matrix {a ij }, we first measure its average degree k , average clustering coefficientc, and all individual nodes' degrees {k i , i = 1, · · · , N }. Second, we estimate hidden degrees and parameters β and µ. Third, we estimate the angular ordering of nodes using the model-corrected LE, and adjust the angles according to the expected angular separation between consecutive nodes given by the S 1 model. This yields Mercator's fast mode version, which produces a first embedding. Fourth, the angular coordinates are refined using ML. Finally, hidden degrees are readjusted given the newly inferred angular positions. All the steps together conform Mercator refined mode. More precisely, Mercator executes the following steps.
2. Using Eqs. (1) and (13), adjust the hidden variables {κ i } such that the expected degree of each node in the S 1 model matches the observed degree in the original network. This step assumes that nodes are homogeneously distributed and uses the values of β and µ from step 1. The initial guess is κ i = k i (the degree of node i in the original network).
3. Using results from steps 1 and 2, evaluate the theoretical value of the average clustering coefficient of the network in the S 1 model. If this value differs from the one measured for the original network, adjust the value of β and return to step 1. Otherwise, proceed to step 4.

4.
For every connected pair of nodes with hidden variables κ i and κ j and original degrees k i , k j > 1, estimate their expected chord length in R 2 as d ij = 2 sin ∆θij 2 , where ∆θ ij is the expected angular separation between connected nodes i and j in the S 1 model.

Construct a weighted Laplacian matrix
ij /t with t being the variance of d ij . Then solve the generalized eigenvalue problem Lv = λDv .
We note v 1 = (v 1,1 , v 1,2 , · · · , v 1,N ) and v 2 = (v 2,1 , v 2,2 , · · · , v 2,N ) the first two eigenvectors with the smallest nonzero eigenvalues. 6. Assign an angular position to each node i as 7. Make a sorted list of the nodes based on their angular position {θ i }. Nodes of degree 1 that were excluded at step 4 are now reinserted in the sorted list randomly before or after their unique neighbor. Note that the angular coordinates computed at step 6 are only used to determine the order in which nodes are located angularly. Their precise angular coordinates are evaluated at the next step.
8. For each pair of consecutive nodes in the list, evaluate its expected angular separation using the S 1 model conditioned on whether the two nodes are connected or not, their hidden variables κ i and κ j , and the fact that they are consecutive. Finally, all gaps are normalized to sum up to 2π. This produces a set of angular coordinates for each node {θ i , i = 1, · · · , N }.  These 8 steps summarize a fast and accurate embedding procedure that, as discussed in Sec. IV, already outperforms current state-of-the-art methods. The next section explains how its accuracy can be further increased.

B. Refined mode
The embedding can be significantly improved with ML techniques using the embedding obtained with the fast mode of Mercator as initial conditions. This is due to the fact that ML uses the information contained both in the presence and absence of links in the network, whereas LE only relies on the information conveyed by the presence of links. The major drawback of ML is the complex configuration space that needs to be explored to find the optimal embedding. However, if the starting point of the exploration is good enough, the maximization of the likelihood function is easy and efficient. Thus, starting from the embedding obtained with the fast mode, we proceed as follows. 9. Extract the onion decomposition of the network [31] and sort nodes accordingly, starting with the node in the deepest layer [32]. Doing so allows the likelihood optimization phase to begin with the most central nodes (based on mesoscale topological information) thereby greatly facilitating the finding of an acceptable local maximum in the configuration space.
10. For each node in the sorted list, find the average angular coordinate of its neighbors.
11. Sample O(ln N ) angular positions around this average value using a normal distribution whose standard deviation is set to half of the angular distance between this average value and the farthest neighbor.
12. Compute the local log-likelihood of the sampled angular positions where p ij is computed with Eq. (1), and set the new angular position of the node to the sampled angle with highest log-likelihood. 13. Once the position of every node has been optimized once, repeat step 2 to find a better estimate of the hidden variables κ i using the newly inferred angular positions. This last step is optional, although it generally leads to substantial improvements of the final embedding.

IV. VALIDATION IN SYNTHETIC NETWORKS
We test Mercator using synthetic networks of different average degrees and clustering coefficients generated with the S 1 model, and consider several quality measures to check the accuracy of the embeddings. We first focus on its capability to recover the angular coordinates. Figure 1 shows the C-score [23], defined as the fraction of pairs of nodes that are correctly ordered in the circle, as well as the Pearson correlation coefficient between the real and inferred angular coordinates [33]. The results of the two versions of Mercator are compared with those obtained using the Coalescent embedding presented in Ref. [23], which was reported to give the best node orderings with respect to other embedding algorithms in the literature. Notice that Mercator is able to outperform the results even in its fast mode, especially for networks with a low average degree. As an example, Fig. 2a depicts the inferred angular coordinates versus the real ones for one of the networks considered. Relative likelihood difference EL = L infer /L real − 1 averaged over all the networks of a given k and β embedded with Mercator (refined mode with final adjustment of the {κi} at step #13). In many cases, Mercator is able to find embeddings with likelihoods above those that generated the networks. In both plots, the errorbars represent the corresponding standard deviations. d. Empirical connection probability (circles) compared to the theoretical curve for the network in a.
Mercator also has the clear advantage of systematically inferring the hidden degrees and global model parameters. In Fig. 2b we show that the inference of β is very precise for all the synthetic networks considered in this section. This has important implications for applications that require finding a good congruency between the network and the model. Indeed, Mercator is able to find embeddings with very high likelihoods. To quantify this, we consider the relative likelihood difference E L =L infer /L real − 1, whereL ≡ L 2/N (N −1) is the geometric average of the likelihood over all pairs of nodes. Hence, a positive (negative) E L indicates that the inferred embedding has a higher (lower) likelihood than the real coordinates and model parameters. Strikingly, for low values of β, the embeddings found by Mercator have E L > 0 (Fig. 2c). Finally, Fig. 2d presents an example of the empirical connection probability (fraction of connected pairs as a function of the rescaled distance χ = d/(µκκ )), which is extremely congruent with Eq. (1). Put together, these results reveal that Mercator is not just the most accurate algorithm in terms of the reconstruction of the angular coordinates of synthetic networks-arguably the most difficult aspect of the embedding problem-, but it also determines correctly all other model parameters, including hidden degrees.

V. EMBEDDING OF REAL NETWORKS
Another strength of Mercator is its ability to embed networks with arbitrary degree distributions. As an illustration, we embedded several real-world complex networks from different domains whose degree distributions include clean scale-free, heavy-tailed, and arbitrary distributions. More specifically, the networks under study are: the world airport network [34], the neural network of the visual cortex of the Drosophila Melanogaster at the neuron level [35], the neural network of the C. Elegans worm [36], a human connectome [37,38], the metabolic network of the bacterium E. Coli [14,39], the world trade web [15], a US commute network [40], a cargo ships network [41], a US commodities network [40], and the Internet at the Autonomous Systems level [9].
One particularly telling example is the airports network whose truncated power-law degree distribution with exponent γ < 2 cannot be easily embedded with methods based on the PSO model. In the case of real networks, we do not have access to the "real" coordinates to compare with those obtained from our embeddings. Yet, in some cases, metadata related to the similarity between nodes is available and can be used to test whether an embedding is meaningful or, instead, is an artifact of the algorithm. In the case of the airports network, geography is such metadata. Figure 3a shows the hyperbolic embedding obtained by Mercator in the slow mode with nodes colored according to the continent they belong to (separating North and South America). Airports belonging to the same continent appear clustered in similar angular positions, thus supporting the relation between the angular space of the embedding and similarity among nodes. Similar analyses were carried out for the Internet at the autonomous systems level [9], metabolic networks [14], and the world trade web [15]. A strong correlation between the angular distribution of points and available metadata was found in all cases. In light of these results, we conclude that our geometric embeddings are meaningful and capture attributes that contribute to the similarity among the elements of complex networks.
Beyond this qualitative agreement, we tested the extent to which the embedding inferred by Mercator is accurate enough to reproduce the topology of the airports network. To do so, we first compare the expected connection probability Eq. (1) with the inferred value of β against the empirical connection probability, computed using the inferred coordinates of the nodes ({κ i , θ i }). This remarkable agreement confirms that the rescaled distance χ provides a meaningful measure to characterize the interaction between nodes in the network. Second, we used the set of coordinates {κ i , θ i } as well as the parameters β and µ to generate an ensemble of synthetic networks using Eq. (1). We then compared several topological properties of this ensemble with those measured on the original network. Specifically, the first row of Fig. 4 shows the results for the complementary cumulative degree distribution P c (k), the average nearest neighbors' degreek nn (k), and the clustering spectrumc(k). The final adjustment of hidden degrees in step #13 of the Mercator algorithm strongly enhances the reproduction of the degree distribution. Notice that the S 1 model does not include any mechanism to control degree-degree correlations or the shape of the clustering spectrum (recall that β is chosen based on the average clustering coefficient only). Yet, the generated network ensemble reproduces these two quantities with remarkable precision. This is particularly interesting in the case of the average nearest neighbors' degree, which shows a non-trivial assortative behavior for low degrees both in the real network and the ensemble. This suggests that the non-uniform angular distribution of nodes inferred by Mercator (and so the network geometric properties) is partly responsible for the observed degree-degree correlations in real complex networks.
The ensemble of synthetic networks generated from the estimated geometric parameters of the airports network performs also very well at reproducing topological properties of individual nodes. The second row of Fig. 4 shows scattered plots of the degree of a given node, the sum of degrees of its neighbors, and number of triangles attached to it in the generated network ensemble versus the same quantities measured on the original network. For each node, we also compute the 2σ confidence interval, and ζ shows the fraction of nodes whose original property (degree, sum of neighbors' degree, number of triangles) lies outside this interval. Results show that the fraction of points outside this interval is around 6%, which is consistent with the 2σ (or ≈ 95%) confidence interval. These results, supported by those presented in the Supplementary Information, clearly illustrate the accuracy of the embeddings provided by Mercator. To the best of our knowledge, such accuracy cannot be obtained with other existing embedding methods.

VI. COMPUTATIONAL COMPLEXITY
We now support our claim that the computational complexity of Mercator scales as O(N 2 ) for sparse networks composed of N nodes (and L = k N/2 links). Figure 5a and 5b show the running time in seconds in function of the number of links (L) for both synthetic and real networks. In both cases, we find that Mercator's refined mode does indeed roughly scale as L 2 although it is clear that other topological properties influence the final total running time. With respect to the fast mode, we find that the computational complexity is roughly linear within the range in the number of links that was considered. Finally, Fig. 5c breaks down the running time into the time spent in each of the Mercator major steps, and doing so shows how most of the running time is spent during the ML step.

VII. CONCLUSIONS
In this work, we introduce and deliver the full code of Mercator, the most accurate method to embed complex networks into their latent metric spaces. We showed that the quality of the embeddings can be significantly improved by a proper combination of machine learning techniques and powerful statistical methods. Thanks to this combination, Mercator is able to overcome some of the drawbacks of other techniques which, for instance, require perfect power laws in the whole domain of degrees, a condition that is not met by many real networks. Our results also indicate that the obtained embeddings are able to recover ground truth information not contained in the network topology. We expect Mercator to become a standard tool within the toolbox of network scientists and anybody interested in retrieving information from big data systems admitting a network representation. AA acknowledges financial support from the project Sentinelle Nord of the Canada First Research Excellence Fund, from "la Caixa" Foundation and from the Spanish "Juan de la Cierva-incorporación" program (IJCI-2016-30193).

Appendix A: Mercator in details
We now provide the full details of Mercator.

Sketch of the method
The S 1 model has two global parameters that need to be inferred; µ, controlling the average degree and β, which determines the level of clustering in the network. In addition, every node i is assigned two hidden variables: a hidden degree κ i and an angular coordinate θ i . Mercator's fast and refined modes using synthetic networks generated with the S 1 model with a power law distribution for the expected degrees κ with exponent γ = 2.2, β = 2 (clustering) and k = 10. Dashed lines proportional to L α were added to guide the eye. Higher values of γ lead to smaller running times but similar scaling behavior as the number of links increases. b. Computational complexity of Mercator's slow mode applied to the real network datasets presented in Sec. V. A line proportional to L 2 has been added to guide the eye. c. Mercator's computational complexity broken down into each each step of the algorithm. The same synthetic networks as in a were used.
The following method finds the values of µ, β and κ i for which the expected degrees in the modelk(κ i ), that is, in synthetic networks generated with uniformly distributed angular positions, equal the observed degrees in the real network and, moreover, the expected mean local clustering of the embedding matches the real value. To that end, some values of µ and β are proposed. Next, the corresponding κ i are calculated. Finally, the expected clustering coefficient is computed and β is adjusted if the predicted value is not within an acceptable range of the original value.
The method relies on the assumption that all nodes with the same degree have the same hidden degree. Therefore, the first preliminary step is reading the network and counting the number of nodes in every degree class k, that we denote by N k .

Inferring the hidden degrees
This step assumes some given value of β and the corresponding µ = β 2π k sin π β , where k is the observed average degree. We then assign to every degree class the hidden degree given by κ(k) = k as the initial guess. The aim of the following algorithm is to adjust this relation so thatk(κ(k)) = k ± , where can be set, for instance, to = 0.01. To solve this problem, we need a way to reckon the values ofk(κ(k)) from the relation κ(k). To that end, it is useful to consider the probability for two nodes with hidden degrees κ and κ to be connected in the ensemble of networks with global parameters R = N/(2π), µ, β and uniformly distributed angular coordinates. This probability is given by Starting from the initial guess κ(k) = k, we perform the following steps to refine the relation κ(k): 1. Initialize expected degrees: For every degree class k, setk(κ(k)) = 0.
3. Compute largest deviation: Let max = max{|k(κ(k)) − k|} k be the maximal deviation between degrees and expected degrees. If max > , the values of κ(k) need to be corrected. Then, for every degree class k, set |κ(k) + [k −k(κ(k))]u| → κ(k), where u is a random variable drawn from U (0, 1). The rationale behind this transformation is that every degree-class hidden degree is corrected according to its expected-degree excess or deficiency; the random variable u prevents the process from getting trapped in a local minimum. Next, go to step 1 to compute the expected degrees corresponding to the new set of κ(k). Otherwise, if max ≤ , hidden degrees have been inferred for the current global parameters.

Inferring parameter β
To infer β, we need to compute the expected mean local clusteringc given the current values of the global parameters as well as of the hidden-degree distribution provided by κ(k) and N k found using the algorithm from the last subsection. The method is based on the following idea. Suppose we want to estimate the expected clusteringc(k) of some node with degree k. According to the definition of mean local clustering, this quantity is given by the probability for two randomly chosen neighbors of the node to be connected, which can be computed in two steps: first, we randomly choose two of its neighbors and draw their distances to the node from the distribution of distances between connected nodes in the model. Second, we compute the distance between the two neighbors and, with it, the probability for them to be connected. Two important points require some clarification: a. The model is uncorrelated at the hidden level.
Therefore, in the calculation of the clustering, we draw the two neighbors from the uncorrelated distribution P (k|k ) = kP (k)/ k .
b. The distribution of angular distance ∆θ between two connected nodes with hidden degrees κ and κ , ρ(∆θ|a κκ = 1), where a κκ stands for the corresponding adjacency-matrix element, is given by In the above expression, p(a κκ = 1|∆θ) = 1/ 1 + (R∆θ/(µκκ )) β is the connection probability between the two nodes with hidden degrees κ and κ separated by a distance ∆θ. The distribution of distances in the S 1 model is simply ρ(∆θ) = 1/π, since angular coordinates are uniformly distributed. Finally, p(a κκ = 1) is given in Eq. (A1). Equation (A2) therefore reads The expected mean local clustering can now be found following three steps: 1. Initialize mean local clustering: Letc(k) represent the expected mean local clustering of degree class k. Setc(k) = 0 for all k.

2.
Compute expected mean local clustering spectrum: For every degree class k, do m times: i. Draw two variables k i from P (k i |k), i = 1, 2.

Compute expected mean local clustering:
The expected mean local clusteringc can be readily computed asc = kc (k)N k /N .
If the error in the expected mean local clustering |c − c emp | < c , where c is the desired precision andc emp is the observed mean local clustering coefficient, we can accept the current value of β and proceed to the inference of the angular coordinates. Otherwise, β needs to be corrected and the hidden degrees must be recalculated accordingly by repeating the process explained in the previous subsection. Notice that, since the expected mean local clustering coefficient is a monotonic function of β, the process can be iterated efficiently using the bisection method. In practice, however, we use a modified version of the bisection method in which the upper bound is let free until we reach a value of β for which the expected clustering is higher than the observed one. More precisely, we start with a value for β picked uniformly between 2 and 3. Then, while the expected clustering is lower than the observed one, we increase β by multiplying it by 1.5. When we reach a value for which the observed clustering is surpassed, we start the regular bisection method. We also note that, for c = 0.01, m = 600 is enough. Of course, if more precision is required, m must be increased to guarantee that the fluctuations in the computedc(k) are small enough.
{θ i }, of each node. This is performed by following two steps: a machine learning step providing an initial ordering of the nodes as well as realistic positions, and a second step in which nodes are moved in order to maximize the likelihood that the S 1 model generated the original edge list.

Initial ordering and positions
This step is a modified version of the Laplacian Eigenmaps (LE) algorithm introduced in Ref. [26] and used in Refs. [22,23]. This machine learning algorithm was originally conceived for dimensionality reduction. The main idea is as follows. Given a set of points in R n , the algorithm first constructs a RGG by, for instance, connecting points located at a distance below some threshold in ndimensional Euclidean space. Once this graph is known, the points are mapped to R m with m < n by diagonalizing the corresponding Laplacian and assigning to every point i the coordinates is the i-th component of the j-th Laplacian eigenvector with non-null eigenvalue (the eigenvectors are ordered according to their eigenvalues). It can be shown that these coordinates minimize the squared distances between connected pairs in the RGG, while preventing all nodes from collapsing into a single point. Furthermore, the relevance of every connection (i, j) in the RGG in the above expression can be modulated by assigning a weight ω ij to it according to where |x i − x j | is the distance between the points in R n and t is a scaling factor fixed as the mean of the squares of all the distances [23]. The same procedure then leads to the minimization of The approach taken by Refs. [22,23] is to consider the network to be embedded as the RGG generated in a higher-dimensional space. Hence, by proceeding to the dimensionality reduction in m (typically m = 2) dimensions, we obtain an embedding in R m , which can be radially normalized so that all points lay in S m−1 . The improvement in Ref. [23] is to assign weights according to some heuristic and, once the coordinates on the plane are known, these are used to infer the ordering of nodes only. The final coordinates are computed by distributing all nodes on the circle with θ i+1 −θ i = 2π/N, ∀i. We now propose an improvement based on assigning the weights in the Laplacian matrix as well as the gaps θ i+1 − θ i according to the S 1 model.

Laplacian Eigenmaps for node ordering:
Since degree-one nodes do not add geometric information, we remove them at this step and work with the subgraph of nodes with k > 1. We then apply the Laplacian Eigenmaps method to such graph after weighting every links according to Eq. (A5), where we use as a proxy for the distance |x i − x j | and ∆θ ij is the expected angular distance between nodes i and j in the S 1 model conditioned to the fact that they are connected. The above expression simply maps such expected angular distance onto the corresponding cord length, since Laplacian Eigenmaps is designed to work on Euclidean space and, therefore, it seems natural to consider the S 1 model as embedded in R 2 for the algorithm. The expected distance between the nodes can be readily computed from Eq. (A3) as Since a similar approach developed in Ref. [23] has been shown to yield very good results in terms of the angular ordering of the nodes, we use this machine learning step to define a sequence of angular coordinates for the nodes in the subgraph, where the angles in S are ordered in increasing order. Each θ i is computed as where v i 1 and v i 2 are the x and y coordinates of node i found by LE. Once we have the ordering of nodes with k > 1, we reincorporate the degree-one nodes. This can be easily done by replacing every node i with t degree-one neighbors by the sequence (n i 1 , . . . , n i t/2 , i, n i t/2 +1 , . . . , n i t ) in S , where n i j is the j-th degree-one neighbor of node i (in any arbitrary order) and · is the floor function. Such operation yields a new sequence of nodes S including all the nodes in the original graph.
2. Order-preserving adjustment: This last step of the approximate embedding locates the nodes on the circle preserving the ordering of the nodes in S. To that end, we set every node's coordinate such that the gap between two consecutive nodes in S is proportional to the expected gap between two consecutive nodes with the same hidden variables and adjacency-matrix element in the S 1 model. We do this in two steps: i. Computing the expected gaps: Let nodes i and i + 1 be consecutive in S. The distribution for the length of the gap g i between them can be obtained from Bayes' rule, as in Eq. (A2), where now ρ(g i ) is an exponential distribution with mean 2π/N , The expected gap g i can thus be computed as g i = π 0 g i p(a i+1,i |g i )ρ(g i )dg i π 0 p(a i+1,i |g i )ρ(g i )dg i . (A14) Both integrals can be carried out numerically.
ii. Normalizing the gaps: By applying the last step to every pair of consecutive nodes (including the pair (N, 1)), we obtain a sequence of N expected gaps which, however, needs not sum up to 2π, so we normalize each g i as Finally, we can assign every node's coordinate sequentially, starting with θ 1 = 0, as θ i = θ i−1 +g i−1 , i = 2, . . . , N .

Likelihood maximization
This stage of the embedding algorithm adjusts the angular coordinates that maximize the likelihood for the observed network to be generated by the model. As opposed to previously proposed likelihood-maximization schemes, we do not need to explore a vast region of configuration space, since the machine learning stage provides a set of coordinates located near an optimal configuration. Hence, we visit every node once and propose several new angular coordinates for it, keeping the one with higher log-likelihood. The steps we follow are: 1. Define a new ordering of nodes: We visit the nodes in the order defined by the network's onion decomposition. In the sequence, the ordering among nodes belonging to the same layer in the decomposition is random.

2.
Find new optimal coordinates: For every node i, we select the optimal coordinate among candidate positions generated in the vicinity of the mean angular coordinate of its neighbors. This is achieved in three steps: i. Compute mean coordinate of node i's neighbors: Let node i have k i neighbors, which we now label with index j = 1, . . . , k i . Since nodes lay on a circle, we must compute their mean angular coordinateθ i using the vector sum of their positioning vectors in R 2 . The polar angle of the resulting vector sum is given bȳ where the hidden degrees in the above expression weight the contribution of every neighbor's positioning vector, as proposed in Ref. [20].

2.
Propose new positions aroundθ i : We generate 100 max(ln N, 1) candidate angular coordinates from a normal distribution with mean θ i and standard deviation σ given by where ∆θ max is the angular distance between θ i and the most distant neighbor of node i, i.e., ∆θ max = max{min |θ j −θ i |, 2π − |θ j −θ i | } j . Locate node i at the angular position maximizing the local log-likelihood.

Adjusting hidden degrees
The final process adjusts hidden degrees according to the hidden coordinates found so thatk(κ i ) = k i . The algorithm is similar to the initial inference of hidden degrees: 1. Compute expected degrees: For every node i, setk 2. Correct hidden degrees: Let max = max{|k(κ i ) − k i |} i be the maximal deviation between degrees and expected degrees. If max > , the set of hidden degrees needs to be corrected. Then, for every node i, set |κ i + [k i −k(κ i )]u| → κ i , where u is a random variable drawn from U (0, 1). As in Sec.A 2, the random variable u prevents the process from getting trapped in a local minimum. Next, go to step 1 to compute the expected degrees corresponding to the new set of hidden degrees. Otherwise, if max ≤ , hidden degrees have been inferred for the current global parameters and angular coordinates.