Neural-prior stochastic block model

The stochastic block model (SBM) is widely studied as a benchmark for graph clustering aka community detection. In practice, graph data often come with node attributes that bear additional information about the communities. Previous works modeled such data by considering that the node attributes are generated from the node community memberships. In this work, motivated by a recent surge of works in signal processing using deep neural networks as priors, we propose to model the communities as being determined by the node attributes rather than the opposite. We define the corresponding model; we call it the neural-prior SBM. We propose an algorithm, stemming from statistical physics, based on a combination of belief propagation and approximate message passing. We analyze the performance of the algorithm as well as the Bayes-optimal performance. We identify detectability and exact recovery phase transitions, as well as an algorithmically hard region. The proposed model and algorithm can be used as a benchmark for both theory and algorithms. To illustrate this, we compare the optimal performances to the performance of simple graph neural networks.


Introduction
The stochastic block model (SBM) is widely studied as a benchmark for graph clustering aka community detection, see e.g.reviews Fortunato (2010); Abbe (2017);Peixoto (2019).The standard version of the stochastic block model observed a graph of connections and the goal is to recover the communities from the knowledge of the graph.
However, in practice, graph data often come with node attributes that bear additional information about the communities.In such a case there are several sources of information on communities one can use: the structure of the graph (as in the standard SBM), and the features or attributes of the nodes.Past work developed algorithms and models accounting for such node information.Among the well-known is the CESNA model of Yang et al. (2013) where the attributes are generated via logistic regression on the community membership.Another model that recently became popular in the context of benchmarking graph neural networks (e.g.Chien et al. (2021);Fountoulakis et al. (2022); Tsitsulin et al. (2021)) is the contextual SBM Binkiewicz et al. (2017); Deshpande et al. (2018), where communities determine centroids for a Gaussian mixture model generating the node-features.In both these examples, the node attributes are generated via conditioning on the community label of the node.
In signal processing, another separate line of work, that witnesses a surge of interest, is modeling signals as the output of a deep generative neural network; for recent reviews see e.g.Ongie et al. (2020); Shlezinger et al. (2020).Deep generative neural networks can be trained on data, and due to their expressivity are able to capture generic structural properties of the signal.In community detection the signal can be seen as the community memberships; following the line of work on deep generative priors it is hence of interest to propose a model where the node attributes are an input of a generative neural network and the node community memberships are the output thereof.In this work, motivated by a recent surge of works in signal processing using deep neural networks as priors, we propose to model the communities as being determined by the node attributes rather than the opposite.We define the corresponding model; that we call the neural-prior SBM.
One of the attires of the stochastic block model is that it is amenable to exact statistical analysis of what is the best achievable performance from an information-theoretic and from algorithmic point of view.This has led to a line of work, originating in statistical physics, where statistical and computational thresholds are analyzed; see e.g.Decelle et al. (2011b); Abbe et al. (2015); Abbe (2017).It is valuable to have a solvable case for which we know what is statistically and algorithmically achievable; because in the context of modern machine learning, it is rarely known if or how much the observed performance can be further improved.Asymptotically exact analysis of the detectability threshold was also performed for the contextual stochastic block model Deshpande et al. (2018); Lu and Sen (2020).The main topic of the present paper is the statistical physics analysis of optimal algorithmic performance for a simplified version of the proposed neural-prior stochastic block model that we call the generalized-linear-model SBM (GLM-SBM).
The GLM-SBM model we propose can be used for benchmarking graph neural networks (GNNs).Since the model is analyzable, we can compare the performance of the evaluated GNN to the optimal algorithmic performance in a non-trivial high-dimensional setting.We treat both the unsupervised and the semi-supervised cases and accompany our paper with an implementation that can be readily used for comparison by GNN developers.As far as we found, a model similar to the neural-prior SBM, we propose here, has been used in Cho et al. (2022).In that work it is used as a building block for a large neural network; it was not analyzed per se.
A large part of this paper is dedicated to the asymptotic analysis of the GLM-SBM model.We identify how the detectability phase transition well known from the SBM changed under the presence of the GLM-prior.We also unveil an exact recovery phase transition that happens when the prior on the latent variables of the GLM is binary, while the average degree of the SBM remains finite.Such an exact recovery phase at a finite average degree came to us as a surprise and we find it rather remarkable in view of the fact that without the GLM prior exact recovery in the standard SBM is only possible for degrees growing logarithmically with the system size Abbe et al. (2015); Abbe (2017).The exact recovery transition is discontinuous and makes the problem algorithmically challenging posing a nice set of parameters that can serve as a benchmark in the attempt of improving graph-neural networks.
2 The neural-prior stochastic block model

Definition
We consider a set V of |V | = N nodes, a graph G(V, A) on those nodes.Nodes have features/attributes F µ ∈ R M of dimension M , µ = 1 . . ., N .The features and the graphs are observed.We aim to divide the set of nodes into q communities with labels s µ ∈ {1, . . ., q} in such a way that (a) the graph structure correlates with the labels, e.g.nodes being in the same community are more likely to be connected, and (b) the node attributes F µ are correlated with the labels.

SBM:
In the stochastic block model the edges A µν of the graph G are generated conditioned on the group memberships s µ ; we consider the following rule: and A µν = 0 otherwise.Here c i and c o are the affinity coefficients common to the SBM.We define the affinity matrix whose elements are c s,t = c i δ s=t + c o δ s̸ =t .We note that the literature often considers a more general SBM where the affinity matrix has arbitrary elements.The model and analysis proposed in this work could be readily generalized to that case.We consider a slightly restricted version of the SBM purely for simplicity.In the SBM the ground truth group memberships s µ are generated at random from a prior that only accounts for the sizes of the q groups.The node attributes F are simply ignored in the SBM.
Neural-prior SBM: In neural-prior SBM, that we define here, the group memberships s µ can be a generic function on the attributes F µ .Such a function can be represented by a deep neural network and learned from ground-truth data.The training data would be pairs {F µ , s µ } where attributes act as the neural network inputs and the group memberships as output labels.For instance, for a L-layer fully connected neural network this reads for the last activation function φ (L) chosen as in multi-class classification tasks.The aim of this paper is to provide a benchmark model where the optimal performance can be analyzed asymptotically exactly.For this we need to (a) define the corresponding asymptotic limit, (b) consider a simple neural network prior that is amenable to asymptotic analysis.We will also limit ourselves to consider community detection with two groups of the same size only, q = 2 (this is not a strong limitation, but is considered in the follow-up for simplicity).With this in mind, in the rest of the paper, we will consider the following model generating the group memberships s µ .

GLM-SBM:
In order to make analysis amenable we will consider the features F to be random and drawn independently as F µl ∼ N (0, 1/M ).We then consider M latent variables w l ∼ P w , l = 1, . . ., M and generate the community memberships as This corresponds to a single-layer neural network with a sign activation function.Such a neural network is also often referred to as the generalized linear model (GLM) or as the perceptron.We will hence call this variant of the neural-prior SBM the GLM-SBM.
Concerning the asymptotic limit, we work in the challenging sparse case of SBM.We parameterize the SBM by the standard parameterization We then consider N → ∞ with c = (c i + c o )/2 = O(1) is the average degree, and λ = O(1) is the signal-to-noise ratio.We further work in the high-dimensional limit of the GLM where N/M = α = O(1), with α being the aspect ratio that will play a role of another signal-to-noise ratio.This is because the higher α the more correlation there is between the group memberships and the easier the community detection should be.
The GLM-SBM differs from the SBM because communities are not independent, conditionally on the features.For instance, in the extreme case M = 1, all memberships are known, up to a global flip given by w 1 ; that is to say, they are all very strongly correlated.The GLM-SBM tends toward a SBM when α → 0. Indeed, for large M , M l F µl w l tend to independent Gaussian variables.

Related work
Anticipating the asymptotic analysis that we are aiming at, we note that such an analysis has been done for the standard SBM in Decelle et al. (2011b,a) using the belief propagation algorithm and the cavity method from statistical physics for the asymptotic analysis of its behavior.Concerning semi-supervised learning in the SBM, the information coming from the semi-supervision is readily incorporated into the analysis of the above papers as has been done in Zhang et al. (2014).
The predictions of Decelle et al. (2011b,a) have then been partially established rigorously see e.g.Mossel et al. (2015Mossel et al. ( , 2018)); Abbe (2017); Coja-Oghlan et al. (2017).However, the full conjecture of Decelle et al. (2011b,a) about the asymptotic exactness of their analysis remains an open question from the mathematical point of view.In this paper, we will use the same techniques as Decelle et al. (2011b,a) anticipating a follow-up work putting the conjectures about optimality on a rigorous basis.For the GLM, which is defined by a dense graphical model, the rigorous analysis is simpler and was carried out in Barbier et al. (2019).
The analysis of the GLM-SBM requires to glue the two graphical models using the GLM as the prior for the SBM, and the SBM as a source of uncertainty of the outputs of the GLM.Such a glueing of two dense exactly solvable graphical models for developed in Manoel et al. (2017) with rigorous justifications given in Gabrié et al. (2018); Aubin et al. (2019); Gerbelot and Berthier (2021).Our work is the first one, as far as we are aware, where a sparse graphical model (the SBM) is glued to a dense graphical model (the GLM).This can be done heuristically and is conjectured asymptotically exact along the lines of the works of Decelle et al. (2011b,a).A complete rigorous justification would have to be preceded by the proof of the conjecture for the SBM that is still open.
The contextual stochastic block model (CSBM) introduced and studied theoretically in Binkiewicz et al. (2017); Deshpande et al. (2018) is another version of the SBM incorporating node information.In the CSBM the node information is modeled via a Gaussian mixture model with each community having their own centroid.From the analysis point of view, this model takes into account two sources of observation about the latent variables -the community memberships.This is hence different from the GLM-SBM where one model serves as a prior for the other instead of as an independent source of information.Modulo this difference, some of the analysis performed for the CSBM is related to our work.Notably, the detectability threshold and the linearized message passing algorithm presented in Deshpande et al. (2018); Lu and Sen (2020) are obtained in a similar manner in which we obtain the detectability phase transition and the linearized algorithm.We note that the semi-supervised version of the CSBM has not been analyzed, but this could be done rather straightforwardly using the same methods as in Zhang et al. (2014).

Bayes-optimal estimation of communities
We consider the GLM-SBM as defined above and aim to analyze the Bayes-optimal inference of the community structure.We will consider in general the semi-supervised setting where next to the structure of the graph A and the covariates F we observe the communities for a subset Ξ of the nodes, ρ = |Ξ|/N .We denote by s the vector of unobserved nodes and s Ξ the vector of observed nodes.The unsupervised case is then recovered as the special case where Ξ is an empty set, ρ = 0.
The analysis of this paper is set in the so-called Bayes-optimal setting where we know the details of the GLM-SBM model.The only quantity that we do not observe is the ground truth values of the latent variables w that generate the group memberships s.For the group memberships, we only observe a fraction ρ of them in the semi-supervised setting and none of them in the unsupervised setting.
The optimal inference is then done using the posterior distribution over the unobserved communities where Z(A, F, s Ξ ) is the normalization constant.We used here the definition of the GLM-SBM model that implies P (A|s, s Ξ , F ) = P (A|s, s Ξ ).For GLM-SBM the prior on s is where we define P 0 (t|z) = δ t=sign(z) the output distribution and P s,µ the additional prior distribution, which is used to inject information about the membership of node µ: In eq. ( 6) we marginalize over the latent variable w.However, since the estimation of the latent variable w is crucial in order to exploit the full power of the prior (6) it will be instrumental to consider the posterior as a joint probability of the unobserved nodes and the latent variable Z is the Bayesian evidence.We define the free entropy of the problem as its logarithm: We seek an estimator ŝ that maximizes the overlap with the ground truth.The Bayes-optimal estimator ŝ that maximizes it is given by ŝMMO where p µ is the marginal posterior probability of node µ.Using the ground truth values s µ of the communities the maximal mean overlap is then computed as To estimating the latent variable w, we consider minimizing the mean squared error via the MMSE estimator ŵMMSE i.e. ŵMMSE is the mean of the posterior distribution.Again using the ground truth values w l of the latent variables the MMSE is then computed as The problem is invariant by a global sign flip of s and w so in practice we measure the following overlaps In general, the Bayes-optimal estimation requires the evaluation of the averages over the posterior that is in general exponentially costly in N and M .In the next section, we will derive the AMP-BP algorithm and argue that, in the limit N → ∞ and M → ∞ with N/M = α = O(1) and all other parameters being of O(1) this algorithm approximates the MMSE and MMO estimators with an error that vanishes.We give more precise statements below.

The AMP-BP algorithm
To retrieve the communities for the GLM-SBM, our main results rely on an algorithm that we call AMP-BP.We conjecture that in the large system size, this algorithm cannot be beaten by another polynomial algorithm.We can also extract the so-called hard phases where the randomly initialized algorithm fails, but an exponentially costly algorithm would succeed; we do this using an informed initialization and the free entropy.We then analyze the performance of the algorithm and the associated phase transitions.

Algorithm
The algorithm is based on belief propagation (BP) and approximate message-passing (AMP).BP was used to solve SBM in Decelle et al. (2011b) and conjectured asymptotically optimal among efficient algorithms in doing so.AMP was used to solve GLM, see e.g.Donoho et al. (2009); Krzakala et al. (2012), and again conjectured asymptotically optimal among efficient algorithms in doing so with strong evidence for this being provided by Celentano et al. (2021).We glue these two algorithms together along the lines of Manoel et al. (2017); Aubin et al. (2019) to solve the GLM-SBM; we call the resulting algorithm AMP-BP.Using statistical physics arguments analogous to those in Decelle et al. (2011b); Krzakala et al. (2012) we conjecture that it provides asymptotically optimal performance in the considered cases.
We derive the AMP-BP algorithm for the GLM-SBM starting from the factor graph of the problem: The χs and ψs are probability distributions on the variables s and w; they are called cavity messages.We write the belief-propagation (BP) equations for these distributions that read: The proportionality signs ∝ denote that all the messages are non-negative numbers summing to one over their lower indices, the corresponding normalization factors being omitted in our notation.
These BP equations still include a high-dimensional integral and hence cannot be implemented efficiently.We simplify them to obtain AMP-BP by using the central limit theorem on the dense side of the graphical model and keeping only the means and variances of the resulting Gaussians.This is standard in the derivation of the AMP algorithm, see e.g.Krzakala et al. (2012).The details of this derivation are given in appendix A.
In order to state the final algorithm, we introduce the denoising function: We define the input functions as We denote by Zs the normalization factors obtained so that the messages sum to one over their lower indices.
To give some intuition we explain what are the variables AMP-BP employs.a l is an estimation of the posterior mean of w l , v l is an estimation of its variance; ω µ is an estimation of the mean of l F µl w l and V an estimation of its variance.ψ µ→µ sµ is a marginal distribution on s µ , as estimated by the AMP on the GLM side, while χ µ→µ sµ is the distribution as estimated by the BP on the SBM side.Γ l is a proxy for estimating the mean of w l in absence of the prior P w and Λ is for the variance.h t can be interpreted as an external field enforcing the nodes not to be in the same group; χ µ→ν sµ is a marginal distribution on s µ (these variables are the messages of a sum-product message-passing algorithm) and χ µ sµ is the estimated posterior marginal on s µ , that we are interested in.
We provide an implementation of AMP-BP in the supplementary material.It is also available from our repository. 1e draw attention to the output function g o that covers the difference between AMP for GLM-SBM and AMP for GLM alone.In AMP for GLM alone g o depends on the observed labels while here we use their estimated marginals.On the other side, the difference between BP for GLM-SBM and BP for SBM alone are the messages ψ µ→µ in BP update.ψ µ→µ can be interpreted as the conditional probability of s µ given w without SBM.
Estimators.The Bayes-optimal estimators of s and w are obtained according to eqs. ( 10) and ( 12).Expressed using the AMP-BP messages they become ŝAMP-BP where χ µ + is the estimated marginal probability of the event s µ = +1 and a l is the estimated mean of w l .
Free entropy.We express also the free entropy ϕ in terms of the messages and variables of AMP-BP at the fixed point; it is called the Bethe free entropy ϕ Bethe .The derivation from the factor graph is done in appendix B. Up to a term that diverges with N we obtain that the Bethe free entropy is If the AMP-BP has more than one fixed point then the free entropy serves to select the fixed point of AMP-BP that corresponds to Bayes-optimal performance.It is the one with the largest free entropy that should be selected.
We compare later the free entropy of the fixed point of AMP-BP to the free entropy of the fully informative point where q S = q W = 1.We write it ϕ info .At this point the messages are delta functions of the ground truth; we can derive ϕ info directly from the factor graph and it reads 4.2 Asymptotic optimality conjecture.
We conjecture that AMP-BP gives the Bayes-optimal estimator for GLM-SBM in the following sense.
We define the two possible initializations: (a) random initialization, where we initialize the messages randomly according to their prior distribution, adding no information, as described in the algorithm above; and (b) informed initialization, where we initialize the estimators to delta functions of the true values of s and w.
We consider the fixed point of AMP-BP that has the largest Bethe free entropy ϕ Bethe .We argue that it suffices to check the random and informed initializations to find all the relevant fixed points.
We conjecture that, asymptotically exactly, the AMP-BP fixed point that has the largest ϕ Bethe provides the Bayes-optimal estimators for the GLM-SBM model.Its overlap q S is asymptotically equal to the Bayes-optimal MMO overlap and ϕ Bethe is equal to ϕ, with high probability as N → +∞.This is aligned with the same conjecture for BP and the standard SBM from Decelle et al. (2011a) and the proofs of this property for the AMP algorithm and the pure GLM model in Barbier et al. (2019).
5 Bayes-optimal estimation with AMP-BP and phase transitions

Gaussian prior, 2nd order transition to partial recovery
In this subsection we consider the GLM prior P w to be a standard Gaussian.The GLM then produces binary labels, the group memberships, with the same probability of being in each of the groups.
We conjecture that for this prior the fixed point of AMP-BP reached from random initialization always corresponds to the Bayes-optimal estimation and no computationally hard phase is present.We observe that the algorithm converges to the same fixed point for the two possible initializations.In Fig. 7 in the appendix E we illustrate that the system size we use is close enough to the thermodynamical limit in the sense that the change in the curves is small when the size is changed.
The accuracy AMP-BP achieves is depicted in Figs. 1 (unsupervised case) and 8 (semisupervised case, in appendix E).We observe that the larger the snr λ or the aspect ratio α the better the recovery.The recovery is eased when community memberships are explained by a few features i.e. when α is large.Figure 1: Left and right: overlaps q S (group membership estimation) and q W (the GLM latent vector estimation) of the fixed point of AMP-BP, vs λ for a range of compression ratios α.Vertical dashed lines: theoretical thresholds λ c to partial recovery, eq. ( 29).N = 10 4 , c = 5, P w Gaussian.We run ten experiments per point.Inset: we plot the ten data points and their mean.
In the unsupervised case, we observe a phase transition from a non-informative fixed point q S = q W = 0 to an informative fixed point q S > 0, q W > 0. The transition is located at a particular critical threshold λ c .This transition is well known for standard SBM, which is recovered here in the α → 0 limit, where λ c = 1 for q = 2.The transition is of 2nd order; this means that the overlaps vary continuously with respect to λ.In the semi-supervised case the 2nd order transition disappears.
Linearization, spectral algorithm.λ c can be computed by a linear stability analysis of the non-informative fixed point of AMP-BP: at a given λ, if the algorithm is not stable it will move away from the non-informative fixed point to the informative fixed point.The linearization of the algorithm is done in appendix C. We obtain the following update equation: where the xs are real random variables and (F F T ) µν = l F µl F νl .Taking the variance of this equation and averaging over the realizations of the graph we obtain the stability criterion Eq. ( 28) can be interpreted as a spectral algorithm; we apply iteratively a linear operator to the variables where B is the non-backtracking matrix Krzakala et al. (2013).Such a spectral algorithm will share the phase transition at snr given by eq. ( 28).The study of the resulting overlap is also of interest, but we do not consider it in the present article.

Binary prior, 1st order transition to exact recovery
In this subsection the GLM prior is considered to be P w = (δ w=1 + δ w=−1 )/2 Rademacher.This still produces two groups with unbiased sizes.The fixed point AMP-BP achieves from a random initialization is depicted on Figs. 2 and 9 (in appendix E).We observe it admits the same transition to partial recovery at λ c , eq. ( 29), as the Gaussian prior does; this is also predicted by the linearization of the previous part.
For values of α > α algo (that we determine below), we observe another transition; it is discontinuous, from partial recovery q S > 0, q W > 0 to exact recovery q S = q W = 1.There is a value λ algo such that for λ > λ algo randomly initialized AMP-BP recovers the group memberships exactly for all nodes.The overlap q S does not vary continuously at λ algo ; over the many independent trials we observe that there is an interval of overlaps below 1 that cannot be reached by AMP-BP for any λ.
Discontinuous thresholds are related to the existence of several fixed points of AMP-BP and to first-order phase transitions.A 1st order phase transition is located by comparing the free entropies ϕ Bethe of the various fixed points.We notice that next to the AMP-BP fixed point that is reached from a random initialization, the exact recovery point is a fixed point at all values of λ and α (still considering P w binary).In the region of λ and α where these two fixed points differ we need to compare their free entropies.The fixed point with larger free entropy describes the Bayes-optimal performance that can in general be better than the one of AMP-BP.
Figure 2: Left and right: overlap q S and free entropy ϕ Bethe − ϕ info of the fixed point of AMP-BP, vs λ for several compression ratios α.N = 10 4 , c = 5, P w Rademacher, ρ = 0. We run ten experiments per point; the median is plotted and the error bars are the difference between the 0.85th and 0.15th quantiles.Insets: we plot the median and the ten data points.We use damping for AMP-BP: we interpolate taking 1/4 of the values at t + 1 and 3/4 of the values at t.
The difference between the free entropies of the fixed point reached by AMP-BP from random initialization and the informative fixed point is depicted on the rhs of Fig. 2. We see that for λ < λ IT the fixed point reached from random initialization has larger free entropy ϕ Bethe > ϕ info and hence describes the optimal performance.In the region λ IT < λ < λ algo the informative fixed point has larger free entropy ϕ info > ϕ Bethe , but randomly initialized AMP-BP does not reach it.This is an algorithmically hard phase where exact recovery is statistically possible, but the AMP-BP algorithm is sub-optimal.At the same time the AMP-BP algorithm is conjectured optimal among efficient algorithms Gamarnik et al. (2022) and thus the hardness of this phase is believed to be intrinsic.For λ > λ algo we only find the exact recovery fixed point.An exact recovery for the standard SBM is only achievable for graphs of average degrees c diverging logarithmically with the size of the system Abbe et al. (2015), where the logarithm comes from a type of coupon collector problem.The existence of an exact recovery phase in graphs of constant degrees is novel as far as we know.It nicely illustrates the power of the GLM prior that is able to induce it.It is well known that a 1st order phase transition appears for GLM alone with binary weights and known labels Györgyi (1990); Sompolinsky et al. (1990); Barbier et al. (2019).We note, however, that in the GLM-SBM the labels are not observed directly but via the graph.It is thus not a priori clear that an exact recovery phase can appear.Without our analysis its existence would not be easy to anticipate.
Let us finally derive the values α algo above which the exact recovery phase exists.We consider the limit λ = √ c; then the graph G consists of two disconnected components, one for each community; and AMP-BP performs as AMP for GLM alone, up to a global sign.We take into account the proportion e −c of nodes that are isolated and do not bring information.We obtain that α algo = α algo, perceptron 1 − e −c −1 (32) where α algo, perceptron ≈ 1.493 is the algorithmic critical compression ratio of the binary perceptron Barbier et al. (2019).Similarly the λ IT will exist above where α algo, perceptron ≈ 1.249 is the information-theoretic critical threshold of the binary perceptron Györgyi (1990); Sompolinsky et al. (1990); Barbier et al. (2019).The 1st order phase transition λ IT and its spinodal λ algo are still present in the semisupervised case ρ > 0, for small values of ρ, see Fig. 9 in appendix E, contrary to the 2nd order phase transition to partial recovery that vanishes in the semi-supervised case.Moreover, for ρ > α algo,perceptron /α, perfect recovery is achieved at any λ, because one has enough train labels to infer w.

Analysis in the dense limit
As in Decelle et al. (2011a,b) for the sparse SBM, the analysis of the AMP-BP is based on the numerical investigation of the fixed points and their free entropies on systems large enough that the behavior is representative of the large-size limit.This is also at the basis of the mathematical difficulty to establish this prescription rigorously.At the same time, a dense version of the SBM has been proposed and studied fully rigorously in Lesieur et al. (2017); Miolane (2017).This rigorous analysis has then been extended to include the GLM prior in Aubin et al. (2019) that studies a generic instance of low-rank matrix factorization problem with a generative prior.We hence study the phenomenology of the AMP-BP algorithm in the limit of large degree c, where it becomes a special case of the framework developed in Aubin et al. (2019).
The dense limit is defined by taking p i , p o = O(1) and p i − p o = O(1/ √ N ).SBM is then a low-rank matrix factorization problem.It is parameterized by its signal-to-noise ratio ∆ I (which is defined as the inverse variance of an equivalent additive Gaussian channel).We need ∆ I as a function of the parameters of the SBM, that is to say to equalize their signal-to-noise ratios.We compute the Fisher information of the channel N is taken to zero.The mapping is then where p i = c i /N and p o = c o /N .It is of order one in both sparse case and dense case.Also, we add the factor 1/4 to obtain a phase transition at ∆ I = 1 in the dense case when α = 0.In the following ρ = 0. Authors of Aubin et al. (2019) give the algorithm corresponding to the dense case of AMP-BP algorithm.We reproduce it in appendix D. Its performances can be tracked by a few scalar equations that are named state evolution (SE) equations.For P w Rademacher they read: where q s and q w are the s-and w-overlaps, ∆ I is the signal-to-noise ratio of the problem, ξ and η are standard Gaussians and Aubin et al. (2019) gives also the free entropy of the fixed point of the algorithm for the dense problem.It reads where xlogx being the function x → x log x.The convergence to the dense limit is quite fast; the large degree results are close to the observed results even for c quite small.Numerically it appears that c ≈ 20 is enough (N = 10 4 ) to already observe quite small difference, see Fig. 3.
The fully informative fixed point is (q s , qw , q w ) = (1, +∞, 1).Its free entropy is The analysis of the system of SE equations is done in appendix D; we summarize the four main points: (a) the fully informative fixed point is stable for all ∆ I ; (b) the width of its stability domain shrinks to zero when ∆ I tends to zero; (c) a general necessary condition to observe a fully informative fixed point is that P w does not admit everywhere a twice differentiable density; (d) the algorithmic critical compression ratio α algo,d is close to α algo,preceptron .
We also obtain an approximation for the critical point λ c of the transition to partial recovery.Aubin et al. (2019) gives us that in the dense limit, the critical snr is The limit c = ω(1) large gives λ c = 1 + 4α/π 2 −1/2 + O(c/N ), as predicted by the linearization.the fixed point of AMP-BP and of the SE equations of the dense limit, vs ∆ I for several average degrees cs.We generate instances of GLM-SBM according to the λ obtained by inverting eq. ( 34).N = 10 4 , α = 3, P w binary.For AMP-BP we run ten experiments per point; for the SE equations one experiment.The median is plotted and the error bars are the difference between the 0.85th and 0.15th quantiles.Insets: we plot the median and the ten data points.We use damping.For SE, the slight decrease of the free entropy at large ∆ I is due to numerical imprecision.

Comparison of performance with standard GNNs on GLM-SBM
GLM-SBM can be used as a benchmark for clustering or classification tasks on attributed graphs.We compare two simple baselines with AMP-BP.We show that GLM-SBM is simple to define yet challenging algorithmically, in particular in the case of binary prior close to the first order phase transition.
An unsupervised baseline.The algorithm is inspired by graph convolution networks; it performs binary clustering.We compare its performances to the optimal ones given by AMP-BP.Its performances are shown on Figs. 4 left (P w binary) and 10 left (P w Gaussian, in appendix E).Data is generated according to the GLM-SBM.We stack the features F µl into vectors The observed graph G is used for the convolution steps.
We compute n steps of graph convolution on the features; perform PCA on the transformed features and keep the largest component; threshold its left vector to obtain the membership of each node.Formally, we consider the features F (0) µ ∈ R M ; we apply n times where a is a scalar.We apply PCA on the new matrix F whose rows are F (n) µ .Writing u ∈ R N the left vector of its largest component, the estimator is ŝ = sign(u).We tune n and a empirically to optimize the recovery.We observe that roughly it depends on n and a only by their product an.Also, the optimal a scales like 1/c.
Figure 4: Overlap q S of the baseline algorithms, vs λ.We compare to the overlap obtained by AMP-BP.Left: unsupervised; for the parameters of the graph convolution we choose a = 0.1 and n = 4. Right: semi-supervised; for the hyperparameters of the GNN we choose n = 2, N hidden = 20, learning rate 3.10 −4 and L2 penalty 10 −3 .The train set is ρ = 1/10 th of the nodes.N = 10 4 , c = 5, P w binary.We run ten experiments per point.
A semi-supervised baseline.The algorithm is a simple GNN, trained in a semi-supervised way for node classification.Again data is generated according to the GLM-SBM, with ρ = 1/10.We stack the features F µl into vectors F (0) µ ∈ R M .We use the observed graph G for the message-passing steps.The GNN is made of a two-layer perceptron and a readout layer for the binary classification.It reads: where and n is the number of steps.We train it given the labels of the subset of nodes Ξ.We use gradient descent with logistic loss, momentum and L2 regularization.We do not fine-tune the hyperparameters.Its performances are shown on Figs. 4 right (P w binary) and 10 right (P w Gaussian, in appendix E).
We also performed experiments where the GNN is made of a single-layer perceptron (no relu), as Cheng et al. (2022) does on CSBM.The performances are similar to the multi-layer perceptron, but it requires much more parameters to be trained (M 2 vs M N hidden , and we take N hidden = O(1)).

Conclusion on the comparison.
As to the GLM-SBM dataset, Fig. 4 illustrates that, both in the unsupervised and the semi-supervised settings, the baseline methods have a considerable gap to the optimal performances given by the AMP-BP algorithm.The GLM-SBM setting is hence suitable to develop GNN algorithms that are able to provide higher accuracy.
As to the AMP-BP algorithm, it is very scalable.It has a running time similar to the GNN-based approaches, around a few minutes per point on Figs. 1 or 2 (including the ten experiments).Its complexity is O(N M ) in time and in memory.This is the smallest any algorithm can do, for reading the input.The number of steps needed for convergence does not depend on N .

Conclusions
We propose a model of attributed graphs.It is a sparse SBM where the nodes carry features that determine their community memberships.We solve it, in the sense that we derive an algorithm that is conjectured to perform optimally among polynomial algorithms.We analyze a linearization of the algorithm and the dense limit of the model.The model, yet simple, exhibits a rich phenomenology with detectable and exact recovery phase transitions.It can be used as a challenging benchmark for graph-neural networks.
In the analysis of this paper we only considered two groups.For more than two groups, q > 2, the analysis can also be done by writing an AMP-BP algorithm; just, the AMP-side would need to correspond to a single-layer network with multi-class output.The AMP for such a model has been written and studied in Cornacchia et al. (2022) and one would have to merge it with the BP of Decelle et al. (2011a).Another generalization that would be possible to analyze is when the attributes F are drawn from a Gaussian with a generic covariance.This can be done along the lines of Loureiro et al. (2021).On the other hand considering as a prior the multi-layer neural network (2) with learned weights W would be more challenging; a corresponding AMP algorithm that would provide an asymptotically exact solution is not known.
A future direction of work could also be to theoretically analyze the learning of GLM-SBM by a GNN, i.e. to give insights on the generalization performance of the neural network of part 7; as Cheng et al. (2022) does for a perceptron-based graph convolution network on CSBM.This would be interesting because few theoretical works address the generalization ability of GNNs.
These messages satisfy these equations: A.1 SBM We can apply the standard simplifications for sparse SBM Decelle et al. (2011b), Zdeborová and Krzakala (2016).We consider only messages on G.This gives

A.2.1 r-BP
We apply first the simplifications that lead to r-BP.We define and consider the inner part of the χ l→µ w l message: ψν→l We set z ν = F νl w l + m̸ =l F νm w m .By independence of the w the partial sum behaves like a Gaussian with mean and variance with a m→ν = dw m χ m→ν wm w m v m→ν = dw m χ m→ν wm w 2 m − a 2 m→ν (64) We replace the integral over all ws by a Gaussian integral over z ν ; we obtain We can simplify.F νl is small, we expand the exponential: We introduce the denoising function; its expression differs from the one of Zdeborová and Krzakala (2016): where we evaluate g o in (ω ν→l , χ ν→ν , V ν→l ).We exponentiate: We take the product of the ψ to obtain where We close the loop defining the input functions The mean and the variance of the marginals are estimated by where We obtain also the expression of the GLM-to-SBM message where

A.2.2 Time indices
There are two possibilities for mixing the GLM part and the SBM part: / / 5 = χ (t+1) We try both; we do not observe any numerical difference.

A.2.3 AMP
Then we go from r-BP to AMP.We remove the dependence of the messages on the target.We keep only the marginals.The derivation is given by Zdeborová and Krzakala (2016).We obtain that o,µ A.2.4 Further simplifications F 2 µm self-averages.We can replace it by its average 1/M in eqs.( 78) and ( 81).So Λ and V become scalars.Also, on average, −∂ ω g o,µ = g 2 o,µ .We obtain the algorithm given in the main part.

Appendix B. Free entropy
We start with the factor graph.The Bethe free entropy is the sum of the free entropies of the nodes plus the factors minus the edges i.e.

N ϕ Bethe
This simplifies to On the GLM side, we have We compute log Ẑµ→l as a function of the target-free elements (we start using that ω µ→l = ω µ − F µl a l→µ and V µ→l = V µ − F 2 µl v l→µ and expanding).This gives: where we write c .,. for the affinity matrix and where we have used the standard linearization for SBM.We have also δΓ We simplify: T .We assemble equations together: The matrices 1 2 c.,.
c − 1 and ∂ ω ψ| * ∇ χ g o | * share the same eigenvectors.They have one null eigenvalue and one positive: c i −co 2c = λ √ c and 2 π .We project to obtain where (F F T ) µν = l F µl F νl .We obtain the threshold λ c of partial recovery taking the variance of the expression 112, discarding the time indices.We use that (F F T − I N ) 2 µ,ν averages to 1/M if µ ̸ = ν and to O(1/M ) otherwise.We obtain: t+1) , ω (t+1) µ , V (t+1) ) t ← t + 1 until convergence of a l , v l , σ µ , Σ µ output estimated mean a l and variance v l of w l , estimated mean σ µ and variance Σ µ of s µ

D.2 Analysis of the SE equations near the full recovery point
The state evolution equations are given in section 6, eqs.( 35)-( 37).We study the conditions of stability for the fully informative fixed point (q s , qw , q w ) = (1, +∞, 1).We use the following notation for the update: where the f i are given by the SE update equations: where ξ and η are standard Gaussians.We expand around (1, +∞, 1); we use the parametrization (r, t, s) = (∆ I + ϵ r , 1/ϵ 2 t , 1 − ϵ 2 s ).f 3 -We expand the integrand in f 3 around +∞.This is valid only for √ tξ = ξ/ϵ t ≫ 1; so we introduce a cut-off δ such that both δ = o(1) and δ = ω(ϵ t ).For ξ > δ we use the asymptotic sinh(x) tanh(x) = 1 2 e x − 3 2 e −x + o(e −x ); for ξ < δ we develop the Gaussian density to the first (constant) order.Then where in the last lines we expanded the error function around +∞.
f 1 -We use the shorthand notation g(x, y) = (sinh(x) + cosh(x) erf(y)) 2 cosh(x) + sinh(x) erf(y) The first order is enough since it is not constant; we have:

D.2.1 Stability
We obtain the following update of the perturbation: We consider only the s variable because r does not affect the dynamics and the initialization is done on r and s, t being inferred then.We have which is stable for all α and ∆ I .Numerically, however, instability can be detected: for ϵ s large enough the system diverges from the fully informative fixed point.We compute numerically the limiting ϵ * s , such that ϵ t+1 s = ϵ t s ; we find that ϵ * s tends to zero fast for ∆ I or µ going to zero.In the Gaussian case we have f 3 (t) = t/(1 + t) and so ϵ t+2 s = ϵ t+1 t = ϵ t s / αC 2 (∆ I ); so this fixed point is unconditionally not stable.

D.2.2 Generalization of the prior
We ask for which prior P w the fully-informative point is stable.We recall that w ) to the fixed point vs µ for some αs.p o = 1/2.These are the fixed points of eq. ( 139).When initialized above the curves, the system diverges from the fully-informative fixed point.When the snr ∆ I = µ 2 /p o (1 − p o ) tends to 0, the size of the attraction basin shrinks to 0.
We show that if P w admits everywhere a density twice differentiable, then the fully-informative fixed point is unstable.Indeed, at large t we have: and obtain an equation similar to the one of the Gaussian case: ϵ t+2 s = C ′ ϵ t s , C ′ > 0, which is unstable.

D.2.3 Large snr
We give an implicit value for the critical compression ratio α algo,d .We take the limit ∆ I ≫ 1 and seek whether the SE updates converge to the fully informative point.

Figure 3 :
Figure3: Left and right: overlap q S and free entropies ϕ Bethe − ϕ info and ϕ Bethe,d − ϕ info,d of the fixed point of AMP-BP and of the SE equations of the dense limit, vs ∆ I for several average degrees cs.We generate instances of GLM-SBM according to the λ obtained by inverting eq.(34).N = 10 4 , α = 3, P w binary.For AMP-BP we run ten experiments per point; for the SE equations one experiment.The median is plotted and the error bars are the difference between the 0.85th and 0.15th quantiles.Insets: we plot the median and the ten data points.We use damping.For SE, the slight decrease of the free entropy at large ∆ I is due to numerical imprecision.

f 3 Figure 5 :
Figure5: The limiting perturbation 2 log 10 (ϵ * s ) = log 10 (1 − q * w ) to the fixed point vs µ for some αs.p o = 1/2.These are the fixed points of eq.(139).When initialized above the curves, the system diverges from the fully-informative fixed point.When the snr ∆ I = µ 2 /p o (1 − p o ) tends to 0, the size of the attraction basin shrinks to 0.