Approximate message passing with spectral initialization for generalized linear models

We consider the problem of estimating a signal from measurements obtained via a generalized linear model. We focus on estimators based on approximate message passing (AMP), a family of iterative algorithms with many appealing features: the performance of AMP in the high-dimensional limit can be succinctly characterized under suitable model assumptions; AMP can also be tailored to the empirical distribution of the signal entries, and for a wide class of estimation problems, AMP is conjectured to be optimal among all polynomial-time algorithms. However, a major issue of AMP is that in many models (such as phase retrieval), it requires an initialization correlated with the ground-truth signal and independent from the measurement matrix. Assuming that such an initialization is available is typically not realistic. In this paper, we solve this problem by proposing an AMP algorithm initialized with a spectral estimator. With such an initialization, the standard AMP analysis fails since the spectral estimator depends in a complicated way on the design matrix. Our main contribution is a rigorous characterization of the performance of AMP with spectral initialization in the high-dimensional limit. The key technical idea is to define and analyze a two-phase artificial AMP algorithm that first produces the spectral estimator, and then closely approximates the iterates of the true AMP. We also provide numerical results that demonstrate the validity of the proposed approach.


Introduction
We consider the problem of estimating a d-dimensional signal x ∈ R d from n i.i.d.measurements of the form y i ∼ p(y | x, a i ), i ∈ {1, . . ., n}, where •, • is the scalar product, {a i } 1≤i≤n are given sensing vectors, and the (stochastic) output function p(• | x, a i ) is a given probability distribution.This is known as a generalized linear model [McC18], and it encompasses many settings of interest in statistical estimation and signal processing [RG01, BB08, YLSV12, EK12].One notable example is the problem of phase retrieval [Fie82, SEC + 15], where with w i being noise.Phase retrieval appears in several areas of science and engineering, see e.g.[FD87,Mil90,DJ17], and the last few years have witnessed a surge of interest in the design and analysis of efficient algorithms; see the review [FS20] and the discussion at the end of this section.
Here, we consider generalized linear models (GLMs) in the high-dimensional setting where n, d → ∞, with their ratio tending to a fixed constant, i.e., n/d → δ ∈ R. We focus on a family of iterative algorithms known as approximate message passing (AMP).AMP algorithms were first proposed for estimation in linear models [DMM09,BM11], and for estimation in GLMs in [Ran11].AMP has since been applied to a wide range of high-dimensional statistical estimation problems including compressed sensing [KMS + 12, BM12, MAYB13], low rank matrix estimation [RF12, DM14, KKM + 16], group synchronization [PWBM18], and specific instances of GLMs such as logistic regression [SC19] and phase retrieval [SR14,MXM19,MLKZ20].
An appealing feature of AMP is that, under suitable model assumptions, its performance in the high-dimensional limit can be precisely characterized by a succinct deterministic recursion called state evolution [BM11,Bol14,JM13].Using the state evolution analysis, it has been shown that AMP provably achieves Bayes-optimal performance in some special cases [DJM13,DM14,MV17].Indeed, a conjecture from statistical physics posits that AMP is optimal among all polynomial-time algorithms.The optimality of AMP for generalized linear models is discussed in [BKM + 19].
However, when used for estimation in GLMs, a major issue of AMP is that in many problems (including phase retrieval) we require an initialization that is correlated with the unknown signal x but independent of the sensing vectors {a i }.In many cases, it is not realistic to assume that such a realization is available.For such GLMs, without a correlated initialization, asymptotic state evolution analysis predicts that the AMP estimates will be uninformative, i.e., their normalized correlation with the signal vanishes in the large system limit.
In this paper, we propose an AMP initialized using a spectral estimator.The idea of using a spectral estimator for GLMs was introduced in [Li92], and its performance in the high-dimensional limit was recently characterized in [LL19,MM19].It was shown that the normalized correlation of the spectral estimator with the signal undergoes a phase transition, and for the special case of phase retrieval, the threshold for strictly positive correlation with the signal matches the informationtheoretic threshold [MM19].
Our main technical contribution is a novel analysis of AMP with spectral initialization for GLMs, under the assumption that the sensing vectors {a i } are i.i.d.Gaussian.This yields a rigorous characterization of the performance in the high-dimensional limit (Theorem 1).The analysis of AMP with spectral initialization is far from obvious since the spectral estimator depends in a nontrivial way on the sensing vectors {a i }.The existing state evolution analysis for GLMs [Ran11,JM13] crucially depends on the AMP initialization being independent of the sensing vectors, and therefore cannot be directly applied.
At the center of our approach is the design and analysis of an artificial AMP algorithm.The artificial AMP operates in two phases: in the first phase, it performs a power method, so that its iterates approach the spectral initialization of the true AMP; in the second phase, its iterates are designed to remain close to the iterates of the true AMP.The initialization of the artificial AMP is correlated with x, but independent of the sensing vectors {a i }, which allows us to apply the standard state evolution analysis.Note that the initialization of the artificial AMP is impractical (it requires the knowledge of the unknown signal x!).However, this is not an issue, since the artificial AMP is employed as a proof technique: we prove a state evolution result for the true AMP by showing that its iterates are close to those in the second phase of the artificial AMP.
Initializing AMP with a (different) spectral method has been recently shown to be effective for low-rank matrix estimation [MV17].However, our proof technique for analyzing spectral initialization for GLMs is different from [MV17].The argument in that paper is specific to the spiked random matrix model and relies on a delicate decoupling argument between the outlier eigenvectors and the bulk.Here, we follow an approach developed in [MTV20], where a specially designed AMP is used to establish the joint empirical distribution of the signal, the spectral estimator, and the linear estimator.
For the case of phase retrieval, in [MXM18] it is proposed a slightly different version of the spectral estimator to initialize AMP.A heuristic justification of the initialization was given, but a rigorous characterization of its performance remained open.
We note that for some GLMs, AMP does not require a special initialization that is correlated with the signal x.In Section 3, we give a condition on the GLM output function that specifies precisely when such a correlated initialization is required (see (3.13)).This condition is satisfied by a wide class of GLMs, including phase retrieval.It is in these cases that AMP with spectral initialization is most useful.
Other related work.For the problem of phase retrieval, several algorithmic solutions have been proposed and analyzed in recent years.An inevitably non-exhaustive list includes semi-definite programming relaxations [CSV13, CESV15, CLS15a, WdM15], a convex relaxation operating in the natural domain of the signal [GS18, BR17], alternating minimization [NJS13], Wirtinger Flow [CLS15b, CC17, MWCC20], iterative projections [LGL15], and the Kaczmarz method [Wei15,TV19].A generalized AMP (GAMP) algorithm was introduced in [SR14], and an AMP to solve the non-convex problem with 2 regularization was proposed and analyzed in [MXM19].Most of the algorithms mentioned above require an initialization correlated with the signal x and, to obtain such an initialization, spectral methods are widely employed.
Beyond the Gaussian setting, spectral methods for phase retrieval with random orthogonal matrices are analyzed in [DBMM20].Statistical and computational phase transitions in phase retrieval for a large class of correlated real and complex random sensing matrices are analyzed in [MLKZ20], and a general AMP algorithm for rotationally invariant matrices is studied in [Fan20].Thus, the extension of our techniques to more general sensing models represents an interesting avenue for future research.

Preliminaries
Notation and definitions.Given n ∈ N, we use the shorthand [n] = {1, . . ., n}.Given a vector x, we denote by x 2 its Euclidean norm.The empirical distribution of a vector x = (x 1 , . . ., x d ) T is given by 1 d d i=1 δ x i , where δ x i denotes a Dirac delta mass on x i .Similarly, the empirical joint Generalized linear models.Let x ∈ R d be the signal of interest, and assume that x 2 2 = d.The signal is observed via inner products with n sensing vectors (a i ) i∈[n] , with each a i ∈ R d having independent Gaussian entries with mean zero and variance 1/d, i.e., (a i ) ∼ i.i.d.N(0, I d /d).Given g i = x, a i , the components of the observed vector y = (y 1 , . . ., y n ) ∈ R n are independently generated according to a conditional distribution p Y |G , i.e., y i ∼ p Y |G (y i | g i ).We stack the sensing vectors as rows to define the n × d sensing matrix A, i.e., A = [a 1 , . . ., a n ] T .For the special case of phase retrieval, the model is y = |Ax| 2 + w, where w is a noise vector with independent entries.We consider a sequence of problems of growing dimension d, and assume that, as d → ∞, the sampling ratio n/d → δ, for some constant δ ∈ (0, ∞).

Spectral initialization.
The spectral estimator xs is the principal eigenvector of the d × d matrix D n , defined as where T s : R → R is a preprocessing function.We now review some results from [MM19,LL19] on the performance of the spectral estimator in the high-dimensional limit.Let G ∼ N(0, 1), Y ∼ p(• | G), and Z s = T s (Y ).We will make the following assumptions on Z s .
Let xs be the principal eigenvector of the matrix D n defined in (2.1), and let λ * δ be the unique solution of where ψ δ and φ are the derivatives of the respective functions.
Remark 2.1 (Equivalent characterization).Using the definitions (2.4)-(2.5), the conditions (2.7) When these conditions are satisfied, the limit of the normalized correlation in (2.6) can be expressed as Remark 2.2 (Optimal preprocessing function).In [MM19], the authors derived the preprocessing function minimizing the value of δ necessary to achieve weak recovery, i.e., a strictly positive correlation between xs and x.In particular, let δ u be defined as with G ∼ N(0, 1).Furthermore, let us also define where (2.11) Then, by taking T s = T , for any δ > δ u , we almost surely have for some > 0. Furthermore, for any δ < δ u , there is no pre-processing function T such that, almost surely, (2.12) holds.For a more formal statement of this result, see Theorem 4 of [MM19].The preprocessing function that, at a given δ > δ u , maximizes the correlation between xs and x is also related to T * (y) as defined in (2.11), and it is derived in [LAL19].
3 Generalized Approximate Message Passing with Spectral Initialization We make the following additional assumptions on the signal x, the output distribution p Y |G , and the preprocessing function T s used for the spectral estimator.
and G ∼ N(0, 1).Furthermore, there exists a function q : R × R → R and a random variable V independent of G such that Y = q(G, V ).More precisely, for any measurable set A ⊆ Y and almost every g, we have P(Y ∈ A | G = g) = P(q(g, V ) ∈ A).We also assume that E{|V | 2 } < ∞.
(B3) The function T s : R → R is bounded and Lipschitz.
Following the terminology of [Ran11], we refer to the AMP for generalized linear models as GAMP.In each iteration t, the proposed GAMP algorithm produces an estimate x t of the signal x.The algorithm is defined in terms of a sequence of Lipschitz functions f t : R → R and h t : R×R → R, for t ≥ 0. We initialize using the spectral estimator xs : where b 0 = 1 n d i=1 f 0 (x 0 i ), the diagonal matrix Z s is defined in (2.1), and λ * δ is given by (2.7).Then, for t ≥ 0, the algorithm computes: where h t (•, •) denotes the derivative with respect to the first argument.
The asymptotic empirical distribution of the GAMP iterates x t , u t , for t ≥ 0, can be succinctly characterized via a deterministic recursion, called state evolution.Our main result, Theorem 1, shows that for t ≥ 0, the empirical distributions of u t and x t converge in Wasserstein distance W 2 to the laws of the random variables U t and X t , respectively, with where (G, W U,t ) ∼ i.i.d.N(0, 1).Similarly, X ∼ P X and W X,t ∼ N(0, 1) are independent.The deterministic parameters (µ U,t , σ U,t , µ X,t , σ X,t ) are recursively computed as follows, for t ≥ 0: For the spectral initialization in (3.1)-(3.2),with a as defined in (2.6), the recursion is initialized with (3.9) We state the main result in terms of pseudo-Lipschitz test functions.A function ψ : R m → R is pseudo-Lipschitz of order 2, i.e., ψ ∈ PL(2), if there is a constant C > 0 such that Assume that for t ≥ 0, the functions f t , h t are Lipschitz with derivatives that are continuous almost everywhere.Then, the following limits hold almost surely for any PL(2) function ψ : R × R → R and t such that σ 2 X,k is strictly positive for 0 ≤ k ≤ t: (3.12) The result (3.11) also holds for (t + 1) = 0.In (3.11) (resp.(3.12)), the expectation is over the independent random variables X ∼ P X and W X,t ∼ N(0, 1) ) t≥0 are given by the recursion (3.8) with the initialization (3.9).We give a sketch of the proof in Section 5 and defer the technical details to the appendices.We now comment on some of the assumptions in the theorem.The assumption ψ δ (λ * δ ) > 0 is required to ensure that the spectral initialization x 0 has non-zero correlation with the signal x (Lemma 2.1).From Remark 2.2, we also know that for any sampling ratio δ > δ u there exists a choice of T s such that ψ δ (λ * δ ) > 0. We also note that, for δ < δ u , GAMP converges to the "un-informative fixed point" (where the estimate has vanishing correlation with signal) even if the initial condition has non-zero correlation with the signal, see [MM19, Theorem 5].
There is no loss of generality in assuming the sign of xs to be such that xs , x ≥ 0. Indeed, if the sign were chosen otherwise, the theorem would hold with the state evolution initialization in (3.9) being µ That is, we can perfectly estimate x from x k , and thus terminate the algorithm after iteration k.
Let us finally remark that the result in (3.11) is equivalent to the statement that the empirical joint distribution of (x, x t+1 ) converges almost surely in Wasserstein distance (W 2 ) to the joint law of (X, µ X,t+1 X + σ X,t+1 W ).This follows from the fact that a sequence of distributions P n with finite second moment converges in W 2 to P if and only if P n converges weakly to P and a 2 2 dP n (a) → a 2 2 dP (a), see [Vil08, Definition 6.7, Theorem 6.8].
When does GAMP require spectral initialization?For the GAMP to give meaningful estimates, we need either x 0 or x 1 to have strictly non-zero asymptotic correlation with x.To see when this can be arranged without a special initialization, consider the linear estimator xL (ξ) := A T ξ(y), for some function ξ : R → R that acts component-wise on y.If there exists a function ξ such that the asymptotic normalized correlation between xL (ξ) and x is strictly non-zero, then AMP does not require a special initialization (spectral or otherwise) that is correlated with x.Indeed, in this case we can replace the initialization in (3.1)-(3.2) by x 0 = 0, u 0 = 0 (by taking f 0 = 0), and let h 0 (u 0 ; y) = √ δξ(y).This gives x 1 = A T ξ(y) = xL (ξ), which has strictly nonzero asymptotic correlation with x.This ensures that |µ X,1 | > 0, and the standard AMP analysis [JM13] directly yields a state evolution result similar to Theorem 1.
The output function p Y |G determines whether a non-trivial linear estimator exists for the GLM.
then the correlation between A T ξ(y) and x will asymptotically vanish for any choice of ξ.The condition (3.13) holds for many output functions of interest, including all distributions p Y |G that are even in G (and, therefore, including phase retrieval).It is for these models that spectral initialization is particularly useful.
Bayes-optimal GAMP.Applying Theorem 1 to the PL(2) function ψ(x, y) = (x − f t (y)) 2 , we obtain the asymptotic mean-squared error (MSE) of the GAMP estimate f t (x t ).In formulas, for t ≥ 0, almost surely, lim If the limiting empirical distribution P X of the signal is known, then the choice of f t that minimizes the MSE in (3.14) is Similarly, applying the theorem to the PL(2) functions ψ(x, y) = xf t (y) and ψ(x, y) = f t (y) 2 , we obtain the asymptotic normalized correlation with the signal.In formulas, for t ≥ 0, almost surely For fixed (µ X,t , σ 2 X,t ), the normalized correlation in (3.16) is maximized by taking f t = cf * t for any c = 0.This choice also maximizes the ratio (3.17) We now specify the choice of h t (u; y) that maximizes the ratio µ 2 X,t+1 /σ 2 X,t+1 for fixed (µ U,t , σ 2 U,t ).Proposition 3.1.Assume the setting of Theorem 1.For a given (µ U,t , σ 2 U,t ), the ratio µ 2 X,t+1 /σ 2 X,t+1 is maximized when h t (u; y) = c h * t (u; y) where c = 0 is any constant, and where ) and W ∼ N(0, 1).In (3.18), the random variables U t and Y are conditionally independent given G with The optimal choice for h * t in Proposition 3.1 was derived in [Ran11] by approximating the belief propagation equations.For completeness, we provide a self-contained proof in Appendix A. The proof also shows that with As the choices f * t , h * t maximize the signal-to-noise ratios µ 2 U,t /σ 2 U,t and µ 2 X,t+1 /σ 2 X,t+1 , respectively, we refer to this algorithm as Bayes-optimal GAMP.We note that to apply Theorem 1 to the Bayes-optimal GAMP, we need f * t , h * t to be Lipschitz.This holds under relatively mild conditions on P X and p Y |G [MV17, Lemma F.1].

Numerical Simulations
We now illustrate the performance of the GAMP algorithm with spectral initialization via numerical examples.For concreteness, we focus on noiseless phase retrieval, where Gaussian prior.In Figure 1, x is chosen uniformly at random on the d-dimensional sphere with radius √ d (hence, P X is Gaussian), and We take d = 8000, and the : Performance comparison between two different choices of f t for a binary prior P X (1) = P X (−1) = 1 2 .The Bayes-optimal choice f t = f * t (in red) has a lower threshold compared to f t equal to identity (in blue).numerical simulations are averaged over n sample = 50 independent trials.The performance of an estimate x is measured via its normalized squared scalar product with the signal x.The black points are obtained by estimating x via the spectral method, using the optimal pre-processing function T s reported in Eq. ( 137) of [MM19].The empirical results match the black curve, which gives the best possible squared correlation in the high-dimensional limit, as given by Theorem 1 of [LAL19].The red points are obtained by running the GAMP algorithm (3.3)-(3.4)with the spectral initialization (3.1)-(3.2).The function f t is chosen to be the identity, and h t = √ δh * t , for h * t given by Proposition 3.1.The algorithm is run until the normalized squared difference between successive iterates is small.As predicted by Theorem 1, the numerical simulations agree well with the state evolution curve in red, which is obtained by computing the fixed point of the recursion (3.8) initialized with (3.9).
Bayes-optimal GAMP for a binary-valued prior.Assume now that each entry of the signal x takes value in {−1, 1}, with P X (1) = 1 − p X (−1) = p.In Figure 2, we take p = 1 2 , and compare the performance of the GAMP algorithm with spectral initialization for two different choices of the function f t : f t equal to identity (in blue) and f t = f * t (in red), where f * t is the Bayes-optimal choice (3.15).By computing the conditional expectation, we have The rest of the setting is analogous to that of Figure 1.There is a significant performance gap between the Bayes-optimal choice f t = f * t and the choice f t (x) = x.As in the previous experiment, we observe very good agreement between the GAMP algorithm and the state evolution prediction of Theorem 1.We remark that for this setting, the information-theoretically optimal overlap (computed using the formula in [BKM + 19]) is 1 for all δ > 0. Since the components of x are  in {−1, 1}, there are 2 d choices for x.The information-theoretically optimal estimator picks the choice that is consistent with . (Since A is Gaussian, with high probability this solution is unique.) Coded diffraction patterns.We consider the model of coded diffraction patterns described in Section 7.2 of [MM19].Here the signal x is the image of Figure 3a, and it can be viewed as a d 1 × d 2 × 3 array with d 1 = 820 and d 2 = 1280.The sensing vectors are given by To obtain non-integer values of δ, we set to 0 a suitable fraction of the vectors a r , chosen uniformly at random.In this model, the scalar product x j , a r can be computed with an FFT algorithm.Furthermore, in order to evaluate the principal eigenvector for the spectral initialization, we use a power method which stops if either the number of iterations reaches the maximum value of 100000 or the modulus of the scalar product between the estimate at the current iteration T and at the iteration T − 10 is larger than 1 − 10 −7 .
The GAMP algorithm with spectral initialization for the complex-valued setting is described in Appendix D. Figure 3 shows a visual representation of the results.The improvement achieved by the GAMP algorithm over the spectral estimator is impressive, with GAMP achieving full recovery already at δ = 2.4.A numerical comparison of the performance of the two methods is given in Figure 5 in Appendix D. We emphasize that the state evolution result of Theorem 1 is only valid for Gaussian sensing matrices.Extending it to structured matrices such as coded diffraction patterns is an interesting direction for future work.

Sketch of the Proof of Theorem 1
We give an outline of the proof here, and provide the technical details in the appendices.
The artificial GAMP algorithm.We construct an artificial GAMP algorithm, whose iterates are denoted by xt , ũt , for t ≥ 0. Starting from an initialization (x 0 , ũ0 ), for t ≥ 0 we iteratively compute: For t ≥ 0, the functions ft : R → R and ht : R × R → R are Lipschitz, and will be specified below.
The scalars ct and bt+1 are defined as where h t denotes the derivative with respect to the first argument.The iteration is initialized as follows.Choose any α ∈ (0, 1), and a standard Gaussian vector n ∼ N(0, I d ) that is independent of x and A. Then, (5.4) The artificial GAMP is divided into two phases.In the first phase, which lasts up to iteration T , the functions ft , ht for 0 ≤ t ≤ (T − 1), are chosen such that as T → ∞, the iterate xT approaches the initialization x 0 of the true GAMP algorithm defined in (3.1).In the second phase, the functions ft , ht for t ≥ T , are chosen to match those of the true GAMP.The key observation is that a state evolution result for the artificial GAMP follows directly from the standard analysis of GAMP [JM13] since the initialization x0 is independent of A. By showing that as T → ∞, the iterates and the state evolution parameters of the artificial GAMP approach the corresponding quantities of the true GAMP, we prove that the state evolution result of Theorem 1 holds.
We now specify the functions used in the artificial GAMP.For 0 ≤ t ≤ (T − 1), we set where T s is the pre-processing function used for the spectral estimator, λ * δ is the unique solution of ζ δ (λ) = φ(λ) for λ > τ (also given by (2.7)), and (β t ) t≥0 are constants coming from the state evolution recursion defined below.Furthermore, for t ≥ T , we set With these choices of ft , ht , the coefficients ct and bt in (5.3) become: (5.7) Since the initialization x0 in (5.4) is independent of A, the state evolution result of [JM13] can be applied to the artificial GAMP.This result, formally stated in Proposition B.1 in Appendix B.1, implies that for t ≥ 0, the empirical distributions of xt and ũt converge in W 2 distance to the laws of the random variables Xt and Ũt , respectively, with (5.8) Here W X,t , W Ũ ,t are standard normal and independent of X and G, respectively.The state evolution recursion defining the parameters (µ X,t , σ X,t , µ Ũ ,t , σ Ũ ,t , β t ) has the same form as (3.8), except that we use the functions defined in (5.5) for 0 ≤ t ≤ (T − 1), and the functions in (5.6) for t ≥ T .The detailed expressions are given in Appendix B.1.
Analysis of the first phase.The first phase of the artificial GAMP is designed so that its output vectors after T iterations (x T , ũT ) are close to the initialization (x 0 , u 0 ) of the true GAMP algorithm given by (3.1)-(3.2).This part of the algorithm is similar to the GAMP used in [MTV20] to approximate the spectral estimator xs .In particular, the state evolution recursion of the first phase (given in (B.2)) converges as T → ∞ to the following fixed point: lim where b0 = 1 δ E{f 0 (X 0 )}.Then, for t ≥ 0: ) − bt+1 h t (û t ; y). (5.14) Here, for t ≥ 0, the deterministic memory coefficients bt and ct are where X t , U t are defined in (3.6)-(3.7).
We have now defined three different GAMP iterations: the original one with iterates (x t , u t ) given by (3.3)-(3.4), the modified one above with iterates (x t , ût ), and the artificial GAMP iterates (x t , ũt ) given by (5.1)-(5.2).Lemma B.5 in Appendix B.3 proves that, for each t ≥ 0, (i) the iterates (x t+T , ũt+T ) are close to (x t , ût ) for sufficiently large T , and (ii) the corresponding state evolution parameters are also close.We then use this lemma to prove Theorem 1 in Appendix B.4.In particular, we show that, almost surely, (5.16) That is, the iterates in (5.13)-(5.14)have the same asymptotic empirical distribution as the original version in (3.3)-(3.4).
where the last equality is due to (A.6), and (i) holds due to Stein's lemma.Finally, we obtain (A.2) from (A.1) as follows: Here step (ii) holds due to Stein's lemma.This completes the proof of the proposition.

B Proof of the Main Result B.1 The Artificial GAMP Algorithm
The state evolution parameters for the artificial GAMP are recursively defined as follows.Recall from (5.8) that Xt = µ X,t X + σ X,t W X,t and Ũt ≡ µ Ũ ,t G + σ Ũ ,t W Ũ ,t .Using (5.4), the state evolution initialization is For 0 ≤ t ≤ (T − 1), the state evolution parameters are iteratively computed by using the functions defined in (5.5) in (3.8): Here we recall that G ∼ N(0, 1), , and the equality in (2.7) which is used to obtain the expression for µ X,t+1 .For t ≥ T , the state evolution parameters are: For any PL(2) function ψ : R 2 → R, the following holds almost surely for t ≥ 1: Here X ∼ P X and Y ∼ P Y |G , with G ∼ N(0, 1).The random variables Xt , Ũt are defined in (5.8).
The proposition follows directly from the state evolution result of [JM13] since the initialization x0 of the artificial GAMP is independent of A.

B.2 Analysis of the First Phase
Lemma B.2 (Fixed point of state evolution for first phase).Consider the setting of Theorem 1.Then, the state evolution recursion for the first phase, given by (B.1)-(B.2),converges as T → ∞ to the following fixed point: where a is defined in (2.8).

Lemma B.3 (Convergence to spectral estimator).
Consider the setting of Theorem 1, and consider the first phase of the artificial GAMP iteration, given by (5.1)-(5.2) with ft and ht defined in (5.5). Then, Furthermore, for any PL(2) function ψ : R × R → R, almost surely we have: Here X ∼ P X and W ∼ N(0, 1) are independent.
Proof.As in the proof of the previous result, let Z = Z s /(λ * δ − Z s ) and note that (B.7)-(B.8)hold.Also define Then, the assumptions of Lemma 5.4 in [MTV20] are satisfied, with the only difference of the initialization of the GAMP iteration (cf.(5.4) in this paper and (5.4) in [MTV20]).However, it is straightforward to verify that the difference in the initialization does not affect the proof of Lemma 5.4 in [MTV20].Thus, (B.9) follows from (5.87) of [MTV20], and (B.10) follows by taking k = 2 in (5.31) of [MTV20].
We will also need the following result on the convergence of the GAMP iterates.
Lemma B.4 (Convergence of GAMP iterates).Consider the first phase of the artificial GAMP iteration, given by (5.1)-(5.2) with ft and ht defined in (5.5).Then, the following limits hold almost surely: Though the initialization of the GAMP in [MTV20] is different from (5.4), the proof of Lemma B.4 is the same as that of Lemma 5.3 in [MTV20] since it only relies on µ X,0 = α being strictly non-zero.Then, for t ≥ 0 such that σ 2 X,k > 0 for 0 ≤ k ≤ t, the following statements hold:

B.3 Analysis of the Second Phase
The limits in (B.14) and (B.16) also hold for t + 1 = 0.
Proof.We will use κ t , κ t , c t , γ t to denote generic positive constants which depend on t, but not on n, d, or ε.The values of these constants may change throughout the proof.
For brevity, we write ∆ µ,t , ∆ σ,t for (µ X,t − µ X,t+T ) and (σ X,t − σ X,t+T ), respectively.By the induction hypothesis, given any ε > 0, for T sufficiently large we have Since σ X,t is strictly positive, κ t is finite and bounded above.From (3.8) we have Recalling that f t is Lipschitz and letting L t denote its Lipschitz constant, we have where we have used E{|X| 2 } < E{X 2 } = 1.Noting that E{|W X,t |} = 2/π, from (B.19) it follows that for sufficiently large T : Next consider σ 2 U,t .From (3.8), we have Furthermore, as W X,t d = W X,t+T and independent of X, we also have that Using the reverse triangle inequality, we have where the last inequality follows from (B.21).Similarly, Using (B.27), we obtain the bound Similarly, using (B.28) we get From (3.8) and (B.3), we note that Using this in (B.25)-(B.26),we have Furthermore, as f t is Lipschitz, from (B.31) and the induction hypothesis we have for some constant c t .Using (B.24), (B.34) and (B.35) in (B.33), we conclude that for sufficiently large T : Next, we show that if (B.13) holds for some t ≥ 0 and σ 2 X,k > 0 for k ≤ t, then : lim We denote the Lipschitz constant of h t by Lt , and write ∆µ,t , ∆σ,t for (µ U,t − µ Ũ ,t+T ) and (σ U,t − σ Ũ ,t+T ), respectively.Using this notation, we have The induction hypothesis (B.13) implies that for sufficiently large T : We note that σ U,t > 0 since σ X,t > 0. Indeed, from the discussion leading to (3.17), for a fixed µ X,t , σ X,t the smallest possible ratio σ 2 U,t /µ 2 U,t is achieved by the Bayes-optimal choice f t = cf * t , where f * t (X t ) = E{X|X t }.Furthermore, from (3.17), in order for σ U,t = 0, we need E{E{X|X t } 2 } = 1.From Jensen's inequality, we also have this is impossible when σ X,t > 0. Therefore σ U,t > 0, and γ t in (B.39) is strictly positive.
Figure 5 shows the performance with coded diffraction pattern sensing vectors, given by (4.2).The signal x is the image in Figure 3a .The red points in Figure 5 are obtained by running the complex GAMP algorithm with spectral initialization, as given in (D.1)-(D.4).We perform n sample = 5 independent trials and show error bars at 1 standard deviation.For comparison, the black points correspond to the empirical performance of the spectral method alone, and the black curve gives the theoretical prediction for the optimal squared correlation for Gaussian sensing vectors (see Theorem 1 of [LAL19]).

Figure 1 :
Figure1: Performance comparison between GAMP with spectral initialization (in red) and the spectral method alone (in black) for a Gaussian prior P X ∼ N(0, 1).The solid lines are the theoretical predictions of Theorem 1 for GAMP with spectral initialization, and of Lemma 2.1 for the spectral method.Error bars indicate one standard deviation around the empirical mean.
Figure2: Performance comparison between two different choices of f t for a binary prior P X (1) = P X (−1) = 1 2 .The Bayes-optimal choice f t = f * t (in red) has a lower threshold compared to f t equal to identity (in blue).

Figure 3 :
Figure 3: Visual comparison between the reconstruction of the GAMP algorithm with spectral initialization and that of the spectral method alone for measurements given by coded diffraction patterns.

From
Lemma B.4, we know that lim T →∞ lim d→∞ ũT −1 − ũT −2 2 2 d = 0 almost surely.We now show that the other terms on the RHS of (B.60) are bounded almost surely.Recall from (5.7) that bT = 1 n d i=1 f 0 (x T i).Proposition B.1 guarantees that the empirical distribution of xt converges to the law of Xt ≡ µ X,t X + σ X,t W . Since f 0 is Lipschitz, Lemma C.1 in Appendix C therefore implies that almost surely: E{f 0 (µ X,T X + σ X,T W )}.(B.61) t+1 .(B.90)We have thus shown via (B.77) that almost surely lim proof via induction, we need to show that if (B.87) and (B.91) hold with (t + 1) replaced by t for some t > 0, then almost surely lim

(Figure 5 :
Figure 5: Performance comparison between complex GAMP with spectral initialization (in red) and the spectral method alone (in black) for a model of coded diffraction patterns.
Figure4shows the performance of GAMP with spectral initialization when the signal x is uniform on the d-dimensional complex sphere with radius √ d, and the sensing vectors (a i ) ∼ i.i.d.CN(0, I d /d).Figure5shows the performance with coded diffraction pattern sensing vectors, given by (4.2).The signal x is the image in Figure3a, which is a d 1 × d 2 × 3 array with d 1 = 820 and d 2 = 1280.The three components x j ∈ R d (j ∈ {1, 2, 3} and d = d 1 • d 2 ) are treated separately, and the performance is measured via the average squared normalized scalar product 1 3 3 j=1 | xj ,x j | 2 xj 2 2 x j 2 2 and the (d (t 1 , t 2 ))'s are i.i.d. and uniform in {1, −1, i, −i}.The index r ∈ [n] is associated to a pair ( , k), with ∈ Analysis of the second phase.The second phase of the artificial GAMP is designed so that its iterates xT +t , ũT +t are close to x t , u t , respectively for t ≥ 0, and the corresponding state evolution parameters are also close.In particular, in order to prove Theorem 1, we first analyze a slightly modified version of the true GAMP algorithm in (3.3)-(3.4)where the 'memory' coefficients b t and c t in (3.5) are replaced by deterministic values obtained from state evolution.The iterates of this modified GAMP, denoted by xt , ût , are as follows.Start with the initialization x0 E{h t (U t ; Y ) 2 } − E{h t ( Ũt+T ; Y ) 2 } .By using (B.46) and (B.39), we can upper bound the RHS of (B.45) with κ t+1 ε, for sufficiently large T .This completes the proof of the second limit in (B.37).
t .(B.43) Thus, as T → ∞, the random variable (µ Ũ ,T +t G+ σ Ũ ,T +t W U,t ) converges in distribution to µ t 69) Thus lim T →∞ lim d→∞ S 3c = 0 almost surely.Using the results above in (B.64), we have shown that To complete the proof for the base case, we show that the term inside the brackets in (B.50) is finite almost surely as n → ∞.First, by assumption (B2) on p. 5, we have lim n→∞ y 2 2 /n = E{Y 2 } almost surely.Furthermore, by Proposition B.1, we almost surely have lim B.79)where Lt is the Lipschitz constant of the function h t .Since the operator norm of A is bounded almost surely as d → ∞, by the induction hypothesis (B.76) we have lim T →∞ lim d→∞ ; y i ).Proposition B.1 guarantees that the joint empirical distribution of (ũ T +t , y) converges to the law of ( ŨT +t E{h t (µ U,t G + σ U,t W U,t , Y )} = ct a.s.