Nearly optimal bounds for the global geometric landscape of phase retrieval

The phase retrieval problem is concerned with recovering an unknown signal $\bf{x} \in \mathbb{R}^n$ from a set of magnitude-only measurements $y_j=|\langle \bf{a}_j,\bf{x} \rangle|, \; j=1,\ldots,m$. A natural least squares formulation can be used to solve this problem efficiently even with random initialization, despite its non-convexity of the loss function. One way to explain this surprising phenomenon is the benign geometric landscape: (1) all local minimizers are global; and (2) the objective function has a negative curvature around each saddle point and local maximizer. In this paper, we show that $m=O(n \log n)$ Gaussian random measurements are sufficient to guarantee the loss function of a commonly used estimator has such benign geometric landscape with high probability. This is a step toward answering the open problem given by Sun-Qu-Wright, in which the authors suggest that $O(n \log n)$ or even $O(n)$ is enough to guarantee the favorable geometric property.


Background
In a prototypical phase retrieval problem, one is interested in how to recover an unknown signal x ∈ R n from a series of magnitude-only measurements where a j ∈ R n , j = 1, . . . , m are given vectors and m is the number of measurements. This problem is of fundamental importance in numerous areas of physics and engineering such as X-ray crystallography [17,24], microscopy [23], astronomy [10], coherent diffractive imaging [16,28] and optics [34] etc, where the optical detectors can only record the magnitude of signals while losing the phase information. Despite its simple mathematical formulation, it has been shown that reconstructing a finite-dimensional discrete signal from the magnitude of its Fourier transform is generally an NP-complete problem [27].
Due to the practical ubiquity of the phase retrieval problem, many algorithms have been designed for this problem. For example, based on the technique "matrix-lifting", the phase retrieval problem can be recast as a low rank matrix recovery problem. By using convex relaxation one can show that the matrix recovery problem under suitable conditions is equivalent to a convex optimization problem [5,7,33]. However, since the matrix-lifting technique involves semidefinite program for n × n matrices, the computational cost is prohibitive for large scale problems. In contrast, many non-convex algorithms bypass the lifting step and operate directly on the lower-dimensional ambient space, making them much more computationally efficient. Early non-convex algorithms were mostly based on the technique of alternating projections, e.g. Gerchberg-Saxton [16] and Fineup [12]. The main drawback, however, is the lack of theoretical guarantee. Later Netrapalli et al [25] proposed the Alt-MinPhase algorithm based on a technique known as spectral initialization. They proved that the algorithm linearly converges to the true solution with O(n log 3 n) resampling Gaussian random measurements. This work led further to several other non-convex algorithms based on spectral initialization. A common thread is first choosing a good initial guess through spectral initialization, and then solving an optimization model through gradient descent, such as [6,8,13,18,31,35,36]. We refer the reader to survey papers [9,19,28] for accounts of recent developments in the theory, algorithms and applications of phase retrieval.

Prior arts and motivation
As stated earlier, producing a good initial guess using carefully-designed initialization seems to be a prerequisite for prototypical non-convex algorithms to succeed with good theoretical guarantee. A natural and fundamental question is: Is it possible for non-convex algorithms to achieve successful recovery with a random initialization ?
Recently, Ju Sun et al. carried out a deep study of the global geometric structure of the loss function: where y j are measurements given in (1). They proved that the loss function does not possess any spurious local minima under O(n log 3 n) Gaussian random measurements. More specifically, it was shown in [29] that all minimizers of F (z) coincide with the target signal x up to a global phase, and F (z) has a negative directional curvature around each saddle point. Thanks to this benign geometric landscape any algorithm which can avoid strict saddle points converges to the true solution with high probability. A trust-region method was employed in [29] to find the global minimizers with random initialization. The results in [29] require m ≥ O(n log 3 n) samples to guarantee the favorable geometric property and efficient recovery. On the other hand, based on ample numerical evidences, the authors of [29] conjectured that the optimal sampling complexity could be O(n log n) or even O(n) to guarantee the benign landscape of the loss function F (z) (cf. p. 1160 therein).
In this paper, we focus on this conjecture and prove that the loss function F (z) possesses the favorable geometric property, as long as the measurement number m ≥ O(n log n), by some sophisticated analysis. In other words, we prove that (1) all local minimizers of the loss function F (z) are global; and (2) the objective function F (z) has a negative curvature around each saddle point and local maximizer. This is a step toward proving the open problem.
We shall emphasize that if allowing some modifications to the loss function F (z), the sampling complexity can be reduced to the optimal bound O(n) [3,4,22]. In [22] the authors show that a combination of the loss function (2) with a judiciously chosen activation function also has the benign geometry structure under O(n) Gaussian random measurements. Furthermore, in our recent work [2], we consider another new smoothed amplitude flow estimator which is based on a piece-wise smooth modification to the loss function and we could also prove that the loss function (3) after some modifications has a benign geometric landscape under the optimal sampling threshold m = O(n).
The emerging concept of a benign geometric landscape has also recently been explored in many other applications of signal processing and machine learning, e.g. matrix sensing [1,26], tensor decomposition [14], dictionary learning [30] and matrix completion [15]. For general optimization problems there exist a plethora of loss functions with well-behaved geometric landscapes such that all local optima are also global optima and each saddle point has a negative direction curvature in its vincinity. Correspondingly several techniques have been developed to guarantee that the standard gradient based optimization algorithms can escape such saddle points efficiently, see e.g. [11,20,21].

Our contributions
In this paper, we focus on the open problem: what is the optimal sampling complexity to guarantee the loss function F (z) given in (2) has favorable geometric landscape? We develop several new techniques and prove that m ≥ O(n log n) Gaussian random measurements are enough. While we can not prove the optimality of this bound, it is an improvement over the result of m ≥ O(n log 3 n) given in [29]. The main result of our paper is the following theorem.
Theorem 1 Assume that a j ∈ R n , j = 1, . . . , m are i.i.d. standard Gaussian random vectors and 0 = x ∈ R n is a fixed vector. There exist positive absolute constants C, c and c ′ , such that if m ≥ Cn log n, then with probability at least 1 − cm −1 − 7 exp(−c ′ m) the loss function F (z) defined by (2) has no spurious local minimizers. In other words, the only local minimizer is x up to a global phase and all saddle points are strict, i.e., each saddle point has a neighborhood where the function has negative directional curvature. Moreover, the loss function is strongly convex in a neighborhood of ±x, and the point z = 0 is a local maximum point where the Hessian is strictly negative-definite.
Remark 1 For simplicity we consider here only the real-valued case and will investigate the complex-valued case elsewhere. In Theorem 1 the probability bound O(m −1 ) can be refined.

Remark 2
Another interesting issue is to show the measurements are non-adaptive, i.e., a single realization of measurement vectors a j ∈ R n can be used to reconstruct all 0 = x ∈ R n . However we shall not dwell on this refinement here for simplicity.

Notations
Throughout the paper, we write z ∈ S n−1 if z ∈ R n and z 2 = 1. We use χ to denote the usual characteristic function. For example where the constant c > 0 will be sufficiently small. We use m n to denote m ≥ Cn where C > 0 is a universal constant. In this paper, we use C, c and the subscript (superscript) form of them to denote universal constants whose values vary with the context.

Organization
The rest of the papers are organized as follows. In Section 2, we divide the whole space R n into several regions and investigate the geometric property of F (z) on each region. In Section 3, we present the detailed justification for the technical lemmas given in Section 2. Finally, the appendix collects some auxiliary estimates needed in the proof.

Proof of the main result
In the rest of this section we shall carry out the proof of Theorem 1 in several steps. More specifically, we decompose R n into several regions (not necessarily non-overlapping), on each of which F (z) has certain property that will allow us to show that with high probability F (z) has no local minimizers other than ±x. Furthermore, we show F (z) is strongly convex in a neighborhood of ±x.
Without loss of generality, we assume x 2 = 1. Denote σ = σ(z) := z, x / z x . Then we can decompose R n into three regions as shown below.
-R 2 := {z ∈ R n : |σ| ≥ 0.5 and dist(z, x) ≥ δ 0 }, where ε 0 is arbitrary small positive constant and 0 < δ 0 < 1/4 is a universal constant. Figure 1 visualizes the partitioning regions described above and gives the idea of how they cover the whole space. The properties of F (z) over these regions are summarized in the following three lemmas.
Lemma 1 For any ε 0 > 0 there exists a constant δ 1 > 0 such that with probability at least provided m ≥ C(ε 0 )n. Here, C(ε 0 ) and c(ε 0 ) are positive constants depending only on ε 0 , and c > 0 is a universal constant.
The proofs of the above lemmas are given in Section 3. Lemma 2 guarantee the gradient of F (z) does not vanish in R 2 . Thus the critical points of F (z) can only occur in R 1 and R 3 . However, Lemma 1 shows that at any critical point in R 1 , F (z) has a negative directional curvature. Finally, Lemma 3 implies that F (z) is strongly convex in R 3 . Recognizing that ∇F (x) = 0 and x ∈ R 3 , thus x is the local minimizer. Putting it all together, we can establish Theorem 1 as shown below.
Proof of Theorem 1 For any z being a possible critical point and satisfying |σ| ≤ − 0.01, Lemma 1 shows that F (z) has a negative directional curvature. For any z satisfying |σ| ≥ 0.5 and dist(z, x) ≥ 0.01, Lemma 2 demonstrates that the gradient ∇F (z) = 0. Finally, when z is very close to the target solutions ±x, F (z) is strongly convex and ±x are the global solutions.

Proofs of technical results in Section 2
The basic idea of the proof is to show for each critical point except ±x there is a negative curvature direction.

Proof of Lemma 1
Proof For any z = 0, denote Recall the function Through a simple calculation, the Hessian of the function F (z) along the ξ direction can be denote by We first show that z = 0 is a local maximum. Indeed, by Corollary 2 in the appendix, if m n then it holds with probability at least 1 − exp(−cm) that Here, c and c 1 are universal positive constants. This means that with high probability the Hessian ∇ 2 f (0) is strictly negative definite.
Next, we consider the case where z = 0 and prove the loss function (5) has a negative curvature at each critical point in the regime If at some z = 0 we have a critical point, then where R is defined by (4). By Lemma 11, if m n then it holds with probability at least Here, c 2 > 0 is a universal constant. Consequently, if at some z = 0 we have a critical point, then it holds On the other hand, the Hessian at this point along the direction x is Using the equation (6), we obtain We claim that for any 0 < ǫ < 1, when m ≥ C(ǫ)n, with probability at least 1 − c m 2 − exp(−c(ǫ)m), the following holds: where C(ǫ), c(ǫ) > 0 are constants depending only on ǫ and c > 0 is a universal constant. On the other hand, by Lemma 11, when m ≥ C(ǫ)n, with probability at least Putting (8) and (9) into (7), we obtain that for m ≥ C(ǫ)n, with probability at least Since the term 1 m m j=1 (a ⊤ jẑ ) 4 is the sum of nonnegative random variables, the deviation below its expectation is bounded and the lower-tail is well-behaved. More concretely, by It immediately gives for some constant δ 1 > 0 by taking ǫ to be sufficiently small (depending on ε 0 ), provided Here, c 0 is an absolute constant. This means the Hessian matrix has a negative curvature along the direction x, which proves the lemma.
Finally, it remains to prove the claim (8). Due to the heavy tail of fourth powers of Gaussian random variables, to prove the result with sampling complexity m n, we need to decompose it into several parts by a Lipschitz continuous truncated function. To do this, Next, we give upper bounds for the terms B 1 , B 2 and B 3 . Thanks to the smooth cutoff, B 1 can be well bounded. By Lemma 12, for any 0 < ǫ < 1/2, there exist constants Moreover, note that , it then follows from Lemma 6 that for N sufficiently large (depending only on ǫ) we have which means For the terms B 2 and B 3 , when N is sufficiently large depending only on ǫ, applying Lemma 10 gives Here, C, c ′′ and c ′′′ are universal positive constants. Thus for B 2 and B 3 , by Cauchy-Schwarz inequality, we have Collecting the above estimators together gives that when m ≥ C(ǫ)n, with probability at which completes the proof of claim (8). ⊓ ⊔

Proof of Lemma 2
Proof Without loss of generality we can assume σ := ẑ, x ≥ 0. For any z = 0, denote Recognize that Similarly, according to ∇F (z), Combining the above two equations leads to the following fundamental relation for any critical point z = √ Rẑ: Observe that where σ := ẑ, x ≥ 0. By Corollary 3, for any 0 < ǫ < 1/2, if m ≥ C(ǫ)n then with probability at least 1 − c(ǫ) For the convenience, we denote We claim that for any 0 < ǫ < 1/2, if m ≥ C(ǫ)n then with probability at least 1 − c(ǫ) m 2 − exp(−c ′ (ǫ)m) the following holds: and Putting (12), (13) and (14) into (11), we immediately have with probability at least 1 − c(ǫ) m 2 − exp(−c ′ (ǫ)m), provided m ≥ C(ǫ)n. By Lemma 11, for m ≥ C(ǫ)n with probability at least 1 − exp(−c ′ (ǫ)m) it holds that In particular, we have A ≥ 1. Then we can simplify (15) as Note that z ∈ R 2 , which means σ ≥ 0.5. On the other hand, recall that σ ≤ 1. By taking ǫ > 0 to be sufficiently small, it then follows from (17) that σ must be sufficiently close to 1. It implies that for any 0 < η < 1, if m ≥ C(η)n then with probability at least Furthermore, it follows from the equality in (17) that Combining with (16) gives the desired two-way bound for A that On the other hand, it follows from (13) that if m ≥ C(ǫ)n then with probability at least This immediately means that the term B also has the desired two-way bounds Finally if 0 = z = √ Rẑ is a critical point, then by (6), we have Since we have already shown that B and A are well-bounded, it then follows that Combining (18) and (19), we obtain that if z := √ Rẑ is a critical point then it holds by taking η := δ 2 0 /3. This contradicts to the condition that dist(z, x) ≥ δ 0 for all z ∈ R 2 . Thus, the loss function F (z) has no critical point on R 2 . We arrive at the conclusion.
Finally, it remains to prove the claims (13) and (14). Let φ ∈ C ∞ c (R) be such that 0 ≤ φ(x) ≤ 1 for all x ∈ R, φ(x) = 1 for |x| ≤ 1 and φ(x) = 0 for |x| ≥ 2. Then we can write Through a simple calculation, we have Using the same procedure as the claim (8), it is easy to derive from Lemma 12, Lemma 6 and Lemma 15 that for any 0 < ǫ < 1 if m ≥ C(ǫ)n then with probability at least To deal with the error terms, observe that Using the Lemma 15 again, we obtain that when m ≥ C(ǫ)n, with probability at least and |r 1 | ≤ ǫ Thus, we complete the proofs of claim (13) and (14). ⊓ ⊔

Proof of Lemma 3
This section goes in the direction of showing the loss function is strongly convex in a neighborhood of ±x, as demonstrated in Lemma 3.
Proof Recall that along any direction u ∈ S n−1 , To prove the lemma, it suffices to give a lower bound for the first term and an upper bound for the second term. Indeed, for the second term 1 m m j=1 (a ⊤ j u) 2 (a ⊤ j x) 2 , by Lemma 17, for any 0 < ǫ < 1/2 if m ≥ C(ǫ)n log n then with probability at least 1 − c(ǫ) Here, we use the fact that By Lemma 12, for any 0 < ǫ < 1/2 when m ≥ C(ǫ)n with probability at least 1 − exp(−c ′ (ǫ)m), it holds that On the other hand, it follows from Lemma 6 that there exists N > 0 sufficiently large (depending only on ǫ) such that Collecting the above estimators, we have that when m ≥ C(ǫ)n log n, with probability at least 1 − c(ǫ) m 2 − exp(−c ′ (ǫ)m), it holds for all u,ẑ ∈ S n−1 . Here, we use the fact that E(a ⊤ 1 u) 2 (a ⊤ 1ẑ ) 2 = 1 + 2(u ⊤ẑ ) 2 in the first inequality. Recall that z ∈ R 3 . It means Without loss of generality we assume σ := ẑ, x ≥ 0. It then follows from (21) that Putting it into (20), we have Note that δ 0 ≤ 1/4. By taking ǫ > 0 sufficiently small we arrive at the conclusion. ⊓ ⊔

Appendix: Preliminaries and supporting lemmas
In this section we shall adopt the following convention.
-For a random variable Y , we shall sometimes use "mean" to denote EY . This notation is particularly handy when Y is given by a sum of random variables involving various truncations and modifications.
-For a random variable Y , the sub-exponential orlicz norm Y ψ 1 is defined as -We denote {a j } m j=1 as a sequence of i.i.d. random vectors which are copies of a standard Gaussian random vector a : Ω → R n satisfying a ∼ N (0, I n ).
Proof Since a j are i.i.d., it suffices to prove the statement for a single random vector a ∼ N (0, I n ). Noting that a ⊤ u ∼ N (0, 1) and a ⊤ v ∼ N (0, 1), we have if N is sufficiently large. Note that one can easily quantify N 0 in terms of ǫ. However we shall not dwell on this here. ⊓ ⊔ In particular, for m ≥ Cǫ −2 n, with probability at least 1 − exp(−cǫ 2 m), we have Here, C and c are universal positive constants.
Proof Introduce a δ-net S δ on S n−1 with Card(S δ ) ≤ (1 + 2 δ ) n . Observe that by Lemma 5, for any 0 < ǫ 1 ≤ 1, it holds By Lemma 7, we have for m ≥ Cn with probability at least 1 − exp(−c 2 m), it holds that Thus with probability at least 1 − exp(−c 2 m) and uniformly for u, v ∈ S n−1 , we have Now we take ǫ 1 = ǫ 2 and δ = ǫ 20 . It follows that for m ≥ Cǫ −2 log(1/ǫ)n with probability at least the desired inequality holds uniformly for all u ∈ S n−1 . ⊓ ⊔ Proof We choose φ ∈ C ∞ c (R) such that φ(z) ≡ 1 for |z| ≤ 1 2 , φ(z) = 0 for |z| ≥ 1, and 0 ≤ φ(z) ≤ 1 for all z ∈ R. Clearly then We can then apply Lemma 8 with h(z) = 1 − φ( z N ). Note that (below Z ∼ N (0, 1) is a standard normal random variable) Then for any t > 0, Proof Without loss of generality we can assume X i has zero mean. The result then follows from Markov's inequality and the observation Here, C, c ′ and c ′′ are universal positive constants.
Proof By using Cauchy-Schwartz, we have By using Lemma 9, we have for any t 1 > 0, Choosing t 1 to be an absolute constant. The desired result then follows from Corollary 1.

Proof
Step 1. Write v = su + √ 1 − s 2 u ⊥ , where |s| ≤ 1, and u ⊥ ∈ S n−1 is such that u ⊥ , u = 0. Let a ∼ N (0, I n ) and denote X = a ⊤ u, Y = a ⊤ u ⊥ . Clearly X ∼ N (0, 1), Y ∼ N (0, 1) and X, Y are independent. Now let N ≥ 4. We have where c 1 > 0 is an absolute constant, and N is taken to be a sufficiently large absolute constant.
Step 2. Let φ ∈ C ∞ c (R) be such that φ(x) = x 2 for |x| ≤ N and φ(x) = 0 for |x| ≥ N + 1. Clearly if m ≥ Cn, then with probability at least 1 − exp(−cm), we have Define the set For any 0 < ǫ ≤ 1/2, if m ≥ Cǫ −2 log(1/ǫ)n, then with probability at least 1 − exp(−cǫ 2 m), we have Proof First it is easy to check that max j h(a ⊤ j u) ψ 2 1. By Lemma 4, for each u ∈ F , we have Now let δ > 0 and introduce a δ-net S δ on the set F . Note that the set F can be identified as a unit ball in R n−1 . We have Card(S δ ) ≤ (1 + 2 δ ) n . Thus By Lemma 7, if m ≥ Cn, then with probability at least 1 − exp(−cm), we have where K 0 > 0, K 1 > 0 are absolute constants. On the other hand It follows that for some absolute constant K 3 > 0, . The desired conclusion then follows with probability at least Let f 3 : R → R be such that sup z∈R f 3 (z) 1 + |z| 3 1. Define Then for any 0 < ǫ ≤ 1/2, there exist C 1 = C 1 (ǫ) > 0, C 2 = C 2 (ǫ) > 0, such that if m ≥ C 1 · n, then the following holds with probability at least 1 − C 2 m 2 : 1.
By Lemma 9, we have Then with probability at least 1 − 1 where B 0 > 0 is some absolute constant.

Lemma 15 For any
> 0, such that if m ≥ C 1 n, then the following hold with probability at least 1 − C 2 m 2 : For any N ≥ N 0 , we have Proof We only sketch the proof. Write u = (u ⊤ x)x+ũ, whereũ ∈ F = {ũ ∈ R n : ũ 2 ≤ 1,ũ ⊤ x = 0}. For the first inequality, note that (observe |u ⊤ x| ≤ 1) For the first term one can use Lemma 9. For the second term one can use Lemma 14.
Now for the second inequality, we write For H 2 , by using the estimates already obtained in the beginning part of this proof, it is clear that we can take M sufficiently large such that H 2 ≤ ǫ/2. After M is fixed, we return to the estimate of H 1 . Note that we can work with a smoothed cut-off function instead of the strict cut-off. The result then follows from Lemma 12 by taking N sufficiently large.

⊓ ⊔
For any 0 < ǫ ≤ 1/2, if m ≥ Cǫ −1 n log n, then with probability at least 1 − exp(−cǫm/ log m), it holds that Proof Let 0 < δ < 1/2 and introduce a δ-net S δ on the set F . Note that the set F can be identified as a unit ball in R n−1 . We have Card(S δ ) ≤ (1 + 2 δ ) n . Introduce the operator A = 1 m m j=1 b j (a j a T j − I).
Thus with the same probability, we have 1 m m j=1 |b j | · h 1 (a ⊤ j u)h 2 (a ⊤ j v) − h 1 (a ⊤ jũ )h 2 (a ⊤ jṽ ) where K 1 > 0 is another absolute constant. It is also not difficult to control the differences in expectation, i.e. for some absolute constant K 2 > 0, Now take δ = (a ⊤ j u) 2 (a ⊤ j x) 2 − mean ≤ ǫ, ∀ u ∈ S n−1 .
Proof Write u = (u ⊤ x)x + u ⊥ , where u ⊥ , x = 0. Then Clearly the first two terms can be easily handled by Lemma 9 and Lemma 14 respectively. For these terms we actually only need m ≥ Cn. To handle the last term we need m n log n.
The main observation is that (a ⊤ j x) and (a ⊤ j u ⊥ ) are independent. Write b j = (a ⊤ j x) 2 and observe that with probability at least 1 − O(m −2 ), we have