On phase retrieval via matrix completion and the estimation of low rank PSD matrices

Given underdetermined measurements of a Positive Semi-Definite (PSD) matrix $X$ of known low rank $K$, we present a new algorithm to estimate $X$ based on recent advances in non-convex optimization schemes. We apply this in particular to the phase retrieval problem for Fourier data, which can be formulated as a rank 1 PSD matrix recovery problem. Moreover, we provide theory for how oversampling affects the stability of the lifted inverse problem.


Introduction
The (discretized) formulation of the phase retrieval problem consists in finding a complex vector x, usually a discretized signal or image, given a certain amount of measurements in form of modulus of scalar products (see [SEC15]), i.e. b k = | a k , x | 2 , k = 1, . . . , M (where a k ∈ C N are known), and possibly additional geometric constraints. The aim is thus to reconstruct the discrete vector x ∈ C N representing the object. In [BCE06] is shown that M ≥ 4N − 2 generic complex measurements are needed in order to be able to distinguish two different signals (up to a phase); in [CEH15] the lower bound was improved to M ≥ 4N − 4. This number is believed to be close to optimal. A possible combinatorial approach to the problem has been described in [SCa91], together with the proof that, as formulated, the problem is NP-hard. An optimization-based standard solution technique would be to solve the least-squares minimization problem but the quadratic terms in the objective function makes the problem highly non-convex and therefore hard to solve. A convex optimization approach was suggested in [CSV13], called PhaseLift, and this note considers an improvement of this approach. We recommend [FWd15] for a pleasant overview of recent advances regarding PhaseLift and related techniques. A lot of work has been devoted to deal with cases in which a k are Gaussian measurements, see e.g. [CLS15], but these algorithms seems to work poorly with Fourier data. The main objective of the present article is to provide improvements of the PhaseLift approach, with a particular focus on Fourier measurements, based on recent advances in non-convex low rank approximation algorithms. The theory relies on the quadratic envelope applied to the indicator functional of matrices with a fixed predetermined rank, and can be used also for any other PSD-estimation problem of fixed low rank (Section 2.1).

Physical background
The phase retrieval problem appears frequently for instance in elecrodynamics; the single scalar complex field Ψ(x, y, z, t), often called wave-function, solution of the d'Alembert equation is enough to describe the electromagnetic disturbance in the free space. Making use of Fourier integral, the wave-function can be spectrally decomposed as superposition of monochromatic fields where the spatial wave-function ψ ω (x, y, z) associated with a given monochromatic component of the spectral decomposition of Ψ solves the Helmholz equation (∇ 2 + k 2 )ψ ω (x, y, z) = 0, k = ω/c with suitable boundary conditions. In Coherent Diffractive Imaging (CDI) an object (sample) is illuminated by a coherent almost monochromatic wavefield and subsequentially the diffracted farfield intensity pattern is measured by detectors. The current detectors measure intensities but they are not able to measure the phase of the diffracted pattern; the phase retrieval problem consists then in retrieving the object knowing the aforementioned intensity plus, in general, some additional physical constraints. In the Fraunhofer regime, which occurs whenever the Fresnel number N F := b 2 /λ∆ 1 (b being the diameter of the region occupied by the sample and ∆ the distance between the sample and the detector) it can be shown (see for instance [Pag06]) that the 2-dimensional propagated disturbance ψ ω measured at z = ∆ ≥ 0 (where the optical axis is considered to be coincident with the z axis) is where k is the wavenumber and · the 2-dimensional Fourier transform. In words this says that "the propagated disturbance has the form of a paraxial modulated spherical wave with the modulation being proportional to (a transversely scaled form of) the two-dimensional Fourier transform of the unpropagated disturbance" ( [Pag06]). A family of algorithms developed for solving the phase retrieval problem in combination with a priori knowledge of the support of the object, relies on the simple idea of alternatingly adjust the support of the image and the modulus of its Fourier transform. This goes back to Gerchberg and Saxton ([GSa72]) and to Fienup ([Fie82]), even though the pool of mathematical ideas, borrowed from convex analysis, goes back at least to Bregman ([Bre65]). The most famous is the errorreduction algorithm which is essentially a non-convex adaptation of the idea of Projection On Convex Sets (cfr. POCS algorithm, chatper 5 of [BCo10]): if ψ O is the object we want to reconstruct, supported in C ⊆ R 2 , error-reduction performs successive projections between the two sets The drawback is that the first set is not convex and therefore there are no theoretical guarantees of the convergence of this scheme. Further algorithms have of course been developed; in [Mar07] the projections are replaced with reflections, in [MBK16] a Newton-type scheme with Tikhonov regularization is proposed, to mention two recent contributions. These methods have been successfully applied with real data, but the non-convexity issue persists.

PhaseLift approach
In the present note we focus our attention to a lifting approach popularized by [CSV13] and [CES13]. For A k = a k a * k , where * is the conjugate transpose and the latter product is the usual matrix multiplication, we define a linear operator A : being ·, · F the Frobenius inner product and H N the space of N × N Hermitian matrices. Noticing that | a k , x | 2 = a k a * k , xx * F and recalling that every positive semidefinite matrix X ∈ H N (C) with rank(X) = 1 admits a factorization of the type xx * = X with x ∈ C N , the quadratic phase retrieval problem is lifted onto a linear one with dimension squared; the problem (1) to be solved is now to find a suitable rank 1 PSD-matrix X such that In real applications, the "perfect measurement" b k = | a k , x | 2 is contaminated by noise, so the problem can be re-casted as an optimization one min A(X) − b 2 subject to rank(X) = 1, X 0. (4) The constraint rank(X) = 1 is however highly non-convex, so a convex relaxation has been proposed in [CSV13] based on the nuclear norm, known for being "low-rank inducing" (cfr. [RFP10]): The parameter λ gives a tradeoff between low rank of the sought minimum X and data fit, and this term also induces a bias which scales with λ. For this reason, in practice one does not in general find a rank 1 matrix, but a low rank one from which a suitable vector candidate x can be extracted, we refer to [CSV13] for the details.

Innovations and discussion of the lifting trick
While recognizing the PhaseLift as an important and celebrated contribution, we have found that this method suffers from a number of drawbacks making its practical use limited. The main issue is the size of the involved objects due to the lifting process. A typical (CDI) measurement generates an image of at least 100 × 100 which leads to a vector x of size 10 4 , which in turn yields a 10 4 × 10 4matrix which is then fed to an iterative algorithm which need to compute eigenvalue factorization. The time complexity of the latter, without any implementation tricks, is O((10 4 ) 3 ) = O(10 12 ).
The second drawback stems from the fact that one usually need to run the algorithm several times in order to find a suitable value of λ, and a third drawback is that one typically need lots of linearly independent measurements (A 1 , . . . , A M ) for stable recovery. As mentioned initially, one needs at least M = 4N measurements for having an (essentially) unique solution in the noiseless case. [CSV13] suggests working with M ≥ cN log(N ) where c is some (unknown) constant, whereas in their numerical section they use M = 6N where N = 128. When dealing with Fourier data, one commonly used method for boosting M is simply oversampling in the Fourier domain [MSC98], but it was observed numerically in [CES13] that this gives an ill-posed inverse problem (see Sections 2.1 and 4.4.3), which led them to instead suggest the use of e.g. masks to increase the amount of measurements.
In this article we will prove that oversampling indeed does not give rise to more linearly independent data beyond M ≈ 2 d N , where d is the dimension of the problem. In other words it is pointless to oversample more than a factor 2 in each dimension. This also shows that any method based on the lifting principle (3) is bound to fail unless combined with masks or similar tricks, at least for d ≤ 2. Furthermore, we will rewrite (4) into an optimization problem where the constrains are built into the functional to be minimized, and then use the quadratic envelope to regularize this functional. Ideally one would like to work with the convex envelope of the functional, but this is not doable in practice. The quadratic envelope yields a non-convex continuous surrogate which has recently been shown to have the desirable property of yielding a regularizer which, just like the convex envelope, does not move the global minima [Car19]. While not completely removing other local minimizers, it reduces the amount and in practice it seems that the corresponding algorithm can find the global minimizer in fairly difficult problems. In other words we suggest a new framework which seems to be able to solve the original problem (4). We refrain from proving this claim mathematically, but remark that in [CGO19] an analogous statement is rigorously shown for the similar problem of low rank recovery without PSD constraints. Here we satisfy with developing the algorithm, with a particular focus on efficient implementation, as well as numerically demonstrating several desirable features (see Section 5); • the algorithm can solve any problem of the type (4) but with a predetermined rank K, not necessarily K = 1.
• The nuclear norm approach (5) yields matrices with rank larger than 1, hence using more degrees of freedom. Despite this, the proposed method gives a better data fit while at the same time fulfilling the rank constraint.
• In [CES13] a reweighted version of (5) is used, which is known for better accuracy, (but this algorithm is non-convex as well). In terms of reconstruction accuracy the two algorithms now perform similarly. Benefits of the proposed algorithm is that it finds the correct rank and does so within fewer iterations, plus that it relies on a more developed theoretical framework.
The paper is structured as follows; Section 2 presents the mathematical details behind the new general fixed rank PSD-matrix estimation algorithm, Section 3 presents our theoretical findings regarding the application to phase retrieval and sampling issues, Section 4 covers implementation details and finally Section 5 concludes with some numerical illustrations.

Notation
M N will denote the set of N ×N complex matrices and H N the subset of Hermitian (i.e. self-adjoint) matrices. N is the size of the unknown vector before lifting to a problem with N 2 unknowns, and M is the amount of measurements, which we assume is larger than N . Images and higher dimensional objects are represented as tensors ⊗ d j=1 C n and the measurements then take place in the Fourier domain which we discretize as ⊗ d j=1 C m , where m ≥ n. Hence N = n d and M is a constant multiple N m + 1 of m d , where N m is the number of masks. A denotes the operator (3) and its "matrixification" (as described in Section 4.1) is denoted by A, except in the case of pure Fourier data, in which case we denote these objects by F and F respectively. The quadratic envelope is denoted by Q γ (f ). • denotes composition of two functions, in particular if f acts on R N then f • λ acts on H N , where λ denotes the eigenvalues of any given X ∈ H N ordered non-increasingly.
2 Quadratic envelope approach to low rank recovery Let H be a Hilbert space, f : H → R ∪ {∞} a non-convex penalty functional and set where A is a linear operator from H to some other Hilbert space in which the "measurement" b lives. In [Car19] the quadratic envelope is studied as a means of regularizing functionals of this type. It is shown that the regularized functional has some desirable properties when A 2 < γ. Most notably, R reg lies between R and its lower semi-continuous convex envelope, any local minimizer of R reg is a local minimizer of R and the global minimizers of R and R reg coincide. In [CGO18] the quadratic envelope has been studied further for the case of the famous 0 − 2 problem in which H = R N and f (x) = x 0 or, in case the sought degree of sparsity K is known, Here x 0 denotes the number of non-zero entries in x. In these cases R reg has been numerically compared to the more classical 1 -penalty in solving the problem of retrieving a sparse signal. In the companion work [CGO19] the analysis has been lifted to the space of N × N -matrices H = M N (C), and f (X) = rank(X) or The vector problem and the matrix problem are closely related; note that ι R K (X) = ι K (σ(X)) where σ(X) denotes the singular values of X. It turns out that and, more importantly, if X has SVD X = U ΣV * , that prox ι R K (X) = U diag(prox ι K (σ(X)))V * . In other words, if we can compute the proximal operator in the scalar case, the matrix case follows immediately. In [CGO19] the theory is much further developed for the case f = ι R K ; in particular we show that for noisy measurements the functional R reg has a unique local minima which is also the solution of arg min rank(X)≤K A(X) − b . The new ingredient in the present contribution is basically the PSD condition, and we satisfy with numerically testing the machinery on the Phase Retrieval problem as well as some generic low rank recovery problems with K > 1.

Low-rank Positive Semi-Definite problems
We now come to the new material for this paper. In this section we set H to be the set of Hermitian matrices H N ⊂ M N (C) and consider the problem of searching for a PSD matrix with fixed rank K. Of course, for the PhaseLift problem we will set K = 1, but we develop the theory for general K ∈ N. Motivated by the previous section we introduce the function ι + K : R N → {0, +∞} defined by where x ≥ 0 is interpreted elementwise. If R + K denotes the set of positive semi-definite matrices with rank ≤ K, then it is easy to see that ι R + K (X) = ι + K (λ(X)) where λ(X) denotes the eigenvalues of X ∈ H. Given a transform A we then have that the fixed-rank minimization problem arg min This in turn has the same global minimizer as the regularized problem arg min as long as γ > A 2 , as argued earlier. The latter is a continuous (in its domain H + N , cfr. Proposition 3.2 of [Car19]) functional whose stationary points can be found for instance via Forward-Backward Splitting scheme (also know as Proximal Gradient Method: see for instance [Bec17]) 1 , as long as the proximal operator of Q γ (ι R + K ) is computable. We now describe how this can be done. We recall that a function f : R N → R is said to be symmetric if f (x) = f (Πx) for every permutation Π and for every x ∈ R N . We also recall that for a lower semi-continuous function g : V → [−∞, ∞], being V an Hilbert space, the (possibly empty or set-valued) proximal operator is defined by be a symmetric function and consider F : , by a variation of the von Neumann trace inequality (cfr. [Lew96]). Moreover the equality holds if and only if there exists a unitary matrix V such that We may therefore assume without loss of generality that By the symmetry of f we have that f (λ(X)) = f (ξ), and therefore The corresponding claim for Q γ follows immediately by Q γ (f ) = S γ (S γ (f )).
Let us prove the second part of the statement. We have that [BoL05] follows that the latter holds if and only if Z − X and Z share the same eigenvectors and This is equivalent to λ(Z) ∈ prox Qγ (f )/ρ (λ(X)) and by the symmetry of f we have that λ i (Z) = λ j (Z) whenever λ i (X) = λ j (X). Hence any unitary matrix U of eigenvectors to X are also eigenvectors to Z, and since Z = U diag(λ(Z))U * , the desired conclusion follows.
We remark that the maximum negative quadrature of Q γ (F ) is γ, as shown in [Car19], so the condition γ < ρ ensures that the proximal operator is single valued (since the corresponding minimization problem is strongly convex).
Of course we are interested in the concrete case of f = ι + K , but it turns out that Q γ (ι + K ) has a rather complicated expression. Luckily, the related transform S γ (ι + K ) has a simpler expression, and it follows that we can still compute the prox Qγ (ι + K ) by duality; we postpone the details of this to Section 4.2. In the coming section we investigate the structure of the operator A for the phase retrieval problem with multidimensional data and Fourier measurements.

Images and Fourier data
Let us return to the phase retrieval problem. We consider "images" in d dimensions, which can be realized as the tensor product vector space ⊗ d j=1 C n , where we use the same number of points in every dimension for simplicity only. This linear vector space can of course be identified with C N = C n d by introducing some basis, but we will see that it is often convenient to actually skip this step and work directly in the more abstract setting. In this case a rank 1 matrix corresponds to a linear operator on ⊗ d j=1 C n of the form x ⊗ y, and the matrix is PSD if and only if the operator can be written x ⊗ x, where the bar indicates complex conjugation.
Given elements a k ∈ ⊗ d j=1 C n , where k runs over some set of (usually multidimensional) In this new framework, PhaseLift corresponds to the equivalent problem of finding a rank 1 PSD linear operator X on ⊗ d j=1 C n such that

Fourier data and limitations to oversampling
In the common case of Fourier measurements, the a k 's are discretizations of pure oscillatory exponential functions, and in this case we denote them by f k and the corresponding operator by F in place of A. We denote by 2 (S) the linear vector space of all functions on S, so that b naturally identifies with an element of 2 (S). Often, we measure on an m d grid where (m/n) d is the oversampling factor, i.e. S = {0, . . . , m − 1} d in which case 2 (S) is readily identified with ⊗ d j=1 C m . To be more explicit, in this case we have for k ∈ {0, . . . , m − 1} d and n ∈ {0, . . . , n − 1} d . However, to allow for unequally spaced sampling we stick to the more general setting where and {ζ k ∈ R d } k∈S are some frequencies and S some set. The equation (9) can now be written Since we are working with an ill-posed inverse problem, it is crucial to get as much linearly independent equations as possible. In other words we want to choose {ζ k ∈ R n } k∈S so that F has maximal rank. The next result basically states that it is pointless to oversample beyond a factor of two.
The amount of tuples of the form n 1 −n 2 equals (2n−1) d , which gives the desired upper bound.
We now prove that, given sufficient oversampling and a vise choice of {ζ k } k∈S , the rank of F actually equals the maximal rank (2n − 1) d . More precisely, we will need that {ζ k } k∈S is such that n → e iζ k ·n is a linearly independent set on ⊗ d j=1 C 2n−1 or, if |S| > (2n − 1) d , that these functions span the full space ⊗ d j=1 C 2n−1 . Such a choice of {ζ k } k∈S will be called non-degenerate. By considering the DFT it is clear that non-degenerate sets of frequencies exist for all cardinalities of S. More generally, say that we pick our ζ k 's on an irregular product set, i.e. suppose that for each j = 1 . . . d there are "coordinate-frequencies" {ζ j k } 2n−1 k=1 and the multi-frequencies are formed Then, in order for the multidimensional set {ζ k } k∈{1,...,2n−1} d to be non-degenerate it suffices for each coordinate set {ζ j k } 2n−1 k=1 is non-degenerate (see e.g. Theorem 2 in [Hay82] or Proposition 4.1 in [AnC17]).
Proposition 3.2. Given a non-degenerate set of frequencies {ζ k } k∈S , the rank of F equals rank(F) = min(|S|, (2n − 1) d ). (13) Proof. First assume that |S| = (2n − 1) d and denote F in this case by F 0 . As in the previous proof we have that the range of F 0 is spanned by k → e iζ k ·n where n ∈ {−n + 1, . . . , n − 1}. In this case, there are as many different n's as there are k's, and it follows by basic linear algebra that the set of functions of the form k → e iζ k ·n is linearly independent if and only if the set of functions n → e iζ k ·n is, which is true by assumption. This finishes the proof under the assumption that |S| = (2n − 1) d . Given any particular basis for the domain and the codomain, the operator F 0 has a matrix representation − → F 0 (of dimension (2n − 1) d × n 2d ). We know from what we have already shown that we have (2n − 1) d linearly independent columns, i.e. − → F 0 has full rank. If |S| > (2n − 1) d we can think of − → F as adding rows to the matrix − → F 0 , which then impossibly can result in a lower rank. Combined with the previous lemma, this shows that the rank in this case is (2n − 1) d . Similarly, if |S| < (2n − 1) d we can think of − → F as removing rows from the full rank matrix − → F 0 , which then clearly results in a new full rank matrix. Hence the rank will be |S|, and the proof is complete.
The above proposition is somehow bad news for the "oversampling approach", since it shows that the maximum oversampling factor one could hope for without adding more linearly dependent equations is roughly 2 d . This conclusion is also backed by the numerical experiments in [CES13]: Section 4.4.3 is devoted to numerically demonstrate that oversampling alone does not yield a wellposed inverse problem. Hence Proposition 2.1 can be seen as a theoretical justification of numerical observations in [CES13]. We also validate this conclusion experimentally in Section 5.2. An interesting remark is that two is also a suitable oversampling parameter for other Phase Retrieval methods, see e.g. [MSC98] which backs up this conclusion experimentally. This can be understood since |x(θ)| 2 is a trigonometric polynomial with (2n − 1) d undetermined coefficients, and it is well known that one needs precisely (2n − 1) d (non-degenerate) measurements to uniquely determine these coefficients. The following interesting papers [BeP15,BBE17,BrS79,Hay82,HES16,Osh12] contain more information about uniqueness of the Phase Retrieval problem, (as well as other interesting algorithms and plenty of other information). A curious remark is that the factor 2 d does get better with the dimension d, and it is also well known in the experimental community that 3d reconstructions are more stable; see e.g. [CBM06,CZL14]. For the moment, the case d = 3 is too computationally expensive with the present method, but we hope to address this shortcoming in future work. The proposition also shows that one does not increase stability, or degree of linear independence to be more precise, by picking {ζ k } from some irregularly sampled grid. Hence we find no motivation to deviate from the simplest possible choice, i.e. picking an m in the range n ≤ m < 2n and use (10) or, which is the same, setting S = {0, . . . , m − 1} d and ζ k = k/m in (11).

Stabilizing PhaseLift by adding new equations
Section 2.1 of [CES13] provide a list of experimental methods which can be employed in order to add further linearly independent measurements to those given by the operator F from the previous section. In particular it is argued that one can use masks, that is, you create new measurement vectors f C k by introducing a mask which only lets light through in a region C. Mathematically, this amounts to multiplying the image x with the characteristic function χ C . If we use regularly sampled measurements as in (10), this gives new measurement-functions via the formula f C k (n) = e −2πi k·n m χ C (n) where C is realized as a subset of {0, . . . , n − 1} d , and k runs over the index set {0, . . . , m − 1} d . Also ptychography, optical grating and oblique illuminations are considered. However additional linearly independent measurements are created, we can form an operator A by extending F so that the problems (6)-(8) has an essentially unique solution.
While the methods mentioned above are great from a mathematical perspective, they are often not feasible or slow or expensive from a physical perspective in a concrete experiment. Due to this, a priori estimates on the support of the object measured is still by far the most common method used in practice. This was pioneered by Fienup in [Fie82] and presently the most popular methods to solve the phase retrieval problem in this context are Hybrid Input Output ( [Fie82]), Difference Map ([Els03]) or Relaxed Averaged Alternating Reflection ([Luk05]). A simple way to incorporate support constraints into the scheme (6)-(8) is to construct A by extending F by simply adding linear equations X(m, n) = 0 for all pairs (m, n) ∈ ({0, . . . , n − 1} d ) 2 such that either m or n is outside of C (recall that X is a tensor in (⊗ d j=1 C n ) ⊗ (⊗ d j=1 C n )). However, this will yield an algorithm which balances meeting the support constraint versus the low rank inducing functional Q γ (ι R + K ), and hence the output may still be non-zero off C, which may be suitable depending on how certain the support estimate is. An alternative is to set A = F (i.e. only "pure" Fourier data) and use ADMM on the problem arg min X∈H: X(m,n)=0 for m∨n ∈C which will force the solution to obey the support constraint. We leave it for further research to investigate these options from a practical perspective.

Estimating A for masked Fourier data
As explained in section 2, the parameter γ in the quadratic envelope needs to be chosen considering the size of A 2 , and since A in practice is a huge matrix it is not desirable to have to compute its norm. We therefore provide some rough estimates here depending only on the dimension of the images, in the simplest case when m = n. We thus assume that A is formed as explained in the first paragraph of the previous section, using N m number of masks (plus pure Fourier data). Let A j , j = 0, . . . , N m be the suboperator of A connected to a particular mask (so A 0 = F are those measurement where no mask is used.) In other words A 0 (X)(k) = X, f k ⊗ f k . The tensors {f k ⊗ f k : k ∈ {0, . . . , n − 1} d } are mutually orthogonal (since the f k 's are). Moreover so it follows that A 0 is N times a unitary operator, which has operator norm 1. Now, if we represent A 0 as a matrix A 0 , then it is clear that A j (for each j ≥ 1), is obtained from A 0 by replacing entire rows and columns by 0. This means that A j ≤ A 0 = N by basic estimates, and hence the triangle inequality implies that A ≤ j A j ≤ (N m + 1)N = M . Summing up we have shown that

Implementation aspects
In this section we show how to efficiently minimize (8) without additional constraints, (or at least how to compute a stationary point). Since (8) is is smooth, we use the Forward-Backward Splitting scheme, as this has been established to converge to a stationary point under assumptions which are applicable in our setting. Moreover we suggest to use the accelerated version FISTA, since we have observed that this significantly speeds up convergence. The algorithm alternates between a gradient update step and a proximal operator step. The computation of the gradient of g = 1 2 A(·) − b 2 2 (S) is mathematically straightforward but very time consuming, and we will therefore introduce certain tricks based on FFT and Toeplitz matrices for efficient evaluation of this step when working with Fourier data, see Section 4.1. The computation of the proximal operator of Q γ (ι R + K )(X) = Q γ (ι + K )(λ(X)) is rather tricky, the details are given in Section 4.2. We give here a general overview of all the functions involved. For concreteness we present the code when working with Fourier data and a number of masks, and leave it up to the reader to work out details for e.g. ADMM routines with support constraints.
In FISTA, short for Fast Iterative Shrinkage-Thresholding Algorithm, the proximal-gradient steps are taken at the interpolation where {θ k } k≥1 is a sequence of positive real numbers with some properties that ensure convergence in the convex case (see for instance [Bec17]). We used the sequence θ k = (k + 1)/2, for k ≥ 1 as suggested in Remark 10.35 of [Bec17]. Both X k and X k+1 are initialized at zero. Upon choosing a step-size t (which we discuss a little further on), the FISTA algorithm now alternates between the update steps For the latter to be single valued, we need to have 1 γ > t, which puts an upper bound to the choice of step-size t.
We now discuss suitable choices of γ and t. Based on (15) and the theory for the quadratic envelope, it would seem that γ = M 2 would be a natural choice, since it would guarantee that (6) and (8) have the same global minimizer. However, a large γ also means that Q γ (ι R + K ) is, informally speaking, more non-convex, and we have found that the choice γ = N 2 gives a better performance. We have also observed that performance is rather stable with respect to changes in γ around this value. With this choice, we get the bound t < 1/N 2 for the step-size. However, the convergence of FISTA is guaranteed (in both convex and non-convex cases) if the stepsize t is < 1/L(∇g) (see [BCo10] and [ABS13]), where g is the differentiable function and L(∇g) the Lipschitz constant of its gradient, which in our case leads to t ≤ 1/ A 2 . Based on (15), we therefore used t = 1/(M 2 + 1) in our experiments. In our experience, larger step-size leads to faster convergence, but it is important to not exceed the bound.

Efficient computation of the gradient step
We shall here only treat Fourier measurements with masks, in which case the gradient can be computed very efficiently by using the Fast Fourier Transform (FFT in short), but the details are a bit intricate. A technical problem is the representation of tensors as vectors in a computer. Given any enumeration of the index set {0, . . . , n − 1} d and a vector x ∈ ⊗ d j=1 C n , we denote the corresponding vector by x ∈ C N . More precisely, we let β n : {0, . . . , n − 1} d → {1, . . . , N } be a bijection and set x(β n (j)) = x(j). If for example d = 2 (so we are dealing with images) and we vectorize by column-stacking, then β n (j) = β n (j 1 , j 2 ) = j 2 n + j 1 + 1.
All vectors are realized as column-vectors, so that e.g. xx * becomes a matrix. Similarly, if M denotes the amount of elements in S the function b ∈ 2 (S) can be identified with a vector b ∈ C M by ordering the elements of S. Once both the domain (C n ) d and codomain 2 (S) have been vectorized, the operators A and F get matrix representations that we denote by A and F. If a k relates to A (and A) as described shortly before (3), it is now easy to see that by basic multivariable calculus. We first consider the case of no masks and follow the oversampling recommendations from Section 3.1; we pick a parameter m with n ≤ m < 2n and set S = {0, . . . , m − 1} d . This set is then in bijective correspondence with {1, . . . , M } through β m , and we will implicitly identify k in the latter with k = β m (k) in the former. Now A = F and a k derive from pure oscillatory exponentials as in (10). More precisely a k corresponds to a k = f k (j) = e −2πi jk m as in (10), where a k is the tensor counterpart of a k .
Similarly, since X represents an arbitrary matrix in M N it implicitly defines a tensor X on the tensor product of ⊗ d j=1 C n with itself, by the formula X(j, l) = X(β n (j), β n (l)). Summing up we have that a k a * k , X F equals f k ⊗ f k , X in the space ⊗ d j=1 C n ⊗ ⊗ d j=1 C n or, more explicitly where the sums over j, l run over all {0, . . . , n − 1} d , p runs through {−n + 1, . . . , n − 1} d and each coordinate of q, say q 1 , satisfies max(0, −p 1 ) ≤ q 1 ≤ min(n−1, n−1−p 1 ) to ensure that both q and p + q are in {0, . . . , n − 1} d . We now introduce Y (p) = q X(q, p + q) (where summation bounds for q are as before) for p ∈ {−n + 1, . . . , n − 1} d , and define Y (p) = 0 outside of this grid. In the case d = 1 the operation above corresponds to summing over lines in X parallel to the diagonal, as in a Toeplitz matrix. For the multivariable case the above is easily computed using X, by simply noting that Y (p) = q X(β n (q), β n (p + q)).
Clearly (17) is a sort of inverse Fourier transform on Y , and moreover the computation of Y is fast; O(N 2 ). To make use of FFT for fast evaluation of this expression jointly for all k ∈ {0, . . . , m − 1} d , we introduce Y mod (p) = n∈Z d f (p + mn) for all p ∈ {0, . . . , m − 1} d . Since usually n ≤ m ≤ 2m, note that for each p the sum contains at most 2 non-zero entries, so this operation is efficiently evaluated. Summing up we have that where DF T denotes the discrete Fourier transform on the grid {0, . . . , m − 1} d . This can be efficiently evaluated (i.e. O(N 2 log(N ))) in most programming languages using some preset implementation of the FFT. For example, when d = 3 and when using MATLAB, we can represent Y MOD as a multidimensional array and evaluate (18) using the command fftn.
Let us now return to the formula (16). We denote the element computed in (18) by c k , and similarly we introduce b k = b βm(k) . Then (16) takes the form where f k is the vector representation of the tensor f k via β n , as before. At some fixed index pair (p, q) ∈ {1, . . . , N } 2 , this is evaluated as where p = β n (p) and q = β n (q). Clearly all these N 2 = n 2d values reduce to (at most) M = m d different values, and can be computed by an inverse Fourier transform on the tensor (c k −b k ), which has time complexity O(M 2 log(M )). This concludes the explanation of how to efficiently compute the gradient step (16) in the case of no masks. Luckily, the incorporation of masks poses very little additional difficulty. Say we have N m masks C 1 , . . . , C Nm as described in Section 3.2. Clearly then the computation of (16) can be separated in N m + 1 independent pieces, one for each mask. We just describe how to compute one of these contributions for a fixed mask C. In this case we again have that k = 1, . . . , m d and we may as well index the vectors using k ∈ {0, . . . , m − 1} d instead. Now a k corresponds to the tensor χ C (j)e −2πi jk m and a k to its vectorization via β n . If w is the vectorization of χ C via β n and I w denotes the diagonal operator on C N with w as diagonal, we thus have that a k = I w f k with f k as before. Returning to the coefficients in (16) we have that which means that we can use formula (18) upon replacing X with I w XI w . Finally, since a k a * k = I w f k f * k I w the summation in (16) can be evaluated as in (19) with the difference that we insert a 0 in the final matrix on positions (p, q) where either of the two coordinates p and q lie outside of C.

Computation of the proximal operator
The expression of the proximal operator of Q γ (ι + K ) is quite involved. We first recall that Q γ (f ) = S γ (S γ (f )) where S γ (h)(x) is computed by first taking the Legendre transform of h+ γ 2 · 2 and then subtracting γ 2 · 2 . By an alteration of the Moreau decomposition (cfr. Chapter 14 of [BCo10]) it is possible to show (cfr. [Car16], Proposition 3.3) that if ρ > γ we have where (see again [Car16]) beingx the decreasing rearrangement of x. We now sketch the routine for computing prox ρ−γ ργ Sγ (ι + K ) (y). After some basic algebra it turns out that By the rearrangement inequality we may assume that y is already ordered non increasingly and therefore the latter turns into arg min For a given x letk(x) be the minimum between K and the last index j for which x j is non-negative. Then (21) becomes arg min Now it is clear that if the vector is non-increasing then it is a global minimizer for (21). In particular this happens whenever y switches sign before K (i.e.k(y) < K) or whenever y K ≥ ρy K+1 /γ. We exclude these two scenarios from the further analysis. Ifk(x) < K then xk (x) ≥ 0 and xk (x)+1 < 0, but looking at (22) we see that setting xk (x)+1 = 0 would diminish the quantity (since yk (x)+1 > 0), a contradiction. Thus we have thatk(x) = K, and so (22) becomes arg min Let x be some candidate for the global minimum and set s := x K . It is easy to see that x must have the following structure: Now consider (25) inserted into (24); except from an additive constant this function takes the form By inspection it is clear that this function is convex and has a unique minimum on the interval V = [y K , ρy K+1 /γ]; indeed the first sum is constant on (−∞, y K ] and strictly convex (increasing) on (y K , ∞), whereas the second sum is strictly convex (decreasing) (−∞, ρy K+1 /γ) and constant on [ρy K+1 /γ, ∞). We seek for the unique solution of d ds F (s) = 0 in V . Let j * be the smallest index such that y j * ≤ ρy K+1 /γ and let l * be the biggest index such that ρy l * /γ ≥ y K ; now the set If this value is outside I then the optimal s is to be found in another interval. By strict convexity s I ∈ I will hold in at least one interval, and at most two intervals (when s I lies on the boundary).
In conclusion we can summarize the previous observations in an algorithm for computing the proximal operator: 1. If y K < 0 or ρy K+1 /γ < y K return (23), else 2. compute j * and l * ; 3. sort {y i } K i=j * ∪ {ρy i /γ} l * i=K+1 decreasingly and call it z; 4. for m ∈ {1, . . . , l * −j * −1} set s = (z m +z m+1 )/2 and compute the indices j and l as described above. Notice that the indices j and l are the same for all t ∈ [z m , z m+1 ], and this is why it is enough to consider the midpoints; 5. compute the new value of s according to (26); 6. if z m ≥ s ≥ z m+1 stop and return (25). Otherwise increase m with 1 and repeat the steps 4-5.
The "for-loop" 4-6 must stop since F is strictly convex and has a unique minimizer.

Numerical examples
We will illustrate the various results by conducting four experiments. In the first experiment we will compare the proposed method with PhaseLift and reweighted PhaseLift, in the second we test the oversampling conjectures from Section 3, in the third we discuss the capacity of the proposed method of solving low rank PSD problems of any given rank as predicted in Section 2, and finally we end with the reconstruction of a 2D image using the tricks of Section 4. Figure 2: On x-axis the 2 norm of the noise and on the y-axis D. In blue performances of (5), in red of (27) and in yellow of (8). Figure 3: On x-axis the 2 norm of the noise and on the y-axis the Frobenius norm ofX − X0. In blue performances of (5), in red of (27) and in yellow of (8).

1D synthetic "masked" Fourier measurements
In this subsection we consider 1-dimensional signals x 0 ∈ C 100 with x 0 = 1 randomly generated with a Gaussian distribution; they represent our "ground truth", i.e. the signals that we are interested in retrieving. The method we proposed is compared to the reweighted nuclear norm technique, following the numerical section of [CES13]. The idea of reweighting the nuclear norm was heuristically introduced by [FHB01]. Despite of the lack of a systematic mathematical theory behind, this trick seemed to work quite well for our problem too, proving to be able to significantly enhance the capacity of the nuclear norm of finding low rank matrices. We consider binary masks, as introduced in Section 3.2. Following [CES13] we use m = n (oversampling is considered separately in the next section) and three masks, so n = m = N = 100 and M = 400. Note that this means that we use 400 equations plus a rank-PSD constraint to determine roughly 5000 variables in the lifted problem. We focus our attention in comparing the approximated solutions to the two problems (8) and min X∈H100 λ W (l) X * + 1 2 A(X) − b 2 2 , for l = 1, 2, . . .
where W (l) are weight (diagonal) matrices and where A is as described in section 3.2 and b = A(x 0 x * 0 ) + with ∈ C 400 additive random gaussian noise. Figure 4: On x-axis the 2 norm of the noise and on the y-axis the average rank ofX. In red performances of (27), in yellow of (8). The rank was computed using the MATLAB function rank with tolerance 10 −6 for (8) and tolerance 10 −3 for (27).
The proposed method only contains one user parameter γ which is fixed as described in Section 4, independent of noise level. The minimization of (27) on the other hand relies on λ which must scale with the noise for optimal performance, and moreover the update of the weights requires yet another parameter δ to be chosen; following [CWB08] we set being X (l) the outcome of the algorithm after some fixed number of iterations with W (l) . W (1) is just the identity matrix.
We have not found any clear guidance in the literature for how to pick λ and δ; in [CES13] the algorithm is run several times for different λ's picked according to a golden section search or cross validation. This is very time consuming, especially taking into account that the algorithm is slow to begin with due to the lifting trick. We have found that λ = 0.01 + 0.75 and δ = 0.01 + 0.05 seemed to work fairly well in our experiments, and these are the values used in the graphs presented. The bisection search only gives a marginal improvement to the results shown, but it would be unfair to do this in comparison with 8, since this is run only once with one fixed choice of γ. For practical purposes, the fact that (8) does not rely on intricate parameter choices is clearly a strength.
To compute the numerical approximations we used the FISTA algorithm (already previously described) with 10000 iterations for (8) and with 10000 iterations for each weights update for (27); we found out that l > 2 does not give any significant contribution to reconstruction accuracy. The stepsize was fixed to t = 1/(16N 2 + 1) = 1/1601 (see section 4) and 5 different trials for each level of noise were carried out, in the sense that for each d ∈ {0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4} we run five different experiments, i.e. we randomly generated five different ground truths X 0 (and consequently five different noise vectors with d = ), and we averaged the various outcomes over the number of trials. All the masks were randomly generated and the norm of the measured data vector is averagely ≈ 20; in the graphs below we cover cases where the noise has up to 20% of the signal magnitude. According to [CES13] the approximated signal reconstructed using (27) is chosen as the eigenvector of the greatest eigenvalue of the outputX of the algorithm. Since in (8) only rank 1 matrices are involved, the approximated signal will be the only nonzero eigenvector ofX. The underlying signal is unique only up to a global phase, therefore the distance between x 0 andx is computed as D = min c∈C∩{|z|=1} cx 0 −x 2 2 .
In Figure 3 we plot D, X − X 0 F and in Figure 4 the average rank ofX. The graphs show that the performances of the three methods. It is clear that nuclear norm without reweighting is inferior in all aspects. We focus therefore on discussing the proposed method versus reweighted nuclear norm. In terms of "norms precision", these are essentially comparable; nevertheless the reweighted nuclear norm fails in finding the correct rank for high levels of noise. Thus, when this method is equivalent with the proposed method, it uses more degrees of freedom. On the other hand the method that we propose is always able to retrieve rank 1 solutions, in perfect compliance with the theory developed.
Since determining the numerical rank is somewhat fishy, we include the following tables which show the first ten eigenvalues (decreasingly ordered) of the approximationX, computed respectively with (8) and (27), in each of the five different experiments run for the noise level = 3. As is plain to see, the solutions obtained by the reweighted nuclear norm are far from rank 1.

Stabilizing by oversampling
In broad terms, the conclusion of Section 3 is that it is desirable to oversample a factor 2 (in each dimension), but beyond this point no further gain is expected. We test this for the one dimensional case on synthetic data of size N = 25 with an identical setup to the one above. As measurement we use b = A(X 0 ) + where is noise which we vary between 0 and 20% of the magnitude of b. Each data point in the below graphs is the average of five distinct trials. A numerical observation in [CES13] is that one needs 3 masks to have stable inversion. The authors also point out that a low residual error A(X) − b in combination with a poor actual error X − X 0 is an indicator of having an ill-posed inverse problem, i.e. one where several different points minimize the functional in question. Below we plot both graphs as a function of the Noise to Signal Ratio / b .  To the left we see that the residual errors are almost identical, and moreover close to the line y = x, as expected, since it is clearly unlikely to beat the noise level by any notable margin. Indeed, if we were to getX = X 0 then clearly A(X) − b = , so the above graph actually tells us that we most of the time find a solution better than ground truth.
To the right we see the normalized reconstruction error, and we can confirm that the expectations from Section 3 is in perfect compliance with the outcome. Oversampling by a factor two has a significant effect in improving the actual error, whereas oversampling by a factor 3 only has a marginal effect, as predicted by Proposition 3.2. Noteworthy is that also the latter two curves are close to "y = x".
Encouraged by this result we now try the same setup with 2 masks. As is plain to see, the pattern repeats itself with the major difference that m = n is completely unreliable except for 0 noise, whereas m = 2n and m = 3n do a similar job which, although not extremely convincing, clearly outperforms m = n. In conclusion, it seems that the recommendation of oversampling with a factor of two has both theoretical and numerical support, at least in the case d = 1.

Finding the correct rank for general PSD estimation problems
We have ran extensive tests for minimizing (8) with different choices of A, b and ranks K, and never found a single instance where the algorithm converges to a point with the wrong rank, i.e. different from K, as long as γ > A 2 . 2 This is in accordance with the theory for quadratic envelopes as developed in [Car19] and summarized in Section 2, stating that local minimizers of (8) form a subset of those for (6), except for the fact that FBS is only guaranteed to find stationary points, 3 not necessarily local minima. This indicates that either there are no saddle point type stationary points, or that FBS actually has the capacity to avoid them. We do not underline these observations with any specific graph, since it seems impossible to design one experiment that would cover all different types of potential applications.

2D image reconstruction
We complement this section by displaying a 2D reconstruction example. The example chosen is a 27 × 27 pixels grayscale real image; the initial measurements were contaminated with 1% of additive gaussian random noise. The algorithm used was again FISTA with eight binary masks and the parameters choice indicated in Section 3.3 where we employed the tricks of Section 4.1 for efficient evaluation via FFT of the gradient-step. The reconstruction was made on a standard laptop and the bottleneck timewise for the algorithm is the eigendecomposition, which prohibits larger images to be processed. We plan to address this shortcoming in future works, which clearly needs to be solved for the algorithm (as well as PhaseLift) to be of practical use for 2D or 3D imaging. Our experience suggested that the 2D problem needs a higher number of masks (with respect to the 1D case) in order to be stabilized; this observation is present in [CES13] as well where, in the noisefree case, eight binary masks are used. We used m = n for simplicity, it is of course possible that oversampling could lead to fewer masks, but we have not yet tested this.
On the side, the original image is the second while the first is our reconstruction.