Perfect reconstruction of sparse signals with piecewise continuous nonconvex penalties and nonconvexity control

Ayaka Sakata; Tomoyuki Obuchi

doi:10.1088/1742-5468/ac1403

1. Introduction

A signal processing scheme for reconstructing signals through linear measurements, when the number of measurements is less than the dimensionality of the signals, is known as compressed sensing (or compressive sensing) [1, 2]. Let ${\boldsymbol{x}}^{0}\in {\mathbb{R}}^{N}$ and $A\in {\mathbb{R}}^{M{\times}N}$ be the unknown original signal and measurement matrix, respectively. Compressed sensing is mathematically formulated as a problem of reconstructing the signal x ⁰ from its measurement y = A x ⁰, where the number of measurements is less than the signal dimension (M < N). In general, the problem is underdetermined, and the solution is not unique. However, the signal can be reconstructed utilizing the knowledge that the original signal is sparse; it contains zero components with a finite probability. The reconstruction of signals from a limited number of measurements is a common challenge in various fields. In the past decade, theories and techniques of compressed sensing have been enriched by interdisciplinary work in fields such as signal processing, medical imaging, and statistical physics.

A natural way to reconstruct a sparse signal is to minimize the ℓ₀ norm under a constraint:

$\begin{equation}\underset{\boldsymbol{x}}{\mathrm{min}}{\Vert}\boldsymbol{x}{{\Vert}}_{0},\quad \mathrm{s}\mathrm{u}\mathrm{b}\mathrm{j}\mathrm{e}\mathrm{c}\mathrm{t}\enspace \mathrm{t}\mathrm{o}\enspace \boldsymbol{y}=A\boldsymbol{x},\end{equation} \tag{ 1 }$

where || x ||₀ is the number of nonzero components in x . However, a combinatorial search with respect to the support set is required to exactly solve (1); hence, it is unrealistic for implementation. The minimization of the ℓ₁ norm [1, 3] is a widely used approach:

$\begin{equation}\underset{\boldsymbol{x}}{\mathrm{min}}{\Vert}\boldsymbol{x}{{\Vert}}_{1},\quad \mathrm{s}\mathrm{u}\mathrm{b}\mathrm{j}\mathrm{e}\mathrm{c}\mathrm{t}\enspace \mathrm{t}\mathrm{o}\enspace \boldsymbol{y}=A\boldsymbol{x},\end{equation} \tag{ 2 }$

which is a convex relaxation problem of (1), where ${\Vert}\boldsymbol{x}{{\Vert}}_{1}={\sum }_{i=1}^{N}\vert {x}_{i}\vert$ . Efficient algorithms to solve (2) have been developed [4, 5], in addition to the convex optimization techniques [6]. Further, some conditions about the measurement matrix, such as the null space property and the restricted isometry property [7, 8], are found and under such conditions it is shown that solutions (1) and (2) become equivalent.

Despite the mathematical tractability of ℓ₁ minimization, its performance is inferior to ℓ₀ minimization for the practical setting of the measurement matrix A. The difference between ℓ₁ and ℓ₀ is expected to be reduced by introducing the minimization of ℓ_p (0 < p < 1) norm. In fact, ℓ_p (0 < p < 1) minimization achieves the reconstruction of the original signal from a lower number of measurements than ℓ₁ minimization [9, 10]. However, ℓ_p (0 < p < 1) minimization leads to a discontinuity of the reconstructed signal with respect to the input, which induces algorithmic instability. Smoothly clipped absolute deviations (SCADs) [11] and minimax concave penalties (MCPs) [12], which are piecewise continuous nonconvex penalties, are potential candidates to address this limitation. SCADs and MCPs are designed to provide continuity, unbiasedness, and sparsity to the estimates, and their nonconvexities are controlled by nonconvexity parameters. The mathematical treatment of nonconvex penalties is seemingly difficult compared with ℓ₁. However, it is shown that a data compression problem under SCAD and MCP can be solved without additional computational cost compared with convex optimization problems in a certain parameter region, and this region is characterized by replica symmetry in the context of statistical physics [13]. This investigation implies the prospects of these penalties for improvement in reconstructing the signals in compressed sensing.

In this study, we theoretically verify the performance of the minimization of SCAD and MCP for the reconstruction of sparse signals in compressed sensing. The perfect reconstruction is achieved with a smaller number of measurements compared with the ℓ₁ reconstruction limit [14–16]. Further, SCAD and MCP minimization overcomes the algorithmic limit of the Bayes-optimal method [17], in the sense that there exists a unique stable solution corresponding to the perfect reconstruction, even beyond the algorithmic limit of the Bayes-optimal method, and that there are no phase transitions that can be algorithmic barriers. Based on this finding, as a reconstruction algorithm, we employ the approximated message passing (AMP) algorithm to SCAD and MCP minimization. Examining its performance, it is found that AMP cannot actually achieve the perfect reconstruction beyond the Bayes-optimal algorithmic limit, despite the theoretical basis. To investigate the gap between the AMP's behavior and the analytical result, we use the technique of the state evolution (SE), which allows us to track the macroscopic dynamics of AMP. As a result, it is found that the gap comes from the shrinkage in the basin of attraction (BOA) to the perfect reconstruction and the divergent behavior of AMP in some parameter regions. One of the contributions of the paper is the finding of the scenario of the algorithmic failure different from that in the Bayes-optimal setting, where the emergence of another local minimum than the success solution, which is the global minimum, hampers the convergence of AMP to the perfect reconstruction.

We mitigate the abovementioned gap and improve the performance of AMP by introducing a method controlling nonconvexity parameters, named nonconvexity control, into AMP, which is our other contribution. The efficiency of the nonconvexity control is understood from the flow of SE. Further, the property of the fixed point of SE gives a guide for the protocol of the nonconvexity control. The algorithmic limit of SCAD and MCP minimization, called the nonconvexity control limit (NCC limit), is determined as the limit where the proposed nonconvexity control leads to the perfect reconstruction. The resultant algorithmic limit is very close to but slightly inferior to the Bayes-optimal algorithmic limit.

The remainder of this paper is organized as follows. In section 2, we introduce the nonconvex sparse penalties, SCAD and MCP, used in this study. The equilibrium properties of compressed sensing under SCAD and MCP are studied in section 3, based on the replica method under the replica symmetric (RS) assumption. In section 4, the limit for the perfect reconstruction is derived for SCAD and MCP, and we show that their performance is expected to overcome the ℓ₁ reconstruction limit and the algorithmic limit of the Bayes-optimal method by investigating the presence of phase transitions and the local stability of the solution corresponding to the perfect reconstruction. In section 5, we demonstrate the actual reconstruction of the signal using AMP, and show that the reconstructed signal diverges when the nonconvexity parameters are small, even when the reconstruction is theoretically supported. We show that this divergence can be suppressed by introducing the nonconvexity control. Section 6 is devoted to the summary and discussion of this paper.

2. Definition of SCAD and MCP

The problem considered in this study is formulated as

$\begin{equation}\underset{\boldsymbol{x}}{\mathrm{min}}J(\boldsymbol{x};\lambda ,a)\quad \text{subject}\;\text{to}\enspace \boldsymbol{y}=A\boldsymbol{x},\end{equation} \tag{ 3 }$

where $J(\boldsymbol{x};\lambda ,a)={\sum }_{i=1}^{N}J({x}_{i};\lambda ,a)$ is a sparsity-inducing penalty and λ, a are regularization parameters. We deal with two types of nonconvex penalties, SCAD and MCP. The shapes of these penalties are controlled by two parameters, λ and a, and ℓ₁ penalty is considered as a limit. We call these regularization parameters nonconvexity parameters.

SCAD is defined by [11]

$\begin{equation}J(x;\lambda ,a)=\begin{cases}\lambda \vert x\vert \quad & (\vert x\vert {\leqslant}\lambda )\\ -\frac{{x}^{2}-2a\lambda \vert x\vert +{\lambda }^{2}}{2(a-1)}\quad & (\lambda {< }\vert x\vert {\leqslant}a\lambda )\\ \frac{(a+1){\lambda }^{2}}{2}\quad & (\vert x\vert { >}a\lambda )\end{cases},\end{equation} \tag{ 4 }$

where λ ∈ (0, ∞) and a ∈ (1, ∞). Figure 1(a) represents the SCAD penalty at λ = 1 and a = 3. The dashed vertical lines are the thresholds |x| = λ and |x| = aλ. The SCAD penalty for |x| ⩽ λ is equivalent to the ℓ₁ penalty, and that for |x| > aλ is equivalent to the ℓ₀ penalty; i.e. the penalty has a constant value. These ℓ₁ and ℓ₀ regions are connected to each other through a quadratic function. At a → ∞, SCAD is reduced to the ℓ₁ penalty, J(x; λ, a → ∞) = λ|x|, and the minimization of SCAD at λ → ∞ is also equivalent to the ℓ₁ minimization.

MCP is defined by [12]

$\begin{equation}J(x;\lambda ,a)=\begin{cases}\lambda \vert x\vert -\frac{{x}^{2}}{2a}\quad & (\vert x\vert {\leqslant}a\lambda )\\ \frac{a{\lambda }^{2}}{2}\quad & (\vert x\vert { >}a\lambda )\end{cases},\end{equation} \tag{ 5 }$

where λ ∈ (0, ∞) and a ∈ (1, ∞). Figure 1(b) represents MCP for λ = 1 and a = 3. The vertical line represents the threshold |x| = aλ. As with SCAD, MCP is also reduced to ℓ₁ by taking the limit a → ∞, and is also equivalent to ℓ₁ minimization when the penalty is minimized at λ → ∞.

2.1. Estimator under SCAD and MCP

For understanding the properties of SCAD and MCP, let us consider the one-dimensional fitting problem of the input w using a Gaussian model penalized by SCAD or MCP as

$\begin{equation}\hat{x}(s,w)=\mathrm{arg}\underset{x}{\mathrm{min}}\left\{\frac{{x}^{2}}{2s}-wx+J(x;\lambda ,a)\right\},\end{equation} \tag{ 6 }$

where s > 0. Both SCAD and MCP have upward convex terms, hence to define a solution of (6) for all w regions, we need to carefully consider the possible regions of s and a, which give the coefficient of the quadratic term in the rhs of (6). In the case of SCAD, the relationship a > 1 + s should hold to obtain the solution as

$\begin{equation}\hat{x}(s,w)={{\Sigma}}_{\text{SCAD}}(s,w){\mathcal{M}}_{\text{SCAD}}(s,w)\end{equation} \tag{ 7 }$

where ${\mathcal{M}}_{\text{SCAD}}$ and Σ_SCAD represent the coefficients of the linear and quadratic terms in (6) given by

$\begin{equation}{\mathcal{M}}_{\text{SCAD}}(s,w)=\begin{cases}w-\mathrm{sgn}(w)\lambda \quad & \mathrm{f}\mathrm{o}\mathrm{r}\quad \lambda (1+{s}^{-1}){\geqslant}\vert w\vert { >}\lambda \\ w-\mathrm{sgn}(w)\frac{a\lambda }{a-1}\quad & \mathrm{f}\mathrm{o}\mathrm{r}\quad a\lambda {s}^{-1}{\geqslant}\vert w\vert { >}\lambda (1+{s}^{-1})\\ w\quad & \mathrm{f}\mathrm{o}\mathrm{r}\quad \vert w\vert { >}a\lambda {s}^{-1}\\ 0\quad & \mathrm{o}\mathrm{t}\mathrm{h}\mathrm{e}\mathrm{r}\mathrm{w}\mathrm{i}\mathrm{s}\mathrm{e}\end{cases},\end{equation} \tag{ 8 }$

$\begin{equation}{{\Sigma}}_{\text{SCAD}}(s,w)=\begin{cases}s\quad & \mathrm{f}\mathrm{o}\mathrm{r}\quad \lambda (1+{s}^{-1}){\geqslant}\vert w\vert { >}\lambda \\ {\left({s}^{-1}-\frac{1}{a-1}\right)}^{-1}\quad & \mathrm{f}\mathrm{o}\mathrm{r}\quad a\lambda {s}^{-1}{\geqslant}\vert w\vert { >}\lambda (1+{s}^{-1})\\ s\quad & \mathrm{f}\mathrm{o}\mathrm{r}\quad \vert w\vert { >}a\lambda {s}^{-1}\\ 0\quad & \mathrm{o}\mathrm{t}\mathrm{h}\mathrm{e}\mathrm{r}\mathrm{w}\mathrm{i}\mathrm{s}\mathrm{e}\end{cases},\end{equation} \tag{ 9 }$

and sgn(w) denotes the sign of w. An example of the estimator as a function of the input w under SCAD is shown in figure 2(b), where s = 1, λ = 1 and a = 3. The SCAD estimator behaves like the ℓ₁ estimator, which is shown in figure 2(a), and like the ordinary least square (OLS) estimator when λ(1 + s⁻¹) ⩾ |z| > λ and |w| > aλs⁻¹, respectively. In the region aλs⁻¹ ⩾ |w| > λ(1 + s⁻¹), the estimator linearly transits between ℓ₁ and OLS estimators.

From the same argument as SCAD, a > s should hold to define the solution of (6) under MCP. Further, the condition a > 1 is imposed in the definition of MCP, hence in summary the restriction is given as a > min{1, s}. When the condition a > min{1, s} is satisfied, the solution of the single body problem under MCP is given by [18]

$\begin{equation}{x}^{{\ast}}(s,w)={{\Sigma}}_{\text{MCP}}(s,w){\mathcal{M}}_{\text{MCP}}(s,w),\end{equation} \tag{ 10 }$

where

$\begin{equation}{\mathcal{M}}_{\text{MCP}}(s,w)=\begin{cases}w-\mathrm{sgn}(w)\lambda \quad & \mathrm{f}\mathrm{o}\mathrm{r}\quad a\lambda {s}^{-1}{\geqslant}\vert w\vert { >}\lambda \\ w\quad & \mathrm{f}\mathrm{o}\mathrm{r}\quad \vert w\vert { >}a\lambda {s}^{-1}\\ 0\quad & \mathrm{o}\mathrm{t}\mathrm{h}\mathrm{e}\mathrm{r}\mathrm{w}\mathrm{i}\mathrm{s}\mathrm{e}\end{cases},\end{equation} \tag{ 11 }$

$\begin{equation}{{\Sigma}}_{\text{MCP}}(s,w)=\begin{cases}{({s}^{-1}-{a}^{-1})}^{-1}\quad & \mathrm{f}\mathrm{o}\mathrm{r}\quad a\lambda {s}^{-1}{\geqslant}\vert w\vert { >}\lambda \\ s\quad & \mathrm{f}\mathrm{o}\mathrm{r}\quad \vert w\vert { >}a\lambda {s}^{-1}\\ 0\quad & \mathrm{o}\mathrm{t}\mathrm{h}\mathrm{e}\mathrm{r}\mathrm{w}\mathrm{i}\mathrm{s}\mathrm{e}\end{cases}.\end{equation} \tag{ 12 }$

Figure 2(c) shows the behaviour of the MCP estimator at s = 1, λ = 1, and a = 3. The MCP estimator behaves like the OLS estimator at |w| > aλs⁻¹, and is connected from zero to the OLS estimator in the region aλs⁻¹ ⩾ |w| > λ.

3. Replica analysis for SCAD and MCP

We assume that the signal to be reconstructed is generated according to the Bernoulli–Gaussian distribution,

$\begin{equation}{P}_{0}({\boldsymbol{x}}^{0})=\prod\limits _{i}\left\{(1-\rho )\delta ({x}_{i}^{0})+\frac{\rho }{\sqrt{2\pi {\sigma }_{x}^{2}}}\enspace \mathrm{exp}\left(-\frac{{({x}_{i}^{0})}^{2}}{2{\sigma }_{x}^{2}}\right)\right\},\end{equation} \tag{ 13 }$

where δ(x) is the Dirac delta function. Further, we consider the measurement matrix A to be a random Gaussian, where each component is independently and identically distributed according to the Gaussian distribution with mean 0 and variance N⁻¹. The measurement is expressed as y = A x ⁰, and the minimization of J( x ; λ, a) is implemented under the constraint y = A x . For mathematical tractability, we express the constraint y = A x by introducing a parameter τ as

$\begin{equation}{P}_{\tau }(\boldsymbol{y}\vert \boldsymbol{x})=\frac{1}{{(\sqrt{2\pi \tau })}^{M}}\enspace \mathrm{exp}\left\{-\frac{1}{2\tau }{\Vert}\boldsymbol{y}-A\boldsymbol{x}{{\Vert}}_{2}^{2}\right\},\end{equation} \tag{ 14 }$

where the probability is concentrated at y = A x taking the limit τ → 0. The posterior distribution corresponding to the problem (3) is given by

$\begin{equation}P(\boldsymbol{x}\vert \boldsymbol{y})=\underset{\beta \to \infty }{\mathrm{lim}}\enspace \underset{\tau \to 0}{\mathrm{lim}}\frac{1}{{Z}_{\beta ,\tau }(\boldsymbol{y})}\enspace \mathrm{exp}(-\beta J(\boldsymbol{x};\lambda ,a)){P}_{\tau }(\boldsymbol{y}\vert \boldsymbol{x}),\end{equation} \tag{ 15 }$

where β is a parameter to attain the uniform distribution over the minimizer of (3) at β → ∞, and

$\begin{equation}{Z}_{\beta ,\tau }(\boldsymbol{y})=\int \mathrm{d}\boldsymbol{x}\enspace \mathrm{exp}(-\beta J(\boldsymbol{x};\lambda ,a)){P}_{\tau }(\boldsymbol{y}\vert \boldsymbol{x})\end{equation} \tag{ 16 }$

is the normalization constant. The minimizer of (3) is given by $\hat{\boldsymbol{x}}=\langle \boldsymbol{x}\rangle$ , where ⟨⋅⟩ denotes the expectation with respect to x according to the posterior distribution (15).

The performance of the reconstruction (3) depends on the randomness A and x ⁰. Here, we examine the typical performance of the SCAD and MCP minimization at N → ∞ and M → ∞ keeping M/N = α ∼ O(1), where α is the compression ratio. Free energy density defined by

$\begin{equation}f=-\underset{\beta \to \infty }{\mathrm{lim}}\enspace \underset{N\to \infty }{\mathrm{lim}}\enspace \underset{\tau \to 0}{\mathrm{lim}}\enspace \frac{1}{N\beta }{E}_{{\boldsymbol{x}}^{\mathbf{0}},A}[\mathrm{ln}\enspace {Z}_{\beta ,\tau }(\boldsymbol{y})]\end{equation} \tag{ 17 }$

is the key in this discussion, where ${E}_{{\boldsymbol{x}}^{0},A}[\dots ]$ denotes the expectation with respect to A and x ⁰ introduced for the discussion of the typical property. Here, we proceed with the calculation for general τ and β, and take the limit after the derivation of the general form. It is calculated using the following identity

$\begin{equation}{E}_{{\boldsymbol{x}}^{0},A}[\mathrm{ln}\enspace {Z}_{\beta ,\tau }(\boldsymbol{y})]=\underset{n\to 0}{\mathrm{lim}}\frac{{E}_{{\boldsymbol{x}}^{0},A}[{Z}_{\beta ,\tau }^{n}(\boldsymbol{y})]-1}{n}.\end{equation} \tag{ 18 }$

Assuming that n is a positive integer, we can express the expectation of ${Z}_{\beta }^{n}$ in (18) using n-replicated systems

$\begin{align}{E}_{{\boldsymbol{x}}^{0},A}[{Z}_{\beta ,\tau }^{n}(\boldsymbol{y})]& =\int \mathrm{d}A\enspace \mathrm{d}\boldsymbol{y}\enspace \mathrm{d}{\boldsymbol{x}}^{0}\enspace {P}_{0}({\boldsymbol{x}}^{0}){P}_{A}(A)\delta (\boldsymbol{y}-A{\boldsymbol{x}}^{0})\\ & \quad {\times}\int \mathrm{d}{\boldsymbol{x}}^{(1)}\dots \mathrm{d}{\boldsymbol{x}}^{(n)}\enspace \underset{\tau \to 0}{\mathrm{lim}}\frac{1}{{(\sqrt{2\pi \tau })}^{nM}}\\ & \quad {\times}\mathrm{exp}\left[\sum\limits _{a=1}^{n}\left\{-\frac{1}{2\tau }{\Vert}\boldsymbol{y}-A{\boldsymbol{x}}^{(a)}{{\Vert}}_{2}^{2}-\beta J({\boldsymbol{x}}^{(a)};\lambda ,a)\right\}\right].\end{align} \tag{ 19 }$

The detail of the calculation is shown in [13, 16, 19], and here we briefly summarize the calculation. The free energy density under the RS assumption is given by

$\begin{equation}f={\mathrm{e}\mathrm{x}\mathrm{t}\mathrm{r}}_{{\Omega},~{{\Omega}}}\left[\frac{\alpha (Q-2m+\rho {\sigma }_{x}^{2})}{2\chi }+m~{m}-\frac{~{Q}Q-~{\chi }\chi }{2}+\frac{\overline{\xi (~{Q},\sigma )}}{2}\right],\end{equation} \tag{ 20 }$

where ${\mathrm{e}\mathrm{x}\mathrm{t}\mathrm{r}}_{{\Omega},~{{\Omega}}}$ represents the extremization with respect to the quantities Ω = {Q, χ, m} and $~{{\Omega}}=\left\{~{Q},~{\chi },~{m}\right\}$ . The function $\xi (~{Q},\sigma )$ depends on the regularization as

$\begin{equation}\xi (~{Q},\sigma )=2\int DzL(~{Q},\sigma z),\end{equation} \tag{ 21 }$

$\begin{equation}L(~{Q},\sigma z)=\underset{x}{\mathrm{min}}\left(\frac{~{Q}}{2}{x}^{2}-\sigma zx+J(x;\lambda ,a)\right),\end{equation} \tag{ 22 }$

where $\int Dz={\int }_{-\infty }^{\infty }\mathrm{d}z\enspace \mathrm{exp}(-{z}^{2}/2)/\sqrt{2\pi }$ . Equation (22) is equivalent to the one-dimensional problem (6). Here, $\overline{\dots }$ denotes the average over σ according to the distribution

$\begin{equation}{P}_{\sigma }(\sigma )=(1-\rho )\delta (\sigma -{\sigma }_{-})+\rho \delta (\sigma -{\sigma }_{+}),\end{equation} \tag{ 23 }$

with ${\sigma }_{-}=\sqrt{~{\chi }}$ and ${\sigma }_{+}=\sqrt{~{\chi }+{~{m}}^{2}{\sigma }_{x}^{2}}$ . The random fields σ₋ z and σ₊ z effectively represent the randomness induced by A and x ⁰, in particular zero-signals and non-zero-signals, respectively. We denote the solution of x in the effective single-body problem (22) as ${x}^{{\ast}}({~{Q}}^{-1},\sigma z)$ , which depends on the regularization, and we consider the specific form of ${x}^{{\ast}}({~{Q}}^{-1},\sigma z)$ and $L(~{Q},\sigma z)$ later. The saddle point equations are given by

$\begin{equation}\chi =-\frac{\partial \overline{\xi (~{Q},\sigma )}}{\partial ~{\chi }}=\int Dz\overline{\frac{\partial {x}^{{\ast}}({~{Q}}^{-1},\sigma z)}{\partial (\sigma z)}},\end{equation} \tag{ 24 }$

$\begin{equation}Q=\frac{\partial \overline{\xi (~{Q},\sigma )}}{\partial ~{Q}}=\int Dz\overline{{({x}^{{\ast}}({~{Q}}^{-1},\sigma z))}^{2}},\end{equation} \tag{ 25 }$

$\begin{equation}m=-\frac{1}{2}\frac{\partial \overline{\xi (~{Q},\sigma )}}{\partial ~{m}}=\rho ~{m}{\sigma }_{x}^{2}\int Dz\frac{\partial {x}^{{\ast}}({~{Q}}^{-1},{\sigma }_{+}z)}{\partial ({\sigma }_{+}z)},\end{equation} \tag{ 26 }$

$\begin{equation}~{\chi }=\frac{\alpha (Q-2m+\rho {\sigma }_{x}^{2})}{{\chi }^{2}},\end{equation} \tag{ 27 }$

$\begin{equation}~{Q}=\frac{\alpha }{\chi },\end{equation} \tag{ 28 }$

$\begin{equation}~{m}=\frac{\alpha }{\chi }.\end{equation} \tag{ 29 }$

At the saddle point, ${x}^{{\ast}}({~{Q}}^{-1},\sigma z)$ is statistically equivalent to the point estimate $\hat{\boldsymbol{x}}$ , and χ, Q and m are related to the physical quantities as

$\begin{equation}Q=\underset{N\to \infty }{\mathrm{lim}}\enspace \frac{1}{N}\sum\limits _{i=1}^{N}{E}_{{\boldsymbol{x}}^{0},A}[{\hat{x}}_{i}^{2}],\end{equation} \tag{ 30 }$

$\begin{equation}m=\underset{N\to \infty }{\mathrm{lim}}\enspace \frac{1}{N}\sum\limits _{i=1}^{N}{E}_{{\boldsymbol{x}}^{0},A}[{x}_{i}^{0}{\hat{x}}_{i}],\end{equation} \tag{ 31 }$

$\begin{equation}\chi =\underset{\beta \to \infty }{\mathrm{lim}}\enspace \underset{N\to \infty }{\mathrm{lim}}\enspace \frac{\beta }{N}\sum\limits _{i=1}^{N}{E}_{{\boldsymbol{x}}^{0},A}\left[\langle {x}_{i}^{2}\rangle -\langle {{x}_{i}\rangle }^{2}\right].\end{equation} \tag{ 32 }$

Hence, the expectation value of the mean squared error (MSE) between the reconstructed signal and the original signal is represented as

$\begin{equation}\varepsilon \equiv \frac{1}{N}{E}_{{\boldsymbol{x}}^{0},A}\left[{\Vert}\hat{\boldsymbol{x}}-{\boldsymbol{x}}^{0}{{\Vert}}_{2}^{2}\right]=Q-2m+\rho {\sigma }_{x}^{2}.\end{equation} \tag{ 33 }$

The saddle point equations for variables Ω = {Q, χ, m} directly depend on the functional form of the regularization, but the equations for $~{{\Omega}}$ do not depend on it. In the following subsections, we show the form of saddle point equations of Ω for SCAD and MCP.

The RS solution is stable against the symmetry breaking perturbation when [13, 16]

$\begin{equation}\frac{\alpha }{{\chi }^{2}}\int Dz\overline{{\left(\frac{\partial {x}^{{\ast}}({~{Q}}^{-1},\sigma z)}{\partial (\sigma z)}\right)}^{2}}{< }1,\end{equation} \tag{ 34 }$

which is known as de Almeida–Thouless (AT) condition [20]. The form of this condition also depends on the functional form of the regularization.

3.1. SCAD

As mentioned in section 2.1, $~{Q}{ >}{(a-1)}^{-1}$ should hold to define the minimizer of (22). In the following, the subspace of the macroscopic parameters where the minimizer of (22) can be defined is denoted by ${{\Omega}}^{{\dagger}}(a)\equiv \left\{Q,\chi ,m\vert ~{Q}{ >}{(a-1)}^{-1}\right\}$ for each a, and we restrict our discussion to Ω^†(a). When we are in Ω^†(a), the minimizer of (22) under SCAD is given by [18]

$\begin{equation}{x}^{{\ast}}({~{Q}}^{-1},\sigma z)={{\Sigma}}_{\text{SCAD}}({~{Q}}^{-1},\sigma z){\mathcal{M}}_{\text{SCAD}}({~{Q}}^{-1},\sigma z),\end{equation} \tag{ 35 }$

and substituting solution (35) into (22), we obtain

$\begin{equation}-2L(~{Q},\sigma z)=\begin{cases}\frac{{(\sigma z-\lambda \enspace \mathrm{sgn}(z))}^{2}}{~{Q}}\quad & (\sqrt{2}{\theta }_{1}(\sigma ){< }\vert z\vert {\leqslant}\sqrt{2}{\theta }_{2}(\sigma ))\\ \frac{{\left(\sigma z-\frac{a\lambda }{a-1}\right)}^{2}}{~{Q}-\frac{1}{a-1}}+\frac{{\lambda }^{2}}{a-1}\quad & (\sqrt{2}{\theta }_{2}(\sigma ){< }\vert z\vert {\leqslant}\sqrt{2}{\theta }_{3}(\sigma ))\\ \frac{{(\sigma z)}^{2}}{~{Q}}-(a+1){\lambda }^{2}\quad & (\vert z\vert { >}\sqrt{2}{\theta }_{3}(\sigma ))\\ 0\quad & (\mathrm{o}\mathrm{t}\mathrm{h}\mathrm{e}\mathrm{r}\mathrm{w}\mathrm{i}\mathrm{s}\mathrm{e})\end{cases},\end{equation} \tag{ 36 }$

where ${\theta }_{1}(\sigma )=\lambda /(\sqrt{2}\sigma )$ , ${\theta }_{2}(\sigma )=\lambda (1+~{Q})/(\sqrt{2}\sigma )$ , and ${\theta }_{3}(\sigma )=a\lambda ~{Q}/(\sqrt{2}\sigma )$ . Equation(21) for SCAD regularization is derived as

$\begin{equation}-\xi (\sigma )={\xi }_{1}(\sigma )+{\xi }_{2}(\sigma )+{\xi }_{3}(\sigma )+\frac{{\lambda }^{2}{\xi }_{4}(\sigma )}{a-1}-(a+1){\lambda }^{2}\enspace \mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}({\theta }_{3}(\sigma )),\end{equation} \tag{ 37 }$

where

$\begin{equation*}{\xi }_{1}(\sigma )=\frac{{\sigma }^{2}}{~{Q}}\left[-\frac{2{\theta }_{1}(\sigma )}{\sqrt{\pi }}\left({\text{e}}^{-{\theta }_{1}^{2}(\sigma )}+(~{Q}-1){\text{e}}^{-{\theta }_{2}^{2}(\sigma )}\right)\right.\end{equation*}$

$\begin{equation}\left.\quad +(1+2{\theta }_{1}^{2}(\sigma ))\left\{\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}({\theta }_{1}(\sigma ))-\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}({\theta }_{2}(\sigma ))\right\}\right],\end{equation} \tag{ 38 }$

$\begin{equation*}{\xi }_{2}(\sigma )=\frac{{\sigma }^{2}}{~{Q}-\frac{1}{a-1}}\left[\frac{2}{\sqrt{\pi }}\left\{{\theta }_{2}(\sigma ){\text{e}}^{-{\theta }_{2}^{2}(\sigma )}\right.\right.\end{equation*}$

$\begin{equation*}\left.\left.\quad -{\theta }_{3}(\sigma ){\text{e}}^{-{\theta }_{3}^{2}(\sigma )}-\frac{2{\theta }_{3}(\sigma )}{~{Q}(a-1)}\left({\text{e}}^{-{\theta }_{2}^{2}(\sigma )}-{\text{e}}^{-{\theta }_{3}^{2}(\sigma )}\right)\right\}\right.\end{equation*}$

$\begin{equation}\left.\quad +\left\{1+2{\left(\frac{{\theta }_{3}(\sigma )}{~{Q}(a-1)}\right)}^{2}\right\}{\xi }_{4}(\sigma )\right],\end{equation} \tag{ 39 }$

$\begin{equation}{\xi }_{3}(\sigma )=\frac{{\sigma }^{2}}{~{Q}}\left[\frac{2{\theta }_{3}(\sigma )}{\sqrt{\pi }}{\text{e}}^{-{\theta }_{3}^{2}(\sigma )}+\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}({\theta }_{3}(\sigma ))\right],\end{equation} \tag{ 40 }$

$\begin{equation}{\xi }_{4}(\sigma )=\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}({\theta }_{2}(\sigma ))-\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}({\theta }_{3}(\sigma )).\end{equation} \tag{ 41 }$

The regularization-dependent saddle point equations are given by

$\begin{equation}Q=\overline{\frac{{\xi \enspace }_{1}(\sigma )}{~{Q}}+\frac{{\xi }_{2}(\sigma )}{~{Q}-\frac{1}{a-1}}+\frac{{\xi }_{3}(\sigma )}{~{Q}}},\end{equation} \tag{ 42 }$

$\begin{equation}\chi =\frac{1}{~{Q}}\left[\hat{\rho }+\frac{\frac{1}{a-1}}{~{Q}-\frac{1}{a-1}}\overline{{\xi }_{4}(\sigma )}\right],\end{equation} \tag{ 43 }$

$\begin{equation}m=\rho {\sigma }_{x}^{2}\left[\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}({\theta }_{1}({\sigma }_{+}))+\frac{\frac{1}{a-1}{\xi }_{4}({\sigma }_{+})}{~{Q}-\frac{1}{a-1}}\right],\end{equation} \tag{ 44 }$

where $\hat{\rho }$ is the density of the nonzero component in the estimate given by

$\begin{equation}\hat{\rho }=\overline{\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}({\theta }_{1}(\sigma ))}.\end{equation} \tag{ 45 }$

From (34), the AT condition is derived as

$\begin{equation}\frac{1}{\alpha }\left[\hat{\rho }+\left\{{\left(\frac{~{Q}}{~{Q} - \frac{1}{a-1}}\right)}^{2}-1\right\}\overline{{\xi }_{4}(\sigma )}\right]{< }1.\end{equation} \tag{ 46 }$

3.2. MCP

As with SCAD, we concentrate our discussion on the subspace of the macroscopic parameters ${{\Omega}}^{{\dagger}}(a)=\left\{Q,\chi ,m\vert ~{Q}{ >}{a}^{-1}\right\}$ , where the solution of (22) can be defined. The solution of the single body problem under MCP in ${{\Omega}}^{{\dagger}}(a)=\left\{Q,\chi ,m\vert ~{Q}{ >}{a}^{-1}\right\}$ is given by [18]

$\begin{equation}{x}^{{\ast}}({~{Q}}^{-1},\sigma z)={{\Sigma}}_{\text{MCP}}({~{Q}}^{-1},\sigma z){\mathcal{M}}_{\text{MCP}}({~{Q}}^{-1},\sigma z),\end{equation} \tag{ 47 }$

and we obtain

$\begin{equation}-2L(~{Q},\sigma )=\begin{cases}\frac{{(\sigma z-\lambda \enspace \mathrm{sgn}(z))}^{2}}{~{Q}-{a}^{-1}}\quad & (\sqrt{2}{\theta }_{1}(\sigma ){< }\vert z\vert {\leqslant}\sqrt{2}{\theta }_{2}(\sigma ))\\ \frac{{(\sigma z)}^{2}}{~{Q}}-\lambda {a}^{2}\quad & (\vert z\vert { >}\sqrt{2}{\theta }_{2}(\sigma ))\\ 0\quad & (\mathrm{o}\mathrm{t}\mathrm{h}\mathrm{e}\mathrm{r}\mathrm{w}\mathrm{i}\mathrm{s}\mathrm{e})\end{cases},\end{equation} \tag{ 48 }$

where ${\theta }_{1}(\sigma )=\lambda /(\sqrt{2}\sigma )$ and ${\theta }_{2}(\sigma )=a\lambda ~{Q}/(\sqrt{2}\sigma )$ , and (21) for MCP is derived as

$\begin{equation}-\xi (\sigma )={\xi }_{1}(\sigma )+{\xi }_{2}(\sigma ),\end{equation} \tag{ 49 }$

where

$\begin{equation*}{\xi }_{1}(\sigma )=-\frac{2{\sigma }^{2}}{\sqrt{\pi }(~{Q}-{a}^{-1})}\left\{{\theta }_{1}(\sigma )({\text{e}}^{-{\theta }_{1}^{2}(\sigma )}-{\text{e}}^{-{\theta }_{2}^{2}(\sigma )})-{\text{e}}^{-{\theta }_{2}^{2}(\sigma )}({\theta }_{1}(\sigma )-{\theta }_{2}(\sigma ))\right\}\end{equation*}$

$\begin{equation}\quad +\frac{({\sigma }^{2}+{\lambda }^{2}){\xi }_{3}(\sigma )}{~{Q}-{a}^{-1}},\end{equation} \tag{ 50 }$

$\begin{equation}{\xi }_{2}(\sigma )=\frac{{\sigma }^{2}}{~{Q}}\left(\frac{2{\theta }_{2}(\sigma )}{\sqrt{\pi }}{\text{e}}^{-{\theta }_{2}^{2}(\sigma )}+\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}({\theta }_{2}(\sigma ))\right)-\lambda {a}^{2}\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}({\theta }_{2}(\sigma )),\end{equation} \tag{ 51 }$

$\begin{equation}{\xi }_{3}(\sigma )=\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}({\theta }_{1}(\sigma ))-\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}({\theta }_{2}(\sigma )).\end{equation} \tag{ 52 }$

The saddle point equations for Ω are given by

$\begin{equation}Q=\overline{\frac{{\xi }_{1}(\sigma )}{~{Q}-\frac{1}{a}}+\frac{{\sigma }^{2}}{{~{Q}}^{2}}\left\{\frac{2{\theta }_{2}(\sigma )}{\sqrt{\pi }}{\text{e}}^{-{\theta }_{2}^{2}(\sigma )}+\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}({\theta }_{2}(\sigma ))\right\}},\end{equation} \tag{ 53 }$

$\begin{equation}\chi =\frac{1}{~{Q}}\left[\hat{\rho }+\frac{{a}^{-1}\overline{{\xi }_{3}(\sigma )}}{~{Q}-{a}^{-1}}\right],\end{equation} \tag{ 54 }$

$\begin{equation}m=\rho {\sigma }_{x}^{2}\left[\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}({\theta }_{1}({\sigma }_{+}))+\frac{{a}^{-1}{\xi }_{3}({\sigma }_{+})}{~{Q}-{a}^{-1}}\right],\end{equation} \tag{ 55 }$

where $\hat{\rho }$ is the density of the nonzero component in the estimate given by

$\begin{equation}\hat{\rho }=\overline{\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}({\theta }_{1}(\sigma ))}.\end{equation} \tag{ 56 }$

The AT condition for MCP is derived as

$\begin{equation}\frac{1}{\alpha }\left[\hat{\rho }+\left\{{\left(\frac{~{Q}}{~{Q}-{a}^{-1}}\right)}^{2}-1\right\}\overline{{\xi }_{3}(\sigma )}\right]{< }1.\end{equation} \tag{ 57 }$

4. Stability of success solution

One of the solutions of the saddle point equations in Ω^†(a) is characterized by $Q=m=\rho {\sigma }_{x}^{2}$ . Following the correspondence between the order parameters and the MSE (33), this solution indicates the perfect reconstruction of the original signal x ⁰. Hence, we call the solution with $Q=m=\rho {\sigma }_{x}^{2}$ the success solution. The saddle point equation can have solutions other than the success solution; however, these solutions do not satisfy the AT condition as far as we observed. Substituting the relationship $Q=m=\rho {\sigma }_{x}^{2}$ , we immediately obtain χ = 0 and $~{Q}=~{m}=\infty$ , and the only variable to be solved is $~{\chi }$ . The expansion of Q and m up to the order $O({~{Q}}^{-2})$ gives the expression of $~{\chi }$ for the success solution under SCAD

$\begin{align}~{\chi }& =\frac{1-\rho }{\alpha }\left[-\frac{2~{\chi }}{\sqrt{\pi }}{\theta }_{-}{\text{e}}^{-{\theta }_{-}^{2}}+(~{\chi }+{\lambda }^{2})\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}({\theta }_{-})\right]\\ & \quad +\frac{\rho }{\alpha }\left[~{\chi }+{\lambda }^{2}\left\{1-\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}({\theta }_{+})\right\}+\left\{{\left(\frac{a\lambda }{a-1}\right)}^{2}+\frac{{\sigma }_{x}^{2}}{{(a-1)}^{2}}\right\}\left\{\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}({\theta }_{+})-\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}(a{\theta }_{+})\right\}\right.\\ & \quad \left.+\frac{2{\sigma }_{x}^{2}{\theta }_{+}}{\sqrt{\pi }(a-1)}\left\{\frac{a}{a-1}({\text{e}}^{-{a}^{2}{\theta }_{+}^{2}}-{\text{e}}^{-{\theta }_{+}^{2}})-{\text{e}}^{-{\theta }_{+}^{2}}\right\}\right],\end{align} \tag{ 58 }$

and under MCP

$\begin{align}~{\chi }& =\frac{1-\rho }{\alpha }\left\{-\frac{2~{\chi }}{\sqrt{\pi }}{\theta }_{-}{\text{e}}^{-{\theta }_{-}^{2}}+(~{\chi }+{\lambda }^{2})\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}({\theta }_{-})\right\}\\ & \quad +\frac{\rho }{\alpha }\left[~{\chi }+\left({\lambda }^{2}+\frac{{\sigma }_{x}^{2}}{{a}^{2}}\right)\left(1-\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}(a{\theta }_{+})\right)+\frac{2{\sigma }_{x}^{2}{\theta }_{+}}{a\sqrt{\pi }}{\text{e}}^{-{a}^{2}{\theta }_{+}^{2}}-\frac{4{\sigma }_{x}^{2}{\theta }_{+}}{a\sqrt{\pi }}\right],\end{align} \tag{ 59 }$

where ${\theta }_{-}=\lambda /\sqrt{2~{\chi }}$ and ${\theta }_{+}=\lambda /\sqrt{2{\sigma }_{x}^{2}}$ . Equations (58) and (59) are reduced to the saddle point equation of $~{\chi }$ corresponding to the success solution for ℓ₁ regularization by setting λ = 1 and when a → ∞ [16].

For both penalties, the success solution is a locally stable solution as a saddle point of the RS free energy when

$\begin{equation}\frac{1}{\alpha }\left\{(1-\rho )\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}\left({\theta }_{-}\right)+\rho \right\}{< }1.\end{equation} \tag{ 60 }$

This condition is derived by the linear stability analysis of χ around 0. Further, we can show that the AT condition for the success solution is equivalent to (60). This means that when the success solution is locally stable as an RS saddle point, it is also stable with respect to the replica symmetry breaking perturbation. Therefore, the reconstruction limit α_c(ρ) is defined as the minimum value of α that satisfies (60) for each ρ. We also define ρ_c(α) as the maximum value of ρ to satisfy (60) at α, and we use both α_c(ρ) and ρ_c(α) for convenience.

Figures 3 and 4 show the ρ-dependence of α_c(ρ) for SCAD minimization and MCP minimization, respectively, where (a) represents a = 3 and (b) represents λ = 0.01. The typical reconstruction is possible in the parameter region α ⩾ α_c(ρ), and that for ℓ₁ minimization (L1) and the algorithmic limit of the Bayes-optimal method given by spinodal transition (BO(S)), and the phase transition boundary of the Bayes-optimal method (BO(P)) over which the success solution is locally stable, are shown for comparison. As λ and a decrease, α_c(ρ) of SCAD and MCP become less than that of the algorithmic limit of the Bayes-optimal reconstruction method [17]. Further, the reconstruction limit α_c(ρ) approaches ρ as λ → 0. Mathematically, α_c(ρ) → ρ is provided by scaling θ₋ → ∞ and $~{\chi }\to 0$ at λ → 0, which reduces (60) to ρ < α. This inequality, ρ < α, is considered to be the fundamental limit, because, in general sparse estimation methods, the estimation of the support increases the effective degrees of the estimated variables. Hence, we need more measurements than the number of the variables to be estimated. It is indicated that SCAD and MCP with λ → 0 achieve the typical reconstruction when the number of the measurements and that of nonzero variables are balanced.

**Figure 3.** Reconstruction limit of SCAD at (a) a = 3 for λ = 0.01 and λ = 0.1 and (b) λ = 0.01, a = 3 and a = 50. The lines with 'L1', 'BO(S)' and 'BO(P)' are the reconstruction limit under ℓ₁ minimization, the algorithmic limit by the Bayes-optimal method given by spinodal transition, and the phase transition point of the Bayes-optimal method, respectively. The shaded regions are α < ρ.
Download figure:
Standard image High-resolution image

**Figure 4.** Reconstruction limit of MCP at (a) a = 3 for λ = 0.01 and λ = 0.1 and (b) λ = 0.01 for a = 3 and a = 50. The lines with 'L1', 'BO(S)' and 'BO(P)' are the same as figure 3. The shaded regions are α < ρ.
Download figure:
Standard image High-resolution image

We denote as a_c(λ) the value of a under which the signals can be reconstructed for each λ. The reconstruction limit a_c(λ) on the λ − a plane is shown in figure 5 for (a) SCAD and (b) MCP, respectively, at α = 0.5 for ρ = 0.3 and ρ = 0.4. The horizontal dashed lines represent a_min, which is equal to 1 when the success solution is stable, and the signals can be reconstructed in the parameter region a_min < a ⩽ a_c(λ). The dashed vertical lines represent λ_c, defined as the maximum value of λ that gives a_c(λ) > a_min. Hence, the signal cannot be reconstructed at λ ⩾ λ_c. For the reconstruction of dense signals, small nonconvexity parameters λ and a are required, and a_c(λ) and λ_c for MCP are always greater than those for SCAD. The dependences of λ_c on ρ/α for SCAD and MCP are compared in figure 6 for (a) α = 0.3 and (b) α = 0.5. The vertical lines represent the reconstruction limit of ℓ₁ minimization, and the values of λ_c diverge as ρ/α approaches the ℓ₁ reconstruction limit. This divergence of λ_c means that one can reconstruct the signals using any λ ∈ (0, ∞) and a ∈ (a_min, ∞) when the signals are sufficiently sparse to be reconstructed by ℓ₁ minimization. For any system parameters, the divergence of λ_c in MCP is faster than that in SCAD, which indicates that the range of possible values of nonconvexity parameters for reconstruction in MCP is wider than that in SCAD. In this sense, MCP is superior to SCAD.

**Figure 5.** Reconstruction limit a_c(λ) at α = 0.5 for (a) SCAD and (b) MCP, respectively. The vertical dashed lines represent the maximum value of λ, where the reconstruction is possible with a_min < a ⩽ a_c(λ). The horizontal dashed lines represent a = 1, which is the minimum value of a for the success solution.
Download figure:
Standard image High-resolution image

**Figure 6.** ρ/α-dependence of λ_c for (a) α = 0.3 and (b) α = 0.5. In the parameter region on the left side to the vertical lines, ℓ₁ minimization reconstructs the original signals.
Download figure:
Standard image High-resolution image

4.1. Comment on the RS-failure solution

We mention the existence of the solution of RS saddle point equation at α < α_c for subsequent discussions. One can find a solution with ɛ > 0 ( $Q{< }\rho {\sigma }_{x}^{2}$ and $m{< }\rho {\sigma }_{x}^{2}$ ) and χ > 0 within Ω^†(a) at α < α_c, which violates the AT condition. We term this solution as an RS-unstable failure solution. Figures 7 and 8 show the α-dependence of ɛ and χ for SCAD and MCP, respectively, where the vertical dashed lines represent α_c. The RS-unstable failure solution is smoothly connected to the success solution that appears at α ⩾ α_c, and does not coexist with the success solution. The RS-unstable failure solution does not contribute to the equilibrium property of the system, but this solution is still useful to consider the behavior of the algorithm as shown in the following sections.

**Figure 7.** α-dependence of ɛ and χ in RS solution of SCAD at ρ = 0.35 for (a) λ = 1, a = 10 and (b) λ = 0.3, a = 5. The vertical dotted lines indicate α_c, and the vertical dashed lines in (b) indicate the disappearance of the finite ɛ and χ.
Download figure:
Standard image High-resolution image

**Figure 8.** α-dependence of ɛ and χ in RS solution of MCP at ρ = 0.4 for (a) λ = 1, a = 10 and (b) λ = 0.5, a = 5. The vertical dotted lines indicate α_c, and the vertical dashed lines in (b) indicate the disappearance of the finite ɛ and χ.
Download figure:
Standard image High-resolution image

4.2. Comment on the diverging 'solution'

As shown in figures 7(b) and 8(b), when λ is sufficiently small, the values of ɛ and χ tend to diverge at sufficiently small α, and the solution with finite ɛ and χ disappears. In fact, (43) indicates that the solution χ → ∞ is stable for both SCAD and MCP when

$\begin{equation}\alpha \enspace {< }(1-\rho )\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}({\theta }_{\infty }^{-})+\rho \enspace \mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}({\theta }_{\infty }^{+})\end{equation} \tag{ 61 }$

holds, where ${\theta }_{\infty }^{+}=a\lambda \sqrt{\alpha /\left\{2(\alpha +\varepsilon )\right\}}$ and ${\theta }_{\infty }^{-}=a\lambda \sqrt{\alpha /(2\varepsilon )}$ , although the solution is out of the physical region Ω^†(a). Considering the limit χ → ∞, the set of the saddle point equations for SCAD and MCP is reduced to the same one equation for the MSE ɛ as

$\begin{align}\varepsilon & =\frac{\varepsilon (1-\rho )}{\alpha }\left\{\frac{2{\theta }_{\infty }^{-}}{\sqrt{\pi }}{\text{e}}^{-{({\theta }_{\infty }^{-})}^{2}}+\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}({\theta }_{\infty }^{-})\right\}+\frac{\rho (\varepsilon +\alpha )}{\alpha }\left\{\frac{2{\theta }_{\infty }^{+}}{\sqrt{\pi }}{\text{e}}^{-{({\theta }_{\infty }^{+})}^{2}}+\mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}({\theta }_{\infty }^{+})\right\}\\ & \quad -2\rho {\sigma }_{x}^{2}\enspace \mathrm{e}\mathrm{r}\mathrm{f}\mathrm{c}({\theta }_{\infty }^{+})+\rho {\sigma }_{x}^{2}.\end{align} \tag{ 62 }$

The solutions of (62) can be finite or infinite depending on α, λ, ρ and a, and the infinite ɛ is always stable if it exists when α < 1. The diverging solutions do not contribute to the thermodynamic behavior, because they are not in the region Ω^†(a), but they affect the algorithmic behavior of SCAD or MCP minimization.

5. Approximate message passing with nonconvexity control

For numerical computation of the estimate under a given measurement matrix, AMP is a feasible algorithm with low computational cost. As discussed below, the typical trajectory and fixed point of AMP can be connected to the analysis based on replica method, hence we can understand algorithmic behavior by comparing with the analysis. The detailed derivation has been given by the previous studies [13, 17, 21], and here we briefly introduce the algorithm. In AMP, the estimates under a general separable sparse penalty is recursively updated as

$\begin{equation}{\hat{x}}_{i}^{(t+1)}=\mathrm{arg}\underset{x}{\mathrm{min}}\left\{\frac{{~{Q}}_{i}^{(t)}}{2}{x}^{2}-{h}_{i}^{(t)}x+J({x}_{i};\lambda ,a)\right\},\end{equation} \tag{ 63 }$

where ${\hat{x}}_{i}^{(t)}$ denotes the estimate at time step t, and

$\begin{equation}{~{Q}}_{i}^{(t)}=\frac{1}{{\hat{V}}^{(t)}},\end{equation} \tag{ 64 }$

$\begin{equation}{h}_{i}^{(t)}={\hat{x}}_{i}^{(t-1)}{~{Q}}_{i}^{(t)}+\sum\limits _{\mu =1}^{M}{A}_{\mu i}{R}_{\mu }^{(t)},\end{equation} \tag{ 65 }$

$\begin{equation}{\hat{V}}^{(t)}=\frac{1}{M}\sum\limits _{i=1}^{N}{\hat{v}}_{i}^{(t-1)},\end{equation} \tag{ 66 }$

$\begin{equation}{\hat{v}}_{i}^{(t)}=\frac{\partial {\hat{x}}_{i}^{(t)}}{\partial {h}_{i}^{(t)}},\end{equation} \tag{ 67 }$

$\begin{equation}{R}_{\mu }^{(t)}=\frac{{y}_{\mu }-{\sum }_{i}{A}_{\mu i}{\hat{x}}_{i}^{(t-1)}}{{\hat{V}}^{(t)}}.\end{equation} \tag{ 68 }$

The solution of (63) corresponds to the minimizer of (22), with the replacement of $~{Q}$ and σz with ${~{Q}}_{i}^{(t)}$ and ${h}_{i}^{(t)}$ , respectively.

The local stability of AMP corresponds to the AT instability condition [13]. Hence, it is expected that AMP reconstructs the original signal in the theoretically derived parameter region α > α_c(ρ) at sufficiently large N, because the current problem does not exhibit any first order transitions or spinodal transitions under fixed nonconvexity parameters, in contrast to the Bayes-optimal setting [17] or the Monte Carlo sampling case [22]. Figure 9 shows examples of reconstructed signals of N = 400 and α = 0.5 after 1000 steps update of AMP under SCAD (a) at λ = 1 and a = 3 for ρ = 0.25 and (b) at λ = 0.1 and a = 3 for ρ = 0.3, where the original and reconstructed signals are represented by solid lines and circles, respectively. In these parameter regions, perfect reconstruction is theoretically supported, but as shown in figure 9(b), the naive update of AMP does not achieve the perfect reconstruction for small values of nonconvexity parameters. The discrepancy between the replica analysis and AMP appears, in particular, when the signal is dense. The tendency is common in both SCAD and AMP, hence we explain their characteristic behavior using SCAD as a representative.

**Figure 9.** True signal ${x}_{i}^{0}$ (solid line) and reconstructed signal ${\hat{x}}_{i}$ (circles) at N = 100 and M = 50 for (a) ρ = 0.25 with SCAD at λ = 1 and a = 3, and (b) ρ = 0.28 with SCAD at λ = 0.1 and a = 3.
Download figure:
Standard image High-resolution image

**Figure 9.** True signal ${x}_{i}^{0}$ (solid line) and reconstructed signal ${\hat{x}}_{i}$ (circles) at N = 100 and M = 50 for (a) ρ = 0.25 with SCAD at λ = 1 and a = 3, and (b) ρ = 0.28 with SCAD at λ = 0.1 and a = 3.
Download figure:
Standard image High-resolution image

As mentioned above, any other stable solutions do not exist when the success solution is locally stable, hence the discrepancy between the replica analysis and AMP cannot be understood by the spinodal transition as with the Bayes-optimal method. For understanding the difficulty in achieving the perfect reconstruction of the signal by AMP at a small nonconvexity parameter region, we use SE [23]. The typical trajectory of AMP is characterized by two parameters ${V}^{(t)}\equiv {E}_{{\boldsymbol{x}}^{0},A}[{\hat{V}}^{(t)}]$ and ${\varepsilon }^{(t)}\equiv {E}_{{\boldsymbol{x}}^{0},A}[{\hat{\varepsilon }}^{(t)}]$ , where ${\hat{\varepsilon }}^{(t)}\equiv {\sum }_{i=1}^{N}{({\hat{x}}_{i}^{(t)}-{x}_{i}^{0})}^{2}/N$ is the MSE at tth iteration step. In particular, when the components of A are independently and identically distributed with mean 0 and variance 1/N, as for the Gaussian measurement matrix, the time evolution of V^(t) and ɛ^(t) is described by SE equations [13, 17]

$\begin{equation}{V}^{(t+1)}=\int \mathrm{d}{x}^{0}\enspace {P}_{0}({x}^{0})\int Dz{\Sigma}({\alpha }^{-1}{V}^{(t)},{x}^{0}+z\sqrt{{\alpha }^{-1}{\varepsilon }^{(t)}}),\end{equation} \tag{ 69 }$

$\begin{equation}{\varepsilon }^{(t+1)}=\int \mathrm{d}{x}^{0}\enspace {P}_{0}({x}^{0})\int Dz{\left[\hat{x}({\alpha }^{-1}{V}^{(t)},{x}^{0}+z\sqrt{{\alpha }^{-1}{\varepsilon }^{(t)}})-{x}^{0}\right]}^{2},\end{equation} \tag{ 70 }$

where $\hat{x}(s,w)={{\Sigma}}_{\text{p}}(s,w){\mathcal{M}}_{\text{p}}(s,w)$ for p ∈ {SCAD, MCP}. SE is equivalent to the RS saddle point equation, and the fixed point denoted by V* and ɛ* corresponds to the RS saddle point as V* = χ and ${\varepsilon }^{{\ast}}=Q-2m+\rho {\sigma }_{x}^{2}$ , respectively. Hence, the success solution is described as V* = ɛ* = 0 in the SE. As mentioned in section 4.1, the failure solution appears in some parameter regions, but it always involves the RS instability and never coexists with the success solution. Note that the flow of the SE describes the typical trajectory of the AMP with respect to A and x ⁰. Hence, it does not necessarily describe a trajectory under a fixed realization of A and x ⁰. However, it is expected that the trajectories converge to the flow of SE for a sufficiently large system size. Hence, SE flow supports an understanding of a trajectory of AMP under a fixed set of A and x ⁰ [24].

Figure 10 shows the flow of SE at α = 0.5 and ρ = 0.28 for SCAD at (a) λ = 1 and a = 3 and (b) λ = 0.1 and a = 3. The arrows assigned to the coordinate $(\hat{V},\hat{\varepsilon })$ are the normalized vector of (V^(t+1) − V^(t), ɛ^(t+1) − ɛ^(t)) with ${V}^{(t)}=\hat{V}$ and ${\varepsilon }^{(t)}=\hat{\varepsilon }$ , which indicate the direction of SE's flow, and the stars depict the fixed points of SE. As shown in figure 10(a), SE has a fixed point with finite ɛ and V for large λ, which corresponds to the RS-unstable failure solution, but as λ decreases, most of the SE flow leads to a divergence of V and ɛ as shown in figure 10(b); however, there is still a region close to the V-axis where the flows are directed to V = ɛ = 0. This region shrinks to V-axis as the nonconvexity parameter decreases. Flows of SE in MCP are almost the same as SCAD, as shown in figure 10. In case of ℓ₁ minimization, one can check that SE reaches to the success solution from any initial condition of (V, ɛ), namely the volume of the BOA diverges to infinity at α > α_c(ρ) or ρ < ρ_c(α). Therefore, this shrinking basin and the diverging flow are significant properties of the minimization problems of SCAD and MCP.

We quantify the volume of the BOA under SCAD minimization and show its dependency on ρ for α = 0.5 as figure 11(a). The BOA to V = ɛ = 0 is zero at ρ = ρ_c, and gradually increases from zero as ρ decreases from ρ_c. When the nonconvexity parameter λ is small, the basin volume tends to be small for any ρ region. Figure 11(b) shows the possible maximum value of ɛ, denoted by ɛ_max, as an initial condition to converge to V = ɛ = 0. Namely, ɛ_max is the maximum value of ɛ on the boundary of the BOA. It means that to achieve perfect reconstruction at sufficiently large ρ with a small nonconvexity parameter, we need to set the initial condition as ɛ ∼ O(10⁻²) for λ = 0.1 and a = 3, and as ɛ ∼ O(10⁻⁵) for λ = 0.01 and a = 3. Such initial conditions with small ɛ are not realistic, and the possibility that AMP attains the success solution V = ɛ = 0 from randomly chosen initial conditions is exceedingly small.

The shrinking BOA is the origin of the difficulty of AMP for small nonconvexity parameters. To resolve this problem, we recall that the RS-unstable failure solution appears for α < α_c for large nonconvexity parameters, as discussed in figures 7 and 8. The emergence of the RS-unstable failure solution implies that the SE has a locally stable fixed point from the correspondence between the saddle point of RS free energy and the SE, as shown in figure 10(a). Therefore, AMP for the large nonconvexity parameter does not converge to a fixed point, but its trajectory is confined into a subshell characterized by finite ɛ and V. We utilize this nondivergence property of AMP in α < α_c at sufficiently large nonconvexity parameters for the perfect reconstruction of dense signals at small nonconvexity parameters. The procedure introduced here, based on the above consideration, is termed nonconvexity control, where we decrease the value of nonconvexity parameters in updating AMP.

Here, we consider the control of the parameter λ under a fixed value of a. Figure 12 shows λ-dependence of ɛ and V at α = 0.5 and a = 3 for ρ = 0.28 and ρ = 0.32, and explains how the nonconvexity control proposed here works or fails for the perfect reconstruction. We treat the set of macroscopic fixed points as a sequence generated by shifting the value of λ. Note that the sequence mainly consists of RS-unstable failure solutions. At ρ = 0.28, the sequences of ɛ and V are connected to zero by decreasing λ, hence one can potentially attain perfect reconstruction by starting from large λ and decreasing λ. However, at ρ = 0.32, ɛ and V tend to diverge when λ ∈ (0.2117, 0.5530) as shown in figure 12. In this region, the finite RS-unstable fixed points disappear, and the SE flow goes to the diverging state, although the state is not allowed as a solution. Figure 13 is an example of SE flow going towards the diverging state in the absence of the solution with finite ɛ and V, which is observed at α = 0.5, ρ = 0.32, λ = 0.4 and a = 3. The discontinuity in the sequence of the macroscopic parameters and the SE flow to the diverging state obstruct the nonconvexity control.

**Figure 13.** SE flow at α = 0.5, ρ = 0.32, λ = 0.4 and a = 3. There is no fixed point and the flow diverges.
Download figure:
Standard image High-resolution image

In figure 14, we compare the sequence of the macroscopic fixed point with the BOA to the origin at α = 0.5 and a = 3 for (a) ρ = 0.28 and (b) ρ = 0.32. The solid lines of figure 14 are drawn by continuously shifting λ, and are equivalent to figure 12. Dots on the lines represent examples of the fixed points at each λ. In figure 14(a), the shaded region denotes the BOA to V = ɛ = 0 at λ = 0.3 and a = 3, where the perfect reconstruction is theoretically supported. To attain perfect reconstruction at this parameter region, we need to prepare the initial condition with ɛ ∼ O(10⁻²), which is not realistic. However, by decreasing the nonconvexity parameter λ from larger values, such as λ = 1, the fixed point, which corresponds to the RS-unstable failure solution, comes into the BOA to the perfect reconstruction at λ = 0.3 as shown in figure 14(a). In figure 14(b), BOA to the perfect reconstruction at λ = 0.1 and a = 3 is depicted as the shaded region at α = 0.5 and ρ = 0.32. For larger ρ, the sequence of the fixed point shows discontinuity as shown in figure 14(b), which is caused by the diverging property of the macroscopic quantities as shown in figures 12 and 13. It is expected that the sequence of the fixed point below λ = 0.2117 can provide a clue to attaining the perfect reconstruction, but the corresponding BOA is already shrunk.

Based on the abovementioned observations, we define NCC limit ρ_NCC(α) as the largest value of ρ under given α at which the sequence of the fixed points reach V = ɛ = 0 as λ decreases without facing to the discontinuity due to the divergence of the macroscopic parameters; α_NCC(ρ) is defined as well. Figures 15(a) and (b) show the phase diagram on ρ–λ plane at α = 0.5 and a = 3, and that on α–λ plane at ρ = 0.32 and a = 3. The NCC limit is denoted by the horizontal dashed lines, and the stability of the diverging state (61) is satisfied on the left side of dotted–dashed lines. At ρ < ρ_NCC or α > α_NCC(ρ), the sequence of the fixed point is connected to V = ɛ = 0 by decreasing λ as shown in figure 14(a). Examples of the SE's flow below the NCC limit in the RS-unstable failure and the success region are shown in figures 10(a) and (b), respectively. At ρ > ρ_NCC or α < α_NCC, a 'no solution' region, in which SE does not have any fixed points in Ω^†(a) and diverges, appears between the 'success' region and 'RS-unstable failure' region, and the nonconvexity control fails due to the 'no solution' region. An example of the SE flow in the 'no solution' region is figure 13. As λ increases, the RS-unstable failure phase and the success phase are connected to each other for any ρ. This property is the same as the ℓ₁ minimization.

**Figure 15.** Phase diagram of SCAD on (a) ρ–λ plane at α = 0.5 and a = 3 and (b) α–λ plane at ρ = 0.32 and a = 3. The horizontal dashed lines denote the NCC limit: (a) ρ_NCC(α = 0.5) and (b) α_NCC(ρ = 0.32). On the left of the dotted–dashed line, the diverging state is stable and the SE flow tends to diverge.
Download figure:
Standard image High-resolution image

Figure 16 shows the α-dependence of the NCC limit ρ_NCC(α) for (a) SCAD and (b) MCP. The dependency of ρ_NCC on a is weak; the changes in a induce the changes in the ρ value smaller than O(10⁻³), and here the value is maximized with respect to a. For comparison, the phase transition boundary α_c(ρ) for λ = 0.1 and a = 3 is shown as principle limit in the sense that the stability of the success solution is guaranteed. According to the principle limit, it is expected that MCP can achieve perfect reconstruction under denser signals than SCAD, but its NCC limit is inferior to SCAD in the order O(10⁻²) in terms of ρ. This observation implies that there remains room for improvement in designing nonconvex penalties to overcome the basin shrinkage sustaining global stability of V = ɛ = 0 in the dense region.

The problem in practice is the protocol of the nonconvexity control. We consider here an 'equilibrium' approach; we spend sufficient time steps at each λ for the convergence to the macroscopic fixed point, which corresponds to the RS-unstable failure solution, and after that we decrease λ by dλ. Hence, we need to set the sufficient time for 'equilibration' and dλ appropriately. However, we cannot assess ${\hat{\varepsilon }}^{(t)}$ in AMP's trajectory since its calculation requires the unknown true signal x ⁰. Instead, we observe ${\hat{V}}^{(t)}$ and ${\hat{D}}^{(t)}\equiv \frac{1}{N}{\sum }_{i=1}^{N}{({\hat{x}}_{i}^{(t+1)}-{\hat{x}}_{i}^{(t)})}^{2}$ as criteria of the convergence, and we decrease λ by dλ after the saturation of the ${\hat{V}}^{(t)}$ and ${\hat{D}}^{(t)}$ around certain values. Next, in determining dλ, the macroscopic fixed point at λ is required to be in the BOA of the macroscopic fixed point at λ − dλ for the effective nonconvexity control. We denote the maximum value of dλ as dλ_max over which the abovementioned condition is violated, and choose a value dλ smaller than dλ_max for nonconvexity control. The value of dλ_max assessed by SE at α = 0.5 and ρ = 0.28 is shown as the solid line in figure 17(a), where dλ_max is obtained by observing the SE flow at λ − dλ starting from the fixed point at λ for various dλ. Here, we note that at this parameter region, the perfect reconstruction is possible at λ < 0.3, hence dλ for λ < 0.3 is trivially dλ ∼ λ.

**Figure 17.** (a) The value of dλ_max below which the nonconvexity control is effective at α = 0.5, ρ = 0.28 and a = 3 for SCAD. (b) Trajectory on V–ɛ plane with convexity control of SCAD. The trajectories of AMP for one realization of X ⁰ and A at N = 10⁵ (solid line) and SE (circles) are shown. The initial condition is set to be V = E = ρ, and λ = 1, a = 3. The value of λ is decreased by dλ = 0.1 after the convergence of ${\hat{V}}^{(t)}$ and D^(t) at each λ.
Download figure:
Standard image High-resolution image

The value of dλ shown in figure 17(a) is for the typical realization of A and x ⁰, and the possible value of dλ might fluctuate depending on A and x ⁰. To be on the safe side, we set dλ = 0.1 for any λ in applying the nonconvexity control to AMP under a given A and x ⁰. Figure 17(b) shows the actual trajectory of AMP (solid line) for one realization of A and x ⁰ at N = 10⁵, and corresponding SE (circles) at α = 0.5, ρ = 0.28 under nonconvexity control. The initial condition of AMP is set to be x = 0 and v = ρ 1_N, where 1_N is the N-dimensional vector whose all components are 1, and hence that of SE is V = ε = ρ. We start with λ = 1 at a = 3, and decrease λ by dλ = 0.1 after the convergence of ${\hat{V}}^{(t)}$ and ${\hat{D}}^{(t)}$ at each λ, until λ becomes to be 0.3. The behavior of AMP is well described by SE. Comparing to the flow of SE (figure 10(b)), the trajectory of AMP with nonconvexity control approaches the BOA connected to the success solution at λ = 0.1, which is almost on the V axis.

6. Summary and discussion

We have analytically derived the perfect reconstruction limit of the sparse signal in compressed sensing by the minimization of nonconvex sparse penalties, SCAD and MCP. In particular, when the nonconvexity parameters are small, SCAD and MCP minimization reconstruct dense signals that are beyond the ℓ₁ reconstruction limit. This analytical result also appears to imply that SCAD and MCP minimization overcomes the algorithmic limit of the Bayes optimal method, but the numerical experiments using AMP have shown that this is actually not the case. The gap between the analytical and numerical results has been understood by observing the flow of SE, revealing the failure of AMP comes from the shrinking BOA and the divergent behavior of AMP in some parameter regions. We have found that SCAD and MCP minimization show a novel failure scenario of the algorithm different from the Bayes-optimal setting where the algorithmic limit is characterized by the emergence of local minima. To mitigate the abovementioned gap and determine the algorithmic limit of AMP, we have proposed the protocol of the nonconvexity control, leading to largely improved performance.

Originally, SCAD and MCP were designed to satisfy the continuity and the oracle property, which is the simultaneous appearance of the asymptotic normality and consistency, at a certain limit with respect to the nonconvexity parameters [11, 12]. However, such a property is not sufficient for practical usage. The design of the nonconvex penalties that do not lead to a shrinkage of the basin is another possibility for nonconvex compressed sensing. From the relationship between the sparse prior in the Bayesian approach and the sparse penalty in the frequentist approach, it is implied that SCAD and MCP can be related to the Bernoulli–Gaussian prior in Bayesian terminology with large variance [25]. A unified understanding of the sparsity over the Bayesian and frequentist approach will be helpful for designing such desirable sparse penalties. Further, the application of the nonconvexity control to the general matrix beyond that consists of i.i.d. entries should be discussed for practical usage. The rotationally invariant matrix is one of the candidates to examine the effectiveness of the nonconvexity control for the general matrix [26–28].

Acknowledgments

The authors would like to thank Yoshiyuki Kabashima, Satoshi Takabe, Mirai Tanaka, and Yingying Xu for their helpful discussions and comments. This work is partially supported by JSPS KAKENHI No. 19K20363 (A S), and Nos. 19H01812, 18K11463 and 17H00764 (T O), Japan Science and Technology Agency (JST) PRESTO Grant No. JPMJPR19M2 (A S), and a Grant for Basic Science Research Projects from the Sumitomo Foundation (T O).

Perfect reconstruction of sparse signals with piecewise continuous nonconvex penalties and nonconvexity control

Article metrics

Author e-mails

Author affiliations

Author notes

Dates

Abstract

1. Introduction

2. Definition of SCAD and MCP

2.1. Estimator under SCAD and MCP

3. Replica analysis for SCAD and MCP

3.1. SCAD

3.2. MCP

4. Stability of success solution

4.1. Comment on the RS-failure solution

4.2. Comment on the diverging 'solution'

5. Approximate message passing with nonconvexity control

6. Summary and discussion

Acknowledgments

Perfect reconstruction of sparse signals with piecewise continuous nonconvex penalties and nonconvexity control

Article metrics

Share this article

Author e-mails

Author affiliations

Author notes

Dates

Abstract

1. Introduction

2. Definition of SCAD and MCP

2.1. Estimator under SCAD and MCP

3. Replica analysis for SCAD and MCP

3.1. SCAD

3.2. MCP

4. Stability of success solution

4.1. Comment on the RS-failure solution

4.2. Comment on the diverging 'solution'

5. Approximate message passing with nonconvexity control

6. Summary and discussion

Acknowledgments