Approximate message passing for nonconvex sparse regularization with stability and asymptotic analysis

We analyse a linear regression problem with nonconvex regularization called smoothly clipped absolute deviation (SCAD) under an overcomplete Gaussian basis for Gaussian random data. We propose an approximate message passing (AMP) algorithm considering nonconvex regularization, namely SCAD-AMP, and analytically show that the stability condition corresponds to the de Almeida–Thouless condition in spin glass literature. Through asymptotic analysis, we show the correspondence between the density evolution of SCAD-AMP and the replica symmetric (RS) solution. Numerical experiments confirm that for a sufficiently large system size, SCAD-AMP achieves the optimal performance predicted by the replica method. Through replica analysis, a phase transition between replica symmetric and replica symmetry breaking (RSB) region is found in the parameter space of SCAD. The appearance of the RS region for a nonconvex penalty is a significant advantage that indicates the region of smooth landscape of the optimization problem. Furthermore, we analytically show that the statistical representation performance of the SCAD penalty is better than that of ℓ1-based methods, and the minimum representation error under RS assumption is obtained at the edge of the RS/RSB phase. The correspondence between the convergence of the existing coordinate descent algorithm and RS/RSB transition is also indicated.


Introduction
Variable selection is a basic and important problem in statistics, the objective of which is to find parameters that are significant for the description of given data as well as for the prediction of unknown data.The sparse estimation approach for variable selection in high-dimensional statistical modelling has the advantages of high computational efficiency, stability, and the ability to draw sampling properties compared with traditional approaches that follow stepwise and subset selection procedures [1].It has been studied intensively in recent decades.The development of sparse estimation has been accelerated since the proposal of least absolute shrinkage and selection operator (LASSO) [2], where variable selection is formulated as a convex problem of the minimization of the loss function associated with 1 regularization.Although the LASSO has many attractive properties, the shrinkage introduced by 1 regularization results in significant bias toward 0 for large regression coefficients.To resolve this problem, nonconvex penalties have been proposed, such as the smoothly clipped absolute deviation (SCAD) penalty [3] and minimax concave penalty (MCP) [4].The estimators under these regularizations have several desirable properties, including unbiasedness, sparsity, and continuity [3].Nonconvex penalties are seemingly difficult to tackle because of the concerns regarding local minima owing to a lack of convexity.Thus, the development of efficient algorithms and verification of their typical performance have not been achieved.In previous studies, it has been shown that coordinate descent (CD) is an efficient algorithm for such nonconvex penalties.A sufficient condition in parameter space for CD to converge to the globally stable state has been proposed in [5].It was derived by satisfying the convexity of the objective function in a local region of the parameter space that contains the sparse solutions; however, the convergence condition has not been derived mathematically.Altough the optimal potential of the nonconvex sparse penalty remains unclear, the existence of such a workable region would imply that the theoretical method applied to the convex penalty is partially valid for the nonconvex penalty.This factor motivates us to theoretically evaluate the performance and develop an algorithm that is theoretically guaranteed to fulfill the potential of the nonconvex sparse penalty.
In this study, we develop an approximate message passing (AMP) algorithm for linear regression with nonconvex regularization.For comparison with the 1 penalty, we employ the SCAD penalty, where the nonconvexity is controlled by parameters and 1 regularization is contained at a limit of the parameters.We start with a brief review of the features of the SCAD penalty in the context of variable selection in the following subsections.

Overview of variable selection by sparse regularizations
The regularization functions that provide the following properties to the resulting estimator are regarded as appropriate for variable selection [3].
• Unbiasedness: To avoid excessive modelling bias, the resulting estimator is nearly unbiased when the true unknown parameter is sufficiently large.
• Sparsity: The resulting estimator should obey a thresholding rule that sets small estimated coefficients to zero and reduces model complexity.
• Continuity: The resulting estimator should be continuous with respect to data in order to avoid instability in model prediction.
Table 1 summarizes the properties of the representative regularizations.The regression problem with q regularization is called bridge regression [6].A well-Regularization Unbiasedness Sparsity Continuity  known example for q > 1 is ridge regression associated with the 2 penalty [7].The resulting estimator under q (q > 1) has continuity, but it is not sparse; hence, q (q > 1) regularization is not appropriate for variable selection.1 regularization gives a continuous solution that obeys a soft thresholding rule, and it is widely used in various problems [2,8].However, it shrinks the estimator by λ.The q (q < 1) penalty gives a sparse and unbiased estimator, but the estimator is discontinuous with respect to data.

Smoothly clipped absolute deviation
Smoothly clipped absolute deviation (SCAD) is a type of nonconvex regularization that improves the q penalty to simultaneously achieve unbiasedness, sparsity, and continuity.SCAD regularization is defined as where λ and a(> 1) are parameters that control the form of SCAD regularization.
In particular, a contributes to the nonconvexity of the SCAD function.Figure 1 shows the dependence of SCAD regularization on a at λ = 1.In the range [−λ, λ], SCAD regularization behaves like 1 regularization.Above aλ and below −aλ, SCAD regularization corresponds to 0 regularization in the sense that the regularization has a constant value.These 1 and 0 regions are connected by a quadratic function, by which the estimator continuously shifts between the shrunken 1 -type estimator and the unbiased 0 -type estimator.At the limit a → ∞, the regions [λ, aλ] and [−aλ, −λ] become linear with gradients λ and −λ, respectively.Hence, SCAD regularization is reduced to 1 regularization at a → ∞.
SCAD regularization provides the oracle property to the estimator in the leastsquares problem [3,9,10].In other words, when the true solution exists, the zero components in the true parameters are estimated as 0 with probability tending to 1, and the nonzero components are estimated as well when the correct support, i.e. the position of the nonzero components, is known.The conditions for the oracle property, which are subject to the true solution and the predictor matrix, and the appropriate scaling of λ with respect to the data dimension are mathematically provided [3].

Our main contributions
In this paper, we focus on the linear regression problem in high-dimensional settings where target data y ∈ R M is approximated by using an overcomplete predictor matrix or basis matrix A ∈ R M ×N (M < N ) as y ∼ Ax.The regression coefficient x ∈ R N is to be estimated under a nonconvex sparse penalty.Among the given overcomplete basis matrix, a small combination of basis vectors is selected corresponding to the regularization parameters, which compactly and accurately represent the target data.The sparse representation under the overcomplete basis is an appropriate problem to check the variable selection ability of sparse regularization.
The organization and main contributions of this paper are summarized below: • We develop an efficient message passing algorithm, namely SCAD-AMP, for a SCAD regularized linear regression problem (Section 3.1), and we analytically show its local stability condition in Section 3.3.
• Through asymptotic analysis, the density evolution of the message passing algorithm is shown in Section 3.2.Further, the equivalence of the stability of AMP's fixed point and the replica symmetric solution is shown in Section 4. In addition, the validity of the asymptotic analysis is confirmed by numerical simulation.
• Based on the replica method from statistical mechanics, a phase transition between the replica symmetric (RS) and replica symmetry breaking (RSB) regions is found in the parameter space of the SCAD regularization (Section 5).In other words, the RS phase exists when the nonconvexity of the regularization is appropriately controlled.We also confirm that the conventional parameter setting a = 3.7 corresponds to the RS phase for sufficiently large λ.
• We show that the stability of the RS solution corresponds to that of AMP's fixed points.This means that the local stability of AMP is mathematically guaranteed in the RS phase.This correspondence holds regardless of the type of regularization.Hence, the result is valid for other regularizations where the RS/RSB transition appears (section 4.1).
• We evaluate the statistically optimal value of the error between y and Ax for the SCAD regularized linear regression problem.The error has a decreasing tendency as parameter a decreases and approaches the RS/RSB boundary, and it is the lowest at the edge of the RS/RSB phase.In addition, we analyze the parameter dependence of the density of the nonzero components in x in the RS phase.The sparsity is nearly controlled by λ, but as a decreases, the number of nonzero components slightly increases.1 regularization corresponds to a → ∞ of SCAD regularization; hence, SCAD typically provides a more accurate and sparser expression compared with 1 in the RS phase (section 5).
• It is numerically shown that the RS/RSB transition point corresponds to the limit that the coordinate descent algorithm reaches the globally stable solution (Section 6).
• Our analysis shows that the transient region of SCAD estimates between the 1type and the 0 -type contributes to the occurrence of the RSB transition (Section 4.1).

Problem settings
In this paper, we analyze a linear regression problem with SCAD regularization, which is formulated as where y ∈ R M and A ∈ R M ×N are the given data and predictor matrix or basis matrix, respectively.Here, we consider the case in which the generative model of y does not contain a true sparse expression; hence, (2) corresponds to a compression problem of the given data y under the basis A, rather than the reconstruction of the true signal.We assume that each component of y and A is independently and identically distributed (i.i.d) according to Gaussian distributions.More specifically, their joint distribution is given by P y,A (y, A) = P y (y)P A (A) In conventional analysis by statistical mechanics for linear systems, such as compressed sensing problems, the elements of the matrix A are often assumed with zero mean and variance 1/N .We set the variance to 1/M in order to match the conventional assumptions in the statistics literature, where µ A 2 µi = 1 holds for all i.The coefficient 1/2 of (2) is introduced for mathematical convenience, and we denote the function to be minimized divided by M as which corresponds to the energy density.When the regularization is given by 1 norm, the problem is known as the LASSO [2].SCAD regularization induces zero components into the estimated variable x corresponding to parameters λ and a.We set λ and a as O(1) regardless of the system size.Further, we introduce the posterior distribution of the estimate with a parameter β: where Z β (y, A) is the normalization constant.The limit β → ∞ leads to the uniform distribution over the minimizers of (2).The estimate of the solution of (2) under a fixed set of {y, A}, denoted by x(y, A), is given by where • β denotes the expectation according to (7) at β.We focus on the overcomplete basis, where the number of column vectors is greater than the dimension of the data, i.e.N > M , with the compression ratio α = M/ N (< 1).We do not impose orthogonality on the basis vectors.In general, the overcomplete basis gives an infinitely large number of solutions, but by adding the cost functions that promote sparsity, a more informative representation than that under the orthogonal basis is potentially obtained.The sparse representation under the overcomplete basis is an appropriate problem to check the variable selection ability of the sparse regularization.

Approximate message passing algorithm
Exact computation of the expectation in (8) requires exponential time and is thus intractable [11].We develop an approximate algorithm for (8) following the framework of belief propagation (BP) or message passing [12,13,14,15].BP has been developed for problems with sparse regularizations, such as compressed sensing with linear measurements [16] and LASSO [17,18], exhibiting high reconstruction accuracy and computational efficiency.To incorporate the nonconvex penalty, we employ a variant of BP known as generalized approximate message passing (GAMP) [19].In the following, we first describe the details of the derivation of the SCAD-AMP algorithm and then discuss the asymptotic analysis.Stability analysis, which is an essential and a common concern of algorithms developed for nonconvex regularizations, will be discussed at the end of the section.

Derivation of SCAD-AMP
We explain the algorithm from general form to specific form for the SCAD penalty.Further, we clarify the required assumptions on the elements of matrix A in each stage of simplification.

Stage 1: General form of belief propagation
We graphically denote the probability system by two types of nodes and connect them by an edge when they are related.The conditional probability of y µ depends on all of x 1 , x 2 , , • • • , x N , implying that the posterior distribution ( 7) can be expressed as a (dense) complete bipartite graph, as shown in Figure 2. Let us define a general constraint/penalty/prior distribution for the variable nodes as P penalty (x), and a probability distribution for the output function notes as P (y|Ax).The canonical BP equations for the probability P (x|A, y) are generally expressed in terms of 2M × N messages, m i→µ (x i ) and which represent probability distribution functions that carry posterior information and output information, respectively.The messages are written as where, Z µ→i and Z i→µ are normalization constants that satisfy dx i m µ→i The means and variances of x i under the posterior information message distributions are defined by We also define for notational convenience.Using (9), we evaluate the approximation of marginal distributions (beliefs) as where Z i is a normalization constant for dx i m i (x i ) = 1.We denote the means and variances of the beliefs as a i and ν i , respectively, which are given by The mean a = (a i ) represents the approximation of the posterior mean x(y, A).
The coupled integral equations ( 9) (10) for the messages are too complicated to be of any practical use.However, in the large N, M limit, when the matrix elements A µi scale as 1/ √ M (or 1/ √ N ), these equations can be simplified.We derive algebraic equations corresponding to ( 9) and ( 10) by using sets of a i→µ and ν i→µ .For this purpose, we define u µ ≡ (Ax) µ and insert the identity into (9), which yields We truncate the Taylor series of exp{iû µ A µj x j } for j = i up to the second order of A µj .By integrating dx j m j→µ (x j ) (. ..) for j = i and carrying out the resulting Gaussian integral of ûµ , we obtain We again truncate the Taylor series of the exponential in (20) up to the second order on the basis of the smallness of A µi .Consequently, a parameterized expression of m µ→i (x i ) is derived as where Zµ→i = 2π 2A µ→i .The parameters A µ→i and B µ→i are evaluated as using The relevant derivations are given in the appendix of [20].Equations ( 22) and ( 23) act as algebraic expressions of ( 9).Note that the form of g out and g out only depends on the output distribution P (y µ |u µ ).We will give the specific expressions of g out and g out for linear regression later.
A similar expression for ( 10) is obtained by substituting the last expression of ( 21) into (10), which leads to where Zi→µ is a normalization constant.The expression of equation (26) indicates that γ =µ m γ→i (x i ) in ( 10) is expressed as a Gaussian distribution with mean R i→µ and variance Σ 2 i→µ , given by We define an auxiliary distribution of x as and the mean and variance over this distribution are defined as the functions Note that the form of f a and f c depends only on the distribution P penalty (x).For a general penalty distribution of x, the functions f a and f c are computed by numerical integration over x.In special cases, such as the Bernoulli Gaussian or l 1 penalty, these functions are easily computed analytically [20], as in the case of the SCAD penalty.We will give the specific analytic form of f a , f c for the SCAD penalty later.
In (29), Ẑ(Σ 2 , R) is a normalization constant for the distribution.Note that By referring to the definitions of a i→µ and ν i→µ in ( 11) and ( 12), we provide the closed form of BP update as The moments of m i (x i ) are evaluated by adding the µ-dependent part to (34) and ( 35) as where Equations ( 11) and (12) together with ( 22), (23), and (26) lead to closed iterative message passing equations.These equations can be used for any data vector y and matrix A. We consider the case in which the matrix A is not sparse ( i A µi x i consists of order N nonzero terms), and each element of the matrix scales as Based on this fact, the use of means and variances instead of the canonical BP messages is exact in the large N limit.
Stage 2: TAP form of the general message passing algorithm BP updates 2M × N messages using ( 22), ( 23), (34), and (35 per iteration, which limits the practical applicability of BP to systems of relatively small size.In fact, it is possible to rewrite the BP equations in terms of M + N messages instead of 2M × N messages under the assumption that matrix A is not sparse and all its elements scale as In statistical physics, BP with reduced messages corresponds to the Thouless-Anderson-Palmer equations (TAP) [21] in the study of spin glasses.For large N , the TAP form is equivalent to the BP equations, and it is called the AMP algorithm [16] in compressed sensing.This form will result in a significant reduction of computational complexity to O(M × N ) per iteration, which enhances the practical applicability of message passing.First, we ignore A 2 µi ν i→µ in (20) for a sufficiently large system, because A 2 µi vanishes as O(M −1 ) while ν i→µ ∼ O(1).Then, we obtain which can be used in Σ 2 i and R i : In the large N limit, it is clear from (34) and ( 35) that the messages a i→µ and ν i→µ are nearly independent of µ.However, one must be careful to keep the correcting terms, which are referred to as the "Onsager reaction terms" [21,22] in the spin glass literature.We express a i→µ by applying Taylor's expansion to (34) around R i as where ).According to definition (13), multiplying (44) by A µi and summing the resultant expressions over i yields where we have used µi , all the correction terms are negligible in the N → ∞ limit.Therefore, we have The general TAP form of the message passing or general approximated message passing algorithm is summarized in Figure 3.

Stage 3: Further simplification for basis matrix with random entries
For special cases of random matrix A, the TAP equations can be simplified further.For homogeneous matrix A with i.i.d random entries of zero mean and variance 1/M (the distribution can be anything as long as the mean and variance are fixed), the simplification can be understood as follows.Define V as the average of V µ with respect to different realisations of the matrix A. We replace µ g out (ω µ ) 5) Variances of output information message distributions : −1 6) Average of output information message distributions : i ) 8) Posterior variance : i ) 9) Iteration : Repeat from step 2 until convergence.
Figure 3. GAMP algorithm with the assumption that matrix A is not sparse and all its elements scale as The convergent vectors of a (t) , ν (t) , and ω (t) obtained in the previous loop are denoted by a * , ν * , and ω * , respectively.expectation M −1 from the law of large numbers.This replacement makes V µ for any µ equal to the average as where • • • denotes the average over A µi .The same argument can be repeated for all the terms that contain A µi .For sufficiently large N , Σ 2 i typically converges to a constant denoted by Σ 2 , independent of index i.This removal of site dependence, in conjunction with ( 22) and ( 23), yields By iterating (47), ( 48), (49), and we can solve the equations with only 2(M + N + 1) variables involved.

SCAD-AMP for linear regression
In this paper, we consider the case in which matrix A contains i.i.d random variables from a Gaussian distribution with zero mean and variance 1/M .We can employ the simplified version of the GAMP algorithm derived above.First, let us consider the output channel.In the linear regression problem, where we focus our analysis on the β → ∞ case.We re-scale Ṽ = βV and Σ2 = βΣ 2 such that Ṽ and Σ2 are O(1) in the β → ∞ limit.Inserting (53) into (40) and ( 41) gives (g out ) µ and (g out ) µ as respectively.Inserting (54) into (50) gives and inserting (55) into (48) gives Next, on the prior distribution side, by inserting SCAD penalty distribution into (29), in the limit of β → ∞, we obtain where and R is given by inserting (54) and ( 57) into (49), which gives At β → ∞, the function f a corresponds to the minimiser of φ as By solving the minimization problem, we obtain where Here, we have replaced f a (Σ 2 , R) and f c (Σ 2 , R) with f a ( Σ2 , R) and f c ( Σ2 , R), respectively, since there is no β dependence in the final form of (64) and (65).Figure 4 shows an example of f a as a function of R. In the regions I and III, f a behaves like the existence of regularization.Hence, this region contributes to the unbiasedness of the estimates under SCAD regularization.In region II, f a transits linearly between the 1 and 0 estimates.We will discuss the contribution of this region to the instability of AMP in connection with the replica method in Section 4.
We term the entire procedure as the approximate message passing for smoothly clipped absolute deviation (SCAD-AMP) algorithm.Figure 5 shows the pseudocode of SCAD-AMP.To improve the convergence property, employing an appropriate damping factor in conjunction with the normalization of |a| is valid.The most time-consuming parts of SCAD-AMP are the matrix-vector multiplications µ (g out ) µ A µi in (43) and i A µi a i in (45); hence, the computational complexity is O(N M ) per iteration.We note that a i in equation ( 49) and (g out ) µ V in equation ( 50) correspond to the Onsager reaction term in the spin glass literature [21,22].These terms effectively cancel out the self-feedback effects, thereby stabilising the convergence of SCAD-AMP.a seed : t ← 0 2) Counter increase : t ← t + 1 3) Mean of variances of posterior information message distributions : ) Iteration : Repeat from step 2 until convergence.
Figure 5. Pseudocode of SCAD-AMP with the assumption that matrix A contains i.i.d random variables with zero mean and variance 1/M .For the linear regression problem, the variance vector of output information message distributions is V + 1. Functions f a and f c for SCAD are (64) and (65), respectively.The convergent vectors of a (t) , ν (t) , and ω (t) obtained in the previous loop are denoted by a * , ν * , and ω * ,respectively. 1 is the N -dimensional vector whose entries are all unity.The quantities we are focusing on are calculated at AMP's fixed point.For example, we calculate the ratio of the nonzero components to the data dimension, called sparsity, as at AMP's fixed point.Figure 6 shows the dependence of ρ/α on λ for N = 200 and a = 5 by circle •.The results are averaged over 1000 samples of y and A. At a certain value of λ that depends on a, AMP does not converge to fixed points, for which the values of λ are denoted by vertical black dashed lines in Figure 6.Understanding and practically implementing AMP requires stability analysis of the fixed point.We note that the dashed magenta lines in Figure 6 are the results obtained by the replica method that describes the asymptotic behaviour of AMP, which is explained in Section 4.

Macroscopic analysis: density evolution of message passing
Density evolution is a framework for analyzing the dynamical behaviour of BP through a macroscopic distribution of messages [23].Here, we derive density evolution for linear regression with the SCAD penalty in the case where the matrix A and vector y have random entries that are i.i.d., with mean 0 and variances 1/M and σ 2 y , respectively.We start from the following expression of the quantity R (t) i : where the superscript (t) denotes the values at iteration step t.The variables {R

(t)
i } are random variables with respect to the distribution of the basis matrix elements A µi and the data elements y µ ; hence, they are regarded as Gaussian random variables from the central limit theorem.The mean of R (t) i is given by where • • • denotes the average with respect to A and y.The variance of R (t) i , defined as E (t) , is given by (R where we ignore the terms of O(1/ √ N ).Based on the statistical property of R where z is a random Gaussian variable with zero mean and unit variance, and Ẑ(t+1) i is the a normalization constant of the marginal distribution at step t + 1.
Equation (73) implies that the marginal distribution at step t can be simply expressed in terms of two parameters V (t) and E (t) at sufficiently large N .Finally, by referring to (30), (31), and the definition of E (72), the density evolution equations are derived as where . They describe how macroscopic parameters E and V evolve during the iterations of the BP algorithm.The density evolution equations are the same for the message passing and for the TAP equations, as the factors of O(1/N ) are ignored in the density evolution.The correspondence between density evolution and the replica symmetric solution is provided in Section 4.
In the current problem setting, the fixed point of the density evolution equation is unique within the range we observed for any system parameters α, a, and λ.

Microscopic stability of AMP
Even when the density evolutions (74) and (75) converge to a stationary state, the convergence of the microscopic variables updated in AMP, such as R (t) i→µ , is not guaranteed.We perform linear stability analysis of AMP's fixed points to examine their local stability, which is a necessary condition for the global stability.A similar analysis has been discussed in [24] for binary signals.Here, we provide a more general analysis.
Assume the update of R i→µ is linearized around a fixed point solution R i→µ , using equations ( 27) and (44) yields Therefore, where we have replaced V µ with the average V and assumed that M is large.Assuming that δR i→µ are uncorrelated, the summation including the basis matrix elements {A µi }, which are i.i.d.random numbers, on the right-hand side of (77) leads to a Gaussian random number because of the central limit theorem.Furthermore, the correlations of R (t+1) i→µ with respect to indices µ, i, and t are negligible, because the right-hand side of (77) does not include the indices µ and i.These facts make it possible to analyze the stability of the fixed point by observing the growth of the first and second moments of δR (t) i→µ through each update.
The first moment of δR i→µ is negligible at sufficiently large M and N , since the average of A µi is zero.The second moment of δR i→µ is given by (δR where we assume the self-averaging property, and • • • denotes the average over A µi .Further, we have replaced the sample average of products with the product of the sample average in the last step, which is valid when N is large.In addition, it can be expected that because of the self-averaging property, the macroscopic variable ( regardless of i, where V and E are fixed point values of V t and E t , respectively.For sufficiently large M and N , N −1 j =i (δR j→γ ) 2 coincides with (δR i→µ ) 2 itself, because their dependency on µ can be ignored.Therefore, the second moment is enlarged through the belief update, and the fixed point solution becomes locally unstable if We will show the correspondence of the AMP's local stability condition and the de Almeida-Thouless condition derived from the replica method in Section 4.

Replica analysis
The replica method provides performance of the optimal algorithm for problem (2).The asymptotic property of AMP's fixed point presented in the previous section can also be analytically derived independently using the replica method.The replica method provides the physical meanings of the density evolution and the stability of AMP.The basis for the analysis is the free energy density defined by which corresponds to the expectation of the minimum value of the energy e(x|y, A).
The expectation according to (3) is implemented by the replica method based on the following identity: which is employed for the analysis of sparse regularizations in various problems [25,26,27,28].We briefly summarize the replica analysis for the SCAD penalty.The detailed explanation is given in Appendix A. We focus on the N → ∞ and M → ∞ limits by keeping M/N = α ∼ O(1).Under the replica symmetric (RS) assumption, the free energy density is given by where extr Q,χ, Q, χ denotes extremization with respect to the variables {Q, χ, Q, χ}, which are given by at the extremum.The function ξ( Q, χ) is given by where √ χz is the random field that effectively represents the randomness introduced by y and A. It is shown that Q and χ relate to the physical quantities at the extremum as where the superscript T denotes matrix transpose.The solution of x concerned with the effective single-body problem (89), denoted by x * (z; Q, χ) for SCAD regularization, is given by where and the thresholds are given by θ 1 = λ/ √ 2 χ, θ 2 = λ( Q + 1)/ √ 2 χ, and θ 3 = aλ Q/ √ 2 χ, respectively.The behaviour of x * is the same as that shown in Figure 4. Using x * , we can represent the quantities Q and χ in the replica method as The solution x * is statistically equivalent to the solution of the original problem (2); hence, the density of the nonzero component is given by where erfc(θ ).The dependence of ρ/α on λ derived by the replica method is shown in Figure 6 by dashed lines.The results of the replica method match those of AMP for a sufficiently large system size, which is mathematically supported by considering the correspondence between the variables used in AMP and the replica method.
A comparison of f a in AMP (63) and x * in replica analysis (89) shows that they are equivalent to each other on the basis of the correspondences From equations ( 57) and (86), the correspondence (99) is equivalent to that between V in AMP and χ in replica analysis.This is consistent with the definition of V and the physical meaning of χ given by (91).Equation (100) implies that the distribution of R i /Σ 2 with respect to y and A is represented by a Gaussian distribution with variance χ.From the calculation explained in Appendix A, it is shown that χ corresponds to the representation error defined by which evaluates the accuracy of the expression of data using the estimated sparse expression.The variance of R i is defined as E; hence, This correspondence implies that density evolutions ( 74) and ( 75) can be regarded as recursively solving the saddle point equations χ and (Q + σ 2 y ), respectively, under RS assumption.

de Almeida-Thouless condition for replica symmetric phase
The RS solution discussed thus far loses local stability under perturbations that break the symmetry between replicas in a certain parameter region.This phenomenon is known as de Almeida-Thouless (AT) instability [29], and it occurs when holds [29], as explained in Appendix B. By applying this to the minimizer of the singlebody problem (92), we get the AT instability condition for the SCAD penalty as where γ = erfc(θ 2 )−erfc(θ 3 ).At a → ∞, the AT instability condition for SCAD reduces to that for 1 regularization: ρ/α > 1.The second term of (104) is the characteristic of SCAD regularization.By definition, γ/α is the probabilistic weight of the transit region (region II of Figure 4), and Q/( Q − (a − 1) −1 ) denotes the ratio between the variances in the transient region and those in other regions (equation ( 92)).Equation (104) implies that as the transient region is extended or as the variance of the estimates in the transient region becomes smaller, the RS solution is likely to be unstable.
Based on the correspondence between AMP and the RS saddle point discussed thus far, the AT instability condition (103) is coincident with the instability condition of density evolution given by (80).Similar correspondence has been shown in the BP algorithm for CDMA [24].We note that the correspondence between AMP's local stability and the RS/RSB transition holds regardless of the type of regularization.Hence, the result is valid for other regularizations where the RS/RSB transition appears.In the current problem, the RS saddle point equation has a unique solution; the first order transition does not occur.Therefore, the local stability of the AMP's fixed points indicates their global stability.

Phase diagram and representation error
Figure 7 shows the phase diagram on the λ−a plane for (a) α = 0.1 and (b) α = 0.5.The range of parameters that gives the RS phase shrinks as α decreases.The conventional value a = 3.7 [3], which is indicated by horizontal lines in Figure 7, is considered as an appropriate value of a to obtain RS stability at sufficiently large λ.As λ decreases, the value of a for achieving the RS phase diverges, which is consistent with the shape of SCAD regularization.
Figure 8 shows the parameter dependence of the sparsity ρ/α in the RS phase.The sparsity is nearly controlled by λ; as λ increases, the nonzero components in the estimate are enhanced.The sparsity slightly depends on a compared with λ, but as a decreases, the number of nonzero components increases.This fact is reasonable from the form of the regularization, because the gradient around the origin that is a key aspect of the sparsity is governed by λ.
As the number of nonzero components decreases, the representation performance of the data is generally degraded.There is a trade-off relationship between the simplicity and the accuracy of the expression.We quantify this relationship by regarding the representation error (101) as a function of sparsity ρ/α.Here, ρ/α corresponds to the ratio of the number of variables to be estimated to the number of known variables.Hence, ρ/α ≤ 1 is the physically meaningful region.As mentioned in Section 4 and explained in Appendix A, the representation error is given by χ in the replica method.Hence, we obtain the dependency of the representation error on the sparsity by solving (87) and (98).
Figure 9 shows the representation error as a function of the sparsity at α = 0.5 and α = 0.8.For comparison, the results obtained by 1 regularization, which corresponds .Relationship between the representation error and the normalized sparsity ρ/α for (a) α = 0.5 and (b) α = 0.8.LASSO corresponds to the limit a → ∞, and the shift of the error curve by decreasing a is indicated by arrows.The shaded regions are regions that are not achievable for any compression method [30].
to a → ∞, are shown by dashed lines, and the regions that are not achievable for any estimation method [30] are shaded.At each value of λ, we find the value of a that gives the smallest representation error.The shift of the representation error curve associated with the decrease in a is shown by arrows in Figure 9, where the minimum values of a are (a) a = 3, 6, and 20 for λ = 0.290, 0.614, and 1.02, respectively, and (b) a = 2.739, 6.51, and 25 for λ = 0.2, 0.5, and 1, respectively.The representation error monotonically decreases as a decreases.Hence, when we restrict the SCAD parameters to be within the replica symmetric region, the smallest representation error is obtained at the RS/RSB boundary.In addition, the sparsity slightly decreases as a decreases.Hence, the sparsest and most accurate expression is obtained by decreasing a to be on the RS/RSB boundary.

Correspondence between RS/RSB transition and convergence of coordinate descent: a conjecture
For the SCAD penalty, it is shown that the coordinate descent (CD) algorithm is valid in a certain parameter region [5].CD is the component-wise minimization of energy density (6) while all other components are fixed; it cycles through all parameters until convergence is reached.Let us denote the estimate at step t of CD as x(t) and define the residue at step 0 as r (0) = y − A x(0) .CD for SCAD is given by the cyclic update of one component of the estimate x according to the following equations [5]: x(t+1) where S is the soft-thresholding function defined by and A j denotes the j-th column vector of A. A sufficient condition for the convergence of coordinate descent to the globally stable state is proposed as [5] a > 1 + c(λ, a; A) −1 .
Here, c(λ, a; A) is the minimum eigenvalue of the Gram matrix A T S(λ,a) A S(λ,a) , where A S denotes the submatrix of A consists of columns in the set S, S(λ, a) is the union of the support at λ and the index of a component in x, which will become nonzero at λ − dλ.The infinitesimal quantity dλ(> 0) is set such that the difference between the numbers of supports at λ and λ − dλ becomes 1.However, the necessary and sufficient condition for the convergence of CD is not known.Here, we suggest that the convergence conditions of CD and AMP are equivalent from numerical observation.
To check the convergence of CD, we prepare m random initial conditions.The fixed point of CD starts from the k-th initial condition under the fixed data y and predictor matrix A is denoted as xCD k (y, A).We compute the average of the differences between the fixed points as The uniqueness of the stable solution for CD is indicated by d(y, A) = 0. We define a * (y, A; λ) as the minimum value of a that satisfies d(y, A) = 0 for each λ.In other words, at a < a * (y, A; λ), CD does not always attain the global minimum.In Figure 10, the averaged value of a * (y, A; λ), which indicates a typical solvable limit, over 100 samples of i.As shown in Figure 10, RS/RSB transition is an approval indicator of the convergence of CD, although complete correspondence between them has not been mathematically proved.From the physical consideration, RS/RSB transition is supposed to correspond to the appearance of the local minima, whose number is an exponential order of the system size.This view of the RS/RSB transition is consistent with the result shown in Figure 10.

Summary and conclusion
We studied a linear regression problem under SCAD regularization.We developed AMP for SCAD regularization and identified the stability condition of AMP's fixed points.This stability condition corresponds to the AT condition derived by the replica method, and the correspondence between the stability of AMP and that of the RS solution was indicated.The stability analysis does not depend on the form of the regularization; hence, the correspondence holds for other regularizations that exhibit RS/RSB transition.In addition, we identified the replica symmetric phase on the λ − a parameter plane, and quantified the relationship between the representation error and the sparsity.Furthermore, we theoretically showed that, for i.i.d.Gaussian data and a basis matrix, SCAD regularization typically provides a more accurate and sparser expression compared with 1 regularization in the RS phase.In addition, the smallest representation error is obtained at the RS/RSB boundary for each λ.
Our result not only shows that the nonconvexity of the sparse regularization does not always result in instability of the replica symmetry but also highlights the potential of message passing algorithms for nonconvex regularizations.Because the unclear condition of global stability is the main concern for nonconvex sparse regularization in practice, identifying and using the RS region of other nonconvex sparse regularizations is a topic of interest for future work.
In this paper, we focused on the case in which data y and predictor matrix A are i.i.d.Gaussian random variables.Our discussion is applicable to non-Gaussian A and y that satisfy the following conditions: the correlation between components is negligible, each component of A has zero mean and variance 1/M , and each component of y has zero mean and variance σ 2 y .However, in dealing with general data, a nonzero mean might be considered.Our algorithm is applicable to such cases by introducing a simple modification that leads to a zero-mean matrix and vector as y − ȳ ∼ (A − Ā)x, where ȳ ≡ ( 1 M µ y µ )1, 1 denotes the M -dimensional identity vector, and Ā is the M × N matrix where the i-th column is given by the values Āi ≡ (1/M ) µ A µi .For a more general case where the correlation between the components of A and y is not negligible, the convergence of AMP and the validity of the approximations introduced into AMP are interesting problems to be resolved.The incorporation of nonconvex regularization into variants of AMP [31,32] where the convergence is improved merits further discussion.with an arbitrary function g(w).When q 0 = Q and χ0 = χ1 , the 1RSB solution is reduced to the RS solution.Hence, the stability of the RS solution is examined by the stability analysis of the 1RSB solution of Q = q 0 and χ1 = χ0 .We define ∆ = Q − q 0 and ∆ = χ1 − χ0 , and we apply Taylor expansion to them by assuming that they are sufficiently small.From (B.This is the de Almeida-Thouless condition for the current problem.

Figure 4 .
Figure 4. Example of f a as a function of R. The dashed diagonal line represents the behaviour of the estimate when the regularization term does not exist.

Figure 6 .
Figure 6.Dependence of ρ/α on λ at SCAD-AMP's fixed point for a = 5 with (a) α = 0.1 and (b) α = 0.5.Circles are obtained by the SCAD-AMP algorithm.The dashed magenta lines represent the result derived by the replica method, and SCAD-AMP does not converge in the left-hand side region of the black dashed lines.

Figure 8 .Figure 9
Figure 8. Parameter dependence of the sparsity at (a) α = 0.1 and (b) α = 0.5 in the RS phase.The figures were created by presenting the elements of matrix ρ(a, λ)/α in a colour map, resulting in block-like shapes.