This site uses cookies. By continuing to use this site you agree to our use of cookies. To find out more, see our Privacy and Cookies policy. Close this notification
Brought to you by:
Paper: Interdisciplinary statistical mechanics The following article is Open access

Statistical mechanical analysis of sparse linear regression as a variable selection problem

, , and

Published 19 October 2018 © 2018 The Author(s). Published by IOP Publishing Ltd on behalf of SISSA Medialab srl
, , Citation Tomoyuki Obuchi et al J. Stat. Mech. (2018) 103401 DOI 10.1088/1742-5468/aae02c

1742-5468/2018/10/103401

Abstract

An algorithmic limit of compressed sensing or related variable-selection problems is analytically evaluated when a design matrix is given by an overcomplete random matrix. The replica method from statistical mechanics is employed to derive the result. The analysis is conducted through evaluation of the entropy, an exponential rate of the number of combinations of variables giving a specific value of fit error to given data which is assumed to be generated from a linear process using the design matrix. This yields the typical achievable limit of the fit error when solving a representative problem and includes the presence of unfavourable phase transitions preventing local search algorithms from reaching the minimum-error configuration. The associated phase diagrams are presented. A noteworthy outcome of the phase diagrams is that there exists a wide parameter region where any phase transition is absent from the high temperature to the lowest temperature at which the minimum-error configuration or the ground state is reached. This implies that certain local search algorithms can find the ground state with moderate computational costs in that region. Another noteworthy result is the presence of the random first-order transition in the strong noise case. The theoretical evaluation of the entropy is confirmed by extensive numerical methods using the exchange Monte Carlo and the multi-histogram methods. Another numerical test based on a metaheuristic optimisation algorithm called simulated annealing is conducted, which well supports the theoretical predictions on the local search algorithms. In the successful region with no phase transition, the computational cost of the simulated annealing to reach the ground state is estimated as the third order polynomial of the model dimensionality.

Export citation and abstract BibTeX RIS

Original content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

Compressed sensing is a technique used to recover a high-dimensional signal from a limited number of measurements by utilising the fact that the signal of interest has redundancy and thus can be 'sparse'; many of the coefficients are set to zero when described with an appropriate basis. This technique has a long history [13], but it has recently attracted increased attention as its high performance has been demonstrated in recent influential papers [48]. There has been a surge of research of compressed sensing, which is based on a general idea that the signal has a sparse representation on an appropriate basis, because the idea can be shared in many other contexts such as data compression, multivariate regression, and variable selection. This trend has triggered a major evolution in techniques of signal and information processing, which is gradually forming a new framework called 'sparse modelling' [912].

For clarity, we provide a concise mathematical form to the problem treated here. Suppose a data vector $ \newcommand{\V}[1]{\boldsymbol{#1}} \newcommand{\mR}{\mathbb{R}} \V{y}\in \mR^{M}$ is generated by the following linear process with a design matrix $ \newcommand{\mR}{\mathbb{R}} A\in \mR^{M\times N}$ and a signal vector $ \newcommand{\V}[1]{\boldsymbol{#1}} \newcommand{\mR}{\mathbb{R}} \V{x}_0 \in \mR^{N}$ such that:

Equation (1)

where $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{\xi}$ is a noise vector, the component of which is assumed to be an independent and identically distributed (i.i.d.) variable from the normal distribution with zero mean and variance $\sigma_{\xi}^2$ , $\mathcal{N}(0, \sigma_{\xi}^2)$ . In the context of compressed sensing, the design matrix A represents the measurement process, and given A and $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{y}$ , we try to infer $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{x}_0$ for the situation M  <  N. This is an underdetermined problem and the sparsity assumption that the number of nonzero components of $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{x}_0$ is smaller than M is needed for solving it. With this assumption, the perfect reconstruction of $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{x}_0$ is possible if the noise is absent. The most naive algorithm to achieve this is the exhaustive search, which selects the sparsest set of variables among error-free ones. This is clearly infeasible if the model dimensionality N is large, and more efficient algorithms or approximations should be tailored. A common approximation is to relax the sparsity constraint. Many studies have been conducted along this direction, and some theoretical studies demonstrated that the perfect reconstruction of $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{x}_0$ is possible under reasonable conditions even under such a relaxation [1315]. Associated efficient algorithms achieving perfect reconstruction in the noiseless case have been developed [1619].

Some degree of compromise such as the relaxation above appears to be unavoidable, because variable selection in the present problem is NP-hard in the worst case [20]. However, more recent works suggest that, even without such relaxation, variable selection can be achieved at reasonable expense for 'typical' cases [21, 22] where the design matrix is assumed to be i.i.d. from the normal distribution. Their formulation is based on the Bayesian framework assuming the signal's generative process is known. The algorithmic limit was computed by using non-rigorous statistical mechanical techniques, and an associated message-passing algorithm was developed and shown to achieve the limit, and those results have been supported from a firmer mathematical basis [23].

However, the Bayesian framework is not always preferred in the context of signal processing. This is because the signal's generative process is not necessarily evident and is difficult to model in many practical situations. In such cases, it may be better to focus less on the signal sources and rely more on the less informative prior. According to this idea, in the context of data compression, the present authors recently proposed a variable selection criterion based on the following widely-used optimisation formulation [24]:

Equation (2)

where $ \newcommand{\V}[1]{\boldsymbol{#1}} ||\V{x}||_k=(\sum_{i}|x_i|^k){}^{1/k}$ denotes the $ \newcommand{\e}{{\rm e}} \ell_k$ norm and the $ \newcommand{\e}{{\rm e}} \ell_0$ norm $ \newcommand{\V}[1]{\boldsymbol{#1}} ||\V{x} ||_0$ is assumed to give the number of nonzero components of $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{x}$ . Our basic idea in [24] is to compare all variable sets of size K and to organise them as an ensemble in the statistical mechanical sense by regarding the fit error as energy. Our statistical mechanical analysis, again non-rigorous and performed under the same assumption on the design matrix as [21, 22], showed that the configuration space of the variable set is rather 'smooth', implying that certain local search algorithms can efficiently find the minimum-error variable set. Based on this finding, we developed an algorithm based on the so-called simulated annealing (SA) algorithm [25], which is a Monte Carlo (MC)-based optimisation solver, and demonstrated that it can efficiently find the minimum-error set for a wide range of parameters [26, 27]. These results again suggest that variable selection in the present setting can be efficiently achieved, even without relaxation or resorting to the Bayesian framework.

The success of the MC based method further motivates us to analyse the property of the ensemble of the variable set in detail. The previous analysis [24] was limited to the data compression context, and another analysis directly relevant to compressed sensing or multivariate regression is desirable. The present paper addresses this point. The main difference from [24] is the presence of the true signal $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{x}_0$ in equation (1). This introduces other criteria on the reconstructed signal such as the prediction ability and the error to the true signal. We provide a quantitative analysis for these issues.

The remainder of the paper is organised as follows. In section 2, we state the problem setting and the formulation which we employ in this paper. The meaning of the formulation in relation to the Bayesian framework is also explained. In section 3, we provide the analytical solution of the fit error and related quantities derived by the statistical mechanical formulation. In section 4, we present the results of numerical experiments using a careful MC method to support our analytical computations. The performance of the SA algorithm for finding the minimum-error set is also revisited. The final section concludes the study.

2. Problem and formulation

2.1. Problem setting and notation

As noted in the section 1, the data is supposed to be generated by the linear process (1). We assume the true signal $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{x}_0$ is K0-sparse and K0 is less than M: $ \newcommand{\V}[1]{\boldsymbol{#1}} \newcommand{\nnzero}{K} ||\V{x}_0||_0=\nnzero_0<M$ . For notational convenience, we introduce the $ \newcommand{\e}{{\rm e}} \ell_0$ operator or support operator of a vector $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{x}$ as $ \newcommand{\V}[1]{\boldsymbol{#1}} |\V{x}|_{0}$ , which results in a binary vector whose component is $ \newcommand{\V}[1]{\boldsymbol{#1}} (|\V{x}|_{0})_i=1$ if $x_i\neq 0$ , or $ \newcommand{\V}[1]{\boldsymbol{#1}} (|\V{x}|_{0})_i=0$ otherwise. The support of the true signal is represented by a support vector $ \newcommand{\V}[1]{\boldsymbol{#1}} \newcommand{\e}{{\rm e}} \V{c}_0\equiv |\V{x}_0|_0$ .

We are interested in the fit quality to the data $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{y}$ for a given set of variables described by a support vector $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{c}\in\{0, 1\}^N$ , on a linear model basis. The fit quality is thus quantified by a mean squared error (MSE) for $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{y}$ and the coefficients of the chosen variables are assumed to be optimised to describe $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{y}$ . The optimised coefficients are written in the following form:

Equation (3)

where $ \newcommand{\V}[1]{\boldsymbol{#1}} (\V{c}\circ \V{x})_i=c_i x_i$ represents the Hadamard product. To eliminate an ambiguity in equation (3), the coefficients of variables out of the support are set to zero, $ \newcommand{\V}[1]{\boldsymbol{#1}} c_i=0 \Rightarrow \hat{x}_i(\V{c})=0$ , to provide consistency with the support operator: $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{c}=|\hat{\V{x}}(\V{c})|_0$ . We denote the corresponding MSE with $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{y}$ by

Equation (4)

This is called the output MSE throughout this paper. In addition to the output MSE, we are interested in the MSE with the true signal defined by

Equation (5)

Hereafter this is termed the input MSE, and the hat symbol is assumed to represent an estimator of the corresponding quantity.

The primary object of our investigation is the histogram of $ \newcommand{\V}[1]{\boldsymbol{#1}} \newcommand{\e}{{\rm e}} \newcommand{\MSEy}{\epsilon_{y}} \MSEy(\V{c}|\V{y}, A)$ when changing the set of variables $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{c}$ . There are two reasons for evaluating this quantity. One is related to the reconstruction of $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{x}_0$ . In this context, lower values of the output MSE are not always preferable and we need more global information regarding the set of variables to obtain a good solution. The other involves possible algorithmic implications. To find a small output-MSE configuration, we usually conduct a local recursive search from certain initial conditions. The performance of such local search algorithms is strongly affected by the structure of the configuration space of $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{c}$ . The histogram of $ \newcommand{\V}[1]{\boldsymbol{#1}} \newcommand{\e}{{\rm e}} \newcommand{\MSEy}{\epsilon_{y}} \MSEy(\V{c}|\V{y}, A)$ provides the necessary information about the structure.

More specifically, we evaluate the exponential rate of the histogram in the large size limit $N\to \infty$ while keeping $\alpha=M/N$ and $ \newcommand{\nnzero}{K} \rho=\nnzero/N=\sum_{i}c_i/N$ finite. This quantity is simply the statistical mechanical entropy defined by

Equation (6)

Entropies or similar thermodynamic functions associated with certain optimisation problems have provided algorithmic implications and benefits in several contexts such as information theory, computer science, and neural networks [2839]. We see this strategy actually works well in the present problem below.

2.2. Outline of analysis

We outline the analysis of the entropy. Evaluation of the entropy is replaced with an assessment of a generating function that is a Legendre transform of the entropy. Assumptions to enable this assessment are stated as well.

2.2.1. Generating function: Legendre transform of entropy.

Direct computation of the entropy is not easy, and a more systematic method of evaluation is available using a Legendre transform of the entropy, which we call free entropy throughout this study [40].

We introduce a partition function $ \newcommand{\V}[1]{\boldsymbol{#1}} G(\mu|\rho, \V{y}, A)$ as

Equation (7)

where $\delta(\cdot)$ is the delta function. The free entropy is represented by the exponential rate of G. Considering the definition of the entropy and assuming that the saddle-point method is applicable in the large N limit, we can easily see that the following relation holds

Equation (8)

where $ \newcommand{\supp}[1]{{\rm supp}({#1})} \supp{f(x)}$ denotes the support of the function $f(x)$ . Hence g and s are the Legendre transforms of each other. Assuming $ \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} s(\MSEp)$ is concave with respect to $ \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} \MSEp$ , then g and s have a one-to-one correspondence and are connected by the control parameter $\mu(\geqslant 0)$ . This control parameter plays the role of 'inverse temperature' in physics. The maximiser in equation (8) and the corresponding entropy, $ \newcommand{\V}[1]{\boldsymbol{#1}} \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} \MSEp(\mu|\rho, \V{y}, A)$ and $ \newcommand{\V}[1]{\boldsymbol{#1}} \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} s(\mu|\rho, \V{y}, A)=s(\MSEp(\mu|\rho, \V{y}, A)|\rho, \V{y}, A)$ , are thus parameterised as

Equation (9a)

Equation (9b)

Employing these relations, we can easily handle the dominant output MSE and determine the corresponding entropy from g.

However, there are two difficulties in the evaluation of g. One is the dependence on $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{y}$ and A, and the other is the presence of the least squares problem in the definition of $ \newcommand{\V}[1]{\boldsymbol{#1}} \newcommand{\e}{{\rm e}} \newcommand{\MSEy}{\epsilon_{y}} \MSEy(\V{c})$ .

The first problem is overcome as follows. The free entropy g is a self-averaging quantity, as is the entropy. Typical values of self-averaging quantities are in accordance with their averaged values in the large N limit. This means that we can calculate the averaged value of g over A and $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{y}=A\V{x}_0+\V{\xi}$ instead of directly considering the entropy's dependence on those quantities. We denote the average over $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{x}_0, \V{\xi}$ and A by square brackets with appropriate subscripts as $ \newcommand{\V}[1]{\boldsymbol{#1}} [\cdots ]_{\V{x}_0, \V{\xi}, A}$ . Unfortunately, this average is not easy to calculate. The so-called replica method is a great aid in such a situation, and is symbolised by the following identity

Equation (10)

In addition to this identity, we assume that n is a positive integer. This assumption enables us to compute the average $ \newcommand{\V}[1]{\boldsymbol{#1}} [\cdots ]_{\V{x}_0, \V{\xi}, A}$ . After computing this average, we take the limit $n\to 0$ by employing the analytical continuation from $n \in \mathbb{N}$ to $n \in \mathbb{R}$ under the so-called replica symmetric (RS) or the replica symmetry breaking (RSB) ansatz, which will be explained later.

The second problem is solved by introducing a variable β and taking a limit as follows

Equation (11)

where

Equation (12)

where the factor $ \newcommand{\lb}{\left(} \newcommand{\rb}{\right)} \newcommand{\lbb}{\left\{} \newcommand{\rbb}{\right\}} \prod_{i}\lbb (1-c_i)\delta(x_i) + c_i \rbb$ is introduced to make the integral well-defined and can be regarded as a prior for $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{x}$ . In the limit $\beta \to \infty$ , only the contribution corresponding to the solution of the least squares problem in equation (3) survives the integration, and $ \newcommand{\V}[1]{\boldsymbol{#1}} \newcommand{\mc}[1]{\mathcal{#1}} \newcommand{\e}{{\rm e}} \newcommand{\MSEy}{\epsilon_{y}} \mc{H}(\V{c}|\beta, \V{y}, \V{A}) \to M \MSEy(\V{c}|\V{y}, \V{A})$ . We further assume that $\nu=\mu/\beta$ is a positive integer as well as n in equation (10). This enables us to treat in parallel the summation over $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{c}$ and the integration over $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{x}$ , as well as the average $ \newcommand{\V}[1]{\boldsymbol{#1}} [\cdots ]_{\V{x}_0, \V{\xi}, A}$ . The limit $\beta\to \infty \Leftrightarrow \nu \to 0$ is taken after those operations through the analytic continuation.

These operations can be summarised in a line

Equation (13)

with abbreviations $ \newcommand{\V}[1]{\boldsymbol{#1}} \newcommand{\lb}{\left(} \newcommand{\rb}{\right)} \newcommand{\T}[1]{\tilde{#1}} \newcommand{\re}{{\rm Re}} \newcommand{\Tr}[1]{\mathop{\rm Tr}_{#1}} \Tr{\V{c}}=\lb \prod_{i=1}^{N}\sum_{c_i=0, 1} \rb \delta\lb \sum_{i}c_i-N\rho \rb $ and $ \newcommand{\V}[1]{\boldsymbol{#1}} \newcommand{\lb}{\left(} \newcommand{\rb}{\right)} \newcommand{\lbb}{\left\{} \newcommand{\rbb}{\right\}} \newcommand{\T}[1]{\tilde{#1}} \newcommand{\re}{{\rm Re}} \newcommand{\Tr}[1]{\mathop{\rm Tr}_{#1}} \Tr{\V{x}|\V{c}}=\int{\rm d}\V{x} \prod_{i}$ $ \newcommand{\V}[1]{\boldsymbol{#1}} \newcommand{\lb}{\left(} \newcommand{\rb}{\right)} \newcommand{\lbb}{\left\{} \newcommand{\rbb}{\right\}} \newcommand{\T}[1]{\tilde{#1}} \newcommand{\re}{{\rm Re}} \newcommand{\Tr}[1]\left \lbb (1-c_i)\delta(x_i) + c_i \rbb$ . Overall, to calculate the entropy, we assess the free entropy g. The averages over $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{x}_0, \V{\xi}, A$ are taken through the replica method with a replica number n, and the internal variables $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{x}$ are integrated with an additional replica number ν.

2.2.2. Assumptions for theoretical computation.

The description above is generic, but for technical reasons we need additional assumptions to complete the computation. The most crucial assumption is applied to the distribution of A: each component of A is assumed to be i.i.d. from $ \newcommand{\mc}[1]{\mathcal{#1}} \mc{N}(0, 1/N)$ . Relaxing this assumption, i.e. introducing correlations between components, makes the analysis much more complicated. Admittedly, this assumption is not necessarily realistic. However, the purpose of this paper is to provide an analytical basis to understand the variable-selection performance, and we consider that the random-matrix assumption can provide sufficiently nontrivial implications for this purpose.

Thus assuming the absence of correlations among components of A, we note that only the average behaviour of the signal components is relevant. According to this observation, without loss of generality, we may assume a factorised prior of the signal vector $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{x}_0$ as

Equation (14)

where $ \newcommand{\nnzero}{K} \rho_0=\nnzero_0/N$ is the density of nonzero components and P0(x) is a prior distribution for the nonzero component. Our theoretical computation can be performed for any prior P0(x) having no probability mass at x  =  0, and we keep it unspecified for a while.

Usually, the entropy function $ \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} s(\MSEp)$ enjoys some useful properties such as non-negativity, boundedness, bounded support, concavity, and analyticity. We assume these properties, but as shown below, the concavity and analyticity are partially broken in the present problem. This causes problems in evaluating the parametric form (9), but they can be bypassed by some additional considerations when conducting the saddle-point method. The analyticity breaking of the entropy is actually related to algorithmic performances and is one of the central issues discussed in this paper.

2.2.3. Probabilistic meaning and intrinsic hierarchy of the problem.

Our formulation has a probabilistic meaning which involves two different intrinsic hierarchies in the present problem. We can define the distribution for $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{c}$ as

Equation (15)

Let us denote the average over P(1) by angular brackets as $ \newcommand{\V}[1]{\boldsymbol{#1}} \newcommand{\Ave}[1]{\left\langle {#1} \right\rangle} \Ave{\cdots}_{\V{c}}$ . This distribution is clearly conditioned by $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{y}$ and A. We also define another distribution for $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{x}$ given $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{c}$ as

Equation (16)

where

Equation (17)

This distribution is conditioned by $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{c}$ in addition to $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{y}$ and A. Hence, we denote the average over P(2) by $ \newcommand{\V}[1]{\boldsymbol{#1}} \newcommand{\Ave}[1]{\left\langle {#1} \right\rangle} \Ave{\cdots}_{\V{x}|\V{c}}$ . The simultaneous average over both P(1) and P(2) is denoted by double angular brackets $ \newcommand{\dAve}[1]{\left\langle \left\langle {#1} \right\rangle \right\rangle} \dAve{\cdots}$ .

Recalling that $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{y}$ and A are also random variables, we note that there are three different hierarchies of random variables: $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{x}$ is conditioned by $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{c}$ which is conditioned by $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{y}$ and A. This discrimination is a natural consequence of the structure of the present problem5.

2.3. Relationship to Bayesian inference using sparsity-inducing priors

Our formulation is related to a Bayesian framework. To demonstrate the relationship, we introduce the following prior distribution of $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{x}$ given $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{c}$ :

Equation (18)

where ϕ is the prior distribution of the nonzero components. As our purpose is to achieve a variable selection, i.e. choosing the best support $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{c}$ , the variable $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{x}$ can be treated as a hidden variable. Hence in the Bayesian framework, the most rational approach is to sample $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{c}$ from the following posterior distribution:

Equation (19)

where $ \newcommand{\V}[1]{\boldsymbol{#1}} P(\V{y}|\V{x}, \V{c})$ is our model distribution of data, which is derived through the noise distribution with variance $\sigma^2$ :

Equation (20)

and the prior distribution of $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{c}$ , $ \newcommand{\V}[1]{\boldsymbol{#1}} P(\V{c})$ , is set to be a uniform distribution at a fixed $ \newcommand{\nnzero}{K} \nnzero=N\rho$

Equation (21)

This method is optimal when our parameters and model match the true generative process. This matching condition is called the Nishimori condition in physics. Performance at the optimality can be an issue to be studied further; however, in the present paper, we do not pursue this direction. Instead, we perform a maximum a posteriori (MAP) estimation for $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{x}$ by assuming a (un-normalised) flat prior $ \newcommand{\mR}{\mathbb{R}} \phi(x)=1, ~(\forall{x}\in\mR)$ . This yields the $ \newcommand{\V}[1]{\boldsymbol{#1}} \hat{\V{x}}(\V{c}|\V{y})$ expression defined in equation (3) as the MAP estimator. Hence, the MAP estimation approximates equation (19) as

Equation (22)

Overall, the posterior distribution of $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{c}$ defined in equation (15) can be regarded as a MAP estimation of equation (19) in the Bayesian framework6.

There are two reasons for treating the MAP estimator. The first reason is the computational cost of equation (19). The computation of equation (19) requires integration with respect to $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{x}$ , which is computationally expensive in general even if we employ versatile approximations such as the MC method. In contrast, the MAP estimator allows us to skip this integration and provides a reasonable estimator (3), which is relatively easy to compute. The second reason is the plausibility in matching our inference model with the true generative process, as also discussed in section 1. In many practical tasks for which our regression model is employed, it is not realistic to assume that we have precise knowledge regarding the generative process. Hence, certain mismatch between the inference and generative models is inevitable. This is a common criticism against applying the Bayesian approach to signal processing tasks. On the contrary, the MAP estimator with the uninformative flat prior $\phi(x)=1$ allows us to bypass this problem and can yield better performance than the Bayes estimator (19) in certain mismatching cases. These reasons naturally motivate us to investigate equation (15) instead of equation (19).

2.4. Related work

Here we give a brief summary of several preceding work treating related problems and make it clear how the present paper is similar to or different from those work.

In [41], Guo and Verdú studied a random linear estimation problem in the context of code-division multiple access by using the replica method, as in [32]. Its MAP estimator was considered in [42] again by using the replica method. Reeves and Gastpar treated similar linear models in the variable selection context as in this paper: they computed the trade-off relation between the measurement rate ($\alpha=M/N$ in our notation) and a distortion quantifying an error rate in the reconstruction of the true support, and derived rigorous lower and upper bounds of the measurement rate to achieve given value of distortion and compared it with the replica result [4345]. Another investigation by Reeves and Pfister succeeded to prove that the replica prediction is exact, by tightening the bounds under the assumption that the matrix A is i.i.d. from the normal distribution under the Bayes optimal setting [46]. In [47], the same problem was investigated by using the replica method except that the matrix A is drawn from rotationally invariant ensembles. These preceding results employing the replica method were conducted under the RS ansatz, and Bereyhi et al examined the RSB ansatz and showed that the k-step RSB with small k can much improve the RS solution's inconsistency appearing when the MAP estimator with $ \newcommand{\e}{{\rm e}} \ell_0$ -norm regularization is considered [48, 49]. In [50], a similar problem with a restriction such that the signal components take only binary values was studied by using the improved second moment method. From a wider viewpoint, generic glassy natures of MAP estimators in inference problems were examined in [51] by considering a rank-one matrix estimation as an example; the so-called survey propagation, which is a variant of message-passing algorithms taking into account glassy natures, was reported to fail in improving the inference accuracy than the standard message-passing algorithm [1619]. In [52], a similar model was considered in the context of reconstructing encrypted signals and investigated by using the replica method; the full-step RSB ansatz was applied and thus the result is expected to be exact and tight.

These study have a connection to this paper in the basic problem setting, but we stress that our present formulation is very different from all of them. We again note that in this paper all possibilities of the support are examined by computing the entropy curve, while the usual replica analysis treats certain specific supports or estimators only. All the preceding work referred in the previous paragraph fall into this case. For example, the estimator given in equation (2), which is a subject of study in [14, 4245, 4749], corresponds to just one point in the entropy curve: the minimum $ \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} \MSEp$ point given K. Meanwhile, the present formulation enables us to simultaneously compute all the other supports, or estimators optimizing the coefficients $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{x}$ on the chosen support. This leads to the distribution of the MSE which is nothing but the entropy curve. Hence, our formulation provides more global information about the problem, yielding deeper insights both in the information theoretic and algorithmic perspectives, as shown in [24] and below. A drawback of this is the analytical procedure much more complicated than the usual replica analysis, as explained in section 2.2.1.

3. Analytical results

3.1. Summary of order parameters

As shown below, the free entropy is characterised by a number of macroscopic order parameters and we summarise them here. The order parameters are defined as

Equation (23a)

Equation (23b)

Equation (23c)

Equation (23d)

m is the overlap with the true signal $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{x}_0$ and is relevant to the reconstruction performance of $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{x}_0$ . R and Q describe the powers (per element) of the reconstructed signal, but the latter takes into account the 'thermal' fluctuation that results from the introduction of β. These two quantities fall within the limit $\beta \to \infty$ , but their infinitesimal difference yields an important contribution

Equation (24)

This is $O(1)$ even in the limit $\beta \to \infty$ . The last order parameter q directly reflects the fluctuation of the support vector $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{c}$ and exhibits the RSB in some parameter regions.

Using these order parameters, the average value of the input MSE is

Equation (25)

where $ \newcommand{\POWx}{\sigma_x^2} \POWx$ represents the typical power (per non-zero element) of the signal defined by

Equation (26)

The output MSE is computed from equation (9) once the free entropy is obtained in terms of the order parameters. Hence, we can naturally compute both the input and output MSEs in the present formalism.

3.2. Expressions of free entropy

3.2.1. RS solution.

Postponing the details of analysis to the appendix, we present the resultant formulas of the free entropy and related quantities. The expression for g in the RS level is given by

Equation (27)

where for simplicity of notation we let $ \newcommand{\lb}{\left(} \newcommand{\rb}{\right)} \newcommand{\lbb}{\left\{} \newcommand{\rbb}{\right\}} \Omega_{{\rm RS}}=\lbb \chi, Q, q, m, \tilde{\rho}, \tilde{\chi}, \tilde{Q}, \tilde{q}, \tilde{m} \rbb$ , $Dz={\rm d}z{\rm e}^{-\frac{1}{2}z^2}/\sqrt{2\pi}$ , $ \newcommand{\POWx}{\sigma_x^2} V=\rho_0 \POWx+\sigma_{\xi}^2-2m$ , and

Equation (28)

Equation (29)

Equation (30)

Equation (31)

$Y^{{\rm RS}}_{0}$ is obtained by substituting $\tilde{m}=0$ into $Y^{{\rm RS}}_{\tilde{m}}$ , and $h^{{\rm RS}}_0$ is defined similarly. The symbol $ \newcommand{\Extr}[1]{\mathop{\rm Extr}_{#1}} \Extr{\Omega}$ denotes the extremisation condition with respect to Ω coming from the saddle-point method. This extremisation condition yields the following equations of state (EOS):

Equation (32a)

Equation (32b)

Equation (32c)

Equation (32d)

Equation (32e)

Equation (32f)

Equation (32g)

Equation (32h)

Equation (32i)

Note that all tilde variables are conjugates of their respective variables and are introduced to expand delta functions with the Fourier transform as shown in the appendix. From equation (32), we obtain some simple relations

Equation (33a)

Equation (33b)

Equation (33c)

Equation (33d)

Equation (33e)

3.2.2. RSB solution and the instability of the RS solution.

The RS solution can be inaccurate when the configuration space of $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{c}$ exhibits spontaneous breaking into many locally separated components. In such a situation, the RSB ansatz should be adopted. The RSB solution is categorised by the level of the emerging hierarchical structure of the separation. In the simplest case called the 1st step RSB (1RSB) solution, only one level of hierarchy is taken and we examine this in the present paper. The 1RSB solution is actually sufficient to expose the instability of the RS solution and hence is sufficient to achieve the present purpose of obtaining implications to local search algorithms.

We postpone the detailed derivation of the 1RSB solution to the appendix. The explicit 1RSB formula involving g is given as

Equation (34)

where τ is Parisi's breaking parameter, $ \newcommand{\lb}{\left(} \newcommand{\rb}{\right)} \newcommand{\lbb}{\left\{} \newcommand{\rbb}{\right\}} \Omega_{{\rm 1RSB}}=\lbb \chi, Q, q_1, q_0, m, \tilde{\rho}, \tilde{\chi}, \tilde{Q}, \tilde{q}_1, \tilde{q}_0, \tilde{m}, \tau \rbb$ , and

Equation (35)

Equation (36)

Equation (37)

Equation (38)

Equation (39)

As in the RS case, $h^{{\rm 1RSB}}_0$ and $Y^{{\rm 1RSB}}_{0}$ are given by inserting $\tilde{m}=0$ into $h^{{\rm 1RSB}}_{\tilde{m}}$ and $Y^{{\rm 1RSB}}_{\tilde{m}}$ , respectively. The corresponding EOS are involved and are given in the appendix.

By examining the 1RSB solution, we can determine the instability points of the RS solution. Empirically, two types of instabilities are known to appear in a wide range of systems:

  • Global instability  
    The RS solution is locally stable but there emerges another solution involving exponentially many metastable states, which induces the so-called random first-order transition (RFOT). We thus call the associated instability RFOT instability in this paper.
  • Local instability  
    The local instability of the RS solution, which can be signaled by expanding the 1RSB solution with respect to $\Delta_0$ and observing its coefficient. This is also known as the de Almeida–Thouless (AT) instability.

According to these empirical facts, we derive two instability conditions below.

The RFOT instability is known to emerge at $\tau=1$ and can be detected in an easy manner as follows. For $\tau=1$ , we can identify q0 and $\tilde{q}_0$ with q and $\tilde{q}$ in the RS solution, respectively, because their EOS formally accord with each other. As well, the 1RSB EOS of all other order parameters except for q1 and $\tilde{q}_1$ become identical to their corresponding RS EOS. Hence, we should compute q1 and $\tilde{q}_1$ on top of the RS solution and examine whether the nontrivial solution $q_1\neq q_0=q$ exists or not. The equations to be solved are written in terms of $\Delta_0$ and $ \newcommand{\e}{{\rm e}} \tilde{\Delta}_0\equiv \tilde{q}_1-\tilde{q}_0=\tilde{q}_1-\tilde{q}$ as follows

Equation (40)

Equation (41)

In these equations, we should read $\tilde{q}_1=\tilde{q}+\tilde{\Delta}_0$ in $h^{{\rm 1RSB} }$ and $Y^{{\rm 1RSB} }$ . The trivial RS solution $\Delta_0=0$ always exists and the question is whether a nontrivial solution $\Delta_0\neq 0$ exists or not. Such a nontrivial solution is absent for the low μ region but is present at sufficiently large values of μ. The lowest value of μ for which the nontrivial solution exists defines the RFOT point $\mu_{\rm RFOT}$ .

The AT instability is observed by examining the presence of a nontrivial solution around $\Delta_0=0$ . This can be accomplished by expanding the right hand side of equation (41) with respect to $\Delta_0$ up to the first order after inserting equation (40) into $\tilde{\Delta}_0$ . If the coefficient of the first-order term is greater than unity, a nontrivial solution emerges. This condition is written using the RS solution only, because the small $\Delta_0$ limit implies that the order parameters q1 and q0 of the 1RSB solution can be identified as q in the RS solution, and the corresponding tilde variables can also be identified. The explicit stability condition of the RS solution against the AT instability is given by

Equation (42)

This condition always holds for sufficiently low values of μ. The lowest μ value violating equation (42) indicates the AT transition point $\mu_{\rm AT}$ .

These two instabilities are known to affect the performance of local search algorithms. The origin of this degraded performance by the RFOT transition is clear—there emerge exponentially many local minima and thus the search/dynamics is easily trapped in one of those states; the typical trapping state will be the most numerous one and will be far from the true global minimum. Each trapping state in this case is separated by high energy barriers and hence escaping will take an exponentially long time [40]. Meanwhile, the influence of the AT instability is less trivial than the RFOT. According to the standard physical picture, when the AT instability occurs, the structure of the configuration space of $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{c}$ has many saddle-point-like structures, which leads to a complicated critical slowing down of the dynamics and thus the performance of local search algorithms will be strongly degraded. However, this degradation will be less serious compared to that caused by the RFOT transition, because in the AT case it is considered that there exist certain directions along which the system can escape from a local saddle point. This may take a long time, as the energy landscape will be very flat along the escape directions owing to the saddle-point nature, but will be shorter than the RFOT case, where an exponentially long time is required. These descriptions have actually been supported by numerical simulations of some metaheuristic algorithms in several optimisation problems [5358]. We, however, stress that there are still many unclear points about the dynamics around the AT instability and further investigations using concrete algorithms, such as Monte-Carlo methods or message-passing algorithms, are desired.

As will be seen in the next section, we have several characteristic regions depending on the parameters ρ and $\sigma_{\xi}^2$ . In some regions, the RSB transitions induced by AT and RFOT instabilities occur, which makes it difficult for the system's dynamics to converge to the equilibrium distribution. In other regions, the RS solution is always stable when μ is changed. However, the RS-stable regions are separated into two small regions, one has no phase transitions and thus the metaheuristic algorithm can work well, and the other has another 1st order transition which prevents the algorithm from approaching the global minimum. These descriptions will be actually confirmed by numerical experiments of a metaheuristic algorithm in section 4.2.

3.2.3. Some simple limits.

To check our replica results, we summarise some simple solutions obtained at particular limits below.

High temperature solution.

A trivial solution in the high temperature limit, $T=\mu^{-1} \to \infty$ , is derived from equation (32). From simple algebra based on equations (32) and (33), we obtain

Equation (43)

Accordingly,

Equation (44a)

Equation (44b)

Equation (44c)

Using the relation $\tilde{\rho}=-\log (\rho/(1-\rho))$ , we have

Equation (45a)

Equation (45b)

Equation (45c)

where $H_2(\rho)$ is the binary entropy, giving a reasonable result.

The local stability of this solution can be checked by substituting equations (43) and (44) into equation (42), which yields

Equation (46)

This always holds in the meaningful setup of $\rho\leqslant 1$ and $\rho\leqslant \alpha$ , and hence the high temperature solution is stable.

Perfect reconstruction in the noiseless limit.

Of particular interest for the noiseless limit $\sigma_{\xi}=0$ is whether we can achieve the perfect reconstruction of $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{x}_0$ . We call the corresponding solution the perfect reconstruction (PR) solution, which is defined by

Equation (47)

Note that the support components outside the true support may take unity: ci can be 1 even if $|x_{0i}|_0=0$ . This is because the coefficients $ \newcommand{\V}[1]{\boldsymbol{#1}} \hat{\V{x}}(\V{c})$ outside the true support become automatically zero as a result of the optimization if all the components of the true support is covered as equation (47), leading to the vanishing MSEs. In other words, the PR solution always exists if the estimated nonzero density is greater than or equal to the true one, $\rho \geqslant \rho_0$ . This seemingly indicates that it is safer to overestimate the nonzero density ρ. This is, however, not necessarily true because larger estimates of ρ tend to involve unfavourable phase transitions, as shown below.

The output MSE $ \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} \MSEp$ of the PR solution is zero and the entropy becomes

Equation (48)

We can actually find this PR solution in the limit $\mu \to \infty$ of our RS formula (32). To derive this, we have to carefully treat the scaling of $V+q$ and $\Delta_{{\rm RS}}$ within that limit. We first assume the following scaling:

Equation (49)

The consistency of this assumption is confirmed after the computation. Using this and equation (33), we can easily determine the following limits

Equation (50)

These relations mean that $Y^{{\rm RS}}_{\tilde{m}}$ diverges or vanishes depending on the values of x0 and z while $Y^{{\rm RS}}_{0}$ converges to ${\rm e}^{-\tilde{\rho}}$ . The vanishing region of $Y^{{\rm RS}}_{\tilde{m}}$ is expressed in terms of z as

Equation (51)

Equation (52)

As we have assumed $(V+q)\to 0~(\mu \to \infty)$ , this vanishing region rapidly shrinks and does not provide any meaningful contribution. Hence, we may treat $Y^{{\rm RS}}_{\tilde{m}}$ as a diverging factor in all contributing regions. This consideration yields, from equation (32e):

Equation (53)

Additionally, some algebraic operations from equations (32) and (33) lead to

Equation (54)

where

Equation (55)

which yields a finite contribution in the limit $\mu \to \infty$ . The divergence of $Y^{{\rm RS}}_{\tilde{m}}$ implies that

Equation (56)

The precise scaling of the right hand side strongly depends on the choice of the prior distribution $P_0(x_0)$ . For examples, if it is Gaussian $ \newcommand{\POWx}{\sigma_x^2} P_0(x_0)=\mathcal{N}(0, \POWx)$ , then $(V+q)=O(\mu^{-\frac{3}{2}})$ ; if the signal strength is a constant at $P_0(x_0)=\delta(x_0-C)$ with $C\neq 0$ , then $(V+q)$ exponentially decays as μ increases. Similar calculations apply to $\Delta_{{\rm RS}}$ , which involves the same scaling of $\Delta_{{\rm RS}}$ as $(V+q)$ . These scalings are consistent with the assumed ones (49).

Summarising these calculations in conjunction with equations (32) and (33), we determine that

Equation (57)

This implies that the free entropy g coincides with the entropy in the limit $\mu\to \infty$ . Substituting the scalings obtained so far into equation (27), we find that

Equation (58)

The limiting behaviours of $\Delta_{{\rm RS}}$ and $V+q$ cause the input MSE to vanish:

Equation (59)

Hence, the PR solution is successfully derived.

The local stability (42) of this PR solution should be checked. Inserting equations (50) and (53) into equation (42) results in the stability condition

Equation (60)

Hence, the PR solution is stable as long as $\rho_0 \leqslant \rho \leqslant \alpha$ where the PR solution exists. Furthermore, this stability is considered to be rather 'robust'. It is physically reasonable that the RS solution is stable even for $\rho < \rho_0$ if ρ is sufficiently close to $\rho_0$ because the inequality (60) is safely satisfied even at $\rho=\rho_0$ . This will be demonstrated in the phase diagram shown below. Note that equation (60) is just a necessary condition and not a sufficient one.

The stability of the RS solution at and around the PR solution has an algorithmic implication to the $ \newcommand{\e}{{\rm e}} \ell_0$ -minimisation approach [5961]. Namely, local search algorithms such as the message-passing algorithm will not be degraded by the rugged energy landscape if the initial condition is sufficiently close to the PR solution and hence the predictions based on the RS computation will be precise.

3.3. Entropy curve and phase diagram

To obtain a concrete result, in the following we set the prior distribution of the nonzero component as a Gaussian:

Equation (61)

Owing to this assumption, the double integrations with respect to x0 and z in equation (32) can be merged into a Gaussian integration by a variable transform $ \newcommand{\POWx}{\sigma_x^2} (\tilde{m}x_0+\sqrt{\tilde{q}}z)\to \sqrt{\tilde{q}+\tilde{m}^2\POWx}z$ . The EOS (32) should be appropriately transformed. Except for equation (32i), this can be accomplished by neglecting the integration $\int {\rm d}x_0 P_0(x_0)$ and reading

Equation (62)

in the EOS and $Y^{{\rm RS}}_{\tilde{m}}$ . Equation (32i) involves the cross term of x0 and $h^{{\rm RS}}_{\tilde{m}}$ so some special care is needed. A slight calculation yields

Equation (63)

Furthermore, the signal variance $ \newcommand{\POWx}{\sigma_x^2} \POWx$ is set to be $ \newcommand{\POWx}{\sigma_x^2} \POWx=1/\rho_0$ to fix the per-element signal power to unity, $ \newcommand{\POWx}{\sigma_x^2} \rho_0\POWx=1$ .

3.3.1. Noiseless case.

We start from the noiseless limit $\sigma_{\xi}=0$ . Figure 1 shows the plot of entropy $ \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} s(\MSEp)$ versus $ \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} \MSEp$ for several different values of ρ. Other parameters are common and are set as $\alpha=0.5$ and $\rho_0=0.2$ . The drawn curves are based on the RS solution, and the RSB solution is only used to indicate the respective instability point. As seen from figure 1, we observe four different characteristic behaviours of $ \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} s(\MSEp)$ . For small ρ, the AT instability occurs at small $ \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} \MSEp$ , where the full-step RSB will be needed to correctly describe that region. This RS unstable region vanishes for larger ρ, defining a critical value $\rho_{{\rm AT}}(<\rho_0)$ . In the region $\rho_{{\rm AT}} < \rho < \rho_0$ , the RS solution is stable for the entire $ \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} \MSEp$ region having the nonnegative entropy $ \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} s(\MSEp)\geqslant 0$ . A somewhat surprising fact in this region is that the entropy crisis (EC), $ \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} s(\MSEp)=0$ , occurs at a finite critical value of $\mu_{\rm EC}(<\infty)$ . As it approaches $\rho_0$ , this critical value $\mu_{\rm EC}(\rho)$ diverges and the entropy curve is continuously connected to the PR solution in the region $\rho_0 \leqslant \rho$ . For a wide range of $\rho(\geqslant \rho_0)$ , the RS solution is again stable for the entire region of $ \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} \MSEp$ . However, at larger values of ρ, there emerges a new phase transition, defining another critical value, $\rho_{\rm SP}$ . For $\rho_{\rm SP} \leqslant \rho$ , the PR solution is detached from the high temperature limit, and there are two different branches for the small $ \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} \MSEp$ or large μ region. Two critical temperature points are accordingly defined—the spinodal point TSP at which the low-temperature branch connected to the PR solution vanishes, and the first-order transition point TF at which g values of the two branches coincide. To locate the corresponding critical temperatures at a given $\rho(>\rho_{\rm SP})$ , we plot g against $T=1/\mu$ in figure 2. As a reference, the output MSE $ \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} \MSEp$ is also plotted against T around the critical temperatures in the right panel. The presence of the first-order phase transition implies that the simple SA algorithm with a rapid annealing schedule will fail to find the PR solution in $\rho_{\rm SP}\leqslant \rho$ . The system's dynamics goes along the branch connected to the high temperature limit and cannot move to the PR solution. Actually, this strongly affects the SA performance, as the values of input MSEs $ \newcommand{\e}{{\rm e}} \newcommand{\MSEx}{\epsilon_{x}} \MSEx$ are quite different between the two branches, despite the output MSEs $ \newcommand{\e}{{\rm e}} \newcommand{\MSEy}{\epsilon_{y}} \MSEy$ both being small, as shown in the right panel of figure 2. This will be demonstrated in section 4.2.

Figure 1.

Figure 1. Entropy plotted against the output MSE $ \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} \MSEp$ , which is represented by the solid black line, for $\alpha=0.5$ and $\rho_0=0.2$ in the noiseless limit $\sigma_{\xi}=0$ . The nonzero density is different among the four panels: $\rho=0.04<\rho_{{\rm AT}}$ (upper left), $\rho_{{\rm AT}}< \rho=0.18 <\rho_0$ (upper right), $\rho_{0}< \rho=0.3 {<} \rho_{\rm SP}$ (lower left), and $\rho_{\rm SP}< \rho=0.45$ (lower right). In the last case, there are two entropy curves denoted by the solid and dashed dotted lines, which correspond to two different phases, and the inset is a close-up around $ \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} (\MSEp, s(\MSEp))=(0, s_{{\rm PR}})$ . Broken and dotted lines denote the high temperature solution (45) and the PR solution (48), respectively.

Standard image High-resolution image
Figure 2.

Figure 2. Plots of the free entropy g (left) and the output MSE $ \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} \MSEp$ (right) against the temperature $T=1/\mu$ around $T_{\rm F}\approx 3.3\times 10^{-4}$ (vertical broken line) defined by the intersection of two branches of g represented by black solid and blue dashed dotted lines. The parameters are $\rho=0.45$ , $\alpha=0.5$ , and $\rho_0=0.2$ . The output MSE of the right panel shows that both branches seem to produce vanishing $ \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} \MSEp$ in the limit $T\to 0$ , though Branch 2 connected to the PR solution yields lower values.

Standard image High-resolution image

Summarising the above findings, we draw a phase diagram in the ρ-T plane for $\alpha=0.5$ and $\rho_0=0.2$ in figure 3. Note that the entropy-crisis line $T_{\rm EC}(\rho)$ below the AT line $T_{\rm AT}(\rho)$ has no direct physical consequence because the RS solution is unreliable in that region. The exact entropy-crisis line would be derived by the full step RSB solution, but this is beyond the scope of the present paper.

Figure 3.

Figure 3. Phase diagram in the ρ-T plane for the noiseless case $\sigma_{\xi}=0$ . The parameters are $\alpha=0.5$ and $\rho_0=0.2$ . The right panel is a close-up of the left one in the low T region, used to focus on the spinodal and the first-order transition lines, TSP and TF, which are rather small compared to the other critical temperatures TAT and TEC in the region $\rho<\rho_0$ . The vertical dashed lines denote the critical values of ρ: from left to right: $\rho_{\rm AT}\approx 0.17$ , $\rho_{0}=0.2$ , and $\rho_{\rm SP}\approx 0.43$ .

Standard image High-resolution image

Figure 3 implies that finding the ground state is easy in the region $\rho_{\rm AT}< \rho < \rho_{\rm SP}$ , while it is difficult in other regions: $\rho\leqslant \rho_{\rm AT}$ and $\rho_{\rm SP}\leqslant \rho$ . We focus on how the easy region behaves when changing the external parameters α and $\rho_0$ . We examine the parameters and find that the easy region shrinks as $\rho_0$ increases against a fixed α and finally vanishes at a certain critical value of $\rho_0$ . As an example, the ρ-T phase diagram for $\alpha=0.5$ and $\rho_0=0.3$ is shown in figure 4. The spinodal and first-order transition lines cover the entire $\rho_0< \rho$ region and hence the easy region disappears, implying that local search algorithms cannot find the PR solution for any ρ in this case. This vanishing of the easy region thus defines the algorithmic limit. By searching all parameter regions of $\rho_0$ and α, we can draw a phase diagram of the algorithmic limit, which is shown in figure 5. The performance of our formulation is clearly better than the $ \newcommand{\e}{{\rm e}} \ell_1$ relaxation [14] and is competitive with the Bayesian result [21]. This implies that the present formulation, which can be regarded as a MAP estimation in the Bayesian framework, does not significantly lose its reconstruction performance despite discarding the signal source information. This is encouraging the use of the present formulation in the context of signal recovery and is one of the main results of this paper.

Figure 4.

Figure 4. Phase diagram in the ρ-T plane for the noiseless case $\sigma_{\xi}=0$ . The parameters are $\alpha=0.5$ and $\rho_0=0.3$ . The right panel is a close-up of the left one in the low T region. The spinodal and first-order transition lines cover the entire $\rho_0\leqslant \rho$ region, implying that local search algorithms will fail to find the PR solution for any ρ.

Standard image High-resolution image
Figure 5.

Figure 5. The algorithmic limits for the perfect reconstruction of the planted solution $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{x}_0$ . The red solid line is the algorithmic limit derived here. Two other boundaries, the blue solid and red broken ones, indicate the $ \newcommand{\e}{{\rm e}} \ell_1$ relaxation derived in [14] and the Bayesian inference shown in [21], respectively. Our result, which is regarded as an MAP approximation of the Bayesian inference, is competitive with the Bayesian result.

Standard image High-resolution image

3.3.2. Noisy case.

A new behaviour specific to the noisy case is the presence of the RFOT transition for strong noises at middle values of ρ. As an example, the entropy curves for $\sigma_{\xi}^2=10$ are given in figure 6. Two critical ρ values, $\rho_{\rm AT}$ and $\rho_{\rm RFOT}$ , accordingly emerge. For $\rho\leqslant \rho_{\rm AT}$ , the AT instability first occurs as the temperature decreases. For $\rho_{\rm AT} < \rho \leqslant \rho_{\rm RFOT}$ , the RFOT begins to emerge above the AT instability temperature. For larger ρ, the RS solution is accurate for all temperature regions. The RS EC occurs at finite T in that region, which is somewhat similar to the Ising perceptron problem [37, 38]. Note that $\rho_{\rm RFOT}$ does not exist if the noise is sufficiently weak.

Figure 6.

Figure 6. The solid black line denotes the entropy curve plotted against the output MSE $ \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} \MSEp$ for $\alpha=0.5, \rho_0=0.2$ and $\sigma_{\xi}^2=10$ . The non-zero component density ρ is $0.1, ~0.25$ and 0.4 from left to right. The RFOT transition, which is absent in the noiseless case, appears at middle values of ρ.

Standard image High-resolution image

Summarising the above findings, we show phase diagrams for three noise strengths, $\sigma_{\xi}^2=0.001, ~0.1$ and 10 in figure 7. The first-order transition in the noiseless case is quite fragile and disappears for very weak noise as seen in the left panel of figure 7. We attempted to capture the critical values of $\sigma_{\xi}$ for the disappearance, but it proved numerically difficult and we did not pursue this point. As the noise increases, the phase boundary's shape expand more and more from the noiseless limit and the RSB region, $\rho<\rho_{\rm RFOT}$ , seems to grow. We consider whether this RSB region will cover the low-temperature region completely as the noise is very large. To answer this, we tested the no signal case $\rho_0=0$ and $\sigma_{\xi}^{2}=1$ and observed that $\rho_{\rm RFOT}$ takes a value similar to the one in the right panel of figure 7. Hence, an RS region exists even in the strong noise case, which is consistent with our previous analysis in the data compression context [24].

Figure 7.

Figure 7. Phase diagrams for three different noise strengths: $\sigma_{\xi}^{2}=0.001, ~0.1$ and 10 from left to right. The other parameters are $\alpha=0.5$ and $\rho_0=0.2$ . The phase boundaries are denoted by solid lines with different colours (green: TRFOT, red: TAT, black: TEC). For the weak noise case (left), we could not locate the RFOT region. The vertical dashed lines represent the critical values of ρ.

Standard image High-resolution image

4. Numerical simulations

4.1. Monte Carlo evaluation of entropy curves

Here we examine the analytical results by comparing with the numerical simulations. Our simulations calculate the free entropy by the exchange Monte Carlo sampling [62]. Estimation of the free entropy g is accomplished by using the multi-histogram method [63].

Our MC sampling is based on the Metropolis criterion, where an MC move $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{c}\to \V{c}'$ is accepted according to the probability

Equation (64)

During the update, we would like to keep the non-zero components density $\rho=\sum_{i}c_i/N$ constant. For this, we generate trial moves $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{c}\to \V{c}'$ by 'pair flipping' of two support indicators, one equal to 0 and the other equal to 1. Namely, by choosing an index i of the support indicator from $ \newcommand{\e}{{\rm e}} {\rm ONES}\equiv \{k|c_{k}=1 \}$ and another index j from $ \newcommand{\e}{{\rm e}} {\rm ZEROS}\equiv \{k|c_{k}=0 \}$ , we set $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{c}'=\V{c}$ , except for the counterpart of $(c_i, c_j)=(1, 0)$ , which is given as $(c'_i, c'_j)=(0, 1)$ . The pseudocode of our MC algorithm is given in algorithm 1. We define one MC step (MCS) as N trials of pair flipping for each system at every temperature point. The exchange of every pair of neighboring temperature points is conducted after every 1/N MCS, which is a rather frequent exchange than conventions.

Algorithm 1. MC update with pair flipping.

1: procedure MCPF($ \newcommand{\V}[1]{\boldsymbol{#1}} \V{c}, \mu, \V{y}, A$ ) $\rhd$ MC routine with pair flipping
2:    $ \newcommand{\lA}{\leftarrow} {\rm ONES} \lA \{k|c_k=1\}, ~{\rm ZEROS} \lA \{k|c_k=0\}$  
3:     randomly choose i from ONES and j from ZEROS  
4:     $ \newcommand{\V}[1]{\boldsymbol{#1}} \newcommand{\lA}{\leftarrow} \V{c}' \lA \V{c}$  
5:     $ \newcommand{\lA}{\leftarrow} (c'_i, c'_j) \lA (0, 1)$  
6:     $ \newcommand{\V}[1]{\boldsymbol{#1}} \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} \newcommand{\lA}{\leftarrow} (\MSEp, \MSEp')\lA (\MSEp(\V{c}|\V{y}, A), \MSEp(\V{c}'|\V{y}, A))$  
7:     $ \newcommand{\lb}{\left(} \newcommand{\rb}{\right)} \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} \newcommand{\lA}{\leftarrow} p_{\rm accept} \lA \min(1, {\rm e}^{-M\mu\lb \MSEp' -\MSEp\rb })$  
8:     generate a random number $r\in [ 0, 1]$  
9:     if $ r < p_{\rm accept} $ then  
10:       $ \newcommand{\V}[1]{\boldsymbol{#1}} \newcommand{\lA}{\leftarrow} \V{c} \lA \V{c}'$  
11:     end if  
12:     return $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{c}$  
13: end procedure  

In all simulations, we set $\alpha=0.5$ , $\rho_0=0.2$ and $ \newcommand{\POWx}{\sigma_x^2} \rho_0\POWx=1$ . The configurational average is calculated by taking the median over 1000 different samples of $ \newcommand{\V}[1]{\boldsymbol{#1}} (\V{x}_0, \boldsymbol{\xi}, \boldsymbol{A})$ . The error bars are estimated by the Bootstrap method. The examined system sizes are $N=30, 40, \cdots, 100$ . The equilibration is checked by monitoring the convergence of all measured quantities $ \newcommand{\e}{{\rm e}} (g, s, \epsilon_y)$ to stable values by changing the total MCSs; for reference, we note that $256\times 10^2$ MCSs are needed for equilibration when N  =  100, $\rho=0.3$ , $\sigma_\xi^2=10$ . For burn-in, the first half of the total MCSs is discarded.

4.1.1. Simulation in noiseless case.

The free-entropy values evaluated by numerical simulations and the extrapolation to the infinite size limit of the noiseless case $\sigma_{\xi}^2=0$ are presented in figure 8. The extrapolation lines result from linear regression using an asymptotic form $g\approx a+bN^{-1}+cN^{-1}\log N^{-1}$ . The regression is conducted by applying the least squares method as follows:

Equation (65)
Figure 8.

Figure 8. Plots of g against 1/N at several values of μ for $\rho=0.1$ (left) and 0.3 (right) in the noiseless case $\sigma_{\xi}^2=0$ . The lines are produced by linear regression based on equation (65). On the vertical axis, the black circles and red crosses represent the extrapolated and analytical values in the $N \to\infty$ limit, respectively.

Standard image High-resolution image

This asymptotic form is based on Stirling's formula and is exact at $\mu=0$ , which motivates us to use the form even when $\mu\not=0$ . The same asymptotic form is used for obtaining the extrapolated values of the output MSE $ \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} \MSEp$ and the entropy s. Using these values for the limit $N \to \infty$ , we present the curves $g(T)$ and $ \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} s(\MSEp)$ in figure 9. The lines represent the RS analytical results. The circles represent the extrapolated values obtained from the numerical results. The extrapolated values show fairly good agreement with the RS analytical ones, justifying our analytical results. For the case of $\rho=0.1$ , the AT instability occurs at $T\approx 0.04$ , but even below this temperature, the agreement between the RS analytical result and the numerical one is fairly good, suggesting a weak RSB effect on s and $ \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} \MSEp$ . This is, however, not the case for the input MSE $ \newcommand{\e}{{\rm e}} \newcommand{\MSEx}{\epsilon_{x}} \MSEx$ , as demonstrated in section 4.2 below.

Figure 9.

Figure 9. The free-entropy curve $g(T)$ (left) and the entropy curve $ \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} s(\MSEp)$ (right) for $\rho=0.1$ (upper) and 0.3 (lower) in the noiseless case $\sigma_{\xi}=0$ .

Standard image High-resolution image

4.1.2. Simulation in noisy case.

A similar analysis for the case of strong noise ($\sigma_{\xi}^2=10$ ) is performed and the results are shown in figures 10 and 11. These figures again demonstrate good agreement between the RS and numerical results as long as the RSB does not occur. Below the RSB transition point, we observe a deviation between them, as shown in the left upper panel of figure 11. In such a situation, generally speaking, the RS entropy curve can be regarded as an upper bound of the entropy values [64], although a meaningful difference is not observed in the right upper panel of figure 11. Again, the RSB effect on s and $ \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} \MSEp$ appears to be weak.

Figure 10.

Figure 10. Plots of g versus 1/N at several values of μ for $\rho=0.1$ (left) and 0.3 (right) in the strong noise case $\sigma_{\xi}^2=10$ . The gap between the RS and extrapolated results at $\mu=1.5$ in the left panel is probably caused by the RSB effect.

Standard image High-resolution image
Figure 11.

Figure 11. The free-entropy curve $g(T)$ (left) and the entropy curve $ \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} s(\MSEp)$ (right) for $\rho=0.1$ (upper) and 0.3 (lower) in the strong noise case $\sigma_{\xi}=10$ . A deviation between the RS and the extrapolated values below the AT point in the left upper panel is considered to be caused by the RSB effect.

Standard image High-resolution image

We have a noteworthy remark to make on the EC phenomenon for $\rho=0.3$ . We stress that this EC phenomenon can be described in the RS level and occurs at a finite temperature. This may be somewhat surprising for readers familiar with other models exhibiting similar EC phenomena, because in most of such systems the 1RSB treatment is needed to describe the EC phenomenon. We note that the energy levels of the present system around the ground state can be very dense and the energy gap between the ground and excited states can be extremely small, which can be argued by the fact that the PR solution in the noiseless case has numerous degeneracies for $\rho >\rho_0$ . This gap is supposed to vanish in the $N\to \infty$ limit, presumably enabling the RS EC phenomenon to appear at a finite μ. The agreement between the RS and numerical results strongly argues in favour of this description. Note that the EC phenomenon at small values of ρ is in a different situation and its RS description is not accurate. This is because the AT instability occurs at higher temperatures in that region and hence the full step RSB treatment is needed. This is in contrast to the large-ρ region in which no instability occurs at higher temperatures than TEC.

4.2. Monte Carlo-based optimisation and its performance

The SA is a metaheuristic solver of generic optimisation problems based on the MC method. A variant of the SA for the present problem was proposed in [26] and its performance was examined in a limited parameter region of the present synthetic model and in a real astronomical dataset [26, 27]. We re-examine this over a wider range of parameters to provide more quantitative information.

Our SA algorithm is summarised in algorithm 2. The lines marked with # are not necessarily needed for SA, but have been inserted for later convenience. In algorithm 2, we have a set of inverse temperature points $\{\mu_a \}_{a=1}^{L_{\mu} }$ arranged in ascending order $(0=)\mu_{1}<\mu_{2}<\cdots<\mu_{L_{\mu}}(\gg1)$ and the waiting times $\{\tau_a \}_{a}$ at those points. Hence, as the algorithm proceeds, the temperature of the system $T=1/\mu$ decreases step by step. It is theoretically guaranteed that if the schedule of the decreasing temperature is slow enough, then the SA can find the optimal solution [65]. However, the guaranteed schedule is usually overcautious and in many practical situations we may choose a faster one. The actual schedule examined below consists of $ L_{\mu}=200$ temperature points chosen as

Equation (66)

The first linear region of the schedule is simply inserted for visibility in the plots shown below and the important point is that the schedule is exponentially increasing as a grows. The final temperature is very low, $T_{L_{\mu}}=\mu_{L_{\mu}}^{-1}=10^{-6}$ . The waiting time at each temperature point is kept constant, at $\tau_a=\tau~(\forall{a})$ , for simplicity. We show below that this rapid schedule works very efficiently for a wide range of parameters and discuss that the performance is closely related to the system's property at equilibrium, which was already calculated in section 3.

Algorithm 2. SA for variable selection in sparse linear regression.

1: procedure SA ($ \newcommand{\V}[1]{\boldsymbol{#1}} \{\mu_a, \tau_a \}_{a=1}^{L_{\mu}}, \rho, \V{y}, A$ )
2:    Generate a random initial configuration $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{c}$ with $\sum_{i}c_i=N\rho$
3:    for $a=1:L_{\mu}$ do         $\rhd$ Changing temperature
4:       for $t=1:\tau_{a} $ do           $\rhd$ Sampling at $\mu=\mu_a$
5:         for $i=1:N$ do     $\rhd$ Extensive number of updates
6:           $ \newcommand{\V}[1]{\boldsymbol{#1}} \newcommand{\lA}{\leftarrow} \V{c}\lA {\rm MC_{PF}}(\V{c}, \mu_a, \V{y}, A)$        $\rhd$ MC update with pair flipping
7:         end for
8:         # Calculate the MSEs $ \newcommand{\V}[1]{\boldsymbol{#1}} \newcommand{\MSEx}{\epsilon_{x}} \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} \MSEx(\V{c}_t), \MSEp(\V{c}_t)$ of the current support vector $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{c}_t=\V{c}$
9:       end for
10:       # Calculate the average as $ \newcommand{\V}[1]{\boldsymbol{#1}} \newcommand{\dAve}[1]{\left\langle \left\langle {#1} \right\rangle \right\rangle} \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} \dAve{\MSEp} \approx (1/\tau_a)\sum_{t=1}^{\tau_a}\MSEp(\V{c}_t)$ .
11:     end for
12:     return $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{c}$
13: end procedure

A noteworthy remark applies to the computational cost of this SA algorithm. This cost can be formally written as $O(L_{\mu}\tau N C_{\rm MC})$ , where the last factor is the computational cost of each MC update. The most expensive operation is the matrix inversion required to calculate the energy of the output MSEs. If we use simple multiplication and Gauss elimination in the inversion process for each step, then $C_{\rm MC}=O(M(N\rho){}^2+(N\rho){}^3)$ . However, we employ pair flipping in each update and the change in the relevant matrices in each update is small and successive. Using this fact and the matrix inversion formula, the total cost of each MC update can be reduced to $C_{\rm MC}=O((N\rho){}^2+MN\rho)=O(N^2\alpha \rho)$ , as explained in [26]. Hence, if $L_{\mu}$ and τ do not scale with N and can be kept constant, the total computational cost is $O(L_{\mu}\tau N C_{\rm MC})=O(L_{\mu}\tau \alpha \rho N^3)$ and is scaled as the third order polynomial of the system size N. This is comparable with the versatile algorithms solving the $ \newcommand{\e}{{\rm e}} \ell_1$ relaxation and thus the present algorithm solves the $ \newcommand{\e}{{\rm e}} \ell_0$ problem with fairly reasonable computational cost. The assumption of constant $L_{\mu}$ and τ is not trivial, but it appears to be correct, i.e. sufficient to find the PR solution in the successful region, in the region we have numerically searched. Hence, we adopt this constant assumption below.

4.2.1. Reconstruction performance of simulated annealing.

Noiseless case.

Let us begin by showing the results for the noiseless case. The MSEs at $\alpha=0.5$ and $\rho_0=0.2$ are plotted against T in figure 12. The assumed values of ρ are $\rho=0.02$ , $\rho=0.3$ , and $\rho=0.45$ for the left, middle, and right panels, respectively. The numerical results agree well with the black solid line representing the RS solution connected to the high temperature limit in all cases. The middle panels show the successful region for finding the PR solution, $\rho_{0} < \rho < \rho_{\rm SP}$ , and the vanishing $ \newcommand{\e}{{\rm e}} \newcommand{\MSEx}{\epsilon_{x}} \MSEx$ means that we actually find the PR solution. The right panels are in $\rho_{\rm SP} < \rho $ , meaning that the search is trapped in the metastable state connected to high temperatures. The SA result follows the high-temperature branch and cannot reach the low-temperature one denoted by the blue dashed dotted line. Overall, the SA experiments demonstrate that our theoretical predictions are very precise, and the presence of phase transitions strongly degrades the SA's performance in finding the minimum-MSE configuration.

Figure 12.

Figure 12. SA performance for the noiseless case at $\alpha=0.5$ and $\rho_0=0.2$ . The output MSE $ \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} \MSEp$ (upper) and the input MSE $ \newcommand{\e}{{\rm e}} \newcommand{\MSEx}{\epsilon_{x}} \MSEx$ (lower) are plotted against temperature T. The values of ρ are 0.02 (left), 0.3 (middle), and 0.45 (right), respectively. The MCS is fixed at $\tau=100$ ; the number of averages are $800, ~200$ and 50 for $N=100, ~200, $ and 400, respectively. For visibility, $ \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} \MSEp$ is plotted in the double logarithm scale while $ \newcommand{\e}{{\rm e}} \newcommand{\MSEx}{\epsilon_{x}} \MSEx$ is in the semi-logarithmic one. The black solid and blue dashed dotted lines show the RS analytical solutions.

Standard image High-resolution image
Noisy case.

Next, we show the results of the noisy case. The SA results for the strong noise case $\sigma_{\xi}^2=10$ at $\alpha=0.5$ and $\rho_0=0.2$ are given in figure 13. As seen from the figure, the MSEs show good agreement to the analytical curve (black solid line) up to the transition points for the left and middle panels. An exceptional deviation is observed at low temperatures in the upper right panel, but this is considered to be a finite-size effect because the range of $ \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} \MSEp$ in this region is very small and supposedly unreachable by $N\approx 100$ systems. Hence, these behaviours are very consistent with the analytical predictions that the system's dynamic behaviour is affected by the RSB transitions and ceases to follow the equilibrium state.

Figure 13.

Figure 13. SA performance for the strong noise case $\sigma_{\xi}=10$ at $\alpha=0.5$ and $\rho_0=0.2$ . The output MSE $ \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} \MSEp$ (upper) and the input MSE $ \newcommand{\e}{{\rm e}} \newcommand{\MSEx}{\epsilon_{x}} \MSEx$ (lower) are plotted against temperature T. The values of ρ are 0.1 (left), 0.25 (middle), and 0.4 (right). The MCS is fixed at $\tau=100$ ; the number of averages are $800, ~200$ , and 100 for $N=100, ~200, $ and 400, respectively.

Standard image High-resolution image

The effect of the RSB on the reconstruction performance becomes much clearer by examining the achievable-limit values of the MSEs for a moderate noise case. Figure 14 shows the plots against ρ of the limit values obtained at very low temperatures by the rapid SA with $\tau=5$ for $\sigma_{\xi}^2=0.1$ . The $ \newcommand{\e}{{\rm e}} \newcommand{\MSEx}{\epsilon_{x}} \MSEx$ values at middle ρ values are clearly dominated by the ones at the AT transition points rather than the ones at the EC points, implying that the system's search is trapped by local minima emerging at the transition points. An interesting outcome of this phenomenon is a better reconstruction of the planted signal $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{x}_0$ . As seen from figure 14, the $ \newcommand{\e}{{\rm e}} \newcommand{\MSEx}{\epsilon_{x}} \MSEx$ values at the AT points are lower than the ones at the RS EC points, implying that the reconstruction performance of the local minima induced by the RSB is better than that of the minimum-$ \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} \MSEp$ configuration $ \newcommand{\V}[1]{\boldsymbol{#1}} \hat{\V{c}}$ by solving equation (2). This means that the generalisation capability of the rapid SA is no worse than exactly solving equation (2) in this case because the input MSE $ \newcommand{\e}{{\rm e}} \newcommand{\MSEx}{\epsilon_{x}} \MSEx$ is proportional to the generalisation error when each component of the design matrix and the noise is i.i.d. from the zero-mean Gaussian. This encourages the use of the presented formulation and algorithm for practical purposes, as reported in [27].

Figure 14.

Figure 14. Plots versus ρ of the limiting values of the MSEs $ \newcommand{\e}{{\rm e}} \newcommand{\MSEp}{\epsilon_{y}} \MSEp$ (left) and $ \newcommand{\e}{{\rm e}} \newcommand{\MSEx}{\epsilon_{x}} \MSEx$ (right) obtained at very low temperatures for $\alpha=0.5$ , $\rho_0=0.2$ and $\sigma_{\xi}^2=0.1$ . Three analytical values, where the RS entropy crisis, the AT instability, and the RFOT transition occur, are denoted by black, red, and green solid lines, respectively, although the green line only exists in an extremely small region and is difficult to observe. The simulation results (markers) obtained by the SA with a very rapid schedule $\tau=5$ agree well with the analytical curves. The number of averages are $100, ~100$ and 40 for $N=100, ~200, $ and 400, respectively.

Standard image High-resolution image

5. Conclusion

In this study, we have analytically provided an algorithmic limit of an $ \newcommand{\e}{{\rm e}} \ell_0$ -based formulation of compressed sensing through evaluation of the entropy using statistical mechanical techniques. The results are mainly characterised by the ratio of number of observations α, the nonzero-components density of the inference and generative models ρ and $\rho_0$ , and the strengths of the signal and noise $\sigma_{x}$ and $\sigma_{\xi}$ . The entropy curves and the associated phase diagrams have been provided, and their implications to local search algorithms have been discussed. Quantitative analysis of the noisy cases has also been conducted. To validate the analytical computation, we have performed a careful MC simulation using the exchange MC method and the multi-histogram method, the results of which have exhibited fairly good agreement with the analytical results. To test the theoretical predictions on local search algorithms, we have also performed numerical experiments using the SA-based algorithm. The performance of the SA is well understood through the phase diagrams. Over a wide parameter region in the noiseless case, we have actually located the PR solution with reasonable computational cost of O(N3) although, in other hard regions, the RSB and the RS first-order transitions prevent rapid convergence of the SA to the PR solution.

To overcome the problems caused by phase transitions, it may be interesting to tailor new algorithms based on MC methods. The idea of extended ensembles can be useful: it will be a promising future work to invent an algorithm relaxing the hard constraint on the nonzero components density, by introducing a 'chemical potential' softly controlling the density. The idea of multi-canonical sampling such as the Wang–Landau algorithm [66] may also be an interesting direction. By extending the work in [51], it is also worth trying to take into account glassy natures in algorithms in certain ways.

Relaxing the i.i.d. random-matrix assumption on A is also an interesting problem. This is even practical because in an ideal situation of compressed sensing, a new observation should be conducted along with maximising the resultant information, implying that all rows of A should be orthogonal. Considering such orthogonal ensembles in the presented framework is a high-priority issue which should be done in near future.

Acknowledgments

This work was supported by JSPS KAKENHI Grant Numbers 26870185 and 18K11463 (TO), 17K12749 (YN-O), 25120009 (MO), and 25120013 and 17H00764 (YK). TO is also supported by a Grant for Basic Science Research Projects from the Sumitomo Foundation. YN-O is also supported by JST PRESTO Grant Number JPMJPR1773.

Appendix. Calculations of g(μ)

Assuming n and ν are positive integers, we can rewrite

Equation (A.1)

These summations over $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{c}$ and $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{x}$ are now calculated over all the replicated variables $ \newcommand{\V}[1]{\boldsymbol{#1}} \{\V{c}^{a}\}_{a=1}^{n}$ and $ \newcommand{\V}[1]{\boldsymbol{#1}} \{\V{x}^{a\alpha}\}_{a=1, \cdots, n, \alpha=1, \cdots, \nu}$ . Let us set

Equation (A.2)

The variable d is an extensive sum of random variables and can be expressed by an appropriate sum of Gaussian variables with a certain covariance. The covariance is expressed by

Equation (A.3)

Evaluating this full description is difficult in general. Instead, we consider more amenable subspaces which are described by the RS or RSB ansatz.

A.1. RS computation

In the RS ansatz, the dominant contribution is assumed to come from the following subspace:

Equation (A.4)

Equation (A.5)

Equation (A.6)

Equation (A.7)

Equation (A.8)

and hence the covariance matrix is described by four order parameters and one external parameter $ \newcommand{\POWx}{\sigma_x^2} \rho_0\POWx$ . The corresponding useful description of $d^{a\alpha}_{\mu}$ is

Equation (A.9)

where $u, v$ and z are i.i.d. from $\mathcal{N}(0, 1)$ . The average with respect to A thus can be replaced by the average over $u, v$ and z, yielding

Equation (A.10)

where $\mathcal{I}$ is the subshell (the state density) characterised by the above four order parameters such that

Equation (A.11)

and $\mathcal{L}$ describes

Equation (A.12)

where we merge two Gaussian variables $(z, \xi)$ into z and

Equation (A.13)

Direct integrations yield

Equation (A.14)

where we keep only the linear term with respect to n at the first line and take the limit $\nu\to 0$ while keeping $\beta \nu=\mu$ at the last line and applying $\beta (R-Q)=\chi$ according to equation (24). Summarising equations (A.12) and (A.14), we obtain

Equation (A.15)

Evaluation of $\mathcal{I}$ requires additional algebra. We break the delta functions by using the Fourier transform as follows

Equation (A.16a)

Equation (A.16b)

Equation (A.16c)

Equation (A.16d)

Equation (A.16e)

Then,

Equation (A.17)

where

Equation (A.18)

The replica indices have been superscripts so far but we rewrite them as subscripts for visibility. Using the Gaussian integrals, we break the sum $\sum_{a<b}$ into a single replica sum as

Equation (A.19)

Hence we can take $\sum_{c_a=0, 1}$ for each a independently

Equation (A.20)

where

Equation (A.21)

The sums $\sum_{\alpha<\beta}$ and $(\sum_{\alpha}x_{\alpha}){}^2$ can also be broken

Equation (A.22)

Hence,

Equation (A.23)

Summarising the result yields

Equation (A.24)

and the law of large number implies

Equation (A.25)

where

Equation (A.26)

and X0 is obtained by inserting $\tilde{m}=0$ in $X_{\tilde{m}}$ . This relation means that the average over $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{x}_0$ in equation (A.17) is actually not needed, which comes from the absence of correlations among components of the design matrix A. This validates the use of the factorised prior (14).

To take the limit $\nu \to 0$ , we rescale the tilde variables as follows

Equation (A.27a)

Equation (A.27b)

Equation (A.27c)

Equation (A.27d)

Equation (A.27e)

Applying this rescaling and keeping only the linear term with respect to n, we get

Equation (A.28)

Combining equations (A.15) and (A.28) and using the saddle-point method yield equation (27).

A.2. 1RSB computation

In the 1RSB level, the overlap between ca and cb with different $a, b=1, \cdots, n$ should be treated in a more involved manner. In the standard 1RSB construction, the n replicas are separated into $n/\tau$ blocks of the size τ. We may label the blocks by $B=1, \cdots, n/\tau$ , and the replica index a is represented by two indices as $a=(B_a, I_a)$ where Ia denotes the component index inside the block. For simplicity, we identify the index set of those components with the block label Ba, allowing us to represent them as $I_a \in B_a$ . The overlap $q_{b\beta}^{a\alpha}=(1/N)\sum_{i}c^a_i c^{b}_i x^{a\alpha}_{i}x^{b\beta}_{i}$ is assumed to take two discriminative values depending on whether the replica indices $a, b$ are in the same block or not. This can be written as

Equation (A.29)

Correspondingly, $d^{a\alpha}_{\mu}$ can be expressed as

Equation (A.30)

where $u, v, w, $ and z are i.i.d. from the normal distribution. Then, equation (A.1) is rewritten as

Equation (A.31)

where

Equation (A.32)

and

Equation (A.33)

where

Equation (A.34)

$\mathcal{L}_{{\rm 1RSB}}$ can be computed by recurring Gaussian integrations as in the RS case, and the result is

Equation (A.35)

The computation of $\mathcal{I}_{{\rm 1RSB}}$ can also be performed in parallel with the RS case. The delta functions of common variables with the RS case are broken in the same manner as equation (A.16), and the ones of q1 and q0 are broken similarly to equation (A.16d). These transforms yield

Equation (A.36)

where

Equation (A.37)

Equation (A.38)

To derive a factorised form with respect to the replica index, we again use the trick of the Gaussian integrations as the RS case. For example, the last two terms in equation (A.38) are factorised as

Equation (A.39)

where we insert $X_{BI}= \sum_{\alpha}x_{BI\alpha}$ . Repeating similar computations, in the leading order of n we finally get

Equation (A.40)

To derive this, when taking the limit $\nu\to 0$ , we have rescaled $\tilde{q}_1$ and $\tilde{q}_0$ as

Equation (A.41a)

Equation (A.41b)

Other tilde variables are rescaled in the same manner as equation (A.27).

Inserting equations (A.35) and (A.40) into equation (A.31) and using the saddle-point method, we obtain equation (34). The EOS are obtained by taking the extremisation condition and the result is

Equation (A.42a)

Equation (A.42b)

Equation (A.42c)

Equation (A.42d)

Equation (A.42e)

Equation (A.42f)

Equation (A.42g)

Equation (A.42h)

Equation (A.42i)

Equation (A.42j)

Equation (A.42k)

where

Equation (A.43)

Footnotes

  • There is an analogy between the 1st step RSB formulation and the present problem: $ \newcommand{\V}[1]{\boldsymbol{#1}} (\V{y}, A)$ correspond to quenched variables; $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{c}$ and $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{x}$ are dynamical variables, but $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{c}$ determines a pure state and $ \newcommand{\V}[1]{\boldsymbol{#1}} \V{x}$ is an active dynamical variable inside the pure state; the replica number ν corresponds to Parisi's breaking parameter.

  • Insightful readers may doubt the probabilistic interpretation of $ \newcommand{\V}[1]{\boldsymbol{#1}} P(\V{x}|\V{c})$ because it is not possible to normalise $ \newcommand{\V}[1]{\boldsymbol{#1}} P(\V{x}|\V{c})$ due to the presence of flat prior ϕ. This inconvenience can be solved by replacing the flat prior with $(1/\sqrt{2\pi L}){\rm e}^{-x_i^2/(2L)}$ and taking the limit $L\to \infty$ , although this causes another problem in model selection according to the marginal likelihood in the usual Bayesian framework. However in the following discussions, any model selection using this criterion is not performed and the entire treatment in the main text does not inherit any of these problems.

Please wait… references are loading.