Abstract
An algorithmic limit of compressed sensing or related variable-selection problems is analytically evaluated when a design matrix is given by an overcomplete random matrix. The replica method from statistical mechanics is employed to derive the result. The analysis is conducted through evaluation of the entropy, an exponential rate of the number of combinations of variables giving a specific value of fit error to given data which is assumed to be generated from a linear process using the design matrix. This yields the typical achievable limit of the fit error when solving a representative problem and includes the presence of unfavourable phase transitions preventing local search algorithms from reaching the minimum-error configuration. The associated phase diagrams are presented. A noteworthy outcome of the phase diagrams is that there exists a wide parameter region where any phase transition is absent from the high temperature to the lowest temperature at which the minimum-error configuration or the ground state is reached. This implies that certain local search algorithms can find the ground state with moderate computational costs in that region. Another noteworthy result is the presence of the random first-order transition in the strong noise case. The theoretical evaluation of the entropy is confirmed by extensive numerical methods using the exchange Monte Carlo and the multi-histogram methods. Another numerical test based on a metaheuristic optimisation algorithm called simulated annealing is conducted, which well supports the theoretical predictions on the local search algorithms. In the successful region with no phase transition, the computational cost of the simulated annealing to reach the ground state is estimated as the third order polynomial of the model dimensionality.
Export citation and abstract BibTeX RIS
Original content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
1. Introduction
Compressed sensing is a technique used to recover a high-dimensional signal from a limited number of measurements by utilising the fact that the signal of interest has redundancy and thus can be 'sparse'; many of the coefficients are set to zero when described with an appropriate basis. This technique has a long history [1–3], but it has recently attracted increased attention as its high performance has been demonstrated in recent influential papers [4–8]. There has been a surge of research of compressed sensing, which is based on a general idea that the signal has a sparse representation on an appropriate basis, because the idea can be shared in many other contexts such as data compression, multivariate regression, and variable selection. This trend has triggered a major evolution in techniques of signal and information processing, which is gradually forming a new framework called 'sparse modelling' [9–12].
For clarity, we provide a concise mathematical form to the problem treated here. Suppose a data vector is generated by the following linear process with a design matrix and a signal vector such that:
where is a noise vector, the component of which is assumed to be an independent and identically distributed (i.i.d.) variable from the normal distribution with zero mean and variance , . In the context of compressed sensing, the design matrix A represents the measurement process, and given A and , we try to infer for the situation M < N. This is an underdetermined problem and the sparsity assumption that the number of nonzero components of is smaller than M is needed for solving it. With this assumption, the perfect reconstruction of is possible if the noise is absent. The most naive algorithm to achieve this is the exhaustive search, which selects the sparsest set of variables among error-free ones. This is clearly infeasible if the model dimensionality N is large, and more efficient algorithms or approximations should be tailored. A common approximation is to relax the sparsity constraint. Many studies have been conducted along this direction, and some theoretical studies demonstrated that the perfect reconstruction of is possible under reasonable conditions even under such a relaxation [13–15]. Associated efficient algorithms achieving perfect reconstruction in the noiseless case have been developed [16–19].
Some degree of compromise such as the relaxation above appears to be unavoidable, because variable selection in the present problem is NP-hard in the worst case [20]. However, more recent works suggest that, even without such relaxation, variable selection can be achieved at reasonable expense for 'typical' cases [21, 22] where the design matrix is assumed to be i.i.d. from the normal distribution. Their formulation is based on the Bayesian framework assuming the signal's generative process is known. The algorithmic limit was computed by using non-rigorous statistical mechanical techniques, and an associated message-passing algorithm was developed and shown to achieve the limit, and those results have been supported from a firmer mathematical basis [23].
However, the Bayesian framework is not always preferred in the context of signal processing. This is because the signal's generative process is not necessarily evident and is difficult to model in many practical situations. In such cases, it may be better to focus less on the signal sources and rely more on the less informative prior. According to this idea, in the context of data compression, the present authors recently proposed a variable selection criterion based on the following widely-used optimisation formulation [24]:
where denotes the norm and the norm is assumed to give the number of nonzero components of . Our basic idea in [24] is to compare all variable sets of size K and to organise them as an ensemble in the statistical mechanical sense by regarding the fit error as energy. Our statistical mechanical analysis, again non-rigorous and performed under the same assumption on the design matrix as [21, 22], showed that the configuration space of the variable set is rather 'smooth', implying that certain local search algorithms can efficiently find the minimum-error variable set. Based on this finding, we developed an algorithm based on the so-called simulated annealing (SA) algorithm [25], which is a Monte Carlo (MC)-based optimisation solver, and demonstrated that it can efficiently find the minimum-error set for a wide range of parameters [26, 27]. These results again suggest that variable selection in the present setting can be efficiently achieved, even without relaxation or resorting to the Bayesian framework.
The success of the MC based method further motivates us to analyse the property of the ensemble of the variable set in detail. The previous analysis [24] was limited to the data compression context, and another analysis directly relevant to compressed sensing or multivariate regression is desirable. The present paper addresses this point. The main difference from [24] is the presence of the true signal in equation (1). This introduces other criteria on the reconstructed signal such as the prediction ability and the error to the true signal. We provide a quantitative analysis for these issues.
The remainder of the paper is organised as follows. In section 2, we state the problem setting and the formulation which we employ in this paper. The meaning of the formulation in relation to the Bayesian framework is also explained. In section 3, we provide the analytical solution of the fit error and related quantities derived by the statistical mechanical formulation. In section 4, we present the results of numerical experiments using a careful MC method to support our analytical computations. The performance of the SA algorithm for finding the minimum-error set is also revisited. The final section concludes the study.
2. Problem and formulation
2.1. Problem setting and notation
As noted in the section 1, the data is supposed to be generated by the linear process (1). We assume the true signal is K0-sparse and K0 is less than M: . For notational convenience, we introduce the operator or support operator of a vector as , which results in a binary vector whose component is if , or otherwise. The support of the true signal is represented by a support vector .
We are interested in the fit quality to the data for a given set of variables described by a support vector , on a linear model basis. The fit quality is thus quantified by a mean squared error (MSE) for and the coefficients of the chosen variables are assumed to be optimised to describe . The optimised coefficients are written in the following form:
where represents the Hadamard product. To eliminate an ambiguity in equation (3), the coefficients of variables out of the support are set to zero, , to provide consistency with the support operator: . We denote the corresponding MSE with by
This is called the output MSE throughout this paper. In addition to the output MSE, we are interested in the MSE with the true signal defined by
Hereafter this is termed the input MSE, and the hat symbol is assumed to represent an estimator of the corresponding quantity.
The primary object of our investigation is the histogram of when changing the set of variables . There are two reasons for evaluating this quantity. One is related to the reconstruction of . In this context, lower values of the output MSE are not always preferable and we need more global information regarding the set of variables to obtain a good solution. The other involves possible algorithmic implications. To find a small output-MSE configuration, we usually conduct a local recursive search from certain initial conditions. The performance of such local search algorithms is strongly affected by the structure of the configuration space of . The histogram of provides the necessary information about the structure.
More specifically, we evaluate the exponential rate of the histogram in the large size limit while keeping and finite. This quantity is simply the statistical mechanical entropy defined by
Entropies or similar thermodynamic functions associated with certain optimisation problems have provided algorithmic implications and benefits in several contexts such as information theory, computer science, and neural networks [28–39]. We see this strategy actually works well in the present problem below.
2.2. Outline of analysis
We outline the analysis of the entropy. Evaluation of the entropy is replaced with an assessment of a generating function that is a Legendre transform of the entropy. Assumptions to enable this assessment are stated as well.
2.2.1. Generating function: Legendre transform of entropy.
Direct computation of the entropy is not easy, and a more systematic method of evaluation is available using a Legendre transform of the entropy, which we call free entropy throughout this study [40].
We introduce a partition function as
where is the delta function. The free entropy is represented by the exponential rate of G. Considering the definition of the entropy and assuming that the saddle-point method is applicable in the large N limit, we can easily see that the following relation holds
where denotes the support of the function . Hence g and s are the Legendre transforms of each other. Assuming is concave with respect to , then g and s have a one-to-one correspondence and are connected by the control parameter . This control parameter plays the role of 'inverse temperature' in physics. The maximiser in equation (8) and the corresponding entropy, and , are thus parameterised as
Employing these relations, we can easily handle the dominant output MSE and determine the corresponding entropy from g.
However, there are two difficulties in the evaluation of g. One is the dependence on and A, and the other is the presence of the least squares problem in the definition of .
The first problem is overcome as follows. The free entropy g is a self-averaging quantity, as is the entropy. Typical values of self-averaging quantities are in accordance with their averaged values in the large N limit. This means that we can calculate the averaged value of g over A and instead of directly considering the entropy's dependence on those quantities. We denote the average over and A by square brackets with appropriate subscripts as . Unfortunately, this average is not easy to calculate. The so-called replica method is a great aid in such a situation, and is symbolised by the following identity
In addition to this identity, we assume that n is a positive integer. This assumption enables us to compute the average . After computing this average, we take the limit by employing the analytical continuation from to under the so-called replica symmetric (RS) or the replica symmetry breaking (RSB) ansatz, which will be explained later.
The second problem is solved by introducing a variable β and taking a limit as follows
where
where the factor is introduced to make the integral well-defined and can be regarded as a prior for . In the limit , only the contribution corresponding to the solution of the least squares problem in equation (3) survives the integration, and . We further assume that is a positive integer as well as n in equation (10). This enables us to treat in parallel the summation over and the integration over , as well as the average . The limit is taken after those operations through the analytic continuation.
These operations can be summarised in a line
with abbreviations and . Overall, to calculate the entropy, we assess the free entropy g. The averages over are taken through the replica method with a replica number n, and the internal variables are integrated with an additional replica number ν.
2.2.2. Assumptions for theoretical computation.
The description above is generic, but for technical reasons we need additional assumptions to complete the computation. The most crucial assumption is applied to the distribution of A: each component of A is assumed to be i.i.d. from . Relaxing this assumption, i.e. introducing correlations between components, makes the analysis much more complicated. Admittedly, this assumption is not necessarily realistic. However, the purpose of this paper is to provide an analytical basis to understand the variable-selection performance, and we consider that the random-matrix assumption can provide sufficiently nontrivial implications for this purpose.
Thus assuming the absence of correlations among components of A, we note that only the average behaviour of the signal components is relevant. According to this observation, without loss of generality, we may assume a factorised prior of the signal vector as
where is the density of nonzero components and P0(x) is a prior distribution for the nonzero component. Our theoretical computation can be performed for any prior P0(x) having no probability mass at x = 0, and we keep it unspecified for a while.
Usually, the entropy function enjoys some useful properties such as non-negativity, boundedness, bounded support, concavity, and analyticity. We assume these properties, but as shown below, the concavity and analyticity are partially broken in the present problem. This causes problems in evaluating the parametric form (9), but they can be bypassed by some additional considerations when conducting the saddle-point method. The analyticity breaking of the entropy is actually related to algorithmic performances and is one of the central issues discussed in this paper.
2.2.3. Probabilistic meaning and intrinsic hierarchy of the problem.
Our formulation has a probabilistic meaning which involves two different intrinsic hierarchies in the present problem. We can define the distribution for as
Let us denote the average over P(1) by angular brackets as . This distribution is clearly conditioned by and A. We also define another distribution for given as
where
This distribution is conditioned by in addition to and A. Hence, we denote the average over P(2) by . The simultaneous average over both P(1) and P(2) is denoted by double angular brackets .
Recalling that and A are also random variables, we note that there are three different hierarchies of random variables: is conditioned by which is conditioned by and A. This discrimination is a natural consequence of the structure of the present problem5.
2.3. Relationship to Bayesian inference using sparsity-inducing priors
Our formulation is related to a Bayesian framework. To demonstrate the relationship, we introduce the following prior distribution of given :
where ϕ is the prior distribution of the nonzero components. As our purpose is to achieve a variable selection, i.e. choosing the best support , the variable can be treated as a hidden variable. Hence in the Bayesian framework, the most rational approach is to sample from the following posterior distribution:
where is our model distribution of data, which is derived through the noise distribution with variance :
and the prior distribution of , , is set to be a uniform distribution at a fixed
This method is optimal when our parameters and model match the true generative process. This matching condition is called the Nishimori condition in physics. Performance at the optimality can be an issue to be studied further; however, in the present paper, we do not pursue this direction. Instead, we perform a maximum a posteriori (MAP) estimation for by assuming a (un-normalised) flat prior . This yields the expression defined in equation (3) as the MAP estimator. Hence, the MAP estimation approximates equation (19) as
Overall, the posterior distribution of defined in equation (15) can be regarded as a MAP estimation of equation (19) in the Bayesian framework6.
There are two reasons for treating the MAP estimator. The first reason is the computational cost of equation (19). The computation of equation (19) requires integration with respect to , which is computationally expensive in general even if we employ versatile approximations such as the MC method. In contrast, the MAP estimator allows us to skip this integration and provides a reasonable estimator (3), which is relatively easy to compute. The second reason is the plausibility in matching our inference model with the true generative process, as also discussed in section 1. In many practical tasks for which our regression model is employed, it is not realistic to assume that we have precise knowledge regarding the generative process. Hence, certain mismatch between the inference and generative models is inevitable. This is a common criticism against applying the Bayesian approach to signal processing tasks. On the contrary, the MAP estimator with the uninformative flat prior allows us to bypass this problem and can yield better performance than the Bayes estimator (19) in certain mismatching cases. These reasons naturally motivate us to investigate equation (15) instead of equation (19).
2.4. Related work
Here we give a brief summary of several preceding work treating related problems and make it clear how the present paper is similar to or different from those work.
In [41], Guo and Verdú studied a random linear estimation problem in the context of code-division multiple access by using the replica method, as in [32]. Its MAP estimator was considered in [42] again by using the replica method. Reeves and Gastpar treated similar linear models in the variable selection context as in this paper: they computed the trade-off relation between the measurement rate ( in our notation) and a distortion quantifying an error rate in the reconstruction of the true support, and derived rigorous lower and upper bounds of the measurement rate to achieve given value of distortion and compared it with the replica result [43–45]. Another investigation by Reeves and Pfister succeeded to prove that the replica prediction is exact, by tightening the bounds under the assumption that the matrix A is i.i.d. from the normal distribution under the Bayes optimal setting [46]. In [47], the same problem was investigated by using the replica method except that the matrix A is drawn from rotationally invariant ensembles. These preceding results employing the replica method were conducted under the RS ansatz, and Bereyhi et al examined the RSB ansatz and showed that the k-step RSB with small k can much improve the RS solution's inconsistency appearing when the MAP estimator with -norm regularization is considered [48, 49]. In [50], a similar problem with a restriction such that the signal components take only binary values was studied by using the improved second moment method. From a wider viewpoint, generic glassy natures of MAP estimators in inference problems were examined in [51] by considering a rank-one matrix estimation as an example; the so-called survey propagation, which is a variant of message-passing algorithms taking into account glassy natures, was reported to fail in improving the inference accuracy than the standard message-passing algorithm [16–19]. In [52], a similar model was considered in the context of reconstructing encrypted signals and investigated by using the replica method; the full-step RSB ansatz was applied and thus the result is expected to be exact and tight.
These study have a connection to this paper in the basic problem setting, but we stress that our present formulation is very different from all of them. We again note that in this paper all possibilities of the support are examined by computing the entropy curve, while the usual replica analysis treats certain specific supports or estimators only. All the preceding work referred in the previous paragraph fall into this case. For example, the estimator given in equation (2), which is a subject of study in [14, 42–45, 47–49], corresponds to just one point in the entropy curve: the minimum point given K. Meanwhile, the present formulation enables us to simultaneously compute all the other supports, or estimators optimizing the coefficients on the chosen support. This leads to the distribution of the MSE which is nothing but the entropy curve. Hence, our formulation provides more global information about the problem, yielding deeper insights both in the information theoretic and algorithmic perspectives, as shown in [24] and below. A drawback of this is the analytical procedure much more complicated than the usual replica analysis, as explained in section 2.2.1.
3. Analytical results
3.1. Summary of order parameters
As shown below, the free entropy is characterised by a number of macroscopic order parameters and we summarise them here. The order parameters are defined as
m is the overlap with the true signal and is relevant to the reconstruction performance of . R and Q describe the powers (per element) of the reconstructed signal, but the latter takes into account the 'thermal' fluctuation that results from the introduction of β. These two quantities fall within the limit , but their infinitesimal difference yields an important contribution
This is even in the limit . The last order parameter q directly reflects the fluctuation of the support vector and exhibits the RSB in some parameter regions.
Using these order parameters, the average value of the input MSE is
where represents the typical power (per non-zero element) of the signal defined by
The output MSE is computed from equation (9) once the free entropy is obtained in terms of the order parameters. Hence, we can naturally compute both the input and output MSEs in the present formalism.
3.2. Expressions of free entropy
3.2.1. RS solution.
Postponing the details of analysis to the appendix, we present the resultant formulas of the free entropy and related quantities. The expression for g in the RS level is given by
where for simplicity of notation we let , , , and
is obtained by substituting into , and is defined similarly. The symbol denotes the extremisation condition with respect to Ω coming from the saddle-point method. This extremisation condition yields the following equations of state (EOS):
Note that all tilde variables are conjugates of their respective variables and are introduced to expand delta functions with the Fourier transform as shown in the appendix. From equation (32), we obtain some simple relations
3.2.2. RSB solution and the instability of the RS solution.
The RS solution can be inaccurate when the configuration space of exhibits spontaneous breaking into many locally separated components. In such a situation, the RSB ansatz should be adopted. The RSB solution is categorised by the level of the emerging hierarchical structure of the separation. In the simplest case called the 1st step RSB (1RSB) solution, only one level of hierarchy is taken and we examine this in the present paper. The 1RSB solution is actually sufficient to expose the instability of the RS solution and hence is sufficient to achieve the present purpose of obtaining implications to local search algorithms.
We postpone the detailed derivation of the 1RSB solution to the appendix. The explicit 1RSB formula involving g is given as
where τ is Parisi's breaking parameter, , and
As in the RS case, and are given by inserting into and , respectively. The corresponding EOS are involved and are given in the appendix.
By examining the 1RSB solution, we can determine the instability points of the RS solution. Empirically, two types of instabilities are known to appear in a wide range of systems:
- Global instabilityThe RS solution is locally stable but there emerges another solution involving exponentially many metastable states, which induces the so-called random first-order transition (RFOT). We thus call the associated instability RFOT instability in this paper.
- Local instabilityThe local instability of the RS solution, which can be signaled by expanding the 1RSB solution with respect to and observing its coefficient. This is also known as the de Almeida–Thouless (AT) instability.
According to these empirical facts, we derive two instability conditions below.
The RFOT instability is known to emerge at and can be detected in an easy manner as follows. For , we can identify q0 and with q and in the RS solution, respectively, because their EOS formally accord with each other. As well, the 1RSB EOS of all other order parameters except for q1 and become identical to their corresponding RS EOS. Hence, we should compute q1 and on top of the RS solution and examine whether the nontrivial solution exists or not. The equations to be solved are written in terms of and as follows
In these equations, we should read in and . The trivial RS solution always exists and the question is whether a nontrivial solution exists or not. Such a nontrivial solution is absent for the low μ region but is present at sufficiently large values of μ. The lowest value of μ for which the nontrivial solution exists defines the RFOT point .
The AT instability is observed by examining the presence of a nontrivial solution around . This can be accomplished by expanding the right hand side of equation (41) with respect to up to the first order after inserting equation (40) into . If the coefficient of the first-order term is greater than unity, a nontrivial solution emerges. This condition is written using the RS solution only, because the small limit implies that the order parameters q1 and q0 of the 1RSB solution can be identified as q in the RS solution, and the corresponding tilde variables can also be identified. The explicit stability condition of the RS solution against the AT instability is given by
This condition always holds for sufficiently low values of μ. The lowest μ value violating equation (42) indicates the AT transition point .
These two instabilities are known to affect the performance of local search algorithms. The origin of this degraded performance by the RFOT transition is clear—there emerge exponentially many local minima and thus the search/dynamics is easily trapped in one of those states; the typical trapping state will be the most numerous one and will be far from the true global minimum. Each trapping state in this case is separated by high energy barriers and hence escaping will take an exponentially long time [40]. Meanwhile, the influence of the AT instability is less trivial than the RFOT. According to the standard physical picture, when the AT instability occurs, the structure of the configuration space of has many saddle-point-like structures, which leads to a complicated critical slowing down of the dynamics and thus the performance of local search algorithms will be strongly degraded. However, this degradation will be less serious compared to that caused by the RFOT transition, because in the AT case it is considered that there exist certain directions along which the system can escape from a local saddle point. This may take a long time, as the energy landscape will be very flat along the escape directions owing to the saddle-point nature, but will be shorter than the RFOT case, where an exponentially long time is required. These descriptions have actually been supported by numerical simulations of some metaheuristic algorithms in several optimisation problems [53–58]. We, however, stress that there are still many unclear points about the dynamics around the AT instability and further investigations using concrete algorithms, such as Monte-Carlo methods or message-passing algorithms, are desired.
As will be seen in the next section, we have several characteristic regions depending on the parameters ρ and . In some regions, the RSB transitions induced by AT and RFOT instabilities occur, which makes it difficult for the system's dynamics to converge to the equilibrium distribution. In other regions, the RS solution is always stable when μ is changed. However, the RS-stable regions are separated into two small regions, one has no phase transitions and thus the metaheuristic algorithm can work well, and the other has another 1st order transition which prevents the algorithm from approaching the global minimum. These descriptions will be actually confirmed by numerical experiments of a metaheuristic algorithm in section 4.2.
3.2.3. Some simple limits.
To check our replica results, we summarise some simple solutions obtained at particular limits below.
High temperature solution.
A trivial solution in the high temperature limit, , is derived from equation (32). From simple algebra based on equations (32) and (33), we obtain
Accordingly,
Using the relation , we have
where is the binary entropy, giving a reasonable result.
The local stability of this solution can be checked by substituting equations (43) and (44) into equation (42), which yields
This always holds in the meaningful setup of and , and hence the high temperature solution is stable.
Perfect reconstruction in the noiseless limit.
Of particular interest for the noiseless limit is whether we can achieve the perfect reconstruction of . We call the corresponding solution the perfect reconstruction (PR) solution, which is defined by
Note that the support components outside the true support may take unity: ci can be 1 even if . This is because the coefficients outside the true support become automatically zero as a result of the optimization if all the components of the true support is covered as equation (47), leading to the vanishing MSEs. In other words, the PR solution always exists if the estimated nonzero density is greater than or equal to the true one, . This seemingly indicates that it is safer to overestimate the nonzero density ρ. This is, however, not necessarily true because larger estimates of ρ tend to involve unfavourable phase transitions, as shown below.
The output MSE of the PR solution is zero and the entropy becomes
We can actually find this PR solution in the limit of our RS formula (32). To derive this, we have to carefully treat the scaling of and within that limit. We first assume the following scaling:
The consistency of this assumption is confirmed after the computation. Using this and equation (33), we can easily determine the following limits
These relations mean that diverges or vanishes depending on the values of x0 and z while converges to . The vanishing region of is expressed in terms of z as
As we have assumed , this vanishing region rapidly shrinks and does not provide any meaningful contribution. Hence, we may treat as a diverging factor in all contributing regions. This consideration yields, from equation (32e):
Additionally, some algebraic operations from equations (32) and (33) lead to
where
which yields a finite contribution in the limit . The divergence of implies that
The precise scaling of the right hand side strongly depends on the choice of the prior distribution . For examples, if it is Gaussian , then ; if the signal strength is a constant at with , then exponentially decays as μ increases. Similar calculations apply to , which involves the same scaling of as . These scalings are consistent with the assumed ones (49).
Summarising these calculations in conjunction with equations (32) and (33), we determine that
This implies that the free entropy g coincides with the entropy in the limit . Substituting the scalings obtained so far into equation (27), we find that
The limiting behaviours of and cause the input MSE to vanish:
Hence, the PR solution is successfully derived.
The local stability (42) of this PR solution should be checked. Inserting equations (50) and (53) into equation (42) results in the stability condition
Hence, the PR solution is stable as long as where the PR solution exists. Furthermore, this stability is considered to be rather 'robust'. It is physically reasonable that the RS solution is stable even for if ρ is sufficiently close to because the inequality (60) is safely satisfied even at . This will be demonstrated in the phase diagram shown below. Note that equation (60) is just a necessary condition and not a sufficient one.
The stability of the RS solution at and around the PR solution has an algorithmic implication to the -minimisation approach [59–61]. Namely, local search algorithms such as the message-passing algorithm will not be degraded by the rugged energy landscape if the initial condition is sufficiently close to the PR solution and hence the predictions based on the RS computation will be precise.
3.3. Entropy curve and phase diagram
To obtain a concrete result, in the following we set the prior distribution of the nonzero component as a Gaussian:
Owing to this assumption, the double integrations with respect to x0 and z in equation (32) can be merged into a Gaussian integration by a variable transform . The EOS (32) should be appropriately transformed. Except for equation (32i), this can be accomplished by neglecting the integration and reading
in the EOS and . Equation (32i) involves the cross term of x0 and so some special care is needed. A slight calculation yields
Furthermore, the signal variance is set to be to fix the per-element signal power to unity, .
3.3.1. Noiseless case.
We start from the noiseless limit . Figure 1 shows the plot of entropy versus for several different values of ρ. Other parameters are common and are set as and . The drawn curves are based on the RS solution, and the RSB solution is only used to indicate the respective instability point. As seen from figure 1, we observe four different characteristic behaviours of . For small ρ, the AT instability occurs at small , where the full-step RSB will be needed to correctly describe that region. This RS unstable region vanishes for larger ρ, defining a critical value . In the region , the RS solution is stable for the entire region having the nonnegative entropy . A somewhat surprising fact in this region is that the entropy crisis (EC), , occurs at a finite critical value of . As it approaches , this critical value diverges and the entropy curve is continuously connected to the PR solution in the region . For a wide range of , the RS solution is again stable for the entire region of . However, at larger values of ρ, there emerges a new phase transition, defining another critical value, . For , the PR solution is detached from the high temperature limit, and there are two different branches for the small or large μ region. Two critical temperature points are accordingly defined—the spinodal point TSP at which the low-temperature branch connected to the PR solution vanishes, and the first-order transition point TF at which g values of the two branches coincide. To locate the corresponding critical temperatures at a given , we plot g against in figure 2. As a reference, the output MSE is also plotted against T around the critical temperatures in the right panel. The presence of the first-order phase transition implies that the simple SA algorithm with a rapid annealing schedule will fail to find the PR solution in . The system's dynamics goes along the branch connected to the high temperature limit and cannot move to the PR solution. Actually, this strongly affects the SA performance, as the values of input MSEs are quite different between the two branches, despite the output MSEs both being small, as shown in the right panel of figure 2. This will be demonstrated in section 4.2.
Download figure:
Standard image High-resolution imageSummarising the above findings, we draw a phase diagram in the ρ-T plane for and in figure 3. Note that the entropy-crisis line below the AT line has no direct physical consequence because the RS solution is unreliable in that region. The exact entropy-crisis line would be derived by the full step RSB solution, but this is beyond the scope of the present paper.
Download figure:
Standard image High-resolution imageFigure 3 implies that finding the ground state is easy in the region , while it is difficult in other regions: and . We focus on how the easy region behaves when changing the external parameters α and . We examine the parameters and find that the easy region shrinks as increases against a fixed α and finally vanishes at a certain critical value of . As an example, the ρ-T phase diagram for and is shown in figure 4. The spinodal and first-order transition lines cover the entire region and hence the easy region disappears, implying that local search algorithms cannot find the PR solution for any ρ in this case. This vanishing of the easy region thus defines the algorithmic limit. By searching all parameter regions of and α, we can draw a phase diagram of the algorithmic limit, which is shown in figure 5. The performance of our formulation is clearly better than the relaxation [14] and is competitive with the Bayesian result [21]. This implies that the present formulation, which can be regarded as a MAP estimation in the Bayesian framework, does not significantly lose its reconstruction performance despite discarding the signal source information. This is encouraging the use of the present formulation in the context of signal recovery and is one of the main results of this paper.
Download figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution image3.3.2. Noisy case.
A new behaviour specific to the noisy case is the presence of the RFOT transition for strong noises at middle values of ρ. As an example, the entropy curves for are given in figure 6. Two critical ρ values, and , accordingly emerge. For , the AT instability first occurs as the temperature decreases. For , the RFOT begins to emerge above the AT instability temperature. For larger ρ, the RS solution is accurate for all temperature regions. The RS EC occurs at finite T in that region, which is somewhat similar to the Ising perceptron problem [37, 38]. Note that does not exist if the noise is sufficiently weak.
Download figure:
Standard image High-resolution imageSummarising the above findings, we show phase diagrams for three noise strengths, and 10 in figure 7. The first-order transition in the noiseless case is quite fragile and disappears for very weak noise as seen in the left panel of figure 7. We attempted to capture the critical values of for the disappearance, but it proved numerically difficult and we did not pursue this point. As the noise increases, the phase boundary's shape expand more and more from the noiseless limit and the RSB region, , seems to grow. We consider whether this RSB region will cover the low-temperature region completely as the noise is very large. To answer this, we tested the no signal case and and observed that takes a value similar to the one in the right panel of figure 7. Hence, an RS region exists even in the strong noise case, which is consistent with our previous analysis in the data compression context [24].
Download figure:
Standard image High-resolution image4. Numerical simulations
4.1. Monte Carlo evaluation of entropy curves
Here we examine the analytical results by comparing with the numerical simulations. Our simulations calculate the free entropy by the exchange Monte Carlo sampling [62]. Estimation of the free entropy g is accomplished by using the multi-histogram method [63].
Our MC sampling is based on the Metropolis criterion, where an MC move is accepted according to the probability
During the update, we would like to keep the non-zero components density constant. For this, we generate trial moves by 'pair flipping' of two support indicators, one equal to 0 and the other equal to 1. Namely, by choosing an index i of the support indicator from and another index j from , we set , except for the counterpart of , which is given as . The pseudocode of our MC algorithm is given in algorithm 1. We define one MC step (MCS) as N trials of pair flipping for each system at every temperature point. The exchange of every pair of neighboring temperature points is conducted after every 1/N MCS, which is a rather frequent exchange than conventions.
Algorithm 1. MC update with pair flipping.
1: procedure MCPF() | MC routine with pair flipping |
2: | |
3: randomly choose i from ONES and j from ZEROS | |
4: | |
5: | |
6: | |
7: | |
8: generate a random number | |
9: if then | |
10: | |
11: end if | |
12: return | |
13: end procedure |
In all simulations, we set , and . The configurational average is calculated by taking the median over 1000 different samples of . The error bars are estimated by the Bootstrap method. The examined system sizes are . The equilibration is checked by monitoring the convergence of all measured quantities to stable values by changing the total MCSs; for reference, we note that MCSs are needed for equilibration when N = 100, , . For burn-in, the first half of the total MCSs is discarded.
4.1.1. Simulation in noiseless case.
The free-entropy values evaluated by numerical simulations and the extrapolation to the infinite size limit of the noiseless case are presented in figure 8. The extrapolation lines result from linear regression using an asymptotic form . The regression is conducted by applying the least squares method as follows:
Download figure:
Standard image High-resolution imageThis asymptotic form is based on Stirling's formula and is exact at , which motivates us to use the form even when . The same asymptotic form is used for obtaining the extrapolated values of the output MSE and the entropy s. Using these values for the limit , we present the curves and in figure 9. The lines represent the RS analytical results. The circles represent the extrapolated values obtained from the numerical results. The extrapolated values show fairly good agreement with the RS analytical ones, justifying our analytical results. For the case of , the AT instability occurs at , but even below this temperature, the agreement between the RS analytical result and the numerical one is fairly good, suggesting a weak RSB effect on s and . This is, however, not the case for the input MSE , as demonstrated in section 4.2 below.
Download figure:
Standard image High-resolution image4.1.2. Simulation in noisy case.
A similar analysis for the case of strong noise () is performed and the results are shown in figures 10 and 11. These figures again demonstrate good agreement between the RS and numerical results as long as the RSB does not occur. Below the RSB transition point, we observe a deviation between them, as shown in the left upper panel of figure 11. In such a situation, generally speaking, the RS entropy curve can be regarded as an upper bound of the entropy values [64], although a meaningful difference is not observed in the right upper panel of figure 11. Again, the RSB effect on s and appears to be weak.
Download figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageWe have a noteworthy remark to make on the EC phenomenon for . We stress that this EC phenomenon can be described in the RS level and occurs at a finite temperature. This may be somewhat surprising for readers familiar with other models exhibiting similar EC phenomena, because in most of such systems the 1RSB treatment is needed to describe the EC phenomenon. We note that the energy levels of the present system around the ground state can be very dense and the energy gap between the ground and excited states can be extremely small, which can be argued by the fact that the PR solution in the noiseless case has numerous degeneracies for . This gap is supposed to vanish in the limit, presumably enabling the RS EC phenomenon to appear at a finite μ. The agreement between the RS and numerical results strongly argues in favour of this description. Note that the EC phenomenon at small values of ρ is in a different situation and its RS description is not accurate. This is because the AT instability occurs at higher temperatures in that region and hence the full step RSB treatment is needed. This is in contrast to the large-ρ region in which no instability occurs at higher temperatures than TEC.
4.2. Monte Carlo-based optimisation and its performance
The SA is a metaheuristic solver of generic optimisation problems based on the MC method. A variant of the SA for the present problem was proposed in [26] and its performance was examined in a limited parameter region of the present synthetic model and in a real astronomical dataset [26, 27]. We re-examine this over a wider range of parameters to provide more quantitative information.
Our SA algorithm is summarised in algorithm 2. The lines marked with # are not necessarily needed for SA, but have been inserted for later convenience. In algorithm 2, we have a set of inverse temperature points arranged in ascending order and the waiting times at those points. Hence, as the algorithm proceeds, the temperature of the system decreases step by step. It is theoretically guaranteed that if the schedule of the decreasing temperature is slow enough, then the SA can find the optimal solution [65]. However, the guaranteed schedule is usually overcautious and in many practical situations we may choose a faster one. The actual schedule examined below consists of temperature points chosen as
The first linear region of the schedule is simply inserted for visibility in the plots shown below and the important point is that the schedule is exponentially increasing as a grows. The final temperature is very low, . The waiting time at each temperature point is kept constant, at , for simplicity. We show below that this rapid schedule works very efficiently for a wide range of parameters and discuss that the performance is closely related to the system's property at equilibrium, which was already calculated in section 3.
Algorithm 2. SA for variable selection in sparse linear regression.
1: procedure SA () | |
2: Generate a random initial configuration with | |
3: for do | Changing temperature |
4: for do | Sampling at |
5: for do | Extensive number of updates |
6: | MC update with pair flipping |
7: end for | |
8: # Calculate the MSEs of the current support vector | |
9: end for | |
10: # Calculate the average as . | |
11: end for | |
12: return | |
13: end procedure |
A noteworthy remark applies to the computational cost of this SA algorithm. This cost can be formally written as , where the last factor is the computational cost of each MC update. The most expensive operation is the matrix inversion required to calculate the energy of the output MSEs. If we use simple multiplication and Gauss elimination in the inversion process for each step, then . However, we employ pair flipping in each update and the change in the relevant matrices in each update is small and successive. Using this fact and the matrix inversion formula, the total cost of each MC update can be reduced to , as explained in [26]. Hence, if and τ do not scale with N and can be kept constant, the total computational cost is and is scaled as the third order polynomial of the system size N. This is comparable with the versatile algorithms solving the relaxation and thus the present algorithm solves the problem with fairly reasonable computational cost. The assumption of constant and τ is not trivial, but it appears to be correct, i.e. sufficient to find the PR solution in the successful region, in the region we have numerically searched. Hence, we adopt this constant assumption below.
4.2.1. Reconstruction performance of simulated annealing.
Noiseless case.
Let us begin by showing the results for the noiseless case. The MSEs at and are plotted against T in figure 12. The assumed values of ρ are , , and for the left, middle, and right panels, respectively. The numerical results agree well with the black solid line representing the RS solution connected to the high temperature limit in all cases. The middle panels show the successful region for finding the PR solution, , and the vanishing means that we actually find the PR solution. The right panels are in , meaning that the search is trapped in the metastable state connected to high temperatures. The SA result follows the high-temperature branch and cannot reach the low-temperature one denoted by the blue dashed dotted line. Overall, the SA experiments demonstrate that our theoretical predictions are very precise, and the presence of phase transitions strongly degrades the SA's performance in finding the minimum-MSE configuration.
Download figure:
Standard image High-resolution imageNoisy case.
Next, we show the results of the noisy case. The SA results for the strong noise case at and are given in figure 13. As seen from the figure, the MSEs show good agreement to the analytical curve (black solid line) up to the transition points for the left and middle panels. An exceptional deviation is observed at low temperatures in the upper right panel, but this is considered to be a finite-size effect because the range of in this region is very small and supposedly unreachable by systems. Hence, these behaviours are very consistent with the analytical predictions that the system's dynamic behaviour is affected by the RSB transitions and ceases to follow the equilibrium state.
Download figure:
Standard image High-resolution imageThe effect of the RSB on the reconstruction performance becomes much clearer by examining the achievable-limit values of the MSEs for a moderate noise case. Figure 14 shows the plots against ρ of the limit values obtained at very low temperatures by the rapid SA with for . The values at middle ρ values are clearly dominated by the ones at the AT transition points rather than the ones at the EC points, implying that the system's search is trapped by local minima emerging at the transition points. An interesting outcome of this phenomenon is a better reconstruction of the planted signal . As seen from figure 14, the values at the AT points are lower than the ones at the RS EC points, implying that the reconstruction performance of the local minima induced by the RSB is better than that of the minimum- configuration by solving equation (2). This means that the generalisation capability of the rapid SA is no worse than exactly solving equation (2) in this case because the input MSE is proportional to the generalisation error when each component of the design matrix and the noise is i.i.d. from the zero-mean Gaussian. This encourages the use of the presented formulation and algorithm for practical purposes, as reported in [27].
Download figure:
Standard image High-resolution image5. Conclusion
In this study, we have analytically provided an algorithmic limit of an -based formulation of compressed sensing through evaluation of the entropy using statistical mechanical techniques. The results are mainly characterised by the ratio of number of observations α, the nonzero-components density of the inference and generative models ρ and , and the strengths of the signal and noise and . The entropy curves and the associated phase diagrams have been provided, and their implications to local search algorithms have been discussed. Quantitative analysis of the noisy cases has also been conducted. To validate the analytical computation, we have performed a careful MC simulation using the exchange MC method and the multi-histogram method, the results of which have exhibited fairly good agreement with the analytical results. To test the theoretical predictions on local search algorithms, we have also performed numerical experiments using the SA-based algorithm. The performance of the SA is well understood through the phase diagrams. Over a wide parameter region in the noiseless case, we have actually located the PR solution with reasonable computational cost of O(N3) although, in other hard regions, the RSB and the RS first-order transitions prevent rapid convergence of the SA to the PR solution.
To overcome the problems caused by phase transitions, it may be interesting to tailor new algorithms based on MC methods. The idea of extended ensembles can be useful: it will be a promising future work to invent an algorithm relaxing the hard constraint on the nonzero components density, by introducing a 'chemical potential' softly controlling the density. The idea of multi-canonical sampling such as the Wang–Landau algorithm [66] may also be an interesting direction. By extending the work in [51], it is also worth trying to take into account glassy natures in algorithms in certain ways.
Relaxing the i.i.d. random-matrix assumption on A is also an interesting problem. This is even practical because in an ideal situation of compressed sensing, a new observation should be conducted along with maximising the resultant information, implying that all rows of A should be orthogonal. Considering such orthogonal ensembles in the presented framework is a high-priority issue which should be done in near future.
Acknowledgments
This work was supported by JSPS KAKENHI Grant Numbers 26870185 and 18K11463 (TO), 17K12749 (YN-O), 25120009 (MO), and 25120013 and 17H00764 (YK). TO is also supported by a Grant for Basic Science Research Projects from the Sumitomo Foundation. YN-O is also supported by JST PRESTO Grant Number JPMJPR1773.
Appendix. Calculations of g(μ)
Assuming n and ν are positive integers, we can rewrite
These summations over and are now calculated over all the replicated variables and . Let us set
The variable d is an extensive sum of random variables and can be expressed by an appropriate sum of Gaussian variables with a certain covariance. The covariance is expressed by
Evaluating this full description is difficult in general. Instead, we consider more amenable subspaces which are described by the RS or RSB ansatz.
A.1. RS computation
In the RS ansatz, the dominant contribution is assumed to come from the following subspace:
and hence the covariance matrix is described by four order parameters and one external parameter . The corresponding useful description of is
where and z are i.i.d. from . The average with respect to A thus can be replaced by the average over and z, yielding
where is the subshell (the state density) characterised by the above four order parameters such that
and describes
where we merge two Gaussian variables into z and
Direct integrations yield
where we keep only the linear term with respect to n at the first line and take the limit while keeping at the last line and applying according to equation (24). Summarising equations (A.12) and (A.14), we obtain
Evaluation of requires additional algebra. We break the delta functions by using the Fourier transform as follows
Then,
where
The replica indices have been superscripts so far but we rewrite them as subscripts for visibility. Using the Gaussian integrals, we break the sum into a single replica sum as
Hence we can take for each a independently
where
The sums and can also be broken
Hence,
Summarising the result yields
and the law of large number implies
where
and X0 is obtained by inserting in . This relation means that the average over in equation (A.17) is actually not needed, which comes from the absence of correlations among components of the design matrix A. This validates the use of the factorised prior (14).
To take the limit , we rescale the tilde variables as follows
Applying this rescaling and keeping only the linear term with respect to n, we get
Combining equations (A.15) and (A.28) and using the saddle-point method yield equation (27).
A.2. 1RSB computation
In the 1RSB level, the overlap between ca and cb with different should be treated in a more involved manner. In the standard 1RSB construction, the n replicas are separated into blocks of the size τ. We may label the blocks by , and the replica index a is represented by two indices as where Ia denotes the component index inside the block. For simplicity, we identify the index set of those components with the block label Ba, allowing us to represent them as . The overlap is assumed to take two discriminative values depending on whether the replica indices are in the same block or not. This can be written as
Correspondingly, can be expressed as
where and z are i.i.d. from the normal distribution. Then, equation (A.1) is rewritten as
where
and
where
can be computed by recurring Gaussian integrations as in the RS case, and the result is
The computation of can also be performed in parallel with the RS case. The delta functions of common variables with the RS case are broken in the same manner as equation (A.16), and the ones of q1 and q0 are broken similarly to equation (A.16d). These transforms yield
where
To derive a factorised form with respect to the replica index, we again use the trick of the Gaussian integrations as the RS case. For example, the last two terms in equation (A.38) are factorised as
where we insert . Repeating similar computations, in the leading order of n we finally get
To derive this, when taking the limit , we have rescaled and as
Other tilde variables are rescaled in the same manner as equation (A.27).
Inserting equations (A.35) and (A.40) into equation (A.31) and using the saddle-point method, we obtain equation (34). The EOS are obtained by taking the extremisation condition and the result is
where
Footnotes
- 5
There is an analogy between the 1st step RSB formulation and the present problem: correspond to quenched variables; and are dynamical variables, but determines a pure state and is an active dynamical variable inside the pure state; the replica number ν corresponds to Parisi's breaking parameter.
- 6
Insightful readers may doubt the probabilistic interpretation of because it is not possible to normalise due to the presence of flat prior ϕ. This inconvenience can be solved by replacing the flat prior with and taking the limit , although this causes another problem in model selection according to the marginal likelihood in the usual Bayesian framework. However in the following discussions, any model selection using this criterion is not performed and the entire treatment in the main text does not inherit any of these problems.