Abstract
We investigate the signal reconstruction performance of sparse linear regression in the presence of noise when piecewise continuous nonconvex penalties are used. Among such penalties, we focus on the smoothly clipped absolute deviation (SCAD) penalty. The contributions of this study are three-fold: we first present a theoretical analysis of a typical reconstruction performance, using the replica method, under the assumption that each component of the design matrix is given as an independent and identically distributed (i.i.d.) Gaussian variable. This clarifies the superiority of the SCAD estimator compared with in a wide parameter range, although the nonconvex nature of the penalty tends to lead to solution multiplicity in certain regions. This multiplicity is shown to be connected to replica symmetry breaking in the spin-glass theory, and associated phase diagrams are given. We also show that the global minimum of the mean square error between the estimator and the true signal is located in the replica symmetric phase. Second, we develop an approximate formula efficiently computing the cross-validation error without actually conducting the cross-validation, which is also applicable to the non-i.i.d. design matrices. It is shown that this formula is only applicable to the unique solution region and tends to be unstable in the multiple solution region. We implement instability detection procedures, which allows the approximate formula to stand alone and resultantly enables us to draw phase diagrams for any specific dataset. Third, we propose an annealing procedure, called nonconvexity annealing, to obtain the solution path efficiently. Numerical simulations are conducted on simulated datasets to examine these results to verify the consistency of the theoretical results and the efficiency of the approximate formula and nonconvexity annealing. The characteristic behaviour of the annealed solution in the multiple solution region is addressed. Another numerical experiment on a real-world dataset of Type Ia supernovae is conducted; its results are consistent with those of earlier studies using the formulation. A MATLAB package of numerical codes implementing the estimation of the solution path using the annealing with respect to in conjunction with the approximate CV formula and the instability detection routine is distributed in Obuchi (2019 https://github.com/T-Obuchi/SLRpackage_AcceleratedCV_matlab).
Export citation and abstract BibTeX RIS
Original content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
1. Introduction
Variable selection problems ubiquitously appear in statistics and machine learning tasks. Although traditional statistical approaches to variable selection work well in principle [2], difficulties in computational efficiency and stability emerge owing to the largeness and high dimensionality of the datasets. To overcome this, the possibility of sparse estimation has been pursued for decades. Naive methods for sparse estimation yet require solving discrete optimisation problems, involving a serious computational difficulty, even in the simplest case of linear models [3]. Hence, certain relaxations or approximations are required for handling such large high-dimensional datasets.
A breakthrough was made with the least absolute shrinkage and selection operator (LASSO) [4]. Its basic idea is to relax the sparsity constraint by using regularisation. The success of LASSO motivated the usage of regularisation in many different contexts and models [5–7], leading to an ongoing innovation in signal and information processing [8–10].
Although LASSO has many attractive properties, the shrinkage introduced by the regularisation results in a significant bias in regression coefficients. To solve this, some nonconvex penalties, such as the smoothly clipped absolute deviation (SCAD) penalty [11] and the minimax concave penalty (MCP) [12], have recently been proposed. Although the estimators under these regularisations have desirable properties such as unbiasedness and continuity [11], there exist some concerns about the stability and interpretability of the estimators because of the potential local minimums owing to the lack of convexity. Hence, investigations of typical performance of those estimators are desired.
Under this situation, one of the present authors recently analyzed the SCAD estimator performance in [13], using the replica method and the message passing technique [14–16]. It was shown that there exist two regions in the space of regularization parameters, and in one of them the estimator is uniquely and stably obtained. It was also shown that the estimator outperforms LASSO in the fit quality, and that the emergence of the two regions can be viewed as a phase transition involving replica symmetry breaking (RSB). The latter finding yields a nontrivial insight to the behaviour of local search algorithms, and it was demonstrated that the convergence limit of the coordinate descent (CD) algorithm is closely related to the RSB transition and that a sufficient condition of the convergence, derived in [17], is not tight.
The above-mentioned analysis was limited to the data compression context, in which only the fit quality to a given dataset was considered important. However, with regard to certain applications of sparse estimation, such as compressed sensing [18, 19], the reconstruction performance of the true signal embedded in the data generation process is more important. The current study addresses this problem and conducts a quantitative analysis for the case where noise exists, while the noiseless limit is investigated in a separate study [20]. We provide phase diagrams with respect to regularization parameters derived using the replica method and discuss their implications to the reconstruction performance and the behaviour of local search algorithms. Moreover, we develop an approximate formula for efficiently computing the cross-validation (CV) error, which can be identified with the reconstruction error in our setting. The key results are summarised as follows:
- (i)In the replica symmetric (RS) phase, a unique solution is stably obtained also in the signal reconstruction context.
- (ii)The global minimum of the CV error is (presumably always) obtained in the RS phase.
- (iii)Our approximate formula efficiently estimates the CV error, without actually conducting CV.
These imply that we need not to care about the RSB phase as long as our purpose is to obtain the model best reconstructing the true signal, and in the RS phase we can benefit from the proposed approximate CV formula enabling an efficient estimation of the reconstruction error. Below, we show the theoretical results supporting these messages.
The remaining of the paper is organised as follows: in the next section, our problem setting and an overview of the SCAD penalty are given; in section 3, the replica analysis result is shown without the derivation because the essential part is already given in [13, 20], and phase diagrams and plots of relevant quantities are shown; in section 4, the approximate formula of the CV error is derived; in section 5, numerical experiments are carried out on both simulated and real-world datasets to check the accuracy of the replica result and the approximate formula. The last section concludes the paper.
2. Problem settings
Suppose a data vector is generated by the following linear process with a design matrix and a signal vector :
where is a noise vector, the component of which is assumed to be an independent and identically distributed (i.i.d.) variable from the normal distribution with zero mean and variance , . We denote our dataset as . In the context of compressed sensing, the design matrix represents the measurement process, and we try to infer given and . The inference is herein formulated as a regularised linear regression, and the concrete form of our estimator is given by:
where is the regularisation inducing the estimator sparsity, and is a set of regularisation parameters with a concrete form shown below. To quantify the fit quality of the estimator to the data , we introduce:
and call it the output mean squared error (MSE). We also introduce a MSE between the estimator and the true signal as:
which is termed input MSE, and these characterise the goodness of fit of our estimator .
The purpose of this study is to compute the typical behaviour of and to obtain insights into the estimator behaviour; meanwhile, some other relevant quantities are also evaluated. The analytical techniques for achieving this purpose are explained in section 3, with a more detailed description on and .
2.1. SCAD regularisation
As a representative piecewise continuous nonconvex penalty, we investigate SCAD regularisation in this study. The parameter set consists of , and the functional form is:
An illustration of this form is given as the left panel of figure 1. In the limit , the SCAD regularisation tends to be the regularisation , and correspondingly, the SCAD estimator converges to the LASSO one, allowing the comparison in a continuous manner. For later convenience, we term a the switching parameter, and the amplitude parameter.
To obtain an intuitive view for the SCAD estimator behaviour, we compute the one-dimensional case:
The solution is given by:
where
The middle panel of figure 1 presents an illustration of the estimator at , which behaves as the LASSO estimator when , and as the ordinary least square (OLS) estimator when . In the region , the estimator linearly transits between LASSO and OLS estimators.
The one-dimensional case estimator plays a key role in our analysis, because our original problem with high dimensionality is, eventually, reduced to an effective one-dimensional problem in the limit , termed decoupling principle in [21].
3. Macroscopic analysis
In this section, we provide order parameters and their determining equations of state (EOS). Associated phase diagrams are shown, and their implications on the performance and the computational stability of the SCAD estimator are also discussed.
For proceeding with the analysis, the ensemble of and is required to be fixed. We assume that is a random matrix whose component is i.i.d. from . The true signal is also assumed to be a random number drawn from the independent Bernoulli–Gaussian distribution:
We note that the i.i.d. assumption on is crucial for completing the computation, while the choice of the distribution of does not matter for the analytical tractability. We admit that this i.i.d. assumption on is not necessarily realistic, but it provides a sufficiently nontrivial setup for our purpose. Although it is possible to extend the present analysis to certain other ensembles [22–29], we leave this as a future study.
In the following discussion, we consider the so-called thermodynamic limit , while keeping .
3.1. Outline of analysis
In order to avoid duplication with [13, 20], an outline of the analysis is presented here, instead of the EOS derivation.
Our analysis starts from defining Hamiltonian , partition function Z, and free energy density f as follows:
As seen from (2), the minimiser or the ground state of corresponds to our estimator and, hence, we are interested in the limit of the free energy density. The input and output MSEs can be computed from the free energy in this limit, by following a standard prescription. The free energy density becomes the primary object to be computed, and enjoys the self-averaging property, and the typical value thus converges to the averaged one , where denotes the average over and .
Unfortunately, the average density is not analytically tractable. To overcome this, we employ the following identity:
However, the computation for general is still intractable; thus, we additionally assume that n is a positive integer, because is analytically computable for . Then, using an analytically continuable expression of from to , we evaluate at the final step. These procedures are termed replica method.
The final expression of is given as an extremisation problem with respect to a number of parameters, called order parameters. The extremisation condition appears because of the limit , and yields EOS determining the values of the order parameters. The explicit formulas are given below. It should be noted that the following analysis is conducted only under the RS assumption, although RSB occurs in some parameter regions. This is because the RS analysis is sufficient for the present purpose of obtaining insights on the stability of the SCAD estimator. Beyond this purpose, the RSB analysis will provide further quantitative information about the estimator when many local minimums exist, which will be an interesting future direction.
3.2. Order parameters, equations of state, and stability condition
Here, we summarise the order parameters and EOS. In the RS level, our system is characterised by the following three-order parameters:
where the angular brackets, , denote the average over the Boltzmann distribution . m is the overlap with the true signal and is relevant to the reconstruction performance. Q and q both describe the powers (per element) of the estimator, but the latter takes into account the 'thermal' fluctuation that results from the introduction of . These two quantities fall within the limit , but their infinitesimal difference yields an important contribution:
This is , even in the limit . Besides, we introduce the conjugate parameters of as , respectively, and denote their sets as and . The RS free-energy density in the limit takes the following extremisation problem, with respect to and :
where:
and represents the average over , whose distribution is:
The minimiser of (20) is the solution of the one-dimensional problem (7), with and , thus we can denote it as:
The extremisation condition in (19) yields EOS as:
where
The SCAD regularisation divides the domain of definition into some analytic components, and are the corresponding boundary values for z for the integration . The parameter is the density of non-zero components in the estimate.
Using the solution of EOS, the input and output MSEs can be expressed as:
Furthermore, we additionally quantify the reconstruction performance of the support of the true signal. Denoting the support or active set of as , we introduce the true positive rate and the false positive rate , where Sc denotes the complement set of S. These are expressed by using the solution of EOS as:
where |x|0 expresses operator giving 0 if x = 0 and 1 otherwise. Following the standard analysis [13], we can derive the stability condition of the RS solution, called de Almeida–Thouless (AT) condition [30]. The derivation of our specific case is already given in [13] and we just quote the resultant expression:
Apart from the AT condition, we also notice that the RS solution does not exist when the switching nonconvexity parameter a is small. This is because (20) tends to have no solution in the small a limit, leading to the following existence condition:
These provide sufficient information for the following analyses.
3.3. Phase diagram
In this subsection, we show the phase diagrams in the –a plane for a wide range of parameters. We introduce three boundaries: the first one, derived from (33), is the AT line below which the RS solution is unstable; the second, derived from (34), is the existence limit of the RS solution , below which the RS solution does not exist; and the third, represents the minimum point of the input MSE , when sweeping given a. For clarity, the variance of the non-zero components of is fixed as , setting the signal power per component unity, in average, .
First, we compare the phase diagrams for different noise strengths at and in figure 2. We plot and by blue, red, and green lines, respectively. The green diamond represents the location of the global minimum of under the RS assumption. A useful finding concerning this is that the location is above the AT line, which is always the case as far as we have examined, and some additional evidences are later given in figures 3 and 4. In the right panel of figure 2, the green diamond is not shown, because the input MSE continuously decreases as a grows, implying that the global minimum of is obtained at the LASSO limit . These imply that the best reconstruction performance of the true signal is always obtained in the RS phase, which is one of main claims of this study. Admittedly, there is a possibility that the true global minimum exists in the RSB phase, and the green diamond just represents a local minimum. Our present analysis does not exclude this possibility. To clarify this point, further quantitative analysis in the RSB framework is required, but this is beyond the scope of this study.
Download figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageAnother interesting observation in figure 2 is the re-entrant phase transition concerning in relatively small a regions for the weak noise cases (left and middle panels). For example at a = 2.8 in the left panel, when decreasing from a large enough value, we first go across the rightmost branch of around and enter into the RSB phase from the RS phase; further decreasing we meet the middle branch of around and thus re-enter into the RS phase; still decreasing we hit the leftmost branch of around and we are eventually in the RSB phase. Although the physical reason of the emergence of the re-entrance is not clear, it seems to only exist in the weak noise region. We also note that the AT line, , is always located above . The solution vanishment in the low region is thus an artefact of the RS assumption, and the corresponding parameter regions should be described by the RSB solution.
Next, we check the dependence of the phase structures. Phase diagrams at and a moderate noise level are shown in figure 3. Although the basic structure does not change from figure 2, the re-entrant transitions in the weak noise cases disappear. As increases, the minimum location of increases along the a-axis, and for the large (right panel) the green diamond tends to disappear at finite values of a, implying the LASSO limit yields the minimum of as the strong noise case.
The last phase diagrams are given for checking the dependence. Phase diagrams for at and are shown in figure 4. As seen in the left panel, if the value of is close to that of , the larger a tends to give smaller values of the input MSE, as in the right panel of figure 3. In contrast to the other diagrams of the underdetermined case , the right panel of the case shows a particular behaviour, as both the AT line and the RS existence limit tend to converge to certain finite values of a in the limit. Hence, the whole region becomes RS at sufficiently large but finite a values.
In all the phase diagrams shown above, the minimum value of a is fixed to be 2. This is because the RS solution cannot describe the region a < 2. There is a simple reason for this. According to the argument of the approximate message passing technique [13, 20], the effective one-dimensional problem (20) corresponds to the following marginal distribution:
where the minimiser of (20) corresponds to the location of xi, at which the measure concentrates in the limit . The factor corresponds to , while is a non-negative quantity related to in the RS solution. This means that, under the assumption of , is bounded as:
Combining this with (34), we find that the condition is necessary for the existence of the RS solution. The merging behaviour of the three lines to the a = 2 line, as grows in the phase diagrams, well matches to this condition. Although a = 2 is a known critical value [11, 31], the above argument provides another perspective from a different viewpoint. We also note that this non-existence of the RS solution does not mean the non-existence of the SCAD estimators. Actually, numerical experiments easily show that the estimators take non-trivial values in the region a < 2, and they tend to show strong multiplicity and dependency on the initial condition. To analyse the behaviour of those estimators, we need to consider the RSB solution, but it is beyond the present purpose as already declared.
3.4. Receiver operating characteristic curve
To characterise the reconstruction performance of the true signal's support, we employ the so-called receiver operating characteristic (ROC) curve. The ROC curve is a plot of TP (31) against FP (32). The best ROC curve goes through the point . Accordingly, to quantify 'optimality' of the points on a ROC curve, we use the following quantity:
Thus, the smallest value of R defines the 'optimal' point of the ROC curve. This easy-to-use quantity is commonly applied as a criterion, followed here.
First, ROC curves when sweeping at given values of a are plotted in the left panel of figure 5. The other parameters are . The curves are not monotonic and tend to change in the small region sharply, but the locations of the minimums of R, depicted by filled magenta circles, tend to be in the monotonic region. To compare the values of the R minimums, we plot them against a in the right panel. The global minimum is located at , which matches to the minimum location of , depicted by the green diamond in the middle panel of figure 3. This suggests a possibility that minimising also approximately minimises the error in the variable selection.
Download figure:
Standard image High-resolution imageTo scrutinise the possibility, we show ROC curves when adaptively changing the nonconvexity parameters along the line in the –a phase diagrams: the upper panels of figure 6 are the ROC curves for , and at . The corresponding plots of R and against a are also shown in the lower panels. These figures show that the minimums of are actually close to that of R. As far as we have searched, similar tendency holds in other parameters. These fully support the above-mentioned possibility. Such a nice property is absent in LASSO [32], and the minimum point of in LASSO tends to give a solution with rather large FP4. Hence in the reconstruction performance of the true model, the SCAD estimator is superior to the LASSO one.
Download figure:
Standard image High-resolution imageReaders may doubt the effectiveness of this statement, because the input MSE cannot be computed for realistic settings with unknown true signals. As explained later, the input MSE has a simple linear relation to the generalisation error estimated by CV, when rows of the design matrix are uncorrelated with each other. Hence, we may minimise the CV error instead of the input MSE.
Figures 2–4 show that in some parameter regions there seems to be no global minimum of the input MSE at finite a, as for the strong noise case of in figure 2 and the dense signal case in figure 3. To examine those cases, we plot relevant quantities when changing a and again along the line in figure 7. The left panels show the plots of TP and FP against a, the middle panels display the plots of R and against a, and the right panels give the associated ROC curves. All the quantities of , and show monotonic behaviours with respect to a, and seem to converge to finite values in the LASSO limit . The minimums of R and would be thus obtained by LASSO, with the optimised . These observations imply that LASSO is sufficient for difficult cases with strong noises or dense signals. This also implies that it is difficult to determine a good value of a to find the least solution given a dataset prior to actual analyses, because it strongly depends on the noise strength or the signal density.
Download figure:
Standard image High-resolution image4. Approximate formula for cross-validation
In this section, we derive an approximate formula for the leave-one-out (LOO) CV error. If the dataset size M is large enough, the difference between the estimators of the full and LOO datasets is considered to be small, and it is expected that those two estimators can be connected in a perturbative manner. We concretise this idea below.
The estimator without the th data in (2) is, hereafter, termed th LOO estimator, and the explicit formula is given by:
The LOO CV error (LOOE) is accordingly defined as:
where is the th row vector of A. The LOOE is an estimator for the generalisation error or extra-sample error, defined as:
where represents a new data sample, and denotes its distribution. In our setting, the distribution corresponds to the i.i.d. process described around (1), and it is analytically shown that has a direct connection to as:
Hence, we can estimate the input MSE from the LOOE. Note that the sufficient condition for (41) is that both the noise components and the rows of the design matrix are zero-mean and uncorrelated; correlations in the signal vector may exist because they do not affect (41).
Owing to sparse priors, the variables in the estimator are separated in two types. Some variables are set to zero and the others take non-zero values. We call the former inactive variables and the latter active variables. The index set of the inactive variables, or inactive set, is denoted by , while one of the active variables, or active set, is . The active (inactive) components of a vector are formally expressed as . For any matrix X, we use double subscripts in the same manner and introduce the symbol meaning all the components in the respective dimension. For example, for an matrix X, and denote X's sub-matrices having row components of SA and all, respectively, while their column components are commonly of SI.
4.1. Derivation
The basic assumption to derive the approximate formula is that the active set is 'common' between the full and LOO estimators. Although this assumption is literally not true, we numerically confirmed that this approximately holds in the RS region. In other words, the change of the active set is small enough compared to the size of the active set itself, when considering the LOO operation under the situation with large N and M. Moreover, in the LASSO case, it has been shown that the contribution of the active set change vanishes in a limit , keeping [32]. It is expected that the same holds in the present problem. Hence, we adopt this assumption in the following derivation. We also note that this assumption is not applicable to the RSB region.
Once assuming the active set is known and common between the full and LOO systems, we can easily get the determining equations for the coefficients in the active set, by differentiating the cost function with respect to , yielding:
for the full system and:
for the LOO system.
Let us denote and expand (43) with respect to up to the first order. Erasing some terms using (42), and solving the remaining expression with respect to , we obtain an equation of as:
Using (44), we can connect the residuals of the LOO and full systems as:
Using the relation and the Woodbury matrix inversion formula, we finally get
where
The righthand side of (46) can be computed only from the full solution, enabling an approximate evaluation of the LOOE, without literally conducting CV. The error bar can be put as the standard deviation among all the terms in (46) divided by 5. This is convincing because each term of (46) gives an independent estimator to the generalisation error (40) and hence its error bar can be given as the standard error. Numerical experiments below show that this definition gives a reasonable error bar.
In the case of LASSO, the Hessian of the penalty term is identically zero, , meaning that (46) comes back to the 'approximation 1' in [32]. For SCAD, the Hessian takes the following form:
where denotes the indicator function, giving 1 if the statement is true and 0 otherwise.
5. Numerical experiments and numerical codes
Here we present numerical experiments. To obtain the SCAD estimator, we use the CD algorithm because it is common and stable. We implement this using C language, while the approximate CV formula is implemented as a raw code in MATLAB®. Hence, it is not necessarily fair to compare the computational time in the literal and approximate CVs, which are computed by (39) and (46) respectively, conducted below as a part of experiments. However, even in this comparison there is a meaningful difference in the computational time. When showing the computational time, we fix our experimental environment which uses a single CPU of 3.3 GHz Intel Core i7.
In a single step of the CD algorithm, we update all the components of in a random order. To judge the convergence of the CD algorithm, we monitor the difference between the estimate at the step t and the previous one . If all the component-wise differences are smaller than a threshold value , then the algorithm stops; otherwise it continues. We set the threshold value as in all the experiments below.
5.1. Simulated dataset
In this subsection, we conduct experiments using simulated datasets. The main purpose is to confirm analytical predictions and to examine the accuracy of the approximate formula. Our simulated datasets are generated by the process described around (1) and (10). The signal power is set to be , matching with section 3.3.
5.1.1. Consistency check of the replica solution.
To check the accuracy of the replica result and to examine the finite-size effect, we first plot the input and output MSEs against , given a for different sizes. The plots for and are given in figure 8 as example cases. The black thick curves and the colour markers denote the analytical and numerical results, respectively. For the numerical data, the average over different samples of the set is taken; the sample numbers are for , respectively; the error bars are given as the standard error among the samples. The vertical dashed line represents the AT instability point below which the RS solution is unstable. Focusing only on the RS region, we can find that the finite-size effect is quite weak, and the numerical results show an excellent agreement with the analytical ones, justifying our analytical solutions. In this experiment, even below the AT point, the numerical results show a strong regularity. This is because the solutions are obtained by gradually changing from large to small values, and hence these solutions below the AT point are, in some sense, continuously connected to the ones above the AT point. We term this scheme annealing, which can be considered as a part of the nonconvexity control proposed in [20]. We warn that the solution path obtained by the annealing is very atypical below the AT point, as implied in section 5.1.2.
Download figure:
Standard image High-resolution imageAs another check of the RS solution's consistency, we also draw ROC curves by numerical experiments. In figure 9, we give the ROC curves along the line for and , which correspond to the upper middle panel of figure 6 and the lower middle panel of figure 7, respectively. The numerical result is displayed as the scatter plots (orange cross points) of TP and FP, for the experiments of 10 different samples at N = 1000. The numerical plots show a fairly good agreement with the analytical curve, which again justifies our analytical solutions.
Download figure:
Standard image High-resolution image5.1.2. Accuracy of the approximate CV formula.
To check the accuracy of the approximate CV formula, in figure 10 we compare the CV errors between the literal (by (39)) and approximate (by (46)) CVs for two specific samples of the system size N = 100. The other parameters are and , which correspond to figure 8. Here, all the results are obtained using the annealing and they show regular behaviours, even below the AT point. In all the cases, the approximate result reproduces well the literal one up to the AT point, even for the error bars given by the way explained at section 4.1. The uncontrolled behaviour of the approximate formula below the AT point is owing to the singular behaviour in the factor in (44). This is natural because this factor is nothing but the susceptibility, which is known to involve diverging modes when the AT instability occurs [14]. These considerations mean that our approximate formula is only applicable above the AT point or in the RS phase.
Download figure:
Standard image High-resolution imageCan we detect the instability point only from the approximate CV result without referring to the replica computation? Figure 10 speaks for this, because the approximate CV error tends to show uncontrolled behaviours at and below the AT point. As a trial, assuming a combination use with the annealing, we examine the following procedures to detect the uncontrolled behaviours:
- (i)Detect 'irregular' datapoints by locally comparing each datapoint with neighbouring points along the path (here datapoints mean the approximate CV result, red circles in figure 10).
- (ii)Find the maximum value of whose corresponding datapoint is irregular. Regard all the region below it as 'instability region'.
To obtain a concrete result, we need to implement the first step (i) as an algorithm. The actual implementation is:
- (i)-1If the CV error difference between the irregular point candidate and the compared datapoint is larger enough in reference to the error bar of the compared point, then the candidate is regarded as 'irregular'.
By these procedures, we can separate all the parameter regions into two parts: Stable and unstable regions corresponding to the RS and RSB phases, respectively. By employing this, in figure 11 we draw 'phase diagrams' of two specific samples, used also in figure 10. This figure shows that the boundary between the white and black regions behaves similarly to the AT line , although there is a gap supposedly owing to the sample fluctuation and the finite-size effect. As a reference, we also compute the minimum location of the approximate CV error with respect to given a, defining , which is denoted by the green curve in figure 11. According to (41), this corresponds to appearing in the phase diagrams in section 3.3. Note that these procedures are somewhat overcautious, and can miss some stable regions possibly existing in the small region. As typically seen in the left panel of figure 2, the re-entrant transition can emerge in the weak noise case but the present procedures cannot detect this re-entrancy, because these procedures detect the first RS-RSB transition corresponding to the rightmost branch of and all the region below this first transition point is regarded as 'instability region'. However, for practical use, it is more important to avoid giving wrong estimates by our approximate formula. Hence, we do not aim to improve the above instability detection procedures in this study.
Download figure:
Standard image High-resolution imageApart from the re-entrancy, it is worthwhile to investigate the cause of the gap between the AT line and the instability points detected by our procedures. To this end, we compute the boundary value between the black and white regions in figure 11 given a, , for many samples and different system sizes. The results at a = 5 and a = 10 are shown in figure 12. The left panels provide the histograms of from samples for different sizes discriminated by different colors. As the system size grows, the width of the histogram shrinks and the mode value tends to approach the value at the AT point. Here the number of bins for the histogram is determined by the so-called Sturges rule [33] as , enabling us to define the mode value without ambiguity. To quantify the convergence behavior of the mode value, we plot the mode against the inverse system size 1/N in the right panels. Here the mode value shows a clear tendency of approaching to the AT value as N grows. This indicates that the gap is actually due to the sample fluctuation and the finite-size effect, and also implies that our instability detection procedures are reasonably connected to the AT instability.
Download figure:
Standard image High-resolution imageThe AT instability is known to be connected to the emergence of many local minimums [14]. To directly check this, we conduct the literal CV without the annealing. For each point of , the estimator is computed from ten different randomly initialized , each component of which is i.i.d. from , by the CD algorithm. In figure 13, the resultant CV errors are given as scatter plots in combination with the CV error using the annealing. The experimental setup of each panel is again identical to the corresponding one in figure 10. This figure gives a clear evidence of the multiple solutions below the AT point. Figure 13 also implies that the solution obtained by the annealing is rather atypical: solutions obtained from random initial conditions tend to give rather different values of CV error from the annealed solution. To give a better theoretical background to this statement, we have to construct the full-step RSB solution and to figure out the characterisation of the annealed solution in the ensemble of all the local minimums. This is beyond the present purpose but will be an interesting future work.
Download figure:
Standard image High-resolution imageOur present attitude to the multiplicity of the solutions is to avoid it. This is reasonable because the global minimum of the generalisation error is in the RS region without the multiplicity, as clarified by our analytical computation. Once accepting this attitude, we can use the proposed approximate formula efficiently estimating the generalisation error and, fortunately, the formula also enables us to avoid the multiple solution region by using the above instability detection procedures. This is the main outcome and contribution of this study.
Before closing this subsection, we check the computational time and the approximate accuracy of the proposed formula more quantitatively. Here we quantify the error difference between the literal and approximate CVs by a normalised MSE defined as:
where and are the CV errors evaluated by the approximate and literal CV procedures, respectively. According to the derivation of the formula in section 4, the accuracy is considered to be better as N and M increase. Thus, we plot the normalised MSE against N as the left panel of figure 14. The parameters are set to as an example. For each N, we compute the normalised MSE for several different samples of , and the marker (blue asterisk) denotes the median among the samples, and the upper and lower error bars correspond to the 0.86 and 0.14 quantiles, respectively. The number of samples is for the system size , respectively. As expected, the normalised MSE is quite small for large sizes of , although at the smallest size N = 50 there is a non-negligible difference. This difference is dominated by a few percent of samples giving accidentally large values of . The probability of the accidents seems to become smaller rapidly as the system size grows. In the middle panel, the same plot in the double log scale for small sizes is given, showing a clear decay of the normalized MSE in the scale N−2 as the system size grows. This is naturally understood from the scaling argument presented in section 5.1.1 in [32]. The corresponding computational time of the CD algorithm convergence and the approximate formula are given in the right panel. The approximate formula requires to take the inverse of the Hessian, leading to the computational cost of , which is scaled as the third order polynomial of N if |SA| = O(N). This computational cost can be more expensive than the optimisation cost by the CD algorithm in the large N limit, because the total computational time of the CD algorithm is considered to be scaled as O(N2), under the assumption such that the convergence of the CD algorithm takes place in constant computational steps independent of the system size N, although the O(N2) behaviour is hard to see in figure 14. Despite this inconvenience in the limiting case, figure 14 shows that the computational time of the approximate formula is much smaller than the CD algorithm convergence, in all the investigated range of the system size. We note that the computational time of the CD algorithm shown in figure 14 is just for one-time optimisation, and hence, for conducting the literal k-fold CV, the required computational time becomes approximately k times larger than that. Overall, although there is no superiority in the large N limit, our approximate formula practically works very efficiently in a wide range of system sizes.
Download figure:
Standard image High-resolution image5.2. A real-world dataset: Type Ia supernovae
Here we apply the proposed approximate method to a dataset of Type Ia supernovae. Our dataset is a part of the data from [34, 35], which is screened by a certain criterion [36]. This dataset was treated by a number of sparse estimation techniques recently, and a set of important variables, which is known to be empirically significant, has been reproduced [32, 36–38]. In those studies, the LASSO and cases are treated, and the CV is employed for hyperparameter estimation. We reanalyse this dataset by using the SCAD penalty and compute the CV error by using the approximate formula. The parameters of the screened data are M = 78 and N = 276, and an appropriate standardisation is employed as pre-processing.
Again, we use the annealing to obtain the SCAD estimators for this dataset, and the CV error is computed by our approximate formula. The instability region is detected by the procedures explained in section 5.1.2. As figure 11, the instability detection gives a phase diagram which is in figure 15. The overall shape of this phase diagram is similar to the ones in section 3.3 or figure 11, supporting the practical relevance of our results so far. To directly check the approximation accuracy, we also conducted the literal CV at a number of values of a. The results for a = 4 and 50 are given in figure 16. The approximate error well matches to the literal one up to the instability point, determined by the procedures explained in section 5.1.2, which justifies our instability detection procedures. For the left panel of a = 4, however, even below the instability point, there exists a region in which two CV errors agree well. This implies the presence of re-entrant transition, and it is probably related to a protruding black region around in the right panel of figure 15. As declared in section 5.1.2, we do not try to detect the re-entrancy in the present study, but there must be some ways. For example, the annealing with respect to a instead of would be able to identify the re-entrancy with respect to . We found that this strategy can actually detect the re-entrancy, but the strategy itself is far from perfect. There are some reasons for this. One reason is that the switching parameter a has no upper bound in contrast to ( has an effective upper bound as explained in section 5.3) and hence the initialization becomes nontrivial. Another reason is that some instability 'islands' seem to exist at unexpected regions on the parameter space for this specific dataset: some compact parameter regions exhibiting the instability seem to be able to exist, in contrast to the theoretically derived phase diagrams in section 3.3, and hence isolating the instability regions becomes nontrivial even if the annealing with respect to a is correctly performed. Due to these difficulties, we leave further exploration of better ways of nonconvexity control as a future work.
Download figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution imageTo extract relevant values of the parameters, we plot the approximate CV error and the number of non-zero components along the line in figure 17. Here, some outliers exhibiting extraordinary small CV errors are omitted. At the CV error minimum, the solution with K = 10 is obtained, which is comparable with K = 9 of the LASSO solution at the minimum CV error [32, 36]. In the case of LASSO, it is common to select a sparser solution than the one at the CV error minimum according to the one-standard error rule [10, 39]. Although it is unclear if the application of this rule to the SCAD estimators is appropriate or not, we here try to apply to our case. As a result, we have the solution with K = 3 obtained at a = 2.2 or 2.3 as seen in figure 17. We globally examined all the datapoints within the one-standard error in the stable phase, and confirmed that the K = 3 solution is the sparsest. This sparsest solution consists of variables whose IDs are 1, 2 and 233. This is accurately matching to the result of [37, 38, 40], in which the formalism is treated, while the LASSO estimator tends to give a denser solution with K = 6 even under the one-standard error rule [32, 36]. These demonstrate the effectiveness of the SCAD estimator, and the presented analysis and approximate formula resolve its disadvantages of the multiplicity of solutions and the computational cost in hyperparameter estimation. The effect of the one-standard error rule on the SCAD estimator seems to be also good, though further exemplifications would be needed.
Download figure:
Standard image High-resolution image5.3. Numerical codes
In [1], a MATLAB package of numerical codes implementing the estimation of the solution path using the annealing in conjunction with the approximate CV formula is distributed; the optimization is performed by the CD algorithm as the experiments so far. In the package three regularizations, LASSO, SCAD, and MCP, are treated in a unified manner. All the parameters are tunable in the codes, but the minimally required quantities to run the codes are the data vector , the design matrix A6, and the switching parameter a. In the default setting, the L = 100 values of are chosen as to be a descending order , and the largest is set to be where is the j th column vector of A, because only the trivial solution exists for ; the smallest is set to be with and the intermediate values are given to interpolate these two values by the geometric progression with a constant rate. This way follows that of a commonly-used package glmnet [10, 41]. The annealing is basically the same as warm starts explained in [10], but it has a stronger meaning in the nonconvex penalties because it inevitably picks out a certain solution path as exemplified in the numerical experiments so far. On each point of , the CD algorithm finds the solution from the initial condition (for k = 1 the initial condition is the zero vector), and hence and are expected to be close each other. In the default setting, after obtaining the whole solution path over , the approximate CV formula is subsequently applied and it is followed by the instability detection routine, yielding the approximate values of CV error and its reliable region. In the package, demonstration codes are also included and some experiments in section 5.1 can be easily re-obtained; readers who are interested in the experiments are thus encouraged to try to use them. The details of usage are more explained in [1].
6. Conclusion
In this study, using the replica method, we analysed the macroscopic properties of the SCAD estimator in the context of the signal reconstruction in the presence of noise, under the assumption that the design matrix is the i.i.d. random matrix. We derived the phase diagrams involving the RSB phase, and showed the superiority of the SCAD estimator to the LASSO one based on ROC curves. We also provided an analytical evidence that the global minimum of the input MSE or the generalisation error is located in the RS phase. Furthermore, we derived an approximate formula for the CV error, although it is applicable only for the RS phase. We implemented procedures detecting the AT instability or the approximation instability, enabling to clarify the applicable limit of the approximate formula and making the formula stand-alone.
To examine the analytical results, numerical experiments on simulated datasets and a real-world dataset of Type Ia supernovae were conducted. On the simulated datasets, the replica prediction was well reproduced. The accuracy and the computational time of the approximate CV formula were examined, and its effectiveness was demonstrated in a wide range of the system size. For the real-world dataset, the application of the SCAD penalty reproduced the variables known to be empirically important. By using the approximate formula, we could globally search the parameters efficiently, and find that the SCAD estimator can provide a very sparse solution giving a reasonable value of the CV error. This solution is matching to the one of the earlier studies using the formulation [37, 38, 40], and cannot be found by LASSO. These experiments demonstrate the effectiveness of the SCAD estimator, and the presented analysis and approximate formula resolve its disadvantages of the multiplicity of solutions and the computational cost in hyperparameter estimation.
As an efficient strategy to obtain a solution path, we proposed nonconvexity annealing as a part of nonconvexity control proposed in [20], and especially focused on the usage of the annealing with respect to , termed annealing in this paper. It was shown that this strategy works well also in combination with our approximate CV formula, but it further raised up a question related to RSB. In the RSB phase exhibiting the multiplicity of solutions, what solution is obtained by the annealing? Our numerical experiments showed that the annealed solution tends to give a smaller CV error compared to the solutions computed from random initial conditions, and in this sense the annealing is a nice strategy even in the multiple solution region. A similar observation was obtained in an inference in Gaussian mixture model [42, 43]. To make a more accurate and quantitative analysis about these findings, it is needed to construct the full-step RSB solution and to figure out the characterisation of the annealed solution in the ensemble of all the solutions. This will be an interesting future work.
The present instability detection procedures for the approximate CV formula are rather ad hoc and have some ambiguity, especially in specifying irregular datapoints along the solution path with respect to . This ambiguity is related to which points of should be sampled when computing the solution path. In the case of LASSO, the change points of active set, usually called knots, can be efficiently computed [44], which provides a clear criterion to the above ambiguity problem. It is expected that a similar technique computing knots for SCAD will be useful for improving the instability detection procedures.
As a final remark, we mention about the MCP penalty defined by:
where . If we use this instead of the SCAD penalty, the effective one-dimensional estimator, (26) in the SCAD case, is replaced as:
where
Replacing x* in equations (27a)–(27c) by (51), we can get EOS for the MCP penalty, and the AT condition (33) can be replaced by the same way. Corresponding to (34), the RS existence limit of the MCP case is also given as
Using these replacements, it is easy to obtain the result for the MCP case. As far as we searched, the MCP result is qualitatively similar to the SCAD one. For illustration of this, we give some phase diagrams, plots, and plots of literal CV errors with and without annealing in figure 18. We see qualitatively similar results to the SCAD case: the re-entrancy for the weak noise region; the no global minimum of the input MSE in the RS phase at finite a for the strong noise or dense signal cases; the accurate accordance between the RS and numerical results above the AT point; the solution multiplicity below the AT point. Although there can be a difference between the SCAD and MCP penalties in a quantitative level as reported in [17], such a comparative study requires more detailed quantitative analyses and we also leave it as a future work. Note that the lower existence limit of the RS phase of the MCP case is given as a = 1, which is derived from (36) and (54), and hence the RS stable region tends to be wider than the SCAD case. However, the direct comparison of two parameter spaces is not necessarily meaningful, and another systematic way of comparison is desired.
Download figure:
Standard image High-resolution imageAcknowledgments
The authors would like to thank Yoshiyuki Kabashima, Satoshi Takabe, Takashi Takahashi, and Yingying Xu for their helpful discussions and comments. This work is partially supported by JSPS KAKENHI No. 16K16131 (AS) and Nos. 18K11463 and 17H00764 (TO). TO is also supported by a Grant for Basic Science Research Projects from the Sumitomo Foundation.
Footnotes
- 4
In [32], essentially the same analysis is done for LASSO, but a wrong terminology is used. The quantity R is termed Youden's index in that study, but it is contradictory to the conventional terminology. Youden's index is another similar but different criterion for choosing an 'optimal' point on ROC curve.
- 5
- 6
In the package, the design matrix is denoted as X and the regression coefficients are given as , following the statistics convention.