Inducing sparsity via the horseshoe prior in imaging problems

A problem typically arising in imaging applications is the reconstruction task under sparsity constraints. A computationally efficient strategy to address this problem is to recast it in a hierarchical Bayesian framework coupled with a Maximum A Posteriori (MAP) estimation approach. More specifically, the original unknown is modeled as a conditionally Gaussian random variable with an unknown variance. Here, the expected behavior on the variance is encoded in a half-Cauchy hyperprior. The latter, coupled to the conditioned Gaussian prior, yields the horseshoe shrinkage prior, particularly popular within the statistics community and here introduced into the context of imaging problems. The arising non-convex MAP estimation problem is tackled via an alternating minimization scheme for which the global convergence to a stationary point is guaranteed. Experimental results prove that the derived hypermodel is competitive with classical variational methods as well as with other hierarchical Bayesian models typically employed for sparse recovery problems.


Introduction
We consider a linear imaging inverse problem, where we seek an x ∈ R n to approximate the original imagex ∈ R n in b = Ax + e , e ∼ N (0, σ 2 I m ) . (1) Here b ∈ R m denotes the vectorized observed data, and A ∈ R m×n with m ⩽ n is the known forward model operator such as the convolution operator for image deblurring problems and discrete Radon transform for computed tomography (CT) reconstruction problems. Furthermore, e ∈ R m represents additive zero-mean white Gaussian noise with standard deviation σ, and I m is the m × m identity matrix. Due to the ill-conditioning of A, the linear problem in (1) can not be addressed by direct inversion. As a way to overcome the ill-posedeness of the original problem, one rather seeks for a solution that minimizes a given cost functional min x 1 2 ∥Ax − b∥ 2 2 where R is referred to as the regularization term that encodes information or beliefs possibly available a priori onx.
In this work, we assume that the original imagex is sparse-which is the case of many applications such as microscopy and astronomical imaging-or, more in general, can be sparsely represented in a given basis or dictionary. The latter scenario makes the sparse recovery problem a very flexible framework and has motivated the large amount of research that has been devoted to the design and the analysis of sparsity promoting models-see, e.g. the reviews in [25,26] and the applications-oriented works in [1,10,41].
To ensure sparsity on the solution, the regularizer R can be set as the ℓ 0 -pseudo norm of x with ∥x∥ 0 being the number of nonzero entries of x. However, the minimization problem corresponding to this choice is known to be NP-hard. To overcome this issue, one can replace the ℓ 0 -pseudo norm by ℓ 1 -norm, which is its convex relaxation. The regularizer ∥x∥ 1 is the core of the basis pursuit (BP) [25]. The impact of BP has invested not only the imaging community but many other fields ranging from geophysics to statistics [43,44]. One of the reasons behind the popularity of the ℓ 1 regularizer can be ascribed to the restricted isometry property condition, which guarantees the exact recovery of the sparse solution under certain conditions [20,27] which are satisfied in very specific cases [24]. Besides the convex ℓ 1 regularizer, significant attention has been given to the class of ℓ p regularizers with p < 1-see, e.g. [23,29,30,35,50]. Although they can provide good sparse solutions, such strategies have to face the typical computational challenges that non-convex optimization brings along, together with the problem of selecting a suitable value for the exponent p.
Alternative approaches rely on Bayesian statistics, which incorporate a probabilistic description of the target signal and combine it with a likelihood function that accounts for the forward problem (1) and the observed data. Besides the classical sparse Bayesian learning strategies [45,51], in the last decades significant attentions have also been given to hierarchical Bayesian methods, which are based on modeling the unknown parameters possibly arising in the problem of interest as random variables. Within this class, a plethora of works investigate the probabilistic counterparts of popular sparsity promoting penalty terms, such as the ℓ 1 -penalty term [4], the total variation term [37], and the non-convex ℓ p terms [3]. Priors related to the literature on relevance vector machines have also been employed for enhancing sparsity in the image restoration problem [5]. More recently, in [15,16,19], the authors model the entries of x as Gaussian random variables with unknown variances following generalized gamma hyperpriors. The parameters arising in the expression of the hyperpriors determine the (non-)convexity of the resulting hypermodel and the overall performance of the method in terms of sparsity promotion. The selection of those parameters is a challenging issue.

Contribution
In this paper, we address the sparse recovery problem undergoing the acquisition model in (1) by means of a hierarchical Bayesian model based on the horseshoe (HS) prior. In the HS prior, x is assumed to follow a conditionally Gaussian distribution, whose variance is controlled by half-Cauchy distributed parameters. With respect to the parameterized priors studied in [16], due to its layered formulation, the HS prior does not involve in its expression any tuning parameters. Then, relying on the popular Maximum A Posteriori (MAP) approach, we derive a new non-convex minimization problem to reconstruct a sparse solution. Due to the non-convexity of the corresponding minimization problem, the numerical results may depend on the initialization and numerical algorithm. We apply the iterative alternating sequential (IAS) algorithm [12], which is also known as alternating minimization algorithm, to solve the minimization problem and study its convergence.
More in details, our contribution can be summarized as follows: (i) We introduce the horseshoe prior, originally employed by the statistics community [21,22,39], into the field of imaging problems. Moreover, upon its selection, we validate the effectiveness of a MAP-based approach for the reconstruction of 2D signals that are sparse or that are characterized by sparse features. (ii) We prove that the IAS algorithm-for which, so far, convergence results have been proven only in convex scenarios-globally converges to a critical point when employed for the minimization of the non-convex MAP hypermodel resulting from the adoption of a HS prior. (iii) We empirically analyze the robustness of the overall approach with respect to the initialization-which is known to be crucial in non-convex settings-and we draw some general guidelines for its selection. (iv) Finally, we set up an extensive numerical comparison on image deblurring and CT reconstruction aimed at assessing the performance of the HS prior with respect to classical regularizers and other popular hierarchical models.
The paper is organized as follows. In section 2 we recall the properties of the HS prior. The hypermodel of interest is derived in section 3, while in section 4 we outline the main steps of IAS and study its convergence properties. Then, in section 5, we test our hypermodel on different 1D and 2D reconstruction problems. Finally, we conclude our work with some outlook for future research in section 6.

The horseshoe prior: a closer look
In this section, for the sake of self-containedness we review the definition and some properties of the HS prior. In order to discover why the HS prior is suitable for representing sparse solutions, we also set up a comparison with another two widely adopted prior distributions in the context of hierarchical Bayesian methods for sparse recovery problems.
In the Bayesian framework, one of the possible ways to impose the sparseness on x is by modeling x i , i = 1, . . . , n as conditionally Gaussian distributed random variables where λ 2 i is the unknown variance, which is expected to be small when x i vanishes, while it should be larger in correspondence of the few nonzero-entries of the signal x. In the Bayesian paradigm, the unknown variances can be also modeled as random variables, and the information on their expected behavior is encoded in the so-called hyperprior.
In a number of recent papers [11,12,16,18,19], the hyperprior has been selected from the family of generalized gamma distributions, which include some notable probability density functions (pdfs) such as pdf of the gamma distribution and pdf of the inverse gamma distribution where Γ(·) denotes the gamma function, and α, β > 0 are the shape and the scale parameter of the gamma/inverse gamma distribution, respectively. More specifically, the shape parameter α is hand-tuned so as to model the expected sparsity in x. We are going to refer to the prior resulting from the adoption of a Gaussian mixture coupled with the selection of a gamma and of an inverse gamma hyperprior as the Gg and Gig prior, respectively. In order to gain flexibility, one can think at λ i as being composed of a global parameter τ that informs on the sparsity level of the signal and a parameter θ i that encodes the local sparsity information, i.e.
We assume that τ and θ i follow a half-Cauchy distribution with parameters (0, 1): τ, θ i ∼ hC + (0, 1) , i.e. π(τ ) ∝ 1 1 + τ 2 and π(θ i ) ∝ Typically, one refers to (2) together with (6) as the horseshoe (HS) prior. In a hierarchical set-up, it is well-established that the influence of the hyperparameters on the overall prior becomes weaker as going down with the levels in the hierarchy [33]. From this perspective, the HS prior is informed on the expected local and global sparsity properties of the signal by a higher level, so that any beliefs on the sparsity structure can be more strongly enforced. Moreover, the HS prior is fully determined by its hierarchical formulation, i.e. it does not involve in its expression any tuning parameters.
In figure 1(a) we plot the marginalized Laplace and Student's t priors together with the upper and lower bound for the marginalized HS prior. Moreover, in figures 1(b) and (c) we show a close-up on the peak and on the tails of the pdfs, respectively.
Notice that the band determined by the bounds given in (7) is characterized by an infinite spike at the origin as well as fat-tails, which are expected to yield either a shrinkage or a preservation of the data based on its informativeness. We also highlight that the Student's t pdf with ν = 1, for which the fat-tail behavior is more accentuated, presents a lower peak.
We conclude this section by taking a little bit further the comparison with the Gg and the Gig priors. In what follows, we keep τ fixed, consider n > 1, and introduce the so-called shrinkage coefficients The shrinkage coefficient ϕ i is expected to be close to 0 that gives no shrinkage when x i is a non-vanishing component, while it approaches 1 at x i ≈ 0 where the shrinkage occurs.
In case of the HS prior, we recall that λ i = τ θ i . As θ i follows a half-Cauchy distribution with parameter (0, 1), by applying classical rules on the transformation of random variables, one can derive the pdf of ϕ i given τ , which models the behavior of the shrinkage coefficients Notice that when τ = 1, ϕ i follows a Beta distribution with shape parameter equal to 1/2. The density in (8) is plotted in figure 2(a), for different values of the global parameter τ . One can observe that, regardless of the τ -value, the horseshoe-shaped distribution is capable of promoting the strong components in the signal while shrinking the less informative ones. This is due to presence in each curve of infinite spikes at 0 and 1. Based on (3) and (4), one can derive the pdfs of the shrinkage coefficients ϕ i when the Gg and the Gig priors are selected. The resulting pdfs are shown in figures 2(b) and (c), respectively, for different values of the scale and shape parameters. Comparing with the one from the HS prior, the pdfs from the Gg/Gig priors cannot present simultaneously infinite spikes at 0 and at 1, thus suggesting that a severe shrinkage of the vanishing entries in the signal as well as a strong preservation of the larger components can not be guaranteed at the same time.

Hierarchical Bayesian model
In this section, we couple the HS prior with the information encoded in the likelihood so as to derive the hierarchical model of interest.
Considering the linear inverse problem stated in (1), we can whiten the noise by transforming the operator A and the data b as When the noise level is not available, one can invoke some of the very robust existing strategies for the estimation of σ when the corrupting noise is known to be white Gaussian-see, e.g. [32,34]. Therefore, here without loss of generality, we assume the noise level to be known and with a unitary standard deviation, i.e. σ = 1. Based on (2) and (5), we have that the conditioned prior density function on x can be expressed as By assuming the mutual independence of the univariate θ i and of the global parameter τ , and upon the selection of a half-Cauchy hyperprior on the unknown parameters as stated in (6), we have that the joint hyperprior takes the form According to Bayes' formula we combine the HS prior defined in (10), the hyperprior defined in (11) and the likelihood function to obtain the posterior density function A commonly used point estimate for the above pdf is the maximum a posteriori (MAP) estimate, where one sets the mode of the posterior as the single point representative of the whole density function: After inserting (12) and dropping out the constant terms, we obtain the variational model that we want to solve where the cost functional J is defined as Here, we set a lower bound 1 ≫ ε > 0 to θ and τ , which by definition are required to be strictly positive, so as to avoid numerical instability. In what follows, we refer to J defined in (14) as the HS-L 2 hypermodel. Notice that the cost function J is nonconvex in the joint unknown (x, θ, τ ), which makes it challenging to solve this minimization problem.

Numerical method
In this section, we apply the iterative alternating sequential (IAS) algorithm proposed in [12] to solve the minimization problem in the HS-L 2 model (13). After giving details on the numerical method, we study its convergence properties.

IAS algorithm
The principle of the IAS algorithm, which is also known as the alternating minimization/descent algorithm [46] or block coordinate minimization/descent algorithm [7,38], is to alternately update the unknown variables starting from some proper initial guess. In algorithm 1, we summarize the IAS algorithm applied for solving the problem in the HS-L 2 hypermodel. Then we give the details on the numerical solvers to the subproblems with respect to x, θ and τ .
Algorithm 1: IAS for the HS-L 2 hypermodel (13) Notice that, after whitening the noise, τ plays the role of a regularization parameter ruling the amount of smoothing that has to be performed on x. After defining a new variable w = (1/τ (k) )D −1 θ (k) x, we solve the following minimization problem instead of (15) min which amounts to solve the well-conditioned linear system in the least squares sense. Depending on its dimension, the linear system in (16) with given x (k+1) and τ (k) . The problem can be decomposed into n one-dimensional minimization problems of the form By imposing the first order optimality condition on the function f, we get which yields It is easy to verify that the second derivative of f computed at θ (k+1) i is strictly positive, hence the stationary point in (19) is a minimizer.
Moreover, notice that the minimizer of f(θ i ) can only vanish when τ = 0, which is not allowed, or when x i = 0. To account for this event and for the finite arithmetic system one has to work with, we need to introduce an explicit constraint in the original problem (13) to keep θ i away from 0.
Similar as finding θ i , we use the first order optimality condition on g and obtain which yields According to the constraint, we thus have which can be proven to be a minimizer, since the second derivative of g computed at τ (k+1) is strictly positive. The discussion reported at the end of section 4.1.2 can be applied also here.

Convergence analysis
The IAS algorithm outlined in section 4.1 can be also regarded as the alternating minimization/descent or block coordinate minimization/descent algorithm, for which a wide literature on the convergence analysis in convex/nonconvex and smooth/nonsmooth settings has been developed-see, e.g. [6,49]. In [2], the authors provide a general proof of convergence under the assumption that the cost functional satisfies the Kurdyka-Łojasiewicz (KL) inequality. More in details, denoting by f(ζ) : R s → R ∪ {∞} a given lower semicontinuous smooth proper function and by {ζ (k) } the sequence generated by the alternating descent algorithm, it is possible to prove the convergence of the algorithm to a stationary point provided that the following assumptions are satisfied: (A3) Continuity condition. There exists a subsequence {ζ (kj) } andζ such that We remark that the assumption (A2) can be formulated more in general for nonsmooth functions. In this section, we will verify that the sequence {(x (k) , θ (k) , τ (k) )} generated by algorithm 1 for solving (13) satisfies (A1)-(A3).
First, we summarize some analytical properties of J defined in (14) that will be used to derive the convergence results.  (14) is continuous and coercive, and satisfies the KL property. Moreover, it admits a global minimizer.
Proof. The continuity of J is obvious. Since J is composition, via elementary operations, of real analytic functions, it implies that it satisfies the KL property [8]. For what concerns the coercivity property, it is easy to verify that as ∥(x, θ, τ )∥ 2 → +∞, J positively diverges. To address the case when either τ or θ i tends to ε, it is sufficient to observe that the auxiliary function h : [ε, +∞) → R defined as goes to +∞ as t tends to ε and ε goes to 0 from the right side. Because J is continuous, coercive and lower-bounded, it admits a global minimizer.
Proof. From lemma 4.1, we have that the cost functional J is bounded from below, which implies that the sequence {J (x (k) , θ (k) , τ (k) )} generated by algorithm 1 is also bounded from below. Moreover, in light of the updating steps in algorithm 1, we have that is, the sequence {J (x (k) , θ (k) , τ (k) )} is non-increasing. Proof. We consider Then, we get with which is a full column rank matrix. From lemma 4.2 and (21), we have We can thus conclude that

Then, it yields
Notice that the last inequality comes from τ and θ i for all i = 1, . . . , n in [ε, +∞). According to the boundedness of {x (k) } and (23), we know that ∇ x J is bounded when computed on the sequence {(x (k) , θ (k) , τ (k) )}. As a consequence, there exists a constant γ > 0 such that We now consider is the global minimizer of (18) with x (k+1) i and τ (k) . Therefore, To derive the above result, we use the lower bound ε of τ and θ and the upper bounds h, c of the sequences {τ (k) } and {∥x (k) ∥ 2 }, respectively. Based on the definition of the τ -update in (20), we have Finally, according to the results in (23)- (25), (22) We can now conclude that the following convergence result holds true.

Computed examples
In this section, we are going to evaluate the performance of the proposed hypermodel on different imaging inverse problems. More specifically, as the first example we consider a onedimensional deconvolution problem and test the robustness of the numerical scheme employed for the solution of the non-convex problem in (13) and (14) for different initial choices of τ (0) and θ (0) . Then, the proposed HS-L 2 hypermodel is tested on the restoration of different images, and compared with the popular TV-L 2 [40] and CEL0-L 2 [42] models as well as with the Gg-L 2 and the Gig-L 2 hypermodels, recalled in section 2. Finally, we consider the CT reconstruction problem, and compare our method with the aforementioned methods as well as with the filtered back projection (FBP) [28].
In the 2D tests, the quality of the output image x * with respect to the original uncorrupted imagex is evaluated by means of the structural-similarity-index (SSIM) [48] and the signalto-noise-ratio (SNR). The IAS iterations are stopped as soon as where k ∈ N and J (k) = J (x (k) , θ (k) , τ (k) ). The experiments have been performed using MATLAB R2019b under Windows 10 on an ASUS PC with a 2.50 GHz Intel Core i7 6500U processor and 8 GB of RAM.

One-dimensional signal deconvolution
We start with the problem of restoring a piecewise linear signal f : [0, 1] → R from a few data points generated as follows Here, A represents the Gaussian blur operator given as A(t) = 1 √ 2π w 2 e − t 2 2w 2 , whose width is denoted by w > 0. In addition, e i is a realization of a scaled white Gaussian noise process. In the test examples, we set w = 0.01 and consider two different noise standard deviations, namely 3% and 5% of the maximum of the noiseless signal.
To avoid inverse crime, the underlying original signal as well as the observed data are generated with a dense grid including n dense = 1253 equidistant points in [0, 1]. However, in the forward model we set n = 500 and m = 91, and both are with equidistant grids. We denote x i = f(t i ) and show the original signal and noisy data in figure 3.
One can easily observe that the original signal is not sparse, however it admits a sparse representation z under the basis L of the second order increments, which is defined as Therefore, instead of (1) we consider b = AL −1 z + e with A ∈ R m×n being the linear operator associated to the discretization of the integral in (26) with trapezoidal weights. Notice that the number of selected data-points makes the above linear problem significantly under-determined with the under-determine rate 0.182. Due to the nonconvexity of the cost functional J , the solution depends on the initialization of the numerical method. In the first test, we numerically assess the dependence of the proposed method on the initial choices of τ (0) and θ (0) = ϑ 0 1 n , where ϑ (0) > 0 and 1 n being a vector of ones. With the two different noise levels, we run algorithm 1 for τ 0 , ϑ 0 ∈ {10 −3 , 10 −1 , 10, 10 3 }.
In the test, we set the lower bound ε of θ and τ as 10 −8 . The results for the 3% noise level (σ = 0.0430) and the 5% noise level (σ = 0.0671) are shown in figures 4 and 5, respectively. The restored signal plotted in red is compared to the original signal marked in the black solid line.
One can observe that when τ (0) , ϑ (0) ≪ σ, our method converges to a poor-informative stationary point. However, when τ (0) , ϑ (0) > σ our method presents a robust behavior. The connection arising between prior-related quantities, such as τ (0) , ϑ (0) , and a likelihood-related quantity, i.e. σ, is not surprising. In fact, it has been already observed that the knowledge of σ can be particularly informative for the parameter selection in the Horseshoe prior-see, e.g. [22]. We also remark that in the MAP hierarchical Bayesian paradigm τ (0) and σ are intrinsically related as, keeping apart the whitening of the data and of the forward model operator, they can be coupled so as to give the regularization parameter in the x-update problem in (15).
To provide further insights on this point, we show a few snapshots of x (k) and θ (k) in figures 6 and 7 computed by the IAS with (τ (0) , ϑ (0) ) = (10, 10) and (τ (0) , ϑ (0) ) = (10 −3 , 10 −3 ) under the noise level 3%. The selection of small initial values yields a significant smoothing result at the very first iterations of the algorithm, as the regularization parameter imposes a strong regularization. On the other hand, considering larger τ (0) and ϑ (0) allows the IAS to identify the jumps from the very beginning as the x-update takes advantage in a larger part of the information encoded in the data.
Finally, under the settings τ (0) = ϑ (0) = 10 and 3% noise level, we test the robustness of algorithm 1 with respect to mild and severe under-and over-estimation of the noise level. More specifically, the actual noise standard deviation σ, which-we recall-arises in the whitening process (9) and, subsequently, in the forward model operator A and in the data b, is replaced bỹ σ = κσ. In figure 8, we show the output reconstructed signal for κ ∈ [0.3, 1.8]. We can see that our method is rather robust with respect toσ, when the estimated noise level is close to the true σ. Otherwise, the method would prefer an under-estimation rather than an over-estimation.

Image restoration
In this section, we test our method on restoring the test images qrcode (128 × 128) and geometric (160 × 160) with pixel values between 0 and 1. Both images are corrupted by The considered test images admit sparse representations in the basis of the horizontal and vertical increments. More specifically, we introduce z = Lx , with L being the matrix mapping the free nodal values collected in x to the increments along free edges in the vector z. More details on L can be found in [15, section 4].
We compare our method with the methods based on the Gg-L 2 and the Gig-L 2 hypermodels, which were proposed in [11,18] and derived from a Gg prior given in (2) together with (3) and a Gig prior (2) with (4), respectively. In addition, we show the results from the popular TV-L 2 model [40]: where TV(x) admits an isotropic and an anisotropic version    and (Dx) i represents the discrete gradient at pixel i. In the tests, we consider the anisotropic TV(x) for the test image qrcode, since it presents edges lying along the horizontal and the vertical direction, while the isotropic TV(x) is employed for the restoration of the test image geometric. The minimization problem in (27) is solved by means of the alternating direction method of multipliers (ADMM) [9]. Furthermore, we compare our method with the CEL0-L 2 model which combines the continuous exact ℓ 0 penalty term (CEL0) [42] and the L 2 fidelity term: where a i ∈ R m , i = 1, . . . , n denotes the ith column of matrix A. Note that the regularization parameter µ has been incorporated in the expression of function ϕ CEL0 . Similar as TV, the CEL0 penalty term is applied to enforce sparsity on the magnitudes of the gradients defined in (28). Also in this case, the anisotropic version of g i is considered when restoring the qrcode test image, while for geometric the isotropic instance of g i is adopted. To solve the minimization problem in (29), we apply the modified iterative reweighted ℓ 1 (IRL1) algorithm [36].
For the parameter selection, the regularization parameter µ in the TV-L 2 and in the CEL0-L 2 models as well as the scale and the shape parameters in the Gg-L 2 and the Gig-L 2 hypermodels have been manually adjusted so as to maximize the SNR. In our HS-L 2 hypermodel, we set the lower bound ε for τ and θ as 10 −5 and τ (0) = ϑ (0) = 10.
In table 1, we list the SNR and SSIM values achieved by applying the five methods on the two test images, with the largest quality measures reported in bold. We can see that in the most of cases our method outperforms the other methods in term of SNR, except that in the case of σ = 0.1 with the test image geometric, where CEL0-L 2 performs slightly better than the HS-L 2 . However in term of SSIM our method returns the significantly largest values in all cases. In figures 10 and 11, we show the restored images (left), together with the zoomed regions (middle) and the log 10-absolute error images for the case of σ = 0.1 on two test images, respectively. From a visual inspection, the TV-L 2 model and the Gg-L 2 hypermodel present more difficulties in recovering sharp edges, which can be easily observed through the zoomed regions. This can be mainly ascribed to the convexity of the cost functions that does not guarantee a significant sparsity promotion. When applying the nonconvex CEL0-L 2 , Gig-L 2 and HS-L 2 hypermodels, one can clearly see that the recovered edges are sharper. Nonetheless, as highlighted by the corresponding absolute error, the CEL0-L 2 is less capable of preserving the original contrast in the image than the other two. For the test image geometric, we notice that HS-L 2 returns slightly sharper edges and preserves the contrast better than Gig-L 2 .
In the hypermodels, the hyperparameter λ corresponds to the standard variation of the horizontal and vertical increments, and we expect larger values at edges. In figure 12, we show the log10 plots of λ returned by the Gg-L 2 , Gig-L 2 and HS-L 2 models for the two test images in the case of σ = 0.1. Notice that, as discussed in section 2, for the HS prior we have λ = τ θ. We can see that the range of λ from Gg-L 2 is on smaller values comparing with the ones from the other two models, and it means that the capability of restoring sharp edges from Gg-L 2 is limited with respect to the others. On the other hand, the range of λ from Gig-L 2 is located at larger values than HS-L 2 , and it means that it may be slightly less powerful than HS-L 2 to restore the 0-increment regions, i.e. the homogeneous regions.

CT image reconstruction
We now consider a CT reconstruction problem where the forward model operator A ∈ R m×n in (1) represents the discrete approximation of the Radon transform and m = pq with p, q denoting the number of detectors and the number of view angles, respectively.
The HS-L 2 hypermodel is compared with the FBP [31], CEL0-L 2 , TV-L 2 , Gg-L 2 and Gig-L 2 . We remark that for the CT reconstruction problem an additional constraint on the nonnegativity of the processed image has to be incorporated in the models. For what concerns TV-L 2 and CEL0-L 2 , the constraint x ⩾ 0 can be easily tackled by considering an additional auxiliary variable in the ADMM framework, while for the hierarchical models we rely on [16], where the authors have shown that to add a bound constraint to the IAS algorithm it suffices to project x onto the feasible set, which in our settings is the non-negative orthant.
We first consider the CT reconstruction of the shepp-logan phantom (150 × 150) shown in figure 13(a), which is not sparse but admits sparse horizontal and vertical increments. We utilize the AIR Tools II [31] to generate the data by using a parallel beam geometry with q = 45 equidistant view angles between 0 • and 180 • and p = 170 detector pixels. The corresponding linear forward operator A is a pq-by-n matrix with pq = 7650 and n = 22 500, i.e. the underdetermine rate of the reconstruction problem is 0.34. The acquired sinogram has been further corrupted by adding an additive white Gaussian noise with standard deviation equal to 2% and to the 6% of the maximum of the sinogram. In the numerical tests, we keep the same parameter selections as in the previous section. In figure 14 we show the reconstructed images for σ = 6% together with a selected closeup and the absolute error image. The FBP reconstruction suffers from the limited number of view angles as well as from the presence of Gaussian noise in the sinogram. TV-L 2 and Gg-L 2 have similar performance in terms of edge preservation with the latter being more effective in removing artifacts from noise. The reconstructions from CEL0-L 2 and Gig-L 2 present sharp  edges, however the corresponding absolute error images clearly suggest that both methods have difficulties in preserving the original contrast. The reconstruction from HS-L 2 is characterized by a very homogeneous background as well as by sharp edges in correspondence of the lower contrast loss.
In the second example, we consider the 200×200 test image grains shown in figure 13(b) from AIR Tools II in the limited angle scenario, where we consider q = 25 view angles between −40 • and 40 • around the vertical direction and p = 240 detector pixels. In this case the forward operator A is in R 6000×40000 , and the under-determine rate is 0.15. As in the previous case, 2% and 6% Gaussian noise realizations are added to the simulated data, respectively. The CT reconstructions for the larger noise level are shown in figure 15, whence one can see that one of the main difficulties arising is the low contrast in the target image. This issue is particularly relevant for the TV-L 2 model (see figure 15(d)), which is known to suffer from contrast loss, and for the Gig-L 2 hypermodel (see figure 15( j)) as well. However, the Gg-L 2 , the CEL0-L 2 and the HS-L 2 models are more capable of preserving edges with the latter presenting sharper profiles.  In table 2 we list the SNR and SSIM values achieved by the compared methods in both examples. Notice that HS-L 2 outperforms the other methods in the most of cases. The only exception is for Shepp-Logan with the lower level of noise, where the SNR value from HS-L 2 is less than SNR from CEL0-L 2 but SSIM are comparable. In the other cases, we can see that our method significantly outperforms the other methods in terms of SSIM, especially for the very challenging example on the grains phantom.
Finally, the output hyperparameter λ for the Gg-L 2 , Gig-L 2 , HS-L 2 models, corresponding to the vertical and horizontal increments, are shown in figure 16 for σ = 6%. Similar as image restoration examples, we can draw the same conclusion, i.e. HS-L 2 tends to perform better on reconstructing sharp edges as well as homogeneous regions.

Conclusions
The present work discusses the properties and the potentials of the adoption of a Horseshoe prior in the context of hierarchical Bayesian models for sparse recovery problems. The proposed hypermodel has been proven to be particularly effective in preserving the informative structures in the image while smoothing out the unwanted noise and artifacts. Although the employed alternating scheme has been proven to converge to a stationary point, the nonconvexity of the HS-L 2 hypermodel remains a relevant issue. As a future research direction, we aim to assess the quality of the MAP estimate by comparing it with the posterior mean, which can be computed via suitable sampling methods [47].

Data availability statement
No new data were created or analyzed in this study.