On an iteratively reweighted linesearch based algorithm for nonconvex composite optimization

In this paper we propose a new algorithm for solving a class of nonsmooth nonconvex problems, which is obtained by combining the iteratively reweighted scheme with a finite number of forward–backward iterations based on a linesearch procedure. The new method overcomes some limitations of linesearch forward–backward methods, since it can be applied also to minimize functions containing terms that are both nonsmooth and nonconvex. Moreover, the combined scheme can take advantage of acceleration techniques consisting in suitable selection rules for the algorithm parameters. We develop the convergence analysis of the new method within the framework of the Kurdyka–Łojasiewicz property. Finally, we present the results of a numerical experience on microscopy image super resolution, showing that the performances of our method are comparable or superior to those of other algorithms designed for this specific application.


Introduction
Variational models offer a great tool for solving a variety of inverse problems in many fields of application, such as imaging and machine learning, where an explicit direct solution is seldom available. Through a Bayesian argument, they lead to compute the solution of the inverse problem by solving a problem with the following form: Problem (1) is often referred to as a composite optimization problem. On the one hand, f 0 represents a fidelity term, which serves the purpose of evaluating the quality of the solution with respect to the data at hand. On the other hand, f 1 is the regularization function, which is related to the prior probability of the unknown, and is used to impose additional information that is believed to be valid in advance. More in general, here and in the following we will simply assume that f 0 contains only smooth terms, coming from the Bayesian paradigm, while all the nonsmooth ones are included in f 1 .
Problem (1) becomes especially difficult to handle, from the theoretical and numerical point of view, when either f 0 and /or f 1 are nonconvex. In these cases, the best we can expect from a descent algorithm is the convergence to a stationary point, i.e. a point satisfying the necessary optimality condition, instead of a global or even a local minimizer. The most difficult case in the analysis and implementation of optimization algorithms is when f 1 is nonconvex, i.e. when the objective function contains some terms that are both nonconvex and nonsmooth. When f 1 is convex, that is all the nonsmooth terms are convex, problem (1) can be easily tackled by forward-backward optimization methods [1][2][3][4][5]. The forward step performs a gradient descent on the smooth part of the objective function, and then feeds the result to the backward one, which takes advantage of the convexity of the nonsmooth term to compute the proximal operator, also known as the resolvent. More specifically, the proximal operator associated to f 1 is defined as, where α > 0 is the steplength parameter, D is a symmetric and positive definite matrix, and ∥ · ∥ D denotes the norm with respect to the metric induced by the matrix D. A quite general forward-backward scheme can be summed in the following two steps: x (k+1) = x (k) + λ k (y (k) − x (k) ).
Many algorithms can be cast in this framework, with each one exploiting the three parameters α k , D k and λ k in different ways to achieve convergence and effectiveness. Besides the standard choices for α k and λ k as fixed values related to the Lipschitz constant of ∇f 0 [1,2], adaptive procedures for computing these steplength parameters can be included. Regarding the steplength selection, we may adopt backtracking (adaptive computation of α k with λ k = 1 [6,7]), or linesearch approaches (adaptive computation of λ k for a given α k > 0 [3][4][5]). As concerns the choice of the matrix D k , we mention the approach in [3], where the matrix D k must be selected in order to define a quadratic majorant of the objective function, and the linesearch based approach in [4], where both the matrix D k and the steplength α k can be considered as almost entirely free parameters, which can be exploited to improve its practical performances. The convergence analysis of forward-backward methods has also received much attention in the last years. Besides the classical results obtained under convexity assumptions on both f 0 and f 1 [1,2], more general results have been obtained by assuming that the objective function f satisfies the Kurdyka-Łojasiewicz property, for several different choices of the algorithms parameters [3,[7][8][9][10][11].
On the other side, the case when f 1 is nonsmooth and nonconvex is more difficult to handle in a forward-backward scheme, from both the theoretical and practical point of view. An easy example of these difficulties is the fact that, when f 1 is nonconvex, the minimization problem underlying the resolvent operator may not have a unique solution. However, variational models including nonsmooth nonconvex terms are receiving an increasing interest, since they allow the modelling of more accurate noise statistics [12,13] and the recovery of sparse solutions, which are typically promoted by including an ℓ 0 penalization (or a continuous, nonconvex approximation of it) in the objective function [14][15][16]. The second kind of problem is especially relevant in the field of single-molecule localization microscopy (SMLM) [17][18][19], where the object represented in the acquired images is typically very sparse. From the theoretical viewpoint, iteration (3) and (4) can be still safely applied also when f 1 is nonconvex and nonsmooth [8,20,21]; however, to the best of our knowledge, all the related convergence analyses require that λ k = 1 for all k, thus limiting the possibility to adaptively compute the algorithm parameters.
The iteratively reweighted methods, whose idea dates back to [22], have been developed in the last years as a tool to solve (1) when the (possibly) nonsmooth term f 1 can be written as the composition of a nonconvex smooth function with a convex one. The main idea underlying these methods consists in replacing problem (1) with a sequence of convex subproblems, whose objective function is a majorant (or surrogate) of the original one [23]. In this sense, the iteratively reweighted methods are strictly related to majorization-minimization methods [3,12] and they require the minimization of the surrogate function at each iterate. In the recent literature, this kind of method has been applied to the minimization of the ℓ p norm with p < 1 [24] and, in particular, for the case p = 0 [14,25,26]. In the last few years, the analysis has been extended also to more general problems, within the framework of the Kurdyka-Łojasiewicz property [23,27].

Contribution
In this paper we propose a new algorithm that combines the basic iteratively reweighted scheme with a forward-backward linesearch based procedure. In particular, we focus on problem (1) where: with ψ j : R n → R convex and lower semicontinuous, ϕ j : R → R concave, nondecreasing and continuously differentiable, j = 1, . . . , J, and χ : R n → R convex and lower semicontinuous. The algorithm iteration is derived by a two-step procedure: first, a majorizer for f 1 is obtained by exploiting the concavity of ϕ j ; then, the surrogate problem is approximately solved by a finite, bounded number of steps of a linesearch forward-backward method. We develop the convergence analysis of the new method within the framework of the Kurdyka-Łoyasiewicz property. Even if the assumptions needed for the theoretical analysis, which will be detailed later, prevents to directly handle the ℓ p norms with p < 1, our approach can be still applied to several interesting cases. In particular, we will apply our algorithm to the regularization proposed in [16], named CEL0, as an approximation of the ℓ 0 norm for image super resolution in microscopy. We will show that this functional can be set in the form (5), and we will present the results of a numerical experience in this framework, comparing our proposed algorithm to other iteratively reweighted methods presented in the literature for this specific application [16,17,28].

Related work
The idea of plugging forward-backward iterations in an outer iteratively reweighted / majorization-minimization scheme is proposed also in [27]. Although the theoretical settings are very similar, in our approach we exploit a quite different forward-backward iteration based on a descent direction and a linesearch procedure along it [4,9]. This results in a larger freedom of choosing the parameters α k and D k , which, in turn, can lead to improved practical performances. In this way we can introduce acceleration techniques through suitable steplength and scaling selection strategies, such as the Barzilai-Borwein rules for α k [29] and the Split Gradient technique for D k [30]. Moreover, we generalize the analysis in such a way that the proximity operator of the convex part of the surrogate problem can be computed inexactly according to an implementable inexactness condition, in the same spirit as in [9].

Paper organization
The outline of the paper is as follows. In section 2 we recall preliminary results regarding subdifferential calculus and the inexact proximal operator computation that are needed later. Section 3 is dedicated to our proposed algorithm: in 3.1 we state the problem of interest under some appropriate assumptions on the involved functions; in 3.2 the proposed iteratively reweighted variable metric inexact linesearch algorithm (IR-VMILA) is reported and explained in detail; in 3.3 we provide an insightful analysis of its convergence properties. Finally, in section 4 we report the results of our numerical testing, with 4.1 being focused on an image deconvolution toy problem, while 4.3 illustrates a more interesting application to super-resolution in microscopy where a nonconvex regularizer is employed. Some concluding remarks are offered in section 5.

Notations
In this paper we will adopt a discrete setting and the reference space is R n . The symbol ⟨·, ·⟩ denotes the standard inner product on R n , i.e. ⟨x, y⟩ = x T y for all x, y ∈ R n . We denote with ∥ · ∥ the Euclidean norm of a vector, i.e. ∥x∥ = ⟨x, x⟩ for all x ∈ R n , whereas the symbol ∥ · ∥ D indicates the norm induced by the symmetric positive definite matrix D, that is ∥x∥ D = ⟨x, Dx⟩. The symbol S(R n ) denotes the set of all n × n symmetric matrices, whereas S ++ (R n ) is the set of all n × n symmetric positive definite matrices. Furthermore, we employ the notation I n ∈ R n×n to denote the identity matrix of order n. On the set S(R n ), we consider the Loewner partial ordering relation, which is defined as follows: We denote byR the extended real line, i.e.R = R ∪ {+∞, −∞}. Given a function f : If F is a set valued mapping defined on R n , its domain is defined as dom(F) = {x ∈ R n : F(x) ̸ = ∅}. Given f : R n →R and l, u ∈ R, with l ⩽ u, we adopt also the following notations: The distance of a point x ∈ R n from a set Ω ⊆ R n is defined as dist(x, Ω) = inf y∈Ω ∥x − y∥. The relative interior of a convex set Ω is the interior of Ω relative to the smallest affine set aff(Ω) containing Ω, i.e. relint(Ω) = {x ∈ Ω : ∃ ϵ > 0 s.t. B(x, ϵ) ∩ aff(Ω) ⊆ Ω}. Finally, given a set A ⊆ R n , the characteristic function 1 A is such that 1 A (x) = 1 if x ∈ A, and 1 A (x) = 0 otherwise.

Subdifferential calculus
In this section we summarize some basic results concerning the subdifferential calculus of (possibly nonconvex) functions. Most of the definitions and results reported here are from [31,32]. We start by reporting the definitions of regular vector and regular function. The latter concept is of paramount importance for deriving the subdifferential calculus rules required for our convergence analysis.
The limiting-subdifferential (or simply subdifferential) of f at x is defined as: Finally, the horizon-subdifferential of f at x is given by: Remark 1. Given f : R n →R, we have that any (local) minimum point x ∈ R n of f is a stationary point, while the converse may not be true in general. If f is also convex, then x is a (local) minimum point if and only if x is a stationary point.
In the following lemma, we state a useful closedness property satisfied by the limiting subdifferential. Such a property will be employed to show that the limit points of our proposed algorithm are stationary, as similarly done in other works in the literature [3,27].
We now report some basic subdifferential calculus rules under appropriate convexity or regularity assumptions on the involved functions.
(ii) [32, p 222] Let ψ : R n →R be proper, convex, lower semicontinuous, and λ ⩾ 0. If x ∈ dom(∂ψ), then: (iii) [32, theorem 23.8] Let ψ j : R n →R be proper, convex, and lower semicontinuous for all j = 1, . . . , J, and define ψ = ψ 1 + · · · + ψ J . If relint(dom(ψ i )) ∩ relint(dom(ψ j )) ̸ = ∅ for all i, j = 1, . . . , J, then for all x ∈ J j =1 dom(∂ψ j ) there holds: (iv) [31, corollary 10.9] Let f j : R n →R be proper, lower semicontinuous, for all j = 1, . . . , J, define f = f 1 + · · · + f J , and let x ∈ dom(f). Suppose that the only combinations of sub- and that each f j is regular at x. Then, also f is regular at x, and The next lemma provides the chain rule for functions of the form (5) and will be employed in the subsequent analysis. The rule can be derived by collecting and putting together some known results of variational analysis contained in [31]. Similar results have been obtained also in [23,27] under slightly different conditions on the functions of interest.
is Clarke regular at x; indeed, since each ψ j is convex and lower semicontinuous, and ϕ ′ j (ψ j (x)) ⩾ 0 due to the monotonicity of ϕ j , it turns out that also s is convex and lower semicontinuous, which implies that s is regular [31, example 7.27].
Hence, since properties (i)-(iii) are satisfied, we can apply the chain rule provided in [31, theorem 10.49], and thus conclude that ϕ • Ψ is regular at x and the following equations hold, where the first equation is due to the application of [31, theorem 10.49], whereas the second one follows from lemma 2(ii) and (iii), which is applicable due to the fact that each ψ j is convex and lower semicontinuous, and ϕ ′ j (ψ j (x)) ⩾ 0. Now, we turn to the computation of ∂(ϕ • Ψ + χ)(x). First, observe that ϕ • Ψ is locally Lipschitz continuous at x, since it is the composition of two locally Lipschitz functions at x, as ϕ is continuously differentiable and Ψ is convex and finite [31, example 9.14]. Second, χ is locally Lipschitz continuous on its domain by assumption. Then, we can apply [31, theorem 9.13] and deduce that ∂ ∞ (ϕ • Ψ) = ∂ ∞ χ(x) = {0}. Furthermore, both ϕ • Ψ and χ are regular functions at x. Then, we can employ lemma 2(iv) with J = 2, f 1 = ϕ • Ψ, f 2 = χ, together with (7), to get the thesis.
As the objective function in (1) may be nonconvex, we need to replace convexity with a different analytical property that still allows us to ensure convergence of the proposed method. Several works in the past literature have compensated for the lack of convexity by assuming that the objective function satisfies the so-called Kurdyka-Łojasiewicz inequality [8,20,33].
If f satisfies the KL inequality for all x * ∈ dom(∂f), then f is called a KL function.
The KL inequality is satisfied on the entire domain by indicator functions of semi-algebraic sets, real polynomials, p−norms and, in general, semi-algebraic functions or real analytic functions (see [20] and references therein for more details). In practice, most objective functions arising in imaging inverse problems do comply with the KL assumption.
In the upcoming convergence analysis, it will be convenient to employ the following uniformized version of the KL inequality, which holds whenever the function satisfies (8) over a compact set upon which it is constant. [20, lemma 6] Let f : R n →R be a proper, lower semicontinuous function and Ω * ⊆ R n a compact set. Suppose that f satisfies the KL inequality for each point of Ω * , and that f is constant over Ω * , i.e. f(x * ) = f * for all x * ∈ Ω * . Then, there exist ρ, ν > 0 and a function ξ : [0, ν) → [0, +∞) satisfying the same properties as in definition 5 such that: where the set B * ⊆ R n is given by:

Inexact computation of a proximal-gradient point
The aim of this section is to provide the settings for the forward-backward iterations employed in our iteratively reweighted scheme. In particular, we will include the possibility of inexactly computing the proximal operator in the backward step, according to an implementable inexactness criterion. To this end, we follow the discussion in [9, section 3], reporting some useful results that will be needed in the next section. Let f 0 : Ω 0 → R and ℓ : R n →R, where f 0 is continuously differentiable with L−Lipschitz continuous gradient on the open set Ω 0 ⊆ R n with dom(ℓ) ⊆ Ω 0 , while ℓ is proper, convex, and lower semicontinuous. Given x ∈ dom(ℓ), we introduce the function: where γ ∈ [0, 1], α ∈ [α min , α max ] being 0 < α min ⩽ α max , and D ∈ R n×n such that 1 µ I n ⪯ D ⪯ µI n . Denoting h(·; ·) := h 1 (·; ·), we observe that: Thanks to the assumptions on α and D, we also have that h(·; x) is a strongly convex function with modulus 1/(α max µ), that is: Therefore, h(·; x) admits a unique minimum pointŷ ∈ R n , which we call the proximalgradient point. Indeed, using the definition of proximal operator in (2), remark 1, and lemma 2(i), we easily deduce thatŷ is the point obtained by applying a proximal-gradient step to the function f 0 + ℓ at the point x, namely: We are now interested in the following notion of inexact proximal-gradient point.
Furthermore, we denote any such approximation of the proximal-gradient point with the notationỹ ≈ŷ.
Remark 3. Note that the following inequalities hold true: where the first inequality comes from the definition of h γ (·; x) in (11) and γ ∈ [0, 1], the second one from the inexactness condition (14), the third one from the fact thatŷ is the minimum point of h(·; x), whereas the last equality is due to the definition of h(·; x).Therefore, we can conclude that: The inexactness criterion (14) has been investigated and employed in several works (see e.g. [4,5,9,34]). Unlike other conditions available in the literature [3,8,27], it has the advantage to be practically implementable in certain cases of interest. In particular, if ℓ is the composition of a linear operator with a convex function, then we can compute a pointỹ satisfying (14) by first applying an iterative method to the dual problem of (13), and then stopping the iterates when the primal-dual gap satisfies an appropriate stopping condition (see [9, section 3.1] for further details on this practical procedure).
To conclude this section, we recall some inequalities that involve the pointsỹ,ŷ, and x.

Problem formulation
From now on, we address the following minimization problem: and make the following assumptions on the functions f 0 , f 1 .

Assumption 1.
( where . ψ j : R n → R is convex and lower semicontinuous, and ϕ j : R →R is concave, monotone nondecreasing, continuously differentiable on an open set Ω j with ψ j (R n ) ⊆ Ω j ⊆ R, and with ϕ ′ j locally Lipschitz continuous on Ω j , for all j = 1, . . . , J; . χ : R n →R is proper, convex, lower semicontinuous, and locally Lipschitz continuous on its domain; (i) The general model (22) is especially suited for nonconvex nonsmooth regularization arising in imaging problems. On the one hand, the term J j =1 (ϕ j • ψ j )(x) encompasses several nonconvex regularizers employed in the literature, such as the log-sum penalty term, the Cauchy penalty term, and smoothed versions of the ℓ p −norm with p < 1 (see [23,27] and our section 4 for other examples). On the other hand, the function χ may represent an easy constraint on x. For instance, we may enforce nonnegativity in the solution by setting χ = ι ⩾0 , where ι ⩾0 denotes the indicator function of the nonnegative orthant This constraint is especially relevant in imaging applications. (ii) Since the functions ψ j , j = 1, . . . , J, are convex and finite on R n , it follows that they are locally Lipschitz continuous [31, example 9.14]. Hence, each subdifferential ∂ψ j is locally bounded [31, theorem 9.16]. Therefore, for any compact set K ⊆ R n there must exist j and ψ j both locally Lipschitz continuous, we have that ϕ ′ j • ψ j is Lipschitz continuous on every compact subset of R n . This means that, for any compact set For our convergence analysis, we will require an additional assumption on the objective function, namely that it satisfies the KL inequality on its domain. Such an assumption will be needed only for the final convergence result, hence we state it later in the next section.

The algorithm
In this section, we introduce our proposed method to solve problem (21), which is denoted as IR-VMILA.
At each outer iteration k ⩾ 0, IR-VMILA defines a convex majorant q (k) for the (possibly nonconvex) function f 1 , which is obtained by linearizing the functions ϕ j , j = 1, . . . , J around the current iterate x (k) . Then, the next iterate x (k+1) is computed through the inner iterates x (k,0) ,x (k,1) , . . . ,x (k,I k ) , which are generated by applying a (uniformly bounded) number of inner iterations I k of a variable metric linesearch based algorithm to the minimization of the function In order to better describe the steps of IR-VMILA, we introduce some further notations. For all k ⩾ 0 and j = 1, . . . , J, we define the following function: Since each ϕ j is concave, it follows that q As a result, for all k ⩾ 0, the following function: is a majorant of f 1 at point x (k) . For the sake of simplicity, it will be convenient to consider also the function: Note that ℓ (k) differs from q (k) by an additive constant, i.e.
In the context of iteratively reweighted algorithms, the parameters ω (k) j , j = 1, . . . , J, are usually called weights. For all k ⩾ 0 and i = 0, 1, . . . , I k − 1, given α k,i > 0 and D k,i ∈ S ++ (R n ), we define the following functions: Finally, the unique minimum point of h (k,i) (·) will be denoted as: We are now ready to detail the proposed method IR-VMILA in algorithm 1. At each outer iteration k ⩾ 0, we first define the majorant function q (k) of f 1 around the current iterate x (k) , by updating the weights ω (k) j according to (25) (Step 1). Then, we initialize the inner routine for inexactly solving the kth subproblem min x∈R n f 0 ( Step 2). Next, the inner routine starts. At each inner iteration i = 0, 1 . . . , I k − 1, we select a steplength parameter α k,i > 0 and a scaling matrix D k,i ∈ S ++ (R n ) (Step 3.1), and we compute a (possibly inexact) proximal-gradient pointỹ (k,i) ∈ R n (Step 3.2) according to the following condition where τ ⩾ 0. If τ = 0, the proximal-gradient point is computed exactly, i.e. we haveỹ (k,i) = y (k,i) ; otherwise, if τ > 0, we look for an approximationỹ (k,i) of the exact pointŷ (k,i) . Based on remark 3, we observe that the pointỹ (k,i) makes the function h (k,i) γ negative or null, namely: This property is crucial for the well-posedness of the next inner steps.
3) and perform a linesearch procedure for the function f 0 + ℓ (k) along d (k,i) (Step 3.4). More precisely, we compute the linesearch parameter λ k,i = δ m k,i , where δ ∈ (0, 1) and m k,i is the smallest nonnegative integer such that the following Armijo-like condition holds: Such a condition has been already considered in the nonsmooth framework for other algorithms [4,5,35]. Note that, if h (k,i) , and the linesearch procedure based on (30) terminates in a finite number of steps [35, proposition 3.1]. Finally, the next inner iterate is computed as the convex combinationx (k,i+1) =x (k,i) + λ k,i d (k,i) (Step 3.5). Once the inner routine has ended, we compute the next outer iterate x (k+1) as the point that retains the smallest objective function value betweenx (k,I k ) andỹ (k,0) (Step 4).
Note that, if we set J = 1, ϕ 1 (x) = x, ψ 1 ≡ 0 in (22), and I k = 1 for all k ⩾ 0 in algorithm 1, then our proposed IR-VMILA reduces to the so-called VMILAn algorithm [4,9,36] applied to the minimization of the function f = f 0 + f 1 , where f 1 = χ is convex. In other words, we can see IR-VMILA as a generalization of the VMILAn algorithm to completely nonconvex optimization problems, where both f 0 and f 1 are possibly nonconvex.
We also remark that the proposed IR-VMILA has a similar structure to the C2FB algorithm considered in [27], which is also an iteratively reweighted algorithm with an inner forwardbackward routine. However, in [27], the steplength parameters are forced to be smaller than one, and the scaling matrices are required to satisfy an appropriate majorization-minimization condition. On the contrary, our proposed algorithm allows for a greater flexibility, as the parameters α k,i and D k,i can be freely chosen according to any rule of preference, provided that they both belong to compact sets (see the forthcoming assumption 2). Furthermore, IR-VMILA adopts a different inexactness criterion for the computation of the proximal pointỹ (k,i) , which has the advantage of being practically implementable (see section 2.3).

Convergence analysis
We now develop the convergence analysis for algorithm 1. The main result, stated in theorem 1 to follow, establishes the convergence of the sequence {x (k) } k∈N to a stationary point of problem (21). Our convergence analysis can be considered as a generalization of the one devised for the VMILAn algorithm in [9], even though some major modifications are needed to establish a connection between the (possibly nonconvex) objective function and its iteratively reweighted convex majorizers.
In order to proceed with the analysis, we need some additional hypotheses. First, we state the assumptions needed on the algorithmic parameters α k,i , D k,i , I k , which are all required to belong to compact sets. Assumption 2. (i) There exists I max ∈ Z + such that 0 < I k ⩽ I max for all k ⩾ 0.
Proof. The thesis follows by employing the same arguments as in [35, proposition 3.2].
Next, we show that a sufficient decrease condition holds for the sequence of function values {f(x (k) )} k∈N evaluated at the iterates of IR-VMILA. Furthermore, we show that the gap between two consecutive function values f(x (k) ) and f(x (k+1) ) is a quantity converging to zero, and related to the distance between the two consecutive iterates x (k) and x (k+1) . Lemma 8. Suppose assumptions 1 and 2 hold. Then the following statements hold.
(ii) Due to assumption 1, f is continuous and coercive, hence it is bounded from below. This implies that the monotone nondecreasing sequence {f(x (k) )} k∈N is bounded from below. Then, there must exist f ∈ R such that f(x (k) ) → f, and taking the limit of (33) over k → ∞ we obtain (34). (iii) If Step 4 yields x (k+1) =x (k,I k ) , then an easy application of Jensen's inequality shows that: where the last inequality is due to Step 3.5 of algorithm 1, λ k,i ⩽ 1, and assumption 2(i). Otherwise, if x (k+1) =ỹ (k,0) , then it trivially follows that: Thus, in both cases, inequality (41) holds. Finally, applying inequality (17) In the following lemma, we prove that the norm of the subdifferential of f at the exact proximal-gradient pointŷ (k,0) can be suitably upper bounded by a quantity converging to zero. In the KL framework, this is usually referred to as the relative error condition [8]. Unlike other standard results available in the literature, our result holds at the exact proximal-gradient point y (k,0) , of which we have no knowledge if τ > 0, rather than at the pointỹ (k,0) that is actually computed in the algorithm. This is because such an upper bound is unlikely to hold at an inexact point computed according to (28), as discussed in [36, section 3.2]. Indeed, even if it is possible to obtain an upper bound for the distance between the exact pointŷ (k,0) and its approximatioñ y (k,0) (see inequality (16) in lemma 5), a similar relation between the subgradients atŷ (k,0) and those atỹ (k,0) can not be proved. Lemma 9. Suppose assumptions 1 and 2 hold. For all k ⩾ 0, there exists v (k) ∈ ∂f(ŷ (k,0) ) such that: Proof. By the definition (27) of exact proximal-gradient point, the proximal pointŷ (k,0) is defined as:ŷ ,0) ).
By applying lemma 2(i) to the differential inclusion 0 ∈ ∂h (k,i) (ŷ (k,0) ), we obtain that the following vector u (k) ∈ R n is a subgradient of ℓ (k) at pointŷ (k,0) : Recalling the definition of ℓ (k) in (25) and using lemma 2(ii) and (iii), it follows that: which implies the existence of J vectors u j (ŷ (k,0) ) ∈ ∂ψ j (ŷ (k,0) ), j = 1, . . . , J, and u χ (ŷ (k,0) ) ∈ ∂χ(ŷ (k,0) ) such that: Using lemma 2(i), we can write ∂f(ŷ (k,0) ) = ∇f 0 (ŷ (k,0) ) + ∂f 1 (ŷ (k,0) ). Note that f 1 can be written as satisfy the hypotheses of lemma 3 thanks to assumption 1(ii). Therefore, we can apply lemma 3 to obtain, Hence, the following vector w (k) ∈ R n is a subgradient of f 1 at pointŷ (k,0) : and v (k) := ∇f 0 (ŷ (k,0) ) + w (k) is a subgradient of f at pointŷ (k,0) , i.e. v (k) ∈ ∂f(ŷ (k,0) ). By lemma 8, it follows that the sequence {x (k) } k∈N is contained in the set K 1 = [f ⩽ f(x (0) )], which is compact due to assumption 1(iii). Analogously, denoting ω max = max k∈N |ϕ ′ • ψ j (x (k) )|, which is a finite real value due to the continuity of ϕ ′ • ψ j on K 1 , we note that {ŷ (k,0) } k∈N is a subset of the set: which is compact due to the continuity of the proximal operator with respect to its arguments, and the compactness of K 1 . Thus, letting K : Now, we can write the following inequalities: where the first equality is due to the definition of v (k) , the first inequality follows from the triangular inequality, the second equality comes from (44) and (45), the second inequality is again due to the triangular inequality, whereas the third one follows from remark 4(ii) and (iii), and finally the fourth one is a consequence of inequality (15)
The next lemma contains the first convergence result for IR-VMILA, i.e. that each limit point of the sequence {x (k) } k∈N is stationary for problem (21). Even though the arguments employed are analogous to those followed in several other algorithms framed within the KL framework [3,4,8,27], we report its proof for the sake of completeness. Lemma 10. Suppose assumptions 1 and 2 hold true. Let X * (x (0) ) denote the set of all limit points of the sequence {x (k) } k∈N generated by algorithm 1 starting from the initial iterate x (0) . Then, the following statements hold true.

Proof. (i) From lemma 8(i), we know that {x
Then, assumption 1(iii) implies that {x (k) } k∈N is bounded, which in turn guarantees the existence of at least one limit point, hence X * (x (0) ) ̸ = ∅. The compactness of X * (x (0) ) holds by observing that X * (x (0) ) is a countable intersection of compact sets, exactly as in [20, lemma 5]. (ii) This is a consequence of the definition of limit point of {x (k) } k∈N . (iii) Let x * ∈ X * (x (0) ). Since x (k) ∈ dom(χ) for all k ⩾ 0, and f is continuous on dom(χ) by assumption 1, it follows that there exists a subset of indices K ⊆ N such that lim k∈K f(x (k) ) = f(x * ). As { f(x (k) )} k∈N is a monotone nonincreasing and bounded sequence, it converges to a finite value, hence we have lim k→∞ f(x (k) ) = f(x * ). (iv) This is a straightforward consequence of point (iii).
(v) Let x * ∈ X * (x (0) ) and K ⊆ N any subset of indices such that lim k∈K x (k) = x * . From point (iii), we also know that lim k∈K f(x (k) ) = f(x * ). In addition, by plugging lemma 8(ii) inside (42), we obtain that lim k∈K v (k) = 0. Hence, we can employ the closedness property of the limiting subdifferential (see lemma 1) to conclude that 0 ∈ ∂f(x * ).
As in [9], we now introduce the following surrogate function: which will be needed to incorporate the inexact computation of pointỹ (k,0) inside the proof of theorem 1.
We are now ready to state the main result, namely that the sequence {x (k) } k∈N satisfies the finite length property (which implies convergence to a stationary point). To this aim, we need the following additional requirement on F.

Assumption 3. The surrogate function F defined in (49) is a KL function.
The previous assumption is satisfied if f is the sum of semialgebraic, subanalytic and real analytic functions, see [9, remark 5].
hence it converges to a stationary point x * of problem (21).
Since ξ is concave and differentiable, ξ ′ is nonincreasing. Then, inequality (60) implies: and hence: Combining (61) and lemma 11(ii) leads to: . (62) Next, we can write the following chain of inequalities: where the first inequality is due to the concavity of ξ, the second one follows from (62), and the third one from (33). Setting: the previous inequality can be reformulated as: By applying the square root on both sides and using the basic inequality 2 −1,i) )), we obtain: Summing the previous inequality over p = 1, . . . , k yields: which by simplifying terms reduces to: Note that: where the second inequality follows from the fact that ξ has positive sign. Thus, combining (64) and (65) yields: Taking the limit for k → ∞ allows to conclude that: Finally, recalling lemma 8(iii), we can conclude that: By (66) and lemma 10(v), the thesis follows.
The following result is concerned with the convergence rate of the sequence {f(x (k) )} k∈N , under the hypothesis that the desingularizing function ξ is defined as ξ(t) = 1 c t 1−θ , with c > 0 and θ ∈ (0, 1). In particular, it is stated that the iterates {x (k) } k∈N converge either polynomially, linearly, or in a finite number of iterations, depending on the value that the KL exponent θ assumes.

converges in a finite number of iterations.
Proof. The proof is omitted, as it runs analogously to the one available for the VMILAn algorithm in [9, theorem 2] by combining lemmas 8 and 11, the KL property, and the monotonicity of the derivative ξ ′ .

Numerical experiments
In order to evaluate the performance of IR-VMILA, we turn to the realm of imaging inverse problems, as many of them can be expressed as optimization problems. In this context, the use of composite energy functionals with one or more nonconvex terms, that better capture the properties of the images of interest, has become extremely popular in the last few years [13,16,23,27].
The following experiments are implemented with Matlab 2022a on a Macbook Pro equipped with a 2,4 GHz Intel Core i5 quad-core CPU.

Image deblurring and denoising using smoothed TV and bimodal regularization
In image deconvolution the goal is to compute a suitable approximationx of the true unobtainable ground truth x true ∈ R n , which gets corrupted with blur and noise in the acquisition process. The observed data can then be modelled as y = Hx true + n, where H ∈ R n×n is the matrix representing the convolution with the filter h causing the blurring effect, i.e. the point spread function (PSF), and n ∈ R n is the noise vector. Here we simulate this process by filtering the true image x true with a disk PSF of radius 15, causing the out-of-focus blurring, and then adding white Gaussian noise with standard deviation 0.01. We deal with 20 black/white images of a string of text, in the spirit of the Helsinki Deblur Challenge 2021 3 .
Taking into account that the operator H is ill-conditioned and motivated by the features of the target image, one possible way of obtaining a valid estimatex is to minimize the following energy functional: where: [∇ j x] 2 s + δ 2 is a smoothed version of the Total Variation functional for δ > 0, being ∇ j the discrete gradient of x at pixel j; is a bimodal nonconvex function used to force the pixels to be either 1 or 0.
The function in (67) can be cast in the form (21) by setting f 0 (x) = 1 2 ∥Hx − y∥ 2 + θTV δ (x), Furthermore, it is easy to see that f is coercive (see the appendix) and assumption 1 is satisfied. The weights of the convex majorant in (25) are defined by, If we denote with ω (k) = (ω n ) T , then the proximal operator required in algorithm 1 has the following explicit formulation: We tested the proposed IR-VMILA algorithm using two different strategies for the selection of the inner steplengths: (i) computing α k,i by alternating the Barzilai-Borwein (BB) rules [29] according to the adaptive criterion employed in [37, section 3.2]; (ii) computing α k,i as a prefixed, constant steplength. Moreover, we investigated the effect of the scaling matrix D k,i when chosen according to the split-gradient strategy, see e.g. [30,37]. The variants of IR-VMILA equipped with the BB rules will be denoted as IR-VMILA BB (no scaling) and IR-VMILA BBS (with scaling). We compared IR-VMILA with the standard ISTA [6] and the VMILAn scheme presented in [4,9]. For these algorithms, we chose a different splitting for the function f, gathering all the smooth term together in f 0 (x) = 1 2 ∥Hx − y∥ 2 + θTV δ (x) + ηB(x), so that the resolvent operator needed in the backward step simply consists in the orthogonal projection onto the cartesian product [0, 1] n . For all algorithms, the outer iterations were stopped when either a maximum number of iterations k max = 1000 was achieved, or when the relative difference e (k) r = |f(x (k) ) − f(x (k−1) )|/|f(x (k) )| was lower than ϵ = 10 −7 . Likewise, the inner iterations were halted when either a maximum number I max = 25 was reached, or when the relative difference of the convex majorant f 0 + ℓ (k) value between two consecutive inner iterations was smaller than the tolerance ε. All of the algorithms were initialized at the blurred and noisy data y. The plots that are shown in the following are obtained by considering only the image in figure 1, whereas the results in table 1 are averaged across the 20 images considered.
In figure 2 we compare the performances of IR-VMILA BB and IR-VMILA BBS. For this problem, the use of the scaling matrix in combination with the BB rules does not seem to provide any considerable acceleration. On the other hand, in figure 3 we illustrate how the BB rules are crucial in speeding up the algorithm, as choosing constant steplengths leads to a run time more than 10 times greater, while still obtaning a similar quality in the reconstruction. The

Image deblurring and denoising using nonconvex edge preserving regularization
In this section we illustrate the applicability of our approach to problems where the proximity operator of the function ℓ (k) is not available in closed form. To this end, we consider a deblurring problem where the data y ∈ R n is blurred and corrupted by Gaussian noise. Then, we seek for an approximation of the true image by minimizing the following functional [23]: where H ∈ R n×n is the blurring operator and η > 0. As explained in [23], the second term in the above expression is motivated by the analysis of natural image statistics and, in general, it promotes solutions that are mainly smooth with some sharp discontinuities corresponding to edges. The objective function (69) is a special instance of (21), where f 0 (x) = 1 2 ∥Hx − y∥ 2 and It is easy to check that assumption 1 is satisfied. The weights of the convex majorant in (25) are defined by: The proximal operator involved at the inner iterations of algorithm 1 is then defined as follows: The solution of the above minimization problem can not be expressed in explicit form. Therefore, within our convergence framework, the proximal gradient pointŷ (k,i) = prox ,i) )) must be approximated by a pointỹ (k,i) satisfying inequality (28). Such a point can be computed by applying an iterative optimization method to the dual of problem (71), until the primal-dual gap satisfies a suitable, easy to check condition [9, section 3.1]. Overall, when applied to the minimization of (69), algorithm 1 consists of three nested loops.
We define a test problem by selecting a natural image x true whose pixels range is [0, 1], convolving it with a Gaussian kernel (standard deviation 1.5), and then adding Gaussian noise of zero mean and standard deviation 0.01. The target image and its noisy version are depicted in figures 4(a) and (b). As for the model parameters, we empirically select η = 5 × 10 −4 , which produced a good visual quality for the reconstructions. The scaling matrices in algorithm 1 are set as D k,i = I n , for all k and i, whereas the steplengths α k,i are selected with the same alternation of the BB rules described in the previous section. The inner stopping condition is defined as described in the previous section with ϵ = 10 −7 and I max = 10. The dual problem for approximating the proximal gradient point at each inner iteration is solved with FISTA, until condition (28) is satisfied with τ = 2 × 10 −6 .
For comparison, we consider the method IRL1 [23, algorithm 3], implemented as described in the paper. More specifically, the inner convex problem is approximately solved by using  the Chambolle-Pock method, which is stopped after 10 inner iterations. Both methods are initialized with the noisy blurred image y and run for 100 outer iterations.
The restored image obtained at the final iteration of IR-VMILA is reported in figure 4(c), while in figure 5 we report the objective function decrease for IR-VMILA and IRL1 with respect to the iteration number (a) and the computational time (b). From the plot we can conclude that IR-VMILA provides an accelerated convergence rate towards its limit point, which is mainly due to the variable steplength selection. The acceleration is more evident with respect to the iteration number, but it is still significant also in terms of the computational time. In this regard, it should be noted that the implementation of IRL1 was tailored for this specific problem, whereas the IR-VMILA code was not optimized likewise.

Image super-resolution with CEL0
We now turn our attention to an interesting image super-resolution problem arising in the field of microscopy. The aim is to obtain an high resolution imagex ∈ R K 2 n from a blurred, noisy and low resolution observation y ∈ R n , where K > 0 is the super-resolution factor. This is a common task in microscopy, where the physical limitations of the acquisition tools often lead to under-sampled images, causing close and small objects to remain undetected.
Recent super-resolution techniques in SMLM, like [38][39][40], propose to acquire an elevated number of under-sampled images, where each time only a limited random number of molecules is activated. After these acquisitions are cleaned of the noise and blur, they can be stacked to form the super-resolved image. The effectiveness of this process is influenced by the number of such acquisitions, which in turn is related to the density of the activated molecules. In order not to damage the object of interest, the volume of acquisitions should be restricted, leading to a high density of the molecules, risking again to have smaller objects to go unnoticed. With everything considered, using a sparsity inducing model when restoring the samples is a valid strategy to overcome this issue.
Here, specifically, we focus on the microscopy problem from the ISBI13 challenge 4 . The ground-truth image x true ∈ R K 2 n , with K = 4 and n = 64 2 , shows 8 slim tubular structures which were localized, using the scheme previously described, through 361 acquisitions y i ∈ R n , i = 1, . . . , 361, that are 4 times smaller than the true image.
If y represents one low resolution image corrupted by additive Gaussian noise and blur, then the corresponding super-resolution imagex can be obtained as the minimum of the following energy functional: where A = SH is the composition of two operators: H ∈ R K 2 n×K 2 n is the blurring operator that represents the convolution with the PSF h, while S ∈ R n×K 2 n is the downsampling operator, which reduces each dimension of the image by a factor K. The CEL0 regularization term R CEL0 (x) is meant to induce sparsity on the reconstruction by being an (exact) continuous approximation of the ℓ 0 pseudonorm [16]. Denoting with A j the jth column of the operator A, the regularization term presents itself as: In this case we set f 0 (x) = 1 2 ∥Ax − y∥ 2 and f 1 (x) = R CEL0 (x) + ι x⩾0 (x). Taking into account the non-negativity constraint that is enforced in the problem through the indicator function, we can actually ignore the absolute values in the first argument of the function φ. Thus, the objective function can be framed in problem (21) by setting: Note that assumption 1 is satisfied; particularly, the coercivity of f holds, since the operator A has nonnegative entries and no identically null rows and columns (see the appendix). Consequently, the weights are given by: and the proximal operator of ℓ (k) , whose formulation only depends on the expression of the ψ j , has the same form of equation (68). We employed IR-VMILA to solve the resulting minimization problem and we compared the results with the ones obtained by running ISTA, the C2FB algorithm from [27] as well as the IRL1 algorithm [23] with FISTA as the inner solver, following the footsteps of the authors of [16]. As in the previous experiment, we equipped IR-VMILA with different choices for the steplength: the alternated Barzilai-Borwein rules employed in [37], the constant steplength α k,i = 1, or α k,i = 1/L, where L is an approximation of the Lipschitz constant of ∇f 0 . Regarding the scaling matrix, we computed D k,i using again the split-gradient strategy [30]. On the other hand, the scaling matrix for C2FB has been obtained with the procedure described in [3]. All the algorithms were stopped when either the maximum number of k max = 1000 iterations was reached, or when the relative error of the objective function in two consecutive iterates was less than ϵ = 6 × 10 −6 . However, we adopted different settings for the inner routines of the two iteratively reweighted schemes. In particular, the maximum number of inner iterations for IR-VMILA was set to I max = 3, whereas the corresponding parameter in C2FB was fixed as I max = 50 and as I max = 25 for IRL1-FISTA. All three algorithms stopped the inner iterations when the relative difference in the function f 0 + l k was smaller than 10 −7 . The reason why we adopted such different parameters I max for the algorithm IR-VMILA, with respect to IRL1-FISTA and C2FB, is that the quality of the solutions provided by these two for a smaller upper bound on the inner iterations was much worse than the ones provided by the other method. Regarding the regularization parameter η, we used the value η = 0.1 that is optimal for the performance of ISTA. For all schemes the initial point was chosen as x (0) = A T y. The plots presented in figure 6 are obtained by applying the algorithms to only one of the 361 frames.
In figure 7 we illustrate how, this time, the scaling matrix does indeed provide a boost for the proposed algorithm in terms of a faster decrease of the objective function. It reduces the total number of iterations by a big enough margin to justify the computational cost for constructing the matrix, resulting in an overall reduced computational time. Figure 8 shows the effects of the BB rules on the performance of IR-VMILA. As expected, an adaptive strategy for the steplength selection improves the computational time, especially when compared to the small steplength provided by the inverse of the Lipschitz constant of ∇f 0 . In conclusion, a clever choice of the steplength parameter, possibly in the presence of a scaling matrix, seems to accelerate the proposed approach towards a good approximation of the optimal value.
When compared against the other algorithms tested for the ISBI13 challenge, IR-VMILA is the best performing of the three in terms of decrease of the objective function. Indeed, in figure 9 we see that IR-VMILA reaches the stopping criterion for the outer loop in fewer iterations compared to the other schemes. Furthermore, even though the iterations of IR-VMILA    are more costly than those of ISTA and IRL1-FISTA, the final computational time is still definitely in favor of the former. In super-resolution microscopy, the goal is not to recover the correct intensity of the pixels, but rather the true position of the molecules. Thus, a sensible measure of the quality of the reconstruction is the Jaccard index, which is computed as J δ = #of correct detections (#of correct detections) + (#of false negatives) + (#of false positives) , where δ ∈ N is the number of pixels of tolerance. In table 2 we report the average Jaccard index measured for all the 361 reconstructions, for three different tolerances. Each column shows the Jaccard index relative to the reconstructions provided by one of the three algorithms IR-VMILA BB, ISTA with backtracking, C2FB, and IRL1-FISTA. We did not report the results for the IR-VMILA BBS variant, as we observed that the solution obtained with a nontrivial scaling matrix was worse than the nonscaled one, even if the objective function value was smaller. This could be due to the fact that the problem may have multiple local minima. From table 2, we can observe that the three algorithms achieve comparable results in terms of accuracy, although IR-VMILA BB reconstructs the image with a significantly reduced amount of time, as figure 9 shows.

Conclusions
In this paper we have proposed a novel iteratively reweighted algorithm, where the solution of the inner subproblem is obtained by a finite number of linesearch based forward-backward iterations. In our approach, acceleration strategies related to the selection of the forward-backward parameters can be included. We provided the convergence analysis of the proposed algorithm and the results of a numerical experience on a couple of relevant problems. In particular, the novel algorithm seems to be promising as a tool for solving image super-resolution problems in microscopy. Future work will address the design of effective rules for the selection of the algorithmic parameters and for the specific microscopy application, as well as the definition of a suitable strategy to achieve maximum accuracy in the molecule localization while preserving the acceleration effect of the algorithm.