Paper The following article is Open access

Optimal-order convergence of Nesterov acceleration for linear ill-posed problems*

Published 4 May 2021 © 2021 The Author(s). Published by IOP Publishing Ltd
, , Citation Stefan Kindermann 2021 Inverse Problems 37 065002 DOI 10.1088/1361-6420/abf5bc

0266-5611/37/6/065002

Abstract

We show that Nesterov acceleration is an optimal-order iterative regularization method for linear ill-posed problems provided that a parameter is chosen accordingly to the smoothness of the solution. This result is proven both for an a priori stopping rule and for the discrepancy principle under Hölder source conditions. Furthermore, some converse results and logarithmic rates are verified. The essential tool to obtain these results is a representation of the residual polynomials via Gegenbauer polynomials.

Export citation and abstract BibTeX RIS

Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

One option to calculate a regularized solution to a linear ill-posed problem Ax = y, with A: XY linear and bounded and X, Y being Hilbert spaces, when only noisy data yδ with ||yyδ || = δ are available is to employ iterative regularization schemes. Here, approximate solutions ${x}_{k}^{\delta }$ are calculated iteratively combined with a stopping rule as regularization parameter choice. The simplest one being Landweber iteration (cf, e.g. [7]), which has the downside of being rather slow. To speed up convergence, acceleration schemes may be used such as the following Nesterov acceleration:

Equation (1)

where ||A*A|| ⩽ 1 is assumed and where the sequence αk is chosen, for instance, as

Equation (2)

Here, β is a parameter; common choices are, for example, β = 1 or β = 2. We remark that other alternatives for the sequence αk are possible as well, but for the main analysis of this paper we only consider (2).

This iteration (in a general nonlinear context) was suggested by Yurii Nesterov for general convex optimization problems [14]. It is an instance of a method that achieves the best rate of convergence (in the sense of objective function decrease) that is generally possible for a first-order method. Nesterov acceleration can be employed to speed up convergence of gradient methods in nonlinear or convex optimization. A particular successful instance is the FISTA algorithm of Beck and Teboulle [3] for nondifferentiable convex optimization.

In the realm of ill-posed problems, Hubmer and Ramlau [12] performed a convergence analysis for the nonlinear case, and showed the efficiency of the method.

We note that although the main field of application of Nesterov acceleration lies in nonlinear optimization, in the paper we only treat the case of linear operator equations and the acceleration properties of the method for linear ill-posed problems. Other recent acceleration schemes proposed in the literature use, e.g., Hilbert scale preconditioning [6], the continuous version of Nesterov's scheme [4, 9], or fractional asymptotical regularization [17].

The background and main motivation of the present article is the recent interesting work of Neubauer [15] for ill-posed problems in the linear case. He showed that (1) is an iterative regularization scheme and, more important, proved convergence rates, which are of optimal order only for a priori parameter choices and in case of low smoothness of the solution while being suboptimal otherwise. What is puzzling is that the method shows a quite unusual 'semi-saturation' phenomenon (we explain this term below in section 3.1).

Our contribution in this article is twofold: at first, we prove a formula for the residuals of the iteration (1) involving Gegenbauer polynomials. On this basis, we can build a convergence rate analysis, which improves and extends the results of Neubauer. In particular, we show that the method can always be made an optimal-order method if the parameter β is chosen accordingly to the (Hölder-)smoothness index of the solution. This result holds for both an a priori stopping rule and for the discrepancy principle.

Our analysis also explains the quite nebulous role that this parameter plays in the iteration; it turns out that it is related to the index of the orthogonal polynomials appearing in the residual formula.

Moreover, the above mentioned residual representation also clearly elucidates the semi-saturation phenomenon because the iteration can be interpreted as a mixture of a saturating iteration (Brakhage's ν-method) and a non-saturating one (Landweber method).

In the following we employ some standard notation of regularization theory as in [7]: δ = ||Axyδ || is the noise level and x denotes the minimum-norm solution to the operator equation Ax = y with exact data y = Ax. The index δ of yδ indicates noisy data, and analogous, ${x}_{k}^{\delta }$ denotes the iterates of (1) with noisy data yδ , while the lack of δ indicates exact data y and correspondingly the iteration xk with exact data y in place of yδ in (1).

2. Residual polynomials for Nesterov acceleration

Our work follows the general theory of spectral filter-based regularization methods as in [7], where the convergence analysis results from estimates of the corresponding filter function. The first main result, theorem 1 is quite useful for this purpose as it represents the residual function in terms of known polynomials.

The iteration (1) is a Krylov-space method, and the residual can be expressed as

with the residual polynomials satisfying the recurrence relation (cf [15])

Equation (3)

This is a simple consequence of the definition in (1). The kth iterate can be expressed via spectral filter functions

Observe that the three-term recursion (3) is not of the form to apply Favard's theorem [8], hence rk does not agree with any orthogonal polynomial with respect to some weight functions. (Note that Favard's theorem fully characterizes three-term recurrence relations that lead to orthogonal polynomials).

Before we proceed, we may compare the residual polynomials with other well-known cases. For classical Landweber iteration [7], which is obtained by setting αk = 0 and thus ${z}_{k}^{\delta }={x}_{k}^{\delta }$, the corresponding residual functions ${r}_{k}= :\enspace {r}_{k}^{\left(\mathrm{L}\mathrm{W}\right)}$ is

On the other hand, another class of well-known iteration methods for ill-posed problems that are based on orthogonal polynomials are two-step semiiterative methods [10]. They have the form

where μk and ωk are appropriately chosen sequences. The corresponding residual functions satisfy the recurrence relation

Equation (4)

and thus, rk (λ) form a sequence of orthogonal polynomials. Of special interest in ill-posed problems are the ν-methods of Brakhage [5, 10], defined by the sequences, for k > 1,

the initial values x0 = 0, ${x}_{1}=\frac{4\nu +2}{4\nu +1}{T}^{{\ast}}{y}^{\delta }$, and with ν > 0 a user-selected parameter. The associated residual polynomials ${r}_{k}= :\enspace {r}_{k}^{\left(\nu \right)}$ related to (4) with r0 = 1, ${r}_{1}=1-\lambda \frac{4\nu +2}{4\nu +1}$, have the representation [5]

where ${C}_{n}^{\left(\alpha \right)}$ denotes the Gegenbauer polynomials (aka. ultraspherical polynomials); cf [1].

We now obtain the corresponding representation for the Nesterov residual polynomials, which is the basis of this article.

Theorem 1. Let β > −1. The residual polynomials for the Nesterov acceleration (1) with (2) are

Equation (5)

with the Gegenbauer polynomials ${C}_{n}^{\left(\alpha \right)}$.

Proof. Defining ${h}_{k}\left(\lambda \right)={r}_{k}\left(\lambda \right){\left(1-\lambda \right)}^{-\frac{k+1}{2}}$ and multiplying (3) by ${\left(1-\lambda \right)}^{-\frac{k+2}{2}}$ leads to the relation

Equation (6)

We note that ${C}_{n}^{\left(\frac{\beta +1}{2}\right)}\left(x\right)$ satisfy the recursion relation (cf [1, p 782])

Equation (7)

with

Using the recurrence relation with x = 1, leads to

with

Dividing (7) by ${C}_{k}^{\left(\frac{\beta +1}{2}\right)}\left(1\right)$ and using this relation yields

Equation (8)

By induction (or by well-known formulae [1, 16]), it can easily be verified that $\frac{{C}_{k-1}^{\left(\frac{\beta +1}{2}\right)}\left(1\right)}{{C}_{k-2}^{\left(\frac{\beta +1}{2}\right)}\left(1\right)}=\frac{k+\beta -1}{k-1}$, from which it follows that ${\theta }_{k}-1={\alpha }_{k}^{-1}$ as well as $1-{\theta }_{k}^{-1}={\left(1+{\alpha }_{k}\right)}^{-1}$. Thus, $\frac{{C}_{k}^{\left(\frac{\beta +1}{2}\right)}\left(x\right)}{{C}_{k}^{\left(\frac{\beta +1}{2}\right)}\left(1\right)}$ satisfies the same recursion as hk+1(λ), and the corresponding initial values for k = 0, 1 agree when setting $x=\sqrt{1-\lambda }$. This allows us to conclude that

which proves the theorem. □

This theorem relates the residual function of Nesterov acceleration to other known iterations. In particular, the residual rk is roughly the product of that of $\frac{k}{2}$ Landweber iterations and that of $\frac{k}{2}$ iterations of a ν-method with $\nu =\frac{\beta +1}{4}$.

Remark 1. Gegenbauer polynomials are special cases of Jacobi polynomials and they themselves embrace several other orthogonal polynomials as special cases. Certain values of β in (1) yield various specializations in (5): the choice β = 0 leads to Legendre polynomials, the often encountered choice β = 1 leads to Chebyshev polynomials of the second kind [1].

We note that the result of theorem 1 even holds for β = −1. In this case, only α1 is not well-defined, but it is always 0 for β > −1. Thus, we may extend the definition of the iteration to β = −1 by setting ${x}_{k}^{\delta }{:=}{\mathrm{lim}}_{\beta \to -1}\enspace {x}_{k}^{\delta }$. (This just amounts to slightly modifying (2) by setting α1 = 0 for k = 1; the remaining iteration is well-defined by (1) and (2).) In this case, we may use [1, equation (22.5.28)] to conclude that the resulting polynomials are Chebyshev polynomials of the first kind.

Before we proceed with the convergence analysis, we state for generality the corresponding theorem for the Nesterov acceleration (1) with a general sequence αk .

Theorem 2. Consider the iteration (1) with a positive sequence αk . Then the corresponding residual function can be expressed as

Equation (9)

where Pk is a sequence of orthogonal polynomials obeying the recurrence relation

Equation (10)

with cn and dn recursively defined to satisfy

Equation (11)

Conversely, given a sequence of orthogonal polynomials defined by the recurrence relation (10) with given sequences cn , dn . Then there exists a sequence αk (defined via (11)) such that the corresponding Nesterov iteration (1) has a residual function as in (9).

Proof. The function ${h}_{k}\left(\lambda \right){:=}{r}_{k}\left(\lambda \right){\left(1-\lambda \right)}^{-\frac{k}{2}}$ satisfies the recursion (6) with h0(λ) = 1 and ${h}_{1}\left(\lambda \right)=\sqrt{1-\lambda }$ and for k ⩾ 1. As in the proof of theorem 1, we may conclude that (10) leads to a similar recursion as (8):

with

From (10) we can conclude by some algebraic manipulations that

If (11) holds, then from the recursion for θk , it follows that we can perform an induction step following that ${\theta }_{k-1}^{-1}=1+{\alpha }_{k-1}^{-1}$ implies ${\theta }_{k}^{-1}=1+{\alpha }_{k}^{-1}$. Since ${\theta }_{1}^{-1}=1+{\alpha }_{1}^{-1}$ by definition, we obtain that hk (λ) and $\frac{{P}_{k}\left(x\right)}{{P}_{k}\left(1\right)}$ satisfy identical recursions and have identical initial conditions with the setting $x=\sqrt{1-\lambda }$.

Conversely, if (10) is given and the sequence αk is recursively defined by (11), then it follows in a similar manner that $\frac{{P}_{k}\left(x\right)}{{P}_{k}\left(1\right)}$ has the same recursion and initial conditions as hk (λ) and thus both functions agree. □

The polynomials Pk (x) in this theorem correspond to $x{C}_{k-1}^{\frac{\beta +1}{2}}\left(x\right)$ in theorem 1. As an illustration, we may consider the peculiar choice of αk in Nesterov's original paper [14], which is also used in the well-known FISTA iteration [3]: first, a sequence is defined recursively,

and then the sequence αk is given by

Note that tk+1 is the positive root of the equation ${t}_{k+1}\left({t}_{k+1}-1\right)={t}_{k}^{2}$. Using this identity, we may calculate that

Thus, coefficients for a recurrence formula for orthogonal polynomials that correspond to such an iteration are

However, this does not seem to be related to any common polynomial family, to the knowledge of the author.

On the other hand, we may design Nesterov iterations from the recurrence relation of classical polynomials. For instance, the Hermite polynomials obey a relation (10) with ck = 2, dk = 2k. Thus, the sequence αk has to satisfy the recursion

3. Convergence analysis

We consider the iteration (1) with the usual αk -sequence (2) and show that it is an optimal-order regularization methods (of course, when combined with a stopping rule).

3.1. Convergence rates and semi-saturation

In the classical analysis of regularization schemes [7], one tries to bound the error in terms of the noise level δ: ${\Vert}{x}_{k\left(\delta \right)}^{\delta }-{x}^{{\dagger}}{\Vert}{\leqslant}\psi \left(\delta \right)$, where ψ is some function decreasing to 0 with δ → 0. Often, Hölder-type rates are considered with ψ(δ) = ξ . For such estimates, one has to impose smoothness conditions in form of a source condition

Equation (12)

It is also well-known [7] that the optimal rate of convergence under (12) is of the form

and a regularization scheme that achieves this bound is called of optimal order.

The phenomenon of saturation is the effect that for certain regularization method, the convergence rate ψ(δ) does not improve even when the smoothness is higher, i.e., μ is larger. This happens, for instance for Tikhonov regularization at μ = 1 or for the ν-methods at μ = ν; see [7].

For the Nesterov acceleration (1), a detailed analysis has been performed by Neubauer [15] with the result that, assuming a usual source condition (12) and an appropriate a priori stopping rule, the resulting iterative regularization scheme is of optimal order for $\mu {\leqslant}\frac{1}{2}$, and, for $\mu { >}\frac{1}{2}$, the convergence rates improve with μ but in a suboptimal way. More precisely, the convergence rates proven in [15] are

Thus, contrary to saturating methods, the order still improves beyond the 'saturation index' $\mu =\frac{1}{2}$ but in a suboptimal way. This is what we call 'semi-saturation', and, to the knowledge of the author, this has not been observed yet for a classical regularization method. A further result of [15] is that using the discrepancy principle as stopping rule, convergence rates are proven, which are, however, always suboptimal.

Our second main contribution is an improvement of Neubauer's result in the sense that we show that the Nesterov iteration is of optimal order for a smoothness index $\mu {\leqslant}\frac{\beta +1}{4}$ with an a priori stopping rule. Moreover, contrary to [15], we also obtain optimal-order rates with the discrepancy principle provided that $\mu {\leqslant}\frac{\beta -1}{4}$. These findings allow one to always achieve optimal-order convergence provided that β is chosen sufficiently large.

Moreover, the phenomenon of semi-saturation is made transparent by referring to the representation in theorem 1: the residual is a product of Landweber-type and ν-type residuals, and keeping in mind that Landweber iteration does not show saturation for Hölder indices while the ν-method do, it is clear that a product as in (5) leads to the above described semi-saturation.

3.2. Convergence analysis

In this section we perform a convergence analysis for the iteration (1). By theorem 1, we may base our investigation on the known results for Landweber iteration and the ν-methods.

We collect some useful known estimates:

Equation (13)

This is well-known and follows from [16, equations (7.33.1) and (4.73)]. From this we immediately obtain that

Equation (14)

which has already been shown in [15]. Moreover, we may conclude from (13) and (5) as well that

Equation (15)

Recall that we denote by xk the iteration with yδ replaced by the exact data. As usual, this allows one to split the total error into an approximation and stability term. We estimate the stability term:

Proposition 1. Let ||A*A|| ⩽ 1 and define ${x}_{k}^{\delta }$ by (1) and (2) with β > −1. Let xk be the corresponding noise-free iteration with yδ replaced by y = Ax. Then we have the estimate

Equation (16)

Proof. Following [7], it is enough to estimate

where we used the mean value theorem with $\tilde {\lambda }\in \left(0,\lambda \right)$. The derivative may be calculated from (5) as

We use Markov's inequality (cf [7, equation (6.16)]) and (13) to conclude that

Thus,

Equation (17)

The result now follows with [7, theorem 4.2] and (14). □

Note that this estimate is a slight improvement compared to the corresponding estimate in [15, equation (3.2)], which has 2 on the right-hand side, similar as for the ν-methods.

From this we may conclude convergence:

Theorem 3. Let ||A*A|| ⩽ 1 and β > −1. If the iteration is stopped at a stopping index k(δ) that satisfies k(δ)δ → 0 and k(δ) → as δ → 0, then the iteration (1) is convergent:

Proof. We estimate

The first term converges to 0 by the assumption on k(δ), and the second term does so because k(δ) → and by the dominated convergence theorem using (14), (15) as in [7]. □

We now consider convergence rates, and for this, the following rather deep estimate for orthogonal polynomials is needed. It was derived by Brakhage [5] as well as by Hanke [7, appendix A.2], [10] on basis of Hilb-type estimates for Jacobi polynomials.

Proposition 2. Let β > −1. Then there is a constant cβ with

Equation (18)

Proof. For k even, this is [7, equation (6.22)] (with k there meaning 2k here), or [10, theorem 4.1]. However, the result there is based on the Hilb-type formula ([16, theorems 8.21.12, 8.21.13] which holds for all k as in [5, p 170]. Thus, by following the steps in [7, appendix A.2], the result is obtained. □

Note that in case −1 < β < 1, the constant cβ may be explicitly calculated from [16, equation (7.33.5)].

The corresponding estimates for the residuals of Landweber iteration are standard; cf [7, equation (6.8)]:

Equation (19)

As a consequence, we may state our main convergence rate result for an a priori stopping rule:

Theorem 4. Let ||A*A|| ⩽ 1 and β > −1, and suppose that a source condition (12) is satisfied with some μ > 0.

  • (a)  
    If $\mu {\leqslant}\frac{\beta +1}{4}$ and the stopping index is chosen as
    then optimal order convergence is obtained:
  • (b)  
    If $\mu { >}\frac{\beta +1}{4}$ and the stopping index is chosen as
    Equation (20)
    then the following suboptimal order convergence is obtained:

Proof. For λ ⩽ 1, the estimate (18) yields (by interpolation) and ${\left(1-\lambda \right)}^{\frac{k+1}{2}}{\leqslant}1$ that

Equation (21)

In case of $\mu { >}\frac{\beta +1}{4}$, we have with additionally using (19)

Equation (22)

The result now follows by standard means:

Solving for k by equating the two terms in the last bounds yields the a priori parameter choice and the corresponding rates. □

By a slight refinement, we may even show o(.)-rates as in [15]:

Corollary 1. With the same assumptions as in theorem 4, if the stopping index is chosen such that

(cf, [15, equation (3.1)]), then the same rates for k(δ) as in theorem 4 hold for the stopping index and we obtain the convergence rates

Proof. Following [15], we only have to improve the approximation rates ||xk x|| to o(.)-rates. From (21), it follows that for $\mu {\leqslant}\frac{\beta +1}{4}$

Equation (23)

In case of $\mu { >}\frac{\beta +1}{4}$, we note that for all λ ∈ (0, 1] and ξ > 0, ${\mathrm{lim}}_{k\to \infty }{\left(1-\lambda \right)}^{\frac{k+1}{2}}{k}^{\xi }=0$. Thus, we may conclude from (22) that for all λ ∈ (0, 1],

Equation (24)

Thus, by the theorem of dominated convergence, we obtain o(.)-rates for the approximation error:

The result now follows in exactly the same way as in [15]. □

These results correspond to those of Neubauer when β = 1. However, for β > 1 this is an improvement as we obtain optimal-order convergence if β is chosen larger than 4μ − 1. We note that in the optimal-order case, the number of iteration needed is $O\left({\delta }^{-\frac{1}{2\mu +1}}\right)$, which is the same order as for semiiterative methods and for the conjugate gradient method. The corresponding number of iteration for the Landweber methods is $O\left({\delta }^{-\frac{2}{2\mu +1}}\right)$, cf, e.g. [7, theorem 6.5]. Since $\frac{1}{\mu +\frac{\beta +1}{4}+1}{< }\frac{2}{2\mu +1}$, the number of iterations is, in general, smaller than for the Landweber method even in the suboptimal case. Thus, the Nesterov acceleration certainly qualifies being called a fast method.

3.3. Converse results and logarithmic rates

We present some further contributions to the regularization theory of Nesterov acceleration, namely converse results and logarithmic rates. In this section, we denote by Eλ the spectral family of A*A (cf [7]).

Converse results are statements that certain rates for the approximation error ||xk x|| imply some regularity of the true solution x in form of source conditions. These are converse to standard convergence rates result as, e.g., in theorem 4, where a decay of the approximation error follows from a regularity condition. The results of (21) and (22) state that a given rate of approximation

Equation (25)

for some ξ > 0, is obtained for a smoothness index of with μμ*, where

Equation (26)

Now, concerning converse results for Nesterov acceleration, we may verify similar to [7, proposition 4.13], that a given rate of approximation requires a certain smoothness index, such that our convergence results are rather sharp in that respect. Unfortunately, we can prove this only for the optimal-order range of indices.

Theorem 5. Let ||A*A|| ⩽ 1. For β > −1 fixed, assume that the approximation error obeys a certain rate (25) for some ξ > 0. Then x has to satisfy a source condition

for any epsilon > 0, with ${\mu }^{{\ast}}=\frac{\xi }{2}$.

Proof. In (17), we established the bound |gk (λ)| ⩽ k2 for k ⩾ 1. This yields that

Using spectral theory, it follows that

Thus, a convergence rate of ||xk x|| = O(kξ ) implies $\left.{{\Vert}{E}_{\frac{1}{2{k}^{2}}}{\Vert}}^{2}\right)=O\left({\left(\frac{1}{{k}^{2}}\right)}^{\frac{\xi }{2}}\right)$, which implies the source condition with $\mu {\leqslant}\frac{\xi }{2}-{\epsilon}$ for any positive epsilon by [7, lemma 4.12] (see also [2]). This proves the result. □

Remark 2. The result of theorem 5 is comparable to well-known result in the optimal-order situation. It is an open problem whether it can be verified that for a rate $\frac{\xi }{2}{ >}\frac{\beta +1}{4}$, i.e., in the suboptimal case, also a higher smoothness index as in (26) (second line) is needed. We could not establish results in that direction, mainly because it is difficult to find lower bounds for the Gegenbauer polynomials (which may have zeros in the spectrum).

Some more general rates and converse results have been established in [2]. There (see also [4]), the so-called best worst case error has been defined as (adopted to our notation)

which represents the best δ-rate that one can get for x satisfying certain smoothness conditions. What has been found in [2] (cf proposition 3.3) and [4, theorem 2.16] is that, for certain methods, a convergence rate of ${{\Vert}{x}_{k}-{x}^{{\dagger}}{\Vert}}^{2}{\leqslant}\phi \left(k\right)$ is equivalent to a convergence rate for the best worst case error bwc(δ) ⩽ ψ(δ) and also equivalent to a decay rate of ${{\Vert}{E}_{t}{x}^{{\dagger}}{\Vert}}^{2}=\phi \left(t\right)$ (which is related to a source condition). These results are established for a general class of monotone regularization schemes. However, in our case, this monotonicity (and various positivity assumptions) are not necessarily satisfied as the Gegenbauer polynomials are not monotone in k. Thus, a further investigation of such equivalences and converse results is an open problem.

As an illustration of this theory and as an example of convergence rates under general smoothness classes similar as in, e.g. [2, 4, 13], we can verify logarithmic rates for the Nesterov acceleration scheme. We define the logarithmic (monotone) rate function (cf [2, 4]) for some ν > 0:

with the continuous extension φν (0) = 0.

Proposition 3. Let ||A*A|| ⩽ 1 and β > −1. Suppose that the following logarithmic source condition holds:

Equation (27)

Then Nesterov acceleration shows a logarithmic best worst case rate

for δ → 0.

Proof. From log(x) ⩽ x − 1, for x ⩾ 0, it follows that log(1 − λ) ⩽ −λ for λ ∈ [0, 1], hence

Combining this with (5) and (13) yields the bound

We may proceed similar as in [4]. Using [4, estimate (4.6)] with α = k−1 yields

Thus,

As in [4, (2.18)] we find with (27) that

Furthermore,

Since the supremum is easily seen to be bounded, it follows that this integral can be bounded by O(φν (k−1)). Altogether, we find for k−1e−(1+ν) with some generic constant C that

where we used (27). Since additionally φν (k−1) ⩽ 2ν φν (k−2), we observe with a different constant C that

By balancing the two term, we obtain an equation for k, which, when put back into the bound yields as in [2, p 533] (with k−2 playing the role of α) the upper bound

for δ sufficiently small. Taking the inf and sup on the left-hand side establishes the result. □

3.4. Discrepancy principle

With the improved estimates, we can as well strengthen the result of [15] when the iteration is combined with the well-known discrepancy principle. Recall that it defines a stopping index k(δ) a posteriori by the first (smallest) k that fulfils the inequality

Equation (28)

where τ > 1 is fixed. The corresponding convergence rates can be obtained by a slight modification of the proof in [15] and the general theory in [7].

Theorem 6. Let ||A*A|| < 1, β > −1, and assume a source condition (12) satisfied. If the iteration (1) is stopped by the discrepancy principle (28), then the following convergence rates are obtained:

  • (a)  
    If $\mu +\frac{1}{2}{\leqslant}\frac{\beta +1}{4}$, then optimal order convergence rates is achieved
    with a stopping index k(δ) being of the same order as in (20).
  • (b)  
    $\mu +\frac{1}{2}{\geqslant}\frac{\beta +1}{4}$, then it holds that
    and a rate of
    is achieved.

Proof. The proof [15, theorem 4.1] only needs minor modifications. The estimate [15, equation (4.3)]

is valid independent of our new rate results, hence it follows as in [15, equation (4.4)] that ${\Vert}{x}_{k\left(\delta \right)}-{x}^{{\dagger}}{\Vert}{\leqslant}o\left({\delta }^{\frac{2\mu }{2\mu +1}}\right)$. It remains to estimate ${\Vert}{x}_{k\left(\delta \right)}^{\delta }-{x}^{{\dagger}}{\Vert}$ by (16) combined with an upper bound for k(δ). Estimate [15, equation (4.2)] and the discrepancy principle yields

for k = k(δ).

To obtain o(.)-estimates, we slightly refine the bound (21). By interpolation, we obtain from (18) and (13) that

Equation (29)

Thus,

and where limkγ1(k, λ) = limkγ2(k, λ) = 0, pointwise for λ ∈ [0, 1). Thus, in case that $\mu +\frac{1}{2}{\leqslant}\frac{\beta +1}{4}$, we obtain by the theorem of dominated convergence that

Hence,

which yields (20), and with (16) we obtain ${\Vert}{x}_{k\left(\delta \right)}^{\delta }-{x}^{{\dagger}}{\Vert}=o\left({\delta }^{\frac{2\mu }{2\mu +1}}\right)$, which proves the result in the optimal case.

In case that $\mu +\frac{1}{2}{ >}\frac{\beta +1}{4}$, since γ2(k, λ) = o(k), the corresponding estimate is

from which the result in the second case follows. □

These rates agree with those of [15] when setting β = 1. There, however, only the suboptimal case 2 was possible. Our improvement is to show that we may achieve optimal order results even with the discrepancy principle provided β is sufficiently large.

Remark 3. It is clear that in practice β should be selected in the regime of optimal rates, i.e. β > 4μ − 1 for a prior choices and β > 4μ + 1 for the discrepancy principle. However, it is a rule of thumb to choose such parameter also as small as possible, or more precisely, in such a way to come close to the saturation point, i.e., β ∼ 4μ − 1, respectively, β ∼ 4μ + 1.

Remark 4. For semiiterative methods, a modified discrepancy principle [7, 10] has been defined, where the residual in (28) is replaced by an expression of the form (yδ , sk (AA*)yδ ) with a constructed function sk . This yields an order-optimal method as for the a priori stopping rule. An adaption of this strategy for Nesterov iteration is certainly possible and this should yield order-optimal rates for all $\mu {\leqslant}\frac{\beta +1}{4}$. However, the strategy is quite involved and it is not completely clear to us how to include this into the iteration efficiently. We thus do not intend to investigate such modifications in this article.

4. Numerical results

In this section we present some small numerical experiments to illustrate the semi-saturation phenomenon and to investigate the performance of Nesterov's iteration, in particular, with respect to the optimal-order results.

In a first example we consider a simple diagonal operator $A=\mathrm{diag}\left(\frac{1}{{n}^{2}}\right)$, for n = 1, ..., 1000, as well as an exact solution ${x}^{{\dagger}}={\left(\frac{1}{{n}^{4}}{\left(-1\right)}^{n}\right)}_{n=1}^{\mathrm{1000}}$, which amounts to a source condition being satisfied with index μ = 0.75. Thus, we are in a case of higher smoothness, where the results of the present article really improve those of [15]. We add standard normally distributed Gaussian noise to the exact data and performed various iterative regularization schemes: Landweber iteration, the ν-method, and the Nesterov iteration; the latter two with various settings of the parameters ν and β, respectively.

We calculated the stopping index either by the discrepancy principle (28) with τ = 1.01 or, since we have the luxury of an available exact solution in this synthetic example, we also calculate the oracle stopping index, which is defined as

In other words, kopt is the theoretically optimal possible stopping index.

In figure 1, we display the error ${\Vert}{x}_{k\left(\delta \right)}^{\delta }-{x}^{{\dagger}}{\Vert}$ against various noise levels on a log–log scale. The curves correspond to convergence rates for Nesterov iteration (full line, blue), Landweber iteration (dotted line, black), and the ν-method (dashed dotted line, red). The parameter were chosen as β = 4 and ν = 1, i.e., we are in the optimal-order case covered by item 1 in theorem 4 and theorem 6. On the left-hand side we employ the oracle stopping rule using kopt and on the right-hand side we use the discrepancy principle.

Figure 1.

Figure 1. Log–log plot of the error ${\Vert}{x}_{k\left(\delta \right)}^{\delta }-{x}^{{\dagger}}{\Vert}$ versus the noise level δ for Nesterov iteration (full line, blue), Landweber iteration (dotted line, black), and the ν-method (dashed dotted line, red). Left: optimal stopping rule. Right: stopping by discrepancy principle. The parameters β, ν are in an optimal-order regime.

Standard image High-resolution image

As can be observed, all three methods show a similar (optimal-order) rate, as stated in theorems 4 and 6. In particular, this verifies one of our findings that the discrepancy principle for Nesterov's iteration leads to an optimal-order method provided β is chosen appropriately.

In figure 2, we illustrate the semi-saturation phenomenon: here β and ν are deliberately chosen as too small (β = 0, ν = 0.4 on the left-hand side and β = −0.5, ν = 0.3 on the right-hand side). We observe that for small ν, the convergence rate of the ν-method is slow as a result of its saturation. On the other hand, the Nesterov iteration also has a slower rate than the non-saturating Landweber iteration, but, as can be expected from our residual polynomial representation, it is in between the other two.

Figure 2.

Figure 2. The similar plot as in figure 1 (left) for various iteration parameter in a suboptimal-order regime. Left: ν = 0.4 and β = 0. Right: ν = 0.3 and β = −0.5. Stopping by optimal stopping rule kopt.

Standard image High-resolution image

We remark that the ν-methods show some unpleasant behaviour when ν is chosen small. The residual is highly oscillating and for small noise level we could not even reach the prescribed discrepancy, and if we did, then the number of iteration was quite high, even higher than for Landweber iteration. This might be attributed to our quite aggressive setting of the discrepancy principle with τ = 1.01. In that respect, the Nesterov iteration was very well-behaved, and we had no problem with a small β, which is probably due to the robust Landweber-component in the representation (5).

The optimal-order convergence only partly illustrates the effective performance of the methods. In table 1 we therefore provide the ratio of errors values, i.e., the numbers in the table are $\frac{{\Vert}{x}_{\mathrm{m}\mathrm{e}\mathrm{t}\mathrm{h}\mathrm{o}\mathrm{d},k}^{\delta }-{x}^{{\dagger}}{\Vert}}{{\Vert}{x}_{\mathrm{N}\mathrm{e}\mathrm{s}\mathrm{t}\mathrm{e}\mathrm{r}\mathrm{o}\mathrm{v},{k}_{\text{opt}}}^{\delta }-{x}^{{\dagger}}{\Vert}}$, where ${x}_{\mathrm{N}\mathrm{e}\mathrm{s}\mathrm{t}\mathrm{e}\mathrm{r}\mathrm{o}\mathrm{v},{k}_{\text{opt}}}^{\delta }$ denotes Nesterov iteration with the optimal stopping rule and ${x}_{\mathrm{m}\mathrm{e}\mathrm{t}\mathrm{h}\mathrm{o}\mathrm{d},k}^{\delta }$ the iteration of the respective method with the respective stopping rule. All results correspond to an optimal-order regime of parameters (those of figure 1). The number of iterations (both for the oracle stopping rule and the discrepancy principle) are given in table 2. In these tables, we also include the corresponding results for the conjugate gradient iteration CGNE [11].

Table 1. Errors compared to Nesterov iteration: $\frac{{\Vert}{x}_{\mathrm{m}\mathrm{e}\mathrm{t}\mathrm{h}\mathrm{o}\mathrm{d},k}^{\delta }-{x}^{{\dagger}}{\Vert}}{{\Vert}{x}_{\mathrm{N}\mathrm{e}\mathrm{s}\mathrm{t}\mathrm{e}\mathrm{r}\mathrm{o}\mathrm{v},{k}_{\text{opt}}}^{\delta }-{x}^{{\dagger}}{\Vert}}$.

  δ
MethodStopping10−5 10−4 10−3 10−2 10−1
Nesterov kopt 11111
Landweber kopt 1.150.830.961.051.06
ν-method kopt 1.021.061.011.260.97
CGNE kopt 1.020.821.051.020.84
NesterovDiscrepancy1.581.101.412.841.90
LandweberDiscrepancy2.231.171.412.801.98
ν-methodDiscrepancy1.021.131.001.561.88
CGNEDiscrepancy1.811.191.052.511.97

Table 2. Number of iterations for various methods; setting as in table 1.

  δ
MethodStopping10−5 10−4 10−3 10−2 10−1
Nesterov kopt 371163652615
Landweber kopt 11 000219351214536
ν-method kopt 1908233229
CGNE kopt 106432
NesterovDiscrepancy26011139131
LandweberDiscrepancy51061080220371
ν-methodDiscrepancy1909633101
CGNEDiscrepancy85421

In terms of the number of iteration, the Nesterov iteration is slightly slower than the ν-methods (approximately by a constant factor of 1.5) but both have a similar modest increase of iterations when δ is decreased. Both need more iteration than the CGNE-method, which, of course, is the fastest one by design. The slightly higher number of iterations might be attributed to the better error estimate in (16). (Note that the ν-methods have a 2 in place of $\sqrt{2}$ there.) It might appear a little bit paradoxical that a better estimate leads to slower convergence, but this is clear from the theory as the number of iteration is a decreasing function of δ and thus also of any factor in front of δ. This factor, however, pays off when considering the total error of the method, and we observe that Nesterov iteration with the optimal choice kopt indeed has almost always a slightly smaller error than the ν-method. Surprisingly, it is in several instances also better than the CGNE-method. However, the Nesterov method sometimes loses some of its advantages against the ν-method, when using the discrepancy principle, but the performance is still acceptable.

Some further experiments indicate that the results are rather insensitive to overestimating β. As stated in remark 3, the best choice is usually related to the smoothness index, but there arose no serious problems when β was larger.

Further numerical experiments have been performed in [15]: even though the value of β was not reported there, the results are consistent with our theory with the choice β = 1. The forward operator there was the Green's function for the solution of the 1D boundary value problem −u'' = f with homogeneous boundary conditions. Exact solutions with various smoothness are stated there: example 5.1 with $\mu =\frac{1}{8}$, example 5.2 with $\mu =\frac{5}{8}$, and example 5.3 with $\mu =\frac{17}{8}$. We used the same problem and the same examples, but we calculated A by using an FEM-discretization of the boundary value problem and A as the corresponding solution operator. For simplicity we ignored discretization errors and took the discretized (projected) solution as x.

The main purpose of this experiment is to verify that the discrepancy principle (τ = 1.1) can be made an optimal-order method. We choose β = 3.5 for the first two examples and β = 9.5 for the third, which should in any case lead to an optimal-order situation. In figure 3, we plotted the error versus the relative noise level on a logarithmic scale for the three examples with this choice of β, indicated by the marker 'x'. As a comparison, we also indicated the predicted optimal rate by a solid line. Furthermore, also shown and marked with '+' are the corresponding results for β = 1, i.e., in the suboptimal case.

Figure 3.

Figure 3. Convergence rates for the examples in [15]. Left: example 1, smoothness index $\mu =\frac{1}{8}$. Centre: example 2, smoothness index $\mu =\frac{5}{8}$. Right: example 3, smoothness index $\mu =\frac{17}{8}$. Displayed are the errors versus the noise level on a logarithmic scale. A marker 'x' indicates optimal choice of β, and '+' indicates suboptimal choice β = 1. The full line indicates the optimal order rate.

Standard image High-resolution image

These results clearly illustrate that for the discrepancy principle we may achieve the optimal order rates with the correct choice of β and for a wrong choice of β the rate deteriorates. For low-smoothness as in example 1 (left picture in figure 3, however, there seems to occur almost no deterioration contrary to expectation.

5. Conclusion

We have provided a representation of the residual polynomials for Nesterov's acceleration method for linear ill-posed problems as a product of Gegenbauer polynomials and Landweber-type residuals. This allowed us to prove optimal-order rates for an a priori stopping rule and the discrepancy principle as long as β in (2) is sufficiently large. The number of iteration is shown to be of the same order as for other fast methods such as the ν-method or the conjugate gradients methods. Moreover, our representation clearly explains the observed semi-saturation phenomenon.

Within the class of linear iterative methods, the Nesterov acceleration is an excellent choice, as it is a fast method as well as a quite robust one. Although, it must be conceded, that it cannot compete with the conjugate gradient method in terms of number of iterations. However, this is compensated by its flexibility and simplicity of use, which also allows one to easily integrate it into existing gradient methods and also to apply it in nonlinear cases.

Data availability statement

No new data were created or analysed in this study.

Footnotes

  • Dedicated to A Neubauer on the occasion of his 60th birthday.

Please wait… references are loading.