A modified discrepancy principle to attain optimal convergence rates under unknown noise

We consider a linear ill-posed equation in the Hilbert space setting. Multiple independent unbiased measurements of the right hand side are available. A natural approach is to take the average of the measurements as an approximation of the right hand side and to estimate the data error as the inverse of the square root of the number of measurements. We calculate the optimal convergence rate (as the number of measurements tends to infinity) under classical source conditions and introduce a modified discrepancy principle, which asymptotically attains this rate.


Introduction
So we aim to solve Kx =ŷ, where K is compact with dense range andx andŷ are elements of infinite-dimensional Hilbert spaces X and Y. The exact dataŷ is unknown, but we have access to multiple and unbiased i.i.d. measurements Y 1 , ..., Y n with unknown arbitrary distribution and finite variance (E Y 1 −ŷ 2 < ∞). Note that at this point the measurements are infinite-dimensional objects (e.g. functions), we will later discretise along the singular vectors of the operator K. Repeating and averaging the measurement process is a standard engineering practice to estimate and reduce random uncertainties, see [23], [5] and [18] for introducing monographs on the subject of error analysis from a practical view point. In the given setting, a natural estimator of the unknown dataŷ is the sample meanȲ The compactness of K implies that the equation is ill-posed, so that one cannot rely on classical direct methods like LR-or U R-decomposition to determine the (generalised) inverse of K. Regularisation is needed, and the inverse is replaced with a family of related but continuous approximations, e.g. Tikhonov or spectral cut-off regularisation. The particular choice of the approximation has to be based inevitably on knowledge of an upper bound of the true error δ true n := Ȳ n −ŷ , as the famous result of Bakushinskii [2] states. While the exact value of δ true n is clearly not given due to randomness, its variance depends mainly on the number of measurements, Thus δ est n := 1 √ n or δ est n := are natural estimators of the unknown true error δ true n . So a natural approach for the solution of the equation is to use the meanȲ n and the estimated data error δ est n together with a deterministic regularisation method. Indeed, in [11] it was verified, that the approach converges in a suitable sense for a large class of regularisation methods. See also [12] and [20], where this approach was extended to settings involving white or Poissonian noise. The rate of convergence of a given regularisation is known to depend on a certain smoothness of the unknown solutionx relative to the operator K. Classical convergence rates for deterministic noise (i.e. in a setting where one knows an upper bound for the norm of the noise) are deduced by a worst case error analysis. In our setting however, the noise, though random, typically excludes many 'bad' directions for a fixed unknown error distribution. So as it is typical under random noise, see e.g. [3], [8] or [17], the optimal rates obtained here should be substantially better than the ones one would expect from a deterministic worst case error analysis. Indeed, we show that the optimal rates here are better than for the deterministic worst case. The main result of this work then constitutes of a modified discrepancy principle, which yields (almost) the best possible rate for arbitrary unknown error distributions. Denote by (σ j , u j , v j ) j∈N the singular value decomposition of K (i.e. (u j ) j∈N is an orthonormal basis of Y (note that K is assumed to have dense range), (v j ) j∈N an orthonormal basis of N (K) ⊥ , (σ j ) j∈N a monotone to 0 converging sequence of positive numbers and it holds that Kv j = σ j u j ). In the following we will restrict to the spectral cut-off regularisation and to mildly ill-posed problems, i.e. we assume that there exists q > 0 such that σ 2 j ≍ j −q . Thus the reconstruction will be based on the projections (Y i , u j ) for i, j ∈ N. The unbiasedness assumption reads E[(Y i , u j )] = (ŷ, u j ) for all i, j ∈ N and we moreover assume that there are Later we will always consider only a finite number of components for a fixed number of measurements n. Spectral cut-off at truncation level k for the component wise averages then yields the following estimator forxX In order to find a reasonable reconstruction the truncation level has to be determined dependent on the (estimated) noise level, which depends on the number of measurements n.
The rest of the paper is organised as follows. In the following Section 2 we state the three main Theorems, which are proven in Section 3. The main result is accompanied by numerical experiments in Section 4 and the article ends with a short conclusion in Section 5.

Main results
We derive convergence rates with respect to classical Hölder-type source conditionŝ Ifx ∈ X ν,ρ , we say thatx obeys smoothness (ν, ρ) relative to K. Via (1.1) a whole class of estimators indexed by k ∈ N is defined, which is also known under the term projection estimators (with respect to the singular value decomposition, see [9]). The first result gives the optimal error bound for our estimators (1.1) on X ν,ρ , where we measure performance by the integrated mean squared error (also called the minimax (L 2 )-risk in this context).
In particular, for the a priori choice Note that under additional assumption, one can show that the rate from Theorem 2.1 is (up to a constant factor) the optimal rate for all possible estimators, not just for projection estimators (1.1). See e.g. [22] and [4] for the case where ((Y 1 −ŷ, u j )) j∈N are independent and Gaussian.
In view of the fact that the optimal worst case error bound for deterministic noise level 1/ √ n under the source conditionx ∈ X ν,ρ has order (1/ √ n) ν ν+1 , we see that the minimax risk attained by the oracle k n is in all cases strictly better. In particular, for q − p < −1, the problem is in fact well-posed. However, the above choice k n requires knowledge of both the smoothness ν and the decay of variances p. A plain use of the discrepancy principle [21] as an adaptive strategy to determine the truncation level would be to find k = k(Ȳ n , δ est n ), such that the size of the residual is approximately equal to the estimated noise level, i.e. by the relation In [11] it was shown, that the choice (2.2) adapts to the unknown smoothness ν in the sense that asymptotically the optimal deterministic bound holds with a probability converging to 1. According to Theorem 2.1, this is suboptimal. The reason is an intrinsic drawback of the plain discrepancy principle for statistical noise, which tempts to stop too late. We therefore consider in this work a modified version of the discrepancy principle, which also takes information about the stochastic nature of the noise into account.
We first formulate a simplified version of the main result to illustrate the approach. As already mentioned, the rate of convergence depends on certain smoothness properties of the real unknown solutionx relative to the forward operator K, see e.g. (2.1). The general idea is to rescale the operator K with a weighting operator S, such that the smoothness ofx relative to the rescaled operator SK is better than the original one relative to K. In order to avoid distinction of several cases, let us assume for a moment that q − p > −1 additional to the assumptions of Theorem 2.1. Moreover, we assume that p is known to us. The latter is a serious restriction, which will be dropped in the main result Theorem 2.3 below. However, there are settings, where this knowledge is justified, see Example 2.1 at the end of this section. For any ε > 0 with p > 1 + ε we define the (linear and unbounded) weighting operator S as the linear extension of Since q − p > −1, we directly see that SK : X → Y is compact, with singular values d j σ j ≍ j − q+1+ε−p 2 and the same singular bases (v j ) j∈N and (u j ) j∈N as K. Now assume thatx obeys smoothness (ν, ρ) relative to K, i.e. there exists ξ ∈ Y withx = (K * K) ν 2 ξ and ξ ≤ ρ. Let ν ′ := q q+1+ε−p ν. Note that ν ′ > ν, since we assumed that p > 1 + ε. Thenx thereforex obeys smoothness (ν ′ , cρ) relative to SK (with a constant c > 0). Moreover, the rescaled measurements SY 1 , SY 2 , ... are unbiased estimators of Sŷ with finite variance Our modification of the discrepancy principle is, that we apply it not to the unscaled operator and measurements K andȲ n , but to the rescaled ones SK and SȲ n (note that SK also has dense range). Consequently, the stopping index k n is the solution of the equation The following theorem states, that up to ε the optimal bound from Theorem 2.1 holds with a probability converging to 1 as n → ∞ using this strategy. Note that convergence in mean squared error cannot be expected for the discrepancy principle, see [11]. However, adding to the procedure a so-called "emergency stop" (see [8], [11]) might allow to deduce rates in mean squared error, but we will leave this as a future work and focus on the rates in probability.
Let ε > 0 such that p > 1 + ε and assume thatx ∈ X ν,ρ . Then there exists L > 0, such that for k n the solution of (2.4) there holds as n → ∞.
Theorem 2.2 is an immediate consequence of Theorem 1.2.4 from [14] (which is a refined version of Theorem 4 of [11]) applied to SK and SȲ n . A quick calculation reveals, , thus up to ε > 0 we get the optimal rate from Theorem 2.1. Note however, that the smaller we choose ε, the slower will be the convergence to 1 in (2.5). Now we generalise the above result in several ways. We relax the condition E( . Most importantly, the exponent p is no longer assumed to be known. Moreover, we account for the fact that in practice we can only measure a finite number of components. The component-wise variances E[(Y 1 −ŷ, u j ) 2 ] will be estimated from the multiple measurements and are then used to determine the (now random) rescaling weights d j,n . Consequently, the weighting operator S n is now also random and depends on the samples Y 1 , ..., Y n . The precise implementation of the modified discrepancy principle (with ad hoc unknown decay of the component-wise variances) is given in Algorithm 1. Hereby, the factor mn j=1 s 2 n,j is in essence a normalisation by E Y 1 −ŷ 2 . The two arguments of the min(·) function can be roughly interpreted as follows: the first one assures, that the rescaled measurements still have finite variance (c.f. (2.3)), while the second one assures, that the rescaled operator is still bounded. We state now the main result, which confirms that the optimal bound from Theorem 2.1 holds with a probability converging to 1 as n → ∞ (up to discretisation and ε 2 > 0 arbitrary small in the exponent) using this strategy.
≤ C for all j ∈ N. Assume thatx ∈ X ν,ρ for ν, ρ > 0 and let k n be the stopping index of the modified discrepancy principle as implemented in Algorithm 1 with 0 < ε 1 < 1 and 0 < ε 2 < p − 1 and Y 1 , ..., Y n . Then there is a L > 0 such that there holds The proof of Theorem 2.3 is substantially more difficult than the one of Theorem 2.2, mostly due to the dependence of the rescaling operator S n on the (realisations of) the measurements Y 1 , ..., Y n . In particular, the S n Y 1 , ..., S n Y n are not independent and hence we cannot simply apply the results from [11].
Algorithm 1 Modified discrepancy principle with estimated data error 1: Given measurements (Y i , u j ) with i = 1, .., n and j = 1, ..., ⌊n 1−ε1 ⌋; 2: Estimate variances 3: Set s 2 j,n := 1 Remark 2.1. Algorithm 1 could be applied in a general setting, e.g. also to severely ill-posed problems. The weights d j,n are defined such that Note that a chosen ε 2 fulfills the condition ε 2 < p − 1, if lim j→∞ lim n→∞ d j,n = ∞ and the latter can be checked to verify, that ε 2 was chosen sufficiently small. The assumption q > p − 1 is only made for convenience and is not The second argument of the maximum in (2.6) is a discretisation error due to the usage of only finitely many singular vectors. If the latter is negligible, i.e. if ν ν+1 < (1 − ε)qν, the rate from Theorem 2.3 is better than the deterministic worst-case rate from [11]. The additional assumption sup j∈N this holds under Gaussian noise). In particular no independence between the components is required. Our assumption of finite variance (E Y 1 −ŷ 2 < ∞) excludes a direct application to white noise scenarios. The following example shows, how to adapt the approach for Hilbert-Schmidt operators under white noise.
Example 2.1. Consider the equation Ax =ẑ for A : X → Y Hilbert-Schmidt and assume there is a q > 1 such that σ j (A) 2 ≍ j −q . Assume that the measurements are corrupted by i.i.d centered Hilbert-space processes Z 1 , Z 2 , ... (operating on Y). I.e., the Z i : Y → L 2 (Ω, A, P) are bounded linear operators from Y to the space of square-integrable real-valued random variables (on some probability space (Ω, A, P)), such that E(Z 1 , z) = 0. Moreover, Z i has an arbitrary covariance operator Cov Z : Y → Y, which is the bounded linear operator defined implicitly via the equation Since we assume to know the singular value decomposition, this allows to apply Theorem 2.2. It should be noted, that for coloured noise, no prewhitening step or additional assumptions for the covariance operator are needed (in contrast to e.g. [8]).
All in all, the main contribution of this work is to answer the question of optimal adaptivity (in the minimaxsense) for the discrepancy principle in statistical inverse problems with multiple measurements, which was left open in the original work [11]. Moreover, in the light of Example 2.1 the results may be compared to classical existing results for statistical inverse problems, usually using a white noise error model. In particular, they generalise results from [8], [15] and [17], where modifications of the discrepancy principle are applied to the symmetrised equation in several ways. Firstly, the error distribution is arbitrary, secondly the noise level and the covariance structure need not to be known and thirdly, a self-similarity condition (Assumption 3 in [17] and Assumption 2.4 in [15]) forx is not needed. However it should be mentioned here that using symmetrisation is usually avoided, since the ill-posedness of the symmetrised equation A * Ax = A * y is much worse than the one of the original equation Ax = y. Note that this does not contradict the (almost) order-optimality of the methods relying on the discrepancy principle mentioned above, but still may cause problems in practice. Because of this, under white noise one often relies on other methods, which do not depend on the residual, see e.g. [6] for a priori bounds, [19] for the Lepski principle, or [16] for unbiased risk estimation, to only name a few. Finally in [7], [12] and [13] recent modifications of the discrepancy principle, which are based on discretisation and not on symmetrisation, are investigated in white noise scenarios.

Proofs
In this section we present the proofs of the above statements.

Proof of Theorem 2.1
Note that (ŷ, Therefore it holds that The right hand side is minimised by the choices Thus we obtain

Proof of Theorem 2.3
Let m n := ⌊n 1−ε1 ⌋ andx = (K * K) ν 2 ξ with ξ ≤ ρ. For j ∈ N fixed it holds that s 2 j,n → E(Y 1 −ŷ, u j ) 2 (in probability, almost surely and in L 2 ). We denote the deterministic limit (for n → ∞) of the random weights d j,n by The weights d j,n and d j can be interpreted as belonging to weighting operators S n and S respectively. Moreover, the assumption on the error distribution of the (Y 1 −ŷ, u j ) imply that S can be seen as a deterministic limit (for n → ∞) of the S n in a suitable sense. This will ultimately allow to rephrase the increased smoothness relative to the deterministic rescaled limit operator SK instead of the random rescaled operator S n K. In the following, S n and S are not used explicitly, it suffices to stick to the weights d j,n and d j . We start with the following auxiliary proposition, which summarises some of the properties of the sequence (d j ) j∈N .
Proof of Proposition 3.1. Note that obviously d j > 0 for all j ∈ N. First, (3.1) is fulfilled for j = 1. For j ≥ 2, we have If J = ∞, the statement is proven since from p > 1 + ε 2 it follows that Now we first show, that the true solution has at least smoothness ν ′ := q q+1+ε2−p ν (relative to the rescaled limit operator SK). Since ε 2 < p − 1, there holds ν ′ > ν. We use (the reverse of) (3.1) together with (3.3) and obtain for all j ∈ N. We expressx with respect to the rescaled limit operator SK and obtain The assumption sup j∈N E[(Y1−ŷ,uj ) 4 ] (E[(Y1−ŷ,uj) 2 ]) 2 guarantees, that we can estimate the variances E(Y 1 −ŷ, u j ) 2 uniformly for j = 1, ..., m n . To see this, we use Theorem 2 of [1] which states that E[|s 2 from Algorithm 1 and γ := ∞ j=1 d 2 j E(Y 1 −ŷ, u j ) 2 as n → ∞. We will distinguish two cases in the following. In the analysis we will often restrict to certain good events Ω n which hold with a probability P (Ω n ) → 1 as n → ∞. Moreover, we will repeatedly use Markov/Chebyshev's inequality and that for i.i.d real-valued random variables Z 1 , ..., Thus, e.g.
for all j, n ∈ N.

Case 1
We first assume, that for all k ∈ N there exists j k ≥ k such that (ŷ, u j k ) = 0. Note that then also (x, v j k ), (ξ, v j k ) = 0.
Lemma 3.1. Assume that for all k ∈ N there exists j k ≥ k such that (ŷ, u j k ) = 0. Then for δ est Proof of Lemma 3.1. We first show that there exists (q n ) n∈N such that P (k n ≥ q n ) → 1 and q n → ∞ (3.10) as n → ∞. For that it suffices to show that lim n→∞ P (k n ≥ k) = 1 for all k ∈ N. By assumption there exists j k ≥ k such that (ŷ, u j k ) = 0. We set Then for n ≥ max(j k , 32γ 2 /(d j k (ŷ, u j k )) 2 ), Thus k n χ Ωn ≥ kχ Ωn by Algorithm 1 and (3.10) follows with lim n→∞ P (Ω n ) = 1, which holds because of (3.7), (3.8) and the law of large numbers. We come to the main proof. Let ε > 0 and q n be such that q n → ∞ and P (k n ≥ q n ) → 1 (3.12) as n → ∞. Then, We start the main proof and decompose as usual into a data propagation error, approximation error and discretisation error We first consider the second term (approximation error). With the convention t j=s = 0 for s > t, a standard application of Hölder's inequality for p = ν ′ +1 ν ′ and q = ν ′ + 1, (3.6) and the triangle inequality yield Thus for by the Definition of Ω n (3.15) and k n . Consequently, for the approximation error and the discretisation error there holds as n → ∞, where we used (3.7), (3.8) and Lemma 3.1 for Ω n given in (3.15).
To finish the proof we need to verify a similar bound for the data propagation error. By definition of the discrepancy principle (Algorithm 1) and Ω n in (3.15) there holds where we used that d 1 σ 1 ≥ d 2 σ 2 ≥ ... by definition of d j . So, as n → ∞, for Ω n given in (3.15). Now we show, that for all ε > 0 it holds that for n large enough. Let j ε be such that j≥jε j −(1+ε2) ≤ ε γ 16d −ν 2 and set C ε := jε j=1 Define for n large enough, since ε ′ < 1 ν ′ +1 . To prove (3.19) it remains to show that P (Ω n,ε ) ≥ 1 − 3ε for n large enough. We apply Markov's inequality and obtain for n large enough by definition of C ε . Further, by the choice of j ε , (3.23) Therefore, by (3.18), (3.21) and (3.22) there holds P (Ω n,ε ) ≥ 1 − 3ε for n large enough and thus (3.19). Since ε > 0 was arbitrary it follows that as n → ∞. Finally, (3.17) and (3.24) together prove Theorem 2.3 for the case, that for all k ∈ N there is j k ≥ k with (ŷ, u j k ) = 0.

Case 2
Now we assume, that there exists J ∈ N such that (ŷ, u j ) = 0 for all j ≥ J. We cannot expect a result similar to Lemma 3.1 (since k n will not converge to ∞ in probability), but the true solutionx has arbitrarily large smoothness. Let ε > 0 be such that (d J σ J ) −ε ≤ 2. We set ν ′′ = ν ′ + ε and use the representation from (3.6) as n → ∞ because of (3.7), (3.8) and as n → ∞. For n large enough (such that m n ≥ J) the approximation and discretisation error is for n large enough, where we used d J σ J ≥ 2 − 1 ε in the sixth step, the definition of the discrepancy principle in the fifth step and as n → ∞. It remains to treat the data propagation error. We set b n := γ 8ρ ′′ √ n 1 ν ′′ +1 . Let the deterministic sequence (q n ) n∈N ⊂ N be defined via q n := min {j ≤ m n : d j σ j ≤ b n } . (3.29) If b n > d j σ j for all j = 1, ..., m n we set q n := m n . Note that q n → ∞ as n → ∞, since b n → 0. Definē with Ω n from (3.26). We claim that k n χΩ n ≤ q n − 1 (3.31) as n → ∞. Since (3.31) trivially holds for q n = m n , we may assume that d qn σ qn ≤ b n . Further, n χΩ n ≤ δ est n ′ and the claim 3.31 follows by the definition of k n in Algorithm 1. It holds that P Ω n → 1 (3.32) as n → ∞ because of (3.27) and where e 1 , e 2 , ... is the (orthonormal) Galerkin basis andŷ = A * b. We set the discretiation to m = 1000 and approximated the singular value decomposition (σ j , u j , v j ) of A with the function 'csvd'. We used n = [50, 500, ..., 500000] measurements and compared the classical discrepancy principle to the modified one implemented in Algorithm 1 with ε 1 = 0.5 and ε 2 = 0.5 (large) and ε 2 = 0.1 (small). We calculated the relative errors for 100 independent runs and visualised the results as box plots in Figure 1. We clearly see that the errors decay faster for the modified discrepancy principle. Moreover, in Table 1 we compare the (relative) median error of Figure 1: Relative errors for (modified) discrepancy principle. We see that the resulting errors decay faster for the modified discrepancy principle. the plain and modified discrepancy principle (this is the red bar in each of the boxes) to the square root of the minimax risk from Theorem 2.1, where we sampled the latter from the same data. We see that the error of the modified discrepancy principle is comparable to the minimax risk for smaller sample sizes. For larger sample sizes the minimax risk is better, which is consistent with the loss of ε 2 in the exponent of (2.6). Table 1: Sampled median of the relative error of the plain and modified discrepancy principle (dp) and sampled relative minimax risk for different sample sizes. sample size plain dp modified dp, ε 2 large modified dp, ε 2 small oracle n median error median error median error square root of the minimax risk 5e1 4.36e-1 4.36e-1 4.4e-1 4.48e-1 5e2 4.17e-1 3.75e-1 3.72e-1 3.69e-1 5e3 3.67e-1 3.32e-1 3.14e-1 3.07e-1 5e4 3.32e-1 2.85e-1 2.69e-1 2.54e-1 5e5 3.05e-1 2.41e-1 2.32e-1 2.09e-1

Concluding remarks
In this work we have presented a modified discrepancy principle, which yields (almost) optimal convergence rates for arbitrary unknown error distributions, if one is able to repeat the measurements. This was achieved in estimating the variances of one measurement along the singular directions of the operator K, which was then used to rescale the measurements and the operator.
We restricted to linear mildly ill-posed problems and classical Hölder-type source conditions in Hilbert spaces, but the results probably can be extended to general degree of ill-posedness and general source conditions. A major drawback is, that the singular value decomposition of the operator needs to be known. It would be interesting to investigate whether the approach could be adapted to settings, where the singular valued decomposition is not given.