On the discrepancy principle for stochastic gradient descent

Stochastic gradient descent (SGD) is a promising numerical method for solving large-scale inverse problems. However, its theoretical properties remain largely underexplored in the lens of classical regularization theory. In this note, we study the classical discrepancy principle, one of the most popular a posteriori choice rules, as the stopping criterion for SGD, and prove the finite-iteration termination property and the convergence of the iterate in probability as the noise level tends to zero. The theoretical results are complemented with extensive numerical experiments.


Introduction
In this work, we study the following finite-dimensional linear inverse problem: where x ∈ R m is the unknown signal of interest, y † ∈ R n is the exact data and A ∈ R n×m is the system matrix.
In practice, we have access only to a corrupted version y δ of the exact data y † = Ax † (with the reference solution x † being any exact solution) y δ = y † + ξ where ξ ∈ R n denotes the noise, with a noise level δ = ξ .In the literature, a large number of numerical methods have been proposed for solving linear inverse problems accurately and efficiently (see, e.g., [3,11,7]).
When the size of problem (1.1) is massive, one attractive method is a simple stochastic gradient descent (SGD) [16,2].In its simplest form, it reads as follows: given an initial guess x δ 1 = x 1 ∈ R m , let where η k > 0 is a decreasing stepsize, a i is the i-th row of the matrix A (as a column vector), (•, •) denotes Euclidean inner product on R m , and the row index i k at the kth SGD iteration is chosen uniformly (with replacement) from the set {1, ..., n}.Distinctly, the method operates only on one data point (a i k , y i k ) each time, and thus it is directly scalable to the problem size n.This feature makes it especially attractive in the context of massive data.
In fact, SGD and its variants (e.g., minibatch and accelerated) have been established as the workhorse behind many challenging training tasks in deep learning [1,2], and they are also popular for image reconstruction in computed tomography [5,15].Despite the apparent simplicity of the method, the mathematical theory in the lens of classical regularization theory is far from complete.In the work [8], the regularizing property of SGD was proved for a polynomially decaying stepsize schedule, when the stopping index k is determined a priori in relation with the noise level δ.Further, a convergence rate in the mean squared norm between the iterate x δ k and the exact solution x † was derived, under suitable source type condition on the ground truth x † .These results were recently extended to mildly nonlinear inverse problems, further assisted with suitable nonlinearity conditions of the forward map [9].However, in these works, the convergence rate can only be achieved under a knowledge of the smoothness parameter of x † , which is usually not directly accessible in practice.Therefore, it is of enormous practical importance and theoretical interest to develop a posteriori stopping rules that do not require such a knowledge.
For deterministic iterative methods [11], e.g., Landweber method and Gauss-Newton method, one popular a posteriori stopping rule is the discrepancy principle, due to Morozov [14].Specifically, with x δ k being the kth iterate constructed by an iterative regularization method, the principle determines the stopping index k(δ) by where the constant τ > 1 is fixed.The use of the discrepancy principle to many deterministic iterative methods is well understood (see the monograph [11] and the references therein), but in the context of stochastic iterative methods, it has not been explored so far, to the best of our knowledge.The goal of this work is to study the basic property of the discrepancy principle for SGD.It is worth noting that a direct computation of the residual Ax δ k − y δ at every SGD iteration is demanding.However, one may compute it not at every SGD iteration but only with a given frequency (e.g., per epoch, see Section 5), as done by the popular stochastic variance reduced gradient [10], for which residual evaluation is part of gradient computation.Also there are efficient methods to determine Ax δ k − y δ using randomized SVD [12], by exploiting the intrinsic low-rank nature for many practical inverse problems.Now we specify the algorithmic parameters for SGD, and state the main results of the work.Throughout, we make the following assumption on the stepsizes and the regularity condition on the ground truth solution x † , i.e., the minimum-norm solution defined by x † = arg min x:Ax=y † x . (1.4) The stepsize schedule in (i) is commonly known as the polynomially decaying stepsize schedule, and (ii) is the classical power type source condition, where B = n −1 (A T A), imposing a type of smoothness on the solution x † (relative to the system matrix A and the initial guess x 1 ).In the analysis and computation below, x 1 is fixed at 0.
(ii) There is a p > 0 and a w ∈ R m such that The first theorem gives a finite-iteration termination property of the discrepancy principle, where P is with respect to the filtration generated by the random index (i k ) ∞ k=1 .It can also be viewed as a partial result on the optimality.It implies in particular that for p < 1 2 , the data propagation error is of optimal order.The proof relies crucially on the observation that the variance component of the mean squared residual contributes only marginally for sufficiently large k.
Theorem 1.1.Let Assumption 1.1 be fulfilled, and k(δ) be determined by the discrepancy principle (1.3).Then for all 0 < r < 1 and τ The second contribution of this work is on the convergence in probability of the SGD iterate x δ k(δ) with the stopping index k(δ) determined by (1.3).This result has one drawback.In the proof, we have to assume that the stopping index k(δ) is independent of the iterates x δ k(δ) .In practice, this can be achieved by running SGD twice with the same data (y δ , δ): the first round is for the determination of k(δ), then the second (independent) round is stopped using k(δ).This increases the computational expense by a factor of 2. However, the numerical results in Section 5 show that one can use the iterate from the first run without compromising the accuracy.Theorem 1.2.Let Assumption 1.1 be fulfilled, and k(δ) be determined by the discrepancy principle (1.3).Then for all ε > 0 there holds , with the same data (y δ , δ).
In sum, Theorems 1.1 and 1.2 confirm that the discrepancy principle is a valid a posteriori stopping rule for SGD.However, they do not give a rate of convergence, which remains an open problem.Numerically, we observe that the convergence rate obtained by the discrepancy principle is nearly order-optimal for low-regularity solutions, as the a priori rule in the regime in [8], and the performance is competitive with the standard Landweber method.Thus, the method is especially attractive for finding a low-accuracy solution.However, for very smooth solutions (i.e., large p), it manifested as a highly undesirable saturation phenomenon, due to the presence of the significant variance component (when compared with the approximation error), under the setting of Assumption 1.1.The rest of the paper is organized as follows.In Sections 2 and 3, we prove Theorems 1.1 and 1.2, respectively.Several auxiliary results needed for the proof of Theorem 1.1 are given in Section 4. Finally, several numerical experiments are presented in Section 5 to complement the theoretical analysis.We conclude with some useful notation.We denote the SGD iterate for exact data y † by x k , and that for noisy data y δ by x δ k .The expectation E[•] is with respect to the filtration F k generated by the random indices {i 1 , . . ., i k , . ..}.
2 The proof of Theorem 1.1 In this section, we give the proof of Theorem 1.1.First, we give several preliminary facts.By the construction in (1.2), since x δ k is measurable with respect to Thus, by the law of total expectation, the sequence (E[x δ k ]) k∈N satisfies the following recursion: with Ā = n − 1 2 A and ȳδ = n − 1 2 y δ .This is exactly the classical Landweber method [13] (but with diminishing stepsizes) applied to the rescaled linear system Āx = ȳδ .For the Landweber method, the discrepancy principle (1.3), e.g., regularizing property and optimal convergence rates, has been thoroughly studied for both linear and nonlinear inverse problems (see, e.g., [3, Chapter 6] and [11]).The key insight for the analysis below is the following empirical observation: for a suitably large k, typically the variance component δ 2 , as confirmed by the numerical experiments in Section 5.2.This fact allows us to transfer the results for the Landweber method to SGD.
The proof of Theorem 1.1 employs two preliminary results, whose lengthy proofs are deferred to Section 4 below.The first result gives an upper bound of the following stopping index k * (δ), for any τ * > 1, defined by Clearly, k * (δ) is the stopping index by the classical discrepancy principle, when applied to the sequence (E[x δ k ]) k∈N , which is exactly the Landweber method, in view of the relation (2.1).
Proposition 2.1.Let Assumption 1.1 be fulfilled.Then for k * (δ) defined in (2.2), there holds The second result gives an upper bound on the variance ) 2 ] contributes only marginally to the mean squared residual E[ Ax δ k(δ) − y δ 2 ], and consequently the squared residual Ax δ k(δ) − y δ 2 of individual realizations of SGD may be used instead for determining an appropriate stopping index.
Remark 2.1.The condition r < 1 is related to an apparent saturation phenomenon with SGD: for any p > 1 2 , the SGD iterate x δ k with a priori stopping can only achieve a convergence rate comparable with that for p = 1 2 , at least for the current analysis [8].It remains unclear whether this is an intrinsic drawback of SGD or due to limitations of the proof technique.
Remark 2.2.In practice, we prefer computing the residual with a frequency ωn ∈ N: That is, the upper bound on the stopping index remains largely valid for the variant of the discrepancy principle (1.3) evaluated with a frequency.
3 The proof of Theorem 1.2 In this section, we prove Theorem 1.2.It employs the following proposition, which states that potential early stopping actually does not cause any problem.
Proposition 3.1.For all ε > 0, there is a sequence In order to show this, we need the following two estimates for the iterated noise: with the conventions 0 j=1 = 0 and 0 j=1 = 1.We prove the estimates (3.2) and (3.3) by mathematical induction.Note that a i = A t e i .For the estimate (3.2), by the triangle inequality and the defining relation and since For the estimate (3.3), we have x δ 1 − x 1 = 0 and so the claim follows using the estimate (3.2).Now, for each fixed K, since there are only finitely many different realisations of the first K SGD-iterates, there is a (deterministic) η > 0, which depends on K, such that min where without loss of generality, we have assumed y † = 0. Therefore, using estimates (3.2) and (3.4), for any δ < Then by the definition of the discrepancy principle in (1.3), this implies the minimum norm solution.The proof of (3.1) is concluded by for δ → 0 + , where we have used estimate (3.3).This completes the proof of the proposition.
Now we can state the proof of Theorem 1.2.
Proof of Theorem 1.2.Fix ε > 0. Proposition 3.1 and Theorem 1.1 guarantee the existence of two sequences Consequently, for δ > 0 small enough, there holds In view of Theorem 1.1, it remains to show that To this end, let Ω δ := {k − δ ≤ k(δ) ≤ k + δ } and we split the error into three parts in a customary way: approximation error, data propagation error and stochastic error.Specifically, by the triangle inequality, there are constants c 1 and c 2 such that where we have used [8, Theorem 3.2] and Lemma 4.1 below in the third line.The first two terms clearly tend to 0 for δ → 0 + (since k − δ → ∞, and δ(k + δ ) 1−α 2 → 0, in view of Theorem 1.1).By Markov's inequality [4, p. 242] and the independence assumption between k(δ) and x δ k(δ) , Now Jensen's inequality and Proposition 4.1 below (with s = 0, γ < min(α, 1 − α) and β < 1 − α) gives as δ → 0 + .This completes the proof of the theorem.
4 The proofs of Propositions 2.1 and 2.2 In this part, we prove Propositions 2.1 and 2.2, which are used in the proof of the Theorems 1.1 and 1.2.We shall use the following result from [8, Theorem 3.1] frequently.Note that Lemma 4.1.Let Assumption 1.1 be fulfilled, then for s ∈ {0, 1/2} and c p,s := p+s w , there holds

The proof of Proposition 2.1
Proof.We may assume k * > 2. By the definition of k * (δ) and the triangle inequality Using this yields and consequently, by the choice of c 0 , This completes the proof of the proposition.

Proof of Proposition 2.2
The proof of Proposition 2.2 employs several technical estimates [8].
with s ∈ {0, 1 2 } and c s , c p given below.
Proof.By [8, Theorem 3.3] and the bias variance decomposition, the left hand side (LHS) is bounded by Now by the triangle inequality and (4.3), for j ≥ 2 by Lemma 4.1.Thus, with which completes the proof of the lemma with c s = n 2s A 4( 1 2 −s) .
Remark 4.1.The n factor in the estimate is due to the variance inflation of using stochastic gradients in place of gradient in SGD.This factor can be reduced by suitable variance reduction techniques, e.g., mini-batching and stochastic variance reduced gradient [10].Note that with [8, Theorems 3.1 and 3.2] and s = 0, Proposition 4.1 gives an improved (regarding the exponents) a priori bound for the mean squared error E[ Last, using Lemma 4.4 and Proposition 4.1, we can prove Proposition 2.2.

Numerical experiments and discussions
Now we provide numerical experiments to complement the theoretical analysis.Three model examples, i.e., phillips (mildly ill-posed, smooth), gravity (severely ill-posed, medium smooth) and shaw (severely ill-posed, nonsmooth), are taken from the open source MATLAB package Regutools [6], available at http://people.compute.dtu.dk/pcha/Regutools/ (last accessed on April 14, 2020).The problems cover a variety of setting, e.g., different solution smoothness and degree of ill-posedness.These examples are discretizations of Fredholm/Volterra integral equations of the first kind, by means of either the Galerkin approximation with piecewise constant basis functions or quadrature rules.All the examples are discretized into a linear system of size n = m = 1000.In addition, we generate a synthetic example, termed smoothed-phillips, whose exact solution x † is first generated by x † = A t AA t ȳ † and then normalized to have maximum unit, i.e., x † = x † / x † ∞ , where A is the system matrix and ȳ † the exact data from phillips, and the corresponding exact data is formed by y † = Ax † .By its very construction, the solution x † satisfies Assumption 1.1(ii) with an exponent p > 3 2 , and thus it is very smooth in some sense.Throughout, the noisy data y δ is generated according to where the i.i.d.random variables ξ i follow the standard Gaussian distribution (with zero mean and unit variance), and δ > 0 denotes the relative noise level (by slightly abusing the notation).The parameter η 0 in the stepsize schedule in Assumption 1.1(i) is set to (max i a i 2 ) −1 , the exponent α is taken from the set {0.1, 0.3, 0.5}, and unless otherwise stated, the stopping criterion is tested every 100 SGD iterations (see Remarks 2.2 and 3.1).SGD is always initialized with x 1 = 0, and the maximum number of epochs is fixed at 5000, where one epoch refers to n SGD iterations.The parameter τ in the discrepancy principle (1.3) is fixed at τ = 1.2.All the statistical quantities presented below are computed from 100 independent runs.
The example shaw is challenging for numerical recovery, since the solution is far less smooth, and at low noise level δ =1e-3, the discrepancy principle (2.2) cannot be reached even after 5000 Landweber iterations, see Table 3.
A similar behavior is also observed for SGD with α = 0.3 and α = 0.5.Nonetheless, with α = 0.1, the discrepancy principle (1.3) can be reached by SGD after a few hundred epochs, clearly showing the surprisingly beneficial effect of SGD noise for low-regularity solutions.
Next we examine more closely the performance of individual samples.The boxplots are shown in Fig. 1 for the examples at two different scenarios, i.e., fixed α and fixed δ.On each box, the central mark indicates the median, and the bottom and top edges of the box indicate the 25th and 75th percentiles, respectively; The whiskers extend to the most extreme data points not considered outliers, and the outliers are plotted individually using the '+' symbol.It is observed that for a fixed α, on average the error x δ k(δ) − x † 2 increases with the noise level δ samplewise, and also its distribution broadens accordingly.However, the required number of iterations to fulfill the discrepancy principle (1.3) decreases dramatically, as the noise level δ increases, concurring with the preceding observation that SGD is especially efficient for data with high noise levels.Meanwhile, with the noise level δ fixed, the value of α does not change the results much overall.However, a larger α can potentially make the percentile box larger and also more outliers, as shown by the results for gravity in Fig. 1, and thus give less accurate results.This observation is counter-intuitive in that smaller variance does not immediately lead to better accuracy.This might be related to the delicacy interplay between the total error and various problem / algorithmic parameters, e.g., α and p. Further, the outliers in the boxplots mostly lie above the box.These observations are typical for all the examples.

How influential is the variance?
Now we examine more closely the dynamics of the SGD iteration via the bias-variance decomposition of the error In Fig. 2, we display the dynamics of mean squared error E[ x δ k − x † 2 ] and the mean squared residual E[ Ax δ k − y δ 2 ] together with their variance components for the examples at two different relative noise levels, i.e., δ =5e-3 and δ =5e-2.At each time, SGD is run for 100 epochs (i.e., 1e5 SGD iterations), and the results are recorded every 50 SGD iterations, starting from the 50th SGD iterations.
In the plots, we have indicated the true noise y δ − y † 2 , also denoted by δ 2 .It is observed that both E[ x δ k − x † 2 ] and E[ Ax δ k − y δ 2 ] decay steadily at an algebraic rate up to a value comparable to the stopping index k * (δ) for the Landweber method (by the discrepancy principle (2.2)).Beyond the critical threshold k * (δ), the error E[ x δ k − x † 2 ] exhibits a semiconvergence behavior in that it starts to increase, whereas the residual E[ Ax δ k − y δ 2 ] nearly levels off at a value comparable with the noise level δ 2 (actually it oscillates slightly, since the SGD iterate is only descent for the residual on average).This is typical for iterative regularization methods for inverse problems, since for the later iterates, the noise becomes the dominating driving force.Proposition 4.1 with s = 1 2 indicates that a similar behavior holds also for their variance components (up to slightly beyond k * (δ)).Actually, the residual variance ) (upon ignoring the δ term), which matches well the empirical rate in the plot.For the later iterates, as suggested by the δ term in Proposition 4.1, the decay is roughly O(k −α ).Likewise, the error variance E[ ] in the first and last columns are largely comparable, despite their drastic difference in the smoothness of the exact solution x † .Thus, the decay estimate in Proposition 4.1 is actually quite sharp, partially explaining the saturation phenomenon observed earlier.This behavior is consistently observed for all three α values.It is worth noting that for smoothed-phillips, the curves for ] nearly overlay each other, i.e., the bias component is negligible after the initial 50 iterations, due to high smoothness of the true solution, clearly indicating the saturation.For the other three examples, empirically, the variance components are of smaller order right after the initial 50 iterations.In particular, as stated in Proposition 2. 2, E[ A(x δ k − E[x δ k ]) 2 ] contributes very little to the mean squared residual E[ Ax δ k − y δ 2 ] in the neighborhood of k * (δ).This occurs for all three values of the exponent α in the stepsize schedule.The observations hold also for individual realizations; see Fig. 3 for the corresponding plots.The overall behavior of the curves in Fig. 3 is fairly similar to that in Fig. 2, except that the residual and error curves exhibit pronounced oscillations due to the randomness of the row index selection.Nonetheless, in the neighborhood of k * (δ), the variance components remain much smaller in magnitude.This observation provides the key insight for the analysis in Section 2.

Independent run
The convergence analysis in Theorem 1.2 requires a SGD iterate x δ k(δ) independent of the stopping index k(δ) determined by the discrepancy principle (1.3).In practice this can be achieved by an independent run of SGD, at the expense of slightly increasing the computational effort.Now we examine the impact of this choice, and we denote by DP and i-DP the SGD iterate used in (1.3) and that by an independent SGD run, respectively.The relevant numerical results are presented in Tables 5-7, where the numbers outside and inside the bracket denote ) 2 ] versus the SGD iteration number k.The solid and dashed curves denote the squared quantity and the variance components, respectively, and the black curve indicates the discrepancy δ 2 = y δ − y † 2 .The first two columns are for the noise level δ = 5e-3 and the last two columns are for the noise level δ = 5e-2.The rows from top to bottom refer to phillips, gravity, shaw and smoothed-phillips, respectively.
e sgd and std(e sgd ), respectively.It is observed that DP gives only slightly better results in terms of the mean, but its standard deviation std(e sgd ) is generally much smaller than that by i-DP.Nonetheless, both the mean e sgd and the standard deviation std(e sgd ) of i-DP are decreasing steadily as the noise level δ decreases to 0, confirming the convergence result in Theorem 1.2.

Figure 1 :
Figure 1: Box plots for the error x δ k δ −x † 2 and the stopping index k δ by SGD.The first two columns are obtained by SGD with α = 0.3, whereas the last two columns are for the noise level δ =1e-2.The rows from top to bottom refer to phillips, gravity, shaw and smoothed-phillips, respectively.

Figure 2 :
Figure 2: The decay of the mean squared error E[ x δ k − x † 2 ] and residual E[ Ax δ k − y δ 2 ] and their variance componentsE[ x δ k − E[x δ k ] 2 ] and E[ A(x δ k − E[x δ k ]) 2 ] versus the SGD iteration number k.The solid and dashed curves denote the mean squared quantity and the variance component, respectively, and the black curve indicates the discrepancy δ 2 = y δ − y † 2 .The first two columns are for the noise level δ = 5e-3 and the last two columns are for the noise level δ = 5e-2.The rows from top to bottom refer to phillips, gravity, shaw and smoothed-phillips, respectively.

Figure 3 :
Figure 3: The decay of the squared error x δ k − x † 2 and residual Ax δ k − y δ 2 and their variance components E[ x δ k − E[x δ k ] 2 ] and E[ A(x δ k − E[x δ k ]) 2 ] versus the SGD iteration number k.The solid and dashed curves denote the squared quantity and the variance components, respectively, and the black curve indicates the discrepancy δ 2 = y δ − y † 2 .The first two columns are for the noise level δ = 5e-3 and the last two columns are for the noise level δ = 5e-2.The rows from top to bottom refer to phillips, gravity, shaw and smoothed-phillips, respectively.

Table 5 :
Comparison between DP and i-DP for phillips.
.2) Combining (4.1) with (4.2) immediately implies the desired assertion.It remains to show the claim (4.2).To this end, we employ the filter of the Landweber method.The relation (2.1) implies that E[x δ k ] satisfies the following recursion

Table 1 :
Comparison between SGD and LM for phillips.

Table 2 :
Comparison between SGD and LM for gravity.sgd ) k sgd e sgd std(e sgd ) k sgd e sgd std(e sgd ) k sgd e lm

Table 3 :
Comparison between SGD and LM for shaw.

Table 4 :
Comparison between SGD and LM for smoothed-phillips.