Linearly convergent adjoint free solution of least squares problems by random descent

We consider the problem of solving linear least squares problems in a framework where only evaluations of the linear map are possible. We derive randomized methods that do not need any other matrix operations than forward evaluations, especially no evaluation of the adjoint map is needed. Our method is motivated by the simple observation that one can get an unbiased estimate of the application of the adjoint. We show convergence of the method and then derive a more efficient method that uses an exact linesearch. This method, called random descent, resembles known methods in other context and has the randomized coordinate descent method as special case. We provide convergence analysis of the random descent method emphasizing the dependence on the underlying distribution of the random vectors. Furthermore we investigate the applicability of the method in the context of ill-posed inverse problems and show that the method can have beneficial properties when the unknown solution is rough. We illustrate the theoretical findings in numerical examples. One particular result is that the random descent method actually outperforms established transposed-free methods (TFQMR and CGS) in examples.

with a linear map A from R d to R m and b ∈ R m .We revisit this classical problem under the following restricting assumptions: • A is not given as a matrix, but only an implementation is available that takes as input a vector v ∈ R d and returns the output Av ∈ R m .• It is not feasible to store many vectors of either size d or m due to memory constraints (especially we assume that it is not possible to build large parts of the full matrix by computing Ae i for all standard basis vectors e i ).• The adjoint linear map (i.e. the transposed matrix A T ) can not be evaluated.(One can, in principle, evaluate the adjoint map by applying automatic differentiation to the map x → ⟨y, Ax⟩.However, this requires that the programming language allows for automatic differentiation and that the program is written in a suitable way, which, however, is not always the case.)We do not make any structural assumptions on the linear map A. Especially we consider the square, over-, and under-determined cases and the cases of full rank and rank deficient maps.If the evaluation of A T is not possible, one can not immediately calculate the gradient of the least square function, as this also involves the application of the adjoint linear map.
In this paper we will derive randomized methods which do provably converge to a solution of the least squares functional (in expectation) under our restrictive assumptions formulated above.The idea behind these algorithms is a simple observation: There is a simple way to get an unbiased sample of the gradient A T (Av − b) which only uses evaluations of A: Lemma 1.1.Let x ∈ R d be a random vector such that E(xx T ) = I d .Then it holds that in other words, ⟨Av − b, Ax⟩x is an unbiased estimator for the gradient of 1  2 ∥Av −b∥ 2 .Proof.We rewrite ⟨Av − b, Ax⟩x = x⟨Ax, Av − b⟩ = xx T A T (Av − b) and see that the result follows from linearity of the expectation.This observation motivates the use of the stochastic gradient method for the minimization of the least squares functional (1.1) which we state in Algorithm 1.1.
We will analyze the convergence of this algorithm in Section 3.1.1and especially treat the question of suitable choices of the stepsizes τ k .One observes that the SGDAS method takes a step in a fairly random direction x (it is a simple observation that the coefficient ⟨Av k − b, Ax⟩ turns this direction into a descent direction).This motivates to consider an even simpler method where we omit the coefficient and let the choice of the stepsize do the work to guarantee descent.This method is given in Algorithm 1.2.

Algorithm 1.2 Random descent (RD)
Initialize v 0 = 0 ∈ R d , k = 0 while not stopped do obtain random vector x ∈ R d update v k+1 = v k + τ k x where the stepsize solves min τ 1 2 ∥A(v k + τ x) − b∥ 2 update k ← k + 1 end while 1.2.Related work.Methods for the solution of linear least squares problems or linear equations that only need evaluations of A are known as "transpose-free methods" and an early such method for the case of general square linear systems of equations has been proposed Freund in [8] as a generalization of the quasi-minimal residual method named transpose free quasi-minimal residual method (TFQMR). 1lightly later Brezinski and Redivo-Zaglia [1] and Chan et al. [2] proposed a method of Lanczos type which lead to the conjugate gradient squared method (CGS) proposed by Sonneveld [16].
These mentioned methods are available in current software (e.g. in Scipy) and are deterministic.Early randomized methods, again for square linear systems, are based on the idea of reformulating the system Av = b as v = Gb + (I − GA)v for some square matrix G and then consider the solution expressed by the Neumann series which converges to the solution if the spectral radius of I − GA is smaller than one.Then an approximate solution is obtained by using a specific random walk to approximate the truncated series, see [7,3,10,5,21].
Gradient based methods, however, are usually not applicable since the evaluation of the gradient of ∥Av − b∥ 2 needs the evaluation of A T .One exception are coordinate descent methods [12] which will be a special case of the random descent method.Another related method is the random search method (see [15,Section 3.4]).This is a derivative free method for the minimization of a convex objective Φ : R d → R and the iterates are for stepsizes γ k and discretization length α k and where u is sampled uniformly from the unit sphere.As discussed in [15], this method converges if Φ has Lipschitz continuous gradient, γ k is small enough and α k → 0. Applied to the objective Φ(v) = 1 2 ∥Av − b∥ 2 we obtain and if we set α k = 0 in the final expression we obtain the update Algorithm 1.1 (but with u being uniformly distributed on the unit sphere).While convergence of the random search method is known, we are not aware of known rates for this method and, to the best of our knowledge, only the case of sampling from the unit sphere has been analyzed.An extension of random search with exact and inexact linesearch (similar to Algorithm 1.2) has been studied in [18].There, the authors obtain convergence rates in the strongly convex case for directions x uniformly distributed on the unit sphere and random coordinate vectors.Later [14] studied a related method based on smoothing: For µ > 0 the authors consider which amounts to a convolution of f with the scaled probability density of u (in [14] the authors consider normally distributed search directions).It is observed that i.e. that the expectation of the random search step is a gradient step for a smoothed function.The authors derive convergence rates and also an accelerated method.
Our method can also be derived from the framework by Gower and Richtarik [9]: Their class of methods read as where S ∈ R l×m is a random sketching matrix and B ∈ R d×d is a positive definite matrix.If A has linearly independent columns one can set B = A T A and use S = Ax ∈ R m with x ∈ R d being a random vector one ends up with the random descent method.
Finally, note that SGDAS and RD are fundamentally different from the randomized Kaczmarz method [19] (which can also be obtained from (1.2) by choosing B = I d and S = e k being a random coordinate vector).The iterates of the Kaczmarz iteration always operate in the range of A T while SGDAS and RD operate in the full space.

Contribution.
• We derive transpose free methods that minimize the linear least squares functional which only use applications of A (i.e.no applications of A T are done and no other data like knowledge of ∥A∥ is used).These methods can be seen as special cases of previously known random search methods, but we obtain more refined convergence results.• We illustrate that these adjoint free methods can outperform established methods for least squares problems like TFQMR and CGS.• We do a detailed analysis of the random descent method (Algorithm 1.2) for a class of distributions of the search direction and obtain results on the influence of the distribution of the search direction on the rate.• Without any assumptions on A we show that the residual of the normal equations converges (in expectation) at a sublinear rate.
• We analyze the semi-convergence property of the method in the case of inconsistent problems and illustrate that random descent may be beneficial (in comparison to other iterative methods like the Landweber method) for the solution of linear ill-posed inverse problems when the solution is rough.
2. Notation and basic results.We use ∥x∥ to denote the Euclidean norm, write ρ(M ) for the spectral radius of a square matrix M , i.e. the maximum magnitude of the eigenvalues of M .With λ max (M ) and λ min (M ) we denote the largest and smallest eigenvalue of a real symmetric matrix M , respectively and σ min (A) is the smallest positive singular value of a matrix A. The range of matrix A is denoted by rg(A) and the kernel by ker(A).Moreover, E denotes the expectation over all involved randomness (if not stated otherwise).With A ≽ B we denote the Loewner order on symmetric matrices, i.e.A ≽ B if A − B is positive semi-definite.
3. Convergence results.We start our convergence analysis of SGDAS and RD by stating the standing assumption on the random directions.Assumption 3.1.We assume that our random vector x ∈ R d is isotropic, i.e. that it fulfills E(xx T ) = I d .Moreover, we assume that and E(xx T ∥x∥ 2 ) = cI d for some c.

By linearity of the expectation we get
for random vectors satisfying Assumption 3.1 (cf.[20,Lemma 3.2.4]).The further assumption that E(xx T ∥x∥ 2 ) is a multiple of the identity is needed in the proofs below and is fulfilled in the list of sampling methods in the next example.3.1.1.Convergence of the iterates.In these paragraphs we are going to investigate convergence of the iterates of Algorithm 1.1 similar to the analysis in [13].For this to be reasonable we will need that a unique solution v of Av = b exists, i.e. that A T A has full rank.Theorem 3.3.Let Av = b be a consistent system with solution v and let the sequence (v k ) k be generated by Algorithm 1.1, where the x are sampled according to Assumption 3.1.Then it holds that where the expectation is taken with respect to the random choice of x is the kth step.and hence is strictly decreasing and thus convergent if 0 < τ k < 2/(c∥A∥ 2 ) for all k.
Proof.Defining g k as g k = ⟨Av k − b, Ax⟩x, the update step can be formulated as v k+1 = v k − τ k g k .Therefore, we get The summands can be rewritten as and Hence it follows from Assumption 3.1 that the expectation with respect to x is Theorem 3.4.Let Av = b be a consistent system with solution v and let the sequence (v k ) k be generated by Algorithm 1.1 where the x are sampled according to Assumption 3.1, the τ k = τ are constant and fulfill is fulfilled, it holds that the iterates converge linearly with Proof.As in the proof of Theorem 3.3 we get Iterating this result and using the law of total expectation gives the convergence result.
Remark 3.5.The matrices A T A and I − τ A T A(2I − τ cA T A) (in the argument of ρ in the definition of λ) share the eigenvectors and it holds that if µ is an eigenvalue of Convergence of E(∥v k − v2 ∥) happens when λ < 1 holds and for this to hold it is necessary that A T A has full rank and that τ fulfills the bounds stated in Theorem 3.4.
For fastest convergence, one minimizes λ with respect to the stepsize τ .The spectral radius λ where µ ranges over the eigenvalues of A T A. Since this is a quadratic polynomial in µ with a positive leading term, the function Θ τ (µ) := 1 − τ µ(2 − τ cµ) is convex and thus, its maxima are at the endpoints of evaluation.Hence, λ is given by To determine the optimal stepsize τ , we minimize λ over τ , i.e. we have to minimize max Θ τ (λ min (A T A)), Θ τ (λ max (A T A)) over τ .As can be seen in Figure 1, the minimum is where Θ τ (λ min (A T A)) = Θ τ (λ max (A T A)), and this results in Fig. 1: How to find the optimal stepsize?Illustration for the derivation in Remark 3.5.
Hence, we get as optimal stepsize Plugging this in we get that the optimal convergence rate is Now we consider the inconsistent case, in which we do not assume the existence of a solution.We model this as an additive error and assume that the right hand side is b + r with b ∈ rg(A).
Theorem 3.6.Let v fulfill Av = b and let the sequence (v k ) k be generated by Algorithm 1.1 with x sampled according to Assumption 3.1, constant stepsize τ which fulfills and where the right hand side is changed to b + r with b ∈ rg(A) and r = r ′ + r ′′ with r ′ ∈ rg(A) and r ′′ ∈ rg(A) ⊥ .Then with λ from (3.1) it holds that Proof.Following the proof of Theorem 3.4, we observe As in Theorem 3.4 taking the expected value with respect to x and using Young's inequality with ε > 0 results in Now we choose ε = 1 1−λ , which yields by the law of total expectation where the last inequality is due to the geometric series.Since rg(A) ⊥ = ker(A T ), we attain A T r ′′ = 0 and hence the result.
In Theorem 3.6 we see that we can estimate the error with two terms.The first one decays to zero exponentially, but the second is a constant offset which is due to the noise in the right hand side.This offset only depends on the part of the noise that is the range of A. We remark that the offset is not meaningful in the case where λ = 1 occurs (which would also happen in the case of an ill-posed compact operator between Hilbert spaces-a situation which is not considered in this paper).

Convergence of the residual.
In the results in the previous section we needed λ min (A T A) > 0 to get that λ < 1 and hence, the results are not useful for problems where A T A is rank deficient, i.e. especially not in the case of underdetermined systems.In this case we do not have unique solutions of Av = b, but a nontrivial solution subspace (in the case of a consistent system).
We first show convergence properties of the residual.The following result holds for any linear system Av = b.Theorem 3.7.Let Av = b be a linear system and let the sequence (v k ) k be generated by Algorithm 1.1, where the x are sampled according to Assumption 3.1 and the τ k fulfill Then the residual fulfills where the expectation is taken with respect to the random choice of x in the kth step and hence is strictly decreasing and thus convergent.
Proof.Taking the expected value of with respect to x and using Assumption 3.1 results in which leads to the results.
Without any assumption on the shape of A, rank or consistency we always get sublinear convergence of the residual of the normal equations.
Theorem 3.8.Let the sequence (v k ) k be generated by Algorithm 1.1 with constant step size where the x are sampled according to Assumption 3.1.Then it holds that Proof.Under the given conditions, Theorem 3.7 and the law of total expectation provide Hence summing up yields to and thus the result is proven.
The following Theorem shows expected linear convergence of the residual for the iterates generated by Algorithm 1.1.Theorem 3.9.Let Av = b be a linear system and let the sequence (v k ) k be generated by Algorithm 1.1 with constant stepsize where the x are sampled according to Assumption 3.1.
Then it holds for the expectation taken with respect to the random choice of x in the kth step that If σ min (A) > 0 we have 0 < β < 1 and hence we have linear convergence Proof.Similar to the proof of Theorem 3.4 we calculate Considering E(xx T ) = I, taking the expectation with respect to x leads to This proves the first statement.By induction and the law of total expectation, the second result follows.
Minimizing β over τ leads to the optimal stepsize τ = 1/(c∥A∥ 2 ) and This shows: Corollary 3.10.Let Av = b be a linear system with σ min (A) > 0 and let the sequence (v k ) k be generated by Algorithm 1.1 with where the x are sampled according to Assumption 3.1.
Then it holds that In conclusion, we have shown that the stochastic gradient method with adjoint sampling leads to a linear converge rate in terms of distance to the solution (if it is unique) and the residual (if the system is consistent).For systems which are not full rank we still get sublinear convergence of the least squares residual.The standard stochastic gradient method for the least squares functional where the objective is written as with rows a T i of A, leads (after taking the steepest descent stepsize) to the randomized Kaczmarz method.For this one obtains the convergence of the iterates (and residuals) with linear rate 1 − σmin(A) 2 ∥A∥ 2 F (cf. [19]).Since ∥A∥ 2 F ≤ d∥A∥ 2 and c ≥ d, the rate in Corollary 3.10 is slightly worse than the one for the randomized Kaczmarz method.However, the Kaczmarz method only needs access to a single row per iterations (something we do not have access to in the situation of the this paper), while SGDAS needs two applications of the linear map per iteration.
All our results have at least two drawbacks: 1.The convergence is slow due to the factor c which is, in all cases of our random sampling, larger than the input dimension d. 2. The stepsize conditions depend on the operator norm ∥A∥, which is usually not known under our assumptions.Both problems can be circumvented by random descent as we show in the next section.
3.2.Random descent.Now we turn to the analysis of the random descent algorithm (Algorithm 1.2).While this algorithm just takes steps in random directions, we will be able to show linear convergence towards minimizers by using optimal stepsizes.Hence, as a first step, we calculate the optimal stepsize for minimizing the least squares functional in some search direction.
Lemma 3.11.The minimum of τ → 1 2 ∥A(v k + τ x) − b∥ 2 is attained at Proof.If Ax = 0, the objective does not depend on τ at all.Otherwise the objective is convex and the minimizer is provided by the solution of 0 = ∂ ∂τ Note that this stepsize may be positive or negative.Most notably, we can evaluate this stepsize under our restrictive assumptions stated in the introduction: We only need to evaluate A two times (once at the direction x and once at the current iterate).Moreover, no knowledge of the operator norm of A is needed to use this stepsize in practice.Since this step is optimal in direction x, we immediately get linear convergence of random descent (Algorithm 1.2) with this stepsize by comparison with our results for the stochastic gradient descent (Algorithm 1.1).If v k+1 = v k − τ k x from the random descent with optimal stepsize and ṽk+1 = v k − τ ⟨Av k − b, Ax⟩x with some τ between zero and 2/(c∥A∥ 2 ) we always have i.e. random descent does decrease the residual faster than stochastic gradient descent with adjoint sampling.Thus, we can transfer some results from the previous section to random descent: Theorem 3.12.Let Av = b be a linear system and let the sequence (v k ) k be generated by Algorithm 1.2, where the x are sampled according to Assumption 3.1 and the τ k are given by Then we have Moreover, the sequence E(∥Av k − b∥) k of the expected residuals is decreasing.If furthermore λ min (AA T ) > 0, then the sequence of expected residuals E(∥Av k − b∥ 2 ) k does converge at least linearly, i.e. we have Proof.The first result follows directly from Lemma 3.11, while the second is an immediate result of the comparison with Theorem 3.9, Corollary 3.10, in which linear convergence is proven for non-optimal fixed step sizes.
We would expect a better convergence rate due to the optimal stepsize.It turns out that the refined convergence analysis differs for different sampling schemes for the random vectors x.The central object is the matrix where we have to assume that the expectation exists (see Remark 3.14 below).One sees that M only depends on the marginal distribution on the unit sphere, so all results that we obtain for the standard normal distribution also apply to the uniform distribution on the sphere.The next theorem shows the role of the matrix M .After the theorem we discuss the usefulness of the different estimates in different cases.Theorem 3.13.Let v k be generated by Algorithm 1.2 and assume that M exists.Then it holds 1.For the expectation with respect to the choice of x in the kth step we have Proof.For the first claim we calculate and taking the expectation over the random choice of x proves the claim.
For the second claim we use the law of total expectation and estimate and rearranging, summing up and estimating by the minimum gives the claim (similar as in the proof of Theorem 3.8).
For the third claim we further estimate ∥A T (Av k − b)∥ ≥ σ min (A)∥Av k − b∥ and get The second inequality in 3. follows directly from 1.
Remark 3.14.The two estimates in (3.4) and (3.5) in Theorem 3.13 3. are relevant in different circumstances: Estimate (3.5) with the contraction factor (1 − λ min (M )σ min (A) 2 ) is relevant if λ min (M ) > 0. Whether this is true depends on the dimensions of A, the distribution of the directions x, and the matrix A (see below for further details).For overdetermined systems A, i.e. for m > d and A with rank d this is true for all the isotropic sampling schemes we considered here.
The estimate (3.5) with contraction factor (1 − λ min (AM A T )) is only relevant if the minimal eigenvalue is positive.This cannot happen when m > d since AM A T is m × m with rank at most d.In the case m < d the matrix A has a nontrivial kernel and hence, the expectation E( xx T ∥Ax∥ 2 ) may not exist (in fact, it does not exist for x ∼ N (0, I d ) and x ∼ Unif( √ dS d−1 )).However, for the discrete distributions (random coordinate vectors and Rademacher vectors), this depends on whether P(x ∈ ker(A)) > 0 or not.If this is not the case, the expectation exists and is finite.Moreover, in this case we have λ min (AM The convergence speed is governed by spectral properties of M .A simple and crude estimate shows the following bounds for the eigenvalues of M which holds for all isotropic random vectors: Proposition 3.15 (Estimates for general isotropic random vectors).Let d ≤ m, A ∈ R m×d and x ∈ R d be an isotropic random vector.Then it holds for Consequently it holds that with κ = σ max (A) 2 /σ min (A) 2 being the condition number of A.
Proof.Since m ≥ d we have σ min (A) 2 ∥x∥ 2 ≤ ∥Ax∥ 2 ≤ σ max (A) 2 ∥x∥ 2 and hence Hence, we only have to compute E( xx T ∥x∥ 2 ) and for isotropic random variables it holds that E( xx T ∥x∥ 2 ) = 1 d I d which proves the claim.Unfortunately, these bounds do not imply any faster convergence for RD in comparison with SGDAS (compare the statements from Theorem 3.8 and Corollary 3.10 with Theorem 3.13 and Proposition 3.15, respectively).However, numerical experiments indicate that the smallest and largest eigenvalues of M are usually closer together than the simple bounds from Proposition 3.15 suggest.The following lemma shows better bounds for the standard normal distribution (and, by the above remark, the uniform distribution on the sphere): can be diagonalized simultaneously.More precisely, if Proof.For i, k ∈ {1, . . ., d} consider The random variables ⟨u i , x⟩ and ⟨u k , x⟩ are uncorrelated since and have zero mean.Since the denominator d j=1 λ j ⟨u j , x⟩ 2 on contains squares of inner products, it is independent of the signs of ⟨u i , x⟩.This shows that (U T M U ) i,k = 0 for i ̸ = k.The formula for µ i follows from the case i = k.
For the upper estimate we denote z i = ⟨u i , x⟩ and note that z i ∼ N (0, 1).We use the inequality for the arithmetic and geometric mean To estimate µ i we take, without loss of generality, i = 1 (since in the following derivation we do not assume that the values λ k are ordered decreasingly) and have We use the identity ) and obtain Now we turn to the lower estimate.We write the integral in d-dimensional spherical coordinates as where the integral in the second line is taken over (r, ϕ 1 , . . ., We estimate all trigonometric functions in the denominator by 1 and get We use the identities and get (with Γ( • . Remark 3.17 (Approximate bounds for the normal distribution, reduction of the dependence on the dimension).The lower and upper estimates in Lemma 3.16 are difficult to interpret.To make them easier we note that for large d 2 .With these approximation we have approximate bounds where λ i = σ i (A) 2 are the squares of the singular values of A. Especially, we get the bound λ min (M ) ≥ ( √ 8d j λ j ) −1 and since ∥A∥ 2 F = j λ j we get from equation (3.4) Likewise we get from (3.3) that We remark that the estimates on M are loose and that in practice one usually observes faster convergence.We will do a refined analysis in the next section and show how the iterates "converge along singular vectors".
At the end of this section we have a closer look at the choice of random coordinate vectors.
Lemma 3.18 (Estimates for random coordinate vectors).Let x be a random coordinate vector, i.e. x = √ de k with k being selected uniformly at random from {1, . . ., d}.Then it holds that and consequently Proof.We denote the columns of A by a i , i = 1, . . ., d, and the ∥Ax∥ 2 = √ d∥a k ∥ 2 with probability 1/d and it directly follows We can also consider sampling of x from non-isotropic distributions, which brings an additional degree of freedom to the setup of the algorithm.However, it is in general hard to come up with sampling schemes that provably improve the convergence rate and are simple to implement in practice.
Example 3.19.In the case of random coordinate vectors we can consider sampling x = √ de k with probability p k and get the expectation For the special choice p k = ∥a k ∥ 2 /∥A∥ 2 F (know from the randomized Kaczmarz method [19]) we obtain which leads to the rate which is known from [12] (see also [9]).

Ill posed problems and convergence along singular vectors.
In this section we will analyze the behavior of the quantities ⟨v k − v, u i ⟩ for iterates v k and right singular vectors u i of A. These quantities have been studied in [17] (and previous results in a similar direction can be found in [11]) for the randomized Kaczmarz iteration and it has been shown that their expectation vanishes at different rates.We will analyze the noisy case, i.e. the case where Av = b but the iteration is run with the right hand side b + r.As a baseline for comparison, we first analyze the iterates of the Landweber iteration.
for some stepsize ω > 0. Then it holds that Proof.We calculate This theorem shows the typical semiconvergence property of the Landweber method: the contributions ⟨v k − v, u i ⟩ of the ith singular vector have a contribution from the initial error v 0 − v that decays faster for the larger singular values, and another contribution that does converge to ⟨r, w i ⟩/σ i and which is due to the noise.
In a similar way we get for the iterates of the random descent method: where the expectation is taken with respect to the random choice of x in the kth step.
Comparing Theorems 4.1 and 4.2 we observe that in the case of random descent we get a factor µ i which depends on i instead of ω.For the Landweber iteration in Theorem 4.1 one takes 0 < ω < 2/∥A∥ 2 and we see that if µ i > ω we have that both terms on the right hand side of the estimate converge faster (the first towards zero and the second increases towards ⟨r, w i ⟩/σ i ) and thus, the semiconvergence happens faster for random descent than for the Landweber iteration.We will see the practical implications of this result in numerical experiments in Section 5.2.

Experiments.
5.1.Comparison with other solvers.In this section we compare SGDAS and RD with other solvers for linear system Av = b that meet our requirements, i.e.only with solvers that only need forward evaluations of A and do not need to store a larger number of vectors, namely with TFQMR ( [8]) and CGS ( [16]).Note that GMRES also only needs forward evaluations of A, but constructs an orthonormal basis that grows with the number of iterations and hence, does not meet our criteria.Both TFQMR and CGS are designed for square systems but do not assume further structure of the operator A. However, we can turn both over-and underdetermined systems into square systems by adding rows or columns.Although this is pretty straightforward, we include the details for completeness: We implemented SGDAS and RD in Python using numpy and scipy.We generated several sparse random m × d matrices A, corresponding solutions v and right hand sides b = Av to obtain consistent linear systems.We let the methods SGDAS (Algorithm 1.1) and RD (Algorithm 1.2) run with different isotropic random vectors until a specified relative tolerance ∥Av − b∥/∥b∥ was reached or a maximum number of iterations of 10000 has been reached.As a comparison we called TFQMR and CGS from scipy with the same tolerance and maximal number of iterations.In Table 1 we collect the results for several sizes and densities of A and a fairly large tolerance of 10 −2 and Table 2 gives results for smaller matrices A and a smaller tolerance of 10 −5 .
Fig. 2: Comparison of factors influencing convergence of the Landweber method (σ 2 i /∥A∥ 2 ) and random descent with normally distributed directions (µ i σ 2 i ).
Furthermore, we tested RD on least squares problems from the SuiteSparse Matrix Collection [4].We used matrices of different sizes and let RD, TFQMR and CGS run for 10•max(m, d) iterations or until a tolerance of 10 −2 is reached.We did not include spherical sampling in this case since the iterates are the same as for normal sampling (as the update is zero homogeneous and the spherical uniform distribution is the normalized normal distribution).We report the final relative residual and runtime in Table 3.
Notably, RD always works and is reasonably fast.Sometimes it is even in faster than TFQMR and CGS.Moreover, TFQMR and CGS sometimes fail dramatically (actually, they fail on most non-square problems).Moreover, as expected, SGDAS is always slower than RD.In conclusion, RD is a reliable and comparable fast method to minimize least squares functionals under the computational constraints that we consider in this paper.Also the distribution of the random x seems to play a role, but our experiments do not allow for a clear conclusion in this regard.

Ill-posed problems.
In this section we illustrate the effect of the findings from Section 4. We consider a simple toy example with d = m = 100, namely the discrete counterpart of "inverse integration".The corresponding map A is the cumulative sum, i.e. (Av) i = i j=1 v i .From Theorems 4.1 and 4.2 we see that random descent with normally distributed directions has advantages over the Landweber iteration if the eigenvalues µ i of M are larger than the stepsize ω of the Landweber iteration.In Figure 2 we plot the values ωσ 2 i = σ 2 i /∥A∥ 2 and µ i σ 2 i .Larger values lead to faster semiconvergence and we observe that the values µ i σ 2 i are indeed larger for larger indices i, i.e. for the smaller singular values.This indicates, that random descent should perform favorably if the initial error v 0 − v has large contributions from singular vectors from small singular values, i.e. for rough solutions.We constructed a rough solution b and corresponding noisy data b = b+r with little noise (see Figure 3).
We have run the Landweber iteration as well as random descent (with different sampling of the direction x) and report the decay of residuals and semiconvergence of the errors in Figure 4. We observe indeed a faster asymptotic decay of the residual for certain sampling schemes in random descent (namely for spherical, normal and Rademacher directions) than for the Landweber iteration, although the residual decays faster for Landweber in the beginning as predicted by our observation in Figure 2 and similar for the semiconvergence of the errors.
In Table 4 we report values of the best errors achieved by the methods and also the errors achieved with stopping according to Morozov's discrepancy principle.We observe that the adjoint free methods perform comparably well in terms of recon-

1. Introduction. 1 . 1 .
Problem statement.We consider the basic problem of solving linear least squares

Algorithm 1 . 1
Stochastic gradient descent with adjoint sampling (SGDAS)Initialize v 0 = 0 ∈ R d , k = 0 while not stopped do obtain random vector x ∈ R d with E(xx T ) = I d update v k+1 = v k − τ k ⟨Av k − b, Ax⟩x for some stepsize τ k update k ← k + 1 end while

Lemma 3 . 16 (
Estimates for the normal distribution).Let x ∼ N (0, I d ) or x ∼ Unif( √ dS d−1 )and m > d > 2. Then it holds that the matrices A T A and M = E xx T ∥Ax∥ 2

Theorem 4 . 1 .
Let A ∈ R d×m with right and left singular vectors (u i ) and (w i ), respectively, and singular values σ i .Moreover let b ∈ R m and v ∈ R d with Av = b and b

Theorem 4 . 2 .
Let A ∈ R d×m with right and left singular vectors (u i ) and (w i ), respectively, and singular values σ i .Moreover let b ∈ R m and v ∈ R d with Av = b and b = b + r, v 0 ∈ R d and let µ i be the eigenvalues of the matrix M = E xx T ∥Ax∥ 2 .Then the iterates v k be the iterates of Algorithm 1.2 with b instead of b fulfill Underdetermined systems: If A ∈ R m×d and b ∈ R m with m < d we define Ã; = A 0 ∈ R d×d , b := b 0 ∈ R d .Then v solves Ãv = b exactly if it solves Av = b.Overdetermined systems: If m > d we define Ã := A 0 ∈ R m×m and have that a vector ṽ = (v T , w T ) T solves Ãṽ = b exactly if v solves Av = b.If the system Av = b is inconsistent we still have that ṽ solves ÃT Aṽ = ÃT b exactly if v solves A T Av = A T b.

Fig. 3 :
Fig. 3: Solution and data for the ill-posed inverse integration example

Figure 4 .
"Morozov error" shows the relative reconstruction error achieved by stopping the iteration with Morozov's discrepancy principle, i.e. when the residual ∥Av k − b∥ falls below c∥b − b∥ with c = 1.001.
= d deterministically, it again holds that c = d in this case.

Table 1 :
Size of A: 600 × 600, density of A: 0.5.Comparison of SGDAS, RD, TFQMR and CGS.Stopped at relative tolerance of 10 −2 or after 10000 iterations.Random sparse matrices A with normally distributed entries and random solution vectors v with normally distributed entries.struction quality, but are in fact faster than the Landweber iteration.Size of A: 200 × 100, density of A: 0.02.

Table 2 :
Comparison of SGDAS, RD, TFQMR and CGS.Stopped at relative tolerance of 10 −5 or after 500000 iterations.Random sparse matrices A with normally distributed entries and random solution vectors v with normally distributed entries.

Table 3 :
Comparison of RD, TFQMR, and CGS on least squares problems from the SuiteSparse Matrix collection.Fig.4: of residuals and semi-convergence of errors for the ill-posed inverse integration example

Table 4 :
Results of the experiment for the ill-posed problem shown in