Deep Null Space Learning for Inverse Problems: Convergence Analysis and Rates

Recently, deep learning based methods appeared as a new paradigm for solving inverse problems. These methods empirically show excellent performance but lack of theoretical justification; in particular, no results on the regularization properties are available. In particular, this is the case for two-step deep learning approaches, where a classical reconstruction method is applied to the data in a first step and a trained deep neural network is applied to improve results in a second step. In this paper, we close the gap between practice and theory for a new network structure in a two-step approach. For that purpose, we propose so called null space networks and introduce the concept of M-regularization. Combined with a standard regularization method as reconstruction layer, the proposed deep null space learning approach is shown to be a M-regularization method; convergence rates are also derived. The proposed null space network structure naturally preserves data consistency which is considered as key property of neural networks for solving inverse problems.


Introduction
We study the solution of inverse problems of the form Estimate x ∈ X from data y δ = Ax + ξ. (1.1) Here A : X → Y is a linear operator between Hilbert spaces X and Y , and ξ ∈ Y models the unknown data error (noise), which is assumed to satisfy the estimate ξ δ for some noise level δ 0. We consider a possibly infinite-dimensional function space setting, but the approach and results apply to a finite dimensional setting as well.
We focus on the ill-posed (or ill-conditioned) case where, without additional information, the solution of (1.1) is either highly unstable, highly underdetermined, or both. Many inverse problems in biomedical imaging, geophysics, engineering sciences, or elsewhere can be written in such a form (see, for example [7,22]). For its stable solution one has to employ regularization methods, which are based on approximating (1.1) by neighboring well-posed problems, which enforce stability, accuracy, and uniqueness.

Regularization methods
Any method for the stable solution of (1.1) uses, either implicitly or explicitly, a priori information about the unknowns to be recovered. Such information can be that x belongs to a certain set of admissible elements M or that it has small value of some regularizing functional. The most basic regularization method is probably Tikhonov regularization, where the solution is defined as a minimizer of the quadratic Tikhonov functional Other classical regularization methods for solving linear inverse problems are filter based methods [7], which include Tikhonov regularization as special case.
In the last couple of years variational regularization methods including TV regularization or q regularization became popular [22]. They also include classical Tikhonov regularization as special case. In the general version, the regularizer 1 2 · 2 is replaced by general convex and lower semi-continuous functionals.
In this paper, we develop a new regularization concept that we name Φ-regularization method. Roughly spoken, a Φ-regularization method is a tuple ((R α ) α>0 , α ) where (for a precise definition see definition 2.3) M ⊆ X is the set of admissible elements, defined by the function Φ; R α : Y → X are continuous mappings; α = α (δ, y δ ) is a suitable parameter choice; For any x ∈ M we have R α (δ,y δ ) (y δ ) → x as δ → 0.
Note that for some cases is might be reasonable to take R α multivalued. For the sake of simplicity here we only consider the single-valued case. Classical regularization methods are special cases of Φ-regularization methods in Hilbert spaces where M = ker(A) ⊥ . A typical regularization method is in this case given by Tikhonov regularization, Here and in the following Id X denotes the identity on X.

Solving inverse problems by neural networks
Very recently, deep learning approaches appeared as alternative, very successful methods for solving inverse problems (see, for example [2-6, 8, 10, 12, 15, 25-28]). In most of these approaches, a reconstruction network R : Y → X is trained to map measured data to the desired output image.
Various reconstruction networks have been introduced in the literature. In the two-step approach, the reconstruction networks take the form R = L • B where B : Y → X maps the data to the reconstruction space (reconstruction layer or backprojection; no free parameters) and L : X → X is a neural network (NN) whose free parameters are adjusted to the training data. In particular, so called residual networks L = Id X + N where only the residual part N is trained [11] show very accurate results for solving inverse problems [3,6,12,13,16,18,21,26]. Another class of reconstruction networks learns free parameters in iterative schemes. In such approaches, a sequence of reconstruction networks R = R (k) is defined by some iterative process R (k) (y) = N k (Φ k (x k−1 , . . . , x 0 , y)) where x 0 is some the initial guess, N k : X → X are networks that can be adjusted to available training data, and Φ k are updates based on the data and the previous iterates [2,14,15,23].
Further existing deep learning approaches for solving inverse problems are based on trained projection operators [5,8], or use neural networks as trained regularization term [17].
While the above deep learning based reconstruction networks empirically yield good performance, none of them is known to be a convergent regularization method. In this paper, we use a new network structure (null space network) that, when combined with a classical regularization of the Moore Penrose inverse is shown to provide a convergent Φ-regularization method with rates. One of the reviewers of this manuscript kindly brought to our attention that the null space network structure actually has been introduced already by Mardani and collaborators in [19,20] in a finite dimensional setting. We extend the use of the null space network to operators with non-closed range and analyze its stable approximation in the context of regularization methods.

Proposed null space networks and main results
As often argued in the recent literature, deep learning based reconstruction approaches (especially using two-stage networks) lack data consistency, in the sense that outputs of existing reconstruction networks fail to accurately predict the given data. In order to overcome this issue, in this paper, we introduce a new network, that we name null space network. The proposed null space network takes the form The function N : X → X, for example, can be defined by a neural network according to definition 3.2. Note that Id X − A + A = P ker(A) equals the projector onto the null space ker(A) of A. Consequently, the null space network L satisfies the property ALx = Ax for all x ∈ X . This yields data consistency, which means that the equation Ax = y is invariant among application of a null space network (compare figure 1). Suppose x 1 , . . . , x N are some desired output images and let L be a trained null space network that approximately maps A + Ax n to x n . (See the appendix for a possible training strategy.) In this paper, we show that if (B α ) α>0 is any classical ker(A) ⊥ -regularization, then the two-stage reconstruction network yields a Φ-regularization with M := L(ran(A + )). To the best of our knowledge, these are first results for regularization by neural networks. Additionally, we will derive convergence rates for (R α ) α>0 on suitable function classes. The intuition behind using the null space network in the two-stage approach (1.4) is that only invisible information in ker(A) should be learned by the network, whereas the visible part in ker(A) ⊥ should be kept (compare figure 1). Moreover, in the case that A has nonclosed range then the visible part ker(A) ⊥ will be be sensitive with respect to data perturbation. These instabilities with respect to noise can exactly be addressed by a regularizing family (B α ) α>0 of continuous operators that converge pointwise to A + in the limit α → 0.

Outline
This paper is organized as follows. In section 2 we develop a general theory of Φ-regularization and introduce the notion of Φ-generalized inverse (definition 2.1) and Φ-regularization methods (definition 2.3) generalizing the classical Moore-Penrose generalized inverse and regularization concept. We show convergence (see theorem 2.4) and derive convergence rates (theorem 2.8) that include regularization via the null space networks as a special case. In section 3 we introduce the null-space networks and extend the convergence results in the special case of the null space network (theorems 3.4 and 3.5). Possible strategies for training a nullspace network are described in the appendix. The paper concludes with an outlook presented in section 4.

A theory of Φ-regularization
In this section, we introduce the novel concepts of Φ-generalized inverse and Φ-regularization. We derive a general class of Φ-regularization for which we show convergence and derive convergence rates.
Throughout this section, let A : X → Y be a linear bounded operator and Φ : X → ker(A) ⊆ X be Lipschitz continuous and define (2.1) Figure 1. Sketch of the action of a null space network L that maps points z i ∈ ran(A + ) to more desirable elements in z i + ker(A) along the null space of A. The component ker(A) is invisible in the data Ax, whereas the part ker(A) ⊥ can be found by applying the pseudoinverse A + to the data Ax.
The prime example is Φ = P ker(A) • N being a null space network with a neural network function N : X → X. This case will be studied in the following section. The results presented in this section apply to general Lipschitz continuous functions Φ whose image is contained in ker(A).

Φ-regularization methods
In the following we denote by In particular, A + y is well defined, and can be found as the unique minimal norm solution of the normal equation Classical regularization methods aim for approximating A + y. In contrast, the null space network will recover different solutions of the normal equation. For that purpose, we introduce the following concept.

Definition 2.1 (Φ-generalized inverse). We call
Recall that for any y ∈ dom(A + ), the solution set of the normal equation A * Ax = A * y is given by A + y + ker(A). Hence A Φ y gives a particular solution of the normal equation, that can be adapted to a training set. The Φ-generalized inverse coincides with the Moore-Penrose gen-

Lemma 2.2. The Φ-generalized inverse is continuous if and only if ran(A) is closed.
Proof. If ran(A) is closed, then classical results show that A + is bounded (see for example [7]). Consequently, (Id X + Φ) • A + is bounded too. Conversely, if A Φ is continuous, then the identity P ran(A + ) A Φ = A + implies that the Moore-Penrose generalized inverse A + is bounded and therefore that ran(A) is closed. □ Lemma 2.2 shows that as in the case of the classical Moore-Penrose generalized inverse, the Φ-generalized inverse is discontinuous in the case that ran(A) is not closed. In order to stably solve the equation Ax = y we therefore require bounded approximations of the Φ-generalized inverse. For that purpose, we introduce the following concept of regularization methods adapted to A + .
In our generalized notation, a classical regularization method for the equation Ax = y corresponds to a 0-regularization method for Ax = y. Further note that the null space assumption is required for (Id X + Φ) • A + y with y ∈ ran(A + ) being a solution of Ax = y instead of being any element. We consider this data consistency property central for solving inverse problems with neural network based reconstruction methods.

Convergence analysis
The following theorem shows that the combination of a null space network and a regularization method of A + yields a regularization of A Φ .

Theorem 2.4. Let
In particular, A wide class of Φ-regularization methods can be defined by a regularizing filter. For all α > 0, g α is piecewise continuous; Corollary 2.6. Let (g α ) α>0 be a regularizing filter and define B α := g α (A * A) A * . Then Proof. The family (B α ) α is a regularization of A + ; see [7]. Therefore, according to theorem Basic examples of filter based regularization methods are Tikhonov regularization, where g α (λ) = 1/(α + λ), and truncated singular value decomposition where Classical regularization methods are based on approximating the Moore-Penrose inverse. The following result shows that Φ-regularization methods are essentially continuous approximations of A Φ .
is a regularization of A Φ and that there exists a parameter choice α that is continuous in the first argument.
Hence, classical regularization theory implies that P ran(A + ) • R α is a regularization of A + . We have R α = (Id X + Φ) • P ran(A + ) • R α and, according to theorem 2.4, the family Together with standard regularization theory this shows that P ran(

Convergence rates
Next we derive quantitative error estimates. For that purpose, we assume in the following that B α = g α (A * A) A * is defined by the regularizing filter (g α ) α>0 . We use the notation α (δ/ρ) a as δ → 0 where α : Y × (0, ∞) → (0, ∞) and a, ρ > 0 to indicate there are positive constants d 1 , d 2 such that d 1 (δ/ρ) a α (δ) d 2 (δ/ρ) a . Theorem 2.8. Suppose µ, ρ > 0 and let (g α ) α>0 be a regularizing filter such that there exist constants α 0 , c 1 , c 2 > 0 with In particular, for any x ∈ ran((Id Under the given assumptions, g α (A * A)A * is an order optimal regularization method on (A * A) µ B ρ (0) , which implies (see [7]) for some constant C > 0 independent of x, y δ . Consequently, we have where L is the Lipschitz constant of Id X + Φ. Taking the supremum over all x ∈ M µ,ρ,Φ and y δ ∈ Y with Ax − y δ δ yields (2.5). □ Note that the filters (g α ) α>0 of the truncated SVD and the Landweber iteration satisfy the assumptions of theorem 2.8. In the case of Tikhonov regularization, the assumptions are satisfied for µ 1. In particular, under the assumption (resembling the classical source condition) we obtain the convergence rate R α (δ,y δ ) (y δ ) − x = O(δ 1/2 ).

Deep null space learning
Throughout this section let A : X → Y be a linear bounded operator. In this case, we define Φ-regularizations by null-space networks. We describe a possible training strategy and derive regularization properties and rates. For the following recall that the projector onto the kernel of A is given by P ker(A) = Id X − A + A.

Null space networks
We work with layered feed forward networks in infinite dimensional spaces, although more complicated networks can be applied as long as their Lipschitz constant is finite. As the error estimates depend on the Lipschitz constant it is also desirable that the Lipschitz constant is not too large. Neural network functions in infinite dimensional spaces can be found in [1,9,17] and the references therein. Nevertheless, while the notion of neural networks is standard in a finite-dimensional setting, no established definition seems available for general Hilbert spaces. Here we use the following notation in Hilbert space notion. Layered feed forward network). Let X and Z be Hilbert spaces. We call a function N : X → Z defined by
Usually the nonlinearities σ are fixed and the affine mappings W are trained. In the case that X is a function space, then a standard operation for σ is the ReLU (the rectified linear unit), ReLU(x) := max {x, 0}, that is applied component-wise, or ReLU in combination with max pooling which takes the maximum value max {|x(i)| : i ∈ I k } within clusters of transform coefficients. Note that in the typical case of L 2 -spaces the elements are equivalence classes of functions. In this case, any representative has to be selected for the application of ReLU, which is clearly independent of the chosen representative. The network in definition 3.1 may in particular be a convolutional neural network (CNN); see [17] for a definition of CNNs in Banach spaces. In a similar manner, one could define more general feed forward networks in Hilbert spaces, for example following the notion of [24] in the finite dimensional case.
We are now able to formally define the concept of a null space network.

Definition 3.2.
A function L : X → X is called a null space network if it has the form L = Id X + (Id X − A + A)N where N : X → X is any Lipschitz continuous neural network function.
An example for the null space network with N : X → X as defined in (3.1). We however again point out that N could be a more general Lipschitz continuous neural network function for which the results below equally hold. For the sake of clarity of presentation, we use the simple definition 3.1 of layered neural networks.
An example of a standard residual network Id X + N and a layered null space network Id X + (Id X − A + A)N both with depth L = 2 (i.e. two weight layers) are shown in figure 2.

Remark 3.3.
Throughout the following we assume that L = Id X + (Id X − A + A)N is a given null-space network. Following the deep learning philosophy, the network would be selected from a parameterized family (L θ ) θ∈θ based on given training data. A possible training strategy is presented in appendix. The results below hold for any null space network where the Lipschitz constant is finite. It is widely accepted that the Lipschitz constant in typically trained networks is reasonably small. As the error constant depends on the Lipschitz constants of the network it is desirable to keep the Lipschitz constant small. The proposed training strategy also accounts for this issue in the layered neural networks according to definition 3.1.
Another simple way of constructing a null space network is to add a data consistency layer to an existing network. To be specific, let L 0 be any trained network. Then one obtains a null space network by considering Moreover, one can easily show that x − LA + Ax x − L 0 A + Ax for every x ∈ X . Hence the null space network is always better in terms of reconstruction error than the original network for recovering x from data y = Ax.

Convergence and convergence rates
Let Id X + (Id X − A + A)N be a null-space network, possibly trained as described in appendix by approximately minimizing (A.1). Any such network belongs to the class of functions Id X + Φ by taking Φ = (Id X − A + A)N. Consequently, the convergence theory of section 2 applies. In particular, theorem 2.4 shows that a regularization (B α ) α>0 of the Moore-Penrose generalized inverse defines a Φ-regularization method via R α := (Id X + (Id X − A + A)N)B α . Additionally, theorem 2.8 yields convergence rates for the regularization (R α ) α>0 of A Φ .
In some cases, the projection P ker(A) = Id X − A + A might be costly to be computed exactly. For that purpose, in this section we derive more general regularization methods that include approximate evaluations of A + A. Theorem 3.4. Let L = Id X + (Id X − A + A)N be a null space network and set M := ran(L). Suppose ((B α ) α>0 , α ) is a regularization method for Ax = y. Moreover, let (Q α ) α>0 be a family of bounded operators on X with Q α − P ker(A) → 0 as α → 0. Then, the pair ((R α ) α>0 , α ) with is a Φ-regularization method for Ax = y. In particular, the family (R α ) α>0 is a regularization of A Φ . Proof. We have Then, the parameter choice α (δ/ρ) 2 2µ+1 yields the convergence rate results Proof. Follows from the estimate (3.4) with B α = g α (A * A)A * and theorem 2.8. □ One might use Q α = B φ(α) A as a possible approximation to P ker(A) ⊥ = A + A for some function φ : [0, ∞) → [0, ∞). In such a situation, one can use existing software packages (for example, for the filtered backprojection algorithm and the discrete Radon transform in case of computed tomography) for evaluating B φ(α) and A.

Conclusion
In this paper, we introduced the concept of null space networks that have the form L = Id X + (Id X − A + A)N, where N is any neural network function (for example a deep convolutional neural network) and Id X − A + A = P ker(A) is the projector onto the kernel of the forward operator A : X → Y of the inverse problem to be solved. The null space network shares similarity with a residual network that takes the general form Id X + N. However, the introduced projector Id X − A + A guarantees data consistency which is an important issue when solving inverse problems.
The null space networks are special members of the class of functions Id X + Φ that satisfy ran(Φ) ⊆ ker(A). For this class, we introduced the concept of Φ-generalized inverse A Φ and Φ-regularization as point-wise approximations of A Φ on dom(A + ). We showed that any classical regularization (B α ) α>0 of the Moore-Penrose generalized inverse defines a Φ-regularization method via (Id X + Φ)B α . In the case of null space networks where Φ = (Id X − A + A)N, we additionally derived convergence results using only approximation of the projection operator P ker(A) . Additionally, we derived convergence rates using either exact or approximate projections.
To the best of our knowledge, the obtained convergence and convergence rates are the first regularization results for solving inverse problems with neural networks. Future work has to be done to numerically test the null space networks for typical inverse problems such as limited data problems in CT or deconvolution and compare the performance with standard residual networks, iterative networks or variational networks.

Acknowledgment
The work of MH and SA has been supported by the Austrian Science Fund (FWF), project P 30747-N32.

Appendix. Possible network training
We may train the null space network L = Id X + (Id X − A + A)N to (approximately) map elements to the desired class of training phantoms. For that purpose, fix the following: where N is of the form (3.1) and L is the linear part of W and µ is a regularization parameter. Network training aims at making E(N) small, for example, by gradient descent. Clearly L =1 L is an upper bound on the Lipschitz constant of N. By the inequality of arithmetic and geometric means, the sum L =1 L essentially bounds the product L =1 L . Therefore, the Lipschitz constant of the finally trained network will stay reasonably small. Alternatively, one can directly use the product of the Lipschitz constants L =1 L as regularization term in (A.1), but using the sum of the Lipschitz constants seems more in line with standard practice.
Note that it is not required that (A.1) is exactly minimized. Any trained network where 1 2 L =1 x n − (Id X + A + AN)(A + Ax n ) 2 is small yields a null space network Id X + A + AN that does, at least on the training set, a better job in estimating x n from A + Ax n than the identity.
Alternatively, we may train a regularized null space network Id X + (Id X − B α A)N to map the regularized data B α Ax n (instead of A + Ax n ) to the outputs x n . This yields the modified error functional Trying to minimize E α may be beneficial in the case that many singular values are small but do not vanish exactly. The regularized version B α might be defined by truncated SVD or Tikhonov regularization.