Invertible residual networks in the context of regularization theory for linear inverse problems

Learned inverse problem solvers exhibit remarkable performance in applications like image reconstruction tasks. These data-driven reconstruction methods often follow a two-step procedure. First, one trains the often neural network-based reconstruction scheme via a dataset. Second, one applies the scheme to new measurements to obtain reconstructions. We follow these steps but parameterize the reconstruction scheme with invertible residual networks (iResNets). We demonstrate that the invertibility enables investigating the influence of the training and architecture choices on the resulting reconstruction scheme. For example, assuming local approximation properties of the network, we show that these schemes become convergent regularizations. In addition, the investigations reveal a formal link to the linear regularization theory of linear inverse problems and provide a nonlinear spectral regularization for particular architecture classes. On the numerical side, we investigate the local approximation property of selected trained architectures and present a series of experiments on the MNIST dataset that underpin and extend our theoretical findings.


Introduction
In inverse problems, one aims to recover underlying causes from measurements by reversing the forward measurement process.Naturally, they arise in several fields, e.g., medical imaging, non-destructive testing, and PDE-based models.One defines inverse problems via their ill-posed nature, i.e., reversing the measurement process is discontinuous, and obtaining reasonable reconstructions requires a stable reconstruction scheme.Usually, one defines the forward problem by possibly nonlinear forward operator A : X → Y mapping between Banach or Hilbert spaces.Linear spectral regularizations for linear problems in Hilbert space settings were already well studied over two decades ago [17].Also, more general nonlinear regularizations, e.g., sophisticated variational and iterative approaches, were the subject of extensive investigations for linear and nonlinear problems.We refer to the review [9] for a more general overview of this development.More recently, the linear spectral regularization approach was generalized to diagonal frames [15].Frames provide larger flexibility by allowing advantageous representations of the data, e.g., via a set of images, instead of solely singular functions provided by the operator.Nonlinear spectral regularizations with nonlinear dependence on the measurement data were only considered in a few works, e.g., in the context of CG methods [17,Cor. 7.4].
In the present work, we follow a general supervised learning approach for a reconstruction scheme based on the concept of invertible residual networks [8].We use this architecture to provide a nonlinear regularization scheme for which we develop the general convergence theory by exploiting a local approximation property of the underlying network.Furthermore, instead of solving the inverse problem directly, we use a training strategy that aims to approximate the forward operator.We also introduce an architecture type acting on the singular function directions, and show that it provides a data-dependent nonlinear spectral regularization which is theoretically analyzed for specific shallow architectures and it is linked to the established linear spectral regularization theory.In addition, we underpin our theoretical findings with several numerical experiments.
The manuscript is structured as follows: Section 2 defines the problem setting and the theoretical framework.We then investigate general regularization properties in Section 3. In Section 4, we discuss specific architectures and the outcome of a training approach aiming for approximating the forward operator.We then relate it to the classical filter-based regularization theory.Following the theoretical investigation, Section 5 presents a series of numerical experiments.We conclude with a discussion and outlook in Section 6.

Related work
Empirically the success of learning-based methods has been established for several applications like various imaging modalities.Theoretically, there is still a gap, e.g., regarding the convergence properties of such reconstruction schemes.Recently an increasing number of works have examined this aspect.We recommend the excellent survey article [32], illustrating regularization properties of some learned methods [32,Fig.3].
One early example providing a convergent learned postprocessing approach is the deep null space network [41].It utilizes established regularization methods and turns them into learned variants.While it replaces the concept of a minimum norm solution, it inherits convergence guarantees.[5] investigates the convergence of regularization methods trained on a finite set of training samples without access to the entire forward operator.Specifically, the authors consider the projections on the subspaces spanned by the dataset.Assumptions on the dataset ensure desired approximation properties for the convergence analysis of the regularization method.The authors then also consider the data-dependent projected problem in a variational Tikhonov-type framework and derive convergence of the regularization by assuming certain approximation properties of the projection being fulfilled for the desired solution.
Also, Tikhonov-type/variational methods and their classical regularization theory were used to obtain learned regularizations with convergence guarantees.In particular, learned convex regularizers [31,33] and the network Tikhonov (NETT) approach [26,35] investigated this.
Minimization of Tikhonov-type functionals is often performed by solving an equilibrium equation, e.g., obtained from first-order optimality conditions.This concept is generalized by the authors in the recent work [34] where a learned operator is introduced in the equilibrium equation replacing, for example, the component delivered by the regularizer.The authors formulate suitable assumptions on the learnable operator to guarantee convergence, and they also illustrate that residual networks fulfill the desired properties if the residual part in the network is contractive.[16] recently suggested convergence results for plug-and-play reconstructions that are conceptionally motivated by the Tikhonov theory.They guarantee convergence by assuming the required conditions on their parameterized family of denoising operators.
The deep image prior [42] proposed exploiting a network's training and architecture as a regularization to achieve better image representation.Based on this concept of unsupervised learning from one single measurement, the authors of [14] derived an analytic deep prior framework formulated as a bilevel optimization problem.This allowed them to verify regularization properties for particular cases by exploiting relations to the classical filter theory of linear regularizations.More recently, [3] investigated the equivalence between Ivanov regularization and analytic deep prior.They investigated the inclusion of early stopping and proved regularization properties [3].
An earlier work [12] proposed a data-driven approach to a learned linear spectral regularization; more recently, [7,22] considered convergence aspects.In [22], the authors learn a scalar for each singular function direction in a filter-based reconstruction scheme, i.e., a linear regularization scheme.Due to the assumed properties of noise and data distributions, their training outcome is equivalent to a standard Tikhonov regularization with a data-and noise-dependent linear operator included in the penalty term, which is diagonal with respect to the system of singular functions.The authors further verify convergence results concerning the data-and noise-dependent regularization parameters.

Problem setting -theoretical framework
We consider linear inverse problems based on the operator equation where A : X → Y is a linear and bounded operator between Hilbert spaces X and Y .For simplification, ∥A∥ = 1 is assumed, which can be easily obtained by a scaling of the operator.We aim to recover the unknown ground truth x † as well as possible by only having access to a noisy observation y δ ∈ Y such that ∥y δ − Ax † ∥ ⩽ δ, where δ > 0 is the noise level.Instead of solving (2.1) directly, we define A = A * A and z δ = A * y δ to get the operator equation which only acts on X, i.e., we consider the normal equation with respect to A. We propose a two-step data-based approach for solving (2.2), resp.(2.1), which we refer to as the iResNet reconstruction approach: (I) Supervised training of an invertible neural network φ θ : X → X to approximate A.
(II) Using φ −1 θ (or φ −1 θ • A * , respectively) to solve the inverse problem.In general, the supervised training covers both cases, either training on noise-free data tuples (x (i) , Ax (i) ) or on noisy tuples (x (i) , Ax (i) + η (i) ) with noise η (i) for a given dataset {x (i) } i∈{1,...,N } ⊂ X.To guarantee the invertibility, we use a residual network of the form and restrict f θ to be Lipschitz continuous with Lip(f θ ) ⩽ L < 1.We refer to this architecture, which is illustrated in Figure 1, as an invertible residual network (iResNet) [8].Note that [8] considers concatenations of several of these invertible blocks while we focus on a single residual block.The essential properties of the iResNets (like how it can be inverted) are summarized in the following lemma.
(iii) For given z ∈ X, the unique x = φ −1 θ (z) is given by x = lim k→∞ x k where for arbitrary and given x 0 ∈ X.
Proof.The idea of the proof is taken from [8, Section 2].The first assertion follows directly from Assertion (iii) follows from the fixed point theorem of Banach.Since f θ is contractive, the iteration convergences to a unique fixed point.This fixed point x fulfills (2.8) Finally, assertion (ii) follows from by replacing φ θ (x i ) with z i and x i with φ −1 θ (z i ) for i = 1, 2. Besides, we assume the existence of a singular value decomposition (SVD) (u j , v j , σ j ) j∈N of the operator (2.10) holds 1 .All compact operators guarantee this.With respect to A, the vectors v j are eigenvectors, and σ 2 j are the corresponding eigenvalues.The system {v j } j ∈ N builds an orthonormal basis of

Regularization properties of the iResNet reconstruction
In this section, we discuss regularization properties in terms of well-definedness, stability, and convergence of the general reconstruction approach defined by the family of operators ) in the constraint of the Lipschitz constant of the residual function undertakes the role of the regularization parameter by L → 1, as will be made apparent in this section.

General architecture and local approximation property
To highlight the essential dependence of the trained network on L, we write φ θ,L and f θ,L in the following.Please note that the set of possible network parameters and the underlying architecture might also depend on L. We thus consider a network parameter space Θ(L) depending on L. Lemma 3.1 (Well-definedness and stability).For L ∈ [0, 1) and θ ∈ Θ(L), let f θ,L : X → X be such that Lip(f θ,L ) ⩽ L and φ θ,L (x) = x − f θ,L (x).Then, the reconstruction scheme T L = φ −1 θ,L • A * : Y → X is well-defined and Lipschitz continuous.
While well-definedness and continuity (stability) immediately follow from the construction of the reconstruction method T L , the convergence property requires a specific approximation property, which we must guarantee during training.We do so by introducing an index function that generalizes the concept of convergence rates (see, e.g., [40]).Definition 3.1.An index function is a mapping ψ : R ⩾0 → R ⩾0 , which is continuous, strictly increasing and it holds ψ(0) = 0. Now, we can formulate a convergence result for the reconstruction T L (y δ ) for δ → 0 and a suitable a priori parameter choice L(δ).
Theorem 3.1 (Convergence -local approximation property).Let x † ∈ X be a solution of the problem Ax = y for y ∈ R(A) and y δ ∈ Y fulfill ∥y δ − y∥ ⩽ δ.Furthermore, let the network parameters θ(L) ∈ Θ(L) for L ∈ [0, 1) be obtained in a way that the local approximation property holds for some index function ψ.
Remark 3.2.Note that the local approximation property (3.1) is a slightly stronger assumption, which is derived by exploiting the properties of the iResNet to remove the explicit dependency on the inverse within the constraint.A weaker but sufficient condition (which is implied by The local approximation property (3.1) provides a relation between the approximation capabilities of the network architecture and the regularization properties of the iResNet reconstruction scheme, which also contains a certain kind of source condition.Note that the previous statement differs from common convergence results, which are stated not locally but for a broader range of solutions, i.e., for any y ∈ R(A).
Here, the locality is strongly linked to the approximation capabilities of the underlying network, which can imply certain limitations.Thus, of particular interest for the convergence properties is the approximation property set S = {x ∈ X | ∃ index function ψ : x fulfills (3.1)}. (3.7) There are no specific assumptions on the design of the training of the network φ θ made by (3.1).But since both architecture choice and training procedure are crucial for the resulting network parameters θ(L), they also have a strong impact on the local approximation property and the structure of S. This is discussed in more detail in the context of the specific shallow architectures in Section 4. In addition, we report numerical experiments regarding the local approximation property in Section 5.
A simple example illustrating the conditions under which a linear network satisfies the local approximation property is given in the following remark.
Remark 3.3 (Local approximation property for linear networks).Assume that the network φ θ,L is linear and the prediction error is bounded by ∥φ θ(L(δ)),L(δ) (x (i) ) − A * y δ,(i) ∥ ⩽ ζ(δ) with ζ(δ) ∈ R on a dataset (x (i) , y δ,(i) ) i=1,...,N with ∥Ax (i) − y δ,(i) ∥ ⩽ δ.Then, for x † given by x † = N i=1 t i x (i) with t i ∈ R we have (3.8) As as result, Theorem 3.1 implies convergence of and if L(δ) satisfies the conditions of (3.2).This example shows that in the case that the reconstruction error can be bounded by ζ(δ) satisfying (3.9), the convergence directly translates to the linear span of the training data.Consequently, for test data similar to the training data, a simple criterion, that can be numerically validated to ensure convergence to the ground truth data, is obtained.
Similar to a source condition, (3.1) also has an influence on the type of solution x † one approximates in the limit δ → 0 if the operator A has a non-trivial kernel.For example, if R(f θ(L),L ) = N (A) ⊥ , the local approximation property enforces that P N (A) x † = 0, i.e., together with the approximation of A * A one would consider the minimum norm solution.More generally, we can formulate the following property of S. Lemma 3.2.Let x † 1 , x † 2 ∈ S both be solutions of Ax = y for one y ∈ R(A).Then it holds x † 1 = x † 2 .Proof.For abbreviation, we write f L and φ L instead of f θ(L),L and φ θ(L),L .Using x † 2 both fulfill the local approximation property (3.1), there exists an index function ψ such that

Diagonal architecture
In order to investigate relations to established and well-studied spectral regularization methods, we continue with the introduction of a particular network design, which is exploited in the remainder of the manuscript.The idea of spectral regularization is to decompose the data using the SVD (u j , v j , σ j ) j∈N of A and apply filter functions to the singular values.Analogously, we consider a network architecture consisting of subnetworks where all the components ⟨x, v j ⟩ of the data are processed separately.We refer to this architecture as diagonal architecture, which is also illustrated in Figure 2. The naming is motivated by the similarity of the structure of f θ to a matrix multiplication with a diagonal matrix w.r.t the basis {v j } j∈N .This architecture choice is, of course, less expressive than general architectures, but it has the useful benefit that this way, the iResNet can be analyzed like a filter-based regularization scheme.Besides, it enables us to simplify the infinite-dimensional inverse problem Ax = y to 1D subproblems of the form σ j ⟨x, v j ⟩ = ⟨y, u j ⟩, which will be exploited in section 4.
In case of a diagonal architecture, the Lipschitz constraint Lip ( The local approximation property (3.1), in this case for x † ∈ N (A) ⊥ , implies for any j ∈ N, where x † j = ⟨x † , v j ⟩.Note that for infinite-dimensional operators, the constants in O((1 − L)ψ(1 − L)) must be an ℓ 2 sequence with respect to j to obtain the implication in the opposite direction and thus equivalence.
The particular aim of using the diagonal architecture is now to find a filter function that defines a spectral regularization scheme that is equivalent to T L = φ −1 θ • A * .Since filter-based regularization methods are usually linear and we want to also allow for nonlinear architectures, our filter function r L must be data-dependent.Therefore, we define r L : R or for holds.The first argument of r L is the singular value (as usual), and the data-dependency comes via the second argument.This can be seen as a nonlinear extension to the established filter theory of linear regularization methods [38,27], where r L would depend on σ 2 j only.The subnetworks thus play an important role in defining the filter functions via For this to be well-defined, we need to assume that all singular values σ j > 0 have a multiplicity of one.In case of σ j = σ k for j ̸ = k (multiplicity greater than one) one must ensure that f θ,L,j and f θ,L,k also coincide.In practice, this could be realized by sharing the weights of these networks.
Figure 2: Diagonal structure of the residual function including subnetworks f θ,j : R → R acting on x j = ⟨x, v j ⟩.
Remark 3.4.Some authors define filter functions in terms of the original problem with respect to the generalized inverse A † such that Note that the choice of the representation, either in terms of r L or in terms of F L , depends on personal preference (compare, for example, [38] and [27]).In the following, we represent the filter function in terms of r L .
One simple nonlinear dependence on s will be taken into account explicitly in the following, as it becomes relevant in later investigations.In analogy to the bias in neural network architectures, we consider a linear filter framework with bias, i.e., T L (y) = bL + j∈N rL (σ 2 j )σ j ⟨y, u j ⟩v j (3.19)where bL ∈ X is a bias term and rL is a classical filter function.With additional simple assumptions on bL , this becomes a regularization method.For the sake of completeness, we provide the following result.

Approximation training of specialized architectures
Besides the architecture choice and the available data, the loss function for network training is an important ingredient.Having regularization properties in mind, a natural choice for training the iResNet is to approximate the forward operator A : X → X, A = A * A, on a given training dataset {x (i) } i∈{1,...,N } ⊂ X.This is also strongly motivated by the structure of the local approximation property (3.1) to obtain convergence guarantees.The training of φ θ = Id − f θ then consists of solving with L < 1 as a hyperparameter, which we refer to as approximation training.Here, we restrict ourselves to the noise-free case as the outcome of the approximation training is independent for N → ∞ (assuming stochastic independence between data and noise).
In the following, we analyze to which extent φ −1 θ acts as a regularized inverse of A for different simple diagonal iResNet architectures trained according to (4.1).We compute the particular network parameters θ and derive the corresponding data-dependent filter function r L : R × R → R (see Section 3.2).Additionally, we check whether the local approximation property (3.1) is fulfilled.For this purpose, the assumptions on the architecture as well as on the dataset used for the training must be specified.

One-parameter-network -Tikhonov
We begin with a very simple linear network, which only has one single scalar learnable parameter.We show that this architecture is equivalent to Tikhonov regularization if trained appropriately.
Then, the solution of (4.1) is k = L and (3.15) holds with for any s ∈ R.
The proof of the lemma can be found in Appendix A.1.The derived filter function corresponds to the Tikhonov method (with reference element) as one can see by considering the first-order optimality condition, which implies and using the fact that 1 L = 1 + α.It is illustrated in Figure 3.We plot σ 2 r L (σ 2 , s) since the multiplication of the filter function with σ 2 corresponds to the concatenation φ −1 θ • A which emphasizes the regularizing effect.
Remark 4.1.This particular choice of the residual function f θ is one example where the local approximation property is not fulfilled for any x † ∈ X except for eigenvectors corresponding to σ 2 1 = 1, i.e., S = span{v 1 }.But the linear nature and the specific spectral representation of the residual network allow for the verification which converges to zero for L → 1.
Alternatively, convergence can be verified by standard arguments following, for example, the line of reasoning in the proof of [27,Thm. 3.3.3],where properties of the filter function F L (σ j ) = σ 2 j r L (σ 2 j ) are exploited.Since F L defines the filter function for Tikhonov regularization and is therefore known to fulfill the desired properties, we have convergence for arbitrary x † ∈ N (A) ⊥ .

Affine linear network -squared soft TSVD
We continue with a slightly more general affine linear architecture f θ (x) = W x + b.It can be observed that the linear operator W : X → X only depends on A and L while the bias vector b ∈ X is data dependent.Lemma 4.2.Let (v j , σ 2 j ) j be the eigenvectors and eigenvalues of A and φ θ = Id − f θ be an iResNet which solves (4.1)where (ii) the training dataset {x (i) } i∈{1,...,N } and the mean values Then, the solution of the training problem (4.1) is The proof of the lemma can be found in Appendix A.2.The filter function is illustrated in Figure 4.
Remark 4.2.Note that a similar filter function has been found in the context of an analytic deep prior approach for regularization [14].The authors derived the filter function F L (σ) = σ 2 rL (σ 2 ), which is linear for small σ, and named it soft TSVD (soft truncated SVD).In contrast, the filter function in (4.8) is a polynomial of degree two for small σ, which we refer to as squared soft TSVD.
Figure 4: The plot depicts graphs of σ 2 rL (σ 2 ) as given in (4.8) to visualize the effect of the regularization, which is similar to the known soft TSVD regularization.
Due to the affine linear nature of the found reconstruction scheme, regularization properties can immediately be derived by verifying certain properties of the filter function rL and the bias bL using Lemma 3.3.This is summarized in the following corollary.The previous result provides a link to the classical theory for linear regularization operators.Besides the well-known classical theory, we further want to investigate the local approximation property (3.1) to verify whether the convergence property can also be obtained using Theorem 3.1.As a first step, we derive a technical source condition for x † , which implies the local approximation property.Lemma 4.3.Let the network φ θ and the setting be that of Lemma 4.2.Assume x † ∈ N (A) ⊥ and let ψ be an index function such that ∃ β ∈ (0, 1] : ∀β ∈ (0, β) : holds.Then there exists another index function ψ such that the local approximation property (3.1) is fulfilled for x † .
Proof.With the setting of Lemma 4.2 and the trained parameters (W, b) of the network f θ , it holds Now, we estimate the convergence order of both terms w.r.t (1 − L) → 0.
For the first part, we exploit the simple computations Using assumption (4.9), we obtain Regarding the second part, it holds Since the values µ j form by definition (see Lemma 4.2) an ℓ 2 sequence, there exists an index function ψ ⩾ ψ such that thus, the local approximation property (3.1) is fulfilled.
While (4.9) is quite a technical source condition, standard source conditions of the form ∃w ∈ X, µ > 0 : imply it.In this case, the index function ψ is of the form ψ(β) = β µ .The proof of the following corollary reveals the exact relation between standard source conditions and (4.9).
Corollary 4.2.We assume that the setting of Lemma 4.2 holds.Let A be compact.Then, for any x † ∈ N (A) ⊥ the local approximation property (3.1) is fulfilled (i.e., S = N (A) ⊥ ).
Proof.Let x † ∈ N (A) ⊥ be arbitrary.By construction, A = A * A is self-adjoint and non-negative.As A is also compact, we can deduce from [29, Thm.1]: For any ε > 0 there exists an index function ψ This implies that there exists a w ∈ X such that x † = ψ(A)w.We thus obtain for β > 0 The verification of regularization properties by Theorem 3.1 in terms of the local approximation property is thus in line with the existing theory exploited in Lemma 3.3, and it further illustrates the character of the local approximation property combining the approximation capabilities of the residual network and a source condition on the desired solution.

Linear network with ReLU activation
In the following, we include selected nonlinearities as activation functions in a shallow network, allowing for an analytical investigation of the resulting reconstruction scheme.Here, we start with a ReLU activation, which requires a certain assumption on the training dataset depending on the chosen nonlinearity.For simplicity, we do not include a bias in the architecture, in contrast to the architecture from the last section.(ii) for every eigenvector v j of A, the training dataset {x (i) } i∈{1,...,N } contains at least one x (i) s.t.
Then, the solution of (4.1) is W = j∈N w j ⟨•, v j ⟩v j , w j = min{1 − σ 2 j , L} and (3.15) holds with The proof of the lemma can be found in Appendix A.3.The obtained filter function is now characterized by a varying behavior depending on the actual measurement y.This is expressed via the variable s which represents the coefficients ⟨z, v i ⟩ = σ i ⟨y, u i ⟩ (see (3.15), (3.16)).Whenever the coefficient is positive, the reconstruction scheme behaves like the squared soft TSVD without bias discussed in Section 4.2, i.e., they share the same filter function for those x † ∈ N (A).In all other cases, the reconstruction method does not change the coefficients of the data, i.e., it behaves like the identity.Due to the relationship to the squared soft TSVD, we can immediately specify those x † fulfilling the local approximation property, i.e., for the ReLU network, we have Thus, the nonlinearity in the network architecture introduces restricted approximation capabilities as well as convergence guarantees on a limited subset of X only.

Linear network with soft thresholding activation
At last, we want to analyze a network with soft thresholding activation, e.g., known from LISTA [18].This function is known to promote sparsity since it shrinks all coefficients and sets those under a certain threshold to zero.That is why only training data with sufficiently large coefficients matters for the result of the training, and the condition |⟨x, v j ⟩| > αj L is crucial in the following lemma.Lemma 4.5.Let (v j , σ 2 j ) j be the eigenvectors and eigenvalues of A, α = (α j ) j , α j ⩾ 0 and φ θ = Id − f θ be an iResNet which solves (4.1) with (i) f θ (x) = ϕ α (W x), where θ = W , Θ = {W ∈ L(X) | ∃(w j ) j∈N : W = j∈N w j ⟨•, v j ⟩v j } and ϕ α is the soft thresholding function w.r.t. the eigenvectors, i.e., ϕ α (x) = j∈N sign(⟨x, v j ⟩) max(0, |⟨x, v j ⟩| − α j )v j , (ii) the training data set is {x (i) } i∈{1,...,N } , where for any j ∈ N : ∃i : Then, a solution of (4.1) is W = j∈N w j ⟨•, v j ⟩v j , w j = min{ For singular values σ 2 j = 1, w j is not uniquely determined.The proof of the lemma can be found in Appendix A.4.It follows the same line of reasoning as in the previous sections but is more technical due to the effects of the nonlinearity.
So far, the filter function r L in Lemma 4.5 is only defined on the discrete values σ 2 j (and not continuous for σ 2 ∈ [0, 1]) since it depends on the coefficients p L,j , α j and w j .However, if we assume continuous extensions p L (σ 2 ) with p L (σ 2 j ) = p L,j , w L (σ 2 ) with w L (σ 2 j ) = w j , and α(σ 2 ) with α(σ 2 j ) = α j , the function r L = r L (σ 2 , s) also becomes continuous.The continuity at the point |s| = α(σ 2 ) w L (σ 2 ) is assured by To be able to interpret the filter function, suitable values for s representing the coefficients ⟨z, v i ⟩ = σ i ⟨y, u i ⟩ (see (3.15), (3.16)) need to be considered.One reasonable option is to choose s according to the data on which φ θ has been trained.We thus consider an eigenvector v j scaled with the coefficient p L,j .Since φ θ minimizes (4.1), we expect φ θ (p L,j v j ) ≈ p L,j Av j = p L,j σ 2 j v j and φ −1 θ (p L,j σ 2 j v j ) ≈ p L,j v j , respectively.Accordingly, we choose |s| proportional to p L (σ 2 )σ 2 , i.e., |s| = γ p L (σ 2 )σ 2 for different values γ > 0. Hence, the case γ = 1 corresponds to a test vector z with coefficients |⟨z, v i ⟩| = |s| = p L (σ 2 j )σ 2 j , which perfectly fits to the training data.Analogously, the cases γ < 1 (and γ > 1, respectively) correspond to test data whose coefficients are smaller (or bigger, respectively) than the average coefficients of the training data.
For γ = 1, the filter function r L can be written in a form, which allows for an easier interpretation.It holds The derivation can be found in Appendix A.5.Note that the filter function depends especially on the quotient of α(σ 2 ) (soft thresholding parameter) and p L (σ 2 ) (property of training data).To visualize the influence, we depicted the graph in Figure 5 for two different (constant) values of α(σ 2 ) p L (σ 2 ) .As can be seen, the graph of σ 2 → σ 2 r(σ 2 , p L (σ 2 )σ 2 ) has two kinks.The first one depends mainly on the choice of α(σ 2 ) p L (σ 2 ) , the second one mainly on the choice of L.
For γ ̸ = 1, the filter functions cannot be written in a similarly compact and easy form as in (4.22).Instead, we illustrate them in Figure 6.In contrast to the case of γ = 1, the graph of σ 2 → σ 2 r(σ 2 , p L (σ 2 )σ 2 is not equal to one for the larger values of σ 2 .
Finally, we want to analyze to which extent φ θ fulfills the local approximation property (3.1).The illustration of the filter function in Figure 5 implies that convergence can only be possible if both kinks tend to zero.So, we need L → 1 and α → 0. Thus, the structure of the nonlinearity (soft thresholding) has a severe influence on the regularization properties of the reconstruction scheme.But in contrast to the ReLU architecture, the soft thresholding operator provides the opportunity to control its similarity to a linear operator, which can be controlled via the coefficients α.In the following lemma, we discuss in more ).The choice s = p L (σ 2 )σ 2 corresponds to the case that the test data is similar to the average training data.In this case, r L shows a quite similar behavior as the squared soft TSVD function.detail how α can be chosen depending on the regularization parameter L to obtain the desired regularization properties.
Lemma 4.6.Let all assumptions of Lemma 4.5 hold.Further, assume that A is compact and let p † = j∈N p † j v j be given by p ,vj ⟩| .In addition, consider x † ∈ X as well as strictly monotonic and continuous architecture parameter choices α j , β : (0, 1) → [0, ∞) with Then, the local approximation property (3.1) holds for any x † ∈ N (A) ⊥ .
Proof.We first verify that p L , p † ∈ X (p L = j∈N p L,j v j ).For this, we exploit for each j ∈ N that for any i.We thus obtain via Young's inequality Analogously for p L .Due to the finite nature of the data set and the properties of α we immediately have p L → p † for L → 1 and p L = p † for L being sufficiently close to 1.
We now continue with the approximation property, making use of the notation x j := ⟨x † , v j ⟩ and a sufficiently large L < 1 where w j = min{ due to Young's inequality, the Lipschitz continuity of the ReLU function, and the assumption on α j .We further obtain j∈N We thus need to derive estimates for (I) and (II).
(I) = where we again exploit Young's inequality and the properties of a j .For (II) we immediately obtain Due to the properties of β, (II) already has the desired characteristics for L → 1.In contrast, (I) requires some further calculations.Following the same line of reasoning as in the proof of Corollary 4.2 we can find an index function ψ ′ for any ε > 0 such that Combining this with (4.26), (4.28) and (4.29) we obtain the desired result.
Remark 4.3.Analogous to the line of reasoning in the proof of Lemma 4.3, we split the series into two sums, (I) and (II).(I) takes care of small singular values and needs to be related to a property of the element x † in terms of a certain kind of source condition.The second sum (II) is somehow related to the approximation properties of the underlying architecture.In the proof of Lemma 4.3 it is immediately zero for any x † which is due to the linear network assumption therein.In contrast, in the previous proof we had to carefully control the behavior of this term since it is strongly influenced by the structure of the nonlinearity.
In general, the previous lemma serves as an example of the interplay between the approximation capabilities of the network and the resulting regularization properties.

Numerical experiments
In this section, we present numerical results in order to compare our theoretical findings from Section 3 and Section 4 to its corresponding numerical implementations and extend these by experiments in a more general setting by learning from noisy measurements.To this end, we use a Radon transform with 30 angles and 41 detector pixels as our forward operator.We discretize the Radon transform for 28 × 28 px images.This results in a linear forward operator A : R 28×28 → R 30×41 .For the training of the different iResNet architectures, we use the well-known MNIST dataset of handwritten digits [25].This dataset is split into 60 000 images for training and 10 000 images for testing.For our experiments, we treat the 28 × 28 images as column vectors of length 784 and use fully connected layers in all network architectures.We optimize the networks using Adam [23] and a learning rate scheduling scheme.
There are multiple ways to satisfy the Lipschitz constraint during training.Bungert et al. [11] use an additional penalty term in the loss function.However, this method does not strictly enforce the constraint.Behrmann et al. [8] use contractive nonlinearities and only constrain the Lipschitz constant of each individual linear mapping.They compute the Lipschitz constant using a power iteration and normalize the weights after each training step.We observe an improvement of convergence when directly parameterizing the weights to fulfill a specific Lipschitz constant by extending the approach in [30].
In Section 5.1, we show experiments on the local approximation property in Theorem 3.1.To this end, we use the diagonal architecture proposed in Section 3.2.The resulting data-dependent filter functions are visualized in Section 5.2, and the convergence property is considered in Section 5.3.
We implement each subnetwork f θ,j , j = 1, . . ., n, (see (3.12)) as a small fully-connected neural network with independent weights for each j, where n denotes the total number of singular values in the discrete setting.Each subnetwork consists of three layers, where the first two layers f k θ,j , k = 1, 2 each contain 35 hidden neurons, and its final layer f 3 θ,j contains 1 neuron.Every layer is equipped with a linear matrix multiplication and corresponding additive bias and a ReLU activation function (k = 1, 2).Accordingly, each subnetwork has 1366 trainable parameters.An illustration of the architecture is provided in Figure 7. Altogether, the parameters of the subnetworks determine the parameters θ ∈ Θ(L) of φ θ = Id − f θ , where Θ(L) includes the network parameters as well as the Lipschitz constraint being realized by constraints on the network parameter.Here, we enforce the Lipschitz constants Lip(f θ,j ) ⩽ L, j = 1, . . ., n, by constraining Lip(f k θ,j ) ⩽ 1 for k = 1, 2 and Lip(f 3 θ,j ) ⩽ L. In the following, we thus write θ(L) in order to give emphasis to the regularization parameter.Our training objective is the minimization of the approximation loss (4.1), i.e., minimize the loss The source code corresponding to the experiments in this section is available at https://gitlab.informatik.uni-bremen.de/inn4ip/iresnet-regularization.

Local approximation property
According to Theorem 3.1, we expect φ −1 θ(L) to act as a regularizer as soon as the local approximation property (3.1) is fulfilled.Note that the architecture does not change for varying L in our experiments such that we write φ θ(L) instead of φ θ(L),L .To be able to observe this behavior of our trained network, the local approximation property needs to be verified with respect to the trained network parameters.Therefore, we evaluate the approximation error with respect to fixed data samples as well as the mean approximation error over the test data set Figure 8 indicates superlinear convergence of E mean for L → 1, i.e., the existence of an index function ψ such that the local approximation property (3.1) is satisfied within our test data set on average.This underpins the capability of the chosen network architecture and training to locally approximate the operator A in terms of (3.1).Furthermore, the evaluation of E x (1) shows that the chosen sample x (1) fulfills the superlinear convergence and behaves very similarly to the mean error over the test data set.However, from the selection of data sample x (2) and corresponding error E x (2) , we notice that the local approximation property does not hold for all samples of our test data set.Figure 8 additionally shows that some coefficients x (2) j of x (2) , corresponding to large singular values, severely deviate from the corresponding mean values with respect to the data distribution.This effect is reduced for x (1) .Therefore, the slow, respectively non-, convergence of E x (2) could possibly be explained by the fact that structures outside of the core of the distribution of chosen singular values have not been learned properly during network training.

Data-dependent filter functions for diagonal architecture
For the experiments in this and the subsequent section, we train networks φ θ(L) with Lipschitz bounds We also include additive Gaussian noise in the network training via minimizing the approximation loss (4.1).More precisely, for each training sample x (i) we generate Ax (i) + η (i) with η (i) ∼ N (0, δ ℓ Id) and relative noise level δ ℓ = δℓ • std MNIST , where std MNIST denotes the averaged standard deviation of the coefficients ⟨x (i) , v j ⟩ of the MNIST data set (standard deviation with respect to i, mean with respect to j) and δℓ = (5.5) x (1)   x (2)  Note that this includes (4.1) in the noise-free case.Trained networks on noisy samples with noise level δ with a particular Lipschitz bound L are denoted by φ θ(L,δ) .The noise-free training outcome is denoted by φ θ(L) .
Utilizing identical network architectures as in the previous subsection, we evaluate the learned filter functions of chosen networks.Analogously to (3.19) the reconstruction can be written as for bL ∈ X = R n .As stated in Section 3.2, the learned filter function of a trained network with diagonal architecture follows immediately from its subnetworks.Due to the additional bias, we make a small adaption to (3.17), which gives (Id − f θ,j ) −1 (s) − bL,j = r L (σ 2 j , s)s for s ∈ R, where each entry bL,j = ⟨b L , v j ⟩ corresponds to the axis intercept of a subnetwork φ θ,j , j = 1, . . ., n, i.e., bL,j = (Id − f θ,j ) −1 (0).Since f θ,j corresponds to σ j or σ 2 j , respectively, we also write bL,j = bL (σ 2 j ).Adapting to the mean values with respect to the distribution of each coefficient, we compute and evaluate the filter function with respect to its second variable at the continuous extension σ 2 µ(σ 2 ) with µ(σ 2 j ) = µ j , j = 1, . . ., n.
From Figure 9 we notice that the regularization of smaller singular values increases with decreasing Lipschitz bound L for the mean MNIST data sample.This is in line with the regularization theory, as L serves as our regularization parameter.Thus, the filter plots in Figure 9 visualize how L acts as a regularization parameter.The trained diagonal networks φ θ(Lm) for L 2 and L 5 show a slightly different behavior at σ 2 j ≈ 0.27.The distributions of the coefficients ⟨x, v j ⟩ j=1,...,n are treated independently and in this particular singular value we observed a wider distribution of coefficients ⟨x, v j ⟩ in the dataset.The inherent structure within the dataset might be one possible explanation for this outlier.In general, when neglecting the previously mentioned outlier, for the mean MNIST sample one can observe a similarity to the squared soft TSVD filter function (see Section 4.2) to some extent.In addition, the observed decay of ∥ bL ∥ with increased L is also in line with the necessary condition for a classical filter function with bias to become a convergent regularization (see Lemma 3.3).The observed increase for the largest L is most likely caused by a numerical instability when evaluating (Id − f θ,j ) −1 (0) via the fixed point iteration performed for 30 iterations.)) of the iResNet with diagonal architecture and zero noise level δ 0 for different Lipschitz constraints L (left) where the eigenvalues σ 2 j are highlighted as black bullets on the x-axis.∥ bL ∥ corresponding to the axis intersection bL = j bL,j v j , evaluated for L m , m = 1, . . ., 5 (right).
Figure 10 includes reconstructions of a fixed sample from the test data set.This example illustrates the effect of a regularization induced by the Lipschitz bound even if the training takes noise into account.It becomes stronger for small values of L, which coincides with our previous observations.Reconstructions resulting from increased noise levels require stronger regularizations in order to improve reconstruction quality.Therefore, the best reconstructions in case of δ0 and δ1 result from φ −1 θ(L3,δ0) and φ −1 θ(L3,δ1) .In comparison, φ −1 θ(L2,δ4) and φ −1 θ(L2,δ3) provide improved reconstructions for the cases δ4 and δ3 .Moreover, at L 1 we notice similar blurred structures in the background of the reconstructed digit for all noise levels.One might argue that its structure compares to a handwritten digit itself, making the learned background pattern being encoded in the bias bL suitable to the data set.These additional observations from Figure 10 indicate a dependency of the regularization on learned data structures.The corresponding filter functions being illustrated in the right column of Figure 10 show a similar behavior for all training noise levels which underpins that the outcome of the approximation training is independent of the noise for a sufficiently large number of training samples.The outlier in the filter function for the mean sample (µ j ) j can also be observed in Figure 10.In addition, this test sample has a slightly different behavior with respect to the second singular value.Note that the seemingly linear behavior for σ 2 j > 0.4 is only due to the fact that this is the linear continuous extension between the first and second singular value.In summary, the resulting filter behaves similarly to the squared soft TSVD filter independent of the training noise and with exceptions at two singular values.

Convergence property
After investigating approximation capabilities of trained networks with respect to the operator A in Section 5.1 in the noise-free case and extending the training to the noisy case in Section 5.2, we continue verifying the convergence property with respect to different noise levels.We analyze the convergence property by computing the averaged reconstruction error including the noise levels δ ℓ and noise samples η (i) ∼ N (0, δ ℓ Id) in the reconstruction process where the network has been trained on noise level δ.We thus can evaluate the noise-free training case, which solely aims to impart data dependency, on reconstructions from noisy data, and the noisy training case where training and test noise share the same level.Reconstructions, i.e., the inverse of the iResNet, are obtained by using 30 fixed-point iterations.
Figure 11 shows that φ −1 θ(L,δ0) as well as φ −1 θ(L,δ ℓ ) provides more accurate reconstructions with respect to (5.9) at large L and low noise levels, whereas this behavior is reversed for decreasing Lipschitz bound and increasing noise levels.This is consistent with the regularization theory and the visualized reconstructions in Figure 10, as high noise levels require strong regularizations and vice versa.The behavior of

MSE δ ℓ
reco (φ θ(L,δ ℓ ) , A) for small L and small noise levels δ ℓ is rather intuitive, since its approximation capability is limited as a consequence of strong regularization.In addition, from Figure 11 one can also extract a suitable candidate for the parameter choice L(δ) to obtain convergence.The similarity in the behavior of φ −1 θ(L,δ0) and φ −1 θ(L,δ ℓ ) underpins that the outcome of the approximation training is independent of noise if data and noise are independent.

Discussion and Outlook
In the present work, we developed and investigated the regularization theory for the proposed iResNet reconstruction approach providing a learned method from data samples.The network's local approximation property is fundamental to delivering a convergent regularization scheme.It comprises approximation properties of the architecture and training, a definition of solution type, and a source condition.The approximation loss used for training is motivated by this property.In the most general version, the framework can be understood as a fully-learned end-to-end reconstruction scheme with minor limitations as it relies on the concatenation with A * , i.e., some a priori knowledge on the forward operator is at least included in a hand-crafted way in the reconstruction scheme.Introducing a diagonal architecture type relying on the SVD of the forward operator A allowed for an analytical investigation of the resulting learned nonlinear spectral reconstruction method, which becomes a convergent regularization when fulfilling the local approximation property.The analysis of trained shallow architectures revealed the link between the classical linear filter-based regularization theory and the concept of the local approximation property, and it illustrated the interplay between the approximation capability of the nonlinear network and the source condition.In addition, we validated and extended the theoretical findings by a series of numerical experiments on the MNIST data set and provided further insights into the learned nonlinear spectral regularization, such as similarity to the analytically determined linear regularization (squared soft TSVD) and data-dependency of the learned regularization.
Our iResNet method using the diagonal architecture can be seen as a generalization of the learned linear spectral regularization considered in [22].Using a different loss, which measures the reconstruction error, the authors of [22] obtain learned filter functions corresponding to Tikhonov regularization with data-and noise-dependent regularization parameters for each singular vector.An extension of the iResNet approach to this kind of training loss is thus desirable for future comparison.
The present work also serves as a starting point for various future research directions: As we obtain the inverse of the residual network via a fixed point iteration, findings in the context of the iResNet reconstruction may be related to learned optimization methods [6], maximally monotone operators [37], or potentially to plug-and-play methods [16].Investigations at the interface to these methods are far beyond the present work and will remain future work.
The concept of a local criterion to guarantee convergence of the iResNet reconstruction method provides further opportunities to investigate the regularization properties of learned reconstruction schemes.In a different context, the authors of [5] also consider a local property.They consider a learned projection onto subspaces concatenated with the forward operator, i.e., they consider a projected linear problem and combine it with Tikhonov regularization.There, the authors guarantee convergence of the regularization method by the Tikhonov theory, including a different but also localized assumption.The findings in the present work and the literature thus reveal the necessity to explicitly take the approximation capabilities of the learned schemes and the training procedure into account.Immediate future research includes a comprehensive numerical study on how architectural choices such as particular activation functions, e.g., being strictly monotone but nonlinear, influence the desired regularization properties and reconstruction quality.Further potential future investigations in this context are twofold.On the one hand, one may exploit universal approximation properties of the underlying architectures to guarantee convergence for a sufficiently large subset of X, e.g., by verifying the local approximation property.On the other hand, one may adapt the definition of regularization to the approximate and data-based nature of the learning-based setting.
Besides the more general open research questions at the interface to other methods and data-based concepts, the iResNet reconstruction approach itself provides various potential future works.The observed similarity of the learned nonlinear filter functions in the numerical experiments to the analytical solution of the affine network with bias (Section 4.2) immediately raises the question: How much does the data's structure influence the resulting reconstruction scheme?Phrased differently, one needs to test the limits of the proposed approximation training.Here, it would be desirable to change the perspective to a Bayesian one considering data and noise distributions and further investigate the training outcome, including additional variations of the training loss and its influence on the local approximation property.Natural extension and generalizations, such as including learned solution types, e.g., by combining our approach with null space networks [41] or relaxing the diagonal assumption of the architecture, remain future work.
In summary, the present first work on learned iResNet regularization schemes builds the basis for various future works of theoretical and numerical nature.By assumption, for every v j there is at least one x (i) s.t.⟨x (i) , v j ⟩ ̸ = µ j , thus, the problem can be simplified to min (1 − w j − σ 2 j ) 2 ∀j ∈ N. (A.9) The obvious solution is w j = min{1 − σ 2 j , L}, since 1 − σ 2 j is by assumption always positive.Now we use the SVD of A to solve φ θ (x) = z for some z ∈ R(A) component-wise.For all j ∈ N it holds The obvious solution is w j = min{1 − σ 2 j , L}.

Figure 1 :
Figure 1: Illustration of the residual network architecture of φ θ with residual function f θ and skip connection.

σ 2 σFigure 3 :
Figure3: The plot depicts graphs of σ 2 r L (σ 2 , s) as given in (4.2) to visualize the effect of the regularization, which is also known as standard Tikhonov regularization.

Corollary 4 . 1 .
The filter based method (3.19) with rL and bL given by (4.8) fulfills the properties (i)-(iii) in Lemma 3.3, i.e., in the setting of Lemma 3.3 it is a convergent regularization.

σ 2 σ 2 σ 2 rFigure 5 :
Figure5: The plots depict graphs of σ 2 r L (σ 2 , ± p L (σ 2 )σ 2 ) as given in (4.20) and (4.22).The choice s = p L (σ 2 )σ 2 corresponds to the case that the test data is similar to the average training data.In this case, r L shows a quite similar behavior as the squared soft TSVD function.

Figure 6 :
Figure6: The plots depict graphs of σ 2 r(σ 2 , ± γp L (σ 2 )σ 2 ) as given in(4.20).The case γ ̸ = 1 corresponds to test data, which differs from the average training data.Especially for larger singular values σ 2 (to the right of the two kinks), the filter function shows a suboptimal behavior compared to the squared soft TSVD filter function.

4 )
The loss function l then amounts to min θ∈Θ(L)

Figure 9 :
Figure9: Learned filter functions r L (σ 2 , σ 2 µ(σ 2 )) of the iResNet with diagonal architecture and zero noise level δ 0 for different Lipschitz constraints L (left) where the eigenvalues σ 2 j are highlighted as black bullets on the x-axis.∥ bL ∥ corresponding to the axis intersection bL = j bL,j v j , evaluated for L m , m = 1, . . ., 5 (right).