Bi-level iterative regularization for inverse problems in nonlinear PDEs

We investigate the ill-posed inverse problem of recovering unknown spatially dependent parameters in nonlinear evolution PDEs. We propose a bi-level Landweber scheme, where the upper-level parameter reconstruction embeds a lower-level state approximation. This can be seen as combining the classical reduced setting and the newer all-at-once setting, allowing us to, respectively, utilize well-posedness of the parameter-to-state map, and to bypass having to solve nonlinear PDEs exactly. Using this, we derive stopping rules for lower- and upper-level iterations and convergence of the bi-level method. We discuss application to parameter identification for the Landau-Lifshitz-Gilbert equation in magnetic particle imaging.

1 Parameter identification

Introduction
We investigate the inverse problem of recovering the spatially dependent parameter θ in the nonlinear evolution system u(t) + f (t, θ, u(t)) = 0 t ∈ (0, T ) u(0) = Aθ. ( with nonlinear model f and linear initial condition A; here, u denotes time derivative of u.The parameter identification is carried out with additional linear measurement of the state u which is usually contaminated by noise of some level δ, resulting in y δ with ∥y δ − y∥ ≤ δ. The parameter identification (1)-( 2) can be formulated in a reduced setting with the parameter-to-observation map G as the forward operator Essentially, S is a well-defined parameter-to-solution map whose output is the unique solution of the nonlinear PDE (1) while L is the linear observation operator This abstract setting is the framework considered throughout the paper.
Regularization with state approximation error Iterative regularization algorithms for solving the inverse problem (3)-( 5), such as Landweber-type or Newton-type methods, require the repeated evaluation of S(θ) to perform an update for the search parameter.
The task becomes challenging when the PDE model f is nonlinear.Nonlinear solvers frequently involve an application of the fixed point theorem to handle the nonlinearity operation, which results in additional internal iterations.The convergence for numerically approximating u for nonlinear PDEs is rather technical.[2,3,5,8] are a few examples of FEM solvers for the nonlinear Landau-Lifshitz-Gilbert equation in magnetic particle imaging (more details in Section 5.2).Approximating u by ũ introduces additional error to the primary reconstruction process for the parameter θ; as such, it is important that we consider the effect that the propagation of the state approximation error ϵ := ∥ũ − u∥, or the discretization error in case of FEM, has on our recovery of θ.This is the motivating point to this work.Inspired by this interesting line of research, in particular when the underlying PDE model is nonlinear, we investigate a general framework suitable for inverse coefficient problems.In this spirit, we consider the problem of approximating the solution to the PDE (1) as another inverse problem strategically embedded into the main ill-posed parameter identification problem.
Introducing the lower-level Adapting our notation to the framework of iterative methods, we call the iteration for estimating u the lower-level iteration, and the iteration for reconstructing θ the upper-level iteration.To this end, we first recast the lower level problem as an inverse problem in which the unknown is the state u.The forward map, at a given parameter θ, is the PDE model ( 1) θ)(u) := ( u + f (t, θ, u(t)); Aθ − u| t=0 ) (6) with the nonlinear model f : X × V → U * and the initial condition A : X → H.We aim at minimizing the PDE residual towards zero, that is solving F (θ)(u) = φ := (0, 0) (7) with the image φ ∈ U * × H simply being zero.Seeking a good approximation S(θ) for the exact state S(θ) in ( 4) is equivalent to finding an ũ ∈ V such that the PDE residual ( 7) is sufficiently small.Here, the choice of function space plays an important role and will be discussed in detail later.
The state approximation error ϵ := ∥ũ − u∥ in the lower-level iteration must be handled with care: Allowing it to be large saves computational time, yet at the same time, it must be kept small enough such that when it enters the upper-level iteration, one can still obtain an acceptable reconstruction result for θ.
Bi-level as a mixed approach The suggested approach could be seen as a combined version of the reduced formulation (3)-( 5) and an all-at-once approach.All-at-once formulations is a recent approach introduced in optimization with PDE constraint.This concept was subsequently studied in the context of inverse problems [6,7,16,27,28,48], and more recently specifically for time dependent inverse problems [32,41].In this setting, one constructs, instead of the reduced map G (3), the higher-dimensional all-at-once map G and solve: G(θ, u) = (0, 0, y) with G : X × V → U * × H × Y, G(θ, u) := ( u + f (t, θ, u(t)); Aθ − u| t=0 ; Lu).
By treating u as an independent unknown, one jointly recovers the parameter and the state (θ, u), bypassing the step of solving the nonlinear PDE (1) exactly.This feature makes the all-at-once approach powerful in practice.A comparison between these formulations can be found in [32,41].We remark that the classical reduced approach usually brings in certain structural conditions on the model f to induce unique existence for the PDE (1).The new all-atonce approach (8), as explained, bypasses the need to study unique existence as well as solving the PDE exactly.This raises the question: Can one utilize accessible properties of f , and still bypass the exact solution for any nonlinear PDE?This is the viewpoint that we take in this study.In particular, we employ coercivity of the PDE model to ensure convergence of the lower-level iteration.Accordingly, as with the all-at-once approach, no nonlinear PDE needs to be solved exactly.Essentially, bi-level as a mixed approach combining the advantages of the reduced setting and the all-at-once setting is the new perspective of this work.
Connection and contrast to other directions The importance of analyzing the finitedimensional approximation error in an inverse problem framework has been pointed out in [40]; in particular, discretization error in FEM was treated in [19,9,33,25,20].These stimulating works specialize in inverse problems with linear PDEs, and mainly in the context of variational regularization.Here, we focus on a general framework of inverse coefficient problems with nonlinear PDEs.In addition, our research focuses on Landweber regularization instead of variational regularization.Furthermore, we do not use FEM to approximately solve the PDE, but rather take the perspective of solving the nonlinear PDE as a lower-level inverse problem.
We point to the fact that the bi-level viewpoint has been realized in optimization and is an active research area on its own [4].There, the main focus is on optimality conditions rather than, as is the case in inverse problems, regularization parameters; in our case, the regularization parameters will take the form of stopping rules for the Landweber iteration.To the best of the author's knowledge, the bi-level Landweber (or gradient descent) iteration has yet to be discussed for inverse problems.
Interestingly, the idea of approximating the inverse component in a regularized iteration by another iteration can already be found in existing literature.For Newton-type method, [26,17,43] propose to approximate the inverse component [α k Id+G ′ (θ k ) * G(θ k )] −1 by either Landweber or Conjugate Gradient (CG) inner iteration, forming the so-called Newton-Landweber and Newton-CG methods.Independently, the inner iteration in our work is to approximate S(θ k ) = F (θ k ) −1 (recalling that G = L • S), rather than terms related to the derivative G ′ .The outer iteration in our algorithm also employs the Landweber method.Following the established pattern, we might refer to our approach as a Landweber-Landweber method.
It is worth remarking on the growing popularity of the model order reduction approach for large-scale inverse problems [14].This and our approach align in the idea of solving PDEs inexactly.If one replaces the lower-level by other inexact PDE solvers, our upper-level analysis is still well-integrated.In particular, it suggests an adaptive approximation/discretization estimation, e.g.via increasing degrees of freedom in the solvers, that can ensure convergence of the regularized reconstruction.
Contribution In a broad sense, we exploit the structure of the inverse coefficient problem G(θ) = L • S(θ), where in practice, the observation operator L is linear but ill-posed, while the PDE solution map S is nonlinear but well-posed.In addition, we pay attention to nonlinear time-dependent PDEs, a consideration studied relatively infrequently.
The techniques in this study are largely influenced by the seminal works [45,29] for the use of the tangential cone condition in iterative regularization to handle nonlinearity.Regarding regularization parameter choice technique, we are inspired by the very recent work [35], in which an a priori stopping scheme was introduced, as opposed to a classical posterior stopping rule.Although our proof strategy is motivated by these profound results, there are important differences between these works and our analysis.First, we investigate the Landweber method in a bi-level framework, in comparison to the standard single-level setting.Second, our work is tailored particularly to parameter identification in PDEs, rather than to the abstract formulation F (x) = y.Third, from a practical point of view, our bi-level algorithm (Algorithm 1) is ready to use in real-life applications, as all the ingredients in the algorithm are explicitly expressed.
Concretely, we impose a coercivity assumption borrowed from PDEs theory on the lower Landweber iteration.This yields interesting outcomes such as verification of tangential cone condition (Lemma 3), convergence rate (Theorem 1) and monotonicity of the residual sequence (Proposition 2).For the upper Landweber iteration, we carefully control the interaction between the state approximation error and the noise level, and their propagation through the adjoint process and iterative update.As our main result, we provide explicit adaptive stopping rules, both for lower and upper iteration, confirming convergence of the bi-level scheme (Theorems 2).
Structure The paper can be outlined as follows: After an introduction of the problem setting in Section 1.2, we present the bi-level algorithm in Section 2. Section 3 is dedicated to studying the lower-level problem.Section 4 advances the investigation to the upperlevel problem.Section 5 verifies the proposed assumptions and discusses the application to magnetic particle imaging.Finally, we summarize our findings and future prospects in Section 6.

Setting
The space of the spatially dependent parameter is X, and the data space is denoted by Y. Notionally, calligraphic letters indicate spaces that involve the time variable t.
The smooth state space is The image space, which the PDE residual belongs to, is U * = L 2 (0, T ; U * ).Our choice of V ensures well-definedness and boundedness of d dt : V → U * .Unlike the reduced formulation, where U * does not appear, in the all-at-once setting (8) and the combined setting proposed in (6), the PDE image space U * plays a clear role, guaranteeing welldefinedness for the lower-level problem.
The state space V must be sufficiently smooth for well-definedness of the PDE (1) and existence of an adjoint state (Lemma 1).This function space setting is inspired by the book [44,Chapter 8].
X, Y, U, V, U * are all Hilbert spaces, ensuring that our problem remains in a Hilbert space framework.
The parameter-to-state (or solution) map S : X ∋ θ → u ∈ V(⊆ U) is nonlinear and assumed to be Fréchet differentiable.
The observation (or measurement) operator L : U ∋ u → y ∈ Y is linear and bounded.In real life, L is usually a compact operator; it cannot capture smoothness of u ∈ V due to discrete or noisy measurements.Its domain U hence does not need to be as smooth as V.
The choice of data (or observation) space Y depends on how data is acquired, e.g.full measurements, partial measurements or snapshots of u; thus we keep Y as a general Hilbert space.The supercsript δ in y δ denotes the noise level.
The initial value operator A : X → H is linear and bounded.The continuous embedding V → C(0, T ; H) [44, Section 7.2] enables the time-wise evaluation of u, ensuring welldefinedness of the setting u(t 0 ) = Aθ ∈ H.
The PDE model f : (0, T ) × X × U → U * is nonlinear and is assumed to satisfy the Carathéodory condition.This means that for each θ ∈ X, the mapping f is measurable with respect to t for any u ∈ U , and continuous with respect to u for almost every fixed t ∈ (0, T ).In this way, f can be naturally supposed to induce the similarly denoted Nemytskii operator We refer to [44,47] for details on Carathéodory and Nemytskii mappings.
Finally, Ω is a smooth and bounded spatial domain.The time line [0, T ] is finite.

Notation
We use the capitalized subscript R for constants relating to the upper-level problem (e.g. ( 46)-47), while the lower-case subscript r is reserved for constants in the lower-level problem (e.g. ( 31)-( 32)).The subscript S is used in constants relating to the parameterto-state map S (e.g. ( 22)).Regarding iteration index, the superscript (•) j denotes the iterates of the upper-level problem, and subscript (•) k is used for the lower-level problem.
B * denotes the Hilbert space adjoint, and B ⋆ denotes the Banach space adjoint of a linear, bounded operator B. The notation (•, •) means the inner product, and ⟨•, •⟩ is the dual paring (c.f.Proposition 1).
∥B∥ X→Y denotes the operator norm of B : X → Y .In most of the cases, where it is clear which norms are used due to the definition of the operators, we abbreviate, e.g.∥L∥ := ∥L∥ U →Y , ∥A∥ := ∥A∥ X→H etc.Nevertheless, in some computations when technicality is involved, we explicitly write out the norm.We also write various other norms in shortened form, e.g.
The constant C X→Y indicates the norm of the continuous embedding X → Y .
Throughout the paper, we assume that the PDE (1)-( 2) is solvable, and denote by θ † the exact parameter and u † the corresponding exact solution.B R (θ † ) ⊂ X is the ball of radius R around the ground truth θ † ; similar interpretation applies for the ball B r (u † ) ⊆ V.
For the lower-level problem in Section 3, θ B r/2 (u * ) ⊆ V denotes the ball of radius r/2 centered at u * , where u * is a solution of F (θ)(u) = φ at a given θ (see (29)).
f ′ θ , f ′ u are the partial derivatives of f , while ∇f is its gradient.

Bi-level algorithm
We now establish the bi-level Landweber algorithm 1.The main idea is to first perform upper-level iteration for θ by the standard Landweber iteration with the exact forward map G = L • S. We then replace the parameter-to-state map S by an approximation S K(j) obtained through a lower-level Landweber iteration.S K(j) enters the forward and backward (adjoint) evaluation in the upper-level iteration, yielding a gradient update step for θ.
After Algorithm 1 is constructed, we perform some preliminary error analysis in Section 2.2.
Replacing S by its approximation S K(j) , the Landweber iteration with S K(j) becomes where at each j, that is, each θ δ,j , the state approximation S K(j) ( θ δ,j ) := u j K(j) is the outcome of the lower-level iteration with the PDE map F ( θ δ,j ) described in ( 6)- (7).The stopping rule K(j) for the lower-level iteration as well as j * for the upper-level iteration will be the subject of the later sections.
In order to see how the state approximation enters the adjoint S ′ ( θ δ,j ) * , we first of all derive the adjoint process.
Proposition 1 (Adjoint).Let D U : U → U * and I X : X * → X be isomorphisms.Then, the Hilbert space adjoint of G ′ with G defined in (3)-( 5) is given by where z is the solution to the final value problem at source Here, the Banach space adjoints are Proof.As the parameter-to-state map S : X → V in ( 4) is assumed to differentiable, S ′ (θ)ξ =: p ∈ V solves the sensitivity (or linearized) equation where u = S(θ) is the solution to the original PDE (1).Note that p ∈ V, with V being smooth as defined in Section 1.2, along with the Nemytskii assumption on f ensure ṗ + f ′ u (θ, u)p ∈ U * and f ′ θ (θ, u)ξ ∈ U * , confirming well-definedness of the sensitivity equation.In the following computation, we will respectively use (13), integration by parts and (14) with the helps of isomorphisms to translate inner products to dual pairings.For every ξ ∈ X, v ∈ U, one has This proves that the adjoints of S ′ (θ) are as shown in (12).Applying the chain rule, we obtain the adjoint of the composition map as claimed in (11).
With the adjoints detailed, we can now state the Landweber iterations for the exact problem (single-level) and the approximated problem (bi-level).
{Upper-level} Update parameter: {Lower-level} where at each j, meaning at each θ δ,j , iterate : ). {back to Upper-level}Then compute the approximate adjoint Remark 1.In the bi-level algorithm, no exact S appears; equivalently, no nonlinear PDE needs to be exactly solved.The state approximation S K(j) ( θ δ,j ) enters the residual L S K(j) ( θ δ,j ) − y δ as well as the adjoint (20).We emphasize that the term S ′ ( θ δ,j ) * appearing here is not, in fact, the adjoint of the derivative of the exact S evaluated at θ δ,j , but an approximation thereof.

Preliminary error analysis
At this stage, the Landweber algorithms for single-level and bi-level have been established.The two algorithms differ in the state and the adjoint as mentioned in Remark 1.We thus examine the system output error and the adjoint error with respect to a given approximation error ϵ K(j) ; this is the focus of this section.
The state approximation error ϵ K(j) will be quantified in the succeeding section.The subscript K(j) alludes to the fact that we will eventually derive a stopping rule K of the lower-level with respect to the upper-level iterate j.For now, we assume that K(j) is given.
Lemma 1 (System output error).Denote the state approximation error ϵ K(j) at the input and assume that there exists a constant M S > 0 such that Given any other input θ j ∈ B R (θ † ), the system output error is Proof.Differentiability of S : X → V, thus of S : X → U due to V → U, and application of the mean value theorem imply Remark 2. M S in ( 22) is a locally uniform bound for solutions of the linearized PDE (14).A sufficient condition for (22) to hold is Lipschitz continuity of the parameter-to-state map S. When θ plays the role of a source term or initial data, results regarding Lipschitz continuity of S for general classes of nonlinear PDE model f exist in the literature [44,Theorem 8.32].
We now turn our attention to the discrepancy between the exact adjoint and the approximate adjoint with respect to the state approximation error ϵ K(j) .
Lemma 2 (Adjoint state error).Let the assumptions in Lemma 1 hold.Suppose that the solution z of the adjoint PDE depends continuously on the source term Furthermore, assume that f ′ θ , f ′ u are Lipschitz continuous, with Then, given any other input θ j ∈ B R (θ † ), we obtain the adjoint error Proof.The Lipschitz condition (25) implies boundedness of the derivatives, due to recalling that R and r are the radii of the balls B R (θ † ) ⊂ X and B r (u † ) ⊂ V. We will use this bound in evaluating the term Q 2 later in the proof.
Consulting Algorithm 1, subtracting the approximate adjoint state (20) from the exact adjoint state (17), then inserting the mixed term f ′ θ ( θ δ,j , S K(j) ( θ δ,j )) * z, we see that for any v ∈ U, We recall that z and z are, respectively, the solutions of Inspecting Q 2 , we see that the difference z − z solves ) Now, using (24), we can bound z by the source term D U v. In a similar fashion, we bound z − z by the source term (−f ′ u (θ j , S(θ j ) + f ′ u ( θ δ,j , S K(j) ( θ δ,j )) ⋆ z, then estimate this source further via the Lipschitz condition (25).More precisely, we have Using these and the boundedness of the derivative from (27), we can estimate Q 1 , Q 2 and Q 3 by the parameter difference ∥θ j − θ δ,j ∥ and their difference propagating through the state approximation ∥S(θ j ) − S K(j) ( θ δ,j )∥, i.e. (27).In the last step, combining Q 1 , Q 2 and Q 3 , then applying Lemma 1 for L = Id to further estimate ∥S(θ j ) − S K(j) ( θ δ,j )∥ U , we arrive at which is (26); this completes the proof.
At this point, we have analyzed the system output error and adjoint error in the bi-level Algorithm 1 via the state approximation error ϵ K(j) .We now study ϵ K(j) itself.

The lower-level iteration
In this section, we approximate S by S at given θ ∈ B R (θ † ) via the procedure (19) in Algorithm 1.In particular, we examine convergence and convergence rate of (19) in order to derive the state approximation error ϵ K(j) := ∥S( θ δ,j ) − S K(j) ( θ δ,j )∥ U in an explicit form.Optimally, ϵ K(j) should be controllably small by letting the lower-level scheme iterate until a sufficiently large stopping index K(j) is reached.The lower-level stopping index K(j) will be derived in the next section, when its role in the convergence of the upper-level iteration becomes relevant.
For convenience, we shorten the notation in the lower-level problem (6).Explicitly, for a fixed θ, we introduce the model operator thus, θ F ′ will indicate the derivative with respect to the state u, and an exact solution u * to (29) for a given θ satisfies θ F(u * ) = φ.The procedure (19) in Algorithm 1, dropping the superscript j, now takes the form Finally, we recall the use of (θ † , u † ) to denote the exact parameter and exact corresponding state of the original inverse problem (1)-( 2).
The following result presents the convergence analysis for the lower-level Landweber iterates.We derive the convergence rate in the noise-free case not via the well-known source condition, but rather through a coercivity assumption.The weak convergence analysis is carried out without assuming weak closedness of θ F.
Theorem 1 (Convergence and rate).Suppose that at each θ ∈ B R (θ † ), the equation i) Assume that there exists a constant M r > 0 such that and a constant µ r > 0 such that the weak tangential cone condition (wTC) holds uniformly for all θ ∈ B R (θ † ).
Then, the whole sequence {u k } as given in ( 19) and (30) remains in the ball B r (u † ) for all k ∈ N. Furthermore, the sequence converges weakly to a solution of θ F(u) = φ.
ii) If additionally θ F is coercive, in the sense that there exists some C coe > 0 such that for all u ∈ B r (u † ) and all θ ∈ B R (θ † ), then we obtain strong convergence of {u k }, with rate Before proceeding with the proof, we make some observation on these assumptions.
This is the form of the weak tangential cone condition introduced in [45,36] with the tangential cone constant c tc := 1 − M 2 r +µr 2 ∈ (0, 1).Its strong version (sTC) takes the form Moreover, as the initial condition operator A is linear, (wTC) and (sTC) in fact constrain only the nonlinear model f here; moreover, this constraint is with respect to the state u only, and not to the parameter θ (see also Remark 10).
(iii) Assumption (33) can be relaxed to the weaker norm ∥u − u * ∥ α U .In this case, (34) yields the rate ; this rate is sufficient for the treatment of the upper-level problem in the later sections, as we need only quantify ϵ K(j) := ∥S( θ δ,j ) − S K(j) ( θ δ,j )∥ U , the state approxmation in U-norm.
Proof of Theorem 1. i) Fix θ ∈ B R (θ † ) and θ F(u * ) = φ.Following the classical approach [18,29], we show Fejér monotonicity of the error sequence via the boundedness assumption (31) on the derivative and the weak tangential cone condition (32): This estimate shows that for each θ, as long as u 0 is inside the ball θ B r/2 (u * ), the whole sequence u k remains in θ B r/2 (u * ).Together with the fact u * ∈ B r/2 (u † ), it follows that u k ∈ B r (u † ) for all θ ∈ B R (θ † ).As a consequence, there exists subsequence {u kn } weakly convergent to some ū ∈ V, and Next, summing the inequality (35) from 0 up to any arbitrary index K ∈ N results in Thus, together with the uniform bound (31) on the derivative, we have convergence of the image and the gradient step series; Now as ū ∈ B r (u † ), applying again the weak tangential cone condition (32) then inserting φ and using the weak convergence result u kn ⇀ ū and the strong convergence This implies θ F(ū) = φ, proving that the weak limit ū of any weakly convergent subsequence is in fact an exact solution.
Next, we deduce weak convergence of the whole sequence {u k } from Fejér monotonicity.Similar to [10, Lemma 9.1], let u * , ū be two weak accumulation points, thus exact solutions, as argued above.Then there exist subsequences {u kn }, {u km } with u kn ⇀ u * and u km ⇀ ū when n, m → ∞.Accordingly, In the last line, these four limits exist since {u kn }, {u km } are extracted from the iterate sequence {u k }, which has monotone decreasing distance to any true solution, as shown in (35).As these limits coincide, using a sub-subsequence argument, we conclude weak convergence of the full sequence to a solution, that is, ii) For the convergence rate, with coercivity (33) (thus u * is unique) and strong convergence of the image, we have strong convergence of u k to the exact u * ; in particular, By Fejér monotonicity, we obtain the rate for the iterate sequence; for the image sequence, we similarly have Remark 4 (Coercivity -PDE theory).Coercivity (33) is one of the key elements to analyzing unique existence for PDEs [12,44] .This condition is often stated for the nonlinear model f ; after some derivation, it may then be transformed into semi-coercivty of the PDE model, which is θ F in our case.This, as alluded to in Section 1.1, is the motivating point for our study.We have employed well-posedness of the parameter-to-state map S, which is induced from coercivity of the model θ F, to iteratively approximate the state u with convergence guarantee and rate.
At the end of this paper (Section 5.1, Proposition 6, Remark 9), we will estblish coercivity for some classes of nonlinear parabolic PDEs.This demonstrates that (33) is indeed a suitable and verifiable assumption for the lower-level problem.
Remark 5 (Coercivity -conditional stability).Assumption (33) means Hölder continuity of the inverse of θ F and can be extended to superadditive index functions ϕ, that is, Indeed, (37) then becomes implying the linear rate.The extended version (39) can be seen as the conditional stability property of a continuous and injective operator.
There is also a relation to the variational source condition in which the link from the loss functional ℓ to the data penalty via the index function ϕ is the key to convergence rates in variational regularization [13].
Going back to our setting where θ F describes a PDE, we focus on coercivity (33) rather than on the extension (39).In fact, coercivity leads us to the following result.
Lemma 3 (Strong tangential cone condition (sTC)).The strong tangential cone condition (see Remark 3-ii)) holds under coercivity (33) and the Hölder continuity assumption on the derivative, with some In this scenario, all the claims in Theorem 1 remain valid.
Proof.(40) and coercivity as in (33) yield the strong tangential cone condition for u, u * ∈ B r (u † ).Above, the strict inequality (41) ensures that the tangential cone constant ctc is less than one.
Remark 6 (Convergence is highly feasible).Lemma 3 shows that with coercivity, one only needs Hölder continuity of the derivative as in (3) to confirm convergence of the lowerlevel iteration.Indeed, the special case of β = 1, that is, that the derivative is Lipschitz continuous, is a natural assumption for convergence of gradient descent in many contexts, e.g. in convex optimization.
Continuing along these lines, we observe some connections between the weak tangential cone condition and other renowned conditions yielding convergence of the gradient descent method, such as convexity and the Polyak-Lojasiewicz (PL) condition.For consider the cost functional J(u) := ∥ θ F(u) − φ∥ 2 .Recall that J is said to satisfy the PL condition [34] if there exists a µ > 0 such that for all u, Lemma 4 (Convexity versus wTC versus Polyak-Lojasiewicz condition).Assume that the equation θ F(u) = φ is solvable at u * ∈ B r (u † ).Let θ F be differentiable with (rescaling as necessary) ∥ θ F ′ (u)∥ ≤ M r < 1, ∀u ∈ B r (u † ).We have the relation: wTC standard (32) the subdifferential ∂J(u) is a singleton.Furthermore, convexity yields for θ F(u * ) = φ and any u ∈ V. Restricting our consideration to u ∈ B r (u † ), this is in fact the wTC (32) with the fixed constant (M 2 r + µ r ) = 1; clearly, this implies the standard wTC.
Going back to the central point of this section, monotonicity of the error sequence is the main character in Theorem 1; however, the same question for the image sequence in the nonlinear case, has, to the best of the author's knowledge, yet to be discussed.The following proposition gives sufficient conditions for monotonicity of the residual.
1 That is, not only at a solution v = u * .
Proof.Consider u ∈ int B r (u † ), the interior of B r (u † ), and v := u + ϵh ∈ int B r (u † ) with sufficiently small ϵ > 0. Employing the coercivity condition (33), we have The open map theorem implies that at any u ∈ int B r (u † ), the bounded linear form θ F ′ (u) ∈ L(V, U * × H) has closed range.Equivalently, the adjoint θ F ′ (u) * also has closed range.This, together with injectivity, gives us that θ F ′ (u k ) * is also bounded below by the same constant Note that this estimate holds uniformly for all θ F with θ ∈ B R (θ † ).
Next, using the weak tangential cone condition (32) and the parallelogram law, one gets Applying boundedness of the derivative as in (31) to the first and to the third term of the right hand side, then moving the third term to the left hand side, we have Recall that as long as the initial guess satisfies u 0 ∈ B r/2 (u * ), all the iterates u k remain in the ball.Moreover, u k belongs to the interior of B r/2 (u * ) due to Fe jer monotonicity (35); thus, u k ∈ int B r/2 (u * ) ⊂ int B r (u † ).This enables us to apply the lower bound (43) to the first term in the above estimate, with u = u k , y = θ F(u k+1 ) − φ, thus obtaining Here, we notice that the lower bound 1/C coe cannot exceed M r , the upper bound for ∥ θ F ′ (u)∥; hence, we cannot claim decay of the image sequence yet.Therefore, we further estimate the middle term of (44), again invokinging coercivity (33) and the lower bound (43) Substituting this into (44), we arrive at where the strict inequality is achieved according to the constraint (42).This shows the claimed monotonicity of the residual sequence.
Remark 7 (Boundedness from below of θ F ′ (u k ) and its adjoint).Boundedness from below of θ F ′ (u k ) and its adjoint θ F ′ (u k ) * is in fact the same as invertibility of θ F ′ (u k ), thus of θ F ′ (u k ) * .This ties to unique existence analysis of the linearized PDE ( 14) and the adjoint equation ( 13), which appears in the iterative Algorithm 1.
We conclude this section by discussing some links from the lower-level to the upperlevel, stepping toward parameter recovery.
• Convergence rate of the state approximation error ϵ K(j) := ∥u k − u * ∥ U in the U-norm is sufficient to carry out the analysis of the upper-level (c.f.Remark 3-iii)).In some practical examples, this simplifies verification of the coercivity assumption (33).
• Consider the lower-level problem θ F(u) = φ, θ ∈ B R (θ † ).As long as the exact state u * and the initialization u 0 , which might depend on θ, lie in B r/2 (u † ), then all iterates will remain in the ball B r (u † ) (Theorem 1), which is independent of θ.This indicates that if we ensure that in the upper-level problem, the parameter iterates θ j do not escape the ball B R (θ † ), then the lower-level iterates u k are guaranteed to remain in the ball B r (u † ).
• The lower-level problem is noise-free, as φ is the residual of the PDE model.We can therefore iterate u k until the error ϵ K(j) is acceptably small, then input this approximate state into the upper-level iteration.This gives rise to a lower-level stopping rule k ≤ K, where the stopping index K depends on j-th iterate of the upper-level iteration and the noise level δ.We will determine K(j, δ) in the next section.
• The state approximation error ϵ K(j) in practical experiments can be observed via plotting the PDE residual.Indeed, they behave in a similar manner (Theorem 1, ii) rate), and the PDE residual even decays monotonically (Proposition 2), echoing Fejér monotonicity of the error (Theorem 1, (35)).
We summarize the assumptions that have been used.
Assumption 1 (Bi-level algorithm).For all θ, θ ∈ B R (θ † ) and for all u, û ∈ B r (u † ), the following all hold: 1. S ′ is uniformly bounded, with 2. The solution z to the adjoint PDE depends continuously on the source h, that is, ) and for all u, û ∈ B r (u † ), the following all hold: 5. Either the weak tangential cone condition holds, or θ F ′ satisfies the Hölder condition (see Lemma 3) 6. θ F is coercive, with

The upper-level iteration
In the preceeding, we approximated the PDE solution by the lower-level iteration.The approximate state u j K(j) =: S K(j) ( θ δ,j ) and the associated approximate adjoint will be now fed into the upper-level of Algorithm 1 to update the parameter θ δ,j .This section studies convergence of the upper-level iteration under the influence of the noise level δ of the measurement, state approximation and adjoint approximation errors.In particular, we discuss two types of stopping rule: a posterior choice inspired by the classical result [29], and a prior choice inspired by the newer result [35].We put more emphasis on the latter.
Let us recall from (3) that the reduced map G = L • S is a composition of the linear, bounded observation L and the nonlinear, differentiable parameter-to-state-map S. The forward map G has uniformly bounded derivative, satisfying for all θ ∈ B R (θ † ), with M S as in Assumption 1.

Strong tangential cone condition (sTC
The first part of ( 47) is the weak the tangential cone condition, while the second part strengthens the assumption (c.f.Discussion 3, ii)).Assumption (47) can also be written as Unlike the tangential cone condition w.r.t u of the lower-level problem, the derivative here is taken w.r.t the parameter θ.
Algorithm 1 presents the upper-level iteration where S K(j) ( θ δ,j ) is approximated according to the lower-level iteration (19) and S ′ ( θ δ,j ) * is computed according to the adjoint process (20).

A posterior stopping rule
In this section, we analyze an a posteriori choice for the stopping index j * of the upper-level iteration.The analysis also yields a stopping rule K(j) for the lower-level problem.
(in particular, the whole sequence remains in the ball B R (θ † )) if, for some γ(j) ≥ 0, the two following conditions are satisfied: 1.The lower-level (19) is iterated until step K(j), such that the state approximation error satisfies with some nonnegative and sufficiently small γ(j).

2.
In the upper-level, the residual satisfies the relation for some ϵ > 0 with C tc as in (48) and M R , µ R , K R as in (47).
For E 1 , we insert y δ − LS( θ δ,j ).For E 2 , we recall the state approximation error ϵ K(j) := ∥S( θ δ,j ) − S K(j) ( θ δ,j )∥, and use the second part of the tangential cone condition (47).For E 3 and E 4 , we employ boundedness of the derivative from (46), as well as the definition of ϵ K(j) for E 4 .Explicitly, one has Combining these estimates then inserting y and applying tangential cone condition in the equivalent form (48), we get This estimate reveals that if the state approximation error ϵ K(j) is sufficiently small in comparison to ∥LS( θ δ,j ) − y δ ∥, we obtain negativity of e δ j .This motivates us to control the lower-level error ϵ K(j) by the upper-level residual γ(j)∥LS( θ δ,j ) − y δ ∥ as in (50).Indeed, (50) leads us to where negativity is achieved if we stop the upper-level iteration according to the rule (51).Fejér monotonicity follows.
Remark 8 (Stopping rule -consistency check).We wish to emphasize that when S is solved exactly or an analytic solution of the PDE (1) can be derived, then in (51) one can set K(j) = ∞, γ(j) = ϵ = 0 and bypass the lower-level problem.In this event, (51) reduces to coinciding with the known discrepancy principle in [29, Section 2.2, (2.12)].

A priori stopping rule
At this point, it is natural to ask how one may obtain the exact evaluation S( θ δ,j ) required to compare the residual ∥LS( θ δ,j ) − y δ ∥ with τ δ as the discrepancy principle (53) suggests, and how terminate the lower-level iteration w.r.t the upper-level residual (50).A possible scenario is that one may have access to the system output LS : θ δ,j → LS( θ δ,j ), despite S( θ δ,j ) not being directly accessible.In this event, the proposed posterior stopping rule (53) can be verified.If this is not the case, an a priori stopping rule j * via the noise level δ is desirable.
Proposition 4 (Uniform boundedness).Let Assumptions 1, 2, 3 hold.All the iterates θ δ,j with 0 ≤ j ≤ j * (δ) remain in the ball B R (θ † ) if, for some γ(j) ≥ 0, the two following conditions are satisfied: 1.The lower-level problem (19) is iterated until step K(j), such that the state approximation error satisfies 2. In the upper-level, the initialization and the prior stopping index satisfy Proof.Similarly to Proposition 3, we will evaluate the error e δ j between two consecutive iterations.We begin with the estimate (52), where in the last line, we further insert y into E 1 and E 3 (that is, the terms containing y δ ).So where A has the same form as as (E 1 + E 2 + E 3 + E 4 ) in (52), with y replacing y δ in E 1 , E 3 , and B is the residual generated by inserting y.
In A, we can directly apply the first part of the tangential cone condition (47) to E 1 ; the rest is estimated in a similar manner as in (52).That is, Noticing that in comparison to E 3 , A 3 has the extra constant (1 + ϵ ′ ) generated from splitting the square term (a + b) 2 ≤ (1 + ϵ ′ )a 2 + (1 + 1 ϵ ′ )b 2 , ϵ ′ > 0 when inserting y.In B, we use the second part of the tangential cone condition (47), boundedness of the derivative as in (46), then finally Young's inequality 2ab ≤ µ R a 2 + b 2 /µ R for a := ∥LS( θ δ,j ) − y∥, b := δK R .This leads to We now set ϵ = ϵ ′ = √ 2 − 1 in both A and B. In A 3 , we thus have the constant Next, we bound ϵ K(j) in A 2 , A 4 by δγ(j) as the rule (54) suggests.In A 2 , we further apply Young's inequality 2ab ≤ µ R a 2 /2 + 2b 2 /µ R for a := ∥LS( θ δ,j ) − y∥, b := ∥L∥ϵ K(j) K R .Now summing A and B, we see that ( Next, summing up e δ j for j = 0 . . .j for any j ≤ j * (δ) with the upper-stopping index j * proposed in (55), we arrive at This shows that all the iterates until j * (δ) remain in the ball and the proof is complete.
In comparison to Proposition 3, a prior stopping rule cannot yield Fejér monotonicity of the iterate with noisy data.As monotonicity is essential in ensuring the regularizing property of the Landweber iteration, and also due to the bi-level factor, we need to take particular care regarding stability of the upper-level.
The following Proposition is crucial to derive a more concrete stopping rule, which can ensure controllability of the approximation errors propagating to the upper-level.At this point, the preliminary error analysis in Section 2.2 comes into play.We remind ourselves that the circumflex (•) denotes the approximation/bi-level scheme; notation without the circumflex refers to the exact/single-level scheme.
Inserting S ′ (θ j ) * L * (L S K(j) ( θ δ,j ) − y δ ) (first estimate below) and using the analysis of the state approximation error derived in Lemma 1 (second and third estimates) and of the adjoint approximation error derived Lemma 2 (second estimate), we deduce Simplify the notation by denoting where in B δ,ϵ K(j) , we have invoked uniform boundedness of the iterates in the ball B R (θ † ) proved in Proposition 4. We now group the estimate in terms of T j , δ and ϵ K(j) , and obtain The estimate (60) exposes an important fact: if ϵ K(j) decays sufficiently fast with respect to δ then we can bound the term T j+1 from above.Indeed, (54) suggests running the lower-level iterate until ϵ K(j) ≤ δγ(j).Employing the upper bounds δ and γ from the statement of the proposition, we have showing Fejér monotonicity.( 57 These two properties are exactly the same as (35) and (36) in Theorem 1, the key steps to prove convergence of the iteration, with θ in place of u, and G in place of θ F. Therefore, following the line of the proof of Theorem 1, part i), noting that the tangential cone condition (even the strong version) is also assumed here, we can assert that the whole iterate sequence of the single-level algorithm with noise-free data converges to a solution to the inverse problem problem, i.e.
5 Discussion on assumptions and application

Classes of problems with coercivity
As presented in Remark 4 of Section 3, coercivity (34) is the essential property that we extract from well-posedness of S to achieve the convergence rate for the lower-level iterations.In what follows, we present two classes of PDEs where coercivity can be established: Lipschitz nonlineartiy and monotone-type nonlinearity (66).
Let us consider the PDE (29), where the unknowns θ := (a, c, ϕ, u 0 ) ∈ X consist of the diffusion coefficient a, potential c, source term ϕ and initial state u 0 .These unknown parameters are embedded in the PDE model via in the Hilbert space framework (c.f.Section 1.2) with bounded Lipschitz domain Ω ⊂ R 3 .In (65), the nonlinearity Φ : V → U * is a Nemytskii operator induced by Φ : U → U * with (by abuse of notation) [Φ(u)](t) := Φ(u(t)).We consider for all u, v ∈ U with L Φ , M Φ ≥ 0. Clearly, case (b) covers the class of standard monotonic functions.
The property (67) is also referred to as Lipschitz stability.
Proof.We begin by testing θ F(u)− θ F(v) with (u−v; 0) ∈ V ×H.Setting ∥u∥ H 1 0 := ∥∇u∥ L 2 due to the Gagliardo-Nirenberg-Sobolev inequality and recalling a ≥ a > 0, we write Estimating the c-term by using Hölder's inequaltiy with p = q = 2, then by Young's inequality xy ≤ x p /p + y q /q, p = 4/3, q = 4, we get where C Ω H 1 →L 6 indicates the application of the continuous embedding H 1 (Ω) → L 6 (Ω), Ω ⊂ R 3 .For the nonlinear term Φ in (66), one observes where for the case (a), we have used Young's inequality xy ≤ x 2 /ϵ+ϵy 2 , ϵ = a/8.Combining the above estimates of the nonlinearity Φ, the c-term to (68), we have that, for all t ∈ [0, T ] where in the second line, we have applied one more time Young's inequality ab ≤ ϵa 2 + b 2 /ϵ, ϵ < a/8 and grouped the L 2 -terms with the nonnegative constant In the second step, taking the integral of (69) over the time interval (0, t), t ≤ T leads to Now, we apply Grönwall's inequality in the form for (70), where x(t) := 1 2 ∥u(t) − v(t)∥ 2 L 2 is the second term on the left hand side and Π : In (70), one can moreover bound the term (a/8 − ϵ)∥u − v∥ L 2 (0,T ;H 1 (Ω)) on the left hand side by the entire right hand side, thus eventually by ∥ θ F(u) − θ F(v)∥ U * ×H .Altogether, we obtain with some C Φ,R > 0. Importantly, as the parameter θ lies in the ball B R (θ † ) with finite radius R, the estimate (71) is locally uniform.If only (φ, u 0 ) are the unknown parameters, the radius R is allowed to be arbitrarily large, as C Φ,R , thus C Φ,R , depends only on (c, a).This shows Lipschitz stability of u → θ F(u), equivalently the desired coercivity.
Remark 10 (Tangential cone condition).In view of Corollary 3, the tangential cone condition for the lower-level problem (in u) is clearly satisfied for Φ(u) = 0.When Φ(u) is as in (66), as coercivity/Lipschitz stability property is already establish in Proposition 6, one only needs to assume Lipschitz continuity of its derivative to conclude the tangential cone condition.Thus, convergence of the lower-level iteration is highly achievable.
For the upper-level problem, verification of the tangential condition (in θ) is another technical question.We refer to [30] for a study of such condition in several linear and nonlinear parabolic benchmark problems, to [21] for abstract linear elliptic problems, to [23] for quantitative elastography, to [24] for magnetic resonance elastography, to [36] for the electrical impedance tomography, to [11] for full wave from inversion, and recently, to [1], [46] in the context of machine learning.

Application to magnetic particle imaging
This section is dedicated to discussion regarding a real-life application: magnetic particle imaging (MPI).The material of the following discussion is extracted from the two papers [31,42].The work in [31] puts forward the mathematical analysis of the parameter identification problem in MPI; subsequently, [42] proceeds with Landweber, Landweber-Kaczmarz algorithms and numerical experiments.Moreover, an ad hoc bi-level algorithm, in which lower-level iteration is used as an approximation method for the nonlinear PDE (73), was successfully implemented without full analysis.
State of the art.MPI is a novel medical imaging technique invented by B. Gleich and J. Weizenecker in 2005 [37].The underlying concept is observing the magnetic behavior of iron oxide nanoparticle tracers, which are injected into patients and respond to an external oscillating magnetic field (with a field-free-point).One can thus visualize the particles' concentration, yielding quantitative imaging of blood vessels.MPI offers favorable features: no harmful radiation, high resolution (< 1 mm) and fast acquisition (< 0.1 s).
Here, µ 0 is magnetic permeability in vacuum, a l is a band-pass filter, c is the known particle concentration and p R is the 3D receive coil sensitivity.
Inspired by [15,38], we use the Landau-Lifshitz-Gilbert (LLG) equation to model the evolution of the magnetization moment vector m in a microscopic scale The LLG equation ( 73) is highly nonlinear, with vectorial solution m; numerical solution of (73) is a greatly challenging task [2,3,5,8] (see also [42, Remark 1] for a survey).This initiated our idea of a bi-level approach, in which the lower-level problem is used to approximate m.We invite the reader to review [31,42] for various numerical results.To illustrate, the following Figure shows the state approximation m(t, x) in the lower-level [42,Fig. 9] and the reconstruction as well as the internal operation of the bi-level scheme [42,Fig. 14].The reconstruction with Landweber step size µ = 1 ran 250 upper-level iterations and a total of 11095 lower-level iterations over 90,000 seconds.Note that this can be sped up by using the larger step size µ = 10.The final state error is below 1%, and the parameter error is below 2% [42,Table 13].In essence, the proposed bi-level scheme demonstrates its robustness through several test cases with experimental parameters, true physical parameters, noise free and noisy data.We are currently carrying out an extended numerical comparison of the three approaches -reduced, all-at-once and bi-level.

Conclusion
In this work, we have proposed a bi-level Landweber framework with convergence guarantee for coefficient identification in ill-posed inverse problems.The method is tailored, but not limited, to nonlinear time-dependent PDE models.In addition, the regularization study in the upper-level can be combined with other inexact PDE solvers like reduced order models, principal component analysis etc. to obtain an end-to-end error analysis.
The author intends to extend the study in the following directions: • The lower-level could further include a model inexactness term δ P DE .This reflects the case, for instance, when the PDE is not modeled in ideal conditions, or when a simplified version is employed.In this situation, the PDE itself contains a quantity of inexactness, and the problem becomes ill-posed in both levels.
• In Algorithm 1, the lower-level iteration is initiated at some u 0 near u † , and the lowerlevel iterations between different upper-iterates do not communicate with each other.Arguing that a small update in the parameter θ causes a corresponding small update in the PDE solution u, one can initialize the lower-level iteration of step j + 1 at the last iteration of step j, i.e. u j+1 0 := u j K(j) , with the aim of lowering the stopping index K(j + 1).Inspired by the all-at-once approach, the author also wishes to investigate the situation where one lower-level iteration is sufficient, i.e.K(j + 1) = K(j) + 1, yielding alternating updates between u and θ.
Replacing the Landweber method by its accelerated version [39,22] or by another iterative method in either or both levels is also a promising research direction.
This is due to the fact that we already have boundedness of the linear operators ∥ d dt ∥ V→U * , ∥A∥ X→U * and ∥(•) t=0 ∥ V→U * guaranteed from the choice of function spaces suggested in Section 1.2.The magnitude condition M r < (ii) Assumption (32) is equivalent to