Relaxation approach for learning neural network regularizers for a class of identification problems

The present paper deals with the data-driven design of regularizers in the form of artificial neural networks, for solving certain inverse problems formulated as optimal control problems. These regularizers aim at improving accuracy, wellposedness or compensating uncertainties for a given class of optimal control problems (inner-problems). Parameterized as neural networks, their weights are chosen in order to reduce a misfit between data and observations of the state solution of the inner- optimal control problems. Learning these weights constitutes the outer-problem. Based on necessary first-order optimality conditions for the inner-problems, a relaxation approach is proposed in order to implement efficient solving of these inner-problems, namely the forward operator of the outer-problem. Optimality conditions are derived for the latter, and are implemented in numerical illustrations dealing with the inverse conductivity problem. The numerical tests show the feasibility of the relaxation approach, first for rediscovering standard L 2-regularizers, and next for designing regularizers that compensate unknown noise on the observed state of the inner-problem.


Introduction
Designing regularizers for solving inverse problems is an important data-science topic for which data-driven approaches have been recently developed.In particular, the design of regularizers as artificial neural networks demonstrate interesting capacities for improving characteristics of inverse problems.Among these characteristics, one can aim at recovering uniqueness of solutions for initially ill-posed inverse problems, facilitating their solving, obtaining specific qualitative properties like sparsity, or even compensating unknown real-world noise that leads to uncertainties on the measured data.The design of such regularizers can be achieved towards an optimal control approach, consisting in optimizing parameters of a machine, namely an input-output map, in order to minimize a distance between real-world data and outputs of this machine.

General learning problem
Consider a series of K problems that belong to a given class of optimal control problems indexed by 1 ≤ k ≤ K, and assume that for each of them we know a solution.More precisely, we consider a set of tasks ŷk that we know to be achieved by the means of parameters ûk (the controls).The K pairs {(ŷ k , ûk ), 1 ≤ k ≤ K} constitutes our data set, and our main goal is, from this data set, to train a a function denoted by r that improves the solving of this class of control problem.Our supervised-learning problem can be formulated as follows: Problems ( * ) constitute the class of inner-problems for which we want to design a function r ∈ R acting on the parameters u to identify.Designing this function r ∈ R constitutes the outer-problem ( * * ).The set R describes the class of functions in which we design a -possible -optimal function r.In Problem ( * ), the state is denoted by y, and the state equation is given by ϕ(y, u) = 0.The cost c(y, ŷ) corresponds to a misfit between the state y and the task ŷ to achieve.The mapping γ is given, with values in R, and can correspond for example to a norm.

Motivation of data-driven regularization and state-of-the-art
In the formulation of the inner-problem ( * ), the additional term r(u) corresponds to a so-called regularizer.Its role consists in forcing the parameter/control function u to present certain properties, like regularity, boundedness, or sparsity for instance.Besides the qualitative properties aforementioned, designing a regularizer for a class of identification problems like ( * ) can address several issues.For example, inverse problems are wellknown to present non-unique solutions, leading to describe them as being ill-posed.A regularizer designed with a data-driven approach could help to select a relevant solution.Another issue could be due to uncertainty on measurements: Data can be subject to a noise phenomenon whose nature is unknown, and that would lead to inaccuracy on the measured solution.An appropriate regularizer, implementing implicitly the effects of this noise, could help to compensate this lack of accuracy.We illustrate this specific application in our numerical realizations (see section 6).
In order to design such a regularizer, one need to parameterize the set R, that we choose as a set of artificial neural networks (ANNs) with a prescribed number of layers.A regularizer (u → r(w, u)) ∈ R is then a mapping parameterized by a set of weights w.We refer to section 3.1 for further details on the definitions and notation related to the ANNs we will consider.Beside the theoretical approximation capacities of ANNs (see [Elb+21] for example), data-driven methods training ANNs have lead to the design of regularizers that outperform classical solving methods for inverse problems.
For example, the data-driven modeling of regularizers with NETT (Network Tiknonov) was investigated in [Li+20a], in order to solve inverse problems subject to unknown noise, with applications to Computer Tomography.Let us also cite [OSH19] in the context of trained decoder networks for sparse reconstruction, and refer to [Li+20b] for a theoretical analysis of such an approach.Further developments were studied in [Obm+19; Obm+20], considering aNETT (augmented Network Thikonov), designing a regularizer and also a penalty (misfit) function in the form of a neural network.Still on applications in medical imaging, the use of adversarial neural networks as regularizers was studied in [LOS18].The use of deep neural networks for solving ill-posed inverse problems related to medical imaging was also studied in [AO17].The special case of input-convex neural networks (ICNNs) for modeling regularizers with a data-driven approach was considered in [Muk+20].More generally, we orient the reader to [BB18; HN20] for a comprehensive review on techniques related to this topic.

Approach and strategy
In [HKB18] the authors investigated the problem of learning the Thikhonov regularization parameter for a class of inverse problems, with a bi-level approach.This approach consists in adopting a hierarchy in the formulation of the regularizer learning problem: The identification problems ( * ) for which we want to design a regularizer are called inner-problems, and the main problem ( * * ) consists in optimizing the parameters of the regularizer for a given set of identification problems for which we know the ground-truth (the data set).The term optimal control problem will be employed for both inner and outer problems.At a meta-level, the outer-problem consists in optimizing parameters (here the parameters w of the regularizer) of the family of the inner-control problems, for which the parameters u to identify are also described as controls.Let us mention that the bi-level approach was initially developed for image denoising in [KP13].
In the present paper, we adopt this bi-level approach, where the inner-problem ( * ) is considered as a constraint in the outer-problem.Compared with [HKB18], in our case the range of possible regularizers is a priori wider than a Thikonov regularizer, as the regularizers are chosen in the form of artificial neural networks parameterized with weights.The price is to pay is a theoretical analysis (existence and uniqueness of minimizers) that becomes out-of-reach in the general case, due in particular to the purely nonlinear aspects of neural networks.Therefore, our main contribution is the development of a relaxation approach, consisting in replacing the optimality constraint of the inner-problems by the corresponding necessary Karush-Kuhn-Tucker conditions.
In our case these conditions are equality constraints (vanishing gradient equalities) that transform the outerproblem into a standard optimal control problem of state constraint type.This relaxation approach is inspired by optimal control problems with partial differential equations constraints.Indeed, partial differential equations modeling physical dynamical systems can often be derived from the least action principle.They are merely first-order optimality conditions of a minimization problem.Wellposedness questions for these partial differential equations are then related to existence and uniqueness of minimizers for functionals called actions.
The main theoretical result is given in Theorem 4.4, where we provide the sensitivity of the objective function for the relaxed version of Problem ( * * ) with respect to the weights of the neural network.This allows us to implement the determination of these weights in Algorithm 1, with gradient rules.Illustrations are proposed by considering as inner-problems the classical identification of coefficients in elliptic partial differential equations.
Regularizers are designed, in particular for compensating the effects of a Gaussian noise introduced on the state variable.

Plan
The paper is organized as follows: Notation and assumptions are given in section 2. In section 3 we study the inner-problems with ANNs as regularizers.The latter are described in section 3.1.The necessary optimality conditions related to the inner-problems are investigated in sections 3.2 and 3.3.The relaxation approach is presented in section 4.1.We prove existence of minimizers for the relaxed outer-problem in section 4.2, and derive optimality conditions in section 4.5 for the latter.In section 5 we present algorithms for implementing the relaxation approach.Section 6 is devoted to numerical illustrations considering the identification of coefficients in an elliptic partial differential equation.Concluding remarks are given in section 7.In Appendix A.2 we present advantages that the optimal control approach can offer when dealing with state-constrained identification problems.Appendix A.1 recalls classical results related to the inverse conductivity problem.

Notation, assumptions and preliminaries
The reader is invited to refer to the present section when browsing the rest of the paper, especially regarding the notation.

Functional setting
We denote by Y, U, Q real Banach spaces, and by ϕ : Y × U → Q the state mapping.We assume that Q is reflexive.The control space U is assumed to be finite-dimensional, of dimension n in , so that U R nin .Therefore in particular U is a Hilbert space, and reflexive.The range space V of the neural networks is finite-dimensional, of dimension n out , and V R nout .The space of weights is denoted by W, and described in section 3.1.Given weights w ∈ W, we denote by r(w, •) : U → V the associated neural network of R, and by γ : V → R a smooth given mapping, possibly chosen as a norm on R nout .
The data {(ŷ k , ûk ), 1 ≤ k ≤ K} are of finite number.The controls (û k ) 1≤k≤K are assumed to lie in U, while the tasks (ŷ k ) 1≤k≤K lie in a finite-dimensional subspace of Y.

The regularizer as realization of a neural network
Given L ≥ 1, we denote by w = (w 1 , w 2 , . . ., w L ) a vector of affine functions.The set of regularizers, denoted by R, is made of feedforward neural networks of L layers, for which the dimensions n (1 ≤ ≤ L) of the hidden layers are prescribed: We denoted by Aff(n , n +1 , R) the set of affine functions from R n to R n +1 .For 1 ≤ ≤ L, its components write where A ∈ R n +1 ×n and b ∈ R n +1 , with n 1 = n in = dim(U) and n L+1 = n out = dim(V).For the sake of concision, we will denote throughout the paper We introduce L W := L−1

=1
n .We endow the space W with the Euclidean norm given by where |A | 2 R n +1 ×n = trace(A T A ) and |b | is the Euclidean norm of R n +1 .Given a smooth activation function ρ : R → R and a vector w of L weights, a neural network r(w, •) ∈ R of L layers writes as follows: In the expression above, since the functions w have values in R n +1 , the compositions by ρ have to be understood coordinate-wise as ρ(w (z)) i = ρ((w (z)) i ), for z ∈ R n and i ∈ {1, . . ., n +1 }.Thus, the regularizer space R is parameterized by the weights lying in W, and the outer-problem ( * * ) is formulated as follows: subject to: u k is solution of Problem ( * ) with (ŷ, r) = ŷk , r(w, •) , for all 1 ≤ k ≤ K.
( * * ) We keep the same notation for this new description of the outer-problem.

General assumptions
Throughout the paper, we will assume that the following set of assumptions holds: (A1) The mapping ϕ is of class C 2 over Y × U.For all u ∈ U, there exists y ∈ Y such that ϕ(y, u) = 0.
(A2) For all (y, u) ∈ Y × U, the linear mapping ϕ y (y, u) : Y → Q is surjective.Moreover, there exists a constant C(y, u) > 0, depending only on (y, u) such that for all y ∈ Y the following estimate holds: Instead of (A2), we may consider a weaker assumption, namely (A2) : (A2) For all (y, u) ∈ Y × U, the linear mapping ϕ y (y, u) : Y → Q is almost surjective, namely its range is dense in Q .
(A4) For all ŷ the cost function c(•, ŷ) : Assumption (A1) combined with (A2) imply the existence of a parameter-to-state mapping for the innerproblem (see section 2.4).Assumption (A3) is natural, as we aim at designing regularizers for (inner-)problems that are solvable.From (A3) we deduce easily existence of minimizers when the regularizer is non-trivial (Proposition 3.1).While addressing the inner-problems only requires first-order differentiability (section 3), the derivation of optimality conditions for the outer-problem (section 4) necessitates second-order derivatives, in particular for the objective function of the inner-problem, including the cost function c (Assumption (A4)) and the neural network (Assumption (A5)).Throughout the paper these assumptions will be implicitly assumed.

On the parameter-to-state mapping for the inner-problem
The state constraint is given by the implicit function ϕ : Y × U → Q .Under Assumptions (A1)-(A2), we are able to define the parameter-to-state mapping that we denote by S: Proposition 2.1.Assume (A1)-(A2).There exists a mapping S : U → Y of class C 2 such that for all Furthermore, S is of class C 2 over U.
Proof.The result is due to the implicit function theorem.
Depending on the context, for the sake of convenience we may consider the original equality constraint ϕ(y, u) = 0, or substitute the variable y by S(u).The lack of parameter-to-state mapping S for the innerproblem would introduce further difficulties that we choose to not address in this paper.

Necessary optimality conditions for the inner-problem
Given w ∈ W, when the regularizer is chosen in the form of a neural network as described in section 2.2, the inner-problems correspond to the following class of optimal control problems: Again, we keep the same notation for this new formulation of the inner-problems ( * ).The aim of this section is to provide them necessary optimality conditions.For that purpose, we first need to investigate the differential properties owned by neural networks of W.

Differential properties of neural networks
Consider the partial neural network with layers (1 ≤ ≤ L): Recall that the weights (w ) are affine functions of R n with values in R n +1 , of the form w (z Only for the present subsection and also subsection 4.3, we denote Observe that for 1 ≤ ≤ L − 1 the following formula holds: From ( 1), by induction we can verify that the sensitivity of r (L) with respect to u satisfies the identity The vector/matrix product ρ (r ( ) (w ( ) , u))A is a matrix, and has to be understood as follows: The symbol −→ is used for calculating the non-commutative product of matrices, by multiplying them iteratively to the left when the index increases: Without ambiguity, we will simply denote r = r (L) and w = w (L) .Using the mean-value theorem, by induction we can estimate where we recall that we have introduced n in section 2.2.The differentiation of r with respect to the weights w is given by the following formula, for all w ∈ Aff(R n , R n +1 ): We can obtain this formula also by induction, using (1).Here the product has to be understood as follows for These expressions for the derivatives of the neural network will be used in practice for the implementation of the gradient associated with the inner-problem.Let us now derive expressions for this gradient.

Direct approach
The parameter-to-state mapping S obtained in Proposition 2.1 enables us to derive classically necessary optimality conditions for the inner-problems directly by omitting the state constraint.Indeed, using y = S(u), and given w ∈ W, we can rewrite Problem ( * ) as Using Assumption (A3) and the formulation above, we obtain existence of minimizers for the inner-problem ( * ): Proposition 3.1.Under Assumption (A3), for all w ∈ W Problem ( * ) admits a global minimizer.
Proof.From Assumption (A3) and Proposition 2.1, the mapping u → c(S(u), ŷ) admits global minimizers, and due to the regularity of u → γ(r(w, u)), the result follows straightforwardly.
We next obtain necessary optimality conditions involving the derivative of S: Proof.Following the previous comments and Proposition 2.1, the result follows from the Karush-Kuhn-Tucker conditions (see [Trö10, section 2.8, p. 63]).
In certain problems, like shape optimization for example, evaluating efficiently values of the parameter-tostate mapping S(u), and in particular S (u) * , can be challenging numerically, or expensive computationally.
Therefore we prefer sometimes to adopt an optimal control approach, also called the adjoint approach, that we develop in our context in section 3.3 below.See remarks in [Hin+09, section 1.6.1,page 59] for further explanations.Further comments on the possible advantages of the optimal control approach are given in Appendix A.2.

Optimal control formalism for the inner-problem ( * )
Let us recall the initial description of the inner-problem, when we keep the original state constraint: Unlike the direct approach which utilizes ϕ(y, u) = 0 ⇔ y = S(u), the optimal control approach -also called the adjoint approach -consists in taking into account the state constraint by introducing a Lagrange multiplier.
We define the Lagrangian of problem ( * ) as follows Problem ( * ) can be solved by finding a saddle-point to the Lagrangian L , minimized with respect to variables (y, u) and maximizing with respect to p.The sensitivity of L with respect to variable y yields the adjoint equation, whose unknown is the adjoint state We define a solution of system (5) by transposition.
Definition 3.3.We say that p is a solution of system (5) if for all f ∈ Q we have where ỹ ∈ Y denotes the solution of the following linear system in Q : From Assumption (A2), system (7) is well-posed.The existence of solutions for adjoint system (5), in the sense of Definition 3.3, is stated as follows: Proposition 3.4.Let be (y, u) ∈ Y × U.There exists a unique solution p ∈ Q to system (5), in the sense of Definition 3.3, and there exists a constant C(y, u) > 0 (depending only on (y, u)) such that Proof.From Assumption (A2) , it is easy to verify that the mapping ϕ y (y, u) * is injective, and so uniqueness holds.Let us prove existence.Denote by Λ(y, u) : Q → Y the linear operator which maps f to ỹ, where ỹ is the solution of (7).From Assumption (A2), the operator Λ(y, u) is well-defined and bounded, and consequently and the announced estimate follows from boundedness of Λ(y, u) * , which completes the proof.
Consequently, the adjoint approach provides another expression for the gradient associated with the innerproblem, namely the quantity G(u, w, ŷ) introduced in (4).
Proposition 3.5.Let be w ∈ W. If u is solution to Problem ( * ), then the gradient G(u, w, ŷ) introduced in Proposition 3.2 writes where y = S(u), p is the solution of (5) corresponding to (y, u), and necessarily G(u, w) = 0.
The necessary optimality conditions given in Proposition 3.2 are self-consistent, and thus more appropriate for theoretical analysis, while those provided by Proposition 3.5 are more convenient more numerical realization.
Following these two results, the constraint of Problem ( * * ) is then relaxed in the next section, by being simply replaced by the identity G(u, w, ŷ) = 0.
4 Relaxation approach for the outer-problem satisfies necessarily G(u k , w, ŷk ) = 0, but if it satisfies this identity only, then in general this is not necessarily a solution of Problem ( * ).Therefore we consider the relaxed version of ( * * ) below: subject to: G(u k , w, ŷk ) = 0 for all 1 ≤ k ≤ K.
( * * ) The parameter ν ≥ 0 is given, and introduced for theoretical purpose mainly (see Theorem 4.1).This regularization term for the set of weights w could also facilitate the numerical solving of Problem ( * * ), by forcing the weights to be bounded.Note that a parameter-to-solution mapping w → u characterizing G(u, w, ŷ) = 0 is not always available.Even if such a mapping exists, its derivative is difficult to describe explicitly.See section 4.4 for more details.For taking into account the equality constraints of problem ( * * ), we adopt an adjoint approach and introduce the multipliers µ ∈ U, with the following Lagrangian mapping: Like in section 3.3 for the inner-problems (see Remark 3.6), a solution of Problem ( * * ) can be sought -when K = 1 -as a saddle-point of L with respect to the variables (u, w) and µ.In practice, we shall consider (u k ) 1≤k≤K , µ = (µ k ) 1≤k≤K and Up to a change of the functional framework, for the theoretical questions we can simply consider the functional introduced in (12), by considering the vector version of the different variables, namely u = (u k ) 1≤k≤K , û = (û k ) 1≤k≤K and µ = (µ k ) 1≤k≤K .Before deriving conditions of stationarity for L, let us discuss about existence of solutions for Problem ( * * ).

Existence of minimizers for the relaxed problem
Existence of solutions for Problem ( * * ) follows from standard arguments: Theorem 4.1.Assume that ν > 0. Then Problem ( * * ) admits a global minimizer.
Proof.The existence of feasible solutions for Problem ( * * ) is due to Proposition 3.1, combined with Proposition 3.2 (or also Proposition 3.5).Since the functional is bounded from below for ν > 0, Problem ( * * ) admits a minimizing sequence (u (n) , w (n) ) n of feasible solutions.
Further, from (14), the sequences (u (n) ) n and (w (n) ) n are bounded in U and W, respectively.Remind that the spaces U and W are finite-dimensional.Consequently, up to extraction, there exists (u, w) ∈ U × W such that (u (n) , w (n) ) n converges strongly towards (u, w).It is clear that U × W (u, w) → G(u, w, ŷ) ∈ U is continuous, so that we can pass to the limit in G(u (n) , w (n) , ŷ) = 0 for obtaining G(u, w, ŷ) = 0, which concludes the proof, as (u, w) is then solution of Problem ( * * ).
Remark 4.2.If we consider the case ν = 0, the boundedness of the sequence (w n ) n in the proof above is a priori not guaranteed.For proceeding like we did, in such a case we may need to assume further hypotheses on the neural network.For example, we could exploit the equality G(u (n) , w (n) , ŷ) = 0, that is due to the identity (4).The sequence (u (n) ) n still converges, which implies that the quantity is uniformly bounded with respect to n.In the simple one-dimensional case where n in = n out = 1, and γ ≡ Id, this would lead us to consider that the boundedness of the gradient of the neural network implies the boundedness of its weights, which is not true in general.Therefore considering ν > 0 seems to be reasonable, as this provides a simple proof to the existence of minimizers.

Derivatives of second-order for the neural network
In section 4.5, we will need expressions of the second-order derivatives of the neural network.From the formula (2) obtained in section 3.1, we derive and for all 1 ≤ s ≤ L − 1, where we have simply denoted r (k) = r (k) (w (k) , u).We recall and for all w : z → Ãz + b Like for (3), the products of type (ρ r) i = ρ i ri are defined componentwise.For s = L, we have for all These expressions in themselves are not essential for what follows in the present section, but they can help the reader to reproduce the numerical results we obtained in section 6.

Properties of the adjoint operator
The adjoint equation for Problem ( * * ) is obtained by differentiating the Lagrangian L with respect to the variable u, which gives Let us take a look at the operator G u (u, w, ŷ) * ∈ L(U, U ) which arises in this equation.Since U R nin is assumed to be of finite dimension, for each (u, w) ∈ U × W the operator G u (u, w, ŷ) can be represented by a matrix of R nin×nin .Further, reminding that G(u, w, ŷ) = ∂ ∂u J(S(u), u, w), it yields that G u (u, w, ŷ) is an Hessian matrix, and so it is represented by a symmetric matrix.Let us give the expression of G u (u, w, ŷ), by denoting y = S(u) : G u (u, w, ŷ) = S (u) * c yy (y, ŷ)S (u) + c y (y, ŷ)S (u) + γ (r(w, u))r uu (w, u) + r (w, u) * γ (r(w, u))r (w, u).
More specifically, for all v 1 , v 2 ∈ U we have: Solving equation ( 17) determines the value of the variable µ, and thus the question arises whether the operator G u (u, w, ŷ) * = G u (u, w, ŷ) is invertible.Let us comment on the general case where G u (u, w, ŷ) is not necessarily invertible.

When the adjoint operator is invertible
Consider any (u, w) ∈ U × W such that G u (u, w, ŷ) is an invertible matrix.Analogously to the inner-problems, such a case corresponds to an assumption of type (A2) transcribed to Problem ( * * ), and following section 2.4 we can prove via the implicit function theorem the existence of a mapping w → u characterizing the equality G(u, w, ŷ) = 0.This is a mapping of type control-to-state, since for the outer-problem the variable u plays the role of a state and w is the command.This case corresponds to variables w ∈ W such that the associated inner-problem admits a unique solution u ∈ U. Since G u (u, w, ŷ) is a square matrix, the matrix G u (u, w, ŷ) * is also invertible, and so the adjoint equation ( 17), namely G u (u, w, ŷ) * µ + (u − û) = 0 in the generic case, admits the unique solution Therefore in this case we are able to derive the necessary Karush-Kuhn-Tucker conditions for Problem ( * * ), as the associated qualification constraints are satisfied without ambiguity (see Theorem 4.4).

When the adjoint operator is not invertible
In the case where G u (u, w, ŷ) is not invertible, in practice we introduce a perturbation of the latter, based on the following basic result: Therefore, when G u (u, w, ŷ) is singular, we can find ε(u, w, ŷ) > 0 small enough such that G u (u, w, ŷ) − ε(u, w, ŷ)I is invertible.Note that, up to adapting the proof above, we can also choose ε(u, w, ŷ) > 0 small enough such that G u (u, w, ŷ) + ε(u, w, ŷ)I is invertible.Thus we replace G u (u, w, ŷ) by G u (u, w, ŷ) ± ε(u, w, ŷ)I.
Note that the data ŷk are of finite number, and thus the corresponding solutions u k of the inner-problems are of finite number too.Therefore ε > 0 can be chosen independent of u and ŷ, but not on the number of data K.
In practice, we can choose ε = ε(w) > 0 as small as desired, as long as G u (u, w, ŷ) ± ε(w)I is invertible.This is equivalent to add the small L 2 -type regularizer u → ∓ ε(w) 2 u 2 U to the functional of the inner-problem ( * ).

Necessary optimality conditions for the relaxed problem
The following result gives necessary conditions that an optimal solution of Problem ( * * ) has to fulfill.
Theorem 4.4.Assume that w ∈ W is a solution of Problem ( * * ).Then for all 1 ≤ k ≤ K we have G(u k , w, ŷk ) = 0, and necessarily for all 1 ≤ ≤ L, where µ k is the solution of for each 1 ≤ k ≤ K.
Proof.Since the linear mappings G u (u k , w, ŷk ) ∈ L(U, U ) are invertible for each 1 ≤ k ≤ K, or can be perturbed such that they become invertible (see section 4.4.2), the qualification constraints of the Karush-Kuhn-Tucker conditions are satisfied for the Lagrangian (13).Therefore a solution of Problem ( * * ) is necessary a critical point of L (K) , which leads to the announced identities.
In practice, we solve Problem ( * * ) iteratively, by updating the weights w such that the norm of (18) converges to zero.Given a set of weights w, we first compute each u k solution of ( * ) corresponding to (ŷ, r) = (ŷ k , r(w, •)).
Note that the variable u = (u k ) 1≤k≤K plays the role of a state variable in Problem ( * * ).Next we compute µ = (µ k ) 1≤k≤K , where each µ k is solution of (19).Finally we are able to determine the values of the left-handsides G (w) in ( 18), that we use in a gradient rule.

A gradient-based algorithm
This section is devoted to algorithmic implementation of the relaxed approach for Problem ( * * ).Let us first explain how the inner-problems ( * ) can be solved in practice.

Nesterov gradient descent for solving the inner-problems
We solve the inner-problems with Nesterov algorithm [Nes83], also called the accelerated gradient method.It offers a nice compromise between robustness and rapidity of convergence, as inner-problems need to be solved many times, when solving the outer-problem.The corresponding method is given in Algorithm 0 below.
Algorithm 0 Solving the inner-problem for a given regularizer u → r(w, u) via first-order necessary optimality conditions. Initialization: Initial gradient: From u (0) , compute the (initial) gradient as follows: -Compute the state y corresponding to u = u (0) , by solving ϕ(y, u) = 0.

Compute the gradient in 3 steps
Store u (0) and G (0) .
Note that for evaluating the gradient (18) of the outer-problem, Algorithm 0 needs to be performed K times, since we need to determine each u k corresponding to ŷk .For a given w ∈ W, we denote the evaluation of functional of Problem ( * * ) as follows where the u k are obtained with Algorithm 0.

Barzilai-Borwein gradient rule for the relaxed outer-problem
We solve the outer-problem with the Barzilai-Borwein algorithm [Ray97].The corresponding method uses Theorem 4.4, and is explained in Algorithm 1 below, in the case ν = 0.
-Choose random weights as initial weights w (0) for the neural network.
The Barzilai-Borwein is not a monotonous method, but presents second-order convergence behavior, as the expressions of its gradient steps can also be obtained via a quasi-Newton method, more precisely a BFGS method (see [NW06,Chapter 6] for more details).
Given the definition of U in (20), proving existence and uniqueness of solutions for systems (21)-( 22) is classical (see section A.1 for the analysis).

On the relaxed outer-problem
The coefficient ν ≥ was introduced for theoretical purpose.In the numerical experiments we take ν = 0.The relaxed outer-problem ( * * ) is formulated as follows: We equip L ∞ F (Ω) with the Euclidean norm of R nin .In order to apply Theorem 4.4 for this particular example, one need to verify that the theoretical assumptions (A1)-(A3) are satisfied.This is done in the Appendix.Assumption (A4) is satisfied, since c(y, ŷ) = 1 2 y − ŷ 2 L 2 (Ω) .Assumption (A5) is satisfied, as the activation functions chosen for the numerical experiments are tanh and tan −1 , which are of class C 2 .We now focus on the numerical results when solving Problem ( * * ), following the method presented in section 5.

Data generation
In what follows, we consider Ω = (0, 1) (and thus d = 1), and discretize the different elliptic equations with P1-finite elements on a uniform mesh made of 100 subdivisions.The parameters u are piecewise constant.Data will be generated with different values of parameters u, choosing as right-hand-sides f (x) = −u (2 − π 2 x 2 ) sin(πx) + 4πx cos(πx) , g(x) = 0, such that implicitly the corresponding exact solution is y ex (x) = x 2 sin(πx).

Rediscovering the L 2 -norm, or not
Data {(ŷ k , ûk ); 1 ≤ k ≤ K} are generated artificially by solving the inner-problems corresponding to the L 2 regularizer u → 3 2 u 2 U , with U = R.More precisely, given set of coefficient u truth , the associated states y truth =: ŷk are computed by solving system (21), and next Problem ( * ) with r(w, u) replaced by u → 3 2 u 2 U as regularizer is solved with Algorithm 0 that computes the optimal parameters ûk .
Next a feed-forward neural network is trained with this data set, using Algorithm 1 presented in section 5.
The purpose of this test is to compare the trained neural network with the L 2 function that represents the ground-truth regularizer to (re-)discover.We designed a feedforward neural network with 16 layers, with uniform dimension of the hidden layers n = 1, and x → tanh(x) as activation function.Convergence of the objective function is presented in Figure 1, with values given in Table 1. Figure 2   In Figure 1 and Table 1 we see that the relative misfit decreased from 54% to 0.27% after a few iterations.
The curves of Figure 2 suggest non-uniqueness for Problem ( * * ), as the residual misfit is of the same range as the approximation error made when discretizing (21).Therefore the trained neural network does the same job as the original L 2 -regularizer, but it is clearly different.We note that increasing the number of layers (more specifically with L = 32 or L = 64) does not change the result, and in particular does not improve the approximation quality.

Regularizers for compensating noise on the data
Given a set of K = 10 piecewise constant coefficients u true = (u true,k ) 1≤k≤K , we compute the corresponding states y true = (y true,k ) 1≤k≤K solutions of (21).We then introduce artificially a noise following a Gaussian distribution that transforms each y true,k into y noisy,k .The pairs {(ŷ k , ûk ) := (y noisy,k , u true,k ), 1 ≤ k ≤ K} constitute our data set on which we will train the neural network as regularizer for the inner-problem.The aim of such a regularizer consists in compensating the effects of the noise.Problem ( * ) equipped with the so-trained regularizer will then enable us to determine the desired coefficients u, in spite of the noisy state ŷ on which we can rely only.In the illustration presented in Figure 3 and Table 2, the initial misfit -corresponding to a neural network close to zero -is about 54%.We trained a feedforward neural network with L = 8 layers, and x → tan −1 (x) as activation function.We see that only few iterations were needed for obtaining a misfit lower than 5%.The type of neural network we consider does not allow us to reduce the misfit below 3,32%.Again, increasing the number of hidden layers did not help to improve the result.

Conclusion
We have proposed a method for addressing in practice the data-driven design of regularizers for improving a class of inverse problems formulated as optimal control problems.The class of regularizers is restricted to a set of artificial neural networks, which goes in the sense of the purely nonlinear nature of such a problem.
Our approach is bi-level, and based on a relaxation exploiting the necessary optimality conditions for these problems.We saw that such a type of approach enables practical and efficient implementation, providing that the different objective functions and activation functions of the neural network are smooth.Developing a similar approach when the involved quantities are less smooth could present an interesting challenge.Further, investigating applicability of our approach for more complex problems (like image segmentation for example) would demonstrate even more its relevance.In such a context, one may need to take into account neural networks architectures that allow more sophisticated realizations.Finally, noticing that our approach necessitates the evaluation of second-order derivatives of the parameter-to-state mapping for the inner-problem, developing higher order adjoint-based methods for deriving the optimality conditions for the outer-problem could potentially improve the implementation and the applicability to large-scale tasks.

A Appendix
In this section we deal with the example considered in section 6, namely the problem of identifying a conductivity represented by the variable u in the following elliptic partial differential equation:  ( The result below establishes existence of minimizers for Problem (26), and thus Assumption (A3) is fulfilled.
Proof.Since the function J of Problem ( 26) is bounded from below, and since system (21) is well-posed, J admits a minimizing sequence (y n , u n ) n of feasible solutions.From the definition of U m,M , the sequence (u n ) n is uniformly bounded in L ∞ F (Ω), which is of finite dimension.Therefore, up to extraction, the sequence (u n ) n converges strongly in L ∞ (Ω) towards u ∈ U m,M .From estimate (24), the sequence (y n ) n is bounded in H 1 (Ω).
Passing to the limit, we deduce that − div(u∇y) = f in H −1 (Ω).Passing to the limit in the equality y n|∂Ω = g is straightforward.Thus (y, u) is a solution of Problem (26), which completes the proof.
A.2 On the optimal control approach for solving an inverse problem Let us fix the ANN weights w for this subsection.A classical approach for solving inverse problems consists in taking into account the equality constraint ϕ(y, u) = 0 by incorporating it in the functional to minimize, as This approach would require differentiation of the data (represented here by u), more specifically the computation of ∇u, and thus could lead to approximation biases, especially when the set of data is incomplete.Instead of that, we consider the state equation as a constraint, that we impose by introducing a Lagrange multiplier denoted by (p, q) ∈ H 1 0 (Ω) × H −1/2 (∂Ω) = Q Q , namely the adjoint variable, leading to the search of a saddle-point of a Lagrangian functional, given as: L (y, u, p) = c(y, ŷ) + γ • r(w, u) + (p, q) , ϕ(y, u) Q,Q .

4. 1
Relaxed formulation of the main problemThe aim of the present section is to translate the abstract constraint of Problem ( * * ), namely "u k is solution of Problem ( * )[. . .]", into an equality constraint.For that purpose, we use the necessary vanishing gradient condition given in Propositions 3.2 and 3.5.Note that a solution to Problem ( * ) with (ŷ, r) = (ŷ k , r(w, •))

Figure 1 :
Figure 1: Misfit evolution during gradient steps for training a neural network from data generated with a L 2 -regularizer, in linear scale (left) and log scale (right).

Figure 2 :
Figure 2: Graph of the trained neural network in blue, and of the ground-truth L 2 regularizer in red.(For interpretation of the references to color in this figure legend, the reader is referred to the digital version of this article.)

Figure 3 :
Figure 3: Misfit evolution during gradient steps for training a noise-compensating neural network, in linear scale (left) and log scale (right).

Table 1 :
Misfit vales (in %) through the different gradient steps.

Table 2 :
Misfit vales (in %) through the different gradient steps.