Sequential Projected Newton method for regularization of nonlinear least squares problems

We develop a computationally efficient algorithm for the automatic regularization of nonlinear inverse problems based on the discrepancy principle. We formulate the problem as an equality constrained optimization problem, where the constraint is given by a least squares data fidelity term and expresses the discrepancy principle. The objective function is a convex regularization function that incorporates some prior knowledge, such as the total variation regularization function. Using the Jacobian matrix of the nonlinear forward model, we consider a sequence of quadratically constrained optimization problems that can all be solved using the Projected Newton method. We show that the solution of such a quadratically constrained sub-problem results in a descent direction for an exact merit function. This merit function can then be used to describe a formal line-search method. We also formulate a slightly more heuristic approach that simplifies the algorithm and allows for an inexact solution of the sequence of sub-problems. We illustrate the behavior of the algorithm using a number of numerical experiments, with Talbot-Lau X-ray phase contrast imaging as the main application. The numerical experiments confirm that the quadratically constrained sub-problems need not be solved with high accuracy in early iterations to make sufficient progress towards the solution. In addition, we show that the proposed method is able to produce reconstructions of similar quality compared to other state-of-the-art approaches with a significant reduction in computational time.


Introduction
In this manuscript we consider the regularized nonlinear least squares problem min x∈R n Ψ(x) subject to 1 2 ||s(x) − b|| 2 ≤ σ 2 2 where Ψ : R n → R is a convex twice continuously differentiable function. The function s : R n → R m with m ≥ n models some nonlinear forward operation and b ∈ R m is some data contaminated with measurement errors or noise. Here, σ is an estimate of the noise-level, i.e. σ ≈ ||b ex − b||, for some "exact" data b ex . In the following we assume that the unconstrained solution of the least squares problem is below the noise-level, i.e. that min x∈R n s(x) − b ≤ σ. This implies that the constraint in (1) is feasible and thus there exists at least one local solution. Furthermore, we assume s(x) has component functions s i (x) : R n → R for i = 1, . . . , m that are all continuously differentiable, such that the Jacobian matrix is well-defined. Let us denote the constraint function as c(x) = 1 2 ||s(x) − b|| 2 − σ 2 2 . It is an easy calculation to verify that the gradient of the constraint function is given by ∇c(x) = J(x) T (s(x) − b). Note that the constraint function c(x) is not necessarily convex.
The conditions in lemma 1.1 are very mild since ∇Ψ(x * ) = 0 means that x * is an unconstrained minimizer of the convex regularization function. Similarly we have that ∇c(x * ) = 0 expresses the first order necessary optimality condition of the unconstrained optimization problem argmin ||s(x) − b||.
Note that the equations in (2) are precisely the KKT conditions of the equality constrained problem min x∈R n Ψ(x) subject to 1 2 ||s(x) − b|| 2 = σ 2 2 .
Here, the constraint expresses the discrepancy principle [2], which basically says that a point x that satisfies ||s(x) − b|| < σ is expected to be 'over-fitted' to the noisy data b. Among all possible solutions x that fit data b 'well enough' without being over-fitted, the one that minimizes the regularization function Ψ(x) is chosen. The equality constrained optimization problem (3) is closely related to the (unconstrained) regularized nonlinear least squares problem min x∈R n with fixed regularization parameter α > 0. The first order optimality conditions of (4) are given by +α∇Ψ(x * ) = 0.
Any point x * that satisfies these equations also satisfies the first equation of the KKT conditions (2) of the equality constrained optimization problem (3) with λ * = 1/α. Hence, solving (3) can be seen as an approach to simultaneously solve the regularized nonlinear least squares problem (4) and find the corresponding regularization parameter α such that the discrepancy principle is satisfied. Other well-known parameter choice methods include generalized cross validation and the L-curve criterion. We refer to [2] for more general information on the different parameter choice methods. In this work we consider a novel approach to compute a solution (x * , λ * ) to the nonlinear system of equations (2) by solving a sequence of sub-problems, by using the linear approximation s(x + p) ≈ s(x) + J(x)p for the constraint in (3) for some x ∈ R n . By Taylor's theorem we have Hence it follows that When J(x) is Lipschitz continuous the bound is replaced by O(||p|| 2 ).
After some calculations it follows that we have the following equivalent expressions: Hence it is clear that taking a linear approximation of the forward model s(x + p) corresponds to taking a quadratic model of the constraint function c(x+p). Alternatively we could have immediately considered a quadratic model of c(x + p) using Taylor's theorem: However, since c(x) is in general non-convex we have that the Hessian is possibly indefinite, which is undesirable for optimization algorithms. Moreover, for a least squares function we know that the matrix J(x) T J(x) is an approximation of the true Hessian ∇ 2 c(x), see for instance chapter 10 in [1]. This approximation is for instance used in the Gauss-Newton algorithm for solving a least squares minimization problem. The main idea of the method proposed in this manuscript is to solve a sequence of problems of the form min p∈R n Ψ(x + p) subject to for a given iterate x ∈ R n . The KKT conditions for this optimization problem are given by where λ ∈ R is the Lagrange multiplier. Alternatively, we could also consider a quadratic model of the objective function and solve a sequence of problems This idea forms the basis of one of the reference methods that we consider in this manuscript. In this case we get an approach that closely resembles the one taken in [3], namely the Sequential Quadratically Constained Quadratic programming (SQCQP) method. The SQCQP method is slightly more general in the sense that they work with possibly multiple constraint functions c i (x) : R n → R for i = 1, . . . ,m. However, they assume convexity of these functions and then use the true Hessians ∇ 2 c i (x) in the quadratic model. With the numerical experiments, see section 5, it will become clear that in the context of the current manuscript it is better to solve a sequence of problems of the form (6). Hence, we focus our presentation on this case. However, many of the results presented here can be modified in case we solve a sequence of problems (8).
A second reference method that we consider for the numerical experiments is based on the approximation J(x) T J(x) ≈ ∇ 2 c(x). The algorithm can be seen as a type of Gauss-Newton method and is able to solve the regularized nonlinear inverse problem with fixed regularization parameter α > 0 shown in (4). The linear systems of equations that appear in each iteration can be solved using the Conjugate Gradient method [4].
Note that even when (2) has a solution, this does not necessarily imply that (7) has a solution for all x. However, using the implicit function theorem, we can show that this sub-problem has a solution for all x in a neighborhood of the solution x * . The implicit function theorem can be formulated as follows: ∈ R q×q is invertible, then there exists some open set X ⊂ R n containing x * such that there exists a unique continuously differentiable function g : X → R q such that g(x * ) = y * and Φ(x, g(x)) = 0 for all x ∈ X.
Proof. To prove this claim we apply the implicit function theorem to the function Φ : Obviously we have Φ(x * , 0, λ * ) = 0 since (x * , λ * ) solves (2). Moreover we have that the Jacobian of Φ(x, p, λ) with respect to (p, λ) is given by Evaluating this matrix in (x * , 0, λ * ) gives us which is non-singular since ∇ 2 Ψ(x * )+λ * J(x * ) T J(x * ) is positive definite and ∇c(x * ) = 0. Hence, from theorem 1.2 it follows that there exists an open set X ⊂ R n containing x * such that there exists a unique continuously differentiable function g : X → R n+1 such that g(x * ) = (0, λ * ) and Φ(x, g(x)) = 0 for all x ∈ X. It is now clear that g(x) = (p, λ) is a solution for (7). Furthermore, by continuity of g(x) and the fact that λ * > 0 we can choose the any radius ρ > 0 small enough such that B ρ (x * ) ⊂ X and λ > 0.
The rest of the paper is organized as follows. In section 2 we derive the Sequential Projected Newton method, which is based on (approximately) solving a sequence of problems of the form (6). In section 3 we give some comments on related work and describe two reference methods. Next, in section 4, we introduce the application that we consider for our numerical experiments, namely Talbot-Lau X-ray phase constrast imaging. In section 5 we perform a number of experiments with the proposed algorithm and compare it with the two reference methods described in section 3. First, we show that the sub-problems (6) do not need to be solved with a high accuracy to make reasonable progress towards the solution. Next, we illustrate the fact that the Sequential Projected Newton method is significantly less computationally expensive compared to the two reference methods, while producing solutions of similar quality. Lastly this work is concluded and an outlook is given in section 6.

Derivation of the Sequential Projected Newton method
In this section we describe a line-search strategy that can be used when solving a sequence of problems of the form (7). Let us first describe a good choice of initial point x 0 to start our algorithm. We know that there exists a point x 0 that satisfies ||s(x 0 ) − b|| ≤ σ, since we assumed that the unconstrained solution is below the noise-level. Moreover, such a point can be computed very easily by performing just a few Gauss-Newton iterations applied to argmin x∈R n ||s(x) − b||. For such a point we know that the linearized inequality constraint ||J(x 0 )p + s(x 0 ) − b|| ≤ σ is feasible, since p = 0 satisfies the inequality. From this it follows that the equality constraint ||J(x 0 )p + s(x 0 ) − b|| = σ is also feasible, since the constraint can be seen as a convex paraboloid. This means that the nonlinear system of equations (7) has a solution for x = x 0 and hence we can use this as our initial point. See remark 3.1 for more details on how we can compute such a point.
Let us denote [·] + = max{0, ·}. It is well known that the following function is an exact merit function for the optimization problem (1): The authors in [3] use this merit function in proving global convergence of the SQCQP method. The analysis in this section in inspired by this work.
In this section we show that if we choose the penalty parameter r > 0 large enough, we can then always find a step-length β ∈ (0, 1] such that F r (x + βp) < F r (x). We start by proving the following lemma: Proof. By Taylor's theorem we have that there exists some t β ∈ (0, 1) such that and thus we have for all β that with φ = max t∈(0,1) . Moreover, we also have Using this inequality together with (9) and the KKT conditions (7), we get Lemma 2.2. Let x ∈ B ρ (x * ) with corresponding KKT pair (p, λ) that satisfies (7) and r > λ. Let 0 < η < 1, then we have for all β sufficiently small Proof. From lemma 2.1 we have that there exists a constant φ 1 > 0 such that From the second equation in (7) it follows that c( From this it also follows that Again using Taylor's theorem we have for some t β ∈ (0, 1) that Let us now write The final inequality follows from the fact that r > λ. It is clear that (10) holds for all values of β that satisfy By rearranging the terms it follows that the above equality holds for all β with This concludes the proof since the right-hand side in this equality is strictly positive.
Theorem 2.3. Suppose we generate an infinite sequence of iterates x k = x k−1 + β k p k that all satisfy descent condition (10) for penalty parameter r k = max(2λ k , r k−1 ), with (p k , λ k ) the KKT pair (7) corresponding with x k−1 and r 0 an initial penalty parameter.
µI for some constant µ > 0 independent of k. Suppose in addition that the iterates (x k , λ k ) remain bounded and that there exists a constant γ independent of k such that β k λ k > γ > 0. Then there exists a sub-sequence K ⊂ N such that (x k , λ k ) k∈K converges to a solution of (2).
Proof. By boundedness of (x k , λ k ) we know, using the Bolzano-Weierstrass theorem, that there exist a convergent sub-sequence (x k , λ k ) k∈K → (x,λ) for some K ⊂ N. Hence, we can assume that r k = r for some constant r for all k ∈ K sufficiently large. Since the descent condition (10) holds and λ k β k > γ we have for k sufficiently large In addition we have that the sequence F r (x k ) is decreasing and bounded below and thus it must converge. Hence, it also follows that we can conclude that p k → 0. Writing out the KKT conditions (7) for (x k , λ k ) we get Now by taking the limit for k ∈ K and using the fact that p k → 0 we have which concludes the proof.

5:
Update penalty parameter r k+1 = max(2λ k+1 , r k ). 6: Choose β k+1 the largest value in 1, θ, θ 2 , θ 3 , . . . such that Remark 2.4. Boundedness of the Lagrange multiplier λ k is the most important assumption in theorem 2.3. Otherwise, the descent property (10) is meaningless, since this would imply r k → ∞. Boundedness of x k is not necessary to prove that p k → 0, which can be seen by carefully reading the proof. By (5) we know that this implies that the sub-problem (6) becomes a better and better approximation of (3). Hence, it might be possible to remove the boundedness assumption of x k . However, since for practical purposes (which will become clear in the rest of the manuscript) we need to consider a more heuristic approach, we believe there is little added value in explicitly proving this. Theorem 2.3 allows us to formulate a line-search method for solving (2), see algorithm 1. Convergence of the algorithm can be monitored using the nonlinear function that expresses the KKT conditions (2) of the equality constrained optimization problem (3) (not to be confused with the merit function F r (x) used in the line-search). In section 5 we also investigate other possible convergence metrics. Note that the definition of λ 0 on line 2 can be seen as the least squares approximation of the Lagrange multiplier corresponding with x 0 , see for instance [7]. First of all by lemma 2.2 we have that the backtracking line-search on line 6 is welldefined and terminates with some β k+1 > 0. Furthermore we know by theorem 2.3 that this algorithm converges towards the solution of (2) under certain assumptions.
To improve the practical performance of algorithm 1 we describe a slightly more heuristic approach. In principle we can use any constrained optimization algorithm to solve (6) on line 4. However, the Projected Newton (PN) method [8,9] is particularly well suited to solve this problem, since it exploits some of the problem-specific properties. The Projected Newton method is itself an iterative method that uses Generalized Krylov subspaces to solve a sequence of projected problems. The number of floating point operations used in the Projected Newton method grows with the iteration index of the algorithm, so for performance it is important that the number of (inner) iterations required to converge remains relatively small. More specifically we have that the number of floating point operations in the Projected Newton method grows with O(k 2 inner ), where k inner is the inner iteration index. Hence, if we require that (6) is solved very accurately, then algorithm 1 will not be very efficient. In the numerical experiments section we illustrate that (6) does not have to be solved very accurately to be able to make progress towards the solution. Furthermore it turns out that (6) is a very accurate model of (3) in the sense that the descent property (10) is satisfied by β = 1 for all iterations in the experiments of section 5. Hence, to improve performance of the algorithm and remove the seemingly unnecessary line-search we describe a slightly more heuristic approach.
To formulate the computationally efficient variant of algorithm 1 we define the system of nonlinear equations that express the KKT conditions (7) of the subproblem on line 4 of algorithm 1 for x = x k . Algorithm 2 is a heuristic approach that works very well in practice. We illustrate this claim by a number of numerical experiments in section 5.
Apply Projected Newton method [9] to Update x k+1 = x k + p k+1 6: k = k + 1 7: end while Remark 2.5. The introduction of the parameter ζ in algorithm 2 is inspired by the class of inexact Newton methods [10], which is a generalization of Newton's method for solving a nonlinear system of equations F (v) : R n → R n . In stead of solving the Choosing a suitable forcing sequence can significantly improve practical performance [11,12].

Reference methods and related work
As a first reference method we consider an algorithm that is closely related to the Sequential Quadratically Constrained Quadratic Programming (SQCQ) method developed in [3]. It can be seen as a modified version of algorithm 1, where we replace sub-problem (6) on line 4 with sub-problem (8) that uses a quadratic model of Ψ(x + p). Moreover, we also allow an approximate solution of the sub-problem using the ζ parameter, similarly as in algorithm 2. Accuracy of the sub-problem is monitored usingF The resulting algorithm is shown in algorithm 3 and is referred to as SPN-Q (where Q stands for "quadratic"). As previously mentioned, we can readily modify the analysis in section 2 to show similar results for this modified version. In fact, the authors in [3] provide a local and global convergence analysis of the closely related SQCP method. Apply Projected Newton method [9] to with stopping criterion Update penalty parameter r k+1 = max(2λ k+1 , r k ). 6: Choose β k+1 the largest value in 1, θ, θ 2 , θ 3 , . . . such that Update x k+1 = x k + β k+1 p k+1 and λ k+1 = λ k + β k+1 λ k+1 − λ k . 8: To describe the second reference method, we consider the unconstrained reformulation of the regularized nonlinear inverse problem with fixed regularization parameter given by (4). The main difficulty with solving (4) is the fact that the fit-todata term s(x)−b is in general a non-convex function. This means that the solution is not necessarily unique, but more importantly, this also means that the Hessian ∇ 2 f (x) of the objective function f (x) = 1 2 s(x) − b 2 + αΨ(x) can possibly be indefinite. This in its turn implies that the standard Newton direction p = − (∇ 2 f (x)) −1 ∇f (x) is not necessarily a descent direction. However, this difficulty can be circumvented by in stead computing the Gauss-Newton direction A similar approach is for instance taken in [13], where the authors consider nonlinear inverse problems where the forward model is described by a system of partial differential equations. The right-hand side in (15) is precisely −∇f (x) (i.e. the same as in Newton's method) and J(x) T J(x) is a positive-definite approximation of the true Hessian of the nonlinear least squares term 1 2 s(x) − b 2 . It is then easy to show that p ∈ R n is a descent direction for f (x), i.e. that p T ∇f (x) ≤ 0.
This means that we can compute a suitable step-length such that we have a decrease in the objective function f (x+βp) < f (x). The ideas above form the basis of the Gauss-Newton method, see algorithm 4. The linear system of equations (15) can be solved (approximately) used the Conjugate Gradient method [4], which is one of the most well-known and efficient Krylov subspace methods [14,15] . The stopping criterion for the linear system of equations is chosen to be 1/ζ, in correspondence to the choice of tolerance used for the Projected Newton method in algorithm 2, see also remark 2.5.
Unfortunately, due to the fact that the fit-to-data term (4) is non-convex we can not use specialized algorithms such as the split Bregman method [16] or ADMM [17]. However, SPN-Q and the Gauss-Newton method can be used regardless of which forward model s(x) is used. Hence, we will use algorithms 3 and 4 to compare the Sequential Project Newton method with in our numerical experiments.
For linear inverse problems there also exist a wide range of different algorithms that can be used, see for instance [18,19,20,21,22,23,24] and references therein. Hence, many of these algorithm could also be used to solve sub-problem (6) in the Sequential Projected Newton method. However, efficiency of the Projected Newton method was already illustrated by a number of numerical experiments in [9], so we do not consider alternative algorithms for the inner solver in our experiments.
Remark 3.1. We can use the Gauss-Newton method (algorithm 4) with α = 0 to compute the initial point x 0 used in algorithms 1, 2 and 3. In this case, the Gauss-Newton method converges towards the unregularized solution argmin s(x) − b . We terminate the algorithm when s(x) − b < σ, which typically happens already after only a few iterations. Hence, computing this initial point is relatively cheap. We use a fixed tolerance of 10 −10 (i.e. ζ = 10 10 ) for (16) in all our experiments (to make sure all algorithms start from the same initial point, regardless of other parameter choices).

Test problem: Talbot-Lau X-ray phase contrast imaging
For our numerical experiments we consider a test-problem from Talbot-Lau phase constrast imaging [25]. The forward model of the Talbot-Lau phase contrast projection operation in the discretized setting can be described with: where s 0 ∈ R + 0 is the incoming intensity, v 0 ∈ (0, 1) the visibility and φ 0 ∈ R m the phase shift without the object. The variables µ, and δ are the absorption, dark field and phase contrast, respectively. In the current manuscript we consider two-dimensional examples where the three components µ, and δ can be visualized as pixel images. Similarly we could also consider three dimensional examples where the different components contain a large number of voxels. If we denoteñ the total number of pixels (or voxels) then we have µ, , δ ∈ Rñ. Now let m denote the total number of projections, i.e. the number of projection angles times the number of detectors. Then we have s(µ, , δ) ∈ R m . In addition, we denote b the vector of length m containing the acquired (noisy) projections.
The matrix A ∈ R m×ñ is the conventional projection matrix and A φ ∈ R m×ñ the system matrix for differential phase contrast: A φ = DA, where D ∈ R m×m is a matrix performing the finite difference operation. We generate the matrix A using the ASTRA toolbox [26,27] based on a simple parallel beam geometry with line kernel, see Appendix A. Let us denote x = µ T , T , δ T T ∈ R n with n = 3ñ the vector obtained by stacking the three components. Note that s i (x) is a function R n −→ R and s(x) is a function To use the techniques developed in this manuscript we first need to compute the Jacobian matrix J(x) of the nonlinear forward model s(x). To do so it is convenient to Let us calculate the Jacobian Js(x) of the functions(x). For i = j we have that ∂s i ∂t j = ∂s i ∂u j = ∂s i ∂v j = 0 and for i = j we have Hence we can write Using the chain rule for the Jacobian s(x) =s(Āx) we immediately get

Total variation regularization
Total variation regularization is a popular regularization technique that preserves edges in images. Let y ∈ Rñ be a vector obtained by stacking all columns of a pixel image Y ∈ R N ×N withñ = N 2 . The anisotropic total variation function is defined as with finite difference operators in the horizontal and vertical direction given by We can rewrite this in a more convenient way by first writing which represents a finite difference approximation of the derivative operator in one dimension. Let ⊗ denote the Kronecker product. We can compactly write TV(y) = The matrices D h and D v represent the two dimensional finite difference approximation of the derivative operator in the horizontal and the vertical directions respectively. Now because we have three different images µ, and δ we consider the regularization term 2N ). The 1 norm is of course non-differentiable, however, it is easy to formulate a smooth approximation Ψ(x) ≈ ||Lx|| 1 , where Ψ : R n → R is a twice continuously differentiable convex function. More precisely we define where ξ > 0 is a small scalar that ensures smoothness. This is the same technique as used in [9].

Numerical experiments
In this section we perform a number of numerical experiments using the Talbot-Lau test-problem described in section 4. The aim is to illustrate that algorithm 2 is a robust and computationally efficient method for solving the nonlinear system of equations F (x, λ) = 0 defined by the KKT equations (11) for the equality constrained optimization problem (3). This function is also used to monitor convergence of the algorithm. All experiments in this section are performed using MATLAB on a laptop computer with Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz. For all our experiments we set s 0 = 1 in the forward model (17), which can be seen as a normalization of the projection data. Unless explicitly stated otherwise, the value v 0 = 0.75 is used for the visibility and φ 0 is generated using three phase steps. We refer to Appendix A for more details on how the test-problems are generated. For a certain exact solution x ex = (µ ex , ex , δ ex ) we generate exact data b ex = s(x ex ) and then add Gaussian distributed white noise to obtain noisy data b. We refer to the value σ = ||b − b ex || as the noise-level and to σ/||b ex || as the relative noise-level. We denote npix the number of pixels in each direction (we only consider square images) and nangles the number of projection angles used and we set the number of detectors equal to npix. For these choices we have n = 3 × npix 2 unknowns and m = npix × nangles measurements. We always choose nangles to be a multiple (≥ 3) of npix, which implies m ≥ n. In each iteration of the SPN algorithm we start the Projected Newton method with initial Lagrange multiplier 10 5 , which works well in practice. Experiment 1. We start by performing a small experiment with some exact images µ ex , ex and δ ex of size 10 × 10, i.e. npix = 10, and nangles = 4 × npix, resulting in a projection matrix A of size 400 × 100 (m = 400, n = 300). Next we generate data b with relative noise-level 0.01 in MATLAB as follows (1+v0 * exp(-A * x ex(nn+1:2 * nn)). * cos(phi0+D * (A * x ex(2 * nn+1:end)))); 5 noiselevel = 0.01; noise = randn(m, 1); 6 b = b ex + noiselevel * norm(b ex) * noise/norm(noise); We apply the "exact" Sequential Projected Newton method, algorithm 1, with Ψ(x) = n i=1 x 2 i + ξ and parameters r 0 = 10 3 , η = 10 −4 , θ = 0.5, τ = 10 −6 and ξ = 10 −6 . Note that this choice of regularization term can be seen as a smooth approximation of the 1 -norm, which is not a good choice for the random exact solution that we generated above. However, with the current experiment we are not yet concerned with the quality of the obtained solution, but only with the convergence behavior. We use the Projected Newton method [9] to solve sub-problem and set the tolerance for this solver to τ /10, i.e. we stop the algorithm when we have found a point that satisfies ||F (k) (p k+1 ,λ k+1 )|| ≤ τ /10, where F (k) (p, λ) is defined by (12). For comparison we also apply SPN-Q (algorithm 3) using the same parameters and with ζ = ∞.
The result of this experiment is given by figure 1. We compare the value ||F (x k , λ k )|| for both methods in terms of the outer SPN iterations k. In addition we also report the number of inner Projected Newton iterations required to converge for the sub-problems. First of all it is important to note that the decrease condition on line 6 of algorithm 1 is satisfied by the step-length β k = 1 for all iterations k in case we solve sub-problem (6). In contrast, in case we use a quadratic model (8), the line-search terminates with β k < 1 for a large number of iterations. This is the reason that the convergence in the latter case is much slower. It is also interesting to observe that the quadratic model requires more Projected Newton iterations on average to satisfy the stopping criterion. Hence it is clear from this experiment why choosing sub-problem (6) over (8) is the better choice.

Experiment 2.
In the next experiment we illustrate the fact that the sub-problem (6) does not have to be solved to a high accuracy to be able to make progress towards the solution, i.e. to decrease ||F (x, λ)||. We keep the same set-up as before and consider two different relative noise-levels, namely 0.1 and 0.01. We apply algorithm 2 with ζ = ∞ and τ = 10 −4 , 10 −6 , 10 −8 and 10 −10 . This can be seen as applying algorithm 1, i.e. taking a fixed tolerance for the sub-problems, but always talking a full step-length (which satisfy the decrease condition on line 6 anyway). To clearly observe the effect of the tolerance of the Projected Newton method on convergence, we let the algorithm run for a fixed number of iterations. The result of this experiment is given by figure 2. The final achievable accuracy in terms of ||F (x, λ)|| is completely determined by the tolerance of the inner-solver. Furthermore, it can be observed that convergence (up to stagnation of the method) does not differ much in case we use a lower tolerance for the sub-problems. This implies that in early iterations it is not necessary to compute a very accurate solution. Hence an adaptive tolerance as implemented using the ζ coëfficient in algorithm 2 seems like a good idea to improve performance of the algorithm. This is especially important since the computational cost of the Projected Newton method grows with the number of iterations. With the next experiment we investigate the trade-off between keeping the number of inner iterations as low as possible without compromising the convergence in terms of the number of outer iterations needed to converge.
Experiment 3. In this experiment we study the effect of the parameter ζ > 1 in algorithm 2. If this value is large, then we require a relatively accurate solution of the sub-problem and we can expect that we need a large number of inner iterations to converge. However, we can also expect to make significant progress towards the solution in terms of ||F (x k , λ k )|| for the outer iterations. To confirm this we apply   the Sequential Projected Newton method, algorithm 2, with different values of the ζ parameter to a Talbot-Lau test-problem with npix = 16 and nangles = 5 × npix, i.e. m = 1280, n = 768. Furthermore we take relative noise-level 0.01, visibility v 0 = 0.5, tolerance τ = 10 −6 and smoothing parameter ξ = 10 −6 . In figure 3 we show convergence of the method for ζ = 2, 3, 10 and ∞. When we take a small value such as ζ = 2 we can observe that the number of inner iterations needed to satisfy the stopping criterion is relatively small. However, convergence in terms of the outer iterations is also quite slow. For the choice ζ = ∞ we get very rapid convergence in terms of outer iterations since we require an accurate solution (τ = 10 −6 ) of the subproblem, but we need a lot inner iterations to satisfy the stopping criterion of the innersolver. It is also interesting to observe that in this case the number of Projected Newton iteration decreases towards convergence. This is because the Projected Newton method starts from an initial point containing all zeroes and near the solution of F (x, λ) = 0 we have that this is indeed a good initial "guess" for p k .
To investigate this trade-of in a bit more detail we apply algorithm 2 again with a linearly spaced range of 100 values for ζ from 1.1 to 30. The number of outer iterations required to converge are reported in figure 4 (left). If we choose ζ close to one, then we need to perform a lot of outer iterations. However, from a certain point onward, the number of outer iterations hardly changes, which implies that there is almost no benefit to solving the sub-problem very accurately.
In figure 4 (right) we show the total number of inner (Projected Newton) iterations. Again, when ζ is close to one, we need to perform a lot of inner iterations, which is due   Table 1: Experiment 4. The number of outer iterations (outer its), average number of inner iterations per outer iteration (avg. inner) and the total runtime (in seconds) of the SPN method (algorithm 2) applied to a Talbot-Lau test-problem with npix = 16, 32 and 64. Reported timings are averaged over 10 runs with the same instance of noise. Th number of outer and inner iterations do not change over the different runs.
to the fact that the number of outer iterations is quite large. A value of ζ is a little less than 5 seems to result in the fewest number of inner iterations for this particular experiment. From that point onwards, the number of inner iterations starts to increase again, since we require a more accurate solution of the sub-problem, without requiring fewer outer iterations to converge. Experiment 4. Note that since the cost of the Projected Newton method grows (quadratically) with the inner iteration index, it does not necessarily mean that more inner iterations implies a longer total runtime. Fewer inner iterations on average, but more outer iterations might be preferable to few outer iterations that require a   Recall that the number of floating point operations used in the Projected Newton method increases with the number of iterations, so it is important to keep this number as low as possible. Hence, this choice for ζ performs very badly in terms of total runtime. The timings for ζ between 1.1 and 10 are all comparable, except for the case npix = 64 and ζ = 1.1, which seems to perform a bit worse. The choice ζ = 10 seems to keep the number of outer iterations close to that of the choice ζ = ∞, while also keeping the average number of inner iterations modest. This value is used in all subsequent experiments.
Experiment 5. For the next experiment we consider total variation for the regularization function, as implemented in (19). We use the exact images for µ ex , ex and δ ex with npix = 64 as shown in figure 5 and we take nangles = 5 × npix (m = 20480, n = 12228). For the other parameters we take tolerance τ = 10 −6 , relative noise-level 0.01, ζ = 10 and smoothing parameter ξ = 10 −8 . Note that it is important that the smoothing parameter is relatively small, otherwise (19) is not a good approximation of the true total regularization function (18).
To study the performance of the Sequential Projected Newton method (algorithm 2) we also apply the two reference methods in section 3, namely SPN-Q (algorithm 3) and Gauss-Newton (algorithm 4) to the same test-problem. All parameters (such as ζ or τ ) for the different methods are kept the same (when applicable). The regularization parameter α for the Gauss-Newton method is chosen to be 1/λ * , where λ * is the final Lagrange multiplier computed using the Sequential Projected Newton method. In table 2 we report the total runtime (in seconds) as well as the number of outer iterations needed to converge. We perform 5 different runs with 5 different instances of noise. It can be observed that algorithm 2 always performs best, both in number of iterations as well as runtime. Moreover it is clear that SPN-Q is not competitive compared to the other two approaches. Hence, the benefit of the more accurate sub-problem (13) compared to the quadratic model (14) is clear from this experiment. The reconstructed solutions are shown in figure 5. Note that all methods (SPN, SPN-Q and GN) produce a solution of similar quality, which is not surprising since they are all essentially solving the same regularized nonlinear inverse problem using the total variation regularization function. We can clearly see that the reconstructed solutions are quite good and that the total variation regularization function nicely removes noise in the reconstruction, while still preserving edges in the images.
In figure 6 (left) we report some different convergence metrics for the SPN method (algorithm 2) applied to a particular instance of the test-problem. We show convergence      of the values ||F (x k , λ k )|| as always, as well as the relative difference of the iterates x k and λ k . In addition we also show the convergence of the constraint by plotting ||s(x k ) − b|| − σ. These first three metrics seem to converge in a similar fashion, while the latter value converges much more quickly. In the figure on the right, we show the Structural Similarity Index (SSIM) of the reconstructed solution compared with the exact solution. The closer this value is to 1 the more similar the two images are. The actual formula for the SSIM is quite involved so we do not explicitely write it down here but rather refer to [28] where it was first introduced. We compute this value using the MATLAB function ssim. It is interesting to see that the Structural Similarity Index stabilizes very quickly, even when ||F (x k , λ k )|| is not yet very small. Hence, it might be appropriate for practical problem to use a stopping criterion based on ||s(x k )−b||−σ, which decreases much more quickly. Similar observations were made by the authors in [9]. Experiment 6. For our final experiment we use entirely the same set-up as before, but now use a larger test-problem with npix = 128 (m = 81920, n = 49152) as shown in figure 7. Furthermore, we now choose a moderate tolerance τ = 10 −3 . Results of this experiment can be found in figures 7 and 8. The same conclusions as in the previous experiment can be made here. Results comparing algorithm 2 with Gauss-Newton are presented in table 3. We do not include the SPN-Q method in this table since a single run took more than one hour to complete. Again we can conclude that the Sequential Projected Newton method converges much more quickly than the Gauss-Newton method, while producing a solution of similar quality.  Table 3: Experiment 6. The number of outer iterations and total runtime (in seconds) of the SPN method (algorithm 2) and Gauss-Newton (algorithm 4) for 5 different runs (with different instances of noise) applied a the Tablot Lau test-problem with npix = 128. The standard deviation (Std dev) over the different runs is also reported for the different quantities.

Conclusions and outlook
In this work we have developed an algorithm for the automatic regularization of nonlinear inverse problems based on the discrepancy principle. By considering a linear approximation of the forward model using the Jacobian matrix, we obtain a quadratically constrained optimization problem which can be solved using the Projected Newton method. We prove that the solution of this equality constrained optimization problem results in a descent direction for a certain merit function, which can be used to formulate a formal line-search method for solving the inverse problem. We also formulate a slightly more heuristic version of the algorithm by loosening the tolerance of the solution of the sub-problems and removing the (seemingly unnecessary) linesearch. We illustrate using the numerical experiments that these simplifications lead to a robust and computationally efficient method (in terms of runtime) for solving the regularized nonlinear inverse problem compared to two reference methods. Moreover, we show that we can obtain high quality reconstructions for Talbot-Lau X-ray phase contrast imaging using the proposed algorithm.
Since the Sequential Projected Newton method uses the discrepancy principle, it can only be applied in case we have a good estimate of the noise-level available. It might be worth investigating if similar techniques as presented in this manuscript can also be used in combination with other parameter choice methods, such as generalized cross validation or the L-curve criterion. Moreover, we have assumed throughout this entire work that the nonlinear inverse problem is overdetermined. However, in many applications (including Tablot-Lau phase contrast imaging) this is not always the case. Hence, a generalization of the Sequential Projected Newton in case of an underdetermined inverse problem would certainly be valuable. Preliminary experiments (not reported in this work) indicate that the Sequential Projected Newton method also often converges nicely in the underdetermined case, but can also diverge on some testproblems. Hence, further research and safeguards to ensure convergence are required.