Adaptive anisotropic Bayesian meshing for inverse problems

We consider inverse problems estimating distributed parameters from indirect noisy observations through discretization of continuum models described by partial differential or integral equations. It is well understood that the errors arising from the discretization can be detrimental for ill-posed inverse problems, as discretization error behaves as correlated noise. While this problem can be avoided with a discretization fine enough to suppress the modeling error level below that of the exogenous noise that is addressed, e.g., by regularization, the computational resources needed to deal with the additional degrees of freedom may require high performance computing environment. Following an earlier idea, we advocate the notion that the discretization is one of the unknowns of the inverse problem, and is updated iteratively together with the solution. In this approach, the discretization, defined in terms of an underlying metric, is refined selectively only where the representation power of the current mesh is insufficient. In this paper we allow the metrics and meshes to be anisotropic, and we show that this leads to significant reduction of memory allocation and computing time.


Introduction
The estimation of distributed parameters from indirect noisy observations is an inverse problem arising in a wide range of applications, including geophysics, non-destructive material evaluation and medical imaging.The distributed parameters may contain information about the underlying medium such as density, elastic parameters, electric conductivity or optical absorption and scattering coefficients.Typically these parameters are related to the observations through continuum models governed by partial differential equations or integral equations whose numerical solutions require discretization via finite differences, finite element, or finite volumes.To ensure that the discretized problem is close to the continuum one, the mesh must be fine enough for the discretization error not to hamper the approximation.The inherent ill-posedness of inverse problems means that small errors in data may be significantly amplified in the computed solutions, thus proper handling of the discretization error is even more critical in this context than in simulations or model predictions.Since numerical approximation errors arising from coarse discretization affect all output parameters simultaneously, they manifest themselves as if highly correlated noise had been added to the observations [20,21,6] and without a careful analysis of the correlation structure and suitable compensation, they may be detrimental for solving the inverse problem.Because the correlation of the discretization error is typically very structured, it is not well described in terms of the exogenous observation noise, and the higher the quality of the measured data is, the more noticeable the artifacts due to discretization in the computed solution may be.
There are two different ways to mitigate the effect of discretization errors.The first and most obvious approach is to refine the mesh until the error due to the discretization is smaller than the observation error.The drawback of this approach is that refining the discretization may increase the computational complexity of the forward model to the point where the numerical treatment of the inverse problem becomes impractical or impossible.An alternative is to model the approximation error in the Bayesian statistical framework by estimating its probability distribution and incorporating it in the likelihood model that acknowledges the imprecision of the forward model used for the inversion.The latter approach, which has been shown to consistently produce high quality reconstructions without excessive computational complexity [1,20,21,6,28] once the density of the modeling error has been estimated, is not without a price.In fact, since the modeling error is a function of the unknown distributed parameter to be evaluated, estimating its distribution typically requires off-line nmerical forward simulations that may be very costly.Philosophically, the Bayesian modeling error approach can be seen as a training algorithm similar to machine learning algorithms.
In this article, we advocate a third approach that reduces the discretization error by selectively refining the mesh.We assume that the distributed parameter is piecewise constant or nearly constant function, so that its generalized gradient admits a sparse or compressible approximation in terms of appropriately chosen basis functions.The inverse problem is solved by an iterative scheme in which the support of the sparse coefficient vector of the approximation of the gradient is learned while estimating the solution, and the computational domain is iteratively remeshed.The meshing is a function of an underlying metric that is updated at each iteration concomitantly with the approximate solution.The computational engine used for estimating the sparse representation of the gradient is an iterative numerical scheme based on hierarchical Bayesian hypermodels, a hybrid version of the iterative alternating scheme (IAS) studied extensively in the literature [7,8,10,13,17,18,29,30].This article elaborates and extends the work in [5], with a special attention to the computational aspects.One of the novelties of this work is that the approximation of the generalized gradient of the unknown using Whitney elements [2,3] bridges the sparse representation and a geometric interpretation of it.Using the Whitney representation, we build a Clément-type interpolant [4,14], a continuous approximation of the generalized gradient that allows us to define the mesh-generating metric.This new formalism removes the need to introduce intermediate meshing interpolating the hyperparameter associated to the edges of the mesh that made the algorithm in [5] slightly cumbersome, and required some hand-tuning of the mesh near the domain boundary.With a continuous interpolant of the gradient at our disposal, we then define the metric in terms of the gradient.Another novelty is that now we can allow the metric and the corresponding mesh to be anisotropic, leading to a more efficient representation of the unknown, requiring less elements to approximate discontinuities.Moreover, instead of seeking a way to approximate the modeling error, the more efficient adaptive anisotropic meshing that we propose here is able to selectively refine the discretization so that the modeling error falls below the level of the observation noise.
Finally, the Bayesian updating algorithm introduced in this article uses a hybrid version of the IAS algorithm, further improving the sparsity of the gradient representation, thus reducing the overall computing cost.
The computed examples presented in this article are similar to those considered in [5], i.e., the X-ray fanbeam tomography problem with few illumination angles, and the inverse source problem for Darcy flow, thus facilitating the comparison of the novel algorithm with the adaptive isotropic Bayesian meshing.Extending the Bayesian adaptive anisotropic meshing presented here to nonlinear models, while in principle natural, will entail a further analysis of the solver, and will be is the topic of a separate contribution.

Bridging the continuum and discrete formulations
Consider a continuous observation model: Let Ω ⊂ R d , d = 2, 3, be a bounded domain with a Lipschitz boundary, and let u be a distributed parameter function defined in Ω.Let F * denote a function that models the dependency of a finite dimensional observable quantity b on u, and consider an observation model with additive noise, The inverse problem is to estimate u based on observation of b.The forward model F * is typically based on PDE/integral equation.For the sake of definiteness, we assume that the function u is a piecewise constant, or more generally, piecewise smooth function with small gradient outside the discontinuities, whose singular support we want to estimate numerically.
Typically, when solving the inverse problem, the unknown u as well as the forward model F * need to be discretized.It is well understood [21] that the discrete approximation differs from the presumably accurate theoretical continuous model by a model discrepancy representing the discretization error.
Assuming that the data are well described by the continuous model, the model discrepancy acts as an additional noise term.If the data are of high quality in the sense that the exogenous noise ε is of low level, the model discrepancy may be the dominant part of the noise, and as inverse problems are ill-posed, ignoring it may lead to significant loss of accuracy in the estimate.Modeling the discrepancy, on the other hand, is a non-trivial task for several reasons: First, the modeling error may have a non-zero mean and a complex correlation structure, so inflating artificially the noise level in the inverse solver may not have the desired effect.In fact, the model discrepancy depends on the discretization of the continuous model and on the unknown parameter u itself, complicating the process of estimating it.
To address the modeling error question in a more rigorous framework, consider a discretization scheme of the forward model.For simplicity, we limit ourselves to the case d = 2. Assume that the boundary ∂Ω is regular enough to allow an approximation of the domain Ω by triangular finite element mesh: Let T h = {K j } Nt j=1 be a conforming tessellation in triangles K j , where h is a symbolic mesh size parameter, and let Ω h be a polygonal approximation of the domain, Denoting by {ψ j } Nv j=1 a set of appropriately chosen Lagrange basis functions related to the tessellation T h , where N v is the number of vertices of the mesh, we approximate the unknown u by Furthermore, let us write an approximate discrete model as where the error ε h contains the measurement error and the approximation error, While the modeling error cannot be estimated without knowning u, the Bayesian framework provides a natural solution.In the Bayesian approach, since u is an unknown, it is modeled as a random variable U with a prior probability distribution µ U .By denoting we may define the probability distribution of the discrepancy, modeled as a random variable M , as the push-forward of the prior distribution, An analytic form of this distribution is usually not available, however, a computational approximation can be carried out using sampling techniques.In practice, instead of the accurate model F * , an "accurate enough" model is used, and the modeling error sample is generated off-line, akin to network training in machine learning [1,6,21,28].
To minimize the effect of the model discrepancy, one could use a mesh fine enough so that ε h is negligible, or, alternatively, find a way to estimate the modeling error distribution, which also requires an accurate model F * , or a good approximation of it, to generate a model discrepancy sample.One of the main goals of this paper is to bypass the need for a globally accurate model by developing a computationally feasible way to perform selective mesh refinement.In other words, instead of using a costly dense mesh over the whole domain Ω to emulate F * , the mesh is refined only where it is necessary to represent accurately the large gradients of the unknown.The same challenges are encountered in a posteriori mesh refining problems in standard finite element analysis, and the approach that we propose has been inspired largely by the contributions to finite element error analysis.In the adaptive meshing algorithms for approximating the solutions of second order PDEs of advection-diffusion-reaction (ADR) type by finite elements [31,25,26], referred to as the a posteriori error estimate, an important role is played by the approximation error functional where u is the true solution of the PDE, u h is the current Galerkin FEM approximation of the weak solution, and ∇ * u is a computational proxy for the true solution, referred to as the recovered gradient, a Clément-type interpolant of a given order [25,26].The adaptive meshing of the computational domain is guided by the equi-distribution principle of the error: Each element in the new mesh should contribute approximately the same amount to the total error.The local error estimates can be expressed in terms of a metric that is related to the underlying meshing.The metric-based discretization is the main source of inspiration of the current approach [16,19].
Assume that G = (g ij ) is a metric tensor on Ω, defined so that the length of a parametrized curve x = γ(t) is given by the integral of the line element, The principle guiding our meshing strategy is that the triangles K j have approximately the same diameter with respect to the metric distance for a given metric.Conversely, given a tessellation of Ω with triangles of different Euclidian size, we may define approximately a metric so that the elements have the same diameter in this metric.The connection between the tessellation and metric is developed in detail in Section 3.2.Observe that the metric may be anisotropic, leading to triangles of varying eccentricity, defined in terms of the eccentricity of the Steiner ellipses circumscribing the triangle.
To design a metric that adapts to the unknown distributed parameter u, we would like to require that: 1.When |∇u| is large, the triangles defined by the metric are relatively small in the Euclidian metric; 2. For anisotropic triangles, the shorter semi-axes of the Steiner ellipses are parallel to the gradient; 3. For |∇u| small, the metric is isotropic, corresponding to spherical Steiner ellipses.
A candidate metric satisfying these requirements can be written as where (e ∥ (x), e ⊥ (x)) is a local orthonormal frame such that e ∥ (x) is parallel to the gradient of u(x), and θ ∥ (x), θ ⊥ (x) ≥ θ min > 0 are non-decreasing scalar functions of the norm of the gradient of u(x), with the property that for where α is the maximum anisotropy ratio, and for |∇u(x)| small, θ ∥ (x) = θ ⊥ (x).Thus, for |∇u(x)| large, the triangles are small, with width in the direction of ∇u(x) by a factor √ α smaller than the the width in the orthogonal direction, while for small gradients, the triangles are relatively large and isotropic.The exact definitions of the functions θ ∥ and θ ⊥ and of the metric respecting these principles are given in Section 3.2.
Given a metric and the corresponding tessellation (T h ), the discrete inverse problem to be solved corresponds to the model When solving the inverse problem in the Bayesian framework, it is necessary to specify in which way the quality of the solution will be assessed, to define a feedback mechanism to update the metric, and to provide a way to implement the mesh refinement.This feedback is defined by using hierarchical Bayesian models, and therefore implemented as part of the prior for u.In this manner, finding the optimal metric, and therefore the optimal discretization, becomes part of the inverse problem itself.
The reconstruction problem will be addressed in a hierarchical Bayesian framework where the prior assumption that u is piecewise constant, or nearly piecewise constant, is tantamount to assuming that the discretized gradient can be represented as a sparse or compressible vector.To this end, we set up a Bayesian model that favors sparse solutions of the discretized gradient of u.We want to develop an algorithm suitable for estimating a distributed variable u with the a priori belief that the numerical approximation of the total variation of u is limited.We begin by setting up a prior expressing the belief that the discrete approximation of the gradient of u h given by ( 2) is compressible.We begin with representing ∇ u h in terms of a Whitney basis, where N e is the number of edges in the triangular mesh generated by the tessellation T h , and w ℓ is the Whitney basis function associated to the ℓth edge.Denoting the first order piecewise linear nodal basis functions over the current mesh by ψ j , the Whitney basis function associated with a directed edge e ℓ = {v j , v k } is given by the gradients being understood in the weak sense.The relation between the coefficients z ℓ and the nodal values of the function u is discussed in detail in the next section.With the representation introduced above, we want to find u so that the coefficient vector z is sparse, or more generally, compressible.
To set up the hierarchical model first in the abstract setting, consider a discrete inverse problem, where Z and E are random variables taking values in R n and R m representing the unknown and noise, respectively, and B is the R m -valued random variable representing the data.Moreover, we assume that the random variable Z is sparse or compressible, meaning that only a limited number of its components are above some small threshold value.We express our information about Z in terms of a prior density conditional on a hyperparameter Θ, denoted by π Z|Θ (z | θ).In turn, the hyperparameter itself is a random variable in R k , and we define a hyperprior density π Θ (θ) to describe the prior belief about it.The joint prior density of the pair (Z, Θ) is given by Next we write the likelihood density π B|Z (b | z) based on the forward model ( 6).If Z and E are assumed mutually independent, the likelihood is of the form where π E is the probability density of the additive noise.The posterior density is given by Bayes' formula, Given the posterior density, it is possible to compute the Maximum A Posteriori (MAP) estimate (z MAP , θ MAP ), a maximizer of the posterior density above, assuming that such maximizer exists.
In a series of articles [9,13,10,8], the authors have developed a methodology based on a family of hierarchical models that promote sparsity of the variable z MAP by favoring solutions for which the majority of the components of z are either zero or of negligible size.Since the algorithm proposed in this work for the adaptive remeshing is based on those ideas, for completeness we provide a more in-depth review of the sparsity promoting hierarchical prior methodology in Section 3.3, while only presenting a general outline here.
Let Z be a compressible random variable, with zero mean hierarchical conditionally Gaussian prior density When the prior variance θ j is small, it is expected that the variable Z j takes on a small value, too.Since part of the the problem is to identify the few components of Z that are large, the hypermodel for the variance Θ should favor small values most of the time, while allowing occasional large outliers.Among the several fat-tailed distributions that could be chosen for the he components of Θ, for reasons of computational convenience we choose the generalized gamma distribution, i.e., where r ̸ = 0, and the shape parameter β > 0 is assumed to be the same for all j while the scale parameter ϑ j may differ for every j.Assuming that the variables Θ j are mutually independent, the joint prior model then becomes In the case where the additive noise is Gaussian, E ∼ N (0, Σ), where Σ ∈ R m×m is a symmetric positive definite covariance matrix, the likelihood density (7) is the form and the expression for the posterior density (8) becomes Algorithms for finding a sparse or compressible estimate maximizing the above expression, or, equivalently, minimizing the Gibbs energy, are discussed in [9,7,8,10,13,29].An overview of the methodology will be given in Section 3.3.
The Bayesian adaptive (anisotropic) remeshing algorithms that we propose here can be summarized as follows.
1. Given the current tessellation T h , (a) Discretize the forward problem (4), representing the unknown u h in terms of the degrees of freedom z j corresponding to the gradient; (b) Compute the MAP estimate (z MAP , θ MAP ) using the hierarchical Bayesian model; 2. Interpolate the MAP estimate for ∇ u h and update the metric; 3. Update the tessellation T h in accordance with the updated metric; 4. If convergence criterion is met, stop, otherwise repeat from 1.
Several important details are omitted in the outline above, including how the the gradient is interpolated to allow point-wise estimates, and metric is related to the interpolated gradient.These and other details are presented below in separate subsections.The performance of the algorithm is then illustrated with computed examples.

Detailed derivation of the Bayesian adaptive meshing algorithm
In this section, we discuss the details of the various steps outlined above, and we conclude it with a detailed algorithm.We begin with the details of the model discretization.

Model discretization
Let T h denote the current tessellation of the domain Ω h ≈ Ω, which for simplicity we assume to be simply connected and denote by {v j } Nv j=1 the vertices of the mesh.We assume that (2) is a piecewise linear approximation of the unknown function u, that is, the basis functions {ψ j } Nv j=1 are the standard piecewise linear Lagrange basis functions, We assume that the first n v nodes are interior nodes and that the values of u at the boundary nodes are known; without loss of generality, we assume that

The number of boundary nodes is denoted by n
Consider the edge elements of Whitney type: If e ℓ = {v j , v k } is the directed edge connecting the nodes v j and v k , we define the gradients understood as weak derivatives of the piecewise linear basis functions; see Figure 1.Given the approximation (2) in terms of the nodal values u j , the passage to the approximation ( 5) is straightforward, and can be found in the literature [2,3], however, for completeness, we derive it in detail below.Let v j be an arbitrary node, and denote its neighboring nodes by v j 1 , . . ., v jp .Let I i denote the set of indices of the edges with the vertex v i as one of the endpoints.In the patch neighborhood ω i , we have where s i µ = ±1, the sign depending on whether v i is the tail or the head of the edge e µ .This shows that the gradients of the basis functions are in the span of the Whitney elements.Moreover, it follows from this representation that Every edge µ appears exactly twice in the sum, corresponding to the end edges, and if e µ = {v j , v k }, we have Hence, the coefficients z ℓ are obtained by a finite difference formula that is independent of the geometry but depends only on the topology of the mesh.We write the relation in matrix form as where each row of the matrix L full contains exactly two non-zero entries, (L full ) ℓj = 1, (L full ) ℓk = −1.
To modify the formula so that it contains only the nodal values in the interior nodes, we observe first that the matrix L full has a one-dimensional null space, Next we partition the matrix into the columns corresponding to interior and those corresponding to boundary nodes, To show that L int has a trivial null space, let z 0 ∈ N (L int ).Then, and from (11) it follows that z 0 = 0. Finally, after permuting the rows so that those corresponding to the boundary edges are at the bottom, we partition L int as Since for every x ∈ R nv , we must have we conclude that L ′ = O (Ne−ne)×nv .In the following, we denote by z ∈ R ne the increments over edges that have at least one end point in the interior, thus excluding the zero increments over boundary edges, and write where the matrix is full rank, as n e = n v + N t − 1 ≥ n v by the definition of the Euler characteristics.

Tessellation and metric
We start with a brief summary of the metric-based tessellation algorithm.Consider a nonisoparametric triangle K ∈ T h with vertices v 1 , v 2 and v 3 ordered counterclockwise, and let K denote an equilateral reference triangle mapped onto K by the affine mapping Consider the polar decomposition of F, that can be derived from the singular value decomposition by letting where W is an orthogonal matrix and P is a symmetric positive definite (SPD) matrix.The formula defining the Steiner circumellipse is and, conversely, the family of triangles with the same Steiner ellipse is given by Given the tessellation, we define the piecewise constant metric over the approximation Ω h of the domain Ω by setting The arc length element of a line segment across the triangle K, given by γ(t) = v 0 + Fξ(t), where ξ(t) ∈ K and F = PW is then given by implying that with respect to this metric, the triangles are of equal size.Conversely, we consider the following meshing principle: Given a metric G = G(x), x ∈ Ω, find a tessellation such that the triangles have approximately the same size with respect to the metric.The problem can be set up as follows.
Problem 3.1 Given a metric G in Ω and 0 < h min < h max , find a tessellation over the tessellation.
Observe that the bounds h min and h max are safeguards that could be built in the metric itself.In this work, the mesh generation is done by employing an existing software FreeFEM and particularly the function adaptmesh, see [19] for details.

Solving the MAP estimate: IAS algorithm
Consider the Gibbs energy (10) in the case when the forward model corresponds to a linear observation model, f (z) = Az.Let u ∈ R nv denote the nodal values of the discretized function of interest, and let z ∈ R ne denote the coefficients in (5) of the gradient in terms of the Whitney basis, related to u through the equation (12).Since our goal is is to write the sparsity-promoting prior for z, it is necessary to express the likelihood in terms of z.
Consider the QR-decomposition of the matrix L, ×nv is a null matrix, and , where the second equation corresponds to the compatibility condition of vanishing circulations over each triengle, expressed in the orthonormal basis given by Q. Introducing we may write the Gibbs energy in terms of z as where Σ −1 = S T S is a symmetric factorization of the precision matrix of the noise.Without loss of generality, we may assume that S = I, that is, the additive noise is whitened by scaling A 1 and b by the factor S.
The basic IAS algorithm seeks a minimizer of the Gibbs energy by alternating the following minimization steps: 1. Given the current approximation of θ, update z by minimizing 2. Given the current approximation of z, update θ by minimizing We remark that the updating of z is tantamount to solving a least squares problem, and the updating of θ can be done componentwise.Indeed, by differentiating with respect to θ j , the first order optimality condition yields In the special cases r = ±1 this equation admits an explicit solution, while in general, a numerical scheme is required, see [8].
While in principle rather straightforward, there are several details in the algorithm that need to be addressed properly.Before discussing the fine tuning of the algorithm, we will look at the hyperparameters.

Hyperparameters and sensitivity scaling
Let's start by considering the hypermodel r = 1 corresponding to the gamma distribution.In the literature [9,13,11], the following results have been proved.
Theorem 3.2 (a) The Gibbs energy (13) for the gamma hyperprior (r = 1) is a convex functional of (x, θ) with a unique minimizer ( z, θ), and the IAS iteration converges to this minimizer.
(b) Moreover, when η = β−3/2 > 0 converges to zero, the minimizer z converges to the minimizer of the functional The second part of the above theorem suggests that to guarantee sparsity of the solution the variable η should be set to a small value.Observe that in (15) there is no explicit dependence on the noise level, because of the whitening assumption.The formula indicates that the choice of the values of the hyperparameters ϑ j must be related to the sensitivity of the data to the components z j .Indeed, it is well understood that in inverse problems, the weighting of the components in the penalty term is essential when the sensitivity of the data to different components varies: Without sensitivity weighting, optimization algorithms will converge to solutions that explain the data in terms of the variables to which the data are most sensitive, see [11] for a discussion and, e.g., [22,23,24] in the context of geophysics and medical imaging.In the optimization literature, sensitivity weighting is referred to as scaling of the problem [15].This observation may appear to be in conflict with the Bayesian philosophy according to which the prior is independent of the observation model.In [10,8], the authors established a relationship between the hyperparameters and the prior information about signal-to-noise ratio (SNR), thus providing a consistent interpretation of sensitivity weighting.In the same article, the concept of SNR-exchangeability was introduced: A prior model satisfies the SNR-exchangeability condition, if any subset of components having the same cardinality can lead to the expected SNR.In other words, the SNR-exchangeability principle states that all subsets of a given cardinality of the coordinates of the unknown must have the same possibility to explain the data within the presumably known SNR level.It was shown also that if there is SNR-exchangeability, the hyperparameters ϑ j must satisfy i.e., ϑ j is proportional to the squared norm of the jth column of the matrix A 1 .Interestingly, in light of formula (15), this principle leads exactly to the classical sensitivity weighting or scaling referred to above.The parameter ϑ * r > 0 depends on the hypermodel and the estimated SNR; we refer to the original articles [10,8] for the details.

Non-convexity and hybrid schemes
The discussion in the previous subsection was restricted to the IAS algorithm for r = 1.In [7,8], the IAS algorithm with r < 1 was discussed in detail, and it was demonstrated that the Gibbs energy functional to be minimized is non-convex.In those cases the IAS algorithm converges rapidly, however, the minimizer found is typically a local minimizer that depends on the starting point.In order to take advantage of the fast convergence rate and greedier sparsity promotion of the IAS algorithm with r < 1 while avoiding to stop at unsuitable local minimizers, a hybrid IAS algorithm was proposed.The hybrid approach consists of two phases: Phase I, using the hypermodel r = 1, lets the algorithm find a reasonable approximation of the unique minimizer of the Gibbs energy functional.Phase II switches to a hyperprior with r < 1, and the IAS algorithm is run to convergence.The heuristic idea is that this way, the greedier Phase II converges to a minimizer which is close to the global minimizer of Phase I.
To match the hyperparameters in Phases I and II so as to keep consistency throughout the entire process, the following matching conditions were proposed in [12] (see also [8]): 1.When z j = 0, the formula ( 14) should yield the same value for θ j with both hyperpriors, 2. The expected value of θ j in Phase I and in Phase II should remain the same.
Assuming that the parameters of Phase I are set, that is, r = 1, and β 1 = 3/2 + η, these two conditions lead to equations and from which the pair (β 2 , ϑ * 2 ) can determined.Closed form solutions for the special cases r = ±1/2 and r 2 = −1 were given in [12].In our computed examples, we will use the hybrid model with r 2 = 1/2, for which , where µ = 1 + 3 2η , With these choices of the hyperparameters, only two scalar variables, η and ϑ * 1 , remain to be set.The discussion of the choice of their values is postponed to the section of computed examples.

Implementational details
To guarantee that the IAS algorithm produces iterates of z satisfying the compatibility condition of vanishing circulation around each element, we organize the computations as indicated in [5].In the update of z, with the current value of θ, we introduce auxiliary variables and write the objective function E θ (z) in terms of w.Defining D θ = diag(θ), we have Next we compute the reduced QR factorization of L θ , where R θ is upper triangular and invertible, and write It was proved in [5] that the minimizer of this expression satisfies automatically the compatibility condition.To compute efficiently an approximate minimizer, we observe that the least squares solution of the above functional is the standard Tikhonov regularized solution of the linear inverse problem b = A θ w, A θ = AR −1 θ Q T θ , with the regularization parameter equal to unity.The Tikhonov regularized solution is the least squares solution of an overdetermined linear system, and finding it may be computationally very costly.We can reduce the computational complexity, approximating the Tikhonov regularized solution by applying the Conjugate Gradient for Least Squares (CGLS) algorithm directly to the problem b = A θ w.The CGLS algorithm determines a sequence of approximate solutions by solving the minimization problems in a nested sequence of Krylov subspaces, where the Krylov subspace of order k is defined as The CGLS method is a computationally efficient alternative to Tikhonov regularization, albeit not equivalent to it, provided that the iterations are terminated right before the amplified noise components start to overtake the solution.In the present problem, we stop the iterations as soon as one of the following conditions are satisfied, Condition ( 19) is the classical Morozov discrepancy principle, recalling that for Gaussian white noise E ∼ N (0, I m ), the noise level can be defined as The condition (20), on the other hand, monitors the convergence of the original least squares problem whose solution the iterative process seeks to approximate.The updating procedure for z is the same regardless of the hyperprior model for θ, so it applies for both Phase I and Phase II of the hybrid algorithm.For further discussion of the iterative algorithm, we refer to [11].

Updating the metric
Consider next the updating scheme of the metric.Assume that we have completed the IAS iteration, and calculated the pair (z, θ), where z contains the coefficients of the representation of the approximate gradient ∇ u in the Whitney basis and θ is the vectors of the variances.More specifically, both the variance θ j and the coefficient z j are associated to the edge e j of the mesh.
Because of the discontinuity of the Whitney basis functions across the edges, pointwise evaluation of the approximate gradient is not well defined, therefore, to avoid ambiguities, we approximate the gradient over the domain Ω by a continuous interpolant.For that purpose we chose the standard first order Clément interpolant [14], which we briefly review here for the sake of completeness.
To interpolate the discontinuous function ( 5), we begin by considering its components separately, introducing the notation where e 1 e 2 are the canonical unit vectors.Clément interpolation proceeds by 1. finding a local patch approximation around each nodal point v i using orthogonal projections to local piecewise polynomials; 2. interpolating the values of the patch approximations at their central node as global piecewise polynomial.
In the following, we describe the two steps of first order Clément interpolation in detail for ∂ 1 u(x); the procedure for other component is analogous.
We define a local patch ω i where the approximation is carried out by picking a node v i , and consider all triangles with v i as a vertex, see Figure 1.We define where P 1 (ω i ) is the space of piecewise first order polynomials over the patch ω i .If I i is the index set of all nodes associated with the patch ω i , then On the other hand, the orthogonality of the projection is tantamount to the condition This set of conditions leads to a square system of linear equations that must be satisfied by the coefficients determining the interpolation polynomial in ω i .More precisely, which can be expressed in matrix form as This is the linear system that needs to be solved to find he coefficient vector α i .
Once the first order polynomials p i have been found for each node, the Clément interpolant is defined as We denote the Clément interpolation of the function ∇ u(x) by Having the interpolated function ∇ * u(x) at our disposal, we can define the metric G(x) at an arbitrary point x ∈ Ω as follows.Let δ > 0 be a fixed threshold value below which the interpolated gradient is considered negligible.Then: and define where α ≥ 1 is a given anisotropy ratio, and C > 0 is a scaling factor that will be specified below.
The metric defined in this manner produces relatively large and approximately equilateral elements when the interpolated gradient is small, while for large values of the interpolated gradient, the elements are smaller, with a prescribed eccentricity encoded in the anisotropy ratio α.
In practice, the mesh generator of FreeFEM [19] requires the specification of the approximate minimum and maximum edge lengths.In order for the mesh generator to interface seamlessly with the metric defined here, we first specify 0 < h min < h max .Assuming that h min represents the minimum edge length in the direction parallel to ∇ * u, we set from which we can compute the value of C. To select the threshold value δ, we require that when |∇ * u(x)| = δ, the edge length perpendicular to the gradient assumes the value h max , yielding , from which can be compute the value of δ.Therefore, the algorithm requires the specification of only the three parameters (h min , h max , α), all having a direct geometric interpretation.Figure 3: A cartoonish rendition of the geometric setup for the first computed example.Here, for simplicity, the number of views is limited to three.In the computed example, 15 equally spaced illumination directions are used.The number of traversing rays per view is 300, leading to data with dimension m = 45 000.

Algorithm and computed examples
We test the algorithm on two computed examples.For the sake of simplicity, we only consider linear inverse problems, and the examples presented here are similar to those discussed in the earlier paper [5] to allow a comparison of the results obtain with a Bayesian adaptive isotropic meshing algorithm proposed there with our novel anisotropic model.The problems are an X-ray fanbeam tomography with sparse illumination angle data, and an inverse source problem for Darcy flow.

Tomography problem
In this example, the data consists of N illumination views of the target by a fan of rays as indicated in Figure 3.The image area Ω is defined as the unit disc centered at the origin, and we assume that each illumination comprises n uniformly spaced rays traversing the image area.Thus, the dimension of the data is m = N n.In the computed example, we set N = 15 and n = 300, so the data have dimension m = 45 000.
The attenuation of the intensity I of the radiation along an infinitesimal line segment ds is given by the Beer-Lambert law [27], dI = −µds, where µ = µ(x) is the absorption coefficient.Thus, if a ray is represented in a form x = x(s) ∈ Ω parametrized by its arc length s, the attenuated intensity at the end of a ray of length L is obtained by the formula We define the noiseless data b * ∈ R N n componentwise as where x k (s) is the parametrization of the kth ray.We assume for simplicity that the noise in the components b * k can be approximated by additive Gaussian scaled white noise, In the numerical simulation, we assume that the target is a translucent body with two absorbing inclusions with different, yet constant absorption coefficients.
To highlight different features of the algorithm, we run three protocols.In the first run, the noiseless data are corrupted by additive Gaussian scaled white noise, with standard deviation σ equal to 4% of the maximum of the noiseless signal, which corresponds to σ = 0.0472.In the first phase of the hybrid IAS algorithm, which uses the gamma hyperprior parameter, we set η = 0.001 and ϑ * 1 = 0.05.In this example, the sensitivity scaling of the vector ϑ is not necessary, thus, for simplicity, the components of the vector ϑ are all set equal.When the relative change of the vector θ is below a threshold value of 5%, the iterations in Phase I are stopped and as we move to Phase II, the hypermodel changes to a generalized gamma hyperprior with r = 1/2.The hyperparameter values for Phase II are set automatically by using the matching conditions.Similarly to Phase I, the IAS iterations are stopped when the relative change of θ is below 5%.However, in both phases, we set the maximum number of IAS iterations to 15.In the remeshing algorithm, the mesh size parameters are h min = 0.01 and h max = 0.1, while the anisotropy factor is α = 12.
We start the iterations by defining an isotropic mesh with uniform metric, with mesh size parameter h init = 0.05.The initial mesh is shown in Figure 4 top row, left panel, superimposed with the true target.The absorption coefficient in the circular inclusion of the true target is µ = 1.2, and in the kite-shaped inclusion µ = 1.0, while the background is for simplicity assumed to have a vanishing absorption coefficient.
We run four full iterations of the IAS-remeshing algorithm, and show the results at the end of each iteration in Figure 4.The left panel of each row shows the current tessellation, and the middle and right panels show the corresponding reconstructions.Numerical experiments show that after four iterations, the reconstructions and the meshes do not change significantly.
To assess the efficiency of the algorithm, we consider first the computational complexity of the IAS algorithm, and in particular, the number of inner and outer iterations.Table 1 lists the number of outer IAS iterations of both phases if the hybrid IAS algorithm, as well as the number of the CGLS iterations.We recall that the CGLS iterations are stopped when one of the stopping criteria (19) or ( 20) is satisfied.We observe that each IAS update requires fewer than 20 CGLS iterations, making the IAS algorithm very fast.
Before discussing the complexity of the algorithm measured in terms of mesh sizes and computing times, we address first the question whether the mesh refinement is fine enough to ignore the modeling error.Recall that the artifacts due to the modeling error become particularly significant when the data are of high quality, in which case the discretization error dominates the exogenous noise.Therefore, we test the algorithm using the same parameter values for the IAS and the   The rows correspond to the full iterations consisting of the full hybrid IAS updates, and the columns correspond to IAS itrations with gamma hyperprior (Phase I) and generalized gamma hyperprior with r = 1/2 (Phase II).The maximum number of IAS iterations allowed was 15, and a missing number indicates that the stopping criterion was reached before this limit.error, in the first iteration we artificially inflate the standard deviation in the likelihood and set σ = 0.3 × h init = 0.015.The motivation for this choice is that in [5], it was found through numerical simulations that in the tomography problem with piecewise constant inclusion, the modeling error level increases roughly linearly as a function of the mesh parameter with the empirical coefficient 0.3.No artificial inflation is applied after the first iteration.
The results of the first iteration rounds are shown in the top row of Figure 5.In the first reconstructions, the amplified modeling errors are clearly visible, and even in fourth iteration boundary artifacts persist, yielding worse reconstructions than those obtained with higher noise level.This indicates that the refinement parameter h min in the metric was not chosen small enough.The number of the CGLS iterations within the IAS iterations are listed in Table 2.As expected, the lower noise level requires a better fit to the data, thus more inner iterations are required.
The third run is almost identical to the second one, except for the halving of the mesh size parameter in the remeshing, which is set to h min = 0.005, that is, the minimum mesh edge is halved.The results of the four iterations of the algorithm are shown in in Figure 6.This time, there are no boundary artifacts around the inclusions, and the algorithm approximates well the inclusions.The number of CGLS iterations is given in tabular form in Table 3.
To have closer look at the effect of the modeling error, in Figure 7 we plot the estimated absorption coefficients corresponding to experiments 2 and 3 along the vertical diagonal through the disc Ω.The first reconstructions perfectly coincide, as all parameters are the same.The second reconstructions are relatively good in both examples, however, in the third and fourth reconstruction, Figure 5: Four iterations of the hybrid IAS-remeshing algorithm with low noise of 1% of the maximum of the noiseless signal.In this case, the mesh refinement is the same as in the high noise example, making the approximation error the dominant part.The effect of the modeling error is visible in the reconstructions.Table 3: Number of CGLS iterations in the two phases corresponding to the third simulation.The rows correspond to the full iterations consisting of the full hybrid IAS updates, and the columns correspond to IAS itrations with gamma hyperprior (Phase I) and generalized gamma hyperprior with r = 1/2 (Phase II).The maximum number of IAS iterations allowed was 15, and a missing number indicates that the stopping criterion was reached before this limit.when the algorithm starts to sparsify the mesh where the gradient is believed to be small, boundary artifacts with the coarser mesh appear.These artifacts can be attributed to the discretization error.
Finally, we address the question of complexity of the algorithm in terms of number of elements and computing times.Table 4 shows the mesh sizes of each full iteration (I-IV) of the algorithm, together with the computing times.We observe that the first remeshing increases the number of elements, creating an overly dense mesh around a wide area around the object boundaries, after which the algorithm learns the location of the discontinuities and reduces the complexity to only the necessary mesh.
For comparison, we run the three experiments setting the anisotropy parameter to α = 1, that is, the refined meshes are generated by using isotropic metric as in [5].Table 5 shows the mesh sizes and computing times, underlining the advantage of allowing anisotropy in the meshes.In particular, the first remeshing that is always the densest one creates a slowdown of the algorithm by a factor of 4 to 8. Finally, we point out that if the original mesh parameter h = 0.05 would be refined homogeneously over the entire computational domain to yield a mesh parameter h = h min = 0.01 as in experiments 1 and 2, each element would be split into 25 subelements, yielding a mesh with n t = 69 350, resulting in significant computational complications.Table 5: The number of elements in the meshes, and the total computing time of the full updating of the estimates and the meshes using isotropic metric, α = 1.The recomputing of the forward model is not included in the times.

Inverse source problem
Let In this example the data consists of noisy observations of the solution f at the nodes of a regular grid of size 20 × 20, The inverse problem is to estimate the source term u from this data.As in the previous example, we assume that the source term is known to satisfy u| ∂Ω = 0.
Given a tessellation T h , we approximate the source term u in terms of the first order Lagrange basis functions {ψ j } as in (2).To approximate f , we use the second order piecewise polynomial Lagrange basis functions {φ j } associated to nodes obtained by augmenting the vertices {v j } by the midpoints of the edges of the triangles.Denoting by I free the indices to nodes satisfying 0 < v (1) j < 1, the Galerkin approximation of the weak form leads to the linear system for the nodal values {f j } j∈I free of the hydraulic potential, arranged in a vector f h , where the entries of the corresponding matrices are given by (G h ) jk = Ω ∇φ j • ∇φ k dx, j, k ∈ I free , The matrix G h is symmetric positive definite, thus the vector f h can be written as Assuming that the data consist of evaluations of f at points p i ∈ Ω, 1 ≤ i ≤ m, we have the discretized model for the observation, b = A h u h , where the matrix A h is given by We generate the data using a fine hexagonal mesh with mesh size parameter h fine = 0.02.The data are corrupted by scaled Gaussian white noise with standard deviation σ = 3 × 10 −4 , corresponding to a noise level of 10 −4 of the maximum of the noiseless data.
For the inverse problem, we run the algorithm starting with a regular hexagonal coarse mesh with mesh size parameter h = 0.05, performing four full iterations of IAS and subsequent remeshing.As in the previous example, we run the hybrid IAS algorithm with r 1 = 1 and r 2 = 1/2.Unlike in the tomography problem, here it is important for the quality of the reconstruction to use the sensitivity scaling (16), due to the fact that the mesh edges are at different distances from the data points, and without the scaling adjustment, the coefficients z j related to edges near the data points would be favored.The hyperparameters for the IAS algorithm were set to η = β 1 − 3/2 = 0.001 and ϑ * 1 = 0.05, and the remeshing parameters were chosen as h min = 0.003, h max = 0.05, and the anisotropy factor is α = 12.Again, in the first iteration round, we inflate the variance of the likelihood to compensate for the unknown discretization error, using the same inflation factor as in the previous example, σ = 0.3 × h.
The results, together with the true source used for generation of the data are shown in Figure 8.The results show similar characteristics as in the tomography problem: The first iteration tends to overdensify the mesh near the discontinuities, after which the algorithm learns quickly the positions of the jumps and the mesh size decreases.The mesh sizes together with the computing times are again given in tabular form, see Table 6.Table 6: The number of elements in the meshes, and the total computing time of the full updating of the estimates and the meshes.The recomputing of the forward model is not included in the times.

Discussion and outlook
Computational inverse problems for estimating distributed parameters are known to be sensitive to modeling errors, including those arising from insufficiently fine discretization of the underlying continuum model.This work develops further the ideas, outlined in the earlier work [5], that the discretization is part of the inverse problem and should be decided concomitantly with the actual solution of the unknown parameter.Modeling errors are insignificant only if the discretization is fine enough to reduce the error below the noise level that needs to be taken into consideration in the solution of the problem, however, without a guidance of the refinement of the discretization, this may render the forward problem computationally too complex for efficient solution of the problem.In this work we are considering the anisotropic meshing connected to an anisotropic metric that is used to define tessellations allowing a numerical representation of the unknown and the solution of the forward model with sufficient accuracy to yield a modeling error dominated by the exogenous noise.Numerical tests show that the anisotropy reduces the computation times by a factor of 4-8 compared to isotropic selective meshing, and compared to uniform isotropic mesh refining, the reduction would be of orders of magnitude.The examples in this work are in two dimensions.Generalization of the algorithm to three dimensions is possible albeit not without challenges.Furthermore, generalizations to non-linear inverse problems are underway and will be the topic of a separate contribution.

Figure 1 :
Figure 1: The local patch ω i associated to the vertex v i .In the figure, the Whitney basis function associated to the edge {v i , v j 3 } is shown as a vector field plot.Outside the two triangles sharing the edge the basis function vanishes.Observe that the basis function is discontinuous, while having continuous tangential component across the edges.

Figure 2 :
Figure 2: Left panel: An equilateral reference triangle K plotted in blue, and a physical triangle K plotted in red, along with the Steiner circumellipse.On the right, a family of triangles with the same polar matrix P but with different rotations W, all sharing the same Steiner ellipse and thus having the same eccentricity.

Figure 4 :
Figure 4: Four iterations of the hybrid IAS-remeshing algorithm.Here, the data are contaminated by high noise level of 4% of the maximum of the noiseless signal.

Figure 6 :
Figure 6: Four iterations of the hybrid IAS-remeshing algorithm with low noise of 1% of the maximum of the noiseless signal.Here, the minimum edge length in the mesh refinement is one half of the previous example, reducing the modeling error below the additive noise level, yielding a better reconstruction.

Figure 7 :
Figure 7: Profiles of the reconstructions along the vertical diagonal of the disc Ω.The black curve corresponds to the coarser discretization (Experiment 2), the red one to the finer discretization (Experiment 3).The four plots correspond to the four full iterations of the IAS and remeshing algorithm.
a source function, and let f ∈ H 1 (Ω) be the weak solution of the Dirichlet problem, −∆f = u, with a mixed boundary condition: Defining the vertical and horizontal boundary parts as

Table 1 :
Number of CGLS iterations in the two phases corresponding to the first simulation.The rows refer to the full iterations consisting of the full hybrid IAS updates and remeshing, and the columns correspond to IAS iterations with gamma hyperprior (Phase I) and generalized gamma hyperprior with r = 1/2 (Phase II).The maximum number of IAS iterations allowed was 15, and a missing number indicates that the stopping criterion of relative change of θ dropping under 5% was satisfied before the maximum number of iterations was reached.

Table 2 :
Number of CGLS iterations in the two phases corresponding to the second simulation.

Table 4 :
The number of elements in the meshes, and the total computing time of the full updating of the estimates and the meshes.The computations are performed with Matlab on a standard laptop (MacBookPro 3.1 GHz intel Core i5).The recomputing of the forward model is not included in the times.