A Bayesian approach for consistent reconstruction of inclusions

This paper considers a Bayesian approach for inclusion detection in nonlinear inverse problems using two known and popular push-forward prior distributions: the star-shaped and level set prior distributions. We analyze the convergence of the corresponding posterior distributions in a small measurement noise limit. The methodology is general; it works for priors arising from any Hölder continuous transformation of Gaussian random fields and is applicable to a range of inverse problems. The level set and star-shaped prior distributions are examples of push-forward priors under Hölder continuous transformations that take advantage of the structure of inclusion detection problems. We show that the corresponding posterior mean converges to the ground truth in a proper probabilistic sense. Numerical tests on a two-dimensional quantitative photoacoustic tomography problem showcase the approach. The results highlight the convergence properties of the posterior distributions and the ability of the methodology to detect inclusions with sufficiently regular boundaries.


Introduction
The Bayesian approach to inverse problems has in recent decades generated considerable interest due to its ability to incorporate prior knowledge and quantify uncertainty in solutions to inverse problems, see [1,2]. A commonly recurring objective in inverse problems for imaging science is to recover inhomogeneities or inclusions, i.e. piecewise constant features, in a medium; applications range from cancer detection in medical imaging [3,4] to defect detection in material science [5,6]. In a Bayesian framework, this can be tackled by designing a prior distribution that favors images with these features.
An optimization-based approach can address this by parametrizing the relevant subset of the image space and minimizing a functional over the preimage of this parametrization, see for example [7]. This is visualized in Figure 1, where we consider the parametrization Φ defined on a linear space Θ and giving rise to the subset Φ(Θ) of the image space L 2 Λ (D) = {γ ∈ L 2 (D) : Λ −1 ≤ γ ≤ Λ a.e.}, where D is a bounded and smooth domain in R d , d = 2, 3 and Λ > 0 is a constant. Such approaches benefit computationally from the fact that the set of images with inclusions, i.e. Φ(Θ), form a low-dimensional subset of the image space L 2 Λ (D). In the Bayesian framework, a related approach makes use of a push-forward distribution as the prior distribution, i.e. the distribution of a transformed random element of Θ. This often leads to strong a priori assumptions, as the prior only gives mass to the range of the parametrization. More classical prior distributions including Laplacetype priors, see for example [8], and other heavy-tailed distributions often fail to take advantage of the low dimension of such images.
In this paper, we consider a Bayesian approach that captures this idea for two parametrizations used in detection of inclusions for nonlinear inverse problems: the star-shaped set and level set parametrizations. These parametrizations are studied rigorously in [9,10,11] and remain popular to Bayesian practitioners: we mention [1,12,13,14,15,16] in the case of the star-shaped inclusions and [17,18,19,20,12,21] for the level set inclusions, see also references therein.
The solution to the inverse problem in the Bayesian setting is the conditional distribution of the unknown given data, referred to as the posterior distribution. The posterior distribution has proved to be well-posed in the sense of [2] for such parametrizations. This means that the posterior distribution continuously depends on the data in some metric for distributions. This property implies, for example, that the posterior mean and variance are continuous with respect to the data, see [8]. However, such results give no guarantee as to where the posterior distribution puts its mass.
A more recent framework provided in [22] using ideas from [23], see also [24], gives tools to analyze the convergence of the posterior distribution for nonlinear inverse problems. Such results, known as 'posterior consistency', address whether the sequence of posterior distributions arising from improving data (in a small noise or large sample size limit) gives mass approximating 1 to balls centered in the ground true parameter γ 0 generating the data. Nonlinearity in the forward map and parametrization makes consistency results for Gaussian posterior distributions, as in [25], inapplicable. Currently, the setting of [22] and similar approaches require smoothness of the parameter of interest. A crucial condition is that the parameter set that is given most of the mass by the prior, has small 'complexity' in the sense of covering numbers, see [23,Theorem 2.1] or [24,Theorem 1.3.2]. Using Gaussian priors, this parameter set is typically a closed Sobolev or Hölder norm ball, see [24,Theorem 2.2.2] or [22]. However, such priors do not give sufficient mass to discontinuous parameters to conclude consistency. In this paper, we aim to address this, at least partially, by parametrizing the set of discontinuous parameters from a linear space Θ of sufficiently smooth functions.
We aim to recover an element γ, which we call the image or the physical parameter, in a subset Φ(Θ) of L 2 Λ (D) for some continuous map Φ : Θ → L 2 Λ (D). We consider a nonlinear forward map G : L 2 Λ (D) → Y mapping into a real separable Hilbert space Figure 1: A visiualization of the parametrization Φ : Θ → L 2 Λ (D) and forward map G : Y with inner product ·, · and norm · . We refer again to Figure 1 for an overview of this setup. This setting allows us to make use of the framework provided in [22], but transfer the complexity condition from subsets of L 2 Λ (D) to subsets of Θ, see Section 3.2. In the context of inclusion detection, this means we can detect inclusions with sufficiently smooth boundaries.
Our contributions can be summarized as follows: • We present a posterior consistency result for the general setting mentioned above, when the parametrization Φ satisfies mild conditions in regularity. We use the framework provided by [22] extending to Hölder continuous G and push-forward priors. In particular, this gives an estimator, the posterior mean, which converges in probability to the true physical parameter in the small noise limit. Formally, this means there is an algorithmγ defined for noisy measurements Y depending on the noise level ε > 0 such that γ(Y ) − γ 0 L 2 (D) → 0 in probability as ε → 0. This statement will be made precise in Section 2. Furthermore, the rate of convergence is determined in part by the smoothness of elements in Θ and the regularity of the parametrization. • We show that two parametrizations for inclusion detection, a star-shaped set parametrization and a smoothened level set parametrization, satisfy the conditions for this setup. This verifies and quantifies the use of such parametrizations.
• We numerically verify the approach based on the two parametrizations in a small noise limit for a nonlinear PDE-based inverse problem using Markov chain Monte Carlo (MCMC) methods. We consider a two-dimensional quantitative photoacoustic (QPAT) problem of detecting an absorption coefficient. We derive a new stability estimate following [26,27].
We note that the framework of [22] in e.g. [28] and [29] shows consistency for 'regular link functions' Φ (defined in [30]), which are smooth and bijective. The archetypal example is Φ = exp or a smoothened version to ensure positivity of the physical parameter γ. As we shall see, injectivity and inverse continuity are not necessary for the proofs when we want to show consistency in L 2 Λ (D). One novelty of our work is to show that this observation has relevance: we seek to recover the physical parameter γ instead of a non-physical parameter in Θ that generated it. As we shall see, a natural parametrization for star-shaped inclusions is Hölder continuous from a suitable space Θ to L 2 Λ (D). The same holds true for a smoothened level set parametrization, which we will encounter in Section 4.2.
The structure of the paper can be summarized as follows. In Section 2, we recall a few key elements of the Bayesian framework in a 'white noise' model as outlined in [2] and [31,Section 7.4], including the notion of posterior consistency with a rate. In Section 3, we show that Hölder continuity of Φ, some smoothness of elements in Φ(Θ) and conditional continuity of G −1 , suffice to show that the posterior mean converges to the ground truth γ 0 in L 2 (D) as the noise goes to zero. Section 4 considers these conditions for the level set and star-shaped set parametrizations, which are wellknown in the literature. In Section 5, we consider the two-dimensional quantitative photoacoustic tomography problem suited for piecewise constant parameter inference. Then, Section 6 gives background to our numerical tests and results that emulate the theoretical setting of Section 3. We present conclusive remarks in Section 7.
In the following, we let random variables be defined on a measure space (Ω, F , Pr). For a metric space Z 1 the Borel σ-algebra is denoted by B(Z 1 ). If F : Z 1 → Z 2 is a measurable map between the measure space (Z 1 , B(Z 1 ), m) and the measurable space (Z 2 , B(Z 2 )), then F m denotes the push-forward measure defined by F m(B) = m(F −1 (B)) for all B ∈ B(Z 1 ). We denote by L 2 (Ω, Pr) the space of real-valued square integrable measurable functions from (Ω, F , Pr) to (R, B(R)). When Pr is the Lesbegue measure on R, we simply write L 2 (Ω). We call probability measures defined on B(Z 1 ) Borel distributions.

The Bayesian approach to inverse problems
Bayesian inference in inverse problems centers around a posterior distribution. This is formulated by Bayes' rule once a prior distribution in L 2 Λ (D) has been specified and the likelihood function has been determined by the measurement process. In this paper, we consider a 'continuous' model of indirect observations for a continuous forward map G : L 2 Λ (D) → Y, where the separable Hilbert space Y has an orthonormal basis {e k } ∞ k=1 . Here ξ is 'white noise' in Y defined below in (4). We denote the noise level by ε n := σ √ n for some σ > 0 and n ∈ N, which has this convenient form to study a countable sequence of posterior distributions in decreasing noise, i.e. for growing n. When we write Y , it is understood that this depends on n and γ. The rate n −1/2 is natural: if Y is a subspace of Hölder continuous functions on a bounded domain, this observation model is equivalent to observing n discrete point evaluations of G(γ) with added standard normal noise as n → ∞, see [31] and [32, Given a Borel prior distribution Π on L 2 Λ (D), the posterior distribution Π(·|Y ) is proportional to the product of the likelihood and prior. Indeed, according to Bayes' rule, if Y is finite-dimensional, the posterior distribution has a density (Radon-Nikodym derivative) of the form where Z > 0 is a constant, see for example [8,33]. This is well-defined for almost all y under the marginal distribution of Y . The relevance of this object emerges, when evaluating it in a realization Y (ω) = y. Using inner product rules we can rewrite this as where the contribution of Y is absorbed in the constant Z > 0. The purpose of the following paragraphs is to argue that this formula remains valid, when Y is infinitedimensional with the interpretation that Y, G(γ) is a Gaussian random variable defined by which is convergent in Y − in the mean square sense, see [8,Section 2.4], where Y − is the Hilbert space Y − , see also [31,Section 7.4], defined by Note ξ is a Gaussian random element of Y − , since it is the Karhunen-Loeve expansion of a mean zero Gaussian random element with covariance operator K : Y − → Y − defined by Ke k = λ 2 k e k , see [8]. Then Y is also a Y − -valued Gaussian random element, since it is a translation of ε n ξ by an element in Y. We denote the distributions of ε n ξ and Y in Y − by P n and P γ n , respectively. We can think of P γ n as the data-generating distribution indexed by γ, the physical parameter generating the data, and n, which controls the noise regime.
The likelihood function arises as the density (Radon-Nikodym derivative) of P γ n with respect to P n . This is a consequence of the Cameron-Martin theorem in the Hilbert space Y. The theorem gives the likelihood function as here evaluated in Y , see [32][Proposition 6.1.5]. See also a derivation in [31,Section 7.4], for which it suffices that γ → G(γ) is continuous from (the standard Borel space) Then Bayes' rule [33, p. 7] formulates a posterior distribution as a measure in L 2 Λ (D) as in the right-hand side of (2), well-defined for almost all Y . According to [33], this equals almost surely a Markov kernel, which we will call the posterior distribution and also denote it by Π(·|Y ). That is to say that B → Π(B|Y (ω)) is a measure for every ω ∈ Ω and ω → Π(B|Y (ω)) is measurable for every B ∈ B(L 2 Λ (D)). In particular, ω → Π(B|Y (ω)) is a [0, 1]-valued random variable.

Convergence in probability
In preparation for the subsequent section, we recall the notion of convergence in probability. Let t n > 0 be a decreasing sequence going to zero. For a fixed γ 0 ∈ L 2 Λ (D) and a sequence of measurable functions f n : Y − → R we say that the sequence of random variables {f n (Y )} ∞ n=1 converges to υ ∈ R in P γ0 n -probability with rate t n as n → ∞ if there exists a constant C > 0 such that as n → ∞. We consider the following two cases, where we recall that both the posterior distribution and Y depend tacitly on n.
(i) For a sequence of sets {B n } ∞ n=1 in B(L 2 Λ (D)), we could claim that Π(B n |Y ) → 1 in P γ0 n -probability, with rate t n as n → ∞. That is, f n (Y ) = Π(B n |Y ) and υ = 1. If this is the case for B n := {γ : γ − γ 0 L 2 (D) ≤ C 0 r n } for some decreasing sequence r n > 0 going to zero and constant C 0 > 0, we say that the posterior distribution contracts around or is consistent in γ 0 at rate r n .
(ii) Denote by E[γ|Y ] the mean ('posterior mean') with respect to Π(·|Y ). This is defined in the sense of a Bochner integral, which is well-defined by [34,Theorem 2], since for all ω ∈ Ω Then ω → E[γ|Y (ω)] is an L 2 Λ (D)-valued random element by the definition of the Bochner integral and by the measurability of pointwise limits of measurable functions, see [35,Theorem 4.2.2]. We could claim that n -probability, with rate t n as n → ∞. That is, f n (Y ) = E[γ|Y ] − γ 0 L 2 (D) and υ = 0.

Posterior consistency
In this section we recall sufficient conditions posed in [22], see also [24], such that the posterior distribution in our specific setup is consistent. More specifically, we recall for which ground truths γ 0 ∈ L 2 Λ (D), forward models G and prior distributions Π Π(γ : as n → ∞ for some positive decreasing sequencer n going to zero. A consequence of this result, under additional assumptions on the prior, is that the posterior mean converges to γ 0 in P γ0 n -probability, see [24,Theorem 2.3.2] or [33,Theorem 8.8], with rate r n as n → ∞. This is the case of (ii) above. In the nonlinear inverse problem setting, posterior consistency in the sense of (6) follows from a two-step procedure with the use of conditional stability estimates.
Step 1 The first step reduces convergence of {Π(B n |Y )} ∞ n=1 from sets of the form B n = {γ ∈ L 2 Λ (D) : γ − γ 0 L 2 (D) ≤ Cr n } to sets of the form Indeed, for specially chosen subsets A n ⊂ L 2 Λ (D) which may depend on n, assume we have the estimate for all γ 1 , γ 2 ∈ A n and some ν > 0. Then B n ⊂B n and hence wherer n = r ν n .
Step 2 The second step involves showing that Π(B n |Y ) converges to 1 in P γ0 nprobability as n → ∞. This is posterior consistency on the 'forward level'.

Combining
Step 1 and Step 2, we find that Π(B n |Y ) converges to 1 in P γ0 nprobability as n → ∞. The 'conditional' stability estimate of the first step is of independent interest for many inverse problems in literature and usually requires an in-depth analysis of the inverse problem at hand. In this paper we treat first (8) as an assumption, see Condition 2. Although any modulus of continuity will do for the first step in this two-step procedure, for our concrete example in photoacoustic tomography we will show a Lipschitz stability estimate that holds for all γ ∈ L 2 Λ (D), see Section 5. Our main motivation for including A n in the analysis is to keep the exposition generally applicable.
One of the contributions of [22,24] is to address Step 2 for a random design regression observation model using the Theorem 2.1 in [23] and the equivalence between the distance (semi-metric) and the Hellinger distance, see [24], of the data-generating distributions (corresponding to our p γ n ). Theorem 28 in [31], see also [32,Theorem 7.3.5], adapts the proof to the observation model (1), which is what we will use. One can see this second step as showing posterior consistency in G(γ 0 ) at rate r n for the push-forward GΠ(·|Y ) as in [36]. Below, we use the covering number N (A, d, ρ) for a semimetric d, which denotes the minimum number of closed d-balls of radius ρ > 0 needed to cover A, see Appendix A for a precise definition. Then the condition to complete Step 2 is as follows.
Condition A. Let Π = Π n be a sequence of prior distributions in L 2 Λ (D). Let G be the forward model G : L 2 Λ (D) → Y and γ 0 ∈ L 2 Λ (D) the ground truth. Let r n satisfy r n = n −a for some 0 < a < 1/2. Suppose that, A.1 the prior gives enough mass to contracting balls B G (γ 0 , r n ) : A.2 there exist sets A n that are almost the support of Π in the sense that A.3 and that there exists a constant m 0 > 0 such that all for n large enough.
Condition A.1 is a sufficient condition such that the denominator of the posterior distribution cannot decay too fast as n → ∞. This is helpful when showing Π(B c n |Y ) → 0 in P γ0 n -probability as n → ∞. On the other hand Condition A.2 and A.3 are conditions that give control over the numerator in a sense that is made precise in the proof of Theorem 2.1 in [23] (or for example Theorem 28 in [31]). It is also a trade-off; the sets A n should be large enough such that they are almost the support of the prior, but small enough such that the covering number increases sufficiently slowly when n → ∞. In the general case, Step 2 is completed by the following result proved in Appendix B.
Theorem 2.1. Let Π(·|Y ) be the sequence of posterior distributions arising for the model (1) with γ 0 ∈ L 2 Λ (D), G and prior distributions Π = Π n satisfying Condition A for some rate r n . Then, there exists C 0 = C 0 (C 2 , C 3 , m 0 , σ) such that with rate e −bnr 2 n for all 0 < b < C 2 − C 1 − 4 as n → ∞. Given the preceding result, we can conclude posterior consistency in γ 0 at ratẽ r n as in Step 1, if we have a conditional stability estimate as (8).

Markov chain Monte Carlo
While Section 2.1 concludes in an abstract way the usefulness of the posterior distribution, in this section we briefly recall methods to approximate it. We consider MCMC methods that approximate E[γ|Y ] (or other statistics) from averages of samples from a Markov chain that has the posterior distribution as its stationary distribution. Since the composition G • Φ maps Θ into Y continuously by assumption, given a prior distribution Π θ in Θ, there exists a posterior distribution Π θ (·|Y ) in Θ of the form Naturally, if Π = ΦΠ θ , then by a change of variables . This gives rise to the following 'high-level' algorithm: given a realization y ∈ Y − of Y , in Θ using θ (0) as initial condition with an MCMC method targeting Π θ (·|y), and 3. return {Φ(θ (k)) } K k=1 . For our numerical examples, we use the preconditioned Crank-Nicolson (pCN) MCMC method, see [37]. This method uses only single evaluation of the log-likelihood function every iteration and is hence attractive for expensive PDE-based forward maps. It is well-defined when Θ is a Hilbert space and possesses favorable theoretical properties, see [37,38]. The idea to generate samples from Π(·|y) by pushing forward samples also appears in certain reparametrizations of posterior distributions for the use of hyperparameters, see [39].

Posterior consistency using parametrizations
In this section, we follow [22,24] in their approach to satisfy Condition A. In the case where Π = ΦΠ θ for Π θ Gaussian and G • Φ Lipschitz continuous, the approach is the same. We give a brief recap for the case where G • Φ is Hölder continuous for the convenience of the reader. We tackle this by introducing three new conditions convenient for an inverse problem setting. We oppose this to Condition A, which is general and applicable in many statistical inference problems. As our base case, we assume Θ = H β (X ), where X is either the d ′ -dimensional torus or a bounded Lipschitz domain X ⊂ R d ′ , d ′ ≥ 1 and β > d ′ /2. We include here the torus in our considerations, since it is a numerically convenient setting. For more general parametrizations for inclusion detection, we shall need small deviations from this setting. However, these cases will take the same starting point of H β (X ) in Section 4. We begin by stating conditions on Φ, G and Π so that Condition A is satisfied. To do this, we introduce the following subset of Θ.
We then require the following conditions of Φ.

Condition 1 (On the parametrization Φ). For any
That is, we require at least conditional Hölder continuity of the parametrization map Φ. The L ∞ (X ) topology is not necessary for what follows and can be generalized to any L p , p ≥ 1 or H s -norm, s < β. Similarly, we require conditional forward and inverse Hölder continuity of the forward map G.
We have the following condition on the prior distributions Π we consider. They should be push-forward distributions of a scaled Gaussian prior distribution in Θ.
Let the reproducing kernel Hilbert space (RKHS), see [33], for a as in Condition A. Then let Π = ΦΠ θ .
This gives the following structure If one chooses for example a Matérn covariance, see [40], such that Π ′ θ (H β (X )) = 1, then H = H δ (X ) with δ = β + d ′ /2, see Example 11.8 and Lemma 11.35 in [33] or [24,Theorem 6.2.3]. The scaling in (16) essentially updates the weight of the prior term to go slower to zero. Indeed, dividing through by the factor ε −2 n appearing in the data-misfit term, the prior term scales as ε 2 n n 1−2a ∼ r 2 n . This term play the role of the 'regularization parameter' in [41]. Note that lim n→∞ r n = 0 and lim n→∞ ε 2 n /r 2 n = 0, as is needed for the convergence of Tikhonov regularizers for example, see [41,Theorem 5.2]. The scaling (16) is also common in the consistency literature, see for example [22]. In our setting, it ensures that samples are with high probability in a totally bounded set A n , as was called for in Condition A.2 and A.3. We note for β > d ′ /2 that Π ′ θ is also a Gaussian measure on the separable Banach space C(X ) endowed with the usual supremum norm f ∞ := sup x∈X |f (x)|. This is a consequence of a continuous Sobolev embedding and [42,Exercise 3.39].
Under Condition 1, 2 and 3, the lemmas in the subsequent sections ensure that Condition A is satisfied. Then we have the following theorem for posterior consistency at γ 0 ∈ Φ(H) using the push-forward prior Π = ΦΠ θ for Π θ a Gaussian distribution satisfying Condition 3.
The rate of convergence in probability is e −bnr 2 n for any b > 0 choosing C 0 > 0 large enough.
Proof. Note first that Lemma 3.4 shows that Condition A.1 is satisfied for some n -probability, with rate e −bnr 2 n as n → ∞. Then the wanted result is a consequence of (9).
Posterior consistency with a rate as in the preceding theorem often leads to the convergence of related estimators with the same rate, see [33]. Here, we repeat an argument found in [24] to conclude that the posterior mean converges in P γ0 nprobability to γ 0 as n → ∞.
Proof. The proof of Theorem 2.3.2 in [24] applies here, since Φ maps into L 2 Λ (D) by assumption and hence

Excess mass condition A.2
To motivate more precisely the scaling of the prior and the form of A n , we recall [22,Lemma 5.17]: Hence, Π θ charges S β (M ) with sufficient mass in relation to Condition A.2. However, we can consider a smaller set with the same property. Define then condition A.2 is satisfied for A n defined by (20).
for any given as follows from (19).

Metric entropy condition A.3
Now we show that the sets on the form A n defined by (20) (20) and a as in (21).

Small ball condition A.1
In this section, we consider the strong assumption that γ 0 ∈ Φ(H). We refer the reader to [43] for a more general case where θ 0 is only in the closure of H in Θ. However, this extension is not immediately compatible with the scaling (16). What follows in this section is based on the work [24]. We extend this to the case of Hölder continuous maps G • Φ in a straight-forward manner. Below we need the scaled RKHS This is the RKHS associated with Π θ , see [42] or [32, Exercise 2.6.5].
Proof. For R > 0 large enough depending on θ 0 and Π ′ θ , we have by Condition 1 and 2, where , and where we used the triangle inequality. Note alsoΠ θ (·) = Π θ (·+ θ 0 ) is a Gaussian measure in the separable Hilbert space H β (X ). In addition, a closed norm ball in H β (X ) is a closed subset of H β (X ) and so is {θ ∈ H β (X ) : θ ∞ ≤ Cr n } by a Sobolev embedding. Then we can apply the Gaussian correlation inequality [24, Theorem 6.2.2] to (24) so that To each of the factors in the right-hand side of (25) we apply [32, Corrollary 2.6.18] to the effect that for large n for R large enough as follows from (19). The rest of the argument follows [28,Lemma 11] and uses [44,Theorem 1.2], see also Lemma A.2, and the continuous embedding H ⊂ H δ (X ) to conclude

Parametrizations for inclusions
In this section, we make use of Theorem 3.1 for two specific parametrizations suited for inclusion detection: a star-shaped set parametrization and a level set parametrization. These are parametrizations on the form for some Lebesgue measurable subsets A i (θ) of R d and constants κ i > 0 for i = 1, . . . , N , which we denote collectively as κ = {κ i } N i=1 . Since we consider parametrizations that map into L 2 Λ (D), we will implicitly consider Φ(θ) as the restriction of the right-hand side of (26) to D. Note that recovering parameters on this form requires that we know a priori the parameter values κ i . However, this could further be modelled into the prior. In the following, we construct A i (θ) as star-shaped sets and level sets.

Star-shaped set parametrization
We start by considering the parametrization for a single inclusion, i.e. N = 1. For simplicity of exposition, we consider the star-shaped sets in the plane, although it is straight-forward to generalize to higher dimensions. Let ϕ be a continuously differentiable 2π-periodic function. We can think of θ : T → R as a function defined on the 1-dimensional torus T := R/2πZ. The boundary of the star-shaped set is a deformed unit circle: for a point x in D it takes for v(ϑ) := (cos ϑ, sin ϑ) the form

Then we write
Let κ 1 , κ 2 > 0 and define We have the following conditional continuity result, where we for simplicity fix x ∈ D.
, where C only depends on M and κ 1 .
Proof. By the translation invariance of the Lebesgue measure, it is sufficient to bound the area of the symmetric difference A(θ 1 )∆A(θ 2 ) := (A(θ 1 ) \ A(θ 2 )) ∪ (A(θ 2 ) \ A(θ 1 )) for x = 0. We parameterize this planar set using K : where |JK(s, ϑ)| is the determinant of the Jacobian of the map K. In the last line, we used that z → exp(z) is locally Lipschitz as follows from the mean value theorem.
Using the triangle inequality for the symmetric difference and the main result of [45], we would also have an estimate on the continuity of Φ as defined on D × H β (T), i.e. on elements (x, θ). We could then endow D × H β (T) with a product prior which straight-forwardly satisfies Condition A.1. For simplicity we skip this extension. Instead, we gather the following conclusion that follows directly from Theorem 3.1 and Corollary 1.
Note that this is the rate of (18) with ζ = 1/2 and d ′ = 1. Clearly this convergence rate takes into account that a smooth star-shaped inclusion belongs to a low-dimensional subset of L 2 Λ (D). One can think of this fast convergence rate (compared to Gaussian priors directly in L 2 (D)) as an expression of uncertainty reduction. Parameters γ ∈ L 2 Λ (D) on the form (28) carry some regularity. Indeed, using results in [46,47] showing α-Sobolev regularity for 0 < α < 1/2 reduces to giving an upper bound of the area of the ε-tubular neighborhood of ∂A(θ) with respect to ε. This is provided by Steiner's inequality, see [48], for d = 2, or more generally by Weyl's tube formula, see [49], when d ≥ 2.
Multiple inclusions The case of multiple star-shaped inclusions is a straight-forward generalization using the triangle inequality. We consider for N ≥ 1, the map (27) with x = 0, x i ∈ D, and where we set θ = (θ 1 , . . . , θ N ). We denote · N the direct product norm associated with the norm on L ∞ (T), i.e.
We have the following continuity result.
where C only depends on M , κ and N .
Proof. Using the triangle inequality and Lemma 4.1, by the equivalence of the p-norms p > 0 on R N .
Parallel to the remark before Lemma 4.1, we mention that a statement similar to Lemma 4.3 holds true for a map Φ defined on (D × H β (T)) N , if we in addition wish to infer x 1 , . . . , x N . In preparation for the main result of this section let us change notation to suit the current setting. Let We then endow Θ with a (product) prior distribution of Π θ satisfying Condition 3: The last equality is found in for example [50,Lemma 1.2]. For this prior, we have the following result, which is accounted for in Appendix C.  (1) and prior Π = ΦΠ θ for (31). Then there exists C > 0 such that Note that this is the rate as of Theorem 4.2, i.e. the rate does not depend on the number of inclusions; this dependence appears in the constant C.
The example is easy to extend to the more general case where the L ∞ -norm is replaced with the C k -norm. Note also that for fixed θ (n) = θ, we have continuity of Φ in this particular example. This fact generalizes to continuity of Φ in functions θ that do not have critical points on {x : θ(x) = 0}. However, for the stronger Condition 1, it is not obvious how much mass Gaussian distributions give to functions whose gradient is lower bounded away from zero near {x : θ(x) = 0}. For this reason, we take a different approach. We define an approximation Φ ǫ of Φ for which Condition 1 is satisfied. This gives an approximate posterior distribution that contracts around γ ǫ 0 = Φ ǫ (θ 0 ). We shall see that if we take ǫ = n −k for some k ∈ (0, 1), then the approximation properties of Φ ǫ to Φ and a triangle inequality argument ensure we have consistency at γ 0 = Φ(θ 0 ). To this end, consider the continuous approximation H ǫ of the Heaviside function We want to note two straight-forward properties of H ǫ : and We could even consider a smooth approximation for H ǫ , as in [21], but this is not necessary for our case. To construct the continuous level set parametrization, take and let Φ be of the form (26). The corresponding approximate level set parametrization is then where we define H ǫ (z − c 0 ) = 1 and H ǫ (z − c N ) = 0 for any z ∈ R. One can check that Φ ǫ coincides with Φ, when ǫ = 0. Motivated by Example 1 and the property that stationary Gaussian random fields have almost surely no critical points on their level sets, we define the admissible level set functions as where T c := {θ ∈ C 2 (X ) : ∃x ∈ X , θ(x) = c, |(∇θ)(x)| = 0} ∁ .
(ii) Again by the triangle inequality and now (34) we have For the following consistency result we let We endow Θ with a prior distribution Π θ that satisfies Condition 3 for β > 2 + d ′ /2 such that the covariance kernel associated with the random field is stationary. For simplicity we assume f (x) = x ν for some 0 < ν < 1 in Condition 2. Then we have the following result proved in Appendix C.
Theorem 4.6. Suppose Condition 2 is satisfied for S β (M ) as in (39) for f (x) = Cx ν , Φ replaced by Φ n −k for a well-chosen k, and where C and C G are independent of n. Let γ 0 = Φ(θ 0 ) for θ 0 ∈ H ∩ Θ. Let Π(·|Y ) be the corresponding sequence of posterior distributions arising for the model (1) and prior Π = Φ n −k Π θ as above. Then, with rate n −aν as n → ∞ for a = ηδ 2dνη + 2ηδ + d .
Note that for weak inverse stability estimates, i.e. ν small, the obtained contraction rate approaches the usual rate (18).

Quantitative photoacoustic tomography problem
To test the convergence of the inclusion detection methods, we consider the following test problem in quantitative photoacoustic tomography, see [54,27,55]. The diffusion approximation in QPAT models light transport in a scattering medium according to an elliptic equation where µ ∈ L 2 Λµ (D), Λ µ > 0, and γ ∈ L 2 Λ (D) are the optical diffusion and absorption parameters, respectively. The prescribed Dirichlet boundary condition u = g defines the source of incoming radiation. It is well-known that (42) has a unique solution u ∈ H 1 (D) for each g ∈ H 1/2 (∂D) and for any nonzero source function h ∈ H −1 (D) of (42). Furthermore, we have the estimate see for example [56,Chapter 6]. QPAT aims to reconstruct the optical parameters given the absorbed optical energy density map H, which equals the product γu up to some proportionality constant that models the photoacoustic effect. In our simplified approach, we aim to invert the forward map . For smoothness and physical accuracy we assume g ∈ H 3/2 (∂Ω) and 0 < g min ≤ g ≤ g max .
This setting allows a simple inverse stability estimate. First we have the following continuity result of G.
Proof. We note that u 1 − u 2 solves Then by (43) and the maximum principle [57, Theorem 8.1] Lemma 5.2. Under the same assumptions of Lemma 5.1, there exists a constant C > 0 such that Proof. See also [27, hence by elliptic regularity Note by the trace theorem, see [58], for g as in (44) there exists v ∈ H 2 (Ω) such that u 2 − v ∈ H 1 0 (Ω). By a Sobolev embedding v ∈ C 0,α0 (D) for some α 0 > 0 depending on d = 2, 3. Theorem 8.29 and the remark hereafter in [57] states that u 2 ∈ C α (D) for some α = α(d, Λ µ , Λ, D, α 0 ) > 0 and that g). By the maximum principle [57, Theorem 8.1] we can collect the right-hand side to one constant U = U (U 1 , U 2 , g max ) > 0. Now using the argument in [26,Lemma 12], which in return uses the Harnack inequality [57,Corollary 8.21] we conclude where m = m(d, Λ µ , Λ, D, U, α, g min ) is a constant. Note Combining this with (46) and (47) we have We note that G satisfies Condition 2 for η = 1 and f (x) = x. We also note that Y = L 2 (D) is a separable Hilbert space with an orthonormal basis consisting of the eigenfunctions of the Dirichlet Laplacian on D. We conclude that this problem is suitable as a test problem, and that Theorem 4.4 and 4.6 apply. In Section 7 we discuss other suitable inverse problems.

Numerical results
We discuss our numerical tests in detecting inclusions for the QPAT tomography problem using the pCN algorithm of Section 2.2 and the parametrizations of Section 4. For simplicity we assume D = B(0, 1), the two-dimensional unit disk.

Observation model
As an approximation to the continuous observation model (1) for the numerical experiments we consider observing where {e k } ∞ k=1 is an orthonormal basis of L 2 (D) consisting of the eigenfunctions of the Dirichlet Laplacian on D and N d ∈ N is a suitable number. This observation is the sequence of coefficients of the projection of Y from (1) to the span of {e k } N d k=1 . As N d → ∞ observing Y is equivalent to observing Y , see for example [28,Theorem 26]. Besides being a convenient approximation, this model has numerical relevance: there exists closed-form reconstruction formulas for G(γ), e k L 2 (D) in the first part of the photoacoustic problem, see [59,60]. The likelihood function then takes the form

Approximation of the forward map
We approximate the forward map using the Galerkin finite element method (FEM) with piecewise linear basis functions {ψ k } Nm k=1 over a triangular mesh of N m vertices and N e elements, see [54,61]. When γ ∈ L 2 Λ (D) is discontinuous and continuous, we approximate it bỹ respectively. Here E k denotes the k'th element of the triangular mesh. That gives us two approximations of the forward map: whereũ is the FEM solution corresponding toγ Ne andū Nm is the FEM solution corresponding toγ. For the smooth level set parametrization we useḠ Nm with N m = 12708 nodes, while for the star-shaped set parametrization we useG Ne with N e = 25054 elements.
We compute {e k } N d k=1 by solving the generalized eigenvalue problem arising from the FEM formulation of the Dirichlet eigenvalue problem with the Matlab function sptarn. Then G(γ), e k L 2 (D) is approximated using the mass matrix for k = 1, . . . , N d with N d = N freq (N freq + 1) and N freq = 13.

Phantom, noise and data
The phantom we seek to recover consists of two inclusions: where (κ 1 , κ 2 , κ 3 ) = (0.1, 0.4, 0.2) and A 1 , A 2 are two star-shaped sets described by their boundaries:  where ϕ(ϑ) = 0.12 0.8 + 0.8(cos(4ϑ) − 1) 2 , see Figure 2. We compute and fix the optical diffusion parameter to µ = 1 2 1 γ0+µs(1−0.8) following [61]. Here the scattering parameter µ s equals 100γ 0 smoothed with a Gaussian smoothing kernel of standard deviation 15 using the Matlab function imgaussfilt. We choose an illumination g that is smooth and positive on ∂D defined by , m 3 = −m 1 , s 1 = 10, s 2 = 2 and s 3 = 5. This is a superposition of three Gaussians, which illuminates the target well. We simulate data Y as in (48) by computingG Ne 0 (γ 0 ) on a fine mesh of N e0 = 75624 elements and N m0 = 38127 nodes. The corresponding projection can be seen in Figure  3. We choose ε > 0 such that the relative error is in the range (1, 2, 4, 8, 16) · 10 −2 . See Figure 3 for a realization of the white noise expansion (3) projected to the N d first orthonormal vectors {e k } N d k=1 and scaled so that it accounts for 4% relative noise.
To estimate the approximation error, we compute the vector for γ j , j = 1, . . . , 200, samples of the prior for the level set parametrization introduced in Section 6.4 below. We then compute ε level = tr(C) N d , where tr(C) is the trace of the sample covariance matrix C of the vectors V j . For this choice N (0, ε 2 level I) minimizes the Kullback-Leibler distance to N (0, C), see [37]. We compute ε star in the same way usingḠ Nm 0 ,Ḡ Nm and samples of the prior for the star-shaped set parametrization in Section 6.4.

Results
In this section we present the numerical results using the star-shaped and level set parametrizations in different noise regimes. We use the algorithm described in Section 2.2 with the pCN method implemented with an adaptive stepsize targeting a 30% acceptance rate. The initial stepsize is denoted by b. For an example of an implementation of this sampling method, we refer to the Python package CUQIpy, see [66]. For the star-shaped set parametrization, we choose the following prior and algorithm parameters in the order (16, 8 For the star-shaped set parametrization, we obtain K = 10 6 samples after a burnin of 5 · 10 5 , whereas for the level set parametrization, we take K = 10 6 after 1.2 · 10 6 samples as burn-in. We find this choice suitable, since the truncation in Section 6.4 leaves us with a higher dimensional sampling problem in the level set case. We base our posterior mean approximations on Monte Carlo estimates using 10 2 equally spaced samples of the chain. In Figure 5, we see the posterior mean of arising from the star-shaped set parametrization and observations with different noise levels. The posterior mean approximates the ground truth well for all noise levels. Note that the posterior mean varies only slightly for each noise level and is approximately piecewise constant. This indicates little posterior variance. This is due to a small noise level and the fast contraction rate that this inverse problem provides by virtue of (45). The estimates are not exact, but note that the exact data is not available due to projection and discretization. Taking N d large improves the data but also causes the likelihood function to attain larger values. This, in return, requires a smaller step size b. This means there is a computational trade-off between N d and b. Even for 16% relative noise, the reconstruction is fairly good, and the variance of the posterior samples is visibly larger. It is a strength of this method that it is robust for large noise levels. The mixing of the sample chains in the trace plots in Figure 7 indicates that the sampling algorithm is performing well. The convergence of the posterior mean is also evident in L 2 -distance as computed numerically, see Figure 8. This rate does not match the theoretical; but this is too much to expect for the observation (48), as this does not match the continuous observation (1) for which the rate is proved. Note we do not numerically scale the priors as the theoretical results require. Figure 6 suggests that the posterior mean converges as the noise level goes to zero, as is also evident from its L 2 -loss in Figure 8. Note that the reconstructions are continuous, not only because we take an average, but also because we use a continuous level set parametrization. Here, the sampling is initialized at θ (0) = 2φ (0,−1) , since this guess captures some of the low frequency information of possible θ 0 that can give rise to γ 0 . We report that chains with small step-size and the natural starting guess θ (0) = 0 often get stuck in local minima due to the number of levels in (52) and due to the fact that the pCN method does not require the gradient of either the parametrization or the forward map.
The sample diagnostics of Figure 7 indicate that sampling is harder for the level set parametrization compared to the star-shaped set parameterization. This is hard for at least two likely reasons: the first is due to the large number of coefficients θ ℓ , max(|ℓ 1 |, |ℓ 2 |) ≤ 4. This was also noted in [10]. The second likely reason is that θ → Φ ǫ (θ) is not injective for any ǫ ≥ 0. Therefore, the prior could be multimodal, and this can lead to correlated samples in the Markov chain. Other work suggests that the pCN method shows an underwhelming performance when applied to a correlated and multi-modal posterior, see [67] which also provides a gradient-based remedy. The level set method has found success in optimization-based approaches, in for example [68], where a descent step is taken in each iteration of an iterative algorithm. A Bayesian maximum a posteriori approach [21] has also been shown to find success for a smoothened level set. We expect that using gradient information in gradient-based MCMC methods would improve the performance significantly. A benefit of the level set parametrization is that we do not need to know a priori the number of inclusions as in the case of the star-shaped set parametrization. One could also combine the two methods as in [12]. Note in Figure 8 that, for both parametrizations, the posterior mean is stable to different noise realizations. This mirrors the convergence in probability we expect from Theorem 4.4 and 4.6.

Conclusions
In this paper, we provide and investigate a Bayesian approach to consistent inclusion detection in nonlinear inverse problems. The posterior consistency analysis is performed under general conditions of Hölder continuity of the parametrization and conditional well-posedness of the inverse problem. Furthermore, it gives an explicit rate. We showcase the convergence of the posterior mean in a small noise limit for a photoacoustic problem, where we note that the star-shaped set parametrization outperforms the level set parametrization. We highlight that Theorem 4.2 and 4.6 hold for any forward map satisfying Condition 2 and can be applied to other parametrizations. A different parametrization could for example arise in the related problem of crack detection. Interesting future work includes applying the inclusion detection method to other inverse problems. Similar stability estimates to that of Lemma 5.2 exist for the mathematically closely related problems of determining the absorption coefficient in acousto-optic imaging and the permittivity in microwave imaging, see [26]. This is also the case for conductivity imaging in quantitative thermoacoustic tomography, where [69] employed complex geometrical optics solutions. For the Calderón problem in two dimensions, [70] provides a stability estimate that is permitted for the star-shaped set parametrization, see also the comments after Theorem 4.2 on the regularity of γ. There is a natural Hilbert space observation setting for the Calderón problem, see [28]. Also in three dimensions and higher, conditional stability for inclusion detection in the context of the Calderón problem has been considered and shown to be logarithmic at best [71]. The generalization to three dimensions and more complex phantoms is left for future

A. Covering numbers
Consider a compact subset A of a space X endowed with a semimetric d. The covering number N (A, d, ρ) denotes the minimum number of closed d-balls {x ∈ X : d(x 0 , x) ≤ ρ} with center x 0 ∈ A and radius ρ > 0 needed to cover A, see for example [33,     Appendix C] or [32,Section 4.3.7]. Then the metric entropy is log N (A, d, ρ). When d is replaced by a norm, we mean the metric induced by the norm.
Lemma A.1. Let (X, d X ) and (Y, d Y ) be two linear spaces endowed with semimetric d X and d Y .
for some A ⊂⊂ X and some η > 0, then for any ρ > 0 we have (ii) For A ⊂⊂ X and B ⊂⊂ Y , is the product metric.
Proof. (i) We denote by B X (x ′ , ρ) and B Y (y ′ , ρ) the ball in X with center x ′ ∈ X and radius ρ > 0 and the ball in Y with center y ′ ∈ Y and radius ρ > 0, respectively. For any ρ > 0, Then it follows that (ii) Let C A be a finite set in A and C B be a finite set in B such that Take z = (x, y) ∈ A × B, then there exists x 0 ∈ C A such that x ∈ B X (x 0 , ρ) and y 0 ∈ C B such that y ∈ B Y (y 0 , ρ). Hence z ∈ B X×Y ((x 0 , y 0 ), ρ) := {z ∈ X × Y : d ∞ (z, (x 0 , y 0 )) ≤ ρ}. It follows that and hence the wanted property follows.
Lemma A.2. Let X be a bounded Lipschitz domain in R d ′ or the d ′ -dimensional torus and β > d ′ /2, then Proof. Corollary 4.3.38 and the remark hereafter in [32] states that the norm ball see for example [72]. We denote the restriction R : H β ([0, 1] d ′ )) → H β (X ), which is a contraction in supremum norm and the left-inverse of E. Then S β (M ) = R(E(S β (M ))) and E(S β (M )) ⊂ B β (CM ), and hence using also Lemma A.1 (i) and (55).

B. On maximum likelihood composite testing
We denote by E n and E γ n the expectation with respect to P n and P γ n respectively. Lemma B.1. Suppose for a non-increasing function N (ρ), some ρ 0 > 0 and all ρ > ρ 0 , we have Then for every ρ > ρ 0 , there exist measurable functions Ψ n : so that every γ ∈ S j is within distance jρ 4 of a point in S ′ j . For ρ > ρ 0 , there are at most N (jρ) such points. For each γ jl ∈ S ′ j , define the measurable function, Ψ n,j,l : Y − → {0, 1}, known as the maximum likelihood test, . Now, set Ψ n (y) := 1 ∪ j,l A n,j,l (y). This is also a measurable function, since a countable union of measurable sets is measurable. Then by the union bound On the other hand, for any j ≥ 1, For j = 1 we get the wanted result.
There are other ways to prove the existence of suitable measurable functions Ψ n used in the proof of Theorem 2.1. We mention here the approximation argument of [28,Lemma 8] that requires smoothness properties of G. Given any C 2 > C 1 + 4, we set ρ := 4mr n for m > m 0 large enough (depending also on C 2 , C 3 and σ 2 by the following) and apply Lemma B.1: there exists measurable functions Ψ n : Y − → {0, 1} such that E γ0 n (Ψ n ) ≤ e C3nr 2 n e −2σ −2 mnr 2 n 1 − e −2σ −2 mnr 2 n ≤ e −C2nr 2 n .
Then Theorem 28 in [31] and modifications as in the proof of Theorem 1.3.2 in [24] give the claim. for any given C 2 > 0, hence Condition A.2 is satisfied. Condition A.3 is satisfied as in Lemma 3.3 using Lemma A.1 (ii). Then the result follows as in the proof of Theorem 3.1 and Corollary 1.
Proof of Theorem 4.6. Let γ n 0 = Φ n −k (θ 0 ) and γ n = Φ n −k (θ) for some 0 < k < 1, which we will choose later. For anyr n > 0, the triangle inequality gives We shall consider the two factors of the right-hand side in separate parts below: 1) We check that Condition A.