Multilevel Dimension-Independent Likelihood-Informed MCMC for Large-Scale Inverse Problems

We present a non-trivial integration of dimension-independent likelihood-informed (DILI) MCMC (Cui, Law, Marzouk, 2016) and the multilevel MCMC (Dodwell et al., 2015) to explore the hierarchy of posterior distributions. This integration offers several advantages: First, DILI-MCMC employs an intrinsic likelihood-informed subspace (LIS) (Cui et al., 2014) -- which involves a number of forward and adjoint model simulations -- to design accelerated operator-weighted proposals. By exploiting the multilevel structure of the discretised parameters and discretised forward models, we design a Rayleigh-Ritz procedure to significantly reduce the computational effort in building the LIS and operating with DILI proposals. Second, the resulting DILI-MCMC can drastically improve the sampling efficiency of MCMC at each level, and hence reduce the integration error of the multilevel algorithm for fixed CPU time. Numerical results confirm the improved computational efficiency of the multilevel DILI approach.


Introduction
Inverse problems aim to estimate unknown parameters of mathematical models from noisy and indirect observations.The unknown parameters, often represented as functions, are related to the observed data through a forward model, such as a differential equation, that maps realisations of parameters to observables.For ill-posed inverse problems, there may exist many feasible realisations of parameters that are consistent with the observed data, and small perturbations in the data may lead to large perturbations in unregularised parameter estimates.The Bayesian approach [26,37,38] casts the solution of inverse problems as the posterior probability distribution of the model parameters conditioned on the data.This offers a natural way to integrate the forward model and the data together with prior knowledge and a stochastic description of measurement and/or model errors to remove the ill-posedness and to quantify uncertainties in parameters and parameterdependent predictions.As a result, parameter estimations, model predictions, and associated uncertainty quantifications can be issued in the form of marginal distributions or expectations of some quantities of interest (QoI) over the posterior.Due to the typically high parameter dimensions and the high computational cost of the forward models, characterising the posterior and computing posterior expectations are in general computationally challenging tasks.Integrating multilevel Markov chain Monte Carlo (MCMC) [14,23], likelihood-informed parameter reduction [11,36,41] and dimension-independent MCMC [4,8,10,34], we present here an integrated framework to significantly accelerate the computation of posterior expectations for large-scale inverse problems.
In inverse problems, unknown parameters are often cast as functions, and hence the Bayesian inference has to be carried out over typically high-dimensional discretisations of the parameters that resolve the spatial and/or temporal variability of the underlying problem sufficiently.Examples are the permeability field of a porous medium [9,14,22,24] or Brownian forcing of a stochastic ordinary differential equation [3].In those settings, efficient MCMC methods have been developed to sample the posterior and compute posterior expectations with convergence rates that are independent of the discretised parameter dimension; these include (preconditioned) Crank-Nicolson (pCN) methods [4,8,20] that establish the foundation for designing and analysing MCMC algorithms in a function space setting, stochastic Newton methods [29,31] that utilise Hessian information to accelerate the convergence, as well as operator-weighted methods [10,27,34] that generalise pCN methods using (potentially location-dependent) operators to adapt to the geometry of the posterior.
Discretisation also arises in the numerical solution of the forward model, e.g., finite-element discretisation of PDEs.As many degrees of freedom are needed, it can be computationally demanding to accurately resolve the forward model, which is required to simulate the posterior density.A natural way to reduce the computational cost is to utilise a hierarchy of forward models related to a sequence of grid discretisations, ranging from computationally cheaper and less accurate coarse models to more costly but more accurate fine models.Corresponding to this hierarchy of models, the parameters can also be represented by a sequence of discretised functions with increasing dimensions.This yields a hierarchy of posterior distributions.By allocating different numbers of MCMC simulations to sample posteriors across different levels and by combining all those samplebased posterior estimations using a telescoping sum [15], the multilevel MCMC [14,23] provides accelerated and unbiased estimates of posterior expectations.
We present a non-trivial integration of the dimension-independent likelihood-informed (DILI) MCMC [10] and the multilevel MCMC in [14] to explore the hierarchy of posterior distributions.This integration offers several advantages: First, DILI-MCMC employs an intrinsic likelihoodinformed subspace (LIS) [11]-which involves a number of forward and adjoint model simulationsto design accelerated operator-weighted proposals.By exploiting the multilevel structure of the discretised parameters and discretised forward models, we design a Rayleigh-Ritz procedure to significantly reduce the computational effort in building a hierarchical LIS and operating with DILI proposals.Second, the resulting DILI-MCMC can drastically improve the sampling efficiency of MCMC at each level, and hence reduce the integration error of multilevel Monte Carlo for a fixed CPU time budget.Numerical results confirm the improved computational efficiency of the proposed multilevel DILI approach.
We note that the DILI proposal has been used before in the multilevel sequential Monte Carlo (SMC) setting [2], but in a very different way.We use derivative information of the likelihood to recursively construct the LIS via matrix-free eigenvalue solves, whereas [2] uses multilevel SMC to estimate the full-rank empirical posterior covariance matrix and then builds the LIS from this posterior covariance matrix.Moreover, we construct DILI proposals by exploiting the structure of the hierarchical LIS to couple Markov chains across levels, whereas [2] employs the original DILI proposal in the mutation step of SMC to improve mixing.
The paper is structured as follows.Section 2 introduces the framework of Bayesian inverse problems and MCMC sampling while section 3 discusses the general framework of multilevel MCMC.The Rayleigh-Ritz procedure for the recursive construction of the hierarchical LIS is presented in section 4. The coupled DILI proposals that can exploit the hierarchical LIS are introduced in section 5. Section 6 provides numerical experiments to demonstrate the efficacy of the resulting MLDILI method, while finally, in section 7, we provide some concluding remarks.

Background
In this section, we review the Bayesian formulation of inverse problems, the dimension-independent likelihood-informed MCMC approach, posterior discretisation, as well as the bias-variance decomposition for MCMC algorithms.

Bayesian inference framework
Suppose the parameter of interest is some function u in a separable Hilbert space H(Ω) defined over a given bounded domain Ω ⊂ R d .We introduce a prior probability measure µ 0 to represent the a priori information about the function u.The inner product on H is denoted by ⟨• , •⟩ H , with associated norm denoted by ∥ • ∥ H .For brevity, where misinterpretation is not possible, we will drop the subscript H.We assume that the prior measure is Gaussian with mean m 0 ∈ H and a self-adjoint, positive definite covariance operator Γ pr that is trace-class, so that the prior provides a full probability measure on H.
Given observed data y ∈ R d and the forward model F : H → R d , we define the likelihood function L(y|u) of y given u.Denoting the posterior probability measure by µ y , the posterior distribution on any infinitesimal volume du ⊆ H is given by µ y (du) ∝ L(y|u)µ 0 (du) . ( Making the simplifying assumption that the observational noise is additive and Gaussian with zero mean and covariance matrix Γ obs , the observation model has the form and it follows immediately that the likelihood function satisfies where η(y; u) is the data-misfit functional defined by Assumption 2.1.We assume that the forward model F : H → R d satisfies: (i) For all ε > 0, there exists a constant K(ε) > 0 such that, for all u ∈ H, (ii) For any u ∈ H, there exists a bounded linear operator J(u) : In particular, this also implies the Lipschitz continuity of F .
Given observations y ∈ R d and a forward model that satisfies Assumption 2.1, [37] shows that the resulting data-misfit function is sufficiently bounded and locally Lipschitz, and thus the posterior measure is dominated by the prior measure.The second condition states that the forward model is first-order Fréchet differentiable, and hence the Gauss-Newton approximation of the Hessian of the data-misfit functional is bounded.
Suppose we have some quantity of interest (QoI) that is a functional of the parameter u denoted by Q : H → R q , e.g., flow rate.Then, posterior-based model predictions can be formulated as expectations of that QoI over the posterior.We will denote them by MCMC methods construct a Markov chain of correlated random variables U (1) , . . ., U (N ) for which the posterior is the invariant distribution.Then, one can estimate expected QoI(s) using Monte Carlo integration: (5)

Dimension-independent likelihood-informed MCMC on function space
The Metropolis-Hastings (MH) algorithm [21,30] provides a general framework to design transition kernels that have the posterior as their invariant distribution to generate a Markov chain of random variables that targets the posterior.
Definition 2.2 (Metropolis-Hastings Kernel).Given the current state U (k) = u * , a candidate state u ′ can be drawn from a proposal distribution q(u * , •).We define a pair of probability measures Then, the next state of the Markov chain is set to U (k+1) = u ′ with probability and to U (k+1) = u * otherwise.
MH algorithms require the absolute continuity condition ν ⊥ ≪ ν to define a valid transition kernel with non-zero acceptance probability as the dimension goes to infinity [40].We will refer to a MH algorithm as well-defined or dimension-independent if this absolute continuity condition holds.For probability measures over function spaces in the setting considered here, the sequence of papers [4,8,19,20,37] provide a viable way to construct well-defined MH algorithms using a preconditioned Crank-Nicolson (pCN) discretisation of a particular Langevin SDE.The pCN proposal has the form where ξ ∼ N (0, I) and γ ∈ {0, 1} is a tuning parameter to switch between Langevin (γ = 1) and Ornstein-Uhlenbeck proposal (γ = 0).It is required that a ∈ (−1, 1).The pCN proposal (8) satisfies the desired absolute continuity condition and the acceptance probability does not go to zero as the discretisation of u is refined.In addition, [18,34] establish that, under mild conditions, the spectral gaps of the MCMC transition kernels defined by (generalised) pCN proposals do exist, and that they are independent of the dimension of the discretised parameters.Thus, the statistical efficiency of pCN proposals is also dimension-independent.The pCN proposal (8) scales uniformly in all directions with respect to the norm induced by the prior covariance.Since the posterior necessarily contracts the prior along parameter directions that are informed by the likelihood, the Markov chain produced by the standard pCN proposal decorrelates more quickly in the likelihood-informed parameter subspace than in the orthogonal complement, which is prior-dominated [10,27].Thus, proposed moves of pCN can be effectively too small in prior-dominated directions, resulting in poor mixing.
The dimension-independent likelihood-informed (DILI) MCMC [10] provides a systematic way to design proposals that adapt to the anisotropic structure of the posterior while retaining dimension-independent performance.It considers operator-weighted proposals in the form of where A, B, and G are bounded, self-adjoint operators on Im(Γ ) that satisfy certain properties to be discussed below.In this paper, we set G to zero throughout and thus consider only non-Langevin type proposals.By applying a whitening transform to the parameter u and by denoting (in a slight abuse of notation) the associated data-misfit functional again by η(v; y) ← η(Γ pr v + m 0 ; y), the proposal (9) simplifies to The following theorem provides sufficient conditions for constructing the operators A and B such that the proposal (11) yields a well-defined MH algorithm, as well as a formula for the acceptance probability.
Theorem 2.3.Suppose that the posterior measure µ y is equivalent to the prior measure µ 0 and that the self-adjoint operators A and B commute, that is, they can be defined by a common set of eigenfunctions {ψ i ∈ Im(Γ Then, the proposal (11) delivers a well-defined MCMC algorithm and the acceptance probability is given by Proof.The above assumptions are simplified versions of those in Theorem 3.1 of [10].The acceptance probability directly follows from Corollary 3.5 of [10].
The DILI proposal (11) enables different scalings in the proposal moves along different parameter directions.By choosing appropriate eigenfunctions {ψ i } ∞ i=1 and eigenvalues it can capture the geometry of the posterior, and thus can potentially improve the mixing of the resulting Markov chain.The likelihood-informed subspace (LIS) [11,12] provides a viable way to construct such operators A and B. It is spanned by the leading eigenfunctions of the eigenvalue problem where H(v) is some information metric of the likelihood function (with respect to the transformed parameter v), for example, the Hessian of the data-misfit functional or the Fisher information, and µ * is some reference measure, for example, the posterior or the Laplace approximation of the posterior.In the LIS, spanned by {ψ 1 , . . ., ψ r }, the posterior may significantly differ from the prior.Thus, we prescribe inhomogeneous eigenvalues {a i } r i=1 and {b i } r i=1 to ensure that the proposal follows the possibly relatively tight geometry of the posterior.In the complement of the LIS, where the posterior does not differ significantly from the prior, we can use the original pCN proposal and set {a i } i>r and {b i } i>r to some constant values a ⊥ and b ⊥ , respectively.Further details on the computation of the LIS basis and the choice of eigenvalues will be discussed in the multilevel context in later sections.

Posterior discretisation and bias-variance decomposition
When the forward model involves a partial/ordinary differential equation and the parameter is defined as a spatial/temporal stochastic process, it is necessary in practice to discretise the parameter and the forward model using appropriate numerical methods.
A common way to discretise the parameter is the Karhunen-Loéve expansion, which also serves the purpose of the whitening transform.Given the prior mean m 0 (x) and the prior covariance Γ pr , we express the unknown parameter u as the linear combination of the first R eigenfunctions {ϕ 1 , . . ., ϕ R } of the eigenvalue problem Γ pr ϕ j = ω j ϕ j , such that The discretised prior p R (v) associated with the random coefficients v = [v 1 , . . ., v R ] ⊤ is Gaussian with zero mean and covariance equal to the R × R identity matrix I R .In this context, the selection of the truncation dimension R is typically based on the rate of decay of the eigenvalues ω j , for example such that R j=1 ω j / ∞ j=1 ω j ≥ τ with a threshold τ ∈ (0, 1) close to one.In this way, the truncated representation encapsulates a specific percentage of the total prior variance.
We discretise the forward model using a numerical method, such as finite elements or finite differences, with M degrees of freedom, which yields a discretised forward model F R,M mapping from the discretised coefficients v to the observables.In this way, the posterior measure (1) can be discretised, leading to the finite-dimensional density is the discretised data-misfit function.Correspondingly, we also define the discretised QoI Q R,M (v), which maps the discretise coefficient vector v to the discretised QoI.
The discretised parameters and forward models can be indexed by the discretisation level.We consider a hierarchy of L + 1 levels of discretised parameter spaces with dimensions R 0 ≤ R 1 ≤ . . .≤ R L and a hierarchy of discretised forward models with M 0 ≤ M 1 ≤ . . .≤ M L degrees of freedom.Discretised parameter, forward model and QoI on level ℓ are denoted by respectively.Thus, the discretised data-misfit function, prior and posterior on level ℓ are respectively, with the associated posterior expectation Assumption 2.4.
(i) The bias of the posterior expectation on level ℓ can be bounded in terms of the number of degrees of freedom of the forward model such that for some constant ϑ b > 0. (ii) For the computational cost of carrying out one step of MCMC (including a forward model simulation) it is assumed that there exists a constant ϑ c > 0 such that Implicitly, Condition (i) in Assumption 2.4 also assumes that R ℓ is sufficiently large such that on level ℓ the bias due to parameter approximation is dominated by the error due to the forward model approximation.This condition can be verified for certain classes of model problems.For instance, for finite element methods applied to elliptic PDEs (which is the model problem used in the numerical experiments of this work), the convergence analysis in [14,Section 4.2] shows that the discretisation error satisfies that |E µy ) for some constants , the two error contributions are balanced.The constant ϑ c in Condition (ii) of Assumption 2.4 depends on the underlying linear solver and/or numerical integrator, so that a theoretical upper bound on ϑ c is often known.
Consider discretisation level L and let {V L } NMC j=1 be a Markov chain produced by a MCMC algorithm converging in distribution to π L .An estimate for the expectation The focus of this work is the asymptotic performance of algorithms, and hence the initialization bias of MCMC and the computational cost due to burn-in are not discussed.The mean-squared-error (MSE) of the Monte Carlo estimator (18) allows a bias-variance decomposition of the form where N eff MC is the effective sample size of the Markov chain {V (j) L } NMC j=1 .This effective sample size is proportional to the total sample size, i.e., N eff MC = N MC /τ MC , where τ MC ≥ 1 is the integrated autocorrelation time (IACT) of the Markov chain.The work of [18] shows that for pCN-type algorithms the IACT τ MC is dimension-independent given a local Lipschitz assumption, which is often satisfied for inverse problems governed by elliptic PDEs.
Choosing N MC such that the two terms in (19) of the MCMC estimator are balanced and using Assumption 2.4, the total computational cost to achieve MSE(Y MC ) < ε 2 is Thus, one of the key aims in accelerating MCMC sampling is to reduce τ MC , which can be achieved, e.g., via DILI MCMC proposals.In addition, the multilevel method will allow us to improve the asymptotic rate of growth of the cost of the standard MCMC estimator in (20) with respect to ε, as well as to further reduce τ MC on the higher levels.These two things are achieved in multilevel MCMC, by using coarse level samples as proposals on the higher levels and by dealing with the high numerical correlation between subsequent MCMC samples produced by standard proposal mechanisms on the coarsest level (level zero).Thus, most samples are drawn on the computationally least costly level zero, as well as shifting most of the work for removing the initialization bias to level zero, all contributing to the practical advatages of the multilevel algorithms compared to their single-level counterparts.

Multilevel MCMC
By exploiting the hierarchy of posteriors, the rate of the computational cost in ( 20) can be reduced significantly using the multilevel idea in [14].We expand the posterior expectation in the telescoping sum For level zero, the sample set {V (0, j) 0 } N0 j=1 is assumed to be drawn via some MCMC method that converges to π 0 ( • |y) and the first term in the telescoping sum ( 21) is estimated via ).
Since the two expectations in the difference that is, the posteriors π ℓ (v ℓ |y) and π ℓ -1 (v ℓ -1 |y) are the two marginals.Then, the difference between expectations can be expressed as and The construction of the joint density and the associated sampling procedure will be critical to reduce the computational complexity.Suppose the samples .
Then, the remaining terms in (21), for ℓ = 1, . . ., L, are estimated by and the multilevel MCMC estimator for The mean square error of this estimator can again be decomposed as follows:

Variance management
For optimal efficiency, we now choose the numbers of samples N ℓ , ℓ = 0, . . ., N , such as to minimise Var(Y ML ) for fixed computational effort.This includes the within-level variance Var(Y ℓ ) and the cross-level variance Cov(Y ℓ , Y k ) for k ̸ = ℓ.We will provide justifications on managing these variances using the following assumptions.Remark 3.1.Suppose the effective sample sizes are proportional to the total sample sizes, i.e., N eff ℓ = N ℓ τ ℓ , for all ℓ, where τ ℓ ≥ 1 is the IACT of the Markov chain D ℓ .Then, the within-level variance has the form where we set Var ∆0,−1 (D 0 ) = Var π0 (Q 0 ) and have by the Cauchy-Schwarz inequality.Thus, to reduce Var(Y ℓ ), the joint density should be constructed in such a way that Cov ) is positive and (if possible) maximised.In addition, the MCMC simulation should be made statistically efficient in the sense that τ ℓ is as close to one as possible.
Proposition 3.3.Suppose that there exists an r < 1 such that i.e., the cross-level covariance is insignificant compared to the within-level variance.Then Proof.Without loss of generality, we can assume the variances {Var(Y ℓ )} L ℓ=0 are ordered as Var(Y ℓ ) ≥ Var(Y k ) for ℓ < k.Then we have the bound Var(Y ℓ ).
Using Proposition 3.3 and ( 26), the variance of the multilevel estimator satisfies The total computational cost is This way, for a fixed variance, the computational cost is minimised by choosing the sample size which leads to a total computational cost that satisfies Theorem 3.4.Suppose Assumptions 2.4, 3.2 and (28) in Proposition 3.3 hold.For the multilevel MCMC estimator to satisfy MSE(Y ML ) < ε 2 , the multilevel MCMC with N ℓ chosen as in (30) requires an overall computational cost Proof.Follows directly from the multilevel Monte Carlo complexity theorems in [7,15].
It is difficult to rigorously verify Assumption (28) in Proposition 3.3, but it is often observed that the cross-level variances Cov(Y ℓ , Y k ) rapidly decay to zero in practice, as the Markov chains used for computing Y ℓ and Y k with ℓ ̸ = k are statistically independent.For example, in [25] independent Markov chains are constructed and in [14] a subsampling strategy of the coarser chains is employed to ensure independence.Nevertheless, the bound on the computational complexity of multilevel MCMC is reduced under assumption (28) compared to that presented in [14], which has an extra | log ε| factor.For any positive values of ϑ b , ϑ v , ϑ c , the multilevel MCMC approach asymptotically requires less computational effort than single-level MCMC.To choose optimal numbers of samples on the various levels, estimates of the IACTs τ ℓ , the variances Var ∆ ℓ,ℓ -1 (D ℓ ), and the computational costs C ℓ are needed.Such quantities may not be known a priori, but they can all be obtained and adaptively improved (on the fly) as the simulation progresses.

Notations
To map vectors and matrices across adjacent levels of discretisation we define the following notation.Given the canonical basis (ê 1 , ê2 , . . ., êR ℓ ) of the parameter space at level ℓ, where êj ∈ R R ℓ , we define the basis matrices Θ ℓ,c ≡ (ê 1 , ê2 , . . ., êR ℓ -1 ) and Θ ℓ,f ≡ (ê R ℓ -1 +1 , . . ., êR ℓ ), which correspond to the parameter coefficients 'active' at level ℓ -1 and the additional coefficients.Here the subscripts c and f denote the coefficients that are 'active' on the coarse level and the coefficients that are 'active' only on the fine level, respectively.We can split the parameter v ℓ into two components which correspond to the coefficients on the previous level ℓ -1 and the additional coefficients.Given a matrix A ℓ ∈ R R ℓ ×R ℓ , we partition the matrix as where A ℓ,cc ≡ Θ ⊤ ℓ,c A ℓ Θ ℓ,c and A ℓ,ff , A ℓ,f c and A ℓ,cf are defined analogously.The matrices Θ ℓ,c and Θ ℓ,f are never constructed explicitly.Operations with those matrices only involve the selection of the corresponding rows or columns of the matrix or vector.

Multilevel LIS
We aim to employ the DILI method (cf.Section 2.2) as the proposal mechanism for multilevel MCMC.Since the computation of the LIS basis used by the DILI proposal can be costly, here we develop a Rayleigh-Ritz procedure to recursively compute new, multilevel likelihood-informed subspaces using the model hierarchy.The resulting hierarchical LIS basis can be used to generalise DILI proposals to the multilevel setting and to improve the efficiency of multilevel MCMC sampling.In Section 4.1, we define the concept of LIS in the multilevel context.In Sections 4.2 and 4.3, we present the recursive construction of the multilevel LIS using the Rayleigh-Ritz procedure.

Setup
For each level ℓ ∈ {0, 1, . . ., L}, we denote the linearisation of the forward model This yields the Gauss-Newton approximation of the Hessian of the data-misfit functional at v ℓ (hereafter referred to as the Gauss-Newton Hessian) in the form of The Gauss-Newton Hessian in (35) corresponds to the Fisher information matrix of the likelihood with additive Gaussian noise.It is commonly used in statistics to measure the local sensitivity of the parameter-to-likelihood map.The leading eigenvectors of H ℓ (v ℓ ) (corresponding to the largest eigenvalues) indicate parameter directions along which the likelihood function varies rapidly.
However, to extract the global sensitivity of the parameter-to-likelihood map from the local sensitivity information contained in the Gauss-Newton Hessian, it is necessary to compute the expectation of H ℓ (v ℓ ) with respect to some reference distribution p * ℓ (v ℓ ), i.e., Finally, this is approximated using the sample average with K ℓ random samples drawn from the reference distribution, which yields Note that the matrix H ℓ is symmetric and positive semidefinite.Different choices of the reference distribution, such as the prior or the posterior, lead to different ways to construct the LIS and different performance characteristics.Remark 4.1.Following the discussion in [12,41], using the posterior as the reference leads to sharp approximation properties [13,41] compared to other choices.However, the posterior exploration relies on MCMC sampling, and thus this choice requires adaptively estimating LIS during the MCMC sampling.The Laplace approximation to the posterior provides a reasonable alternative in a wide range of problems where the posterior is unimodal.We use the Laplace approximation as the reference distribution in this work.
The choice of the reference distribution can have an impact on the quality of the LIS basis and on the IACT of the Markov chains produced by DILI MCMC, but it does not affect the convergence of MCMC, as DILI samples the full parameter space and only uses the LIS to reduce the IACT and thus to accelerate posterior sampling.
It is often computationally infeasible to explicitly form the Gauss-Newton Hessian matrix (35).However, all we need are matrix-vector-products with the Gauss-Newton Hessian matrix.This requires only applications of the linearised forward model J ℓ (v ℓ ) and its adjoint J ℓ (v ℓ ) ⊤ , which are well-established operations in the PDE-constraint optimisation literature.We refer the readers to recent applications in Bayesian inverse problems for further details, e.g., [5,29,31].

Base level LIS
At the base level, we use the samples {v (k) 0 } K0 k=1 drawn from the reference p * 0 (•) to construct the sample-averaged Gauss-Newton Hessian, H 0 .Then, we use the Rayleigh quotient ⟨ϕ, H ℓ ϕ⟩ / ⟨ϕ, ϕ⟩ to measure the (quadratic) change in the parameter-to-likelihood map along a parameter direction ϕ.Hence, the LIS can be identified via a sequence of optimisation problems of the form where ψ 0,1 is the solution to the unconstrained optimisation problem.The sequence of optimisation problems in (38) is equivalent to finding the leading eigenvectors of H 0 .
Definition 4.2 (Base level LIS).Given the sample-averaged Gauss-Newton Hessian H 0 on level 0 and a threshold ϱ > 0, we solve the eigenproblem and then use the r 0 leading eigenvectors with eigenvalues λ 0,i > ϱ, for i = 1, . . ., r 0 , to define the LIS basis Ψ 0,r0 = [ψ 0,1 , ψ 0,2 . . ., ψ 0,r0 ], which spans an r 0 -dimensional subspace in R R 0 .The eigenvalues in (39) provide empirical sensitivity measures of the likelihood function relative to the prior (which here is i.i.d.Gaussian) along corresponding eigenvectors [11,41].Eigenvectors corresponding to eigenvalues less than 1 can be interpreted as parameter directions where the likelihood is dominated by the prior.Thus, we typically choose a value less than one for the truncation threshold, i.e., ϱ < 1.

LIS enrichment
Because the computational cost of a matrix vector product with the Gauss-Newton Hessian scales at least linearly with the degrees of freedom M ℓ of the forward model on level ℓ, constructing the LIS can be computationally costly.We present a new approach to accelerate the LIS construction by employing a recursive LIS enrichment using the hierarchy of forward models and parameter discretisations.The resulting hierarchy of LISs will be used to reduce the computational complexity of constructing and operating with the resulting DILI proposals.
We reuse the LIS bases computed on the coarser levels by 'lifting' them and then recursively enrich them at each new level using a Rayleigh-Ritz procedure, rather than recomputing the entire basis from scratch on each level.Ideally, the subspace added on each level will have decreasing dimension, as the model and parameter approximations were assumed to converge with ℓ → ∞ and thus no longer provide additional information for the parameter inference.

Definition 4.3 (Lifted LIS basis). Suppose we have an orthogonal LIS basis Ψ
on level ℓ -1.We lift Ψ ℓ -1,r from the coarse parameter space R R ℓ -1 to the fine parameter space R R ℓ using the basis matrix Θ ℓ,c defined in Section 3.2.The lifted LIS basis vectors are collected in the matrix Proposition 4.4.The lifted LIS basis matrix Ψ ℓ,c has orthonormal columns that span an r ℓ -1dimensional subspace in R R ℓ , i.e., Ψ ⊤ ℓ,c Ψ ℓ,c = I r ℓ -1 .
Proof.The proof directly follows as the matrix Θ ℓ,c has orthonormal columns.
Theorem 4.5.The optimisation problem (41) is equivalent to finding the leading eigenvectors of the projected eigenproblem Proof.This result follows from the properties of orthogonal projectors and of the stationary points of the Rayleigh quotient.Here, we sketch the proof as follows.The constraint Π ℓ,c ϕ = 0 implies ϕ ) is also an orthogonal projector.Hence, the optimisation problem becomes The solutions (for k = 1, 2, . ..) to these optimisation problems are given by the leading eigenvectors of the eigenproblem Definition 4.6 (LIS enrichment on level ℓ).The leading s ℓ (normalised) eigenvectors of the eigenproblem (42) with eigenvalues γ ℓ,i > ϱ are denoted by They are added to the lifted LIS basis from level ℓ -1 to form the enriched LIS basis on level ℓ, where the basis vectors in (43) denote the auxiliary "fine scale" directions added on level ℓ.By construction, all the LIS basis vectors at level ℓ are mutually orthogonal.That is, Ψ ⊤ ℓ,r Ψ ℓ,r = I r ℓ .We also have r ℓ = r ℓ -1 + s ℓ .By construction, the LIS basis Ψ ℓ,r is block upper triangular and can be recursively defined as where )×s ℓ , and Ψ ℓ -1,r ∈ R R ℓ -1 ×r ℓ -1 .We have s ℓ = r ℓ − r ℓ -1 and define s 0 = r 0 for consistency.The hierarchical LIS reduces the computational cost of operating with the LIS basis and the associated storage cost.This is critical for building efficient multilevel DILI proposals that will be discussed later.In addition, the recursive LIS enrichment is computationally more efficient, since the amount of costly PDE solves on the finer levels will be significantly reduced.In Appendix A, we develop heuristics to demonstrate the reduction factors of the hierarchical construction of LIS basis in terms of the storage and the number of matrix vector products.

Coupled DILI proposal
= v * ℓ be the j-th states of the Markov chains at levels ℓ -1 and ℓ, respectively.The state at level ℓ has the form v * ℓ = (v * ℓ,c , v * ℓ,f ), corresponding to the coarse part of the parameters (shared with level ℓ -1) and the refined part, respectively.The two Markov chains are called coupled at the j-th state if v * ℓ,c = v * ℓ -1 .Thus, assuming the two chains to be coupled at the jth state, we first present the general form of the multilevel MCMC for generating the next pair of coupled states, and then design the hierarchical DILI proposal within this general framework.
Following [14], we assume that we can generate independent posterior samples In practice, this is achieved (approximatively) by sub-sampling a Markov chain that targets the level ℓ -1 posterior with a sub-sampling rate that depends on the sample autocorrelation [14,Sect. 3].In other words, coupled posterior samples from π(v ℓ -1 |y) and π(v ℓ |y) are generated by using the posterior π(v ℓ -1 |y) on level ℓ -1 as the proposal distribution for the Markov chain on level ℓ, thus reducing the within-level variance Var ∆ ℓ,ℓ -1 (D ℓ ).
The proposed candidate v ′ ℓ -1 ∼ π ℓ -1 (•|y) is assumed to be independent of the current state v * ℓ -1 .To sample from the refined posterior π ℓ (v ℓ |y), we then consider the factorised proposal where the coarse part v ′ ℓ,c of the proposal is set to be the (independent) proposal v ′ ℓ -1 from level ℓ -1.The proposal candidate v ′ ℓ conditioned on v * ℓ can then be expressed as (copy from level ℓ -1 proposal), (47) Based on the factorised proposal (46), the acceptance probability for the chain targeting the level ℓ posterior π ℓ v ℓ | y is of the form Figure 1 shows a schematic of the coupling strategy.The double arrows represent the coupling of the two MCMC states, as well as the coupling of the two proposal candidates across levels.The dashed arrows represent the proposal and acceptance/rejection steps.The top half represents the Markov chain on level ℓ -1.The bottom half represents the Markov chain on level ℓ.Since all the proposal candidates are coupled, all states that follow the acceptance of a proposal candidate on level ℓ are also coupled with the corresponding state on level ℓ -1.

DILI proposal
Then, we design the DILI proposal using the hierarchical LIS introduced in Section 4. Recall that the discretised DILI proposal (11) is ℓ -1 on two adjacent levels are only coupled whenever the proposal on level ℓ is accepted, i.e., for all cases j, j + 1, j + 2 here.
as it was introduced in [10].Suppose we have a LIS basis Ψ ℓ,r ∈ R R ℓ ×r ℓ .By treating the likelihoodinformed parameter directions and the prior-dominated directions separately, we can construct the matrices A ℓ and B ℓ as where A ℓ,r , B ℓ,r ∈ R r ℓ ×r ℓ , a ⊥ and b ⊥ ∈ R and Π ℓ = Ψ ℓ,r Ψ ⊤ ℓ,r are rank-r ℓ orthogonal projectors.Corollary 5.1.In the proposal (50), suppose that A ℓ,r , B ℓ,r ∈ R r ℓ ×r ℓ are non-singular matrices satisfying A 2 ℓ,r + B 2 ℓ,r = I ℓ,r , and a ⊥ and b ⊥ are scalars satisfying a 2 ⊥ + b 2 ⊥ = 1.Then, the corresponding proposal distribution q(v ′ ℓ |v * ℓ ) satisfies the conditions of Theorem 2.3 and has the prior as its invariant measure, i.e., this proposal has acceptance probability one if we use it to sample the prior.The acceptance probability as samples from π ℓ v ℓ |y is Proof.Given A 2 ℓ,r + B 2 ℓ,r = I ℓ,r , the symmetric matrices A ℓ,r and B ℓ,r can be simultaneously diagonalised under some orthogonal transformation.Thus, the operators A ℓ and B ℓ can be simultaneously diagonalised, where the eigenspectrum of A ℓ consists of the eigenvalues of A ℓ,r and a ⊥ , and the same applies to B ℓ .This way, it is easy to check that the proposal distribution q(v ′ ℓ |v * ℓ ) has the prior as invariant measure and that the conditions of Theorem 2.3 are satisfied.
The form of the acceptance probability to sample from π ℓ v ℓ |y directly follows from the acceptance probability defined in Theorem 2.3.
We use the empirical posterior covariance, commonly used in adaptive MCMC [32,17,16] to construct matrices A ℓ,r and B ℓ,r for our DILI proposal (50).On each level, the empirical covariance matrix Σ ℓ,r ∈ R r ℓ ×r ℓ is estimated from past posterior samples projected onto the LIS.Given a jump size ∆t, we can then define the matrices A ℓ,r and B 2 ℓ,r by respectively.The operators A ℓ,r and B ℓ,r satisfy A 2 ℓ,r + B 2 ℓ,r = I ℓ,r by construction.By estimating the empirical covariance within the subspace, common conditions such as the diminishing adaptation [1,33] for the convergence of adaptive MCMC can be easily satisfied.In addition, we adopt a finite adaptation strategy in our numerical implementation, in which only the samples generated post adaptation are used for estimating QoIs.

Conditional DILI proposal
On level 0, the vanilla DILI proposal (cf.[10]) can be used to sample the Markov chain with invariant distribution π 0 (v 0 |y).On level ℓ, to simulate coupled Markov chains using the proposal mechanism defined in ( 46)-(48), a key step is to use DILI to generate the fine components v ′ ℓ,f of the proposal candidate and thus to fix the conditional probability q ). Defining the precision matrix the DILI proposal (50) can be split as follows: where the partitions of the vectors and of the matrix P ℓ correspond to the parameter coordinates shared with level ℓ -1 and the refined parameter coordinates on level ℓ.
To draw candidate samples from the factorised proposal distribution ) defined in (46) we use the procedure outlined in Algorithm 1, which employs the DILI proposal in the form of (55) for the conditional distribution q(v ′ ℓ,f |v * ℓ , v ′ ℓ,c ).Corollary 5.2.Using the above procedure to draw candidates from the factorised proposal distribution , the acceptance probability to sample from the posterior distribution π ℓ (v ℓ |y) is ) on the fine level based on (55).1: procedure Conditional DILI proposal A ℓ v * ℓ using (55).

3:
Draw a random variable r ℓ,f conditioned on r ℓ,c such that jointly (r ℓ,c , r ℓ,f ) ∼ N (0, P −1 ℓ ).Due to (55), the fine-level components of the proposed candidate v ′ ℓ,f then satisfy 4: end procedure 5.1.3.Generating conditional samples The computational cost of the coupling procedure is dictated by the multiplication with A ℓ in Step 2 and the generation of conditional proposal samples in Step 3. The multiplication with A ℓ has a computational complexity of O( ℓ j=0 R j s j ) using the low-rank representation (51) and the upper-triangular hierarchical LIS basis in (45), which has the form We can also exploit the hierarchical LIS to reduce the computational cost of generating conditional proposal samples.As shown in Equation ( 54), given the LIS basis Ψ ℓ,r , the precision matrix P ℓ is dictated by the matrix B −2 ℓ,r , which has the block form corresponding to the splitting of the enriched LIS basis into Ψ ℓ,c and Ψ ℓ,f .Generating conditional proposal samples only involves the blocks P ℓ,ff and P ℓ,f c in the matrix P ℓ , i.e., which in turn only require the blocks Ξ ℓ,f c ∈ R s ℓ ×r ℓ -1 and Ξ ℓ,ff ∈ R s ℓ ×s ℓ in the matrix B −2 ℓ,r .We derive low-rank operations to avoid the direct inversion or factorisation of the matrices P ℓ,ff and P ℓ,f c in the generation of conditional samples and to reduce the computational cost.Suppose the block Z ℓ,f ∈ R (R ℓ −R ℓ -1 )×s ℓ has the thin QR factorisation where U ℓ has orthonormal columns and T ℓ is upper triangular.Then the matrix P ℓ,ff can be expressed as Algorithm 2 Base level algorithm.
Input: A set of samples W 0 = {v Use W 0 to solve the eigenproblem in (39) to obtain the base level LIS basis Ψ 0,r .

3:
Estimate the empirical covariance matrix Σ 0,r from the samples in W 0 and define the operators A 0 and B 0 as in (51)-(52).

7:
With probability α(V end for 9: end procedure Note: Optionally, Σ 0,r , A 0 and B 0 can be adaptively updated within the MCMC after a pre-fixed number of iterations, cf.[1,17].
where W ℓ and D ℓ are respectively orthogonal and diagonal matrices, we have )×s ℓ has orthonormal columns, so that Using these representations of the matrices P −1 ℓ,ff P ℓ,f c and P −1/2 ℓ,ff , the conditional Gaussian in (56) can be simulated efficiently using The associated computational cost is O(R ℓ s ℓ ).

Final MLDILI algorithm
Here, we assemble all the elements of the multilevel DILI method defined in the previous sections in algorithmic form.For the base level (ℓ = 0 ), the LIS construction and the DILI-MCMC sampling are presented in Algorithm 2. The recursive LIS construction and the coupled DILI-MCMC are presented in Algorithm 3.
In both algorithms, we need to use both the LIS basis Ψ ℓ,r and an empirical covariance matrix Σ ℓ,r projected onto the LIS to define operators A ℓ and B ℓ in the DILI proposal.Computing the LIS basis needs some reference distribution p * ℓ (•).We employ the Laplace approximation to the Algorithm 3 Level-ℓ algorithm.

3:
Use W ℓ to solve the eigenproblem in (42) to obtain the auxiliary LIS vectors Ψ ℓ,f .

4:
Estimate the empirical covariance matrix Σ ℓ,r from the samples in W ℓ and define the operators A ℓ and B ℓ as in ( 51)-(52).

9:
With probability α ML ℓ (V end for 11: end procedure Note: Optionally, Σ ℓ,r , A ℓ , B ℓ , and the matrices P −1 ℓ,ff P ℓ,f c and P ℓ,ff can be adaptively updated within MCMC after a pre-fixed number of iterations. posterior (e.g., [29,31]).This way, all the samples from p * ℓ (•) can be generated in parallel and prior to the DILI-MCMC simulation.The empirical covariance Σ ℓ,r can be estimated using either samples drawn from the reference distribution (before the start of MCMC) or adaptively using posterior samples generated in MCMC.The latter option is the classical adaptive MCMC method [17].The adaptation of Σ ℓ,r is optional in Algorithms 2 and 3. Similar to the adaptation of the covariance, the LIS basis can also be adaptively updated using newly generated posterior samples during MCMC simulations.The implementation details for the adaptation of the LIS can be found in Algorithm 1 of [10].

Pooling strategy
Finally, we present an alternative proposal strategy that fully exploits the power of multilevel MCMC but reduces the dependencies of samples on different levels for a better parallel performance.In this pooling strategy, we simulate coupled multilevel Markov chains level-by-level.
Given a set of posterior samples , we again generate N ℓ samples on level ℓ using the multilevel proposal mechanism (46)-( 48) with conditional DILI proposals as described in Algorithm 1.However, here the inputs to Algorithm 1, i.e., the proposals v ′ ℓ,c , are drawn uniformly at random (with replacement) from the set V ℓ -1 , in contrast to using proposals v ′ ℓ -1 ∼ π ℓ -1 (•|y) from a sub-sampled Markov chain on level ℓ -1, as discussed in Section 5.1 above.Thus, in this pooling strategy the empirical distribution of the samples in V ℓ -1 is used as an approximation of π ℓ -1 (•|y).
Due to variance reduction from level to level in the multilevel MCMC algorithm (cf.eqn.( 30)) and the excellent mixing of our MLDILI algorithm, the effective sample size of V ℓ -1 will in general be significantly larger than the number of samples N ℓ in the sample set i=1 that we plan to generate at level ℓ.Thus, after some burn-in phase the set V ℓ -1 will contain (approximately) independent samples from the coarse level posterior which are needed in the construction of the Markov chain on level ℓ in Algorithm 3 (Line 7).
With the pooling strategy, it is possible to run multiple Markov chains at the coarse level and form the pool using the union of coarse level samples.It parallelises much more easily and also provides flexibility if the user decides to run further refined levels to improve the discretisation accuracy-one can simply reuse the pool of previously computed samples before the refinement as the coarse level proposal.Despite the practical usefulness, we note that the formal proof of convergence of the pooling strategy remains unclear and will need to be addressed in future research.

Numerical experiments
In this section, the algorithms are tested on a model problem involving an elliptic PDE with random coefficients described in section 6.1.Numerical comparisons are then given in section 6.2.

Setup
We consider an elliptic PDE in a domain Ω = [0, 1] 2 with boundary ∂Ω, which models, e.g., the pressure distribution p(x) of a stationary fluid in a porous medium described by a spatially heterogeneous permeability field k(x).Here, x ∈ Ω denotes the spatial coordinate and n(x) denotes the outward normal vector along the boundary.
The goal is to recover the permeability field from pressure observations.We assume that the permeability field follows a log-normal prior, and thus we denote the permeability field by k(x) = exp(u(x)), where u(x) is a random function equipped with a Gaussian process prior.In this setting, the pressure p(x) depends implicitly on the (random) realisation of u(x).
For a given realisation u(x), the pressure satisfies the elliptic PDE On the left and right boundaries, we specify Dirichlet boundary conditions, while on the top and bottom we assume homogeneous Neumann boundary conditions: As the quantity of interest, we define the outflow through the left vertical boundary, i.e.
where φ(x) is a linear function taking value one on ∂Ω left and zero on ∂Ω right , as suggested in [39].The Gaussian process prior for u(x) is defined by the exponential kernel k(x, x ′ ) = exp(−5|x− x ′ |). Figure 2 (left) displays the true (synthetic) permeability field in log 10 scale.Noisy observations of the pressure field are collected from 71 sensors located as in Figure 2 (right), with a signal-tonoise ratio 50.A likelihood function can then be defined as in (3), which, together with the prior, characterises the posterior distribution in (1).In practice, (65)-(67) has to be solved numerically.We use standard, piecewise bilinear finite elements (FEs) on a hierarchy of nested Cartesian grids with mesh size h ℓ = 1 20 × 2 −ℓ , for ℓ = 0, 1, 2, 3. Furthermore, we approximate the unknown function u(x) by truncated Karhunen-Loève expansions with R ℓ = 50 + 100 × 2 ℓ random modes, respectively.

Comparisons
Let us now test and compare our algorithms on the model problem described above.First, we proceed as in section 4 to build a LIS at every level, using both the non-recursive and recursive constructions.Table 1 summarises the number of basis functions obtained in each case with truncation threshold ρ = 10 −2 , as well as the storage reduction factor given by the recursive procedure at each level.
Because the recursive LIS construction recycles LIS bases from previous levels and enriches them with a number of auxiliary LIS vectors on each level, it is expected that the total number of Level 0 1 basis functions obtained by the enriching procedure at each level is slightly higher than the direct (spectral) LIS on the same level.However, in the recursive construction, the dimension of the auxiliary set of vectors is expected to decrease as the level increases, requiring less storage and less computational effort on finer levels, since the posterior distributions were assumed to converge with ℓ → ∞.For problems with parametrisations where the parameter dimension increases more rapidly with the discretisation level-e.g., using the same FE grid to discretise the prior covariance, the setting used in the original DILI paper [10]-we expect the reduction factor to be even smaller.
In the comparison of sampling performances, we denote by MLpCN the MLMCMC algorithm using the pCN proposal for the additional parameters on each level (as in [14]).The MLMCMC algorithm using the recursive LIS and the coupled DILI proposals, as summarised in Algorithms 2 and 3, is denoted by MLDILI.The integrated autocorrelation times of Markov chains constructed by MLpCN and MLDILI are reported in Table 2.The IACTs for two functionals are reported for each algorithm.In the "refined parameters" case, at every level ℓ we report the average IACTs of the refined parameters v ℓ,f .This quantifies how well the algorithm performs in exploring the posterior distribution.In the second case, we consider the IACT of the level-ℓ corrections of the quantity of interest In the "refined parameters" case, we observe a significant improvement for MLDILI over MLpCN: the coupled DILI proposal is able to reduce the IACT at every level compared to that obtained by MLpCN.At the base level, DILI is able to reduce the IACT by two orders of magnitude compared to that of pCN.This suggests that coarse parameter modes are very informed by the data, and thus utilising the DILI proposal is highly beneficial.In the case of the quantity of  interest, we observe an even more impressive improvement at the base level (a factor of 456!), while the IACTs of MLDILI and MLpCN on the finer levels are comparable.This suggests that the posterior distribution of the chosen quantity of interest (the integrated flux over the boundary) is not affected strongly by the high frequency parameter modes on the levels.Nevertheless, in both cases, using DILI provides a huge acceleration compared to pCN. Figure 3 compares the integrated autocorrelation times of DILI and pCN on level 0, for both the first parameter component and the quantity of interest.
The IACTs for the level-ℓ corrections of the quantity of interest in Table 2 suggest that using a mixed strategy-in which one employs the LIS and DILI only at the coarsest level and uses pCN in refined levels-is also a reasonable approach in cases where the important likelihood-informed directions that have any influence on the quantity of interest are already well enough identified in the base-level LIS.We refer to this as the MLmixed strategy.
We compare the computational performance of the three multilevel algorithms (MLDILI, MLpCN, MLmixed) with the two single level algorithms using DILI and pCN proposals.The finite element model and all MCMC algorithms are implemented in MATLAB; we use sparse Cholesky factorisation [6] to solve the finite element systems and ARPACK [28] to solve the eigenproblems.All simulations are carried out on a workstation equipped with 28 cores (two Intel Xeon E5-2680 CPUs).The performance of MLmixed is only estimated using the IACTs and the actual computing times measured in the MLDILI and MLpCN runs.
The computational complexities of the five algorithms for approximating E π [Q] on (discretisation) levels L = 1, 2 and 3 with Q defined in (67) are compared in Figure 4 (right).In the multilevel estimators, the coarsest level is always ℓ = 0, so that the number of levels is 2, 3 and 4, respectively.The sampling error tolerance on each level is adapted to the corresponding bias error due to finite element discretisation and parameter truncation, such that the squared bias is equal to the variance of the estimator.The bias errors were estimated beforehand to be 9 × 10 −3 , 4 × 10 −3 , and 2 × 10 −3 on levels L = 1, 2 and 3, leading to a total error of 1.27 × 10 −2 , 5.7 × 10 −3 , 2.8 × 10 −3 , respectively.Those bias estimates are plotted in Figure 4 (left) together with estimates of Var π ℓ (Q ℓ ) and Var ∆ ℓ,ℓ -1 (Q ℓ − Q ℓ -1 ), which suggest that θ b ≈ 0.5 and θ v ≈ 0.5 in Assumptions 2.4(i) and 3.2.This agrees with the theoretical results in [14].The cost per sample is dominated by the sparse Cholesky factorisation on each level and scales roughly like O(M 1.2 ℓ ), so that θ c ≈ 1.2 in Assumption 2.4(ii).Optimally scaling multigrid solvers exist for this model problem, but for the FE problem sizes considered here they are more costly in absolute terms.Moreover, we can also exploit the fact that the adjoint problem is identical to the forward problem here, so that the Cholesky factors can be reused for the adjoint solves required in the LIS construction.
Let us now discuss the results.Single level pCN becomes impractical in this example, since the data is very informative and leads to an extremely low effective sample size.Some of this bad statistical efficiency is inherited by MLpCN, at least in absolute terms, due to the poor effective sample size on level 0. Asymptotically this effect disappears and the rate of growth of the cost is smallest for MLpCN with an observed assymptotic cost of about O(ϵ −2.3 ).As observed in [14], this is better than the theoretically predicted asymptotic rate and is likely a pre-asymptotic effect due to the high cost on level 0. Unsurprisingly, given the low IACTs reported in Table 2, the methods based on DILI proposals all perform significantly better.MLDILI and MLmixed perform almost identically, since the corresponding IACTs on all levels are very similar.They are consistently better than single-level DILI and the asymptotic rate of growth of the cost is also better, O(ϵ −3.4 ) versus O(ϵ −4.1 ).Both rates are consistent with the theoretically predicted rates in Theorem 3.4, given the estimates for θ b , θ v , θ c above.For the highest accuracies, MLDILI is almost 4 times faster than DILI, and due to the better asymptotic behaviour this reduction factor will grow as ε → 0. For grid level L = 4, even MLpCN is expected to outperform single-level DILI, but the computational costs of the estimators for higher accuracies are starting to become impractical even using the multilevel acceleration, as the dashed line representing one CPU day in Figure 4 (right) indicates.
The dominating cost in solving the eigenproblems (39) and (42) is the Cholesky factorisation.As mentioned above, sparse direct solvers are used to solve the stationary forward model and we are able to recycle the Cholesky factors from the forward solve to compute the actions of the adjoint model in (39) and (42) for each sample.As a result, the computational cost of building the LIS is negligible compared to that of the MCMC simulation here (for both the single level and the recursive construction).This also explains why MLmixed performs almost identically to MLDILI.However, in many other applications this is not possible due to the high storage cost or when the adjoint is different.Each action of the adjoint problem typically has a comparable cost to solving the forward model in the stationary case.It can even be more expensive than solving the forward model in time-dependent problems.To provide a thorough comparison in that case, we also report the total CPU time of all the estimators in Figure 5 when the LIS setup cost is included.Here, we compute both the single level LIS and the recursive LIS without storing the Cholesky factors, to mimic the behaviour in the general, large-scale case.In this setup, we observe that a significant amount of computing effort is spent on building the LIS, and thus MLmixed and MLDILI significantly outperform the single level DILI for all error thresholds.MLmixed is more than 4 times faster than DILI even for the largest error threshold of 1.27 × 10 −2 .The construction of the singlelevel LIS requires two times more CPU time than performing the actual MCMC simulation in that case.In comparison, a significant number of adjoint model solves can be saved by the recursive LIS construction.Furthermore, we do expect that the computational cost for constructing the recursive LIS will stop increasing, since the dimension of the auxiliary LIS will eventually be zero at higher levels.Overall, for large-scale problems where the adjoint cannot be cheaply computed by recycling the forward model simulation, the recursive LIS construction, and hence the MLDILI, is clearly more computationally efficient than the single level DILI.

Conclusion
We integrate the dimension-independent likelihood-informed MCMC from [10] into the multilevel MCMC framework in [14] to improve the computational efficiency of estimating the expectation of functionals of interests over posterior measures.Several novel elements are introduced in this integration.We first design a Rayleigh-Ritz procedure to recursively construct likelihood informed subspaces that exploit the hierarchy of model discretisations.The resulting hierarchical LIS needs lower computational effort to construct and has lower operation cost compared to the original LIS proposed in [11].Then, we present a new pooling strategy to couple Markov chains on consecutive levels.This enables more flexible parallelisation and management of computing resources.Finally, we design new coupled DILI proposals by exploiting the hierarchical LIS, so that the DILI proposal can be applied in the multilevel MCMC setting.We also demonstrate the efficacy of our integrated approach on a model inverse problem governed by an elliptic PDE.

A. Computational complexity of hierarchical LIS
Here we develop heuristics-under the following set of restrictive assumptions-to compare the complexities of the construction of the hierarchical LIS and of the single-level LIS, constructed directly on level L. Using a similar derivation, we can also obtain the reduction factor for constructing the hierarchical LIS basis.The number of matrix vector products (with the sample-averaged Gauss-Newton Hessian H 0 ) in the construction of the base level LIS via the eigenproblems (39) is linear in the number of leading eigenvectors obtained, i.e., O(s 0 ).The same holds for the number of matrix vector products with H ℓ in the construction of the auxiliary LIS vectors in the recursive enrichment solving the eigenproblems in (42).Thus, the overall computational complexities for constructing the hierarchical LIS basis is Similarly, the construction of the single level LIS on level L is where the prefactors are the same.The following corollary can be proved in the same way as Corollary A.2, since we have assumed that M ϑc ℓ = M ϑc 0 e βmϑcℓ .Corollary A.3.The reduction factor of building the hierarchical LIS basis (as opposed to the standard single-level LIS basis on level L) satisfies the upper bound Due to the acceptance probability (49), we have , where, by definition, exp −η ℓ v ′ ℓ ; y +η ℓ -1 v ′ ℓ -1 ; y exp −η ℓ v * ℓ ; y +η ℓ -1 v * ℓ -1 ; y , such that we can write The level ℓ parameter vectors can be split as ). and we have v ′ ℓ,c = v ′ ℓ -1 and v * ℓ,c = v * ℓ -1 by construction in the coupling procedure.Thus, The density of the conditional DILI proposal q v ′ ℓ,f |v * ℓ,f , v * ℓ,c , v ′ ℓ,c is defined as that is the ratio between the DILI proposal density and the marginal DILI proposal density, which takes the form Due to Corollary 5.1, the DILI proposal q(v ′ ℓ |v * ℓ ) has the prior distribution p ℓ (v ℓ ) as invariant measure, i.e., Hence, if v * ℓ = (v * ℓ,f , v * ℓ,c ) is drawn from the prior p ℓ (v ℓ ), then the proposal candidate v ′ ℓ = (v ′ ℓ,f , v ′ ℓ,c ) also follows the prior p ℓ (v ℓ ).Furthermore, if v * ℓ is drawn from p ℓ (v ℓ ), then the marginal DILI proposal q v ′ ℓ,c |v * ℓ,f , v * ℓ,c generates candidates with coarse components that follow the marginal prior which for our particular choice of parametrisation is the same as the prior p ℓ -1 v ′ ℓ,c on level ℓ -1, that is, p ℓ v * ℓ q v ′ ℓ,c |v * ℓ = p ℓ -1 v ′ ℓ,c .Using this identity and substituting (77) into (76), the ratio 1 can be simplified to The result then follows immediately from (75).

ℓ - 1 p r o p o s e a c c e p t p r o p o s e a c c e p t p r o p o s e r e j e c t p r o p o s e a c c e p tFigure 1 :
Figure 1: This diagram illustrates the coupling strategy with double arrows representing the coupling of two MCMC states as well as the coupling of two proposal candidates across levels.The dashed arrows represent the proposal and the accept/reject steps.At each iteration, the proposal candidates are coupled by construction, while the samples v (ℓ,j) ℓ and v (ℓ,j)

1 :
k=1 drawn from the base level reference p * 0 (•), the number of MCMC iterations N 0 , and an initial MCMC state V (0) 0 .Output: A LIS basis Ψ 0,r and a Markov chain of posterior samples V 0 = {V procedure Base level LIS and MCMC 2:

Figure 2 :
Figure 2: Setup of elliptic inverse problem.Top left: "true" permeability field used for generating the synthetic data set.Top right: observation sensors (red dots) and pressure field corresponding to "true" permeability field.Bottom row: Realisations of the permeability drawn from the prior.

Figure 4 :
Figure 4: Left: the variance Var π ℓ (Q ℓ ) (blue) and the bias ϵℓ = E µy Q − E π ℓ Q ℓ (yellow) at each level,and the cross-level variances Var∆ ℓ,ℓ -1 (D ℓ -1 ) = Var ∆ ℓ,ℓ -1 (Q ℓ − Q ℓ -1 ) (red)used for estimating the CPU time for various MCMC methods.Right: Total CPU time (in seconds) for various methods to achieve different total error tolerances.The LISs are constructed by recycling Cholesky factors.The dotted line represents a CPU day.

Figure 5 :
Figure 5: Left: Total CPU time (in seconds) for the single level and recursive constructions of the LISs at level 2, 3 and 4. Right: Total CPU time (in seconds) for various methods to achieve different error tolerances.The LISs are constructed without recycling Cholesky factors.The dotted line represents a CPU day.

Table 1 :
LIS dimensions: Results of non-recursive construction (single-level LIS for each ℓ) reported in first row; for the recursive construction, the number of vectors added on the current level and the total LIS dimension are given in the second and third row, respectively; the fourth row displays the storage reduction factor for the recursive procedure at each level.

Table 2 :
Comparison of IACTs of Markov chains generated by MLDILI and MLpCN.This table reports the IACTs of the refined parameters and the level-ℓ correction of the quantity of interest