Uncertainty quantification for goal-oriented inverse problems via variational encoder-decoder networks

In this work, we describe a new approach that uses variational encoder-decoder (VED) networks for efficient uncertainty quantification for goal-oriented inverse problems. Contrary to standard inverse problems, these approaches are goal-oriented in that the goal is to estimate some quantities of interest (QoI) that are functions of the solution of an inverse problem, rather than the solution itself. Moreover, we are interested in computing uncertainty metrics associated with the QoI, thus utilizing a Bayesian approach for inverse problems that incorporates the prediction operator and techniques for exploring the posterior. This may be particularly challenging, especially for nonlinear, possibly unknown, operators and nonstandard prior assumptions. We harness recent advances in machine learning, i.e. VED networks, to describe a data-driven approach to large-scale inverse problems. This enables a real-time uncertainty quantification for the QoI. One of the advantages of our approach is that we avoid the need to solve challenging inversion problems by training a network to approximate the mapping from observations to QoI. Another main benefit is that we enable uncertainty quantification for the QoI by leveraging probability distributions in the latent and target spaces. This allows us to efficiently generate QoI samples and circumvent complicated or even unknown forward models and prediction operators. Numerical results from medical tomography reconstruction and nonlinear hydraulic tomography demonstrate the potential and broad applicability of the approach.


Introduction
Inverse problems arise in many scientific applications, and there has been a significant amount of research on the development of theory and computational methods for solving inverse problems.However, solving an inverse problem is often just one step in a multi-step process, where the solution of the inverse problem is a tool that is used for some end goal (e.g.predictions that are used for optimal design or control).Standard approaches for goal-oriented inverse problems first seek a solution to the inverse problem and, informed by this solution, subsequently compute quantities of interest (QoI).While in some cases, estimation of these QoI is the end goal, there are many scenarios where uncertainties for the QoI provide critical information, e.g. for further decision-making.In this work, we are interested in combining goal-oriented inverse problems with uncertainty quantification for the QoI.In particular, we provide an efficient end-to-end computational approach that can be used to directly estimate the QoI and their uncertainties.
Consider a goal-oriented inverse problem of the form b = F (y) + e, x = G (y) , where b ∈ R m contains the observations, y ∈ R n contains unknown true parameters (e.g. the true image or state parameters), F : R n → R m is the parameter to observable map, and e ∈ R m represents additive noise.In classical inverse problems, the goal is to compute approximations of the unknowns in y, given observations b and knowledge of the forward operator F [1].Here, we are interested in goal-oriented inverse problems (1) where we aim to estimate some QoI in x ∈ R q that may be determined by a prediction operator G : R n → R q , mapping the unknown y onto x.
Some motivating examples of goal-oriented inverse problems include carbon sequestration [2], modeling transport of a contaminant in an urban environment [3], and detecting the locations of discontinuities in reconstructed CT images [4,5].In many applications, the prediction operator may be simply taking a maximum, minimum, average, or some combination of the unknown parameters in y [2,3,6,7].However, QoIs may also be related to y by more complex mappings where the prediction operator may not be well-defined or may rely on expert opinion (e.g.manual classification or segmentation of images).Regardless, the QoI typically represents some low dimensional property of the unknown y, i.e. q n.
Goal-oriented inverse problems suffer from the same obstacles as classical inverse problems such as ill-posedness and large parameter dimensions.Further challenges arise when considering UQ for the QoI due to the additional prediction operator G. Recent works on goal-oriented inverse problems consider modifications of the inversion process or the data acquisition process to take into account the output QoI.For example, in goal-oriented inference, end goals (i.e.output QoI) are incorporated into the inference process thereby leading to efficient inference-for-prediction algorithms [2,8].These approaches rely on estimating the uncertainty of some predictive QoI based on the solution of an inverse problem.However, the main difference is that these approaches target the most relevant parameters for prediction (i.e.modifying the inference process to account for output quantities of interest), and we are interested in estimating uncertainties of the QoI as the end goal.Goal-oriented approaches have also been considered in the context of optimal approximations [7] and optimal design of experiments [3].These approaches use the output QoI, which are often linear mappings of the unknowns, to inform dimensionality reduction techniques and experimental design (e.g.optimal placement of sensors) respectively.

Overview of contributions.
In this work, we provide a novel computational framework for directly estimating the QoI and their uncertainties, given observations.We assume that sample pairs (b j , x j ) J j =1 of observations and QoI are readily available and use a supervised learning approach to train a variational encoder-decoder (VED) network to approximate the mapping from observations b to QoI x.This approach confers several advantages: • We bypass the computationally challenging inversion process of estimating y by establishing an end-to-end framework that maps the observations b directly to the QoI x.Although one could consider a two-step approach where a network is used to approximate the inversion process from b to y [9], followed by the QoI computation, there are significant challenges due to the potentially large number of unknowns in y.Dimension reduction techniques have been considered, e.g. to train a deep neural network (DNN) to approximate the mapping from observation data to desired (e.g.regularization) parameters [10,11]; however, these approaches do not allow for UQ of the QoI.• By leveraging VED networks, we can perform efficient UQ of the QoI.More specifically, we use a VED network to create a mapping from the observation to some latent space, enforce a predetermined distribution on the unobserved latent variables, and then create a mapping from the latent space to the target data distribution (e.g. the QoI).The latter half of the VED defines a generating function that maps independent and identically distributed (i.i.d.) samples from the latent space variables to the target distribution.Once the network is trained, UQ for the QoI can be performed very efficiently in an online phase, via forward propagation of the observation b through the VED, where the output of the VED is a set of samples that can be used for the statistical description of the QoI x. • One key advantage of our approach is that it is data-driven, in that training data are used to circumvent complicated or even unknown forward models and prediction operators.Most goal-oriented approaches use linear prediction operators G. Our approach is not restricted to linear operators and does not require knowledge of G (or the forward operator F).
An overview of the paper is as follows.In section 2 we provide a brief description of the Bayesian formulation for goal-oriented inverse problems and describe some basics of VED networks for readers unfamiliar with these machine learning techniques.Then in section 3, we describe various details about using VED networks for UQ of the QoI for goal-oriented inverse problems.Numerical experiments provided in section 4 illustrate the benefits of our approach for various example applications and goal-oriented problem setups.Conclusions are provided in section 5.

Background
Recall that for goal-oriented inverse problems (1), the standard approach is to first estimate the unknowns in y, given b and F, and second to compute the QoI x from y.However, standard approaches often do not reveal the uncertainties in this estimation.A popular approach that enables quantification of uncertainties (on y, x, or both) is to provide a probabilistic formulation for (1).That is, in order to provide information about the accuracy or uncertainties of the estimate, we are motivated to consider inverse problems in the framework of Bayesian statistics.In section 2.1, we review the Bayesian formulation of an inverse problem with particular emphasis on prediction for the goal-oriented problem.Then, in section 2.2 we describe tools from machine learning and their use within the inverse problems community.In particular, we describe VED networks that will be used in section 3.

Bayesian inverse problem
In this section, we describe a statistical approach to solving inverse problems.This approach enables us to include uncertainties, e.g. in measurement, into our model.In addition, the statistical approach provides a natural method for incorporating our prior knowledge of the parameters.Contrary to deterministic inverse problems where the goal is to compute a point estimate, the solution of a Bayesian inverse problem is a random variable and thus has a distribution [12].More specifically, in the statistical framework, all unknown values are modeled as random variables.Let B, X, Y, and E be random vectors representing the observational data, the QoI, the unknown state, and additive noise, respectively.Then, the Bayesian interpretation of ( 1) is given by ( Note that since Y and E are unknown, we represent their uncertainties with a probability distribution constructed from prior knowledge.In the following, assuming for simplicity of illustration that all densities exist, the solution to the inverse problem ( 2) is given by the posterior density function, conditioned on the observations b, i.e.
Since we are interested only in the distribution of the QoI X rather than those of the state parameters Y themselves, the posterior distribution is propagated through to the QoI X, i.e.
In the context of goal-oriented inference problems, this is often referred to as the posterior predictive [2].The prediction is a random variable and, hence, is characterized by a distribution.We denote its density by π pred (x | b).Given a prior and some observation, samples from the posterior Y | B can be passed through the prediction process, and the resulting samples can be used to describe the posterior predictive and provide uncertainties for the QoI.
In the special case where the parameter-to-observable map is linear F(y) = Ay, the noise E and the prior Y are Gaussian, i.e.E ∼ N (0, Γ noise ) and Y ∼ N (ȳ, Γ y ) with Γ noise and Γ y symmetric and positive definite covariance matrices, the posterior (4) is also a Gaussian and is given by N (y post , Γ post ) where and If, additionally, the prediction operator is linear, i.e.G(y) = Py for appropriately sized matrix P, then the posterior predictive ( 5) is also a Gaussian and is given by N (x pred , Σ pred ) where x pred = Py post and Σ pred = PΓ post P ⊤ .(7) In the linear setting, a computationally efficient inference-for-prediction method was described in [2] that exploits underlying balanced truncation model reduction for dimension reduction.
The posterior predictive was also used for the goal-oriented optimal design of experiments in [3].So far, we have described goal-oriented approaches, and for the special case described above, efficient tools have been developed for exploring the posterior predictive.In this work, we exploit VED networks for UQ of the QoI for goal-oriented inverse problems, resulting in a data-driven approach that can exploit recent developments in machine learning, is generalizable, and works for complex and even unknown prediction operators.

Learning from data
Machine learning has become an important tool in the inverse problems community.In this section, we provide some background on neural networks and describe the training process, given some training data.We describe how these networks work by starting with a description of DNNs, then encoder-decoder networks, and finally VED networks.
Let us begin with the assumption that there exists some target function Φ : R m → R q that maps input vector b ∈ R m to a target output vector x ∈ R q .Deep neural networks (DNNs) are among the many classic machine learning modeling tools, using a parameterized and layered mapping Φ : R m × R p → R n to emulate the target function Φ [13].Given input-target pairs (b j , x j ) J j =1 , the aim of supervised learning is to train the neural network, i.e. computing network parameters θ, such that Φ(b; θ) ≈ Φ(b).For this, a loss function determines the performance of a network by comparing Φ(b j ; θ) and x j for a particular θ.An optimization method is used to find appropriate network parameters θ by minimizing a suitable loss function.
DNNs have been used for solving inverse problems (e.g. for regularization parameter selection [10,11] or for learning regularization functionals [14,15]).Furthermore, DNNs have been adopted for learning the inversion process, i.e. the mapping from the data space b to the inverse solution y [9].Such approaches could be extended for the prediction or estimation of the QoI.
Encoder-decoder networks.DNNs can be implemented within an Encoder-Decoder (ED) architecture.Such networks consist of the concatenation of two networks (the encoder and the decoder).The encoder network is comprised of layers that successively reduce the dimension of the input vector.The output of the encoder is referred to as the latent vector z ∈ R ℓ that describes a representation of the input.Then, the decoder network increases the dimension through the network to match the target dimension.More specifically, the two main parts can be defined using transformations: • the encoder network e : R m × R p e → R ℓ with network parameters θ e ∈ R p e maps an input b to the latent variables z, • the decoder network d : R ℓ × R p d → R q with network parameters θ d ∈ R p d maps the latent variables to the output x, The entire ED network then takes the form Φ ed (b; θ e , θ d ) := d(e(b; θ e ) ; θ d ).See figure 1 for a visual representation of an ED network.Analogous to DNNs, the aim is to find parameters θ e and θ d that minimize some loss function.A common approach is to seek network parameters θ e , θ d to minimize the mean squared error, i.e.
A commonly-used ED network is the autoencoder, which aims to map an input b onto itself, i.e.Φ ed (b; θ e , θ d ) ≈ b.The encoder represents the input in a lower dimensional space while the decoder attempts to reconstruct the input from the low dimensional representation z.The flow of data through a low-dimensional latent space forces the autoencoder to find a low-dimensional representation for a set of data, by which insignificant data is eliminated.Hence, such networks are classically used in dimensionality and noise reduction, and data compression applications, see e.g.[16][17][18][19].
Variational encoder-decoder (VED) networks.Standard ED networks do not have generative capability, meaning a random sample from the latent space z j ∈ R ℓ may not necessarily generate a meaningful target sample d(z j ; θ d ).Variational encoder-decoder (VED) networks address this issue by enforcing a predetermined structure/distribution on the latent variables.This is particularly relevant in the context of variational autoencoders (VAEs), where the goal is to have the output match the input.Here, by sampling from the predetermined latent distribution, the decoder d provides a generating function that can be used to simulate or generate new data with desired features [20].
In a VED network, the input b is passed through an encoder e(b; θ e ) that depends on parameters θ e .This is similar to the ED networks where we reduce the dimensions of the input.However, the output of the encoder is a mean vector µ e ∈ R ℓ and a vector of standard deviations σ e ∈ R ℓ .For simplicity, we assume that µ e and σ e define a Gaussian distribution for the latent vector z, i.e.
Then a sample z is fed to the decoder d(z; θ d ) that depends on parameters θ d .Notice that contrary to the ED network, the latent vector z is not the output of the encoder but rather a sample from the latent distribution.See figure 2 for a visual representation of a VED network.Training a VED network requires determining network parameters θ e and θ d that minimize some objective or loss.However, contrary to DNNs and EDs, for VEDs there are two main components to consider.First, we wish to reconstruct an output matching a target by minimizing a training loss, e.g. for a sample (b j , x j ) where with µ e σ e = e b j ; θ e .( Second, we aim to drive the latent space distribution toward the isotropic Gaussian (standard Gaussian with zero mean and unit variance), which can be obtained by minimizing a similarity loss between distributions.For example, we can define the similarity loss for the VED network as where D kl is the Kullback-Leibler (KL) divergence or KL loss [21], and the objective is to find encoder and decoder parameters θ e and θ d that minimize the total loss [20], Variational autoencoders are special cases of VED networks, and they have been used in various contexts for solving inverse problems, especially in imaging [22][23][24].Variational autoencoders were also used in the context of uncertainty quantification in [25], where they were used to learn the parameter to observable map as well as the inverse problem solver.However, since the input and output dimensions must be the same for variational autoencoders and adversarial autoencoders, these approaches are too restrictive to be considered for goaloriented inverse problems.
Another class of methods that combine learning from data with computational inverse problems is simulation based inference [26].In these approaches, neural networks, which predict unknown parameters, are trained from a collection of simulated data.To incorporate uncertainties, deterministic neural networks are augmented with additional stochastic layers, e.g.see [27][28][29].Although these methods show impressive performance in inferring model parameters, their evaluation of the predictive aspects of these parameters, e.g. the QoI, is lacking.This is particularly concerning for goal-oriented inverse problems, where the inverse map may be severely ill-conditioned.

Variational encoder-decoder (VED) networks for UQ of the QoI
In this section, we describe an approach that builds a distribution on the QoI using a generative algorithm, which can be used to approximate the posterior predictive (5), where the distributions and mappings are encoded in the training data and its corresponding trained network.We establish a relationship between random variables B and X.The maps in this graph are approximated using a VED network.In section 3.1, we argue how the relation between B and X results in a loss function similar to the one expressed in (15).We describe how training data can be used to estimate network parameters.Once the VED is trained and we have a new observation, we describe in section 3.2 a sampling approach to perform UQ for the QoI.
Recall that the density function for the QoI π(x) is, in general, unknown.We aim to approximate this distribution, as well as the density of the posterior predictive π pred (x | b).In this approach, we assume there exists a latent variable with a convenient density function, e.g. a Gaussian density function, and consider a map d : R ℓ → R q such that x = d(z).The benefit of such a map [30] is that it reduces the task of exploring an unknown and complex density function, e.g.π(x), to exploring a computationally convenient density function.However, the complexity of π(x) is still encoded in the nonlinear map d.
We now relate π(z) to the observation density π(b) by introducing another map ē : R m → R ℓ , such that z = ē(b).Note that, in this case, the density function π(b) is unknown.The composite map d • ē establishes a connection between B and X.We represent the connection between the random variables B, Z, and X in the graphical model in figure 3. Now we check if the latent variable Z introduces a bias in the posterior predictive.We first construct an expression for the joint distribution, since Now we evaluate the bias in the posterior predictive in the presence of the latent variable z where we used (16) in the last equality.Multiplying both the numerator and the denominator with π(b) yields Here we used the definition of a conditional distribution.Note that the right-hand side contains the posterior predictive π pred (x | b), and the ratio π(z | x)/π(z | b) does not equal to one for general maps d and ē.However, we require d and ē to satisfy the relation In this case, the ratio in ( 19) is one, and π(x | b, z) coincides with the posterior predictive.

Learning VED maps
In what follows, we discuss how to use training data to approximate maps ē and d using VEDs.Assume that training data, {(b j , x j )} J j =1 , containing observation and QoI pairs are provided.Let e : R m × R p e → R 2ℓ and d : R ℓ × R p d → R 2q be an encoder and a decoder network, respectively.
In figure 4, we provide a visual representation of the VED network used for UQ of the QoI.The main distinction from the VED network in figure 2 is that, in addition to enforcing a distribution on the latent variables, we assume a distribution on the target space so that we can perform UQ of the QoI.
Ideally, we would like to obtain the equality in (20) with a learned encoder e and decoder d.Hence, we aim to learn the parameterized encoder and decoder to minimize the distance between both distributions π e (z | b) and π d (z | x) in (20).We relax (20) to obtain the minimization problem min Here, π e is a family of distributions that the encoder e defines on the latent space.Similarly, π d is a family of posterior distributions that are defined on the latent space.Therefore, the minimization problem (21) looks for the best approximation to (20) within these families.
This approach is referred to as variational inference [13].The superscript on a density function indicates the family to which the density function belongs.We now expand (21) to obtain Here we used the definition of a conditional density in the second line, and the logarithm product rule in the third line.Moreover, L(θ e , θ d ) is called the evidence lower bound (ELBO) [13] and is defined as Note that this is a lower bound on log (π(x)) since the KL divergence is non-negative.Following [20,31], we want to differentiate and optimize the lower bound L(θ e , θ d ), thus leading to max We now apply Bayes' theorem to (23) to obtain The first term in the right-hand-side of ( 25) is interpreted as computing an expectation with respect to the density function π e (z | b).Note that the KL divergence in (21) is with respect to the posterior density π d (z | x) while the one in (25) is with respect to the prior density π(z).Similar derivations, in the context of VAEs, can be found in [31][32][33] and references therein.
Computing the ELBO, as expressed in (25), still poses challenges, e.g.density functions π e (z | b) and π d (z | x) are intractable for nonlinear maps e and d.We consider the following approximations to simplify the computation of the ELBO.
where D e (b) = diag(σ e (b)) is a diagonal matrix and | • | denotes the determinant.
(c) We assume that π d (x | z) is a product of independent Gaussian densities of the form where Combining approximations (a) and (b), the KL divergence in (25) can be written as Then, with approximation (c), the ELBO simplifies to Note that this is for one sample, so the overall goal is max where L j (θ e , θ d ) is defined (29) for (b j , x j ).A standard optimization algorithm such as ADAM can be used to solve (30).See algorithm 1 for a general outline.Generate L samples from the latent space z l ∼ N (µ e , diag(σ e ) 2 ) 7: for l = 1, . . ., L (loop over samples from latent space) do 8: Evaluate decoder end for 10: Compute loss L j and gradient ∇L j 11: end for 12: Update θ e and θ d 13: end for 14: output θ e and θ d   We remark that in practice the loss function of the VED network must be optimized up to a tolerance η > 0. The interpretation of η, when the loss function is optimized with respect to the ℓ 2 -norm, is that the output of the VED, i.e. x, has a Gaussian distribution of variance η, i.e.D d (z) = ηI in (29).The tolerance is a hyperparameter of the model and can be inferred in the learning process.In the case where η is constant and known, the objective function (29) reduces to A good estimate for η can significantly reduce the complexity of optimization.

Quantifying uncertainties in the QoI
In this section, we discuss how to estimate and perform UQ on QoI x for a trained VED network, given an unobserved vector b.In the context of goal-oriented inverse problems, a common approach to estimate the QoI is to use point estimates, e.g. the mean or the maximum of the posterior predictive.However, to quantify uncertainties for goal-oriented inverse problems one may need to compute the higher central moments of the posterior predictive, which in this case is computationally prohibitive.Hence, we first relate the outputs of the trained VED with the posterior predictive, describe how to sample from the posterior, and then utilize generated samples to estimate uncertainties for the QoI.
The following result relates the posterior-predictive with the output of the VED network.Let B, Z, and X denote random variables with corresponding probability densities π(b), π(z), and π(x).The posterior predictive is related to the output of the VED according to To see this, notice that Here, we used the definition of marginalization in the first equality and the conditional probability in the second equality.We refer to E z|b π(x | z, b) as the VED posterior predictive distribution.
In the following, we assume that Z parameterizes X such that X and Z are indistinguishable to B, i.e.
In other words, the parameterization of the prior distribution does not affect the data generation process.To elaborate on this assumption, we look at (34) in terms of parameterization of the prior in the Bayesian setting and use Z to parameterize X.It is natural to assume that parameterization of X is independent of the data generation process, i.e. π(b | x, z) = π(b | x).Furthermore, we assume that the parameterization of X is exact.One way to achieve this is to assume that there is a mapping d : z → x such that π(x | z) = δ d(z) (x), the Dirac's delta function.This yields Here, we used the additional assumption in the second equality.Now we have This simplifies (33) to The expectation in the right-hand side of ( 32) and ( 38) is with respect to z.Therefore, the VED-posterior is a probability distribution with respect to x.Now we discuss how to generate samples from the VED posterior-predictive and perform UQ, using the VED network.Suppose that an observation vector b is provided.We then form the random variable Z | B. Note that this has a Gaussian distribution N (µ e , diag(σ e ) 2 ).We generate samples {z l } l=1,...,L from this distribution.For each of these samples from the latent space, we evaluate the decoder d(z l ; θ d ), and sample from the Gaussian distribution We provide some remarks on algorithm 2. We might require a large amount of Monte Carlo samples x k l to accurately approximate the distribution of the QoI.However, these samples can be obtained efficiently by simple forward propagation of the decoder, which can all be performed in parallel.Further recall that σ d in algorithm 2 represents the accuracy of the VED network and in most applications diag σ d 2 = ηI.Therefore, we may choose K = 1 when η is small.

Numerical results
In this section, we consider two examples from inverse problems, where the aim is to perform UQ for some quantities of interest.The QoI is different in each example.In section 4.1, the goal is to perform UQ on the regularization parameter required to compute tomography reconstructions using total variation regularization.In section 4.2, we consider a hydraulic tomography example, where the QoI represents expansion coefficients defining a piecewise constant conductivity field.In both instances, we sample from the predictive posterior and bypass the intermediate reconstructions (i.e. the tomography reconstruction and the conductivity field reconstruction, respectively).

Experiment 1: x-ray computed tomography (CT)
X-ray CT is an imaging technique that uses a combination of x-ray exposure and computational procedures to construct cross-sectional images of an object.When an object is exposed to an x-ray, its material attenuates the radiation whereas the amount of attenuation depends on the material's properties.The interaction between the object and the x-ray can be modeled as a line integral and is referred to as the Beer-Lambert law [34].The transformation that takes an object (an attenuation field) into its line integrals is mathematically modeled as a linear transformation and is referred to as the Radon transform [34].The transformed function is referred to as a sinogram.The CT reconstruction aims to infer the attenuation field from its Radon transform, however, this inversion process is ill-posed, and stabilization through regularization is required.
Let y ∈ R n be a vectorized 2-dimensional discretized and unknown attenuation field and let b ∈ R m be the noisy sinogram.Consider the model given in (1), where F(y) denotes the discrete Radon transform of y.The corresponding variational inverse problem can be written as a deterministic minimization problem where the first term represents the data fidelity and the second term is the regularization term.
Here we promote piecewise constant solutions in y by utilizing a 1-norm and a first-order two-dimensional finite difference operator D [35,36], representing a total variation regularizer.The balance between the two terms in this minimization problem is established by the regularization parameter x > 0.
The regularization parameter is a priori unknown and choosing an appropriate x can be a challenging task [37][38][39][40][41]. Suppose that a true attenuation field y true is available, then one way to select x is to solve a bilevel optimization problem , where y (x) is given in (39). ( The optimization problem in ( 39) is regarded as the inner minimization problem with a fixed regularization parameter x.The minimization problem in ( 40) is referred to as the design problem which seeks the optimal regularization parameter.This bilevel optimization problem ( 40) is, in general, non-convex and numerically challenging to solve.It requires the repeated evaluation of subproblem (39), which in turn uses numerical solvers for computing y(x) to find an optimal x.Hence, numerical solvers may only produce approximations to x.As for the inner optimization problem, we use the alternating direction method of multipliers and a Bayesian optimization method to solve the corresponding design problem [42].In [10], it was shown that DNNs can be trained to represent the mapping b → x, but the framework did not allow for UQ.Furthermore, it was observed that uncertainties in x arise due to challenges in optimization in (40) and numerical evaluation of its solution.Thus, it is possible that a numerical solution for (40) is far from the optimal regularization parameter.

Description of training.
In this work, we train a VED network to represent the mapping b → x.Hence, in addition to obtaining a VED-predicted regularization parameter, we can efficiently draw samples from the VED predictive posterior and these samples allow us to quantify the uncertainty in the predicted parameter.In this context, given y true , the prediction operator G corresponds to solving the bi-level minimization (40) to find the optimal regularization parameter, i.e. x = G(y true ) is a numerically evaluated optimal regularization parameter.For the training data, we generate J = 24 000 images y j true , j = 1, . . ., J as randomized Shepp-Logan phantoms [43,44].Then the training data are given by, F(y j true ) + e j , G(y j true ) . For simplicity, we assume F(y) = Ay, where A is a matrix representing parallel beam tomography (with 181 parallel rays over 180 equidistant angles).The noisy sinogram is then given by The noise vector e j in ( 41) is chosen to be white noise with a varying noise level that is selected uniformly random between 0.1% and 5%.A noise level of r% corresponds to a noise vector e j satisfying e j 2 / Ay j true 2 = r/100.The architecture of the VED network is illustrated in figure 5.The encoder network is comprised of convolutional layers and ReLU layers [45].The dimension of the latent space is set to ℓ = 128.The decoder network is a feed-forward fully-connected ReLU network with one hidden layer.This architecture is chosen to exploit local correlations in the intensities in a sinogram, but other architectures can be considered.We observe that other network architectures result in similar results (data not shown).The VED is trained by minimizing the cost function (31), with η = 0.1 using the ADAM optimization method [46].The optimization process is stopped after 10 4 epochs.The learning rate is adjusted dynamically from the interval Evaluation of VED.To test the performance and generalizability of the learned VED network, we consider three datasets outside the training dataset.First, we consider an example from the test data.We generate a sinogram from a randomized Shepp-Logan phantom and add noise.The true image and sinogram are provided in figures 6(a) and (b) respectively.Using the approach described in section 3.2, we generate 10 3 samples x j from the VED predictive posterior.The distribution of these samples is provided in figure 6(c).Notice that the VED predicted values for the regularization parameter are within the interval [0.2, 0.7], with many predicted values being close to the optimal regularization parameter x = G(y true ).To visualize the corresponding uncertainties in the reconstructed images, for each of the samples of the regularization parameter, we solve (39) to get a reconstruction y(x j ).The mean reconstruction image, an image of the pixel-wise variances (in inverted colormap so that white corresponds to small variances and black corresponds to larger variances), and the distribution of relative reconstruction errors are provided in figures 6(d)-(f) respectively.Similar to observations made in [5,47], we observe that large uncertainties occur at the locations of discontinuities in the image.Furthermore, as expected, the relative reconstruction errors for samples from the posterior predictive are larger than the relative reconstruction error corresponding to the optimal regularization parameter (denoted by the red dotted vertical line), but the distribution of relative reconstruction errors is tight.
For the second out-of-sample dataset, we generate a sinogram from a randomized Shepp-Logan phantom, but we modify the projection angles.That is, we use a forward model matrix A where the 180 projections angles are no longer equidistant.We add white noise to the vector of projection angles with a standard deviation of 0.1.The sinogram is propagated through the trained VED network, and we draw 10 3 samples of the regularization parameter from the VED-posterior predictive π pred (x | b).The true image, the sinogram corresponding to modified angles, and the distribution of the regularization parameter samples are provided in figures 7(a)-(c) respectively.Although the QoI x is slightly larger than many of the predicted values, the range is fairly small.The mean reconstruction, pixel-wise variances for the reconstruction, and the distribution of relative reconstruction errors are provided in figures 7(d)-(f) respectively.
For the third out-of-sample experiment, we use data for the tomographic reconstruction of a walnut given at [48].The walnut image, which was computed using a CT scan but that we take as ground truth, is provided in figure 8(a) along with a noisy sinogram in figure 8(b) that was generated as in (41).Similar to the previous experiment, we draw 10 3 samples of the regularization parameter from the VED-posterior predictive π pred (x | b).For this example, we observe that the trained VED network provides a wide range of values for the predicted regularization parameter, but all of the samples are not too far from the QoI x.The mean reconstruction and pixel-wise variances are provided in figures 8(d) and (e) respectively, along with the distribution of relative reconstruction errors in figure 8(f).Again, we see that large uncertainties are present at the locations of the discontinuities in the image, and the relative reconstruction error norms for the sampled regularization parameters are close to that of the optimal regularization parameter for this problem.

Experiment 2: nonlinear hydraulic tomography
Next, we consider an example where the goal is to detect the locations of discontinuities in reconstructed images.A typical approach (e.g.used in CT) is to obtain an image reconstruction using standard algorithms, followed by post-processing, e.g. with image segmentation.We Reproduced from [48].CC BY 4.0.consider an approach that avoids the intermediate step and goes directly from the observation to the discontinuities [4,5].Although the QoI could represent locations of the discontinuities, instead we consider a nonlinear hydraulic tomography problem that is adopted from [49][50][51] and define the QoI to be coefficients representing a parameterization of the locations of the discontinuities.Then, we use VEDs to obtain uncertainties for the coefficients.
Consider a confined aquifer that is modeled using an elliptic partial differential equation (PDE), where D ⊂ R 2 is the unit square domain, u : R 2 → R is the hydraulic head, y : R 2 → R + is the hydraulic conductivity, {ξ well k } N well k=1 are locations for N well wells, and q k is the pumping rate at the k-th well.We consider a zero-head (homogeneous Dirichlet) boundary condition on the left, right, and bottom boundaries of D. Furthermore, we consider a no-flux (homogeneous Neumann) boundary condition on the top boundary.
We begin by describing the forward process.Assume we have the domain D and the true hydraulic conductivity.Consider placing N well = 20 wells in the domain D located at coordinates (ξ 1 , ξ 2 ) where ξ 1 = 1 4 , 1 2 , 3 4 , 1 and ξ 2 = 1 5 , 2 5 , 3 5 , 4 5 , 1.At the k-th well, we perform an injection at the rate of q k .We measure the head at the rest of the N well − 1 wells, and this process is repeated for all wells.That is, let where u k y is the solution to (42) when the conductivity is y and the injection is at the k-th well.Thus, the parameter to observation map F in (1) corresponds to the map → b true , where the b true is a vectorized version of matrix B true containing elements (43).We consider a noise level of 1%, i.e. b = b true + e where e is a realization of a mean-zero Gaussian multivariate random variable with covariance matrix σ 2 n I where Note that b is of size N well (N well − 1).In this example, we have 380 measurements corresponding to 20 injections and 19 measurements per injection.Given measurements b, the aim of the hydraulic tomography inverse problem is to reconstruct the hydraulic conductivity.
To formulate this problem in a Bayesian setting, we introduce a prior distribution for the hydraulic conductivity y.We choose a Gaussian level-set prior [52] for y, i.e. we first introduce a Gaussian random function v : D → R distributed according to N (0, C v ), where C v is a traceclass symmetric and positive-definite linear operator [53].For this experiment, we choose C v = (ϵ 2 cl I + ∆) −1 , with ϵ cl = 1/0.1,and, I and ∆, the identity and the Laplacian operators, respectively.We then define y to be the push-forward random function through the relation, where H is the Heaviside step function, and a + , a − > 0 are positive constants.Such a prior promotes a piece-wise constant conductivity field y.Now we introduce the goal-oriented Bayesian setting.We can achieve a low-dimensional representation for y by truncating the Karhunen-Loéve [53] where {g i } i ⩽N goal are the largest N goal eigenvalues of C v , {v i } i ⩽N goal are the associated eigenfunctions, and {X i } i ⩽N goal are independent and standard normal random variables.We define the goal of the inverse problem to be the coefficients X i in (46).We discretize the forward operator, functions y and v, and the linear operator C v using a finite element method on a structured mesh with N FEM = 10, 201 degrees of freedom (on a grid of approximate size 100 × 100).Let N goal = 32, then the QoI are parameters x ∈ R 32 , and the prediction operator G, in (1) corresponds to the map y → x.
We remark that after discretizing v and v i using a FEM method, we obtain vectors v and v i which represent the expansion coefficients of v and v i in FEM basis.In this case, v becomes a mean-zero multivariate Gaussian random variable of size N FEM with the covariance matrix Description of training.We collect a dataset of pairs {(b j , x j )} J j =1 , with J = 10 4 where x j are realizations from the standard normal distribution, and b j has components described in (43).We then consider the VED network summarized in figure 9 for the hydraulic tomography problem.Note that in this example, we do not know the mean and the standard deviations of the target.We consider a Gaussian multi-layer perceptron output layer that estimates these values with 2 output layers.The cost function for this network can be constructed following (29), and the network is trained using the ADAM optimization algorithm [46]   where the goal is to perform UQ on the parameters describing the discontinuities in the conductivity.On the left is the encoder network that maps the input, hydraulic head at the location of the wells, to the latent space (defined by mean µ e and standard deviations σ e ).On the right is the decoder network that maps samples from the latent distribution to the target discontinuity parameters (defined by mean µ d and standard deviations σ d ).
is scaled by a factor of 0.9 on a plateau of the loss function, with a patience of 100 iterations.This adjustment is performed only when the learning rate is greater than 10 −4 .

Evaluation of VED.
We evaluate the performance of the trained VED network for an out-ofsample test problem of finding the hydraulic conductivity field in figure 10(a), which is constructed by placing disks centered at coordinates (0.3, 0.65) and (0.25, 0.65) with radii 0.15 and 0.25 and conductivity values set to a + = 10 and a − = 1, inside and outside of the disk inclusions, respectively.Note that this conductivity field does not belong to the training dataset and is not generated from the prior distribution (46) and (45).A measurement vector b is generated as in (43), and we follow algorithm 2 to draw samples from the VED posterior predictive distribution.In figure 10(b), we provide the mean of the posterior-predictive, i.e. the 32 expansion coefficients in x using the VED (in green) along with the 99% credibility intervals.
We compare the performance of the VED method with a Markov Chain Monte Carlo (MCMC) method, Gaussian process (GP) regression [54], and Bayesian neural networks (BNN) [55].These are common uncertainty quantification tools.However, the MCMC and VED methods are the only approaches that exploit the inverse nature of the test case in this section.
MCMC methods approximate the posterior predictive with a sequence of samples {x j MCMC } NMCMC j =1 , from the posterior-predictive. MCMC methods allow for estimation of moments, e.g.E(f(X)) can be estimated using the ergotic sum We estimate the mean of the posterior-predictive E(X) by taking f to be the identity map in (48).Similarly, we estimate the variance by taking f(X) = (X − E(X)) 2 .The preconditioned Crank-Nicolson (pCN) method [56] was used for this problem because of its suitability for expansions of the form (46) [56].
We use the GPy Python package [57] to construct a GP model for constructing a surrogate for the map b → x.The kernel for the GP model is constructed using radial basis functions.The parameters of this model are then optimized using the training dataset mentioned above.
To construct a BNN model, we designed a feed-forward fully-connected network architecture with 3 hidden layers of sizes 200, 100, and 50, respectively, using the BLiTZ Python package [58].We then trained the parameters of this probabilistic network with a built-in variational estimator of BLiTZ and by minimizing the ELBO loss function for BNNs using the ADAM optimization algorithm [46].BNNs are notoriously difficult to train, and we noticed that the BNN approach failed to draw meaningful predictions of the expansion coefficients or of their uncertainties.Hence, we omit further comparisons with the BNN model.
We provide means and 99% credibility/confidence intervals of the estimated expansion coefficients in figure 10(b).We notice a stronger correlation between the mean estimate of the VED and the MCMC methods, suggesting that the mean of the VED-posterior predictive correctly approximates the mean of the actual posterior predictive.The estimation of the GP model of the expansion coefficient is less accurate, particularly for the lower mode numbers.
The convergence of the sum in (48) for a general posterior distribution is an ongoing topic of research.One indication for convergence is the effective sample size (ESS) [59] of MCMC samples, which suggests how many of the MCMC samples can be regarded as independent samples.Larger ESS values indicate a better approximation in (48).In this example, we compute 2 × 10 5 samples from the posterior predictive using the pCN sampler and discard the first 2 × 10 4 samples as the burn-in period.Figure 11(a) shows the autocorrelation function [59] of the samples for the first 3 expansion coefficients in (46).We observe a similar behavior for the other coefficients.We report that it takes at most 6 × 10 4 samples for the autocorrelation function to reach zero.This suggests that the convergence of the MCMC is only achieved with a very large number of samples.In figure 11(b), we provide the ESS for each expansion coefficient.A reliable estimate of the variance of the posterior requires an ESS of around 100, so the ESS estimates in figure 11(b) suggest that the estimated moments of the posterior cannot be trusted, even with 2 × 10 5 samples computed for this test case.
The bars in figure 10(b) indicate 3 times the length of the standard deviation for all the methods.The standard deviation is estimated from the samples of the posterior in the VED and the MCMC methods, and from the samples of the probabilistic model in GP model.We emphasize that the MCMC estimates of the standard deviation cannot be trusted due to the low ESS.Regarding the VED estimates of the standard deviation, we see that the confidence in the estimates reduces for higher modes, i.e. larger indices.This behavior is also reported in other works, e.g.see [51,60].In addition, since samples provided by the VED posterior predictive are independent samples, the estimated variance can be trusted, up to the accuracy of the VED approximation.
The GP model provides high uncertainty in estimation of the expansion coefficients.GP models often provide a reliable estimate near training data points and high uncertainty far away from data points [54].However, the number of data points required for training a reliable model, i.e. with sharp confidence intervals, grows exponentially with the number of input dimensions.Therefore, the high level of uncertainty in the GP model, in figure 10(b), suggests that the training dataset must be refined significantly.
Next, we provide some samples for visualization.In the first row of figure 12, we provide samples from the prior distribution that were obtained by mapping random samples of the parameter vector x j onto the conductivity field y j following (46) and then (45).Then in the rows (b)-(d), of figure 12 we provide conductivity fields corresponding to samples from the VED posterior predictive distribution and MCMC samples from the posterior predictive distribution, and samples from the GP model, respectively.In terms of the conductivity field, we observe comparable samples between the VED and the MCMC methods.
To visualize the uncertainty in the QoI, one approach is to compute the variance of the posterior-predictive for each expansion coefficient X i in (46) (see figure 10(b)).Another approach is to compute the hydraulic conductivity for all samples from the posterior predictive distribution using the transformation (45), and then estimate means and variances in the conductivity space.These results are provided in figure 13.The VED and the MCMC methods correctly identify the two inclusions in the domain, while the GP method is unable to distinctively identify the two regions.The variance in all methods indicates a point-wise level of uncertainty in the estimation.The VED and the MCMC methods show high certainty away from the boundaries of the inclusion and low certainty at the location of the boundaries.The GP method provides high uncertainty near some of the boundaries, however, it also shows higher levels of uncertainty away from the boundaries.In addition, we see uncertainty at the boundary of the domain with the boundary for methods.
we compare the computational efficiency the two methods by providing CPU times for 1000 samples and 10 independent samples from the posterior predictive in table 1. experiments were carried out on a MacBook Pro an Apple M1 Pro Recall that the MCMC method does not require any offline computations, whereas the VED network requires training using the training dataset.We provide comparisons between the VED and the MCMC methods in 2 cases.The first case includes both the time for data generation  and training of the VED network.Comparing the elapsed times to compute 1000 samples for the VED method and the MCMC methods (in the first column of table 1), we notice that the VED approach provides significant speedup over the MCMC method.In the second case, we exclude the data generation time for the VED method.We provide the elapsed time to compute 1000 samples for this case in the second column of table 1.We notice that we achieve a slightly larger speedup, since the time for training the VED network is negligible compared to the time for data generation using the forward operator.This case is relevant since for any new vector b, the MCMC method requires the same computation time to draw 1000 samples, while a trained VED network can quickly perform online computations to draw samples.We remark that for UQ applications, it is the number of independent samples that indicates the efficiency of an MCMC method.Obtaining a large number of independent samples from posteriors for level-set priors remains a challenge [51,52,61].The MCMC method used for this problem yields approximately 10 independent samples according to figure 11(b).Therefore, we compare its efficiently to obtaining 10 independent samples from the VED network in the last column of table 1.It is clear that the VED method provides a significant computational advantage.This is because samples of the VED network are independent.In addition, evaluating VED networks can be performed efficiently, e.g. on GPUs.Therefore, we expect similar efficiency for larger network architectures.

Conclusions
In this work, we VED networks for the direct estimation of QoI and their uncertainties for inverse problems, given observations.Following a supervised learning approach, we use training data to learn parameters a VED where an encoder maps an observation to a over the latent space a decoder maps samples from the latent space to distributions over the QoI output space.By leveraging probability distributions in both the latent space and in posterior predictive, VED networks can be used to efficiently generate independent samples from the posterior predictive, thereby enabling efficient UQ for the QoI without the use of MCMC methods.The potential benefits of our approach for UQ of the QoI in goal-oriented inverse problems are demonstrated by the numerical experiments for two scenarios.In section 4.1, we demonstrate the use of VED networks for UQ of the regularization parameter for total variation for x-ray CT reconstruction, and in section 4.2, we demonstrate the ability of trained VEDs to not only estimate locations but also uncertainties of discontinuities in nonlinear hydraulic tomography reconstructions.

Figure 1 .
Figure 1.Encoder-decoder network mapping input vector b to output vector x.The encoder maps the input vector b to the latent variables z and the decoder maps the latent variables to the output x.

Figure 2 .
Figure 2. Variational encoder-decoder network mapping input vector b to target vector x.The encoder maps the input vector b to the latent distribution defined by mean µ e and standard deviations σ e .The latent variables z are obtained by sampling from the latent distribution, and the decoder maps the latent variables to the target x.

Figure 3 .
Figure 3. Graphical model representing the connection between the observation b, the latent variables z and the QoI x.

Figure 4 .
Figure 4. Variational encoder-decoder network mapping for UQ of the QoI for goaloriented inverse problems.The encoder maps the input vector b to the latent distribution defined by mean µ e and standard deviations σ e .The latent variables z are obtained by sampling from the latent distribution, and the decoder maps the latent variables to the target distribution defined by mean µ d and standard deviations σ d .Samples of the QoI x are obtained from the target distribution.
(a) The prior density π(z) is a standard normal density, i.e. following the distribution N (0, I), with I the identity matrix of size ℓ.(b) We assume that π e (z | b) is a product of independent Gaussian densities, i.e.

)
Notice that the right-hand-side only exists, as a distribution, whenx = d(z), yielding π(b | x, z) = π(b | z).Note that these conditions are necessary for an unbiased posterior predictive.Using the additional assumption π(b | z) = π(b | x, z), we can further simplify(33).Let us first reformulate the joint distribution π(x, z, b) as

Figure 5 .
Figure 5. Architecture of the VED network for the x-ray CT problem, where the goal is to perform UQ on the regularization parameter.On the left is the encoder network that maps the input sinogram to the latent space (defined by mean µ e and standard deviations σ e ).On the right is the decoder network that maps samples from the latent distribution to the output regularization parameter (defined by mean µ d ).

Figure 6 .
Figure 6.The trained VED network from the x-ray CT experiment is used to perform UQ on the regularization parameter for a test data sample.The true image is provided in (a) and the corresponding noisy sinogram is provided (b).The distribution of the predicted regularization parameters x j from π pred (x | b) is provided in (c) with the QoI x provided for reference.The mean reconstruction and the image of pixel-wise uncertainties (in an inverted colormap so that large variances are denoted in black) are provided in (d) and (e) respectively.The plot in (f) contains the distribution of the relative reconstruction errors.

Figure 7 .
Figure 7. UQ on the regularization parameter via a trained VED network for x-ray CT reconstruction with modified angles.The true image ytrue is provided in (a) and the noisy sinogram b corresponding to modified angles is in (b).The distribution of the predicted regularization parameters x j from π pred (x | b) is provided in (c) with the QoI x provided for reference.Figures (d) and (e) contain the mean reconstruction and the image of pixel-wise uncertainties respectively, and the plot in (f) contains the distribution of the relative reconstruction errors.

Figure 8 .
Figure 8. UQ on the regularization parameter via a trained VED network for an out-ofsample walnut image [48], where the true image and the noisy sinogram are provided in (a) (b) respectively.The VED predicted regularization parameters x j from π pred (x | b) shown in (c) are not too far from the QoI x.Corresponding reconstructions are reasonable, as demonstrated by the mean reconstruction in (d), the image of pixelwise uncertainties in (e), and the distribution of the relative reconstruction errors in (f).Reproduced from[48].CC BY 4.0.

Figure 9 .
Figure 9. Architecture of the VED network for the hydraulic tomography problem, where the goal is to perform UQ on the parameters describing the discontinuities in the conductivity.On the left is the encoder network that maps the input, hydraulic head at the location of the wells, to the latent space (defined by mean µ e and standard deviations σ e ).On the right is the decoder network that maps samples from the latent distribution to the target discontinuity parameters (defined by mean µ d and standard deviations σ d ).

Figure 10 .
Figure10.Results for the nonlinear hydraulic tomography example.In (a), we provide the true hydraulic conductivity field for an out-of-sample test problem.In (b), we provide the estimated expansion coefficients and associated 99% credibility/confidence intervals (indicated by bars) using samples from the VED posterior predictive, MCMC, and GP models.The mode number corresponds to the index i for coefficient X i .

Figure 11 .
Figure 11.For MCMC samples from the posterior predictive, we provide the autocorrelation function for the first three expansion coefficients in (a) and the effective sample size for each expansion coefficient in (b).

Figure 12 .
Figure 12.Row (a) contains samples from prior distribution visualized on the conductivity field, that samples obtained by (46) and then (45).Row (b) contains samples the VED posterior predictive distribution mapped onto the conductivity field.Row (c) contains MCMC samples from the posterior predictive distribution mapped onto the conductivity field.Row (d) contains GP samples mapped onto the conductivity field.

Figure 13 .
Figure 13.For samples from the VED posterior predictive, we provide the means and variances of the conductivity fields in row (a) and (b), respectively.For comparison, we provide the means and variances of the conductivity fields for MCMC samples from the posterior predictive and samples of the GP model.

Table 1 .
CPU times for sampling from the posterior predictive using MCMC versus using VED sampling.The speed-up in terms of obtaining independent samples is significant.