Machine learning based priors for Bayesian inversion in MR imaging

The Bayesian approach allows the incorporation of informative prior knowledge to effectively enable and improve the solution of inverse problems. Obtaining prior information in probabilistic terms is, however, a challenging task. Recently, machine learning has been applied for the training of generative models to facilitate the translation of historically or otherwise available data to a prior distribution. In this work, we apply this methodology to undersampled magnetic resonance imaging. In particular, we employ an autoencoder as part of a generative model to statistically regularise and solve the high-dimensional inverse problem using Bayesian inversion. Comparison with a classical Gaussian Markov random field prior is performed and numerical examples highlight the possible advantages of data-driven priors.


Introduction
Inverse problems are frequently encountered in metrology. Examples include applications such as scatterometry [1], magnetic resonance imaging (MRI) [2], magnetic resonance fingerprinting [3,4], cardiac perfusion [5], computed tomography [6], and spectroscopy [7], to mention just a few. Statistical modeling is an appropriate tool to tackle inverse problems and provides the basis for established procedures such as maximum likelihood estimation from classical statistics [8] or the application of Bayesian inference [9].
The Bayesian interpretation of probability relies on the concept of subjective probability or degree of belief which is updated in the presence of fresh data using Bayes theorem. * Author to whom any correspondence should be addressed.
Original Content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. The Bayesian approach is well known and has been applied to many metrological applications such as scatterometry [10] or magnetic resonance fingerprinting [11]. The application of the Bayesian paradigm to inverse problems has the advantage that probability statements about the quantity of interest can be made conditional on the observations. Moreover, it allows prior knowledge to be employed [12][13][14]. This prior knowledge, in terms of a probability distribution, reflects one's state of knowledge regarding the involved quantities and provides a natural stabilisation of the inverse problem, addressing classical ill-posedness [15].
Generating or extracting meaningful prior knowledge is, however, challenging and an active field of research. Expert elicitation [16,17] can be employed to transform the knowledge about a parameter into a probability density function (PDF). It is also common to include expected properties of the solution. For example, spatial smoothness can be expressed by a Gaussian Markov random field (GMRF) [18] taken as a prior distribution (see applications such as cardiac perfusion [5], nano-spectroscopy [19] and magnetic resonance fingerprinting [11]). Another thriving field of research deals with the determination of prior knowledge from data. For instance, historic data can be statistically modeled as a time series, e.g. for data forecasting [20]. Another prominent example are Gaussian processes [21], which can be turned into a prior for a Bayesian inference [22] (see also data assimilation and recursive Bayesian inference in this context [23]).
In recent years, machine learning, and in particular deep learning, has attracted immense interest, and the number of capitalizing applications has grown rapidly. Today, deep learning marks the state-of-the-art in several medical applications [24], autonomous driving [25], and speech recognition [26], to name but a few. Characteristic of machine learning approaches is that a solution to a task is learned from a typically large database of example solutions and that no physical model needs to be specified a priori. In deep learning, deep neural networks are taken as a model and their large number of parameters is determined in a training phase using the available database. The advantage of this modeling approach is its great flexibility and generality.
Machine learning has recently also been successfully applied to inverse problems. Most of the employed approaches learn an inverse mapping from a large database containing pairs of data and the corresponding solutions [27], or they are designed to further improve approximate solutions such as the Moore-Penrose inverse [28]. In some cases, deep learning has also been utilized to determine prior distributions in a Bayesian inversion through so-called generative models [29][30][31]. Typical approaches to generative models are variational autoencoder (VAE) [32] and generative adversarial networks (GANs) [33]. The advantage of such procedures is that an available physical model for the inverse problem can be utilized to ensure consistency of the inferred solution with the observed data, while for the prior a highly flexible model is taken that reflects the actual characteristics of solutions to the inverse problem. Figure 1 illustrates the difference between conventional priors and generative models expressing the a priori knowledge about the distribution of a quantity of interest gained from a corresponding database. On the left, realisations of two default Gaussian priors are shown that represent white noise models (a) and GMRF models (b). The latter is a typical model to induce spatial smoothness on the obtained solutions of an inverse problem. While the GMRF prior induces spatial smoothness and acts as a proper regularisation approach, it does not reflect the typical structure of the solution, such as, for example, images of human brains. The flexibility of deep neural networks, on the other hand, allows a generative model to be trained in such a way that it truly reflects the expected characteristics of the solution. Therefore, one expects that a prior based on such a model will provide more prior information about the sought solution. To visualize this, we show in figure 1(c) samples from a generative model trained on typical images of human brains. For details on the sample generation, see appendix D. Employing an informative prior as shown in figure 1 serves the purpose to statistically regularise the underlying inverse problem. In the case of an ill-posed problem, as considered here, such a regularization is required to ensure the existence of a well-defined solution.
Using data-driven priors through generative models has already been proposed and discussed in [29]. Generative models in machine learning also allow for a dimension reduction through a latent space of much smaller dimension. This latent space is usually defined by one of the layers of the neural network that has a significantly smaller number of neurons than the input or output layer, cf figure 2(b). Most of the Bayesian inversion methods that use generative models exploit this dimension reduction property and tackle the inverse problem in terms of the reduced dimension [30]. This effectively reduces the dimensionality of the inverse problem, but the resulting posterior distribution in latent space needs to be transformed into the original space, e.g. to the space of brain images. At this point, the quality of the database plays a crucial role and the quality of the generative model determines the induced bias of the solution obtained. In [31] it has been shown that a formulation in the original space has statistical advantages. In particular, estimators derived from the Bayesian posterior are consistent.
In this paper, we propose to use the original space formulation for the Bayesian inversion and we will explore the potential benefits of using a data-driven prior based on an autoencoder (AE) through numerical examples. The numerical examples considered in this paper specifically address the task of MRI [2] using undersampled data. The process of MRI results in a quantity of interest, namely the transversal magnetization, that depends on the device conducting the measurements and its operational parameters, such as the strengths of its magnetic fields and the temporal sequence of high frequency excitations. Conditioned on these factors, the goal of MRI can be viewed as the measurement of a well defined quantitative measurand. This measurand is often taken as the starting point for the determination of other physical quantities, such as perfusion [34], volumes [35] or proton densities and T1 and T2 relaxation times [3]. The undersampling offers the opportunity to reduce acquisition times, e.g. in routine clinical practice, yet it leads to a severely ill-posed inverse problem. In figure 1, the inverse problem involved is visualized on the right. In (e) undersampled MR measurements are shown with their desired ground truth human brain image in (d). Estimates resulting from the approaches considered and proposed in this work are shown in figures 1(f)-(h), with the advantage of the machine learning prior clearly visible in (h). We will discuss the different methods in the following sections.
The contribution in this work is twofold. First we propose to apply Bayesian inversion using data-driven priors for the task of MR imaging from undersampled data; and second we develop an efficient and numerically feasible approach to compute Bayesian point estimates for this task using a novel data-driven prior motivated by adversarial regularisation [36]. The method developed here circumvents the challenging task of training a VAE by relying on a simple AE as the basis for the data-driven prior employed in our Bayesian inversion. For numerical illustration, we utilize the publicly available database [37] of human brain images to determine the AE model, and to create realistic test cases.
The paper is organized as follows. Section 2 gives a brief introduction into generative models. In section 3.1, the Bayesian inversion using a prior determined by parts of a generative model is presented. Results of an application to MR imaging are then given in section 4, which also includes a comparison to a Bayesian inversion using a 'standard' prior. Finally, some conclusions and an outlook to further related research are given in section 5.

Generative models and AEs
Generative models are a well-known tool in statistics and are particularly useful in experimental design and decision making [38]. A generative model is obtained from a given database and employed to generate novel data that is not part of the original database but bears a close resemblance in its properties. Statistical modeling using, e.g. hierarchical models can be employed to model a sampling distribution from which the observed data are taken as one realisation. Another data-driven approach that is frequently applied is principal component analysis (PCA). Depending on the field of expertise, similar procedures are called proper orthogonal decomposition or Karhunen-Loève expansion [39]. These approaches compute a basis representation from the given database and are therefore limited by an inherent linear structure. By assigning a distribution to the coefficients of a PCA, new data can be generated by drawing PCA coefficients at random and taking them as weights in an expansion using the PCA components. Modern tools in machine learning and deep learning facilitate the use of highly non-linear and highdimensional models. A famous example is the work in [40], where deep learning is employed to generate realistic, highresolution human faces of people that do not exist. Typical representatives of generative models that are extensively studied in machine learning are GANs and VAEs. In the following we will focus on AEs, which are a deterministic alternative to VAEs. An introduction to VAEs can be found in appendix A.
AEs can be seen as a non-linear extension of a PCA where the model parameters are obtained by tools from deep learning. A sketch of an AE is shown in figure 2(b). An AE usually consists of two neural networks. Mapping from the space of the database, an encoder takes data and translates it to a latent space, which is usually of much smaller dimension. Subsequently, a decoder maps the latent code back to the space of the database. The aim of an AE is to approximate an identity that maps through a low-dimensional latent space (bottleneck). One assumption for a successful application of an AE (as well as of a VAE below) is that the distribution of training data lies approximately on a manifold of much lower dimension than the dimension of the training data itself, typically the number of pixels of an image. Similarly to PCA, an AE can be used for dimension reduction. The results shown in section 4 and figures 3 and 4 are based on an AE that has been trained by minimizing the loss function where ∥ · ∥ denotes the Euclidean-norm and D comprises the actual training data, e.g. brain images. The output of the AE is given by where g ϕ is a deep neural network with parameters ϕ that constitutes the encoder and maps x to the latent space variable z of lower dimension, and f θ is the decoder which is given by another deep neural network with parameters θ, mapping the latent space variable z to the output x. Encoder and decoder together form a single network of AE structure having a bottleneck in the middle.
In figure 2, we show in (a) examples from the training set of human brain tissue images. A sketch of the encoder and decoder networks that are trained by minimizing (1) over this database are shown in figure 2(b). After training, the resulting AE generates the images on the right of figure 2(c) when presented with the images in (a) as input.
In the following, we aim to employ an AE as part of a prior for the statistical regularisation of an ill-posed inverse problem. To this end, we briefly introduce the Bayesian paradigm next.

The statistical inverse problem in MRI
We consider inverse problems of the kind where the task is to estimate x ∈ R p given data y ∈ R n . The data y are modeled as a realization of a Gaussian distribution with mean Fx and homoscedastic variance. The identity matrix in the appropriate dimension is denoted by I, and F : R p → R n denotes the forward operator. In the case of a linear forward operator F is an n × p matrix. The sought solution x is typically high-dimensional and represents a discrete approximation to a function such as the spatial distribution of the transverse magnetisation of nuclear spins within human tissue in the case of MRI in the brain [41]. The simple homoscedastic Gaussian model makes solving the inverse problem easier, and we will here also assume knowledge about the variance σ 2 . These assumptions are reasonable for the considered application of MR imaging, see e.g. [41], where F stands for the Fourier transform and x comprises real and imaginary part of the magnetisation of the brain tissue. A Bayesian inversion determines the posterior distribution π(x|y) for the unknown x given the data and one's prior knowledge about x. Technically, Bayes theorem is employed according to where π(x) denotes the prior and l(x; y) the likelihood function determined through the statistical model (2). The posterior π(x|y) comprises one's complete knowledge about x after accounting for the information contained in the observed data y, and it allows one to make probability statements about x or to determine a credibility region which contains the unknown solution with a prescribed probability. In most cases the posterior needs to be computed numerically, which is challenging in high dimensions, particularly for Markov chain Monte Carlo procedures [42]. Approximate methods such as Laplace approximation [43] or variational inference [44] can instead be applied.
In this paper, we intend to illustrate the potential benefit of using a data-driven prior determined by an AE. However, we do not aim at a complete Bayesian inference for the task of MR imaging at this point. This challenge is subject to current research and requires a discussion of technical concepts that are beyond the scope of this work. For a discussion on uncertainty estimation using VAEs, see [31]. In what follows, we consider and compare Bayesian point estimates for x obtained for a data-driven prior and conventional priors such as GMRF priors. We will consider the maximum a posteriori (MAP) estimate, i.e. that x for which the posterior is maximized.

AEs as priors for Bayesian inference
An AE is trained to approximately act as an identity mapping on the training data, i.e. for every x within the set of training data taken as the input to the AE, its output approximates the input x. The AE can also be viewed as parameterizing the lower dimensional manifold on which the set of training data is assumed to live since its output is given by the value of the decoder for an input produced within the lower dimensional latent space.
Let AE(x) denote the output of a trained AE, i.e. x is first mapped to the latent space through its encoder, followed by the application of its decoder, see figure 2. We propose to use the following prior which is motivated by the use of adversarial regularizers in inverse problems as recommended in [36]. In (4) ∥ · ∥ stands for the Euclidean-norm, and λ is a precision parameter to be chosen. The term ∥x − AE(x)∥ punishes any deviation of x from the manifold determined by the AE, i.e. the larger the deviation of x from this manifold, the smaller the value of the prior. For any x from the distribution of the training data, on the other hand, the value of the AE approximates x and the value of the prior (4) for x is large. The parameter λ needs to be chosen, and our choice in the numerical examples below is as follows where the sum in (5) runs through the N training data. The factor M balances the magnitude of the prior and the likelihood by taking M as the dimensionality of the image x, see also appendix C. Note that when the training data lie perfectly in the manifold of the trained AE, then the precision λ tends to infinity and the prior (4) excludes any x that is not part of the manifold. In practice, however, the AE is not perfect and our choice of λ reflects the expected deviation to this manifold over the training data. In the numerical experiments below this choice works well, but other approaches to the choice of λ could also be considered, for example selecting λ in view of resulting properties for the posterior for a set of independent test problems.

Data and machine learning prior
The database used in this work comprises approximately 21 000 T2-weighted 2D brain MR images and was extracted from the fastMRI challenge data set [37]. In addition to (raw) uncompressed complex-valued multi-coil MR measurements of human brains, the original data set also contains corresponding real-valued ground truth images. It is these images that we focus on here. From these images, 2D axial slices were taken that show a significant proportion of the brain, i.e. we removed empty slices and slices containing only small brain profiles. Example images from our database are shown in figure 2(a).
To accelerate computations, the original 320 × 320 slices were reduced to 64 × 64 and the intensity was normalized to the interval [0, 1]. This database is used to train an AE 1 . The encoder comprises three convolutional layers with batch normalisation and ReLu activation function. The final layer of the encoder is a fully connected layer with an output dimension of the latent space of 1024, effectively reducing the dimensionality of the images considered by factor of four. Further reduction of the latent dimension reduced the quality of the identity mapping of the AE. In total, the encoder admits 782 200 learnable parameters. The decoder comprises five (transposed) convolutional layers, each again with batch normalization and ReLu activation. In total, the decoder has 3.5 million learnable parameters. The parameters of the encoder and decoder mapping are trained for 4600 epochs using the Adam optimiser in Matlab ® with a batch size of 256 and learning rate 10 −4 .
The quality of the identity mapping approximated by the AE is visualized in figures 2(a) and (c).

Solutions to the inverse problem
The reconstruction of MR images from undersampled MR measurements can be modeled as a linear inverse problem of the form (2), where the forward operator involves the Fourier transform F and an undersampling matrix P of size n × p that steers the reduced measurement process in Fourier space. A typical reduced measurement is shown in figure 1. It can be seen that measurements are acquired in slices, a fact that reflects the physical characteristics of the measurement devices. Alternative sampling strategies for reduced MR measurements are discussed, for instance, in [41,45]. The noise standard deviation of the data is assumed to be known and set to produce a signal-to-noise ratio of approximately 400. For a detailed discussion on our choice of the data variance, we refer to [11]. Finally we would like to note that the measurements y as realisations of Y and the corresponding unknown MR image x are complex-valued, which we identify by a real-valued representation in absolute value and a phase angle. This parametrization is motivated by the Bayesian approaches to tackle the inverse problem (2) and the fact that prior knowledge, in terms of an image database, is available for the absolute value of x only. In the following, we discuss the approaches employed to solve the above inverse problem.

Zero-filling.
One baseline method frequently employed in MR imaging to reconstruct missing data is zerofilling. This approach replaces non-acquired data with zeros and applies the inverse Fourier transform to obtain a solution to problem (2). This procedure efficiently computes the pseudo-inverse of the forward operator. A typical reconstruction using the zero-filling approach is shown in figure 1(f). A common problem with such images are the aliasing effects (replications of the brain from top to bottom), which challenge the usability in clinical applications.

GMRF prior for Bayesian inversion.
This is a statistical approach to tackle the problem (2) that assumes spatial smoothness in the reconstructed solution. In particular, we impose a GMRF prior with 3 × 3 pixel neighborhood for the absolute value and the phase of x. For the construction of a corresponding precision matrix Γ we refer to [18]. For reference, the resulting unnormalized posterior PDF reads Here, x a denotes the vector of absolute values of the unknown MR image x, x φ the phase angles, and ⊙ stands for the Hadamard product. The GMRF prior for the phase is natural, since a phase can often be assumed to be spatially smooth. Additionally, our attempts at using a non-informative prior for the phase resulted in reconstructions that were much worse. We will therefore not report them here. The regularization parameter λ GMRF is chosen according to the database using the scheme of (5), where the inner norm is replaced by the expression x T Γx to account for the GMRF. For the present case, this value reads λ GMRF = 37.22. This choice is equivalent to the choice for the AE prior. Moreover, we observed in our experiments that this value performs well in the sense that small changes do not significantly affect the reconstruction quality while large changes worsen the results. For the regularisation parameter of the phase λ φ , an 'oracle' value is chosen that produces the smallest root-mean-square error (RMSE) of the reconstruction to the ground truth. This choice is made to clearly highlight the impact of the prior for the absolute value. In practice, λ φ has to be estimated from data as well or an L-curve criteria can be employed [46]. Note that the same precision matrix Γ is chosen for both, the magnitude and phase regularisation. This follows from the assumption that magnitude and phase admit similar smoothness properties. In general it would be possible to assign different neighboring structures here.
As the Bayesian point estimate, the MAP of the posterior PDF (6) is calculated by numerical optimization using the Matlab ® function fminunc with quasi-Newton optimizer and default parameter settings. Gradients are obtained by automatic differentiation. The initial value to start the optimization is set to the zero-filling solution. A typical reconstruction using the GMRF prior is shown in figure 1(g).

Data-driven AE prior for Bayesian inversion.
The underlying database contains real-valued magnitudes only. This means that data-driven prior knowledge, in terms of an AE-based prior (4), is solely available for the absolute value of the unknown MR image x. For the phase, a GMRF prior with 'oracle' precision parameter λ φ , as above, is employed. The linear inverse problem (2) is translated to a non-linear problem resulting in the following Bayesian posterior PDF As the Bayesian point estimate, the MAP of the posterior PDF (7) is obtained by the same numerical procedure that has been used for the GMRF prior approach in (6). The regularization parameter λ is chosen as described in (5). For the present case, this value reads λ = 1476.3. An example reconstruction using the AE prior is shown in figure 1(h). It can be seen that the AE has the ability to remove the aliasing effects and improve the image quality.

Results
For the inverse problem, data is simulated based on test images from the database that were not used for training the AE. In particular, for a given ground truth image that contains solely the absolute value of the MR image x a , a fully-sampled measurement is obtained bỹ where x φ,sim denotes a smooth, simulated phase, and ϵ a Gaussian noise of magnitude σ 2 . For the simulated phase, we generate smooth realizations from the following random field For the undersampling operator, 100p% of rows ofỹ are chosen at random with equal probability. In figure 3 we show four case studies chosen at random. The first column shows the ground truth image, the second column the absolute value of the zero-filling solution, and the other two columns the MAP results for both Bayesian approaches (GMRF prior in the third column and AE prior in the fourth column). We focus on the absolute value of the reconstruction result since the phase angle is usually not considered in clinical applications. To obtain simulated measurement data, the ground truth images were transformed according to (8) and subsampling with p = 0.65 was applied. It can be seen in those particular cases that the zero-filling solution tends to have strong aliasing effects remaining, while the GMRF solution reduces the aliasing effects by applying a smoothness constraint. However, the AE prior approach recovers the image much better by removing the aliasing effects completely. This is because the AE is trained on images that do not comprise such effects and therefore penalizes aliasing. This allows a much cleaner image to be obtained. On the top of each image, we show the RMSE. It can be seen that the GMRF prior improves on the zero-filling approach, while the AE prior generates the smallest RMSE in those cases shown.
To ensure that this result generalizes, we performed a quantitative analysis which we consider in figure 4. Hundred different simulated measurements from 100 different ground truth images were taken from the test set. The values for p ∈ {0.75, 0.65, 0.5} are analyzed and all three cases show similar characteristics. It can be seen that the zero-filling solution yields the largest RMSE in all cases. The GMRF prior yields smaller RMSE values than does zero-filling, but the AE prior solution produces the smallest RMSE. Also, particularly for p = 0.75 and p = 0.65, the AE prior solution admits the smallest spread in RMSE values.
Figures 3 and 4 support our claim and clearly show the potential improvement offered by a 'data-driven' prior (AE) over a 'conventional' prior (GMRF).

Conclusions and outlook
In this work, a novel Bayesian inversion approach to MR imaging was proposed. The employed data-driven AE prior was trained on a real-world MR image database of human brain images and the successful application to undersampled MR imaging tasks was shown. The resulting Bayesian point estimate, in terms of the MAP, outperforms classical reconstruction schemes, such as zero-filling and statistical regularisation with a 'conventional' GMRF prior.
This work serves as a proof of concept to demonstrate the potential of data-driven priors and machine learning in metrological applications. The ability to incorporate existing data and the guidance to translate them into prior knowledge to perform a Bayesian inference is a thriving field of research from which metrologists surely benefit.
To keep matters simple, this work focuses on point estimates. However, the fully Bayesian posterior derived allows the computation of uncertainties, coverage probabilities and more general probability statements conditional on the data observed. This requires more sophisticated estimation procedures, such as MCMC or variational inference, which will be covered elsewhere. Also, the inverse problem considered in (2) and the simulation strategy in (8) are simplifications of reality. Accounting for multi-coil information of the measurement device and proper handling of phase information are required in future research. Moreover, this work serves as a proof of concept study; any comparison to other current MR image reconstruction strategies that employ machine learning is left to future research. For a comparison of strategies on the present data set that will be considered in future work we refer to [37] and to the results of that challenge [47].

Appendix A. Variational AEs
A VAE makes use of an AE architecture where, again, deep neural networks are employed for the encoder and decoder. Although the same architecture is taken for a VAE as for an AE, it is a different object playing a different role than an AE. While the latter is simply a low-dimensional approximation to the identity mapping, the former can be used as a generative model which produces random data whose distribution approximates that of the database used to train the VAE.
The neural networks of encoder and decoder of a VAE are usually determined by approximately maximizing the marginal likelihood over all data in the training set. The marginal likelihood is hard to evaluate and therefore a variational approach is utilized with a lower bound maximized numerically (see for example [32]). The parameters of the neural networks of encoder and decoder are then determined by minimizing the loss function given by the negative of the lower bound. The following loss function is often taken to train a VAE. In (9) the term p θ (x|z) stands for the probabilistic model of the data given the variable z in latent space, which is often taken as a multivariate Gaussian with mean vector and diagonal covariance matrix both determined as the output of the neural network of the decoder, where z is the input to the decoder network. For the prior PDF p θ (z) in latent space the multivariate standard Gaussian N (0, I) is usually taken, and q ϕ (z|x) stands for the variational model approximating p θ (z|x). For the variational model q ϕ (z|x) typically a multivariate Gaussian with diagonal covariance matrix is taken, where the mean and covariance matrix are given by the output of the encoder network for input x. The Kullback-Leibler distance D KL (q ϕ (·|x)∥p θ (·)) can in this case be calculated analytically, and the expectation E z∼q ϕ (·|x) log [p θ (x|z)] is approximated through Monte Carlo draws from q ϕ (·|x), using the reparametrization trick that enables easy calculation of the gradient w.r.t. ϕ. For further details and derivations, we refer to [32].

Appendix B. VAE as prior
In this paper, we intend to explore the benefit that can already be gained when using a prior that is based on an AE rather than on a VAE. The motivation is that it is often easier to train an AE than a VAE [48]. Furthermore, no approximation to the resulting prior (e.g. in terms of a Laplace approximation) is required. However, in the following we outline the steps required when using a VAE for Bayesian inference. Generative models such as VAEs or GANs are natural candidates to be used as priors. The generative model is determined such that its distribution reflects that of the database used for training. Figure 1 shows randomly drawn samples from a trained VAE that resemble the characteristics of brain images.
However, while drawing random samples from a generative model can be done efficiently, evaluation of the posterior (3) requires that the prior π(x) be calculated for a given x. For a VAE, this requires calculation of the PDF of the mixture distribution where π(z) denotes the chosen distribution in the latent space, typically a multivariate standard Gaussian distribution, and π(x|z) stands for the trained probabilistic decoder of the VAE (cf p θ (x|z) in section 2). Since π(x|z) is a chosen distribution, e.g. a multivariate Gaussian distribution, with parameters being functions of the latent variables z, and as those functions are represented by neural networks, the integral (10) cannot be solved analytically and its numerical evaluation is challenging. One way to tackle this challenge is to evaluate (10) through a Laplace approximation as proposed in [31]. Such a procedure allows for an analytical evaluation of the approximation to the prior. Furthermore, in cases where F is linear in (2), the resulting approximation to the posterior is analytically given by a Gaussian distribution with explicitly known mean and covariance matrix. We refer to [31] for the use of a VAE as a prior and its potential benefit in linear inpainting and deblurring tasks exemplified for the MNIST database of handwritten digits. In [31] also the statistical properties of the resulting Bayes estimators are explored and its advantages over an estimator obtained from the Bayesian posterior in latent space are demonstrated. In particular, it has been shown that Bayes estimators derived using a VAE prior are consistent and asymptotically normal.

Appendix C. Discussion on the regularization parameter
The choice of the regularization parameter λ in (7) and λ GMRF in (6) is not trivial and impacts the solution of the inverse problem. From a fully Bayesian perspective, λ denotes the precision of the Gaussian prior and uncertainty about this parameter should be modeled accordingly, e.g. using appropriate hierarchical hyper-priors or another deep neural network as in [31]. However, for illustration purpose, the formula (5) is considered. This choice is motivated by the following arguments.
The precision parameter λ in (7) defines a scale and this scale could be determined by requiring that where π 0 denotes the distribution of brain images in the database, and E π0 , and Cov π0 denote the expectation and covariance with respect to π 0 . The rationale is that if a Gaussian distribution for the brain images in the database is assumed, the argument of the exponent for the prior (4) should behave similarly in scale as the exponent of a Gaussian distribution. By replacing the expectation with respect to π 0 on the left in (11) with the mean over the N samples from the database, one obtains Assuming a Gaussian distribution, the term on the right hand side of (12) equals M, where M denotes the dimensionality of x. Then. λ could be chosen as Analogously for the GMRF prior, we take where Γ denotes the precision matrix of the GMRF.

Appendix D. Sample generation from an AE
For the generation of realisations from a generative model in figure 1(c), we employed a VAE that is based on an AE trained on the database of brain images. To this end, an AE is trained as described in section 2. Then, a multivariate Gaussian distribution is assigned to the latent variables of the AE by fitting such a distribution to the data obtained by the output of the encoder on the database of brain images. A random brain image is then generated by drawing a Gaussian variate with corresponding mean and covariance matrix for the latent variables, and subsequently applying the decoder network.