A deep neural network approach for parameterized PDEs and Bayesian inverse problems

We consider the simulation of Bayesian statistical inverse problems governed by large-scale linear and nonlinear partial differential equations (PDEs). Markov chain Monte Carlo (MCMC) algorithms are standard techniques to solve such problems. However, MCMC techniques are computationally challenging as they require a prohibitive number of forward PDE solves. The goal of this paper is to introduce a fractional deep neural network (fDNN) based approach for the forward solves within an MCMC routine. Moreover, we discuss some approximation error estimates. We illustrate the efficiency of fDNN on inverse problems governed by nonlinear elliptic PDEs and the unsteady Navier–Stokes equations. In the former case, two examples are discussed, respectively depending on two and 100 parameters, with significant observed savings. The unsteady Navier–Stokes example illustrates that fDNN can outperform existing DNNs, doing a better job of capturing essential features such as vortex shedding.


Introduction
Large-scale statistical inverse problems governed by partial differential equations (PDEs) are increasingly found in different areas of computational science and engineering [5,11,12,23,41]. The basic use of this class of problems is to recover certain physical quantities from limited and noisy observations. They are generally computationally demanding and can be solved using the Bayesian framework [11,21,23]. In the Bayesian approach to statistical inverse problems, one models the solution as a posterior distribution of the unknown parameters conditioned on the observations. Such mathematical formulations are often ill-posed and regularization is introduced in the model via prior information. The posterior distribution completely characterizes the uncertainty in the model. Once it has been computed, statistical quantities of interest can then be obtained from the posterior distribution. Unfortunately, many problems of practical relevance do not admit analytic representation of the posterior distribution. Thus, the posterior distributions are generally sampled using Markov chain Monte Carlo (MCMC)-type schemes.
We point out here that a large number of samples are generally needed by MCMC methods to obtain convergence. This makes MCMC methods computationally intensive to simulate Bayesian inverse problems. Moreover, for statistical inverse problems governed by PDEs, one needs to solve the forward problem corresponding to each MCMC sample. This task further increases the computational complexity of the problem, especially if the governing PDE is a large-scale nonlinear PDE [21]. Hence, it is reasonable to construct a computationally cheap surrogate to replace the forward model solver [45]. Problems that involve large numbers of input parameters often lead to the so-called curse of dimensionality. Projection-based reduced-order models (ROMs) e.g. reduced basis methods and the DEIM are typical examples of dimensionality reduction methods for tackling parametrized PDEs [1,6,14,20,24,37,50]. However, they are intrusive by design in the sense that they do not allow reuse of existing codes in the forward solves, especially for nonlinear forward models. To mitigate this computational issue, the goal of this work is to demonstrate the use of deep neural networks (DNN) to construct surrogate models specifically for nonlinear parametrized PDEs governing statistical Bayesian inverse problems. In particular, we will focus on fractional deep neural networks (fDNN) which have been recently introduced and applied to classification problems [2]. The approach is flexible and it also directly applies to time-dependent problems such as an unsteady Navier-Stokes system exhibiting vortex shedding.
The use of DNNs for surrogate modeling in the framework of PDEs has received increasing attention in recent years, see e.g. [15,29,33,46,49,51,56,60,61,63] and the references therein. Some of these references also cover the type of Bayesian inverse problems considered in our paper. To accelerate or replace computationally-expensive PDE solves, these references show that trained DNNs can approximate the mapping from parameters in the PDE to observations obtained from its solution. In a supervised learning setting, training data consisting of inputs and outputs are available and the learning problem aims at tuning the weights of the DNN. To obtain an effective surrogate model, it is desirable to train the DNN to high accuracy.
We consider a different approach than the aforementioned works. We propose novel dynamical systems based neural networks which allow connectivity among all the network layers. Specifically, we consider a fractional DNN technique recently proposed in [2] in the framework of classification problems, a pytorch implementation for which is also available. We note that [2] is motivated by [30]. Both papers are in the spirit of the push to develop rigorous mathematical models for the analysis and understanding of DNNs. The idea is to consider DNNs as dynamical systems. More precisely, in [7,30,32], DNN is thought of as an optimization problem constrained by a discrete ODE. As pointed out in [2], designing the DNN solution algorithms at the continuous level has the appealing advantage of architecture independence; in other words, the number of optimization iterations remains the same even if the number of layers is increased. Moreover, [2] specifically considers continuous fractional ODE constraint. Unlike standard DNNs, the resulting fractional DNN allows the network to access historic information of input and gradients across all subsequent layers since all the layers are connected to one another.
The main contributions of the paper are as follows: • We demonstrate that our fractional DNN leads to a significant reduction in the overall computational time used by MCMC algorithms to solve inverse problems, while maintaining accurate evaluation of the relevant statistical quantities. We note here that, although the papers [46,56,60,63] consider DNNs for inverse problems governed by PDEs, the DNNs used in these studies do not, in general, have the optimizationbased formulation considered here. • Additionally, using an unsteady Navier-Stokes model exhibiting vortex shedding, we further demonstrate that the fractional DNN approach outperforms existing approaches such as a residual neural network (ResNet). • Moreover, we discuss some approximation error estimates.
The remainder of the paper is organized as follows: In section 2 we present the problem we want to solve and give details of both existing and proposed solution methods. In particular, in section 2.1 we state the generic parametrized PDEs under consideration, as well as discuss the well-known surrogate modeling approaches. Our main DNN algorithm to approximate parametrized PDEs is provided in section 2.2. We first discuss a ResNet architecture to approximate parametrized PDEs, which is followed by our fractional ResNet (or fractional DNN) approach to carry out the same task. A brief discussion on error estimates has been provided. Next, we discuss the application of fractional DNN to statistical Bayesian inverse problems in section 2.4. This is followed by several illustrative numerical examples in section 3. A few conclusions are provided in section 4.

Problem statement and solution methods
We begin by stating generic parametrized PDEs and discuss the state-of-the-art surrogate approaches (e.g. proper orthogonal decomposition to solve this class of problems in section 2.1. This is followed by a discussion on DNNs, in particular ResNet and fractional DNN, in section 2.2. Subsequently, we state certain approximation error estimates in section 2.3. Applications to Bayesian inverse problems are considered in section 2.4.

Parametrized PDEs
In this section, we present an abstract formulation of the forward problem as a discrete PDE depending on parameters. We are interested in the following model which represents (finite element, finite difference or finite volume) discretization of a (possibly nonlinear) parametrized PDE F(u(ξ); ξ) = 0, (2.1) where u(ξ) ∈ U ⊂ R Nx and ξ ∈ P ⊂ R N ξ denote the solution of the PDE and the parameter in the model, respectively. Here, P denotes the parameter domain and U the solution manifold. For a fixed parameter ξ ∈ P, we seek the solution u(ξ) ∈ U. In other words, we have the functional relation given by the parameter-to-solution map Several processes in computational sciences and engineering are modelled via parameter-dependent PDEs, which when discretized are of the form (2.1), for instance, in viscous flows governed by Navier-Stokes equations, which is parametrized by the Reynolds number [22]. Similarly, in unsteady natural convection problems modelled via Boussinesq equations, the Grashof or Prandtl numbers are important parameters [50], etc. Approximating the high-fidelity solution of (2.1) can be done with nonlinear solvers, such as Newton or Picard methods combined with Krylov solvers [22]. However, computing solutions to (2.1) can become prohibitive, especially when they are required for many parameter values and N ξ is large. Besides, a relatively large N x (resulting from a really fine mesh in the discretization of the PDE) yields large (nonlinear) algebraic systems which are computationally expensive to solve and may also lead to huge storage requirements. This is, for instance, the case in Bayesian inference problems governed by PDEs where a prohibitive number of queries of the forward model are required to adequately sample posterior distributions through MCMC-type schemes [8,47]. As a result of the above computational challenges, it is reasonable to replace the high-fidelity model by a surrogate model which is relatively easy to evaluate.
In what follows, we discuss two classes of surrogate models, namely: ROMs and deep learning models.

Surrogate Modeling
Surrogate models are cheap-to-evaluate models designed to replace computationally costly models. The major advantage of surrogate models is that approximate solution of the models can be easily evaluated at any new parameter instance with minimal loss of accuracy, at a cost independent of the dimension of the original high-fidelity problem. A popular class of surrogate models are the ROMs, see e.g. [1,6,13,19,20,24,37,50]. Typical examples of ROMs include the reduced basis method [50], proper orthogonal decomposition [13,28,37,50] and the discrete empirical interpolation method (DEIM) and its variants [1,6,13,20,24]. A key feature of the ROMs is that they use the so-called offline-online paradigm. The offline step essentially constructs the low-dimensional approximation to the solution space; this approximation is generally known as the reduced basis. The online step uses the reduced basis to solve a smaller reduced problem. The resulting reduced solution accurately approximates the solution of the original problem.
Deep neural network (DNN) models constitute another class of surrogate models which are well-known for their high approximation capabilities. The basic idea of DNNs is to approximate multivariate functions through a set of layers of increasing complexity [27]. Examples of DNNs for surrogate modeling include ResNet [32,35], physics-informed neural network (PINNs) [51] and fractional DNN [2].
Note that the ROMs require system solves and they are highly intrusive especially for nonlinear problems [1,13,20,21]. In contrast, the DNN approach is fully non-intrusive, which is essential for legacy codes. Although, rigorous error estimates for ROMs under various assumptions have been well studied [50], and more recently for also DNNs [9,48,52], we find the nonintrusive aspect of DNNs to be interesting.
In this work, we propose a surrogate model based on the combination of POD and fractional DNN. Before we discuss our proposed model, we first review POD.

Proper orthogonal decomposition
For sufficiently large N s ∈ N, suppose that E := {ξ 1 , ξ 2 , . . . , ξ Ns } is a set of parameter samples with ξ i ∈ R N ξ , and {u(ξ 1 ), u(ξ 2 ), . . . , u(ξ Ns )} the corresponding snapshots (solutions of the model (2.1), with u i ∈ R Nx ). Here, we assume that span{u(ξ 1 ), u(ξ 2 ), . . . , u(ξ Ns )} sufficiently approximates the space of all possible solutions of (2.1). Next, we denote by the matrix whose columns are the solution snapshots. Then, the singular value decomposition (SVD) of S is given by where V ∈ R Nx×r and W ∈ R Ns×r are orthogonal matrices called the left and right singular vectors, and r ⩽ min{N s , N x } is the rank of S. Thus, V T V = I r , W T W = I r , and Σ = diag(ρ 1 , ρ 2 , . . . , ρ r ) ∈ R r×r , where ρ 1 ⩾ ρ 2 ⩾ . . . ⩾ ρ r ⩾ 0 are the singular values of S. Now, denote by V ⊂ V the first k ⩽ r left singular vectors of S. Then, the columns of V ∈ R Nx×k form a POD basis of dimension k. According to the Schmidt-Eckart-Young theorem [17,26], the POD basis V minimizes, over all possible k-dimensional orthonormal bases Z ∈ R Nx×k , the sum of the squares of the errors between each snapshot vector u i and its projection onto the subspace spanned by Z. More precisely, where Z := {Z ∈ R Nx×k : Z T Z = I k }. Note from (2.5) that the POD basis V solves a least squares minimization problem, which guarantees that the approximation error is controlled by the singular values.
For every ξ, we then approximate the continuous solution u(ξ) as u(ξ) ≈ V u(ξ), where u(ξ) solves the reduced problem Notice that, in some reduced-modeling techniques such as DEIM, additional steps are needed to fully reduce the dimensionality of the problem (2.6). Nevertheless, one still needs to solve a nonlinear (reduced) system like (2.6) to evaluate u. We conclude this section by emphasizing that the above approach is 'linear' because u(ξ) ≈ Φ(ξ), where Φ(ξ) = V u(ξ) is an approximation of the map Φ(ξ) given in (2.2).

DNN
The DNN approach to modeling surrogates produces a nonlinear approximation Φ of the input-output map Φ : R N ξ → R Nx given in (2.2), where Φ depends implicitly on a set of parameters θ ∈ R N θ configured as a layered set of latent variables that must be trained. We represent this dependence using the notation Φ(ξ; θ). In the context of PDE surrogate modeling, training a DNN requires a data set (E, S), where the parameter samples ξ j ∈ E are the inputs and the corresponding snapshots u j ∈ S are targets; training then consists of constructing θ so that the DNN Φ(ξ j ; θ) matches u j for each ξ j . This matching is determined by a loss functional. The learning problem therefore involves computing the optimal parameter θ that minimizes the loss functional and satisfies Φ(ξ j ; θ) ≈ u j . The ultimate goal is that this approximation also holds with the same optimal parameter θ for a different data set; in other words, for ξ / ∈ E, we take Φ(ξ; θ) to represent a good approximation to Φ(ξ).
Deep learning can be either supervised or unsupervised depending on the data set used in training. In the supervised learning technique for DNN, all the input samples ξ j are available for all the corresponding samples of the targets u j . In contrast, the unsupervised learning framework does not require all the outputs to accomplish the training phase. In this paper, we adopt the supervised learning approach to model our surrogate and apply it to Bayesian inverse problems. In particular, we shall discuss ResNet [32,35] and fDNN [2] in the context of PDE surrogate modeling.

ResNet
The ResNet model was originally proposed in [35]. For a given input datum ξ, ResNet approximates Φ through the following recursive expression where {W j , b j } are the weights and the biases, h > 0 is the stepsize and L is the number of layers and σ is an activation function which is applied element-wise on its arguments. Typical examples of σ include the hyperbolic tangent function, the logistic function or the rectified linear unit (or ReLU) function. ReLU is a nonlinear function given by σ(x) = max{x, 0}. In this work, we use a smoothed ReLU function defined, for ε > 0, as (2.8) Note that as ε → 0, smooth ReLU approaches ReLU. It follows from (2.7) that where K j (y) = W j y + b j , for all j = 0, . . . , L − 1 and for any y.
To this end, the following two critical questions naturally come to mind: (a) How well does Φ(ξ; θ) approximate Φ(ξ)? (b) How do we determine θ?
We will address (b) first and leave (a) for a discussion provided in section 2.3. Now, setting θ = {W j , b j }, it follows that the problem of approximating Φ via ResNet is essentially the problem of learning the unknown parameter θ. More specifically, the learning problem is the solution of the minimization problem [32]: In this work, we consider the mean squared error, together with a regularization term, as our loss functional: where λ is the regularization parameter. Due to its highly non-convex nature, this is a very difficult optimization problem. Indeed, a search over a high dimensional parameter space for the global minimum of a non-convex function can be intractable. The current state-of-the-art approaches to solve these optimization problems are based on the stochastic gradient descent method [46,56,63]. As pointed out in, for instance, [32], the middle equation in expression (2.7) mimics the forward Euler discretization of a nonlinear differential equation: It is known that standard DNNs are prone to a vanishing gradient problem [58], leading to loss of information during the training phase. ResNets do help, but more helpful is the so-called DenseNet [39]. Notice that in a typical DenseNet, each layer takes into account all the previous layers. However, this is an ad hoc approach with no rigorous mathematical framework. Recently in [2], a mathematically rigorous approach based on fractional derivatives has been introduced. This ResNet is called Fractional DNN. In [2], the authors numerically establish that fractional derivative based ResNet outperforms ResNet in overcoming the vanishing gradient problem. This is not surprising because fractional derivatives allow connectivity among all the layers. Building rigorously on the idea of [2], we proceed to present the fractional DNN surrogate model for the discrete PDE in (2.1).

fDNN
As pointed out in the previous section, the fDNN approach is based on the replacement of the standard ODE constraint in the learning problem by a fractional differential equation. To this end, we first introduce the definitions and concepts on which we shall rely to discuss fractional DNN.
Let u : [0, T] → R be an absolutely continuous function and assume γ ∈ (0, 1). Next, consider the following fractional differential equations Here, d γ t and d γ T−t denote left and right Caputo derivatives, respectively [2], as given by We note here that the use of the Caputo fractional derivative necessitates the function to be absolutely continuous. Considering that the output of a neural network is typically absolutely continuous it becomes an appealing option. Another advantage is that Caputo derivative allows us to take initial values of classical integer order derivative unlike other fractional derivatives, e.g. Riemann-Liouville which requires fractional order initial conditions. Now, setting u(t j ) = u j , and using the L 1 -scheme (see e.g. [2]) for the discretization of (2.12) and (2.13) yields, respectively, (2.14) and where h > 0 is the step size, Γ(·) is the Euler-Gamma function, and After this brief overview of the fractional derivatives, we are ready to introduce our fractional DNN (cf (2.7)) Our learning problem then amounts to min θ J (θ; ξ, u) (2.18) subject to constraints (2.17). Notice that the middle equation in (2.17) mimics the L 1 -in time discretization of the following nonlinear fractional differential equation There are two ways to approach the constrained optimization problem (2.18). The first approach is the so-called reduced approach, where we eliminate the constraints (2.17) and consider the minimization problem (2.18) only in terms of θ. The resulting problem can be solved by using a gradient-based method such as BFGS, see e.g. [43, chapter 4]. During every step of the gradient method, one needs to solve the state equation (2.17) and an adjoint equation. These two solves enable us to derive an expression of the gradient of the reduced loss function with respect to θ. Alternatively, one can derive the gradient and adjoint equations by using the Lagrangian approach. It is well known that the gradient with respect to θ is the same for both approaches, see [3, p 14] for instance.
We next illustrate how to evaluate this gradient using the Lagrangian approach. From here on, we use the 'smoothed' activation function σ ε . The Lagrangian functional associated with the discrete constrained optimization problem (2.18) is given by where {ψ j } are the Lagrange multipliers, also called adjoint variables, corresponding to (2.17) and ·, · is the inner product on R Nx . Next we write the state and adjoint equations fulfilled at a stationary point of the Lagrangian L. In addition, we state the derivative of L with respect to the design variable θ.
where denotes the Hadamard (pointwise) product for matrices. The right-hand-side of (2.20c) represents the gradient of L with respect to θ. We then use a gradient-based method (BFGS in our case) to identify θ.

Remark 1.
We emphasize that solving the adjoint equation (2.20b) can be thought of as equivalent to the wellknown back-propagation method of the machine learning literature. Recall that back-propagation is typically carried out using automatic differentiation; the adjoint equation approach is more direct.

Error analysis
In this section we briefly address the question of how well the deep neural approximation approximates the PDE solution map Φ : R N ξ → R Nx . The approximation capabilities of neural networks has received a lot of attention recently in the literature, see e.g. [18,38,53,62] and the references therein. The papers [38,53] obtain results based on general activation functions for which the approximation rate is O(n −1/2 ), where n is the total number neurons in the hidden layer. This implies that neural networks can overcome the curse of dimensionality, as the approximation rate is independent of dimension N x . We remark here, however, that for some PDEs, especially those posed on complex geometries, the neural network approach, just like ROMs, may be challenging to approximate [16,25]. We begin by making the observation that the fractional DNN can be written as a linear combination of activation functions evaluated at different layers. Indeed, observe from (2.17) that if Φ(ξ) is approximated by a one-hidden layer network tΦ(ξ, θ) := ϕ 2 (that is, L = 2) then it can be expressed as: By a one-hidden layer network, we mean a network with the input layer, one hidden layer and the output layer [18,38,44,53]. Next, for L = 3, we obtain where α 0 = 1 − a 1 , α 1 = τ := h γ Γ(2 − γ) and α 2 = a 1 . Similarly, if we set L = 4, then we get where Proceeding in a similar fashion yields the following result regarding multilayer fractional DNN. Representation of multilayer fractional DNN). For L ⩾ 3, the fractional DNN given by (2.17) fulfill

Proposition 2.1 (
where α i are constants depending on τ and a i , as given by (2.16). Moreover, one-hidden-layer fDNN (that is, L = 2) coincides with a one-hidden-layer standard DNN.
Error analysis for multilayer networks is generally challenging. Some papers that study this include [9,34,48,52,62] and the references therein. The results of the two papers [34,62] focus mainly on the ReLU activation function. In particular, [34] discusses the approximation of linear finite elements by ReLU deep and shallow networks. In [48,52], the authors prove rigorous bounds on the errors resulting from the approximation of forward and inverse problems with PINNs. The paper [9] establishes error estimates and stability analysis for DNN approximation of the incompressible Navier-Stokes equations.
In this paper, we restrict the analysis discussion to one-hidden layer fDNN i.e. L = 2. In particular, we consider a one-hidden layer network with finitely many neurons. Notice that for L = 2, fDNN coincides with standard DNN according to proposition 2.1. Therefore, it is possible to extend the result from [53] to our case. To state the result from [53], we first introduce some notation.
To this end, we make the assumption that the function Φ(ξ) is defined on a bounded domain P ⊂ R N ξ and has a bounded Barron norm Also, we say that f ∈ W m,p loc (P) if f ∈ W m,p (P ′ ), ∀ compact P ′ ⊂ P. Since the mapping Φ : R N ξ → R Nx can be computed using N x mappings Φ j : R N ξ → R, it suffices to focus on networks with one output unit. Observe from (2.21) that (2.28) where n is the total number of neurons in the hidden layer, b i an element of the bias vector, c i and ω i are rows of W 1 and W 0 respectively. Now, for a given activation function σ, define the set We can now state the following result for any non-zero activation functions satisfying some regularity conditions [53, Corollary 1]. We will then show that the activation function of (2.8) fulfills the required assumptions.

Theorem 2.2. Let P be a bounded domain and σ ∈ W m,∞ loc (R) be a non-zero activation function. Suppose there exists a function
(2.30)

31)
where β = diam(P) and C is a constant depending on σ, m, p and β.
Note that the bound in (2.31) is O(n −1/2 ). Here, the function ν(x) is a linear combination of the shifts and dilations of the activation σ(x). Moreover, the activation function itself need not satisfy the decay (2.30). The theorem says that it suffices to find some function ν ∈ F q 1 (σ) (see (2.29) with n = q, N ξ = 1) for which (2.30) holds.
We are working with smooth ReLU (see (2.8)) as the activation function σ(x) for which m = 2. One choice of ν for which (2.30) holds is ν ∈ F 3 1 (σ) given by It can be shown that ν satisfies (2.30) with C p = 20 and p = 1.5, that is, This can be verified from figure 1.

Application to Bayesian inverse problems
The inverse problem associated with the forward problem (2.1) essentially requires estimating a parameter vector ξ ∈ R N ξ given some observed noisy limited data d ∈ R Nx . In the Bayesian inference framework, the posterior probability density, π(ξ|d) : R N ξ → R, solves the statistical inverse problem. In principle, π(ξ|d) encodes the uncertainty from the set of observed data and the sought parameter vector. More formally, the posterior probability density is given by Baye's rule as π pos (ξ) := π(ξ|d) ∝ π(ξ)π(d|ξ), (2.34) where π(·) : R N ξ → R is the prior and π(d|ξ) is the likelihood function. The standard approach to Bayesian inference uses the assumption that the observed data are of the form (see e.g. [12,23]) where Φ is the parameter-to-observable map and η ∼ N (0, Σ η ). In our numerical experiments, we assume that Σ η = κ 2 I, where I is the identity matrix with appropriate dimensions and κ denotes the variance. Now, let the log-likelihood be given by (2.36) then Baye's rule in (2.34) becomes where Z =´P exp(−L(ξ))π(ξ) dξ is a normalizing constant. In practice, the posterior density rarely admits an analytic expression. Indeed, it is generally evaluated using MCMC sampling techniques. MCMC schemes require a prohibitive number of queries of the forward model. This is a significant computational challenge, especially if the forward model represents large-scale high fidelity discretization of a nonlinear PDE. It is therefore reasonable to construct a cheap-to-evaluate surrogate for the forward model in the offline stage for use in online computations in MCMC algorithms.
In what follows, we discuss MCMC schemes for sampling posterior probability distributions, as well as how our fractional DNN surrogate modeling strategy can be incorporated in them.
In what follows, we will describe in more detail the MH in this paper and refer the reader to [8] for the details of the derivation and properties of the many variants of the other three algorithms.
(2) Compute the acceptance probability Observe from (2.36) and (2.37) that each evaluation of the likelihood function, and hence, the acceptance probability (2.38) requires the evaluation of the forward model to compute π(ξ * |d). We propose to replace the forward solves by the fractional DNN surrogate model. In our numerical experiments, we follow [31] in which an adaptive Gaussian proposal is used: where C 0 = I, When the proposal is chosen adaptively as specified above, the HM method is referred to as an Adaptive Metropolis (AM) method [4,31].
We note here that RWM belongs to the family of MH algorithms with symmetric proposal, as the proposal move is generated according to a random walk. MALA was inspired by stochastic models of molecular dynamics and is an extension of the multivariate RWM algorithm that includes partial derivatives to improve mixing. Preconditioned Crank-Nicolson (pCN) differs only slightly from RWM algorithm in the sense that the proposal is a first-order autoregressive process, rather than a centered random-walk. Moreover, the pCN algorithm has an acceptance probability that is invariant to dimension, making it useful for large-scale problems. The HMC corresponds to an instance of the MH algorithm as described above, with a Hamiltonian dynamics evolution simulated using a time-reversible and volume-preserving numerical integrator (typically the leapfrog integrator) to propose a move to a new point in the state space.

Numerical results
Next, we apply fDNN to time-dependent PDEs and statistical inverse problems governed by elliptic PDEs. We observe that fDNN outperforms existing networks such as ResNet. It can capture highly nonlinear phenomenon such as vortex shedding in a model of incompressible flow (see section 3.1). It can efficiently handle inverse problems with two parameters (see section 3.3) or 100 parameters (see section 3.4) with significant observed savings.
All experiments were performed using MATLAB R2020b on a Mac desktop with RAM 32GB and CPU 3.6 GHz Quad-Core Intel Core i7.

Unsteady Navier-Stokes equations
The goal of this section is to compare the ResNet and fDNN approximations of unsteady Navier-Stokes equations with its finite element approximations. The problem, using a finite element method, was studied in [40] with a goal to benchmark two-dimensional finite element codes. This example is highly challenging since vortex formation and shedding takes place.
Consider the unsteady Navier-Stokes system We discretize the weak form of (3.1) using Taylor-Hood P2-P1 finite elements, i.e. piecewise quadratic basis functions for velocity v and piecewise linear basis functions for pressure p. The time discretization is carried out using a modified time-stepping θ-scheme as described in [57]. The modified θ-scheme is a second order fully implicit scheme.

Code benchmarking
We first benchmark our implementation against [40, figure 2]. As in [40, figure 2], we observe that two vortices start to develop (behind the cylinder) at t = 2 s and the vortices separate from the cylinder between t = 4 s and t = 5 s. Moreover, the vortices can be seen even at the final time T = 8 s. Using our implementation, we also obtain same values of other benchmarks, such as drag coefficient, as discussed [40, figure 2], see figure 4 below. The goal of our DNN approximation is to replicate these results that are obtained using the finite element method.

Network details
We consider a network with 5 hidden layers with 30 neurons in each hidden layer. The space discretization is carried out using a finite element triangulation with a total of 30 364 degrees of freedom. Motivated by our recent work on stiff ODEs [10], the network is trained to advance from any time step to the next. In particular, the training input is given by (c) Drag coefficient, see [40] for the expression of drag coefficient.
It is evident from figure 3 that fDNN does a dramatically superior job of capturing the effect of vortex shedding than does ResNet, and, similarly, in figure 4, fDNN is more accurate in the other metrics.

Application to statistical inverse problems
In this section, we focus on a diffusion-reaction problem in which two parameters need to be inferred [21], as well as a thermal fin problem from [8], which involves one hundred parameters to be identified.
In both of these experiments we train using a 3-hidden layer network with 15 neurons in each hidden layer (that is, L = 4, n = 45) and k = 400 neurons in the output layer, where k matches the dimension of POD basis as described in section 3.3 below. We chose ε = 0.1 in the smooth ReLU activation function, final time T = 1 and step-size h = 1 3 . We set the fractional exponent γ = 0.5 in the Caputo Fractional derivative and the regularization parameter λ = 10 −6 . To train the network, we use the BFGS optimization method [43, chapter 4]. Table 1 shows the number of BFGS iterations and the CPU times required to train the data from the diffusion-reaction problem (top) and the thermal fin problem (bottom), respectively. Also shown in these  tables are the relative errors obtained by evaluating the trained network at a parameter ξ e not in the training set E; here, where u(ξ e ) andΦ(ξ e ) are the true and surrogate solutions at ξ e , respectively. Note from both tables that after 1600 BFGS iterations, the decrease in errors is not significant. Hence, for all the experiments discussed below, we use 1600 BFGS iterations.
Equation (3.3) is discretized on a uniform mesh in Ω with 64 grid points in each direction using centered differences resulting in 4096 degrees of freedom. We obtained the solution of the resulting system of nonlinear equations using an inexact Newton-GMRES method as described in [42]. The stopping tolerance was 10 −6 .
To train the network, we first computed N s = 900 solution snapshots S corresponding to a set E of 900 parameters ξ = (ξ 1 , ξ 2 ) drawn from the parameter space [0.01, 10] 2 . It took 642.8 s to generate the data. These parameters were chosen using Latin hypercube sampling. Each solution snapshot is of dimension N x = 4096. Next, we computed the SVD of the matrix of solution snapshots S = VΣW T , and set our POD basis V = V(:, 1 : k), where k = 400. Our training set then consisted of E as inputs and V T S ∈ R k×Ns as our targets. As reported by Hesthaven and Ubbiali in [36] in the context of solving parametrized PDEs, the POD-DNN approach accelerates the online computations for surrogate models. Thus, we follow this approach in our numerical experiments; in particular, using k-dimensional data (where k N x ) confirms an overall speed up in the solution of the statistical inverse problems.
Next, in both numerical examples considered in this section, we solved the inverse problems using M = 20, 000 MCMC samples. In each case, the first 10, 000 samples were discarded for 'burn-in' effects, and the remaining 10, 000 samples were used to obtain the reported statistics. Here, we used an initial Gaussian proposal (2.39) with covariance C 0 = I and updated the covariance after every 100th sample. As in [21], we assume for this problem that ξ is uniformly distributed over the parameter space [0.01, 10] 2 . Hence, (2.37) becomes We generated the observations d by using (2.35) the true parameter to be identified ξ e = (1, 0.1) and a Gaussian noise vector η with κ = 10 −2 . Figures 5-7, represent, respectively, the histogram (which depicts the posterior distribution), the Markov chains and the autocorrelation functions corresponding to the parameters using the high fidelity model (Full) and the DNN surrogate models in the MCMC algorithm. Recall that, for a Markov chain {δ j } J j =1 generated by the Metropolis-Hastings algorithm, with variance κ 2 , the autocorrelation function (ACF) ϱ of the δ-chain is given by ϱ( j) = cov(δ 1 , δ 1+|j| )/κ 2 and the integrated autocorrelation time, τ int , (IACT) of the chain is defined as ACF decays very fast to zero as J → ∞. In practice, J is often taken to be 10 log 10 M , [5]. If IACT = K, this means that the roughly every Kth sample of the δ chain is independent. We have used the following estimators to approximate ϱ( j) and τ int in our computations [4,55]: ,δ is the mean of the δ-chain, and J is chosen be the smallest integer such that J ⩾ 3τ int .
Observe from figures 5 and 6 that, for both Full and fDNN models, the respective histograms and Markov chains are centered around the parameters of interest (1, 0.1). In fact, for the full model, the 95% confidence intervals (CIs) for ξ 1  We also examine the impact of training accuracy in figure 7. The figure shows that the ACFs decay very fast to zero, as expected. The value of IACT is roughly 10 for all the MCMC chains generated by the full model (left). The right of the figure shows the IACT for fDNN model, where the red-colored curves show the results obtained with 1600 training steps and the blue curves show those obtained with 6400 steps. In general, the IACT values imply that roughly every 10th sample of the MCMC chains generated by both the full and the fDNN models are independent.
These results indicate that the fDNN surrogate model produces results with accuracy comparable to those obtained using the full model. The advantage of the surrogate lies in its reduced costs. In this example, using the HM algorithm, the full model required 2347.7 seconds of CPU time to solve the inverse problem, whereas the fractional DNN model required 29.3 seconds (the online costs for fDNN), a reduction in CPU time of a factor of 80. As always for surrogate approximations, there is overhead, the offline computations, associated with construction of the surrogate, i.e. identification of the parameter set θ that defines the fDNN.  For this example, this entailed construction of the 900 snapshot solutions used to produce targets for the training process, computation of the SVD of the matrix S of snapshots giving the targets, together with the training of the network (using BFGS). The times required for these computations were 127.8 s to construct the inputs/targets, 0.52 s to compute the SVD and 29.5 s to train the network. Thus, the offline and online times sum to 187.1 s, which is a lot smaller than the time needed for the full solution (that is, 2347.7 s). Of course, the offline computations represent a one-time expense which need not be repeated if more samples are used to perform an MCMC simulation or if different MCMC algorithms are used 5 . This is, for instance, the case with the Differential Evolution Adaptive Metropolis (DREAM) method which runs multiple different chains simultaneously when used to sample the posterior distributions [59].

Thermal fin example
Next, we consider a thermal fin problem from [8]:  , the forward problem (3.6) is used to compute the temperature u. The goal of this inverse problem is to infer 100 unknown parameters ξ from 262 noisy observations of u. Figure 8 shows the location of the observations on the boundary ∂Ω \ Γ, as well as the forward PDE solution u at the true parameter ξ.
To train the network, we first computed, as in the diffusion-reaction case, N s = 900 solution snapshots S corresponding to 900 parameter sets ξ ∈ R 100 drawn using Latin hypercube sampling. It took 1435.5 s to generate the data. This problem is a lot more difficult than the previous problem in the sense that the dimension of the parameter space in this case is 100 rather than 2. Next, we compute the SVD of S and proceed as before.
In what follows, the infinite variants of Riemannian manifold Metropolis-adjusted Langevin algorithm (MALA) and Hamiltonian Monte-Carlo (HMC) algorithms as presented in [8] are denoted, respectively, by ∞-MALA and ∞-HMC. In particular, figure 9 shows the true conductivity, as well as the posterior mean estimates obtained by three MCMC methods-pCN, ∞-MALA and ∞-HMC-using the fractional DNN as a surrogate model.  As in [8], for this problem we used a Gaussian prior defined on the domain D := [−3, 3] × [0, 4] with covariance C of eigen-structure given by where µ 2 i = {π 2 ((i 1 + 1/2) 2 + (i 2 + 1/2) 2 )} −1.2 , φ i (x) = 2|D| −1/2 cos(π(i 1 + 1/2)x 1 ) cos(π(i 2 + 1/2)x 2 ), and I = {i = (i 1 , i 2 ), i 1 ⩾ 0, i 2 ⩾ 0}. Moreover, the log-conductivity ξ was the true parameter to be inferred with coordinates ξ i = µ i sin((i 1 − 1/2) 2 + (i 2 − 1/2) 2 ), i 1 ⩽ 10, i 2 ⩽ 10. Table 2 shows the average acceptance rates for these models and the computational times required to perform the MCMC simulation for different variants of the MCMC algorithm, using both fDNN surrogate computations and full-order discrete PDE solution. These results indicate that the acceptance rates are comparable for the fDNN surrogate model and the full order forward discrete PDE solver. Moreover, there is about a 90% reduction in the computational times when fDNN surrogate solution is used, a clear advantage of the fDNN approach. The costs of the offline computations for fDNN were 63.2 s to generate the data used for the fractional DNN models and 33.5 s to train the network (with 1600 BFGS iterations) a total of 99.7 s. As in the case of the diffusion-reaction problem, the overhead required for the one-time offline computations is offset by the short CPU time required for the online phase with the MCMC algorithms. Note that the same offline computations were used for each of the three MCMC simulations tested.

Discussion and conclusions
This paper introduces a DNN based approach for solving Bayesian statistical inverse problems governed by large-scale linear and nonlinear PDEs. The approach combines an fDNN with an MCMC algorithm. The fDNN is a deep learning network that incorporates history or memory into the network by connecting all layers to one another. It can be viewed as a time-discretization of a fractional in time nonlinear ordinary differential equation (ODE). The fDNN helps learn the parameter to PDE solution operator and provides fast and accurate forward solves within the MCMC routine. We also discuss some approximation error estimates for the fDNN. We illustrate the efficiency and advantages of fDNN on inverse problems governed by nonlinear elliptic PDEs and the unsteady Navier-Stokes equations. The former case involves two examples with two and 100 parameters, respectively, while the latter case demonstrates that fDNN can capture essential features such as vortex shedding and outperform existing methods such as ResNet. The proposed approach exhibits multiple benefits over the traditional surrogate approaches for parametrized PDEs such as reduced basis methods. For instance, it is fully non-intrusive and therefore it can be directly used in legacy codes, unlike the ROMs for nonlinear PDEs, which can be highly intrusive. However, we remark that for some PDEs, especially those posed on complex geometries, the neural network approach, just like ROMs, may be challenging to approximate [16,25]. Further extensions of this work are possible, for instance, Bayesian inversion with local adaptive reduced basis or DNNs [60,64].