Variational Gaussian Approximation for Poisson Data

The Poisson model is frequently employed to describe count data, but in a Bayesian context it leads to an analytically intractable posterior probability distribution. In this work, we analyze a variational Gaussian approximation to the posterior distribution arising from the Poisson model with a Gaussian prior. This is achieved by seeking an optimal Gaussian distribution minimizing the Kullback-Leibler divergence from the posterior distribution to the approximation, or equivalently maximizing the lower bound for the model evidence. We derive an explicit expression for the lower bound, and show the existence and uniqueness of the optimal Gaussian approximation. The lower bound functional can be viewed as a variant of classical Tikhonov regularization that penalizes also the covariance. Then we develop an efficient alternating direction maximization algorithm for solving the optimization problem, and analyze its convergence. We discuss strategies for reducing the computational complexity via low rank structure of the forward operator and the sparsity of the covariance. Further, as an application of the lower bound, we discuss hierarchical Bayesian modeling for selecting the hyperparameter in the prior distribution, and propose a monotonically convergent algorithm for determining the hyperparameter. We present extensive numerical experiments to illustrate the Gaussian approximation and the algorithms.


Introduction
This work is concerned with Gaussian approximations to a Poisson noise model for linear inverse problems. The Poisson model is popular for modeling count data, where the response variable follows a Poisson distribution with a parameter that is the exponential of a linear combination of the unknown parameters. The model is especially suitable for low count data, where the standard Gaussian model is inadequate. It has found many successful practical applications, including transmission tomography [41,12].
One traditional approach to parameter estimation with the Poisson model is the maximum likelihood method or penalized variants with a convex penalty. This leads to a convex optimization problem, whose solution is then taken as an approximation to the true solution. This approach has been extensively studied, and we refer interested readers to the survey [18] for a comprehensive account on important developments along this line. However, this approach gives only a point estimator, and does not allow quantifying the associated uncertainties directly. In this work, we aim at a full Bayesian treatment of the problem, where both the point estimator (mean) and the associated uncertainties (covariance) are of interest [23,38]. We shall focus on the case of a Gaussian prior, which forms the basis of many other important priors, e.g., sparsity prior via scale mixture representation. Then following the Bayesian procedure, we arrive at a posterior probability distribution, which however is analytically intractable due to the nonstandard form of the likelihood function for the Poisson model. We will explain this more precisely in Section 2. To explore the posterior state space, instead of applying popular general-purposed sampling techniques, e.g., Markov chain Monte Carlo (MCMC), we employ a variational Gaussian approximation (VGA). The VGA is one extremely popular approximate inference technique in machine learning [40,8].

Notation and problem setting
First we recall some standard notation in linear algebra. Throughout, (real-valued) vectors and matrices are denoted by bold lower-and upper-case letters, respectively, and the vectors are always column vectors. We will use the notation (·, ·) to denote the usual Euclidean inner product. We shall slightly abuse the notation (·, ·) also for the inner product for matrices. That is, for two matrices X, Y ∈ R n×m , we define (X, Y) = tr(XY t ) = tr(X t Y), where tr(·) denotes taking the trace of a square matrix, and the superscript t denotes the transpose of a vector or matrix. This inner product induces the usual Frobenius norm for matrices. We shall use extensively the cyclic property of the trace operator tr(·): for three matrices X, Y, Z of appropriate size, there holds tr(XYZ) = tr(YZX) = tr(ZXY).
We shall also use the notation diag(·) for a vector and a square matrix, which gives a diagonal matrix and a column vector from the diagonals of the matrix, respectively, in the same manner as the diag function in MATLAB. The notation N = {0, 1, . . .} denotes the set of natural numbers. Further, the notation • denotes the Hadamard product of two matrices or vectors. Last, we denote by S + m ⊂ R m×m the set of symmetric positive definite matrices in R m×m , I m the identity matrix in R m×m , and by | · | and · the determinant and the spectral norm, respectively, of a square matrix. Throughout, we view exponential, logarithm and factorial of a vector as componentwise operation.
Next we recall the finite-dimensional Poisson data model. Let x ∈ R m be the unknown signal, a i ∈ R m , i = 1, . . . , n, and y ∈ N n ⊂ R n be the data vector. We stack the column vectors a i into a matrix A by A = [a t i ] ∈ R n×m . Given the matrix A and data y ∈ N n , the Poisson model takes the form: y i ∼ Pois(e (ai,x) ), i = 1, 2, . . . , n.
Thus, the likelihood function p(y i |x) for the data point y i is given by p(y i |x) = λ yi i e −λi y i ! , λ i = e (ai,x) , i = 1, . . . , n. (2.1) It is worth noting that the exponential function enters into the Poisson parameter λ. This is commonly known as the log link function or log-linear model in the statistical literature [7]. There are several other models for the (inverse) link functions, e.g., rectified-linear and softplus [33], each having its own pros and cons for modeling count data. In this work, we shall focus on the log link function. Also this model can be viewed as a simplified statistical model for transmission tomography [41,12]. The likelihood function p(y i |x) can be equivalently written as p(y i |x) = e yi(ai,x)−e (a i ,x) −ln(yi!) .
Under the independent identically distributed (i.i.d.) assumption on the data points y i , the likelihood function p(y|x) of the data vector y is given by p(y i |x) = e (Ax,y)−(e Ax ,1n)−(ln(y!),1n) , (2.2) where 1 n ∈ R n is the vector with all entries equal to unity, i.e., 1 n = [1, . . . , 1] t ∈ R n . Further, we assume that the unknown x follows a Gaussian prior p(x), i.e., where µ 0 ∈ R m and C 0 ∈ S + m denote the mean and covariance of the Gaussian prior, respectively, and N denotes the normal distribution. In the framework of variational regularization, the corresponding penalty 1 2 (x − µ 0 ) t C −1 0 (x − µ 0 ) often imposes certain smoothness constraint. The Gaussian prior p(x) may depend on additional hyperparameters, cf. Section 5 for details. Then by Bayes' formula, the posterior probability distribution p(x|y) is given by where the joint distribution p(x, y) is defined by and the normalizing constant Z(y), which depends only on the given data y, is given by That is, the normalizing constant Z is an integral living in a very high-dimensional space if the parameter dimension m is large. Thus it is computationally intractable, and so is the posterior distribution p(x|y), since it also involves the constant Z. The quantity Z is commonly known as model evidence in the literature, and it underlies many model selection rules, e.g., Bayes factor [24]. Thus the reliable approximation of Z(y) is important in certain tasks. The posterior distribution p(x|y) given in (2.3) is the Bayesian solution to the Poisson model (2.1) (under a Gaussian prior), and it contains all the information about the inverse problem. In order to explore the posterior state space, one typically employs Markov chain Monte Carlo methods, which, however, can be prohibitively expensive for high-dimensional problems, apart from the well-known challenge in diagnosing the convergence of the Markov chain. To overcome the challenge, over the last two decades, a large number of approximate inference methods have been developed, including mean-field approximation [40], expectation propagation [30] and variational Gaussian approximation (VGA) [31,8]. In all these approximations, we aim at finding a best approximate yet tractable distribution q(x) within a family of parametric/nonparametric probability distributions, by minimizing the error in a certain probability metric, prominently the Kullback-Leibler divergence D KL (q||p), cf. Section 3.1 below.
In this work, we shall employ the VGA to obtain an optimal Gaussian approximation q(x) to the posterior distribution p(x|y) in the Kullback-Leibler divergence D KL (q||p). Fitting a Gaussian to an intractable distribution is a well-established norm for approximate Bayesian inference, and it has demonstrated success in many practical applications [17,3,8,2]. The popularity can be largely attributed to the fact that the Gaussian approximation is computationally attractive, and yet delivers reasonable accuracy for a wide range of problems, due to the good analytical properties and great flexibility of the Gaussian family. However, analytical properties of approximate inference procedures are rarely studied. In the context of Poisson mixed models, the asymptotic normality of the estimator and its convergence rate was analyzed [16]. In a general setting, some theoretical issues were studied in [34,29].

Gaussian variational approximation
In this section, we recall the Kullback-Leibler divergence, derive explicit expressions for the lower bound functional and its gradient, and discuss basic analytic properties, e.g., concavity and existence.

Kullback-Leibler divergence
The Kullback-Leibler divergence is one of the most popular metrics for measuring the distance between two probability distributions. The Kullback-Leibler (KL) divergence [28] from one probability distribution p to another distribution q is a functional defined by Clearly, KL divergence is not symmetric and thus not a metric in the mathematical sense. Since the logarithm function ln x is concave and that q is normalized, i.e., q(x)dx = 1, by Jensen's inequality, we can derive the nonnegativity of the KL divergence: Due to unsymmetry of the KL divergence, to find an approximation q to the target distribution p, there are two options, i.e., minimizing either D KL (q||p) or D KL (p||q). These two options lead to different approximations. It was pointed out in [6, Section 10.1.2] that minimizing D KL (p||q) tends to find the average of modes of p, while minimizing D KL (q||p) tends to find one exact mode. Traditionally, the former is used in expectation propagation, and the latter in variational Bayes. In this work, we focus on the approach min D KL (q||p), which leads to the VGA to be described below.
Remark 3.1. In practice, the so-called Laplace approximation is quite popular [39]. Specifically, letx be the maximum a posteriori (MAP) estimatorx, i.e.,x = arg min x∈R m g(x), where g(x) = − ln p(x|y) is the negative log posterior distribution. Consider the Taylor expansion of g(x) at the MAP estimatorx: since ∇g(x) vanishes. The Hessian H of g(x) is given by Thus,x might serve as an approximate posterior mean, and the inverse Hessian H −1 as an approximate posterior covariance. However, unlike the VGA discussed below, it lacks the optimality as evidence lower bound (within the Gaussian family), and thus may be suboptimal for model selection etc.

Variational Gaussian lower bound
Now we derive the variational Gaussian lower bound. By substituting p(x) with the posterior distribution p(x|y) in (3.1), we obtain Since the posterior distribution p(x|y) depends on the unknown normalizing constant Z(y), the integral on the right hand side is not computable. Nonetheless, given y, Z(y) is fixed. In view of the identity instead of minimizing D KL (q(x)||p(x|y)), we may equivalently maximize the functional By (3.2), we have D KL (q(x)||p(x|y)) ≥ 0, and thus lnZ ≥ F (q, y). That is, F (q, y) provides a lower bound on the model evidence Z, for any choice of the distribution q. For any fixed q, F (q, y) may be used as a substitute for the analytically intractable model evidence Z(y), and hence it is called an evidence lower bound (ELBO). Since the data y is fixed, it will be suppressed from F (q, y) below. In the VGA, we restrict our choice of q to Gaussian distributions. Meanwhile, a Gaussian distribution q(x) is fully characterized by its meanx ∈ R m and covariance C ∈ S + m ⊂ R m×m , i.e., Thus, F (q) is actually a function ofx ∈ R m and C ∈ S + m , and will be written as F (x, C) below. Then the approach seeks optimal variational parameters (x, C) to maximize ELBO. This step turns a challenging sampling problem into a computationally more tractable optimization problem.
The next result gives an explicit expression for the lower bound F (x, C).

(3.4)
Proof. By the definition of the functional F (x, C) and the joint distribution p(x, y), we have It suffices to evaluate the integrals termwise. Clearly, we have N (x;x, C)(Ax, y)dx = (Ax, y). Next, using moment generating function, we have With the Cholesky decomposition C = LL t , for z ∼ N (0, I m ), x = µ + Lz ∼ N (x; µ, C). This and the bias-variance decomposition yield (E q(x) [·] takes expectation with respect to the density q(x)) By the cyclic property of trace, we have E N (z;0,Im) [z t L t ALz] = tr(L t AL) = tr(ALL t ) = tr(AC). In particular, this gives Collecting preceding identities completes the proof of the proposition.
Remark 3.2. The terms in the functional F (x, C) in (3.4) admit interesting interpretation in the lens of classical Tikhonov regularization (see, e.g., [11,19,37]). To this end, it is instructive to rewrite it as The first line represents the fidelity or "pseudo-likelihood" function. It is worth noting that it actually involves the covariance C. In the absence of the covariance C, it recovers the familiar log likelihood for Poisson data, cf. Remark 3.1. The second line imposes a quadratic penalty on the meanx. This term recovers the familiar penalty in Tikhonov regularization (except that it is imposed onx). Recall that the function − ln |C| is strictly convex in C ∈ S + m [13, Lemma 6.2.2]. Thus, one may define the corresponding Bregman divergence d(C, C 0 ). In view of the identities [9] simple computation gives the following expression for the divergence: Statistically, it is the Kullback-Leibler divergence between two Gaussians of identical mean. The divergence d(C, C 0 ) provides a distance measure between the prior covariance C 0 and the posterior one C. Let be the pairs of generalized eigenvalues and eigenfunctions of the pencil (C, C 0 ), i.e., Cv i = λ i C 0 v i . Then it can be expressed as This identity directly indicates that d(C, C 0 ) ≤ c implies C ≤ c and C −1 ≤ c, where here and below c denotes a generic constant which may change at each occurrence.
Thus, the third line regularizes the posterior covariance C by requesting nearness to the prior one C 0 in Bregman divergence. It is interesting to observe that the Gaussian prior implicitly induces a penalty on C, although it is not directly enforced. In statistics, the Bregman divergence d(C, C 0 ) is also known as Stein's loss [21]. In recent years, the Bregman divergence d(C, C 0 ) has been employed in clustering, graphical models, sparse covariance estimate and low-rank matrix recovery etc. [27,35].

Theoretical properties of the lower bound
Now we study basic analytical properties, i.e., concavity, existence and uniqueness of maximizer, and gradient of the functional F (x, C) defined in (3.4), from the perspective of optimization.
A first result shows the concavity of F (x, C). Let X and Y be two convex sets. Recall that a functional f : X × Y → R is said to be jointly concave, if and only if for all x 1 , x 2 ∈ X, y 1 , y 2 ∈ Y and λ ∈ [0, 1]. Further, f is called strictly jointly concave if the inequality is strict for any (x 1 , y 1 ) = (x 1 , y 1 ) and λ ∈ (0, 1). It is easy to see that S + m is a convex set. Theorem 3.1. For any C 0 ∈ S + m , the functional F (x, C) is strictly jointly concave with respect tox ∈ R m and C ∈ S + m . Proof. It suffices to consider the terms apart from the linear terms (y, Ax) and − 1 2 tr(C −1 0 C) and the constant term − 1 2 ln|C 0 | + m 2 − (1 n , ln(y!)). Since Ax + 1 2 diag(ACA t ) is linear inx and C, and exponentiation preserves convexity, the term −(1 n , e Ax+ 1 2 diag(ACA t ) ) is also jointly concave. Next, the is strictly concave for any C 0 ∈ S + m . Last, the log-determinant ln |C| is strictly concave over S + m [13, Lemma 6.2.2]. The assertion follows since strict concavity is preserved under summation.
Next, we show the well-posedness of the optimization problem in VGA.
Theorem 3.2. There exists a unique pair of (x, C) solving the optimization problem Proof. The proof follows by direct methods in calculus of variation, and we only briefly sketch it. Clearly, there exists a maximizing sequence, denoted by By the Cauchy-Schwarz inequality, we have ( Then by the continuity of the functional F in (x, C), we deduce that (x * , C * ) is a maximizer to F (x, C), i.e., the existence of a maximizer. The uniqueness follows from the strict joint-concavity of F (x, C), cf. Theorem 3.1.
Since F is composed of smooth functions, clearly it is smooth. Next we give the gradient formulae, which are useful for developing numerical algorithms below. Theorem 3.3. The gradients of the functional F (x, C) with respect tox and C are respectively given by Then by the chain rule showing the first formula. Next we derive the gradient with respect to the covariance C. In view of (3.5), it remains to differentiate the term (1 n , e Ax+ 1 2 diag(ACA t ) ) with respect to C. To this end, let H be a small perturbation to C. By Taylor expansion, and with the diagonal matrix D = diag(e Ax+ 1 2 diag(ACA t ) ), we deduce Since the matrix D is diagonal, by the cyclic property of trace, we have

Now the definition of the gradient completes the proof.
An immediate corollary is the following optimality system.
Corollary 3.1. The necessary and sufficient optimality system of problem (3.6) is given by Remark 3.3. Challis and Barber [8] showed that for log-concave site posterior potentials, the variational lower bound is jointly concave inx and the Cholesky factor L of the covariance C. This assertion holds also for the lower bound F (x, C) in (3.4), i.e., joint concavity with respect to (x, L).
Remark 3.4. Corollary 3.1 indicates that the covariance C * of the optimal Gaussian approximation q * (x) is of the following form: Thus it is tempting that one may minimize with respect to D instead of C in order to reduce the complexity of the algorithm, by reducing the number of unknowns from m 2 to m. However, F is generally not concave with respect to D; see [26] for a one-dimensional counterexample. The loss of concavity might complicate the analysis and computation. Remark 3.5. In practice, the parameter x in the model (2.2) often admits physical constraint. Thus it is natural to impose a box constraint on the meanx in problem (3.6), e.g., c l ≤x i ≤ c u , i = 1, . . . , m, for some c l < c u . This can be easily incorporated into the optimality system in Corollary 3.1, and the algorithms below remain valid upon minor changes, e.g., including a pointwise projection operator in the update ofx.

Numerical algorithm and its complexity analysis
Now we develop an efficient numerical algorithm, which is of alternating direction maximization type, provide an analysis of its complexity, and discuss strategies for complexity reduction.

Numerical algorithm
In view of the strict concavity of F (x, C), it suffices to solve the optimality system (cf. Corollary 3.1): This consists of a coupled nonlinear system for (x, C). We shall solve the system by alternatingly maximizing F (x, C) with respect tox and C, i.e., coordinate ascent. From the strict concavity in Theorem 3.1, we deduce that for a fixed C, (4.1) has a unique solutionx, and similarly, for a fixedx, (4.2) has a unique solution C. Below, we discuss the efficient numerical solution of (4.1)-(4.2).

Newton method for updatingx
To solvex from (4.1), for a fixed C, we employ a Newton method. Let the nonlinear map G : The Jacobian ∂G of the map G is given by where the partial ordering ≥ is in the sense of symmetric positive definite matrix, i.e., X ≥ Y if and only if X − Y is positive semidefinite. That is, the Jacobian ∂G(x) is uniformly invertible (since the prior covariance C −1 0 is invertible). This concurs with the strict concavity of the functional F (x, C) inx. This motivates the use of the Newton method or its variants: for a nonlinear system with uniformly invertible Jacobians, the Newton method converges globally [25]. Specifically, givenx 0 , we iterate The main cost of the Newton update (4.3) lies in solving the linear system involving ∂G(x k ). Clearly, the Jacobian ∂G(x k ) is symmetric and positive definite, and thus the (preconditioned) conjugate gradient method is a natural choice for solving the linear system. One may use C −1 0 (or the diagonal part of the Jacobian ∂G(x)) as a preconditioner. It is worth noting that inverting the Jacobian ∂G(x) is identical with one fixed point update of the covariance C below. In the presence of a priori structural information, this can be carried out efficiently even for very large-scale problems; see Section 4.2 below for further details. By the fast local convergence of the Newton method, a few iterations suffice the desired accuracy, which is fully confirmed by our numerical experiments.

Fixed-point method for updating C
Next we turn to the solution of (4.2) for updating C, withx fixed. There are several different strategies, and we shall describe two of them below. The first option is to employ a Newton method. Let the nonlinear map S : R m×m → R m×m be defined by The Jacobian ∂S of the map S is given by It can be verified that the map ∂S(C) is symmetric with a uniformly bounded inverse (see the proof of Theorem B.1 in the appendix for details). However, its explicit form seems not available due to the presence of the operator diag. Nonetheless, one can apply a (preconditioned) conjugate gradient method for updating C. The Newton iteration is guaranteed to converge globally. The second option is to use a fixed-point iteration. This choice is very attractive since it avoids solving huge linear systems. Specifically, given an initial guess C 0 , we iterate by Conceptually, it has the flavor of a classical fixed point scheme for solving algebraic Riccati equations in Kalman filtering [1], and it has also been used in a slightly different context of variational inference with Gaussian processes [26]. Numerically, each inner iteration of (4.4) involves computing the vector diag(AC k A t ) (which should be regarded as computing a i C k a t i , i = 1, . . . , m, instead of forming the full matrix AC k A t ) and a matrix inversion.
Next we briefly discuss the convergence of (4.4). Clearly, for all iterates C k , we have C k ≤ C 0 . We claim λ max (C k ) ≤ λ max (C 0 ). To see this, let v ∈ R m be a unit eigenvector corresponding to the largest eigenvalue λ max (C k ), i.e., v t C k v = λ max (C k ). Then by the minmax principle Thus, the sequence {C k } ∞ k=1 generated by the iteration (4.4) is uniformly bounded in the spectral norm (and thus any norm due to the norm equivalence in a finite-dimensional space). Hence, there exists a convergent subsequence, also relabeled as {C k }, such that C k → C * , for some C * . In practice, the iterates converge fairly steadily to the unique solution to (4.2), which however remains to be established. In Appendix A, we show a certain "monotone" type convergence of (4.4) for the initial guess C 0 = C 0 .

Variational Gaussian approximation algorithm
With the preceding two inner solvers, we are ready to state the complete procedure in Algorithm 1. One natural stopping criterion at Step 7 is to monitor ELBO. However, computing ELBO can be expensive and cheap alternatives, e.g., relative change of the meanx, might be considered. Note that Step 3 of Algorithm 1, i.e., randomized singular value decomposition (rSVD), has to be carried out only once, and it constitutes a preprocessing step. Its crucial role will be discussed in Section 4.2 below.
With exact inner updates (x k , C k ), by the alternating maximizing property, the sequence {F (x k , C k )} is guaranteed to be monotonically increasing, i.e., with the inequality being strict until convergence is reached. Further, F (x k , C k ) ≤ ln Z(y). Thus, {F (x k , C k )} converges. Further, by [5,Prop. 2.7.1], the coordinate ascent method converges if the maximization with respect to each coordinate is uniquely attained. Clearly, Algorithm 1 is a coordinate ascent method for F (x, C), and F (x, C) satisfies the unique solvability condition. Thus the sequence {(x k , C k )} generated by Algorithm 1 converges to the unique maximizer of F (x, C).

Complexity analysis and reduction
Now we analyze the computational complexity of Algorithm 1, and describe strategies for complexity reduction, in order to arrive at a scalable implementation. When evaluating the functional F (x, C) and its gradient, the constant terms can be precomputed. Thus, it suffices to analyze the terms that will be updated. Standard linear algebra [14] gives the following operational complexity. Update the meanx k+1 by (4.3);

7:
Check the stopping criterion.
In summary, evaluating ELBO F (x, C) and its gradients each involves O(nm 2 + m 3 ) complexity, which is infeasible for large-scale problems. The most expensive piece lies in matrix products/inversion, e.g., (1 n , e Ax+ 1 2 diag(ACA t ) ), A t e Ax+ 1 2 diag(ACA t ) and A t diag(e Ax+ 1 2 diag(ACA t ) )A. The log-determinant ln|C| can be approximated accurately with O(m 2 ) operations by a stochastic algorithm [42]. In many practical inverse problems, there do exist structures: (i) A is low rank, and (ii) C is sparse, which can be leveraged to reduce the per-iteration cost.
First, for many inverse problems, the matrix A is ill-conditioned, and the singular values decay to zero. Thus, A naturally has a low-rank structure. The effective rank r is determined by the decay rate of the singular values. In this work, we assume a known rank r. The rSVD is a powerful technique for obtaining low-rank approximations [15]. For a rank r matrix, the rSVD can yield an accurate approximation with O(mn ln r + (m + n)r 2 ) operations [15, pp. 225]. We denote the rSVD approximation by A ≈ UΣV t , where the matrices U ∈ R n×r and V ∈ R m×r are column orthonormal, and Σ ∈ R r×r is diagonal with its entries ordered nonincreasingly.
Second, the covariance C is approximately sparse, and each row/column has at most s nonzero entries. This reflects the fact that only (physically) neighboring elements are highly correlated, and there is no long range correlation. This choice will be implemented in the numerical experiments for 2D image deblurring. Naturally, one can also consider a more flexible option by adaptively selecting the sparsity pattern. This can be achieved by penalizing of the off-diagonal entries of C by the 1 -norm, which allows automatically detecting significant correlation [35]. Other structures, e.g., low-rank plus sparsity, offer potential alternatives. We leave these advanced options to a future study.
Under these structural assumptions, the complexity of computing the terms (1 n , e Ax+ 1 2 diag(ACA t ) ), A t e Ax+ 1 2 diag(ACA t ) and A t diag(e Ax+ 1 2 diag(ACA t ) )A can be reduced to O(smn). Thus, the complexity of calculating F and ∂F ∂x is reduced to O(smn + m 2 ). For the matrix inversion in (4.4), we exploit the low-rank structure of A. Upon recalling the low-rank approximation of A and the Sherman-Morrison-Woodbury formula [14, pp. 65], i.e., Note that the inversion step only involves a matrix in R r×r , and can be carried out efficiently. The sparsity structure on C can be enforced by computing only the respective entries. Then the update formula (4.5) can be achieved in O(smn + r 2 n + r 2 m) operations. In comparison with the O(m 3 + nm 2 ) complexity of the direct implementation, this represents a substantial complexity reduction.

Hyperparameter choice with hierarchical model
When encoding prior knowledge about the unknown x into the prior p(x), it is often necessary to tune its strength, a scalar parameter commonly known as hyperparameter. It plays the role of the regularization parameter in variational regularization [19,Chapter 7], where its proper choice is notoriously challenging.
In the Gaussian prior p(x), C 0 = α −1C 0 , whereC 0 describes the interaction structure and the scalar α determines the strength of the interaction which has to be specified.
In the Bayesian paradigm, one principled approach to handle hyperparameters is hierarchical modeling, by assuming a hyperprior and treating them as a part of the inference procedure. Specifically, we write the Gaussian prior p(x|α) = N (x|0, α −1C 0 ), and employ a Gamma distribution p(α|a, b) = Gamma(α|a, b) on α, where (a, b) are the parameters. The Gamma distribution is the conjugate prior for α, and it is analytically and computationally convenient. In practice, one may take (a, b) close to (1, 0) to mimic a noninformative prior. Then appealing to Bayes' formula again, one obtains a posterior distribution (jointly over (x, α)). Conceptually, with the VGA, this construction determines the optimal parameter by maximizing ELBO as a function of α, i.e., model selection within a parametric family. Thus it can be viewed as a direct application of ELBO in model selection.
One may explore the resulting joint posterior distribution in several ways [19,Chapter 7]. In this work, we employ an EM type method to maximize the following (joint) lower bound where the subscript α indicates the dependence of ELBO on α. Then, using (3.4) and substituting C 0 with α −1C 0 , we have .
This functional extends ELBO F (x, C) to estimate the hyperparameter α simultaneously with (x, C) in a way analogous to augmented Tikhonov regularization [22]. To maximize F (x, C, α), we employ an EM algorithm [6, Chapter 9.3]. In the E-step, we fix α, and maximize F (x, C, α) for a new Gaussian approximation N (x|x, C) by Algorithm 1, with the unique maximizer denoted by (x α , C α ). Then in the M-step, we fix (x, C) and update α by This follows from the condition ∂F ∂α = 0. These discussions lead to the procedure in Algorithm 2. A natural stopping criterion at line 5 is the change of α. Below we analyze the convergence of Algorithm 2.
Remark 5.1. The first two terms in the denominator of the iteration (5.2) is given by i.e., the expectation of the negative logarithm of the Gaussian prior p(x) with respect to the Gaussian posterior approximation q(x). Formally, the fixed point iteration (5.2) can be viewed as an extension of that for a balancing principle for Tikhonov regularization in [22,20] to a probabilistic context.

5:
Check the stopping criterion; 6: end for 7: Output: (x, C) In order to analyze the convergence of Algorithm 2, we write the functional F α (x, C) as Thus the functional F α (x, C) resembles classical Tikhonov regularization. By Theorem 3.2, for any α > 0, there exists a unique maximizer (x α , C α ) to F α , and the value function ψ(x α , C α ) is continuous in α, cf. Lemma 5.2 below. In Appendix B, we show that the maximizer (x α , C α ) is actually differentiable in α. Proof. Taking inner product between (4.1) andx α , we deduce (C −1 0x α ,x α ) + (e Axα+diag(ACA t ) , Ax α ) = (A t y,x α ). It can be verified directly that the function f (t) = te t is bounded from below by −e −1 for t ∈ R. Meanwhile, by (4.2), C ≤ C 0 , and thus This and the Cauchy-Schwarz inequality give x α ≤ cα −1 , with c depending only on y. Next, by (4.2), and consequently appealing to (4.2) again yields (C −1 0 + cA t A) −1 ≤ C ≤ C 0 , completing the proof. Lemma 5.2. The functional value ψ(x α , C α ) is continuous at any α > 0.
Proof. Let {α k } ⊂ R + be a sequence convergent to α. By Theorem 3.2, for each α k , there exists a unique maximizer (x k , C k ) to F α k (x, C). By Lemma 5.1, the sequence {(x k , C k )} is uniformly bounded, and there exists a convergent subsequence, relabeled as {(x k , C k )}, with a limit (x * , C * ). By the continuity of the functionals φ(x, C) and ψ(x, C), we have for any ( That is, (x * , C * ) is a maximizer of F α (x, C). The uniqueness of the maximizer to F α (x, C) and a standard subsequence argument imply that the whole sequence converges. The desired continuity now follows by the continuity of ψ(x, C) in (x, C).
Next we give an important monotonicity relation for ψ(x α , C α ) in α, in a manner similar to classical Tikhonov regularization [20]. In Appendix B, we show that it is actually strictly monotone.
Proof. This result follows by a standard comparison principle. For any α 1 , α 2 , by the maximizing property of (C α1 ,x α1 ) and (C α2 ,x α2 ), we have Summing up these two inequalities and collecting terms yield Then the desired monotonicity relation follows.
Remark 5.2. The proof of Theorem 5.1 provides a constructive approach to the existence of a solution to (5.2). The uniqueness of the solution α * to (5.2) is generally not ensured. However, in practice, it seems to have only two fixed points: one is in the neighborhood of +∞, which is uninteresting, and the other is the desired one.

Numerical experiments and discussions
Now we present numerical results to examine algorithmic features (Sections 6.1-6.4, with the example phillips) and to illustrate the VGA (Section 6.5). All one-dimensional examples are taken from public domain MATLAB package Regutools 1 , and the discrete problems are of size 100 × 100. We refer the prior with a zero mean µ 0 = 0 and the covariance α −1 I m and α −1 L −1 1 L −t 1 (with L 1 being the 1D first-order forward difference matrix) to as the L 2 -and H 1 -prior, respectively, and letC 0 = I m , andC 1 = L −1 1 L −t 1 . Unless otherwise specified, the parameter α is determined in a trial-and-error manner, and in Algorithm 1, the Newton update δx in (4.3) is computed by the MATLAB built-in function pcg with a default tolerance, the prior covariance C −1 0 as the preconditioner and a maximum 10 PCG iterations.

Convergence behavior of inner and outer iterations of Algorithm 1
First, we examine the convergence behavior of inner iterations for updatingx and C, i.e., (4.3) and (4.4), for the example phillips with the L 2 -prior C 0 = 1.0 × 10 −1C 0 and H 1 -prior C 0 = 2.5 × 10 −3C 1 . To study the convergence, we fix C at C 1 = I forx and present the 2 -norm of the update δx (initialized withx 0 = 0), and similarly fixx at the converged iteratex 1 for C and present the spectral norm of the change δC. For both (4.3) and (4.4), these initial guesses are quite far away from the solutions, and thus the choice allows showing their global convergence behavior. The convergence is fairly rapid and steady for both inner iterations, cf. Fig. 1. For example, for a tolerance 10 −5 , the Newton method (4.3) converges after about 10 iterations, and the fixed point method (4.4) converges after 4 iterations, respectively. The global as well as local superlinear convergence of the Newton method (4.3) are clearly observed, confirming the discussions in Section 4. The convergence behavior of the inner iterations is similar for both priors. In practice, it is unnecessary to solve the inner iterates to a very high accuracy, and it suffices to apply a few inner updates within each outer iteration. Since the iteration (4.4) often converges faster than (4.1), we take five Newton updates and one fixed point update per outer iteration for the numerical experiments below.

Low-rank approximation of A and sparsity of C
The discussions in Section 4.2 show that the structure on A and C can be leveraged to reduce the complexity of Algorithm 1. Now we evaluate their influence on the accuracy of the VGA. First, we examine the influence of low-rank approximation to A. Since the kernel function of the example phillips is smooth, the inverse problem is mildly ill-posed and the singular values σ k decay algebraically, cf. Fig. 4(a). A low-rank matrix A r of rank r ≈ 10 can already approximate A well. To study its influence on the VGA, we denote by (x r , C r ) and (x * , C * ) the VGA for A r and A, respectively. The errors ex = x r −x * and e C = C r − C * for different ranks r are shown in Figs. 4 (b) and (c) for the L 2 -and H 1 -prior, respectively. Too small a rank r of the approximation A r can lead to pronounced errors in both the meanx and the covariance C, whereas for a rank of r = 10, the errors already fall below one percent. Interestingly, the decay of the error ex is much faster than that of the singular values σ k , and the error e C decays slower than ex. The fast decay of the errors ex and e C indicates the robustness of the VGA, which justifies using low-rank approximations in Algorithm 1. Next we examine the influence of the sparsity assumption on the covariance C, which is used to reduce the complexity of Algorithm 1. Due to the coupling betweenx and C, cf. (4.1)-(4.2), the sparsity assumption on C affects the accuracy of bothx and C. To illustrate this, we take different sparsity levels s on C in Algorithm 1, i.e., at most s nonzero entries around the diagonal of C. Surprisingly, a diagonal C already gives an acceptable approximation measured by the errors ex = x s −x * 2 and e C = C s − C * 2 , where (x s , C s ) is the VGA with a sparsity level s. The errors ex and e C decrease with the sparsity level s, cf. Table 1. Thus the sparsity assumption on C can reduce significantly the complexity while retaining the accuracy. 4.88e-2 7.02e-2 1.00e-2 4.29e-2

Hierarchical parameter choice
Now we examine the convergence of Algorithm 2 for choosing the parameter α in the prior p(x). By Theorem 5.1, the sequence {α k } generated by Algorithm 2 is monotone. We illustrate this by two initial guesses, i.e., α 1 = 0.1 and α 1 = 10. Both sequences of iterates generated by Algorithm 2 converge monotonically to the limit α * = 0.7778, and the convergence of Algorithm 2 is fairly steady, cf. Fig.  5(a). Further, Algorithm 2 indeed maximizes the joint lower bound (5.1) with its maximum attained at α * = 0.7778, cf. Fig. 5(b). Though not shown, the lower bound F α (x, c|α) is also increasing during the iteration. Thus, the hierarchical approach is indeed performing model selection by maximizing ELBO. To illustrate the quality of the automatically chosen parameter α, we take six realizations of the Poisson data y and compare the meanx of the VGA with the optimal regularized solutions, where α is tuned so that the error is smallest (and thus it is infeasible in practice). The meansx by Algorithm 2 are comparable with the optimal ones, cf. Fig. 6, and thus the hierarchical approach can yield reasonable approximations. The parameter α by the hierarchical approach is slightly smaller than the optimal one, cf. Table 2, and hence the corresponding reconstruction tends to be slightly more oscillatory than the optimal one. The value of the parameter α by the hierarchical approach is relatively independent of the realization, whose precise mechanism is to be ascertained.

VGA versus MCMC
Despite the widespread use of variational type techniques in practice, the accuracy of the approximations is rarely theoretically studied. This has long been a challenging issue for approximate Bayesian inference, including the VGA. In this part, we conduct an experiment to numerically validate the VGA against the results by Markov chain Monte Carlo (MCMC). To this end, we employ the standard Metropolis-Hastings algorithm, with the Gaussian approximation from the VGA as the proposal distribution (i.e., independence sampler). In other words, we correct the samples drawn from VGA by a Metropolis-Hastings step. The length of the MCMC chain is 2 × 10 5 , and the last 1 × 10 5 samples are used for computing the summarizing statistics. The acceptance rate in the Metropolis-Hastings algorithm is 96.06%. This might be attributed to the fact that the VGA approximates the posterior distribution fairly accurately, and thus nearly all the proposals are accepted. The numerical results are presented in Fig. 7, where the mean and the 90% highest posterior density (HPD) credible set are shown. It is observed that the mean and HPD regions by MCMC and VGA are very close to each other, cf. Figs. 7 and 8, thereby validating the accuracy of the VGA. The 2 error between the mean by MCMC and GVA is 9.80 × 10 −3 , and the error between corresponding covariance in spectral norm is 6.40 × 10 −3 . Just as expected, graphically the means and covariances are indistinguishable, cf. Fig. 8. well with the true solution x † . However, near the boundary, the meanx is less accurate. This might be attributed to the fact that in these regions, the Poisson count is relatively small, and it may be insufficient for an accurate recovery. For the L 2 -prior, the optimal C is diagonal dominant, and decays rapidly away from the diagonal, cf. Fig. 9(b). For the H 1 -prior, C remains largely diagonally dominant, but the off-diagonal entries decay a bit slower. Thus, it is valid to assume that C is dominated by local interactions as in Section 4.2. These observations remain largely valid for the other 1D examples, despite that they are much more ill-posed. Last, we test Algorithm 1 on a 2D image of size 128×128. In this example, the matrix A ∈ R 16384×16384 is a (discrete) Gaussian blurring kernel with a blurring width 99, variance 1.5 and a circular boundary condition. Since the blurring width is large, the matrix A is indeed low-rank, and we employ a rSVD approximation of rank 2000, where the rank is determined by inspecting the singular value spectrum. The true solution x † consists of two Gaussian blobs, cf. Fig. 13(a), and thus we employ a smooth prior with C 0 = 6.00 × 10 −2 L −1 L −t , where L = I ⊗ L 1 + L 1 ⊗ I is the 2D first-order finite difference matrix. Since the problem size is very large, we restrict C to be a sparse matrix such that every pixel interacts only with at most four neighboring pixels. This allows reducing the computational cost greatly. The meanx is nearly identical with the true solution x † , and the error is very small, cf. Fig. 13. The structural similarity index between the meanx and the exact solution x † is 0.812. We also compare the VGA solution with the MAP estimator. The 2 error of the mean of the VGA is 9.7205, which is slightly smaller than that of the MAP estimator (9.7355). To indicate the uncertainty around the meanx, we show in Fig. 13(f) the diagonal entries of C (i.e., the variance at each pixel). The variances are relatively large at pixels where the meanx is less accurate.

Numerical reconstructions
In summary, the VGA can provide a reliable point estimator together with useful covariance estimates.

Conclusions
In this work, we have presented a study of the variational Gaussian approximation to the Poisson data (under the log linear link function) with respect to the Kullback-Leibler divergence. We derived explicit expressions for the lower bound functional and its gradient, and proved its strict concavity and existence and uniqueness of an optimal Gaussian approximation. Then we developed an efficient algorithm for maximizing the functional, discussed its convergence properties, and described practical strategies for reducing the complexity per iteration. Further, we analyzed hierarchical modeling for automatically determining the hyperparameter using the variational Gaussian approximation, and proposed a monotonically convergent algorithm for the joint estimation. These discussions were supported by extensive numerical experiments. There are several avenues for further study. First, one of fundamental issues is the quality of the Gaussian approximation relative to the true posterior distribution. In general this issue has been long standing, and it also remains to be analyzed for the Poisson model. Second, the variational Gaussian approximation can be viewed as a nonstandard regularization scheme, by also penalizing the covariance. This naturally motivates the study on its regularizing property from the perspective of classical regularization theory, e.g., consistency and convergence rates. Third, the approach generally gives a very reasonable approximation. This suggests itself as a preconditioner for sampling techniques, e.g., variational approximation as the proposal distribution (i.e., independence sampler) in the standard Metropolis-Hastings type algorithm or as the base distribution for importance sampler. It is expected to significantly speed up the convergence of these sampling procedures, which is confirmed by the preliminary experiments. We plan to study these aspects in future works.

A On the iteration (4.4)
In this appendix, we discuss an interesting property of the iteration (4.4), for the initial guess C 0 = C 0 . We denote the fixed point map in (4.4) by T, i.e., The next result gives the antimonotonicity of the map T on S + m , i.e., for C,C ∈ S + m , if 0 ≤ C ≤C, then T(C) ≥ T(C).
Lemma A.1. The mapping T is antimonotone.
The next result shows the monotonicity of the sequence {C k } generated by (4.4).
Lemma A.2. For any initial guess C 0 ∈ S + m , the sequence {C k } k≥0 generated by the iteration (4.4) has the following properties: (i) C k ≥ 0 for all k ≥ 0; (ii) C k ≤ C 0 for all k ≥ 0; (iii) If C k ≥ C j then C k+1 ≤ C j+1 ; (iv) If C k ≥ C j then C k+2 ≥ C j+2 .
Proof. Properties (i) and (ii) are obvious. Properties (iii) and (iv) are direct consequences of the fact that the map T is antimonotone on S + m , cf. Lemma A.1.
The next result shows that the sequence constitutes two subsequences, each converging to a fixed point of T 2 , which implies either a periodic orbit of period 2 of the map T or a fixed point of T, Theorem A.1. With the initial guess C 0 = C 0 , the sequence {C k } k≥0 generated by iteration (4.4) converges to a fixed-point of T 2 .
Proof. Lemma A.2(ii) implies C 2 ≤ C 0 , (A.1) so we can use Lemma A.2(iv) inductively to argue that {C 2k } k≥0 is a decreasing sequence. From (A.1) and Lemma A.2(iii), we deduce C 1 ≤ C 3 , which together with Lemma A.2(iv) implies that the sequence {C 2k+1 } k≥0 is increasing. By the boundedness and monotonicity, both {C 2k } k≥0 and {C 2k+1 } k≥0 converge, with the limit C * and C * * , respectively. These are the limits of the fixed point map T 2 .
Remark A.1. By Lemma A.2, C * ≥ C * * , and if C * = C * * , the whole sequence converges. Generally, the interval of matrices [C * * , C * ] provides a lower and sharp bounds for the fixed point of the iteration (4.4) (which is a priori known to be unique and to exist). By repeating the argument in [10, Theorem 2.2], one may also examine the convergence of the sequence for the initial guess either C 0 < C * * or C 0 > C * .

B Differentiability of the regularized solution
In this part, we discuss the differentiability of the regularized solution (x α , C α ) in α. For simplicity, we omit the subscript α. By differentiating (3.6) in α and chain rule, we obtain (withẋ = dx dα andĊ = dC dα ): where D = diag(e Ax+ 1 2 diag(ACA t ) ) ∈ R n×n is a diagonal matrix. This constitutes a coupled linear system for (ẋ,Ċ). The next result gives its unique solvability.
Proof. Since the system (B.1) is linear and square, it suffices to show that the homogeneous problem has only a zero solution. To this end, by eliminating the variableẋ from the second line in (B.1) using the first line, we obtain the Schur complement forĊ: For any fixed C, this defines a linear map on R m×m . Next we show its invertibility. To this end, we take inner product the map withĊ, and show its positivity. Clearly, the first term is strictly positive. Thus it suffices to consider the last two terms. By the cyclic property of trace, with d = diag(D) ∈ R n , we have (A t diag(d • diag(AĊA t ))A,Ċ) = tr(A t diag(d • diag(AĊA t ))AĊ) =(Ddiag(diag(AĊA t )), AĊA t ) = (Ddiag(AĊA t ), diag(AĊA t )) = (ē,ē), whereē = D Since I n −Ā(Ā tĀ + αC −1 0 ) −1Āt > 0, the associated bilinear form is coercive on S + m . Thus the Schur complement is invertible, and the system (B.1) has a unique solution. By assumption, A is rank deficient, and thus the left hand side is rank deficient, whereas the right hand side is of full rank, which leads to a contradiction. Thus we haveĊ = 0.