Joint Gaussian dictionary learning and tomographic reconstruction

This paper studies ill-posed tomographic imaging problems where the image is sparsely represented by a non-negative linear combination of Gaussians. Our main contribution is to develop a scheme for directly recovering the Gaussian mixture representation of an image from tomographic data, which here is modeled as noisy samples of the parallel-beam ray transform. An important aspect of this non-convex reconstruction problem is the choice of initial guess. We propose an initialization procedure that is based on a filtered back projection type of operator tailored for the Gaussian dictionary. This operator can be evaluated efficiently using an approximation of the Riesz-potential of an anisotropic Gaussian which is based on an exact closed form expression for the Riesz-potential of an isotropic Gaussian. The proposed method is evaluated on simulated data.


Introduction
An inverse problem aims to recover a hidden signal from indirect observations (data).Such problems arise in many areas of science and industry, like in applications that involve tomographic imaging where the signal is the image and data is measured tomographic projections.The recovered signal is however rarely the end-goal, instead it often serves as input to a down-stream task specific post-processing procedure.
Reconstruction and task specific post-processing are commonly performed sequentially.As outlined in [ALV+22], much attention has been devoted to the possibility for performing these tasks jointly, e.g., empirical evidence shows that the post-processing task may act as an additional regularizer.An example is when tomographic image reconstruction is performed jointly with dictionary learning and sparse coding.Hence, sparse coding does not only act as a regularizer through a dictionary based prior, it is also part of a post-processing step that transforms the image into a more suitable representation.The case considered in this paper is with Gaussian dictionaries, which are more suitable than pixel/voxel-based representations for certain image features, such as representing molecules.
Ill-posedness and regularization.Inverse problems are however often ill-posed, meaning that there can be multiple solutions consistent with data and/or solutions are sensitive to (small) variations in data (instability).Hence, besides having access to a sufficiently accurate model for simulating noise-free data (forward operator), a key component in solving an inverse problem lies in handling this intrinsic instability.This cannot be addressed by abundantly sampling data or controlling the discretization error when evaluating the forward operator and its inverse (assuming it is formally invertible).Instead, stability is enforced by making sure the recovered signal is sufficiently consistent w.r.t. a prior model (regularization).
The aforementioned forward operator is typically derived from knowledge about the experiment used for generating observed data.This also applies to the statistical model for the observation error that is used by some solvers.In contrast, the prior model encodes a priori information about the true (unknown) signal that is to be recovered and therefore it should not depend on the observed data.
Much of regularization theory has been developed for inverse problems where both signal and data naturally reside in some function spaces.Prior models will in this setting typically encode desirable regularity/roughness properties.Resulting solvers have been successful, but there is rarely a straightforward way to relate these prior models to the a priori information that domain experts normally possess.The strive for addressing this has resulted in a development of prior models with an ever increasing complexity, as is the case for various dictionary based priors.
Dictionary based priors.Generative modeling can be used to define parametrized models for synthesizing images.Such models have been used to define priors that were successfully used to solve ill-posed inverse problems in imaging.
A natural approach for defining a generative model of the above type is to base it on a dictionary, which is a domain adapted representation of images that 'preserves' desirable features.The typical way to represent images is by arrays that correspond to pixel/voxel values.For generative models it is however advantageous to consider representations that better reflect the particular aspects of the underlying use case.In particular, we seek interpretable multi-scale representations.
A prime example is sparse representations that have become popular regularizers for inverse problems that are ill-posed due to lack of data (under-determined problems).Regularization is here through a prior that is based on representing the image and/or data in a sparse way w.r.t. a dictionary that retains essential features [JMS17].This type of regularization is an essential part of the non-linear sampling theory developed within compressed sensing [CW08].
Joint dictionary learning, reconstruction, sparse coding.Using priors given by generative models that are parameterized by dictionaries involves selecting an appropriate domain specific dictionary, e.g., by learning it from specific training data.The corresponding learned generative model can then be used in yet another variational model to regularize ill-posed inverse problems (indirect sparse coding).
Learning the dictionary separately has the issue of selecting representative training data for dictionary learning.An alternate approach that avoids this is to formulate a variational model for joint dictionary learning and reconstruction by indirect sparse coding.The corresponding generative model used for regularization is then specifically tailored to the particular inverse problem, so there is no need for selecting a training data set.Such an approach was successfully applied to denoising [EA06,Ela10], i.e., to perform dictionary learning jointly with denoising of images.It is also applicable to other signal processing tasks, like deblurring and inpainting, where the signal and data reside in the same space.The situation, however, becomes more complicated for inverse problems where the signal and data reside in different spaces, like in tomography imaging.The reason is that the dictionary one seeks to learn from noisy data is used for generating signals that reside in a different space.
One can set-up a variational model for joint dictionary learning and reconstruction as in (7), but this often results in a very challenging minimization of a non-convex and non-smooth objective.In certain cases it is however possible to use the forward operator to 'transport' a dictionary on signals to one on data as outlined in section 2.3.This is the case for dictionaries given by Gaussian mixture models (GMMs) and a forward operator given by the ray transform.A variational model for joint dictionary learning and reconstruction that utilizes this additional information is formulated in (13).This in turn opens up for novel approaches to joint GMM based dictionary learning and tomographic reconstruction.In addition, GMM based dictionaries are also suitable for multi-scale imaging problems since dictionary atoms can be made to represent image features at various scales in a controlled and interpretable manner.

Summary of contributions
We propose a variational model in (15) for joint dictionary learning and sparse reconstruction (indirect coding).A particular advantage with this variational model is its ability to use the forward operator in order to relate a dictionary based generative model on signals to one on data.This is possible for dictionaries based on GMMs and a forward operator given by the ray transform.
We further propose to approximately solve the non-convex optimization problem in (15) using a greedy strategy outlined in (16).Specific attention is devoted to tomography image reconstruction and GMM dictionaries where we suggest a novel initialization for the greedy strategy in (16).This initialization is given by an analytic pseudo inverse of the ray transform of filtered back-projection (FBP) type that is tailored for GMM dictionary atoms (see section 4.2).A key element in deriving the latter is to obtain closed form expressions for the Riesz potential of an isotropic Gaussian together with a corresponding closed form approximate expression for the Riesz potential of an affinely transformed Gaussian (GMM dictionary atom), see theorem 4.1 and lemma 4.2.The greedy method is demonstrated for GMM dictionaries on simulated 2D sparse view tomography data, which is a finite limited angle problem [Far97,Nat01,Mar06] that is known to be severely ill-posed.In practice, this severe instability can be partially, if not totally, addressed by a sufficiently strong prior model, which in this case is the choice of GMM dictionaries.

Organization of paper
Section 2 provides a mathematical formalization of various inverse problems, ultimately leading to the one in (6) ( joint dictionary learning and indirect coding) that we seek to solve.It then introduces various variational models, ultimately leading to the variational model in (15) that forms the basis for our approach.Section 3 recalls some facts about the ray transform and GMMs.In particular, section 3 contains a proof of the fact that the ray transform of a GMM is a GMM.In section 4 we show how the Riesz potential intertwines with affine group actions, and specialize the proposed method in section 2.4 to tomographic reconstruction with Gaussian dictionaries.Section 5 contains numerical results for a synthetic 2D tomography problem.In section 6 we summarize our conclusions.

Mathematical formalization
As stated in the introduction, our aim is to perform dictionary learning jointly with sparse coding from data representing indirect noisy observations of a signal.The motivation is applications where the signal is more appropriately represented in such a learned dictionary, i.e., the sparse coding part is here not only a regularizer, it is the desired way of representing the signal.

Generative models, dictionaries and dictionary learning
A generative model defines a subset of signals in X that can be generated from a latent space representation in Z by means of a synthesis operator S : Z → X, i.e., signals of the form f = S(α) for some α ∈ Z.
Many generative models start out by specifying a dictionary, which is a finite, or countably infinite, subset D := {φ i } i ⊂ X.A signal is here synthesized by taking a linear combination of dictionary atoms φ i ∈ D. This yields a linear synthesis operator S D : Z → X parametrized by the (countable) dictionary and where the latent space Z is naturally a sequence space (vector space of finite/infinite sequences of scalars) and (1) Coding refers to the inverse problem of recovering a latent space representation α ∈ Z for a signal f ∈ X assuming the following generative model: (2) This is the reverse of synthesis, i.e., it amounts to finding an 'inverse' to the synthesis operator.The synthesis operator is rarely injective, e.g., due to dimension mismatch, so the aim is to define an appropriately regularized inverse inverse C D : X → Z (coding operator).A common approach is to regularize the coding through a variational model: Here, L X : X × X → R quantifies similarity in X and R λ : Z → R is the (parametrized) regularizer that ensures the coding defined in (3) is a well-defined stable mapping that serves as a 'pseudo inverse' of S D .
Remark 2.1.An alternative definition of the coding is to replace the penalty R λ (α) in the objective of (3) with a constraint of the form R λ (α) δ or replace the L X -term in the objective of (3) with a constraint of the form L X (S D (α), f ) .
This far, the dictionary that parametrizes the generative model is assumed to be given.Dictionary learning is the procedure for determining a suitable dictionary, either from some example data or jointly as part of the coding step.The latter, henceforth referred to as joint dictionary learning and coding, is the inverse problem of recovering a dictionary D ∈ D X and a latent space representation α ∈ Z from a signal f ∈ X assuming (2) holds.Here, D X is a fixed class of dictionaries on X.A natural approach for solving this is to include D into the optimization in (3), so joint dictionary learning and coding operator is a mapping Here, B X :D X × Z → X is simply the synthesis operator in (1) extended to D X × Z, i.e., B X (D, α) := S D (α) with S D : X → Z given as in (1). (5)

Joint dictionary learning and indirect coding
A further difficulty arises when the signal is only indirectly observed through some noisy data.Performing coding in this setting will henceforth be referred to as indirect coding.If the signal is generated by the model in (2) with a known dictionary D X ∈ D X , then indirect coding is formally the inverse problem of recovering a latent space representation α ∈ Z of an unknown signal in X from noisy data g ∈ Y where g = A S D X (α) + e for some observation error e ∈ Y. (6) Here S D X : Z → X is the synthesis operator associated with the generative model in (2) and A : X → Y is the forward operator that models how a signal in X gives rise to data in Y in absence of observation errors.
Joint dictionary learning and indirect coding refers to attempts at also learning the dictionary itself from the data while performing indirect coding.Stated more precisely, this is the inverse problem of recovering a dictionary D X ∈ D X on X and a latent space representation α ∈ Z of an unknown signal in X from noisy data g ∈ Y assuming (6) holds.One can here re-use the regularizer R λ : Z → X in (3) for the coding part, thus leading to the following variational model for joint dictionary learning and indirect coding of signals in X: It is clear that solving (7) is more challenging than indirect signal coding on X with known dictionary.In particular, the objective in (7) is non-convex even though the corresponding objective for indirect coding with known dictionary is convex.Hence, 'good' initialization may reduce the number of iterates needed for indirect coding with known dictionary (important for computational feasibility).In contrast, 'good' initialization is an essential part of the regularization when solving (7).

Inducing a dictionary on data through the forward operator
The forward operator A : X → Y in (6) can be viewed as a mapping A:D X → D Y that 'transports' a dictionary on X to a dictionary on Y, simply define Next, if the forward operator A : X → Y and the synthesis operator S D X : Z → X in (6) are both linear, then This is an additional constraint since it describes how the prior in signal space gives rise to a 'prior' on data space.In particular, A maps a sparse representation of signals in X onto a sparse representation of data in Y.
To formulate a variational model that uses this additional information, we start by defining Next, we introduce the joint data dictionary learning and coding operator as the mapping Here, L Y : Y × Y → R is the data fidelity term and B Y :D Y × Z → Y is the corresponding data synthesis operator that is defined as in (5).The diagram in (12) summarizes the relation between the above mappings: (12) Note here that due to (9), we have We can now revise the variational model in (7) for joint dictionary learning and indirect sparse coding to account for the above additional information by considering arg min Here, Using a sparse coding regularizer R λ : Z → R, which means choosing R λ (α) := λ|α| 0 , leads to the following sparse coding variant of (13): For certain dictionary classes D X and forward operators A : X → Y, there is a direct efficient software implementation of A dict in (10).An example is the ray transform that maps a Gaussian onto a Gaussian, so a GMM based dictionary on signals will be mapped to another a GMM based dictionary on data.We refer to section 3.2 for further details.This addresses an important computational feasibility problem.Furthermore, in this setting with a data fidelity term L Y given as the squared two-norm, one can also analytically evaluate the induced loss in (14).

Continuous orthogonal matching pursuit (COMP)
There is no generally applicable and computationally feasible method for solving nondifferentiable and non-convex 0 -regularized problems, like the one in (15).One therefore typically resorts to various approximation techniques.As an example, it is common to replace the computationally intractable 0 pseudo-norm with the 1 -norm through convex relaxation and then update D and α in an alternating manner as part of an intertwined iterative scheme.
The 1 -norm from the convex relaxation does not enforce sparsity in a strict sense.The approach taken here is based on another approximation that is offered by greedy methods such as matching pursuit (MP) [MZ93] and its variants [PRK93, TGA13, KYHP14] that build up a signal by gradually increasing its 0 -norm.One then considers an iterative scheme Here we assume that D X and D Y are smooth manifolds and that L D Y ×Z ( •, C Y (g)) and A dict are differentiable, which opens up for using smooth optimization techniques based on gradient descent (GD).
Initialization.Since ( 16) is typically a non-convex functional, choice of initialization is important when one seeks to minimize it.Stochastic initialization, such as the one proposed for a similar setting in [KBGP18], come with certain disadvantages, e.g. they might produce non-repeatable results.Therefore, we promote choosing a deterministic initialization adapted for a certain class of dictionaries.
More precisely, if We then propose to initialize the kth iteration of (16) based on features derived from the pseudo-inverted residual, i.e., from . See section 4.3 for an example of this in the case of the ray transform and GMM of signals.
Computational complexity.The aim here is to quantify the computational complexity associated with continuous orthogonal matching pursuit (COMP).
Schematically, each iteration of COMP involves adding a new term GMM dictionary term to the signal representation.After this new term is initialized, the parameters of the signal representation are updated by minimizing a misfit function through GD.Hence, each iteration of COMP is dominated by the GD algorithm whose complexity is given by O((# of data parameters) × (# of signal parameters at k:th iteration of COMP) In the following, we express the complexity in (17) for COMP in the specific case of GMM based dictionaries and the ray transform as forward operator.This will result in an expression for the computational complexity in (18) of using COMP for joint tomographic image reconstruction and decomposition using GMM based dictionaries.Note first that in general, a sparse GMM representation of data and signal aims to reduce the computational complexity through the first two terms in (17), and a good initialization aims to reduce the number of GD iterations (third term in (17)) without compromising computational complexity.Representation of data is initially given by the data acquisition, however, one can re-express the data in a more compact form by considering its relation to the signal through the forward operator.Now, in general, a sparse GMM representation for the signal does not necessarily imply a sparse GMM representation for corresponding data, and conversely, GMM sparsity in data does not necessarily imply GMM sparsity in signal.However, in the case with the ray transform we know that a GMM representation of a signal is transformed to data with a GMM that has at most the same number of terms.Hence, a computational framework with fast forward and (pseudo-)inverse operators expressed directly in terms of GMMs becomes attractive in problems where signals are naturally given in terms of GMMs, like those encountered in high resolution electron microscopy.
To derive (18), consider a Gaussian defined over R n .This GMM dictionary element is fully specified by 1, n and n (n + 1)/2 parameters representing its amplitude, mean vector, and precision matrix, respectively, making a total of (n + 1)(n + 2)/2 parameters.Hence, a signal over R n that is synthesized from k GMM terms is specified through k (n + 1)(n + 2)/2 parameters.Note that the amplitudes are linear whereas the mean vector and the precision matrix are non-linear parameters of the GMM.Although all parameters contribute similarly to the computational complexity of GD, exploitation of linear and non-linear nature of the parameters through variable projection method can potentially help with the convergence rate as well as reducing the number of iterations, consequently reducing computational complexity of the GD [GP73,GP03].For each projection direction, the ray transform of the GMM signal will be a GMM over R n−1 with at most k n (n + 1)/2 parameters.Thus, the computational complexity for m projection directions at the k:th iteration of COMP based reconstruction will be proportional to Here m is governed by the data acquisition process; and k is governed by the nature of the signal.Note in particular that the latter could be a very large number for a signal that is compressible in the GMM dictionary.Finally, another advantage of using GMM representation of signals is that their ray transform and partial derivatives of the forward operator w.r.t. the GMM parameters can be analytically computed (see section 3), thus making it attractive for efficient functional and gradient evaluation without compromising on the accuracy.Conversely, although the inverse ray transform of an arbitrary GMM data does not have an explicit analytic expression, we derived an analytic approximate pseudo inverse for the ray transform, namely Gaussian FBP (see section 4.2), which is used for initialization of the COMP iterations without introducing additional computational complexity to the inversion with the intention to reduce the number of GD iterations.

Tomography and Gaussian mixture models
Data collected in 3D cryo-electron microscopy (cryo-EM) can after appropriate pre-processing be interpreted as noisy samples of the ray transform, i.e., intensities in a 2D cryo-EM image represent line integrals of a function representing the 3D structure of a specimen being imaged [Ökt15].Acquiring multiple 2D cryo-EM images from different directions gives rise to a tomography problem that involves inverting the ray transform to infer the 3D structure of the specimen.

Ray transform
A straight line in R n can be parameterized by a direction ϕ ∈ S n−1 and a point x ∈ ϕ ⊥ as {x + tϕ, t ∈ R}, where the Schwartz space over R n , i.e., it the vector space of all smooth functions whose derivatives are rapidly decreasing.More precisely, In the following we define the ray transform and state some standard results from integral geometry.
Definition 3.1.The ray transform integrates a function on straight lines, i.e., it is a mapping P : S (R n ) → S (T n ) defined as The ray transform along a fixed direction ϕ ∈ S n−1 is defined as P ϕ f := (P f )(ϕ, •).

Lemma 3.2. ([Nat01]
).The adjoint of P ϕ is given by (P ϕ * g)(x) = g(x − (x • ϕ)ϕ).Moreover, the adjoint of P, also called the back-projection operator, satisfies P * = A P, where When combined with the Riesz potential, the back-projection P * can be used in defining an inversion formula for the ray transform (theorem 3.5).
Definition 3.3.The Riesz potential of f ∈ S (R n ) for some β < n is defined as Here, f (ξ) and f (x) denote the Fourier transform and its inverse: Theorem 3.5.(Inversion formula for the ray transform [Nat01]).
where the Riesz potential and Fourier transform on T n are defined as In practice, considering its Fourier representation, the Riesz potential operator is referred to as a filtering operator.Choosing β = 1 in (23) forms the basis for the back-projection filtering method for inverting the ray transform: Similarly, β = 0 in (23) for the basis for the FBP method for inverting the ray transform:

Gaussian mixture models of signals and their ray transform
Focus here is on reconstructing a non-negative linear combination of Gaussians, also known in the statistics community as a GMM representation, from data representing finitely many ray transform projections.A GMM representation is a dictionary whose atoms are given by continuous functions (Gaussians).These are, compared to patch based dictionaries, better suited for interpolation inbetween sample points for the signal/data.In particular, when learning a GMM representation from examples, one can adaptively learn the (effective) support of the GMM dictionary atoms (decay rate of the Gaussians) instead of having them as a fixed pre-defined hyper-parameter in terms of a patch size as in K-SVD.Next, we do not normalize GMMs to have unit L 1 norm, so a GMM representation is any non-negative linear combination of Gaussians.Finally, note also that GMMs span L 2 [MMS07, NSS11] and Schwartz space is contained in L 2 , so GMMs span the Schwartz space.Therefore, the theory in section 3.1 applies directly to functions defined through such GMM representations Finally, in certain applications, like electron microscopy, one can also give a physical justification for the choice of Gaussian dictionary atoms.Here, molecular structures naturally decomposes into components that have 'blob'-like appearances, so it is natural to represent a molecular structure by a GMM.Concrete example of this is when 3D electron microscopy is used to quantitatively characterize complex structures and morphologies at nano-scale.Relevant features of such structures typically have 'blob'-like appearances and ultimately at very high resolution, one may even discern the atomic composition of the material [EHD+07].Similarly, one can in favorable circumstances image a macromolecular assembly at atomic resolution by single particle cryo-EM [NKS+20].At such high resolution, it is natural to represent the structure of the macromolecular assembly by a GMM instead of an array of voxel values.Likewise, GMMs are also useful for representing lower resolution, or coarsegrained, 3D cryo-EM images of bio-molecules.In this connection, mixtures of a pre-set number of isotropic Gaussians were employed for 3D-EM reconstruction in [JH15,MBM18].Moreover, sparse representations by GMMs are often used for post-processing of 3D-EM reconstructions [JS15,Kaw18].
Let Θ n := R n × PSD n with PSD n denoting the set of positive semi-definite real n × n matrices.For (μ, Σ) ∈ Θ n we define g (μ,Σ) : R n → R as the Gaussian density with mean vector μ and precision matrix Σ, i.e.
The set of finite sequences in Θ n parametrizes the class D n of n-dimensional Gaussian dictionaries: The corresponding latent space is A GMM G(x) is then a function of the form S D (α), D ∈ D n , α ∈ Z n , or, more explicitly: For a real valued function f : R n → R let us define the left regular action of the affine group by where x 0 is referred to the translation and A is a composition of rotations, dilations, and shears.Then any non-degenerate Gaussian may be obtained from the standard Gaussian We now claim that the ray transform of a GMM for a fixed angle is again a GMM.To see this, first we note that where we have tacitly identified the orthogonal complement of ϕ with R n−1 .Next, for f ∈ S (R n ) we get where The linearity of the ray transform together with (32)-(34) now implies that the ray transform of a GMM is indeed a GMM.

Reconstructing a GMM from its ray transform
The proposed reconstruction method is a pursuit type greedy reconstruction method as discussed in section 2.4.It is based on theorem 3.5, where it is essential to compute Riesz potential of GMMs.Thus, we start by studying some properties of Riesz potentials of GMMs before introducing the Gaussian filtered back-projection (G-FBP) operator that is utilized in the proposed reconstruction method.

Riesz potentials of GMMs
Because each term of a GMM is an affinely transformed Gaussian, we study how the Riesz potential interacts with actions of the affine group in data and signal domains.
Theorem 4.1.Let f ∈ S (R n ) and, for β < 0, (a) The Riesz potential commutes with translations, i.e., The Riesz potential commutes (up to a multiplicative constant) with isotropic scaling, i.e., (c) The Riesz potential commutes approximately (up to a multiplicative constant) with affine transformations.More precisely, if f = 0 then Here q := σ max /σ min with σ max , σ min denoting the maximal and minimal singular values of A −1 , respectively.
Proof.The claim in (37) follows directly from the fact that the Riesz potential is a convolution operator.Next, (38) is a consequence of definition 3.3 and the way the Fourier transform intertwines with affine transformations: Finally, proofs of (39) and (40) are given in appendix B.
Our next result describes how the Riesz potential transforms a Gaussian.

Gaussian filtered back-projection (G-FBP)
Let Φ := {ϕ 1 , . . ., ϕ m } ⊂ S n−1 be a collection of projection directions given by the acquisition geometry and define m as the restriction of the ray transform to these directions.
We also let ) denote the finite-angle FBP operator that corresponds to P Φ : Remark 4.3.The restriction of FBP-type of reconstruction operators for inverting the ray transform with incomplete data have been studied in both two-and three-dimensional setting [Qui88,QÖ08,Fri13,BFJQ18].These papers consider the continuum data setting, i.e., the case where one has a continuum set of projection directions.Missing data is treated as zero, so reconstructions suffer from structured artifacts.A way to reduce these artefacts is through completion of the missing data using some information about the object of interest [Nat01, AKT+18, BAKG21].Such extrapolation in data domain is difficult in cases when data is sparse view and/or highly noisy.The approach taken in this work aims to achieve a similar effect by exploiting the GMM nature of the object being imaged as demonstrated in section 5, thus avoiding any explicit data completion step.
As mentioned in section 3.2, the ray transform of a Gaussian along a single direction is again a Gaussian, hence P Φ has a well-defined corresponding dictionary domain operator P G,Φ : We next derive an approximate explicit expression for this operator.Since P † Φ is a linear operator, it is sufficient to consider P † G,Φ acting on a single affine transformed Gaussian, ( at a single orientation, ϕ i .Consequently, theorem 4.1 enables us to do the following approximation: where c is a constant that can be chosen depending on the bounds in theorem 4.1(c), see appendix B for explicit forms of c.Furthermore, note that the evaluation of (47) requires computing the Riesz potential of a Gaussian, which in turn is given by lemma 4.2.Finally, note also that the approximation in (47) becomes an equality in the case of isotropic dilation, i.e., when A = σI.
Remark 4.4.The formulation for most iterative schemes, like COMP in algorithm 1, does not depend on the specific acquisition geometry.In that sense, the theory developed here is valid for any dimension n.On the other hand, using G-FBP operator (defined in section 4.2) for initialization is an important part of COMP.
The definition of the G-FBP operator does depend on the acquisition geometry.We use a variant based on having data that is sampled from the full manifold of lines.However, in many 3D image reconstruction problems, data typically corresponds to sampling the ray transform on lines in a 3D sub-manifold instead of the full 4D manifold of lines.In these cases one cannot use the G-FBP operator in the form that is given in section 4.2.Nevertheless, in certain applications, like single particle cryo-EM, data is actually (randomly) sampled on the full 4D manifold of lines.For this type of 3D problems the G-FBP operator in section 4.2 can be used.
The results in the previous section can now be used to compute the Riesz potential-and therefore the G-FBP-of a function represented by a GMM.First, note that lemma 4.2 together with (37), (38) and (47) yields an exact expression for the Riesz potential of a function represented by a GMM with isotropic Gaussians and its G-FBP.Since any one-dimensional Algorithm 2. Initialization.Gaussian is isotropic, this covers the case n = 2.When combined with (39) and (40), we obtain an approximation of the Riesz potential of a function represented by a GMM with anisotropic Gaussians and, consequently, a closed form approximate inverse of the ray transform (G-FBP) acting on GMMs representations of projection data using (47).
Remark 4.5.While a Gaussian approaches zero for large arguments, the absolute value of the confluent hypergeometric function tends to infinity as the absolute value of its third argument tends to infinity.Hence, some care must be taken when numerically evaluating the right-hand-side of (43) for large x.We here use a generalized Padé approximation [YF15] to approximately compute the right-hand-side of (43); the details are given in the appendices.

A greedy reconstruction
As discussed in section 2.4, we adopt COMP for a greedy reconstruction of the signal from its ray transform whose steps are presented in algorithm 1.The continuity and orthogonality is achieved by minimization of L 2 -loss (step 10 of algorithm 1) with respect to the continuous parametrization of the GMM.The matching pursuit is guided by the choice of the new dictionary element at each iteration (step 7 of algorithm 1), which is performed as follows.
Consider an n dimensional GMM as the signal of interest.For each projection direction, because ray transform maps an n dimensional GMM to an n − 1 dimensional GMM, the preimage of the mean vectors within the corresponding projection contains the mean vectors of the Gaussians within the signal.Consequently, the pre-images of all the mean vectors of projection GMMs would intersect creating a grid of points, not necessarily regular, in the signal domain which contains the mean vectors of the GMM signal.In practice, one may not be able to recover the parameters of the projection GMMs exactly but approximately.As a result, preimages of the mean vectors may not intersect for n > 2. However, an appropriate grid can still be formed.First, each projection data g j is decomposed into an n − 1 dimensional GMM g G := D ( j) , α ( j) m j=1 .More precisely, we use a method from [ZY21] for evaluating C S ϕ ⊥  to compute A set B ⊂ R N of grid points is then pre-computed based on the decomposed data as follows.Let μ j,s denote the sth mean vector of D ( j) , for s = 1, . . ., k j .Let j,s denote the line formed by backprojecting μ j,s along ϕ j .Now for j = j and s = 1, . . ., k j , s = 1, . . ., k j let x j, j ,s,s denote the point that lies closest to j,s and j ,s .Finally, let B be the set of all these points.Although the data decomposition may not be perfect, when back-projected, it provides a sufficient starting point for an initial guess in the signal domain before getting refined to match the data in step 10 of algorithm 1.
Let q := P † G,Φ (g G ) − P † G,Φ • P G (D, α).A point μ ∈ B is found such that q(μ) q(x) for all x ∈ B. The center μ 0 is defined as a local maximum of q, initialized at μ.The initialization of each Gaussian at a local maximum of the residual is motivated by theorem appendix C.1, which provides an upper bound on the distance from a local maximum of a GMM to the set of mean vectors.The precision matrix Σ 0 is computed based on the Hessian H of q, in such a way that if q happened to be a single Gaussian centered at μ 0 , then Σ 0 would be its precision matrix.In the computation of the Hessian we take advantage of the fact that the G − FBP produces a smooth function, in contrast to ordinary (discrete) FBP implementations.Then new Gaussian atom is defined as D new := (μ 0 , Σ 0 ) Finally, the amplitude α new is computed as the global minimizer of the squared L 2 -norm of the residual.Steps of introduction and initialization of the new Gaussian is summarized in algorithm 2. For the sake of reducing computational time, one may modify step 1 to compute an approximation of q by considering the closed form analytic approximation of P † G,Φ using (47).

Numerical experiments
As a proof of concept for the proposed method, we applied it to simulated projections of a 2D phantom.As already stated before, the inverse problem we consider is not traditional tomographic image reconstruction, it is to recover the GMM dictionary and the GMM representation of a signal from noisy under-sampled tomographic data.As a consequence, reconstructions are not 2D images, they are the tabulated GMM coefficients given in tables 1-4.We do nevertheless choose a phantom that resembles the classical Shepp-Logan phantom, but this is merely to have an object that can both be expressed using GMMs (so we have the ground truth in table 1) and at the same time an object that is well recognized by the imaging community.
First, using the ODL software framework [AKÖ17], we simulated ray transform projection data g of a grey-scale image synthesized from a phantom expressed in terms of a ground true GMM dictionary and dictionary coefficients (D true , α true ) as f true := S D true (α true ).Here, the dictionary atoms corresponding to non-zero coefficients are nine two-dimensional Gaussians, see table 1.We used N 1 = 14 equispaced angles collectively described by Φ = ϕ 1 , . . ., ϕ N 1 .For generation of the data, the phantom was sampled on a uniform 501 × 501 grid on the box [−25, 25] 2 , and projections were computed on the interval [−30, 30] using a uniform sampling grid with N 2 = 601 points, see figure 1 for an image of the data.
Noisy data was then generated as g noisy := g + e, where e was a sample from e ∼ N (0, Iσ), and σ was chosen so that SNR := 10 log 10 |g| 2 2 /L = 30 dB with L := N 1 N 2 σ 2 .We also generated 'analytic data' g G using the same operator that was used for reconstruction, i.e., g G := P G,Φ (D true , α true ).
Then we applied algorithm 1 to g G , g and g noisy .For g G no pre-processing was necessary, since it was already in GMM form.For g and g noisy we used the method of [ZY21] for the pre-processing step (line 2 of algorithm 1).We used the L-BFGS-B method [BLNZ95] for all minimization problems in algorithms 1 and 2. For the noise-free data (g G and g), we stopped the data decomposition as well as the reconstruction after nine iterations.For the noisy data we stopped the decomposition and reconstruction when Λ discrete was less than 1.2 × L. We did not explicitly optimize over PSD 2 , instead we optimized over A ∈ R 2×2 and used the parametrization A → A A ∈ PSD 2 .For evaluations of the Riesz potential of g n , we used an approximation of the form described in appendix A with M = 2.For comparison, and to give an indication of the level of noise in the data, we also performed ordinary FBP reconstructions on the discrete data g and g noisy (without any additional denoising).
True GMM dictionary and true dictionary representation within that dictionary are both given in table 1, see also leftmost image in first row of figure 2 for the corresponding image.Recovered GMM dictionary and corresponding dictionary representation of image obtained from analytic data are tabulated in table 2, with middle image in 2nd row of figure 2 showing the corresponding image.Likewise, table 3 shows reconstructed GMM parameters and corresponding dictionary representation of image from noise-free data, corresponding image is shown as the rightmost image in the 2nd row of figure 2. Furthermore, reconstruction of GMM parameters and corresponding dictionary representation of image from noisy data are tabulated in table 4 with corresponding image shown as the leftmost image in the 2nd row of figure 4. Finally, intermediate reconstructions are shown in figure 3.
In the noise-free case, the proposed method was able to recover all nine Gaussians in the phantom.In the noisy case some of the most prominent Gaussians were recovered.

Conclusions
In order to overcome the ill-posedness of certain tomographic inverse problems, some form of sparsity w.r.t. a dictionary is often imposed on the reconstruction.Furthermore, this dictionary is ideally learned from data alongside the reconstruction as part of indirect sparse coding.Finally, in some applications, it is preferable to represent the solution in this learned dictionary.
Here we have described a new method for recovering a sparse GMM from ray transform data, and demonstrated its performance in a proof-of-concept example in 2D tomography.A key aspect is to utilize the fact that the GMM dictionary on signals is transported by the ray transform to a GMM dictionary on sinograms.In particular, from our theoretical and experimental investigations, we see how to exploit an explicit analytical formula for the Riesz potential of a Gaussian.This makes it possible to reconstruct a mixture of anisotropic Gaussians from a few ray transform projections without introducing an arbitrary sampling of the reconstruction space.
The above approach could be useful in 3D electron microscopy where GMMs are natural, see section 3.2 for further details.(see figure 5).For the greedy reconstruction that uses G-FBP only for initialization, the performance were similar for M = 1, 2 and 10.We used M = 2 for the results presented for the greedy reconstructions.

Appendix B. Deferred proofs
Proof of Lemma 4.2.By the definition of the Riesz potential, the Fourier inversion formula, a change to polar coordinates, and using that the standard Gaussian is a fixed point of the Fourier transform we have 3) The inner spherical integral can be expressed in terms of Bessel functions.By equation (2.3) in section VII.2 of [Nat01]: We have the parameters Hence, the integral ∞ 0 ρ n/2−β e −ρ 2 /2 J n−2 2 ρ|x| dρ equals Proof of (39) and (40).Since β < 0 and f is not everywhere vanishing, it follows by remark 3. Since (B.17) Also, note that Now, by Parseval's theorem, and By the Fourier inversion formula: Moreover, since f 0, we get Hence, it follows that which proves (b).

Figure 1 .
Figure 1.The red marks show the 14 angles that we used for the reconstructions.

Figure 2 .Figure 3 .
Figure 2. Phantom and reconstructions.The term 'ana data' refers to analytic projection data from P G,Φ , 'decomp data' refers to decomposed discrete projection data, 'reco' refers to reconstructions obtained from applying algorithm 1 to the analytic and decomposed data respectively, and 'diff' refers to the difference between the aforementioned reconstructions and the phantom.SNR = ∞

Figure 4 .
Figure 4.The phantom and reconstructions.The term 'decomp data' refers to decomposed discrete projection data, 'reco' refers to reconstructions obtained from applying algorithm 1 to the analytic and decomposed data respectively, and 'diff' refers to the difference between the aforementioned reconstructions and the phantom.SNR = 30 dB.

Table 1 .
GMM parameters for the phantom shown in figure2(first row, left).

Table 2 .
Reconstructed GMM parameters from analytic data generated by the phantom in table 1, see figure 2 (2nd row, middle) for corresponding image.

Table 3 .
Reconstructed GMM parameters from noise-free data, see figure 2 (2nd row, right) for corresponding image.