Paper The following article is Open access

Faster PET reconstruction with non-smooth priors by randomization and preconditioning

, and

Published 21 November 2019 © 2019 Institute of Physics and Engineering in Medicine
, , Citation Matthias J Ehrhardt et al 2019 Phys. Med. Biol. 64 225019 DOI 10.1088/1361-6560/ab3d07

0031-9155/64/22/225019

Abstract

Uncompressed clinical data from modern positron emission tomography (PET) scanners are very large, exceeding 350 million data points (projection bins). The last decades have seen tremendous advancements in mathematical imaging tools many of which lead to non-smooth (i.e. non-differentiable) optimization problems which are much harder to solve than smooth optimization problems. Most of these tools have not been translated to clinical PET data, as the state-of-the-art algorithms for non-smooth problems do not scale well to large data. In this work, inspired by big data machine learning applications, we use advanced randomized optimization algorithms to solve the PET reconstruction problem for a very large class of non-smooth priors which includes for example total variation, total generalized variation, directional total variation and various different physical constraints. The proposed algorithm randomly uses subsets of the data and only updates the variables associated with these. While this idea often leads to divergent algorithms, we show that the proposed algorithm does indeed converge for any proper subset selection. Numerically, we show on real PET data (FDG and florbetapir) from a Siemens Biograph mMR that about ten projections and backprojections are sufficient to solve the MAP optimisation problem related to many popular non-smooth priors; thus showing that the proposed algorithm is fast enough to bring these models into routine clinical practice.

Export citation and abstract BibTeX RIS

Original content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

Positron emission tomography (PET) is an important clinical imaging technique as it allows monitoring function of the human body by following a radio-active tracer. The image reconstruction process in PET is challenging as the low number of photon counts call for the Poisson noise modeling and the amount of data is excessively large on modern scanners. While most clinical systems still run non-penalized reconstructions, it has been shown that priors can improve noise control and quantification (Ahn et al 2015, Teoh et al 2015). In addition, the research of the last decade suggests that non-smooth priors, such as the total variation (Rudin et al 1992) and its relatives like total generalized variation (Benning et al 2010, Bredies et al 2010, Bredies and Holler 2015), are beneficial for imaging applications as they allow smooth variations within regions without oversmoothing sharp boundaries (Rudin et al 1992, Benning et al 2010, Bredies et al 2010, Setzer et al 2010, Anthoine et al 2012, Burger and Osher 2013, Sawatzky et al 2013, Bredies and Holler 2015, Ehrhardt and Betcke 2016, Zhang et al 2016). These priors (and their smoothed variants) have been widely studied in the context of PET (e.g. Sawatzky et al (2008), Guo et al (2009), Ahn et al (2012), Müller et al (2012), Cabello et al (2013), Wang et al (2014) and Wang and Qi (2015)) and other medical imaging modalities, e.g. computed tomography (CT) (Niu et al 2014, Gu et al 2018), photoacoustic tomography (PAT) (Boink et al 2018), magnetic resonance imaging (MRI) (Knoll et al 2011, Ehrhardt and Betcke 2016). Modern PET scanners always come with a second anatomical modality such as CT or MRI. Non-smooth priors can also be used to either incorporate anatomical knowledge from MRI or CT into the reconstruction, e.g. Bowsher et al (2004), Ehrhardt and Betcke (2016), Ehrhardt et al (2016), Mehranian et al (2017), Schramm et al (2017) and Hintermüller et al (2018), or to jointly reconstruct PET and the anatomical CT/MRI image (Ehrhardt et al 2015, Knoll et al 2016, Mehranian et al 2018, Rasch et al 2018). Only a few optimization algorithms are capable of combining non-smooth priors and the Poisson noise model, e.g. Esser et al (2010), Figueiredo and Bioucas-Dias (2010), Setzer et al (2010), Chambolle and Pock (2011), Dupe et al (2011), Pock and Chambolle (2011), Harmany et al (2012), Krol et al (2012), Sawatzky et al (2013), Wang and Qi (2015) and Lin et al (2019) and most of these are not applicable to solve all regularization models mentioned above.

One of the most popular algorithms to solve the resulting non-smooth convex optimization problem is the primal-dual hybrid gradient (PDHG) algorithm4 (Esser et al 2010, Chambolle and Pock 2011, Pock and Chambolle 2011). PDHG has been used in numerous imaging studies on multiple imaging modalities, including PET, see e.g. Dupe et al (2011), Wolf et al (2013), Rigie and La Riviere (2015), Foygel Barber et al (2016), Knoll et al (2016), Zhang et al (2016), Rigie et al (2017), Schramm et al (2017) and Rasch et al (2018). While this algorithm is flexible enough to solve a variety of non-smooth optimization problems, in every iteration both the projection and the backprojection have to be applied for all projection bins. Moreover, in every iteration computations on vectors that have the size of the data have to be performed. For modern scanners like the Siemens Biograph mMR with span-1 data format, these vectors contain more than 350 million elements and therefore limiting the applicability of this algorithm (and thus many non-smooth priors) to state-of-the-art scanners.

1.1. Contributions

1.1.1. Subset acceleration with randomization

We propose an algorithm, coined Stochastic PDHG or SPDHG for short, which in every iteration performs computations only for a random subset of the data. We show on clinical data from a Siemens Biograph mMR that with this algorithm, for the first time, non-smooth priors become feasible to be used in routine clinical imaging. Numerically, we show that SPDHG is competitive with OSEM on unregularized reconstruction problems but stable with respect to the choice of the subsets due to its mathematically guaranteed convergence. In fact, SPDHG converges to the deterministic solution for any proper subset selection, see theorem 1.

In addition to the general randomized solution strategy, we propose two further algorithmic advancements: preconditioning and non-uniform sampling.

1.1.2. Preconditioning

We propose and evaluate the use of data-dependent preconditioners in SPDHG for PET image reconstruction. While the convergence theory for a large class of preconditioners has been available since 2011 (Pock and Chambolle 2011), our proposed preconditioners are the first to be computationally efficient and effective for PET image reconstruction with non-smooth priors. The speed enhancement of preconditioning for PDHG was recognized before (Sidky et al 2012), however, we present a novel formulation of these preconditioners that is computiationally efficient, see theorem 2.

1.1.3. Non-uniform sampling

We propose a novel non-uniform sampling strategy, which is necessary to accommodate the differences of data fidelity and regularity. Both randomization and preconditioning can be used independently or can be combined as proposed here in this work.

1.1.4. Uncompressed data

In this work we use uncompressed (span-1) data from the Siemens Biograph mMR. While it is not clear if and how much this improves the reconstructed PET images (Belzunce and Reader 2017), the proposed algorithm is fast enough to study the benefits of uncompressed data in combination with a variety of regularization models.

A few initial findings on randomized reconstruction without preconditioning were published in a conference paper (Ehrhardt et al 2017).

1.2. PET reconstruction via optimization

Given the measured data vector $b \in {{\mathbb N}}^M$ and the projection model $ \newcommand{\mP}{\mathbf P} \mP$ , the PET reconstruction problem can be formulated as the solution to the optimization problem

Equation (1)

where the data fidelity $ \newcommand{\mP}{\mathbf P} D(\mP u)$ measures the match of the estimated image u with the data and the prior $\alpha R(u)$ penalizes features that are not desirable in the solution. In other words the prior can be used to avoid solutions which would fit the noisy data too closely. The data fidelity D is (up to constants independent of u) the negative log-likelihood of the multi-variate Poisson distribution

with expected value being the sum of the projected image y  and the estimated background activity r. The latter is needed in order to model non-linear effects such as scatter and randoms. The data fidelity D measures the distance of the estimated data $ \newcommand{\mP}{\mathbf P} \mP u + r$ to the measured data b in the sense that $ \newcommand{\mP}{\mathbf P} D(\mP u) \geqslant 0$ and $ \newcommand{\mP}{\mathbf P} D(\mP u) = 0$ if and only if $ \newcommand{\mP}{\mathbf P} \mP u + r = b$ . The operator $ \newcommand{\mP}{\mathbf P} \mP$ performs the projection and includes geometric factors, attenuation and normalization.

While the main motivation is the efficient solution of non-smooth optimization problems, we first compare the method to ordered subsets expectation maximization (OSEM) (Hudson and Larkin 1994) for unregularized reconstruction. The 'ordered subsets' idea has subsequently been used for many algorithms related to non-smooth optimisation, see e.g. McGaffin and Fessler (2015). We would like to show in the next example (1) that the 'ordered subsets' idea is generally non-convergent and thus may be unstable and (2) that the proposed algorithm is as fast as OSEM—despite its proven convergence.

1.3. Motivating example: OSEM

If there is no prior, i.e. $\alpha R = 0$ , the most common algorithm to solve the optimization problem (1) is the maximum likelihood expectation maximization algorithm (MLEM) (Shepp and Vardi 1982) defined by

Equation (2)

where all operations have to be understood element-wise. The computational bottleneck in the MLEM algorithm is the evaluation of the operator $ \newcommand{\mP}{\mathbf P} \mP$ and its transpose $ \newcommand{\mP}{\mathbf P} \mP^T$ in each iteration.

To overcome this hurdle, it has been proposed to change the update and evaluate the operator and its adjoint only on one out of m subsets of the data in each iteration. At every iteration k we choose $i = {\rm mod}(k, m)$ and change update formula (2) to

Equation (3)

This algorithm became known as OSEM. Here $ \newcommand{\mP}{\mathbf P} \mP_i$ is the restriction of $ \newcommand{\mP}{\mathbf P} \mP$ onto the ith subset, i.e. $ \newcommand{\mP}{\mathbf P} \mP = (\mP_1^T, \ldots, \mP_m^T){}^T$ . While this change of the update equation reduces the computational burden by 1/m, it is in general not guaranteed to converge to a solution of (1), illustrated in figures 1(a) and (b). A convergent version of OSEM, called complete-data OSEM (COSEM), has been developed (Hsiao et al 2002). While it comes with mathematical convergence guarantees, it is much slower than OSEM (see figure 1(c)) and therefore never became popular for the reconstruction of clinical PET data.

Figure 1.

Figure 1. (a) OSEM may become unstable. OSEM and SPDHG+  are compared for a varying number of subsets. While the speed of SPDHG+  increases with the number of subsets, OSEM fails to converge to the right solution for 100 subsets. $^\ast$ proposed. (b) OSEM may become unstable II. In this example both OSEM and SPDHG+  take 21 subsets with bins equidistantly divided into 21 subsets. In contrast to OSEM, SPDHG+  is robust with respect to this subset selection and achieves a reasonable solution. (c) Faster with subsets. Comparison of reconstruction speed of several algorithms. We compare MLEM, OSEM (21 subsets), COSEM (252 subsets) and the proposed SPDHG+  (252 subsets) in terms of ${\rm PSNR}(x^k, x^\ast)$ (see section 4) where $x^\ast$ is an optimal solution for the $ \newcommand{\dataAmy}{\texttt{florbetapir}} \dataAmy$ dataset (see section 4.1) approximated by 5k MLEM iterations. The subsets are selected with angles equidistantly divided. OSEM and SPDHG+  are clearly faster than MLEM and COSEM. $^\ast$ proposed. (d) OSEM and SPDHG+  look the same. Visual comparison of OSEM (21 subsets) and SPDHG+  (252) after 10 epochs for maximum likelihood reconstruction. Both algorithms achieve very similar images.

Standard image High-resolution image

MLEM has been extended to include smooth (Green 1990) and certain non-smooth (Sawatzky et al 2013) prior information, however, conceptually both algorithms intrinsically struggle with the ordered subset acceleration. Also other algorithms have been 'accelerated' based on the ordered subset idea, e.g. McGaffin and Fessler (2015) and Ross Schmidtlein et al (2017), but are similarly intrinsically unstable due to their non-convergence. See Cheng et al (2013) for a numerical comparison and Ahn et al (2015) and Teoh et al (2015) for a validation on clinical PET data. For differentiable priors, a surrogate based technique allows for stable subset acceleration (De Pierro and Yamagishi 2001, Ahn and Fessler 2003, Ahn et al 2006). In this work we propose the subset-accelerated algorithm SPDHG that is provably convergent and thus stable and robust, see figures 1(a) and (b). SPDHG is flexible enough to be applicable to a large variety of convex and non-smooth priors and is as efficient as OSEM if no explicit prior is being used, see figures 1(c) and (d).

2. Mathematical model

2.1. Non-smooth PET reconstruction with subsets

As outlined above, PET reconstruction can be formulated in terms of the optimization problem (1). Computationally, it is convenient to rewrite (and solve) the optimization problem (1) in terms of subsets. We denote by M the number of projection bins. Let {Si} be a partition of [M], in the sense that $\cup_{i=1}^m S_i = [M]$ , where we used the notation $[M] := \{1, \ldots, M\}$ . It is not necessary to assume that $ \newcommand{\e}{{\rm e}} S_i \cap S_j = \emptyset$ for $i \neq j$ . For notational simplicity we will restrict ourselves to the this case. We define

Equation (4)

with the distance function for every data point given by

Equation (5)

where we omitted the index j  at $\varphi, y, r$ and b for readability. Algorithms from convex optimization require the problem to be defined over an entire vector space which we satisfy by extending $\varphi$ to $\infty$ for non-positive estimated data y   +  r. The data and the background are photon counts and therefore have a natural non-negativity constraint. To allow for the concise notation in (5), we define $0 \log 0 := 0$ and $- \log 0 := \infty$ .

We model the non-negativity constraint for the image u with the indicator function $ \newcommand{\im}{{\rm Im}}\imath_+$ , which is defined as

Equation (6)

Thus, this results in the unconstrained optimization problem

Problem 1 (PET reconstruction with subsets). 

Equation (7)

We would like to stress that solving problem (7) is equivalent to solving the original problem (1) for any choice of subsets. In fact, the subset selection becomes a reconstruction parameter that may be varied to speed up the reconstruction procedure.

Often, our prior assumptions involve linear operators, too. One of the most prominent examples of this is the total variation (Rudin et al 1992)

where we take the 2-norm locally, i.e. at every voxel i we take the 2-norm of the spatial gradient, and the 1-norm globally, i.e. we sum over all voxels. Forward difference discretization of the gradient operator $\nabla$ is used as in Chambolle and Pock (2011). Similarly, we use the directional total variation $ \newcommand{\dTV}{{\rm dTV}} \newcommand{\mD}{\mathbf D} R(u) = \dTV(u) = \|\mD \nabla u\|_{2,1}$ to incorporate a-priori knowledge about the solution given by an anatomical prior image, see Ehrhardt and Betcke (2016), Ehrhardt et al (2016), Schramm et al (2017) and Bungert et al (2018) for details.

Solving problem (7) is challenging, even when the involved variables are small and matrix-vector products are easy to compute. The difficulty stems from its non-smoothness. The data term Di is not finite everywhere and while it is differentiable on its effective domain $ \newcommand{\dom}{{\rm dom}} \dom(D_i) := \{y \mid D_i(y) < \infty\}$ , the gradient is not globally Lipschitz continuous. In addition, further non-smoothness comes from the constraint $ \newcommand{\im}{{\rm Im}}\imath_+$ and the prior R may be non-smooth as well. All of this being said, in PET reconstruction, the variable sizes are actually very large and matrix-vector products expensive to compute.

To apply optimization algorithms to solve (7), we reformulate it as a generic optimization problem of the form

Problem 2 (Generic optimization problem). 

Equation (8)

For instance, for unregularized reconstructions, i.e. $\alpha R = 0$ , we may make the association

and reconstructions regularized by the total variation, i.e. $R(u) = \|\nabla u\|_{2,1}$ , can be achieved by

Equation (9)

2.2. Optimization with saddle-point problems

Instead of solving problem (8) directly, it is more efficient to reformulate the minimization problem as a saddle point problem making use of the convex conjugate of a functional, see e.g. Bauschke and Combettes (2011).

Definition 1 (Convex conjugate). Let $ \newcommand{\bR}{{\mathbb R}} \newcommand{\bRI}{\bR_\infty} f : Y \to \bRI := \bR \cup \{\infty\}$ be a functional with extended real values. Then we define the convex conjugate of f  as $ \newcommand{\bR}{{\mathbb R}} \newcommand{\bRI}{\bR_\infty} f^\ast : Y \to \bRI$ with

For convex, proper and lower semi-continuous (lsc) functionals f  we have that $f^{\ast\ast} = f$ , see e.g. Bauschke and Combettes (2011), and thus $f(x) = \sup_y \left\{\langle x, y\rangle - f^\ast(y) \right\}$ . Then, with $Y = \prod_{i=1}^n Y_i$ , problem (8) is equivalent to

Problem 3 (Generic saddle point problem). 

Equation (10)

We will refer to the variable x as the primal variable and to y  as the dual variable.

Example 1. The convex conjugate of the PET distance function (4) is given by $D_i^\ast(y) = \sum_{j \in S_i} \varphi^\ast_j(y_j)$ with

Equation (11)

where we omitted the index j  at $\varphi, y, b$ and r for readability.

The derivation of the formulas in this and the following example are omitted for brevity.

As some (or all) of the fi and g in (8) are non-smooth, we make use of the proximal operator of these. Our definition varies slightly from the usual definition as we allow the step size parameter to be matrix-valued. For a symmetric and positive definite matrix $ \newcommand{\mS}{\mathbf S} \mS$ , we define the weighted norm $ \newcommand{\mS}{\mathbf S} \|x\|_\mS$ as $ \newcommand{\mS}{\mathbf S} \|x\|_\mS^2 := \|\mS^{-1/2} x\|^2 = \langle \mS^{-1} x, x \rangle$ .

Definition 2 (Proximal operator). Let $ \newcommand{\mS}{\mathbf S} \mS$ be a symmetric and positive definite matrix. Then we define the proximal operator of f  with metric (or step size) $ \newcommand{\mS}{\mathbf S} \mS$ as

From here on, $ \newcommand{\mS}{\mathbf S} \mS$ and $ \newcommand{\mT}{\mathbf T} \mT$ will always be diagonal (and thus symmetric) and positive definite matrices.

Example 2. The proximity operator of the non-negativity constraint (6) is given element-wise by

Example 3. Let $ \newcommand{\Diag}{{\rm diag}} \newcommand{\mS}{\mathbf S} \mS_i = \Diag((\sigma_j)_{j \in S_i})$ . The proximal operator of the convex conjugate of the PET distance (11) can be computed element-wise as $ \newcommand{\Prox}{{\rm prox}} \newcommand{\mS}{\mathbf S} [\Prox^{\mS_i}_{D_i^\ast}(y)]_j = \Prox^{\sigma_j}_{\varphi_j^\ast}(y_j)$ . For each element, the proximal operator is given by

where we again omitted the indices j  for readability and denoted $w = y + \sigma r$ .

3. Algorithm

The saddle point problem (10) (and therefore the PET reconstruction problem (8)) can be solved with the PDHG (Chambolle and Pock 2011), see algorithm 1. It consists of very simple operations involving only basic linear algebra, matrix-vector multiplications and the evaluations of proximal operators. As seen in line 4 of the pseudo-code, PDHG updates all dual variables simultaneously. Therefore, in line 4 and 5, the projection and backprojection that corresponds to the whole data set have to be evaluated. The idea of SPDHG, algorithm 2, is to only select one dual variable randomly in each iteration (line 4) and to perform the update accordingly (line 5 and 6). An important detail is the extrapolation in line 8 with the inverse of the probability pi that i will be selected in each iteration. This guarantees the convergence as proven in theorem 1 below.

3.1. Convergence

SPDHG is guaranteed to converge for any fi and g which are convex, proper and lsc. We now state a very general convergence result which can be derived from (Chambolle et al 2018, theorem 4.3). The actual proof is omitted here for brevity. For more details on convergence and convergence rates we refer the reader to Chambolle et al (2018).

Theorem 1 (Convergence). Assume that the sampling is proper, i.e. the probability pi for an index $i \in [n]$ to be sampled is positive. Let the step length parameters $ \newcommand{\mS}{\mathbf S} \newcommand{\mT}{\mathbf T} \mT = \min_{i \in [n]} \mT_i, \mS_i$ be chosen such that for all $i \in [n]$ the following bound on the operator norm

Equation (12)

holds. Then for any initialization, the iterates $(x, y)$ of SPDHG (algorithm 2) converge to a saddle point of (10) almost surely in a Bregman distance.

Algorithm 1. Primal-dual hybrid gradient (PDHG) to solve (10). Default values given in brackets.

Input: iterates $x (= 0)$ , $y (= 0)$ , step parameters $ \newcommand{\mS}{\mathbf S} \mS = \{\mS_i\}$ , $ \newcommand{\mT}{\mathbf T} \mT$
  1: $ \newcommand{\mA}{\mathbf A} \overline{z} = z = \mA^T y \; (= 0)$
  2: for $k = 1, \ldots$ do
  3:   $ \newcommand{\Prox}{{\rm prox}} \newcommand{\mT}{\mathbf T} x = \Prox^\mT_g\left(x - \mT \overline{z}\right)$
  4:   $ \newcommand{\Prox}{{\rm prox}} \newcommand{\mA}{\mathbf A} \newcommand{\mS}{\mathbf S} y_i^+ = \Prox^{\mS_i}_{f^\ast_i}\left(y_i + \mS_i \mA_i x\right)$   for $i = 1, \ldots, n$
  5:   $ \newcommand{\mA}{\mathbf A} \Delta z = \sum_{i=1}^n \mA_i^T \left(y_i^+ - y_i\right)$
  6:  $z = z + \Delta z, \quad y = y^+$
  7:   $\overline{z} = z + \Delta z$

Algorithm 2. Stochastic primal-dual hybrid gradient (SPDHG) to solve (10). Default values given in brackets.

Input: iterates $x (= 0)$ , $y (= 0)$ , step parameters $ \newcommand{\mS}{\mathbf S} \mS = \{\mS_i\}$ , $ \newcommand{\mT}{\mathbf T} \mT$
  1: $ \newcommand{\mA}{\mathbf A} \overline{z} = z = \mA^T y \; (= 0)$
  2: for $k = 1, \ldots$ do
  3:   $ \newcommand{\Prox}{{\rm prox}} \newcommand{\mT}{\mathbf T} x = \Prox^\mT_g\left(x - \mT \overline{z}\right)$
  4:   Select $i \in [n]$ at random with probability pi
  5:   $ \newcommand{\Prox}{{\rm prox}} \newcommand{\mA}{\mathbf A} \newcommand{\mS}{\mathbf S} y_i^+ = \Prox^{\mS_i}_{f^\ast_i}\left(y_i + \mS_i \mA_i x\right)$
  6:   $ \newcommand{\mA}{\mathbf A} \Delta z = \mA_i^T \left(y_i^+ - y_i\right)$
  7:   $z = z + \Delta z, \quad y_i = y_i^+$
  8:   $\overline{z} = z + \frac{1}{p_i} \Delta z$

Remark 1 (Computational efficiency). Each iteration of algorithm 2 is computationally efficient as only projections and backprojections corresponding to the randomly selected subset i of the data are required. However, the algorithm maintains the whole backprojected dual variable $ \newcommand{\mP}{\mathbf P} z = \mP^T y = \sum_{i=1}^m \mP_i^T y_i$ and in each iteration updates the primal variable with it.

Remark 2 (Memory requirements). The memory requirement of algorithm 2 is higher compared to OSEM or gradient descent but still reasonably low. It requires memory equivalent to two images $(z, \overline{z})$ and up to twice the binned sinogram data (y ,y+) in addition to the necessary memory consumption (output image, sinogram data, background and normalization).

Remark 3 (Sampling). SPDHG allows any kind of random selection as long as the draws are independent and the probability that block i is being selected with positive probability pi  >  0. We will investigate two choices of sampling in the numerical section of this paper. A more thorough numerical and theoretical investigation will be subject of future work.

3.2. Step sizes and preconditioning

We will now discuss two different choices of step sizes under which SPDHG is guaranteed to converge. The proof of the following theorem uses arguments from Chambolle et al (2018) and Pock and Chambolle (2011) and is omitted here for brevity.

Theorem 2 (Step size parameters). Let $\rho < 1$ and $\gamma > 0$ . Then, condition (12) of theorem 1 is satisfied by

Equation (13)

Moreover, if $ \newcommand{\mA}{\mathbf A} \mA_i$ has only non-negative elements, then condition (12) is also satisfied by

Equation (14)

An example of preconditioned step sizes (14) is shown in figure 2.

Figure 2.

Figure 2. Preconditioned parameters $ \newcommand{\mT}{\mathbf T} \mT$ (top) and $ \newcommand{\mS}{\mathbf S} \mS$ (bottom) (14) for the data set $ \newcommand{\dataFdg}{\texttt{FDG}} \dataFdg$ (see section 4.1). Apart from the boundary the step sizes are large in interesting regions, clearly showing the head of the patient.

Standard image High-resolution image

Remark 4. If n  =  1 and pi  =  1, then the step sizes (13) can be identified with the scalar step sizes $ \newcommand{\mA}{\mathbf A} \sigma_i = \gamma \rho / \|\mA_i\|$ and $ \newcommand{\mA}{\mathbf A} \tau = \gamma^{-1} \rho / \|\mA_i\|$ which are commonly chosen for PDHG.

Remark 5. Note that the non-negativity condition holds for the PET projection operator (and any other ray tracing based operator). Moreover, the step size $ \newcommand{\mT}{\mathbf T} \mT$ in (14) resembles the sensitivities used in the update of MLEM (2) and OSEM (3). In addition, a similar preconditioning is performed for the dual variable in the data space.

4. Numerical results

The numerical experiments use the open-source package ODL (Adler et al 2017) which allows for efficient algorithm prototyping in Python. The projection and backprojections are computed with CUDA in single-precision through the open-source package NiftyPET (Markiewicz et al 2018) which is accessible via Python. All results in this section were obtained by selecting subsets with equidistantly divided angles. We use in all numerical experiments the parameter $\gamma=1$ . Fine-tuning of this parameter is left for future work. Moreover, all peak signal-to-noise (PSNR) or relative objective comparisons are performed by first computing an approximate minimizer $x^\ast$ by the deterministic PDHG using 5000 iterations. The PSNR is defined as ${\rm PSNR}(x^k, x^\ast) = 20 \log(\|x^\ast\|_\infty / \|x^k - x^\ast\|_2)$ and the relative objective value is defined as $(\Psi(x^k) - \Psi(x^\ast)) / (\Psi(x^0) - \Psi(x^\ast))$ . We frequently use the word 'epoch' to denote the number of iterations of a randomized algorithm which are in expectation computationally equivalent to one iteration of the deterministic algorithm that uses all data for each iteration. As an example, if a randomized algorithm only uses 1/10 of the data in each iteration, then after 10 iterations one can expect that the algorithm has used all data, thus in this case 1 epoch equals 10 iterations. In all figures, the dashed lines correspond to deterministic and the solid lines to randomized algorithms. The Python code is available online at https://github.com/mehrhardt/spdhg_pet and the $ \newcommand{\dataAmy}{\texttt{florbetapir}} \dataAmy$ data set can be downloaded via https://zenodo.org/record/1472951#.XZtnR-dKhTY.

4.1. Data

We validate the numerical performance of the proposed algorithm on two clinical PET data sets which we refer to as $ \newcommand{\dataFdg}{\texttt{FDG}} \dataFdg$ and $ \newcommand{\dataAmy}{\texttt{florbetapir}} \dataAmy$ . The two separate PET brain datasets each use a distinct radiotracer: [18F]FDG for epilepsy and [18F]florbetapir for the neuroscience sub-study Insight'46 of the Medical Research Council National Survey of Health and Development (Lane et al 2017). The epileptic patient was injected with 250 Mbq of FDG, one hour before the 15 min PET acquisition. The neuroscience volunteer was injected with 370 MBq of florbetapir and scanned dynamically for one hour, starting at the injection time. The last ten minutes were used as a measurement of amyloid deposition, which for the participant was negative.

4.2. Results for total variation

In this section we analyze the impact of various choices within SPDHG on its performance, from randomness over sampling to preconditioning. The test case is total variation prior as defined in (9).

4.2.1. Randomness

Figure 3 shows the effect of randomness where we compare the deterministic PDHG to SPDHG with uniform sampling and scalar step sizes (13) for two different number of subsets. The horizontal axis reflects the number of projections in each algorithm, we call one full projection for the whole data one 'epoch'. Here and in the following dashed lines represent deterministic and solid lines randomized algorithms. We can easily see that both random variants are faster than then deterministic PDHG. Moreover, the randomized SPDHG becomes faster by choosing a larger number of subsets.

Figure 3.

Figure 3. Deterministic versus randomized. The results for the data set $ \newcommand{\dataFdg}{\texttt{FDG}} \dataFdg$ with TV prior show that the randomized algorithms are much faster than their deterministic counterpart. Moreover, more subsets leads to a faster algorithm.

Standard image High-resolution image

4.2.2. Sampling

The effect of different choices of sampling is shown in figure 4. We compare two different samplings: uniform sampling and balanced sampling. The uniform sampling chooses all indices $i \in [n]$ with equal probability pi  =  1/n. In contrast, for balanced sampling we choose with uniform probability either data or prior. If we choose data, then we select a subset again randomly with uniform probability. Thus, the probability for each subset of the data to be selected is $p_i = 1 / (2m), i \in [m]$ and for the prior to be selected pn  =  1/2.

Figure 4.

Figure 4. Uniform versus balanced sampling. In addition to increasing the number of subsets, the sampling is also very important for the speed of the algorithm: 21 subsets with balanced sampling is faster than 100 subsets with uniform sampling.

Standard image High-resolution image

We make two observations. First, balanced sampling is always faster than uniform sampling. This shows the importance of updating the dual variable associated to the prior. Second, for either sampling choosing a larger number of subsets again improves the performance.

4.2.3. Preconditioning

As shown in theorem 2, the step size parameters $ \newcommand{\mT}{\mathbf T} \mT$ and $ \newcommand{\mS}{\mathbf S} \mS_i$ can be chosen either as scalars (13) or as vectors (14), the latter can be seen as a form of preconditioning. Results are shown in figure 5, where we see that preconditioning may accelerate the convergence of either the deterministic PDHG or the randomized SPDHG. Moreover, combining randomization and preconditioning yields an even faster algorithm.

Figure 5.

Figure 5. Preconditioning can be used with and without randomization. The preconditioned algorithms are much faster than without preconditioning.

Standard image High-resolution image

4.2.4. Performance of proposed algorithm

Based on the previous three examples, we propose to combine randomization, balanced sampling and preconditioning, which we refer to as SPDHG+. Figure 6 shows the visual performance of PDHG and SPDHG+. In contrast to the deterministic PDHG, the proposed SPDHG+  yields a good approximation of the optimal solution after only 10 epochs.

Figure 6.

Figure 6. Qualitative results show that in contrast to the deterministic PDHG, the proposed SPDHG+  (252 subsets) approximates the optimal solution well after only 10 epochs. The 'optimal' solution was computed with 5000 iterations of PDHG.

Standard image High-resolution image

4.3. Further numerical results

4.3.1. Anisotropic total variation

Anisotropic total variation decouples the penalization of the derivatives. The mathematical model is similar to the isotropic TV model (9), the only difference being the norm how the total variation is measured: $f_n = \alpha \|\cdot\|_{1,1}$ . It can be seen in figure 7 for $ \newcommand{\dataAmy}{\texttt{florbetapir}} \dataAmy$ that with randomization and preconditioning only a few epochs are needed to obtain a good approximation of the optimal solution.

Figure 7.

Figure 7. Anisotropic TV regularized reconstruction from $ \newcommand{\dataFdg}{\texttt{FDG}} \dataFdg$ data. Top: PDHG and SPDHG+  (252 subsets) reconstructions after 10 epochs. Bottom: quantitative results show a significant speed-up from randomization and preconditioning. Increasing the number of subsets from 21 to 252 has little effect on this data set. The 'optimal' solution was computed with 5000 iterations of PDHG.

Standard image High-resolution image

4.3.2. Directional total variation

Anatomical information from a co-registered MRI is available on combined PET-MR scanners. The structural information of the anatomy can be utilized by the directional total variation prior, see Ehrhardt and Betcke (2016), Ehrhardt et al (2016), Schramm et al (2017) and Bungert et al (2018) for details. The mathematical model is similar to the total variation model (9), except for an additional matrix $ \newcommand{\mD}{\mathbf D} \mD$ . Thus, the only difference is $ \newcommand{\mA}{\mathbf A} \newcommand{\mD}{\mathbf D} \mA_n = \mD \nabla$ . A numerical example is shown in figure 8 for the data set $ \newcommand{\dataAmy}{\texttt{florbetapir}} \dataAmy$ .

Figure 8.

Figure 8. Directional TV prior (which uses MRI information) for $ \newcommand{\dataAmy}{\texttt{florbetapir}} \dataAmy$ data. Both qualitative (top) and quantitative results (bottom) show the speed up provided by randomization and preconditioning. The 'optimal' solution was computed with 5000 iterations of PDHG.

Standard image High-resolution image

4.3.3. Total generalized variation

More sophisticated regularization can be achieved by the total generalized variation (TGV) (Bredies et al 2010, Bredies and Holler 2015)

which can balance first and second order regularization and achieves edge-preserved reconstruction while avoiding the stair-casing artifact. We can solve the TGV regularized PET reconstruction problem by solving problem (8) with the assignment $x = (u, w)$ and

where ${\mathcal E}$ is a symmetrized gradient operator, see Bredies et al (2010) and Bredies and Holler (2015) for more details.

The numerical results shown in figure 9 are in line with the previous findings indicating that randomization and preconditioning can significantly speed up the reconstruction. However, we notice a significant increase in performance by increasing the number of subsets from 21 to 252.

Figure 9.

Figure 9. TGV regularized reconstruction for the $ \newcommand{\dataFdg}{\texttt{FDG}} \dataFdg$ data. Only a few epochs are needed to approximate the optimal solution with randomization and preconditioning. This is visible for both the actual images u and for the reconstructed vector field w. The 'optimal' solution was computed with 5000 iterations of PDHG. *proposed

Standard image High-resolution image

4.3.4. Comparison of mathematical models

We conclude this section by a comparison of various methods on both data sets in figures 10 and 11. While we leave the detailed visual comparisons to the reader, we would like to note that all these images use the same number of projections so have basically the same computational cost.

Figure 10.

Figure 10. Comparison of several reconstruction approaches for the $ \newcommand{\dataFdg}{\texttt{FDG}} \dataFdg$ data. All approaches have about the same computational cost (10 epochs).

Standard image High-resolution image
Figure 11.

Figure 11. Comparison of several reconstruction approaches for the $ \newcommand{\dataAmy}{\texttt{florbetapir}} \dataAmy$ data. All approaches have about the same computational cost (10 epochs).

Standard image High-resolution image

5. Discussion

The extensive numerical experiments all consistently confirm that randomization and preconditioning both speed up the reconstruction. These trends were irrespective of the data set and the chosen prior. The convergence speed in our work was abstractly defined by a solution of the underlying mathematical optimization model approximated with way too many iterations than would be feasible in routine clinical practice. This strategy was chosen intentionally as we did not want to target a specific clinical use case. After these successful initial trials, in the future we will collaborate with medical researchers and clinicians to focus on specific use cases where each use case defines its own metric of what images we wish to reconstruct.

The focus of this contribution was on non-smooth priors like total variation and its descendants like total generalized variation and directional total variation. However, as long as the proximal operators are simple to evaluate, the proposed randomized and preconditioned algorithm can be applied to any other model, too. It would be of interest to compare this algorithm to convergent subset accelerated algorithms for smooth priors like BSREM (De Pierro and Yamagishi 2001, Ahn and Fessler 2003), TRIOT (Ahn et al 2006) and OS-SPS (Ahn and Fessler 2003).

We highlighted the improvements from choosing different distributions for subset selection by comparing 'uniform' and 'balanced sampling'. Further improvements are expected by optimizing the probability selection of this algorithm. This can either be an optimal distribution that is constant along the iterations or even developing over the course of the iterations. We will investigate this direction further in the future.

With the exception of figures 7 and 8 where 21, 100 and 252 subsets were similarly fast, more subsets always resulted in a faster algorithm. There are neither theoretical nor numerical insights how the speed will depend on the subset selection and if more subsets always result in a faster algorithm. However, the numerical evidence suggests that increasing the number of subsets never decreases the speed of the algorithm. This being said, due to the per iteration computational costs, from a practical point of view, there will be an optimal number of subsets that might depend on the prior and even the data (e.g. number of counts) to be reconstructed. We would like to point out that the two figures 7 and 8 have in common that both used the same tracer $ \newcommand{\dataAmy}{\texttt{florbetapir}} \dataAmy$ . In future work we will study the tracer-dependence of the convergence speed in more detail.

Moreover, the algorithm does not exploit any special structure of our optimization problem like smoothness or strong convexity. It is likely that exploiting these properties will lead to additional speed-up. However, as these properties for the PET data term depend on the acquired data, it is unlikely that a straightforward approach will be sufficient and a tailored solution will be necessary.

6. Conclusion

We introduced a convergent subset accelerated algorithm for the reconstruction of PET images with non-smooth priors. The algorithm was enhanced by data-dependent preconditioning. Our numerical results showed that using both randomized subset selection and preconditioning can dramatically speed up the convergence of an iterative reconstruction algorithm. It was observed that a computational effort similar to the current clinical standard OSEM was sufficient for many non-smooth priors, showing that these are now, for the first time, feasible to be used in daily clinical routine.

While these observations were consistent among two data sets with different tracers, more studies are needed to confirm the benefits of this reconstruction strategy. Overall, this algorithmic advancement has the potential to change the PET reconstruction landscape as advanced mathematical models can now be combined with efficient and convergent subset acceleration.

Acknowledgments

MJE and C-BS acknowledge support from Leverhulme Trust project 'Breaking the non-convexity barrier', EPSRC Grant 'EP/M00483X/1', EPSRC centre 'EP/N014588/1', the Cantab Capital Institute for the Mathematics of Information, and from CHiPS and NoMADS (Horizon 2020 RISE project grants). In addition, MJE acknowledges support from the EPSRC platform Grant 'EP/M020533/1'. Moreover, C-BS is thankful for support by the Alan Turing Institute. In addition, all authors gratefully acknowledge the hardware donation by the NVIDIA Corporation.

Footnotes

  • Also known as the 'Chambolle–Pock algorithm'.

Please wait… references are loading.
10.1088/1361-6560/ab3d07