Computed tomography reconstruction using deep image prior and learned reconstruction methods

Daniel Otero Baguer; Johannes Leuschner; Maximilian Schmidt

doi:10.1088/1361-6420/aba415

1. Introduction

Deep learning approaches to solving ill-posed inverse problems currently achieve state-of-the-art reconstruction quality. However, they require large amounts of training data, i.e., pairs of ground truths and measurements, and it is not clear how much is necessary to be able to achieve good generalization. For ill-posed inverse problems arising in medical imaging, such as magnetic resonance imaging (MRI), guided positron emission tomography (PET), magnetic particle imaging, or computed tomography (CT), obtaining such high amounts of training data is challenging. In particular, ground truth data is difficult to obtain as it is impossible to take a photograph of the inside of the human body. What learned methods usually consider as ground truths are phantoms or high-dose reconstructions obtained with classical methods, such as filtered back-projection (FBP). These methods work well when using a large amount of low-noise measurements. In MRI, it is possible to obtain these reconstructions, but the data acquisition process requires a great deal of time. Therefore, one potential benefit of learned approaches in MRI is the reduction of data acquisition times [30]. In other applications such as CT, it would be necessary to expose patients to high doses of x-ray radiation to obtain the required training ground truths.

There is another approach called deep image prior (DIP) [31] that also uses deep neural networks, for example, a U-Net [45]. However, there is a remarkable difference: the DIP does not need any learning, i.e., the weights of the network are not trained. This approach seems to have low applicability because it requires a lot of time for image reconstruction, in contrast to learned methods. In the applications initially considered, for example, inpainting, denoising, and super-resolution, it is much easier to obtain or simulate data, which allows for the use of learned methods, and the DIP does not seem to have an advantage.

In this paper, we aim to explore the application of the DIP together with other deep learning methods for obtaining CT reconstructions when little training data is available. The structure of the paper and the main contributions are organized as follows. In section 2, we briefly describe the CT reconstruction problem. Section 3 provides a summary of related articles and approaches, together with some background and observations that we use as motivation for our work. In section 4, we introduce the combination of the DIP with classical regularization methods and discuss under which assumptions the classical regularization results still hold. In section 5, we propose a similar approach to the DIP but using an initial reconstruction given by any end-to-end learned method. Finally, in section 6, we present a benchmark of the different methods that we have analyzed using varying amounts of data from two standard datasets.

2. CT

CT is one of the most valuable technologies in modern medical imaging [9]. It allows for a non-invasive acquisition of the inside of the human body using x-rays. Since the introduction of CT in the 1970s, technical innovations such as new scan geometries have extended the limits on speed and resolution. Current research focuses on reducing the amount of potentially harmful radiation to which a patient is exposed during the scan [9]. These innovations include making measurements using lower intensity x-rays or at fewer angles. Both approaches introduce particular challenges for reconstruction methods that can severely reduce the image quality. In our work, we compare several reconstruction methods in these low-dose scenarios for a basic 2D parallel beam geometry (cf figure 1).

**Figure 1.** Parallel beam geometry.
Download figure:
Standard image High-resolution image

In this case, the forward operator is given by the 2D Radon transform [43] and models the attenuation of the x-ray when passing through a body. We can parameterize the path of an x-ray beam by the distance from the origin $s\in \mathbb{R}$ and angle φ ∈ [0, π]:

$\begin{equation}{L}_{s,\varphi }\left(t\right)=s\omega \left(\varphi \right)+t{\omega }^{\perp }\left(\varphi \right),\quad \omega \left(\varphi \right){:=}{\left[\mathrm{cos}\left(\varphi \right),\enspace \mathrm{sin}\left(\varphi \right)\right]}^{\text{T}}.\end{equation} \tag{ 1 }$

The Radon transform then calculates the integral along the line for parameters s and φ:

$\begin{equation}Ax\left(s,\varphi \right)={\int }_{\mathbb{R}}x\left({L}_{s,\varphi }\left(t\right)\right)\enspace \enspace \mathrm{d}t.\end{equation} \tag{ 2 }$

According to Beer–Lambert's law, the result is the logarithm of the ratio of the intensity, I₀, at the x-ray source to the intensity, I₁, at the detector

$\begin{equation}Ax\left(s,\varphi \right)=-\mathrm{ln}\left(\frac{{I}_{1}\left(s,\varphi \right)}{{I}_{0}\left(s,\varphi \right)}\right)=y\left(s,\varphi \right).\end{equation} \tag{ 3 }$

Calculating the transform for all pairs (s, φ) results in a so-called sinogram, which we also call an observation. To get a reconstruction $\hat{x}$ from the sinogram, we have to invert the forward model. Since the Radon transform is linear and compact, the inverse problem is ill-posed in the sense of Nashed [39, 40].

3. Related approaches and motivation

In this section, we first review and describe some of the existing data-driven and classical methods for solving ill-posed inverse problems, that have also been applied to obtain CT reconstructions. Following this, we review the DIP approach and related works.

In inverse problems one aims at obtaining an unknown quantity, in this case the image of the interior of the human body, from indirect measurements that frequently contain noise [16, 36, 44]. The problem is modeled by an operator A : X → Y between Banach or Hilbert spaces X and Y and the measured noisy data or observation:

$\begin{equation}{y}^{\delta }=A{x}^{{\dagger}}+\tau .\end{equation} \tag{ 4 }$

The aim is to obtain an approximation $\hat{x}$ for x^† (the true solution), where τ, with ||τ|| ⩽ δ, describes the noise in the measurement.

Classical approaches to solving inverse problems include linear pseudo inverses given by filter functions [36] or non-linear regularized inverses given by the variational approach

$\begin{equation}{\mathcal{T}}_{\alpha }\left({y}^{\delta }\right)\in \mathrm{arg}\underset{x\in \mathcal{D}}{\mathrm{min}}\enspace \mathcal{S}\left(Ax,\enspace {y}^{\delta }\right)+\alpha J\left(x\right),\end{equation} \tag{ 5 }$

where $\mathcal{S}:Y{\times}Y\to \mathbb{R}$ is the data discrepancy, $J:X\to \mathbb{R}\cup \left\{\infty \right\}$ is the regularizer, $\mathcal{D}{:=}\mathcal{D}\left(A\right)\cap \mathcal{D}\left(J\right)$ and $\mathcal{D}\left(A\right)$ , $\mathcal{D}\left(J\right)$ are the domains of A and J respectively. Examples of hand-crafted regularizers/priors are ||x||², ||x||₁ and total variation (TV). The value of the regularization parameter α should be carefully selected. One way to do that, in the presence of a validation dataset with some ground truth and observation pairs, is to do a line-search and select the α that yields the best performance on average, assuming there is a uniform noise level. Given validation data ${\left\{{x}_{i}^{{\dagger}},\enspace {y}_{i}^{\delta }\right\}}_{i=1}^{N}$ , the data-driven parameter choice would be

$\begin{equation}\hat{\alpha }{:=}\underset{\alpha \in {\mathbb{R}}_{+}}{\text{arg}\;\text{max}}\enspace \sum _{i=1}^{N}\ell \left({\mathcal{T}}_{\alpha }\left({y}_{i}^{\delta }\right),\enspace {x}_{i}^{{\dagger}}\right),\end{equation} \tag{ 6 }$

where $\ell :X{\times}X\to \mathbb{R}$ is some similarity measure, such as peak signal-to-noise ratio (PSNR) or structural self-similarity (SSIM).

Data-driven regularized inversion methods for solving inverse problems in imaging have recently had great success in terms of reconstruction quality [6]. Three main classes of methods are: end-to-end learned methods [1, 3, 8, 21, 28, 46], learned regularizers [34, 37] and generative networks [2, 7, 13]. For the study described in this paper, we only focus on the end-to-end learned methods.

3.1. End-to-end learned methods

In this section, we briefly review some of the most successful end-to-end learned methods. Most of them were implemented and included in our benchmark.

3.1.1. Post-processing

This method aims at improving the quality of the FBP reconstructions from noisy or few measurements by applying learned post-processing. Recent works [11, 28, 42, 48] have successfully used a convolutional neural network (CNN), such as the U-Net [45], to remove artifacts from FBP reconstructions. In mathematical terms, given a possibly regularized FBP operator ${\mathcal{T}}_{\text{FBP}}$ , the reconstruction is computed using a network D_θ : X → X as

$\begin{equation}\hat{x}{:=}\left[{D}_{\theta }{\circ}{\mathcal{T}}_{\text{FBP}}\right]\left({y}^{\delta }\right)\end{equation} \tag{ 7 }$

with parameters θ of the network that are learned from data.

3.1.2. Fully learned

Methods of this type aim at directly learning the inversion process from data while keeping the network architecture as general as possible. This idea was successfully applied in MRI by the AUTOMAP architecture [49]. The main building blocks consist of fully connected layers. Depending on the problem, the number of parameters can grow quickly with the data dimension. For mapping from sinogram to reconstruction in the LoDoPaB-CT dataset [32] (see section 6.1), such a layer would have over 1000 × 513 × 362² ≈ 67 × 10⁹ parameters. This makes the naive approach infeasible for large CT data.

He et al [22] introduced an adapted two-part network, called iRadonMap. The first part reproduces the structure of the FBP. A fully connected layer is applied along s and shared over the rotation angle dimension φ, playing the role of the filtering. For each reconstruction pixel (i, j) only sinogram values on the sinusoid s = i cos(φ) + j sin(φ) have to be considered and are multiplied by learned weights. For the example above, the number of parameters in this layer reduces to 513² + 362² × 1000 ≈ 13 × 10⁷. The second part consists of a post-processing network. We choose the U-Net architecture for our experiments, which allows for a direct comparison with the FBP + U-Net approach.

3.1.3. Learned iterative schemes

Another series of works [1, 3, 20, 21] use CNNs to improve iterative schemes commonly used in inverse problems for solving (5), such as gradient descent, proximal gradient descent or hybrid primal-dual algorithms. For example, the proximal gradient descent is given by the iteration

$\begin{equation}{x}^{\left(k+1\right)}={\phi }_{J,\enspace \alpha ,\enspace {\lambda }_{k}}\left({x}^{\left(k\right)}-{\lambda }_{k}{A}^{{\ast}}\left(A{x}^{\left(k\right)}-{y}^{\delta }\right)\right),\end{equation} \tag{ 8 }$

for k = 0, ..., L − 1, where ϕ_J,α,λ : X → X is the proximal operator or projector. In [20], the authors replace the projector by a CNN that is trained to project perturbed reconstructions to the set of clean reconstructions. However, this approach is not end-to-end because the network is first trained to do the projection and then inserted into the iterative scheme.

The idea behind end-to-end learned iterative methods is to unroll these schemes with a small number of iterations, and replace some operators by CNNs with parameters that are trained using ground truth and observation data pairs. Each iteration is performed by a convolutional network ${\psi }_{{\theta }_{k}}$ that includes the gradients of the data discrepancy and of the regularizer as input in each iteration. Moreover, the number of iterations is fixed and small, e.g., L = 10. The reconstruction operator is given by ${\mathcal{T}}_{\theta }:Y\to X$ with ${\mathcal{T}}_{\theta }\left({y}^{\delta }\right)={x}^{\left(L\right)}$ and

$\begin{align*}\hfill {x}^{\left(k+1\right)}& ={\psi }_{{\theta }_{k}}\left({x}^{\left(k\right)},\enspace {A}^{{\ast}}\left(A{x}^{\left(k\right)}-{y}^{\delta }\right),\nabla J\left({x}^{\left(k\right)}\right)\right)\hfill \\ \hfill {x}^{\left(0\right)}& ={A}^{+}\left({y}^{\delta }\right)\hfill \end{align*}$

for any pseudo inverse A⁺ of the operator A and θ = (θ₀, ..., θ_L−1). Alternatively, x⁽⁰⁾ could be just randomly initialized.

Similarly, more sophisticated algorithms, such as hybrid primal-dual algorithms, can be unrolled and trained in the same fashion. In this work, we used an implementation of the learned gradient descent [1] and the learned primal-dual method [3].

The above mentioned approaches all rely on a parameterized operator ${\mathcal{T}}_{\theta }:Y\to X$ , whose parameters θ are optimized using a training set of N ground truth samples ${x}_{i}^{{\dagger}}$ and their corresponding noisy observations ${y}_{i}^{\delta }$ . Usually, the empirical mean squared error is minimized, i.e.,

$\begin{equation}\hat{\theta }\in \underset{\theta \in {\Theta}}{\text{arg}\;\text{min}}\enspace \frac{1}{N}\sum _{i=1}^{N}{\Vert}{\mathcal{T}}_{\theta }\left({y}_{i}^{\delta }\right)-{x}_{i}^{{\dagger}}{{\Vert}}^{2}.\end{equation} \tag{ 9 }$

After training, the reconstruction $\hat{x}\in X$ from a noisy observation y^δ ∈ Y is given by $\hat{x}={\mathcal{T}}_{\hat{\theta }}\left({y}^{\delta }\right)$ . The main disadvantage of most of these approaches is that they do not enforce data consistency. As a consequence, some information in the observation could be ignored, yielding a result that might lack important features of the image. In medical imaging, this is critical since it might remove an indication of a lesion. Recent works [4, 19] also show that some methods, such as those which are fully learned or follow the post-processing approach, are unstable, which means that tiny perturbations in the ground truth or the measurements may result in severe artifacts in the reconstructions. These are the main motivations for the approach we introduce in section 5. Nevertheless, there exist other methods [46] that do enforce data consistency and may not suffer from these instabilities.

3.2. DIP

The DIP is similar to the generative networks approach and the variational method. However, instead of having a regularization term J(x), the regularization is incorporated by the reparametrization x = φ(θ, z), where φ is a deep generative network, for example a U-Net, with randomly initialized weights θ ∈ Θ, and z is a fixed input such as random white noise. The approach is depicted in figure 2 and consists in solving

$\begin{equation}\hat{\theta }\in \underset{\theta \in {\Theta}}{\text{arg}\;\text{min}}{\Vert}A\varphi \left(\theta ,\enspace z\right)-{y}^{\delta }{{\Vert}}^{2},\enspace \enspace \enspace \enspace \enspace \hat{x}{:=}\varphi \left(\hat{\theta },\enspace z\right).\end{equation} \tag{ 10 }$

The weights are optimized by a gradient descent method to minimize the data discrepancy of the output of the network. In the original method, the authors use gradient descent with early stopping to avoid reproducing noise. This is necessary due to the overparameterization of the network, which makes it able to reproduce the noise. The regularization is a combination of early stopping (similar to the Landweber iteration) and the architecture [14]. The drawback is that it is not clear how to choose when to stop. In the original work, the authors do this using a validation set and select the number of iterations that performs best on average in terms of PSNR.

**Figure 2.** The figure illustrates the DIP approach. We use a U-Net architecture with 128 channels at every layer. Some layers have additionally the skip channels (coming from the dashed arrows). We always use either 4 or 0 skip channels.
Download figure:
Standard image High-resolution image

The prior is related to the implicit structural bias of this kind of deep convolutional networks. In the original DIP paper [31] and more recently in [10, 24], it is shown that convolutional image generators, optimized with gradient descent, fit natural images faster than noise and learn to construct them from low to high frequencies. This effect is illustrated in figure 3.

**Figure 3.** Intermediate reconstructions of the DIP approach for CT (ellipses dataset, see section 6.2). At the beginning the coefficients are randomly initialized from a prior distribution. The method starts reconstructing the image from global to local details.
Download figure:
Standard image High-resolution image

3.2.1. Related work

The DIP approach has inspired many other researchers to improve it by combining it with other methods [35, 38, 47], to use it for a wide range of applications [17, 18, 26, 27] and to offer different perspectives and explanations of why it works [10, 12, 14]. In [38], the concept of regularization by denoising (RED) is introduced and it is shown how the two (DIP and RED) can be merged into a highly effective unsupervised recovery process. Another series of works also adds explicit priors but on the weights of the network. In [47], this is done in the form of a multi-variate Gaussian but learning the covariance matrix and the mean using a small dataset. In [12], a Bayesian perspective on the DIP is introduced by also incorporating a prior on the weights θ and conducting the posterior inference using stochastic gradient Langevin dynamics.

So far, the DIP has been used for denoising, inpainting, super-resolution, image decomposition [17], compressed sensing [47], PET [18], MRI [27] among other applications. A similar idea [26] was also used for structural optimization, which is a popular method for designing objects such as bridge trusses, airplane wings, and optical devices. Rather than directly optimizing densities on a grid, they instead optimize the parameters of a neural network which outputs those densities.

3.2.2. Network architecture

In the paper by Ulyanov et al [31], several architectures were considered, for example, ResNet [23], encoder–decoder (autoencoder) and a U-Net [45]. For inpainting large regions, the Autoencoder with depth = 6 performed best, whereas for denoising a modified U-Net achieved the best results. The regularization happens mainly due to the architecture of the network, which reduces the search space but also influences the optimization process to find more natural images. Therefore, for each application, it is crucial to choose the appropriate architecture and to tune hyper-parameters, such as the network's depth and the number of channels per layer. Optimizing the hyper-parameters is the most time-consuming part. In figure 4 we show some reconstructions from the ellipses dataset (see section 6.2) with different hyper-parameter choices. In this case, it seems that the U-Net without skip connections and depth 5 (encoder–decoder) achieves the best performance. One can see that when the number of channels is too low, the network does not have enough representation power. Also, if there are no skip channels, the higher the number of scales (equivalent to the depth), the more the regularization effect. The extraordinary success of this approach demonstrates that the architecture of the network has a significant influence on the performance of deep learning approaches that use similar kinds of networks.

**Figure 4.** CT reconstructions after 5000 iterations using the DIP with a U-Net architecture and different scales (depths), channels per layer (the network has the same number of channels at every layer) and number of skip connections (the first two rows do not use skip connections, i.e., skip: [0, 0, 0, 0, 0]). In the last row all reconstructions use 5 scales and 128 channels.
Download figure:
Standard image High-resolution image

3.2.3. Early-stopping

As mentioned earlier, in [31], it is shown that early stopping has a positive impact on the reconstruction results. It was observed that in some applications, such as denoising, the loss decreases rapidly toward natural images, but takes much more time to go toward noisy images. This empirical observation helps to determine when to stop. In figure 5, one can observe how the similarity with respect to the ground truth (measured by the PSNR and the SSIM metrics) reaches a maximum and then deteriorates during the optimization process.

**Figure 5.** Training loss and true similarity (PSNR and SSIM) of CT reconstructions using the DIP approach. The training was done over 15 000 iterations and the architecture is an encoder–decoder (no skip channels) with 5 scales and 128 channels per layer.
Download figure:
Standard image High-resolution image

4. DIP and classical regularization

In this section we analyze the DIP in combination with classical regularization, i.e., we include a regularization term $J:X\to \mathbb{R}\cup \left\{\infty \right\}$ , such as TV. We give necessary assumptions under which we are able to obtain standard guarantees in inverse problems, such as existence of a solution, convergence, and convergence rates.

In the general case, we consider X and Y to be Banach spaces, and A : X → Y a continuous linear operator. To simplify notation, we use φ(⋅) instead of φ(⋅, z), since the input to the network is fixed. Additionally, we assume that Θ is a Banach space, and φ : Θ → X is a continuous mapping.

The proposed method aims at finding

$\begin{equation}{\theta }_{\alpha }^{\delta }\in \underset{\theta \in {\Theta}}{\text{arg}\;\text{min}}\enspace \mathcal{S}\left(A\varphi \left(\theta \right),{y}^{\delta }\right)+\alpha J\left(\varphi \left(\theta \right)\right)\quad \text{for}\quad \alpha { >}0\end{equation} \tag{ 11 }$

to obtain

$\begin{equation}{\mathcal{T}}_{\alpha }\left({y}^{\delta }\right){:=}\varphi \left({\theta }_{\alpha }^{\delta }\right).\end{equation} \tag{ 12 }$

With this approach, we eliminate the need for early stopping, i.e., the need to find an optimal number of iterations. However, we introduce the problem of finding an optimal α, which is a classical issue in inverse problems. These problems are similar since both choices depend on the noise level of the observation data. The higher the noise, the higher the value of α or the smaller the number of iterations for obtaining optimal results.

If the range of φ is Ω := rg(φ) = X, i.e.,

$\begin{equation}\forall \enspace x\in X:\enspace \exists \enspace \theta \in {\Theta}\enspace \text{s.t}.\enspace \varphi \left(\theta \right)=x;\end{equation} \tag{ 13 }$

this is equivalent to the standard variational approach in equation (5). However, although the network can fit some noise, it cannot fit, in general, any arbitrary x ∈ X. This depends on the chosen architecture, and it is mainly because we do not use any fully connected layers. Nevertheless, the minimization in (11) is similar to the setting in equation (5) if we restrict the domain of A to be $\tilde {\mathcal{D}}\left(A\right){:=}\mathcal{D}\left(A\right)\cap {\Omega}$

$\begin{equation}{\mathcal{T}}_{\alpha }\left({y}^{\delta }\right)\in \mathrm{arg}\underset{x\in \tilde {\mathcal{D}}}{\mathrm{min}}\enspace \mathcal{S}\left(Ax,\enspace {y}^{\delta }\right)+\alpha J\left(x\right),\end{equation} \tag{ 14 }$

where $\tilde {\mathcal{D}}{:=}\tilde {\mathcal{D}}\left(A\right)\cap \mathcal{D}\left(J\right)$ . If the following assumptions are satisfied, then all the classical theorems, namely well-posedness, stability, convergence, and convergence rates, still hold, see [25].

Assumption 1. The range of φ with respect to θ (parameters of the network), namely Ω, is closed, i.e., if there is a convergent sequence {x_k} ⊂ Ω with limit $\tilde {x}$ , it holds $\tilde {x}\in {\Omega}$ .

Definition 1. An element ${x}^{{\dagger}}\in \tilde {\mathcal{D}}$ is called a J-minimizing solution if Ax^† = y^† and $\forall \enspace x\in \tilde {\mathcal{D}}:\enspace J\left({x}^{{\dagger}}\right){\leqslant}J\left(x\right)$ , where y^† is the perfect noiseless data.

Assumption 2. There exists a J-minimizing solution ${x}^{{\dagger}}\in \tilde {\mathcal{D}}$ and J(x^†) < ∞.

Assumption 1 guarantees that the restricted domain of A is closed, whereas assumption 2 guarantees that there is a J-minimizing solution in the restricted domain. In appendix A, we analyze in which cases these conditions hold.

5. DIP with initial reconstruction

In this section, we propose a two-steps approach based on the method from the previous section. The idea is to take the result from any end-to-end learned method $\mathcal{T}:Y\to X$ as initial reconstruction (first step) and further enforce data consistency by optimizing over its deep-neural parameterization (second step).

Definition 2 (Deep-neural parameterization). Given an untrained network φ : Θ × Z → X and a fixed input z ∈ Z, the deep-neural parameterization of an element x ∈ X with respect to φ and z is

$\begin{equation}{\theta }_{x}\in \underset{\theta \in {\Theta}}{\text{arg}\;\text{min}}{\Vert}\varphi \left(\theta ,\enspace z\right)-x{{\Vert}}^{2}.\end{equation} \tag{ 15 }$

The projection onto the range of the network is possible because of the result of assumption 1, i.e., the range is closed. If φ is a deep convolutional network, for example, a U-Net, the deep-neural parameterization has similarities with other signal representations, such as the Wavelets and Fourier transforms [26]. For image processing, such domains are usually more convenient than the classical pixel representation.

As shown in figure 6, one way to enforce data consistency is to project the initial reconstruction into the set where ||Ax − y^δ|| ⩽ δ. The puzzle is that due to the ill-posedness of the problem, the new solution (red point) will very likely have artifacts. The proposed approach first obtains the deep-neural parameterization θ₀ of the initial reconstruction $\mathcal{T}\left({y}^{\delta }\right)$ and then use it as starting point to minimize

$\begin{equation}\mathcal{L}\left(\theta \right){:=}{\Vert}A\varphi \left(\theta ,z\right)-{y}^{\delta }{{\Vert}}^{2}+\alpha J\left(\varphi \left(\theta ,z\right)\right),\end{equation} \tag{ 16 }$

over θ via gradient descent. The iterative process is continued until ||Aφ(θ, z) − y^δ|| ⩽ δ or for a given fixed number of iterations K determined by means of a validation dataset. This approach seems to force the reconstruction to stay close to the set of natural images because of the structural bias of the deep-neural parameterization. The procedure is listed in algorithm 1 and a graphical representation is shown in figure 6.

**Figure 6.** Graphical illustration of the DIP approach with initial reconstruction. The blue area refers to an approximation of some part of the space of *natural* images.
Download figure:
Standard image High-resolution image

Algorithm 1. DIP with initial reconstruction.

1: ${x}_{0}{\leftarrow}\mathcal{T}\left({y}^{\delta }\right)$ ${x}_{0}{\leftarrow}\mathcal{T}\left({y}^{\delta }\right)$

2: z ← noise

3: ${\theta }_{0}\in \underset{\theta }{\text{arg}\;\text{min}}{\Vert}\varphi \left(\theta ,z\right)-{x}_{0}{{\Vert}}^{2}$ ${\theta }_{0}\in \underset{\theta }{\text{arg}\;\text{min}}{\Vert}\varphi \left(\theta ,z\right)-{x}_{0}{{\Vert}}^{2}$

4: for k ← 0 to K − 1 do

5: $\omega \in \partial \mathcal{L}\left({\theta }_{k}\right)$ $\omega \in \partial \mathcal{L}\left({\theta }_{k}\right)$

6: θ_k+1 ← θ_k − ηω

7: end for

8: $\hat{\mathcal{T}}\left({y}^{\delta }\right){\leftarrow}\varphi \left({\theta }_{k},z\right)$ $\hat{\mathcal{T}}\left({y}^{\delta }\right){\leftarrow}\varphi \left({\theta }_{k},z\right)$

The new method $\hat{\mathcal{T}}:Y\to X$ is similar to other image enhancement approaches. For example, related methods [15] first compute the wavelet transform (parameterization), and then repeatedly perform smoothing or shrinking of the coefficients (further optimization).

6. Benchmark setup and results

For the benchmark, we implemented the end-to-end learned methods described in section 3.1. We trained them on different data sizes and compared them with classical methods, such as FBP and TV regularization, and with the proposed methods. The datasets we use were recently released to benchmark deep learning methods for CT reconstruction [32]. They are accessible through the DIVαℓ python library [33]. We also provide the code and the trained methods in the following GitHub repository: https://github.com/oterobaguer/dip-ct-benchmark.

6.1. The LoDoPaB-CT dataset

The low-dose parallel beam (LoDoPaB) CT dataset [32] consists of more than 40 000 two-dimensional CT images and corresponding simulated low-intensity measurements. Human chest CT reconstructions from the LIDC/IDRI database [5] are used as virtual ground truth. Each image has a resolution of 362 × 362 pixels. For the simulation setup, a simple parallel beam geometry with 1000 angles and 513 projection beams is used. To simulate low intensity, Poisson noise is applied to the projection data. The noise amount corresponds to an x-ray source that on average emits 4096 photons per detector pixel. We use the standard dataset split defining in total 35 820 training pairs, 3522 validation pairs and 3553 test pairs. In addition, we analyze another dataset, LoDoPaB (200), obtained by uniformly sampling 200 angles from the original 1000 without any further modification.

6.2. Ellipses dataset

As a synthetic dataset for imaging problems, random phantoms of combined ellipses are commonly used. We use the 'ellipses' standard dataset from the DIVαℓ python library (as provided in version 0.4) [33]. The images have a resolution of 128 × 128 pixels. Measurements are simulated with a parallel beam geometry with only 30 angles and 183 projection beams. In addition to the sparse-angle setup, moderate Gaussian noise with a standard deviation of 2.5% of the mean absolute value of the projection data is added to the projection data. In total, the training set contains 32 000 pairs, while the validation and test set consist of 3200 pairs each.

6.3. Implementation details

For the DIP with initial reconstruction, we used the learned primal-dual, which has the best performance among the compared methods (see the results in figure 7). For each data size, we chose different hyper-parameters, namely the step-size η, the TV regularization parameter α, and the number of iterations K, based on the available validation dataset.

**Figure 7.** Benchmark results of several existing methods and the proposed approaches (DIP + TV, learned primal-dual + DIP) on the Ellipses, LoDoPaB (200) and LoDoPaB datasets. The horizontal lines indicate the performance of data-free methods.
Download figure:
Standard image High-resolution image

Minimizing $\mathcal{L}\left(\theta \right)$ in (16) is not trivial because TV is not differentiable. In our implementation we use the PyTorch automatic differentiation framework [41] and the ADAM [29] optimizer. For the Ellipses dataset we use the ℓ₂-discrepancy term, whereas for LoDoPaB we use the Poisson loss.

6.4. Numerical results

We trained all the methods with different dataset sizes. For example, 0.1% on the ellipses dataset means we trained the model with 0.1% (32 data-pairs) of the available training data and 0.1% (3 data-pairs) of the validation data. Afterward, we tested the performance of the method on the first 100 samples of the test dataset (in the original order, i.e., not sorted by patient). This reduced test dataset was used because some of the methods require a lot of time for reconstruction, and the mean performance on 100 samples already allows for accurate benchmarking. The results are depicted in figure 7 and more details can be found in appendix B.

As expected, the fully learned method (iRadonMap) requires a large amount of data to achieve acceptable performance. On the ellipses and LoDoPaB (200) dataset, it outperformed TV using 100% of the data, whereas on the LoDoPaB dataset, it performed just slightly better than the FBP. The learned post-processing (FBP + U-Net) required much less data. It outperformed TV with only 10% of the ellipses dataset and 0.1% of the LoDoPaB dataset. On the other hand, we find that the learned primal-dual is very data efficient and achieved the best performance. In figure 8, we show some results from the test set for different data sizes.

**Figure 8.** Reconstructions of test samples using the learned primal-dual method trained with different amounts of data from the ellipses and LoDoPaB datasets. The ℓ₂ data error measures the discrepancy between the noisy observation and the noise-free projection of the (reconstructed) image.
Download figure:
Standard image High-resolution image

The DIP + TV approach achieved the best results among the data-free methods. On average, it outperforms TV by 1 dB on all the analyzed datasets. In figures 9 and 10, it can be observed that TV tends to produce flat regions but also produces high staircase effects on the edges. On the other hand, the combination with DIP produces more realistic edges. For the first two smaller data sizes of the ellipses and LoDoPaB (200) datasets, it performs better than all the end-to-end learned methods.

**Figure 9.** Reconstruction obtained with the FBP method, isotropic TV regularization and the DIP approach combined with TV, for test samples from the ellipses and LoDoPaB datasets. The ℓ₂ data error measures the discrepancy between the noisy observation and the noise-free projection of the (reconstructed) image.
Download figure:
Standard image High-resolution image

**Figure 10.** Reconstruction obtained with the FBP method, isotropic TV regularization and the DIP approach combined with TV, for test samples from the LoDoPaB (200) dataset. The ℓ₂ data error measures the discrepancy between the noisy observation and the noise-free projection of the (reconstructed) image.
Download figure:
Standard image High-resolution image

The DIP + TV with initial reconstruction improved the results on the low-data regime for the ellipses and LoDoPaB (200) datasets. For the higher data sizes and the LoDoPaB dataset, it did not yield reconstructions with higher quality than those already obtained by the DIP + TV or learned primal-dual methods. We believe that this approach is more useful in the case of having sparse measurements and little training data.

In figures 11 and 12, we show some reconstructions obtained using this method for the LoDoPab (200) and ellipses datasets. The reconstructions have a better data consistency with respect to the observed data (ℓ₂-discrepancy) and higher quality both visually and in terms of the PSNR and SSIM measures. Moreover, this approach is in general much faster, even if we also consider the iterations required to obtain the deep-prior/neural parameterization of the first reconstruction. These initial iterations are much faster because they only use the identity operator instead of the Radon transform. For example, for the Ellipses dataset, the DIP + TV approach needs 8000 iterations to obtain optimal performance in a validation dataset (five ground truth and observation pairs). On the other hand, by using the initial reconstruction, it needs 4000 iterations with the identity operator and only 1000 with the Radon transform operator, which results in a 2× speed factor.

**Figure 11.** Examples of reconstructions obtained with the DIP + TV approach, the learned primal-dual method trained with 0.01% and 0.1% of the LoDoPaB (200) dataset and the DIP + TV approach with initial reconstruction. The ℓ₂ data error measures the discrepancy between the noisy observation and the noise-free projection of the (reconstructed) image.
Download figure:
Standard image High-resolution image

**Figure 12.** Examples of reconstructions obtained with the DIP approach combined with TV, the learned primal-dual method trained with 0.1% and 0.2% of the Ellipses dataset (32 and 64 resp. data-pairs) and the DIP approach with initial reconstruction. The ℓ₂ data error measures the discrepancy between the noisy observation and the noise-free projection of the (reconstructed) image.
Download figure:
Standard image High-resolution image

7. Conclusions

In this work, we study the combination of classical regularization, deep-neural parameterization, and deep learning approaches for CT reconstruction. We benchmark the investigated methods and evaluate how they behave in low-data regimes. Among the data-free approaches, the DIP + TV method achieves the best results. However, it is considerably slow and does not benefit from having a small dataset with reference reconstructions. On the other hand, the learned primal-dual is very data efficient. However, it lacks data consistency when not trained with enough data. These issues motivate us to adjust the reconstruction obtained with the learned primal-dual to match the observed data. We solved the puzzle without introducing artifacts through a combination of classical regularization and the DIP.

The results presented in this paper offer several baselines for future comparisons with other approaches. Moreover, the proposed methods could be applied to other imaging modalities.

Acknowledgments

The authors acknowledge the support by the Deutsche Forschungsgemeinschaft (DFG) within the framework of GRK 2224/1 'π³: Parameter Identification—Analysis, Algorithms, Applications'. The authors also thank Jonas Adler, Jens Behrmann, Sören Dittmer, Peter Maass and Michael Pidcock for useful comments and discussions.

Appendix A.: DIP and classical regularization

The mapping φ : Θ → X has a neural network structure, with a fixed input $z\in {\mathbb{R}}^{{n}_{0}}$ , and can be expressed as a composition of affine mappings and activation functions:

$\begin{equation}\varphi ={\sigma }^{\left(L\right)}{\circ}{\mathcal{K}}^{\left(L\right)}{\circ}\cdots {\circ}{\sigma }^{\left(2\right)}{\circ}{\mathcal{K}}^{\left(2\right)}{\circ}{\sigma }^{\left(1\right)}{\circ}{\mathcal{K}}^{\left(1\right)},\end{equation} \tag{ A.1 }$

where ${\mathcal{K}}^{\left(i\right)}\left(x\right){:=}{W}^{\left(i\right)}x+{b}^{\left(i\right)}$ , ${W}^{\left(i\right)}\in {G}^{\left(i\right)}\subseteq {\mathbb{R}}^{{n}_{i}{\times}{n}_{i-1}}$ , ${b}^{\left(i\right)}\in {B}^{\left(i\right)}\subseteq {\mathbb{R}}^{{n}_{i}}$ , ${\sigma }^{\left(i\right)}:\mathbb{R}\to \mathbb{R}$ (applied component-wise), and θ = (W^(L), b^(L), ..., W⁽¹⁾, b⁽¹⁾) ∈ G^(L) × B^(L) ⋯ × G⁽¹⁾ × B⁽¹⁾ = Θ. In the following we analyze under which conditions we can guarantee that the range of φ (with respect to Θ) is closed.

Definition 3. An activation function $\sigma :\mathbb{R}\to \mathbb{R}$ is valid, if it is continuous, monotone, and bounded, in the sense there exist c > 0 such that ∀ x ∈ X : |σ(x)| ⩽ c|x|.

Lemma 1. Let φ be a neural network φ : Θ → X with L layers. If Θ is a compact set, and the activation functions σ⁽ⁱ⁾ are valid, then the range of φ is closed.

Proof. In order to prove the result, we show that the range after each layer of the network is compact.

(a)
Let the set $V=\left\{Wu:\enspace W\in G\subset {\mathbb{R}}^{m{\times}n},\enspace u\in U\subset {\mathbb{R}}^{n}\right\}$ , where G and U are compact sets, i.e., bounded and closed. Since G and U are bounded, it follows that V is bounded.Let the sequence {W^(k) u^(k)}, with W^(k) ∈ G and u^(k) ∈ U, converge to v. Since {W^(k)} and {u^(k)} are bounded, there is a subsequence $\left\{{\overline{W}}^{\left(k\right)}{\bar{u}}^{\left(k\right)}\right\}$ , where both $\left\{{\overline{W}}^{\left(k\right)}\right\}$ and {ū^(k)} converge to $\overline{W}\in G$ and ū ∈ U respectively. It follows that $\left\{{\overline{W}}^{\left(k\right)}{\bar{u}}^{\left(k\right)}\right\}$ converges to $\overline{W}\bar{u}$ , therefore, $v=\overline{W}\bar{u}\in V$ , which shows that V is closed. Thus, V is compact.

(b)
From (a), the fact that G⁽ⁱ⁾, B⁽ⁱ⁾ are compact sets, and assuming ${U}^{\left(i\right)}\subset {\mathbb{R}}^{{n}_{i}-1}$ is also compact, it follows that ${V}^{\left(i\right)}=\left\{Wu+b:\enspace W\in {G}^{\left(i\right)},\enspace u\in {U}^{\left(i\right)},b\in {B}^{\left(i\right)}\subset {\mathbb{R}}^{{n}_{i}}\right\}$ is compact.
(c)
It is easy to show that if the pre-image of a valid activation σ is compact, then its image is also compact.

In the first layer, U₀ = {z}, which is compact; thus, using (a), (b), and (c) it can be shown by induction that the range of φ : Θ → Ω is closed. □

All activation functions commonly used in the literature, for example, sigmoid, hyperbolic tangent, and piece-wise linear activations, are valid. The bounds on the weights of the network can be ensured by clipping the weights after each gradient update. □

Remark 1. An alternative condition to the bound on the weights is to use only valid activation functions with closed range, for example, ReLU or leaky ReLU. However, it wouldn't be possible to use sigmoid or hyperbolic tangent. In our experiments, we observed that having a sigmoid activation in the last layer of the DIP network performs better than having a ReLU.

Appendix B.: Dataset details, hyper-parameters and results

In this appendix, we present all the hyper-parameters tables B3–B10 that were selected for the method using a validation set. The first two tables B1–B2 depict the number of samples used for training and validation in each case.

Table B1. Amounts of training and validation pairs from the ellipses dataset used for the benchmark in section 6.

%	0.1	0.2	0.5	1.0	2.0	5.0	10.0	25.0	50.0	100.0
#train	32	64	160	320	640	1600	3200	8000	16 000	32 000
#val	3	6	16	32	64	160	320	800	1600	3200

Table B2. Amounts of training and validation pairs from the LoDoPaB dataset used for the benchmark in section 6. The last two lines denote the numbers of patients of whom images are included.

%	0.01	0.1	1.0	10.0	100.0
#train	3	35	358	3582	35 820
#val	1	3	35	352	3522
#patients train	1	1	7	64	632
#patients val	1	1	1	6	60

Table B3. FBP hyper-parameters and results.

Dataset	Filter type	Low-pass cut-off	PSNR (dB)	SSIM
Ellipses	Hann	0.7051	24.18	0.5939
LoDoPaB (200)	Hann	0.5000	28.38	0.6492
LoDoPaB	Hann	0.6410	30.37	0.7386

Table B4. TV hyper-parameters and results. The step size is set to 10⁻³.

Dataset	Loss function	α	PSNR (dB)	SSIM
Ellipses	ℓ₂	7.743 × 10⁻⁴	27.84	0.8495
LoDoPaB (200)	Poisson	12.63	30.89	0.7563
LoDoPaB	Poisson	20.55	32.95	0.8034

Table B5. DIP + TV hyper-parameters and results. For all experiments the number of channels is set to 128 at every scale. For the output sigmoid activation is used.

Dataset	Loss func.	Scales	Skip channels	Α	step size	PSNR (dB)	SSIM
Ellipses	ℓ₂	5	(0, 0, 0, 0, 0)	3.162 × 10⁻⁴	1 × 10⁻³	28.94	0.8855
LoDoPaB (200)	Poisson	6	(0, 0, 0, 0, 4, 4)	4.0	5 × 10⁻⁴	32.51	0.7803
LoDoPaB	Poisson	6	(0, 0, 0, 0, 4, 4)	7.0	5 × 10⁻⁴	34.44	0.8143

Table B6. DIP + TV (with initial reconstruction given by the learned primal-dual method). For all experiments the number of channels is set to 128 at every scale. For the output sigmoid activation is used.

Dataset	Data size (%)	Loss func.	Scales	Skip channels	Α	PSNR (dB)	SSIM
Ellipses	0.1	ℓ₂	5	(0, 0, 0, 0, 0)	3.162 × 10⁻⁴	29.23	0.8915
0.2	ℓ₂	5	(0, 0, 0, 0, 0)	2.154 × 10⁻⁴	29.39	0.8911
0.5	ℓ₂	5	(0, 0, 0, 0, 0)	2.154 × 10⁻⁴	29.85	0.904
1.0	ℓ₂	5	(0, 0, 0, 0, 0)	2.154 × 10⁻⁴	30.39	0.915
2.0	ℓ₂	5	(0, 0, 0, 0, 0)	2.154 × 10⁻⁴	30.99	0.9253
5.0	ℓ₂	5	(0, 0, 0, 0, 0)	2.154 × 10⁻⁴	31.44	0.9285
10.0	ℓ₂	5	(0, 0, 0, 0, 0)	1.292 × 10⁻⁴	31.78	0.9337
LoDoPaB (200)	0.01	Poisson	6	(0, 0, 0, 0, 4, 4)	4.0	32.52	0.7822
0.1	Poisson	6	(0, 0, 0, 0, 4, 4)	3.0	32.78	0.7821

Table B7. FBP + U-Net. The input FBP reconstruction uses a Hann filter with no additional low-pass filter. Common hyperparameters: scales = 5, skip channels = 4, linear output (i.e. no sigmoid activation). The maximum learning rate is set to 10⁻² or 10⁻³ and scheduled with either cosine annealing or one-cycle policy.

Dataset	Data size (%)	Channels	Batch size	Epochs	PSNR (dB)	SSIM
Ellipses	0.1	(32, 32, 64, 64, 128)	16	5000	26.33	0.7895
0.2	(32, 32, 64, 64, 128)	16	5000	26.59	0.8042
0.5	(32, 32, 64, 64, 128)	16	5000	26.80	0.8114
1.0	(32, 32, 64, 64, 128)	16	5000	27.12	0.8321
2.0	(32, 32, 64, 64, 128)	16	2500	27.44	0.8323
5.0	(32, 32, 64, 64, 128)	16	1000	27.97	0.8604
10.0	(64, 64, 128, 128, 256)	16	700	28.49	0.8751
25.0	(64, 64, 128, 128, 256)	16	280	28.80	0.8872
50.0	(64, 64, 128, 128, 256)	16	140	29.10	0.8940
100.0	(64, 64, 128, 128, 256)	16	70	29.36	0.8987
LoDoPaB (200)	0.01	(32, 32, 64, 64, 128)	32	5000	29.33	0.7143
0.1	(32, 32, 64, 64, 128)	32	5000	31.58	0.7616
1.0	(32, 32, 64, 64, 128)	32	2000	32.60	0.7818
10.0	(32, 32, 64, 64, 128)	32	500	33.19	0.7931
100.0	(32, 32, 64, 64, 128)	32	250	33.55	0.7994
LoDoPaB	0.01	(32, 32, 64, 64, 128)	32	5000	31.36	0.7727
0.1	(32, 32, 64, 64, 128)	32	5000	33.27	0.7982
1.0	(32, 32, 64, 64, 128)	32	2000	34.62	0.8209
10.0	(32, 32, 64, 64, 128)	32	500	35.18	0.8313
100.0	(32, 32, 64, 64, 128)	32	250	35.48	0.8371

Table B8. Learned gradient descent. For all experiments the number of iterations is set to L = 10. The output of the network is linear, i.e. no sigmoid activation is used.

Dataset	Data size (%)	Channels	Batch size	Epochs	lr	PSNR (dB)	SSIM
Ellipses	0.1	32	32	5000	10⁻³	27.81	0.8580
0.2	32	32	5000	10⁻³	28.40	0.8769
0.5	32	32	5000	10⁻³	29.15	0.8955
1.0	32	32	5000	10⁻³	29.55	0.9027
2.0	32	32	2500	10⁻³	29.70	0.9051
5.0	32	32	1000	10⁻³	29.84	0.9077
10.0	32	32	500	10⁻³	29.88	0.9082
25.0	32	32	200	10⁻³	29.95	0.9094
50.0	32	32	100	10⁻³	30.07	0.9121
100.0	32	32	50	10⁻³	30.30	0.9162
LoDoPaB (200)	0.01	32	20	5000	10⁻⁴	29.87	0.7151
0.1	32	20	5000	10⁻⁵	31.28	0.7473
1.0	32	20	500	10⁻⁵	31.83	0.7602
10.0	64	1	200	10⁻⁵	32.41	0.7724
100.0	64	1	20	10⁻⁵	32.41	0.7724
LoDoPaB	0.01	32	1	5000	10⁻³	32.70	0.7860
0.1	32	1	5000	10⁻³	33.81	0.8043
1.0	32	1	500	10⁻³	34.29	0.8103
10.0	64	1	100	10⁻⁴	34.34	0.8115
100.0	64	1	10	10⁻⁴	34.36	0.8122

Table B9. Learned primal-dual. For all experiments the number of iterations is set to L = 10. The output of the network is linear, i.e. no sigmoid activation is used.

Dataset	Data size [%]	Channels	Batch size	Epochs	lr	PSNR [dB]	SSIM
Ellipses	0.1	32	5	5000	10⁻³	28.09	0.8621
0.2	32	5	5000	10⁻³	28.45	0.8778
0.5	32	5	5000	10⁻³	29.35	0.8997
1.0	32	5	5000	10⁻³	30.11	0.9124
2.0	32	5	2500	10⁻³	30.84	0.9258
5.0	32	5	1000	10⁻³	31.44	0.9282
10.0	32	5	500	10⁻³	31.84	0.9360
25.0	32	5	200	10⁻³	32.15	0.9367
50.0	32	5	100	10⁻³	32.21	0.9390
100.0	32	5	50	10⁻³	32.27	0.9403
LoDoPaB (200)	0.01	32	1	5000	10⁻³	29.65	0.7343
0.1	32	1	5000	10⁻³	32.48	0.7771
1.0	32	1	500	10⁻³	33.21	0.7929
10.0	64	1	100	10⁻⁴	33.53	0.7990
100.0	64	1	10	10⁻⁴	33.64	0.8020
LoDoPaB	0.01	32	1	5000	10⁻³	32.68	0.7842
0.1	32	1	5000	10⁻³	34.65	0.8227
1.0	32	1	500	10⁻³	35.27	0.8303
10.0	64	1	100	10⁻⁴	35.63	0.8401
100.0	64	1	10	10⁻⁴	35.73	0.8426

Table B10. iRadonMap. The U-Net part of the network has the same hyperparameters for all experiments: scales = 5, skip channels = 4, channels = (32, 32, 64, 64, 128). The learning rate is set to 10⁻². Selection of the sigmoid output is based on the validation performance; the difference on LoDoPaB with and without sigmoid is marginal.

Dataset	Data size (%)	Batch size	Epochs	Sigmoid output	PSNR (dB)	SSIM
Ellipses	0.1	64	1000	✓	17.83	0.2309
0.2	64	1000	✓	18.35	0.2837
0.5	64	1000	✓	21.41	0.5378
1.0	64	1000	✓	22.64	0.6312
2.0	64	1000	✓	23.62	0.7042
5.0	64	1000	✓	24.77	0.7444
10.0	64	1000	✓	25.61	0.8051
25.0	64	400	✓	26.56	0.8389
50.0	64	200	✓	27.36	0.8615
100.0	64	100	✓	28.02	0.8766
LoDoPaB (200)	0.01	32	150	✓	14.61	0.3529
0.1	32	150		18.77	0.4492
1.0	32	150		24.63	0.6031
10.0	32	150		31.27	0.7569
100.0	32	30	✓	32.45	0.7781
LoDoPaB	0.01	2	150		14.82	0.3737
0.1	2	150		17.67	0.4438
1.0	2	150		22.73	0.5361
10.0	2	150		28.69	0.6929
100.0	2	15	✓	30.99	0.7486

For the data-free baseline approaches, i.e. FBP and TV, we used 100 samples for selecting the optimal hyper-parameters. In the low-data regime this by far exceeds the number of samples used by the learned approaches, leading to a slight bias of the comparison in favor of the data-free baseline approaches. For the DIP + TV we used at most 5 samples for validation and selection of hyper-parameters.

For the learned methods, the numbers of epochs listed in the tables denote the maximum numbers—the model with best mean PSNR on the validation set reached during training is selected. In some cases we used a learning rate scheduler that improved the training. More details can be found in https://github.com/oterobaguer/dip-ct-benchmark.

Computed tomography reconstruction using deep image prior and learned reconstruction methods

Article metrics

Submit

Author e-mails

Author affiliations

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. CT