Reducing the risk of hallucinations with interpretable deep learning models for low-dose CT denoising: comparative performance analysis

Objective. Reducing CT radiation dose is an often proposed measure to enhance patient safety, which, however results in increased image noise, translating into degradation of clinical image quality. Several deep learning methods have been proposed for low-dose CT (LDCT) denoising. The high risks posed by possible hallucinations in clinical images necessitate methods which aid the interpretation of deep learning networks. In this study, we aim to use qualitative reader studies and quantitative radiomics studies to assess the perceived quality, signal preservation and statistical feature preservation of LDCT volumes denoised by deep learning. We aim to compare interpretable deep learning methods with classical deep neural networks in clinical denoising performance. Approach. We conducted an image quality analysis study to assess the image quality of the denoised volumes based on four criteria to assess the perceived image quality. We subsequently conduct a lesion detection/segmentation study to assess the impact of denoising on signal detectability. Finally, a radiomic analysis study was performed to observe the quantitative and statistical similarity of the denoised images to standard dose CT (SDCT) images. Main results. The use of specific deep learning based algorithms generate denoised volumes which are qualitatively inferior to SDCT volumes(p < 0.05). Contrary to previous literature, denoising the volumes did not reduce the accuracy of the segmentation (p > 0.05). The denoised volumes, in most cases, generated radiomics features which were statistically similar to those generated from SDCT volumes (p > 0.05). Significance. Our results show that the denoised volumes have a lower perceived quality than SDCT volumes. Noise and denoising do not significantly affect detectability of the abdominal lesions. Denoised volumes also contain statistically identical features to SDCT volumes.


Introduction
The use of radiative imaging in CT scans has been estimated to pose some risk to patient safety (Monga 2007).An often considered method to reduce the possible radiation dose is using low-dose CT (LDCT) imaging in place of standard dose CT (SDCT) imaging, however this results in noisy images (Oppelt 2005).Deep learning transformations for CT denoising (Chen et al 2017, Wolterink et al 2017, Shan et al 2018, Patwari et al 2020a, 2020b) have grown in popularity over the last few years, and are now commercially available (Boedeker et al 2019, Hsieh et al 2019).Even though deep learning models have shown excellent results in LDCT denoising, the uninterpretability of these approaches raises several concerns.Large deep learning models, particularly language models have been shown to hallucinate information (Alkaissi andMcFarlane 2023, Ji et al 2023).Hallucination in a medical image could result in the addition or removal of clinically incorrect information to the image (Antun et al 2019, Bhadra et al 2020, Genzel et al 2020).In fact, using deep learning for tomographic reconstruction has shown visible structural instability (Antun et al 2019) and hallucinations (Bhadra et al 2020), although such methods have generally shown robustness (Genzel et al 2020).To avoid the possibility of hallucination the use of interpretable deep learning models is necessary.Interpretable deep learning models differ from classical deep learning models by providing an insight into their behaviour and functioning.For example, a deep learning model for CT reconstruction which includes the physics of the reconstruction task into the network architecture would be considered an interpretable deep learning model (Wrfl et al 2016).In contrast, a straightforward stack of convolutional layers would be considered a classical deep learning model.There have been approaches to build interpretable deep learning denoising models, either by tuning known filter parameters (Shen et al 2018, Patwari et al 2020a, 2022) or by integrating known operators to mimic the physics of the imaging system (Syben et al 2019, Patwari et al 2020b).
Most image processing studies using deep learning report their results using quantitative metrics such as the PSNR, SSIM (Wang et al 2004, Patwari et al 2020a, 2020b) and normalized root mean square error (Chen et al 2017, Wolterink et al 2017, Shan et al 2018).While these metrics do provide an insight into image quality (Zhang et al 2018), image quality for clinical purposes is assessed using noise based metrics such as the noise power spectrum (Kijewski and Judy 1987), or task based metrics such as low contrast detectability with model observers (Wunderlich et al 2015, Sidky et al 2020, Zhou et al 2020, Li et al 2021).These metrics require a defined task to be pre-computed, and may not transfer directly to clinical data due to the non-linearity and uninterpretability of classical deep learning.
The gold standard in image quality assessment remains the subjective assessment of denoised images by clinicians (Brendlin et al 2022).The quality of medical images as assessed by clinicians are usually not in agreement with the standard comparative metrics (PSNR, SSIM) used to analyze deep learning based LDCT denoising methods, (Verdun et al 2015, Renieblas et al 2017) and may even be task-dependent (Li et al 2021), although some uncommon statistical metrics show higher agreement with clinicians (Renieblas et al 2017).
The low contrast signal detectability task is often used to measure signal quality (Favazza et al 2015, Kopp et al 2018, Sidky et al 2020, Li et al 2021).It has been observed that classical deep learning based denoising methods could reduce signal detectability, while improving PSNR and SSIM (Li et al 2021).However, the impact of interpretable deep learning methods on signal detectability has not been recorded to the best of our knowledge.An emerging quantitative task for assessing image quality is the radiomic feature extraction task (Song et al 2020, Guiot et al 2022).There is a growing body of research which presents the dissimilarity of radiomics features as a clinically sound image quality metric beyond traditional comparative metrics.(Wei et al 2020, Pan et al 2021, Wei and Hsu 2021, Moummad et al 2022).
In this study, we aimed to assess the performance of interpretable deep learning models using clinically relevant metrics, particularly in comparison to the existing classical deep learning methods.We compared the volumes denoised by interpretable deep learning methods, to those denoised by classical deep learning methods, as well as SDCT volumes.We conducted a human reader study for image quality analysis (IQA) and lesion detection/segmentation, as a proxy for signal detectability.We also conducted a radiomic analysis to extract and compare radiomic feature dissimilarity.

Network training Data:
Training data consisted of 58 abdominal contrast enhanced CT-scans in portal-venous phase performed at the University Hospital Zurich.This data was acquired using a Somatom Force (Siemens Healthineers, Forchheim, Germany) in single source mode with a z-axis flying focal spot and 0.5 s rotation time.The field of view (FOV) was limited to the patient.The data was reconstructed with an image matrix of 512 × 512, with a slice thickness of 0.6 mm and an increment of 0.6 mm.The data was reconstructed using weighted filtered backprojection (Stierstorfer et al 2004) with a medium soft body kernel.Simulated LDCT volumes were generated using Poisson noise insertions to the pre-log x-ray projections.The noise was estimated using the following formula: where P B is the reduced dose measurement with noise applied, P A is the original full dose measurement, a is a scaling factor for reduction of the tube current, N 0A is the estimated number of penetrated photons for the full dose scan, and x is a stochastic process with unit variance and zero mean.Other factors such as electronic noise, bowtie filtering, and the AEC were taken into account (Yu et al 2012).The implementation of the noise insertion was through a proprietary tool (Siemens Healthineers, Forchheim, Germany  al 2017) and GAN3D (Wolterink et al 2017).The denoisers were used to map the noisy volumes, both 50% and 25%, to the SDCT volumes.The errors between the generated volume and the actual SDCT volume were computed using the cost functions described in the literature (Chen et al 2017, Wolterink et al 2017, Yin et al 2019, Patwari et al 2020a, 2020b, 2022).To prevent blurring of edges, we replaced the mean squared error metric with the sum of the mean absolute error and the structural similarity metric.Gradient descent (Robbins and Monro 1951) was used to adjust the weights of the various neural networks.The error gradients of each weight in the deep network were calculated using backpropogation, and the updates were applied using the Adam optimizer (Kingma and Bai 2015).The learning rate of the optimizer was set to 10 −4 with the betas set to 0.9 and 0.99.Learning rate decay was not used.The parameter counts for all methods are presented in the supplementary material.

Qualitative reader study 2.2.1. Reader description
Two board certified radiologists at the Cantonal Hospital of Lucerne, with 7 and 8 years of experience, evaluated the images.The readers were asked to perform two tasks: an image quality assessment task and a lesion detection and segmentation task.A complete reading for all the tasks required approximately four hours to collect.The readings were collected on two consecutive days, with each reader conducting their readings on a different day.

IQ rating and volume selection task
Data: A dataset consisting of 10 abdominal CT scans was provided.This dataset originated from the same source described in section 2.1.LDCT volumes at 25% of the standard clinical dose were produced.Each of the CT scans was denoised using each of the interpretable and classical denoising methods discussed in section 2.1.This resulted in a total of 70 CT volumes (each of the 5 denoised series, the LDCT volumes, and the SDCT volumes) for each of the readers to assess.
Methodology: A customized software was used to display all the denoised volumes for a particular patient, along with the SDCT and LDCT images (see figure 1).The readers were asked to rate the quality of the displayed volumes with respect to four criteria: noise suppression, feature preservation, texture preservation, and overall image quality.The readers were asked to assess each of the displayed volumes on each of the given criteria using a five point Likert scale (1-5, with 1 being the lowest).The readers were allowed to scroll, change the windowing and zoom in on the volumes.Ultimately, the readers were asked to choose their preferred volume for use in clinical practice.

Segmentation task
Data: A dataset of 10 annotated axial CT volumes, containing one or more liver lesions, were selected from the AAPM-Mayo Clinic Low Dose CT Grand Challenge data set (Mccollough 2016).15 slices in the vicinity of the lesions were selected.The data was obtained using a SOMATOM definition Flash (Siemens Healthineers, Forchheim, Germany) in single source mode, with CAREkV used to determine the appropriate tube potentials.Other acquisition details include a 64 × 0.6 mm collimation with z-axis flying focal spot, pitch of 0.8 and 0.5 s rotation time.The CT volumes were reconstructed using a medium smooth body kernel using weighted filtered backprojection (Stierstorfer et al 2004).The scans were contrast enhanced, acquired 70 s after contrast agent injection.The slice thickness was 5 mm with an increment of 3 mm.The FOV was limited to the patient size, with an image matrix of 512 × 512 (Mccollough et al 2020).The CTDIvol for SDCT volumes was 15.6 mGy.The LDCT volumes were denoised using the two interpretable methods, resulting in a total of 40 volumes.To avoid reader fatigue, classical deep learning algorithms were not applied.
All lesions chosen were below 13 mm in diameter, with the smallest lesion being only 2.6 mm in diameter.The annotations were made by board certified subspecialist radiologists, who reviewed all cases, along with the patients' medical records.All lesions chosen for this task were annotated as metastases.
Methodology: The volumes were displayed and presented in a random order using ITKsnap (Yushkevich et al 2016).No information about the patient, reconstruction data or denoising algorithm were provided to the readers.The reader was asked to annotate the visualised lesions using ITKsnaps' interactive segmentation tool.The reader was allowed to scroll through the slices, adjust the windowing and annotate several lesions if multiple separate lesions were detected in the subvolume (see supplementary Material).The Dice score was used as a quantitative metric to assess segmentation performance.

Radiomics study Data:
The test dataset generated as described in section 2.2.2 was used for the radiomics analysis.
Methodology: Segmentation, feature extraction and data management were all performed using a dedicated prototype (MM Radiomics 1.3.0(Wels et al 2019), Siemens Healthineers, Forchheim, Germany).A total of 93 features, as described in the PyRadiomics documentation (Van Griethuysen et al 2017), were used.The SDCT volume was used to segment the organs present in the volume.Segmentation was performed both manually (for liver) and automatically (heart, spinal cord, right kidney) as illustrated in figure 2. Afterwards, radiomics features were computed for the segmented regions in all volumes.Each of these features is a statistical quantity extracted from the segmented organs (e.g. the average intensity value of all the voxels in a certain organ).For each of the 10 patients, the 93 features of the liver, kidney, spinal cord and heart were extracted for each of the volumes (SDCT, LDCT, RLDN, GAN3D, JBFnet, CPCE3D, and REDCNN).
An ideal denoising algorithm operating on LDCT volumes should yield volumes that possess radiomics feature values statistically similar to those of an SDCT volume.The following two tests were conducted to show statistical similarity and superiority: • Liver Statistic Variability: For the liver ROI (figure 2, yellow box), we performed a statistical test based on univariate statistics.For this test, we chose a relatively homogenous, manually drawn ROI, to observe the statistical features with minimum possible anatomical variance.The ROI chosen had purely soft tissue present with no visible blood vessels.All statistical differences may be attributed to the denoising algorithm.All radiomic features were extracted for this chosen ROI for all denoising algorithms and the SDCT volume.
For each denoising algorithm, a linear discrimination model was fitted for every individual feature with the goal of distinguishing the feature values that correspond to the SDCT and the algorithm.We assumed that an LDCT volume processed by an ideal denoising algorithm would have had no radiomic features that were significantly different from the same radiomic features extracted from an SDCT volume.If the differences for a feature were statistically significant, as assessed using FDR Benjamini-Hochberg corrected P-value <0.05, the feature was classified as significantly different.The number of significantly different features was counted and compared among the different algorithms and reconstructions.
Figure 1.The user interface used for the IQA tasks.Seven volumes are loaded simultaneously and the reader is asked to enter ratings and choose a volume.The reader could scroll through the slices, switch a single volume to full screen, and alter the windowing as they wished.The order of the images was randomized.The algorithm used on each series can be seen in yellow.The images were viewed using a window of [−110, 190].
• Feature Ratio Similarity: We conducted a series of pairwise feature ratio comparisons to determine quantitative feature preservation across all segmented ROIs of heart, kidney, and spine.The average (mean) values of a given feature F for a given reconstruction and ROI were computed across all the patients.Subsequently, the following feature ratio R was computed for each denoising method x, based on the normalized mean absolute error: where x was the chosen denoising algorithm, and F the chosen feature.Lower feature ratio (R x ) values indicated greater similarity of the feature of the denoised volume to the same feature of the SDCT volume.
Consistent low feature ratio values across multiple features indicated higher statistical feature preservation.We computed pairwise comparisons between radiomics features extracted from all deep learning algorithms and LDCT volumes.The number of features, in which the deep learning algorithm had a lower feature ratio compared to the LDCT volumes, were counted and tabulated for each organ.

IQ rating
The highest quality scores in each section are achieved by the SDCT volumes (table 1, figure 3).Each of the readers had a different preference for their second highest scored algorithm.Reader 1 preferred the JBFnet as their second highest scored algorithm for all four features.Reader 2 preferred the GAN3D as their second highest scored algorithm for all features except texture preservation, for which CPCE3D was rated higher.In all cases, the quality scores of the denoised volumes were an improvement over the quality scores of the LDCT volumes.Both readers assigned higher quality scores to the deep learning algorithms when compared to the LDCT volumes in all categories.When presenting all volumes to the readers, both readers overwhelmingly chose the SDCT images as their preferred volume in clinical practice.While Reader 1 chose the SDCT image in 9 out of 10 cases , Reader 2 chose the SDCT image in all cases.
JBFnet showed statistically significant performance to GAN3D in all criterion, outperformed CPCE3D in noise suppression, and outperformed REDCNN on all metrics except texture preservation.JBFnet was not statistically outperformed on any criterion by any of the classical deep learning methods.RLDN was statistically significantly outperformed on all criteria by GAN3D, on overall quality and texture preservation by CPCE3D, however, it showed statistically similar overall quality and feature preservation to REDCNN (table 2).

Liver statistic variability
LDCT features were significantly different from SDCT features for 71 out of 93 statistical features (table 4).CPCE3D was significantly different from SDCT for 28 out of 93 features.JBFnet, RLDN and GAN3D (2 out of 93) showed few radiomic features that were significantly different from SDCT features.REDCNN had 65 out of 93 significantly different features compared to SDCT features.

Feature ratio similarity
For the majority of radiomics features, the deep learning methods were successful in restoring radiomic features to the values extracted from SDCT volumes (table 5).There were no cases in which deep learning based methods had lower feature ratios than LDCT volumes on all 93 metrics.The highest number of improved feature ratios was 270 (RLDN) with the lowest being 253 (REDCNN).JBFnet was outperformed by CPCE3D and GAN3D, but outperformed REDCNN.

Discussion
Both readers agreed that the quality of the SDCT volumes in all assessed criteria were better than the quality of any of the denoised volumes.Hence, at this moment, none of the denoised volumes can be used to replace SDCT volumes.This was reinforced by the results of the volume selection task, where both readers almost unanimously selected the SDCT volumes as their volumes of choice.Reader 1 chose JBFnet as his second choice method, and reader 2 chose GAN3D.Interestingly, GAN3D and JBFnet have statistically similar performance on all criteria.Reader 2 chose CPCE3D as his second choice network for texture preservation, which performs statistically significantly to both JBFnet and RLDN on this criterion.This shows that interpretable methods show statistically similar IQA results to classical deep learning methods.Both readers were able to segment the lesions with comparable accuracy in denoised volumes, LDCT volumes and SDCT volumes.This implies that the signal detectability in the cases we studied was not reduced by noise.However, this also implies that the denoising models did not reduce the detectability of the lesions.Li et al (2021) had investigated that deeper neural network models caused degradation in signal detectability.It is possible that interpretable approaches do not suffer from depth based signal degradation and the results of the detection/segmentation study are an indication of this.An alternative explanation is that the image features which aided in detection of the tumours, were not significantly affected by image noise at all.The design of our study considers tumour detection a proxy for signal detectability in a low contrast task.This could be a false assumption, which implies that the ability to segment lesions is unconnected to image quality.This is partially evidenced in the fact that the overall image quality seems to have no correlation with the detection performance.
The radiomic analysis indicated that volumes denoised by deep learning algorithms contain similar statistical features to SDCT volumes, and show significant improvement compared to the same statistical features extracted from LDCT volumes.Interpretable approaches are particularly good at conserving statistical  information, in comparison to classical deep learning approaches.Only one out of the three classical methods provided comparable performance in the statistical feature similarity task.This study used interpretable deep learning methods with the rationale that such methods would remove the possibility of hallucinations.None of the readers noticed any hallucinations during any of the studies.However, it should be noted that this study was based on the impact of noise and noise removal, and did not focus on hallucination detection.The readers were not asked to check for hallucinations, and did not note any hallucinations.This would imply that the hallucinations were either not present, or so minor that they were of no clinical importance.
The study is of a preliminary nature and had several limitations.Firstly, a larger number of readers would be required to provide statistically significant results.Data was solely acquired at a single simulated low dose (25% dose).Denoising of volumes acquired at a higher dose, such as 50% or 75% of the standard clinical dose, could have yielded images with higher quality, albeit with a lower potential for dose reduction.Additionally, our networks were trained on data acquired at only two doses, 25% and 50% of the standard clinical dose.Introducing more dose levels to the networks for training would have significantly improved denoising performance.Finally, the study did not account for any acquisition or injection parameters, and considered the volume simply as a digital object.The study should be conducted with multiple different sets of parameters.

Figure 2 .
Figure 2. Example 2D view of 3D segmentations for liver, kidney and spinal cord for radiomic analysis.Manual segmentations were used for analysing the liver ROI.The kidney and spinal cord were segmented automatically.

Figure 3 .
Figure 3. Graph representing the measured quality scores.The blue bars represent the mean readings from R1 and the orange bars represent the mean readings from R2.The confidence bars represent the standard deviations.

Figure 4 .
Figure 4. Graph representing the Dice indices of the segmented lesions.The medians are represented by the yellow lines.The whiskers show a range of 1.5 times the interquartile range.
(Chen et al 2017, Wolterink et al 2017, Shan et al 2018, Patwari et al 2020ase(Chen et al 2017, Wolterink et al 2017, Shan et al 2018, Patwari et al 2020a, 2020b), we simulated 25% of the clinical dose as the LDCT.Additionally, to avoid our models learning a specific dose level, we also simulated a second set of CT volumes at 50% of the clinical dose.Both sets of LDCT volumes were used for training of our denoising networks.Methodology: We used two modern interpretable denoising techniques: RLDN (Patwari et al 2020a, 2022) and JBFnet(Patwari et al 2020b).For comparison, we included three classical deep learning approaches: CPCE3D (Yin et al 2019), REDCNN(Chen et

Table 1 .
Mean and Standard Deviation of the measured quality scores.Both readers rated the SDCT volumes as having the highest quality scores for each metric.As second choice, R1 preferred the JBFnet, whereas R2 preferred the GAN3D and CPCE3D (texture preservation only).

Table 2 .
P-values for statistical t-tests comparing pairs of readings for each criterion.A P-value of under 0.05 indicates statistical difference.Pairs with statistical differences are marked by * .

Table 3 .
Median and 1.5 times the interquartile range of the Dice indices of the segmented lesions.Both readers had the highest median score with the RLDN.

Table 4 .
Number of features which are statistically dissimilar to the same feature extracted from an SDCT volume.The methods with the lowest number of dissimilar features are JBFnet, RLDN, and GAN3D.

Table 5 .
Number of feature ratios which were lower for each deep learning method compared to the LDCT volumes out of the 93 features.All deep learning methods have vastly better radiomic feature similarity to SDCT volumes compared to LDCT volumes.