Generation model meets swin transformer for unsupervised low-dose CT reconstruction

Computed tomography (CT) has evolved into an indispensable tool for clinical diagnosis. Reducing radiation dose crucially minimizes adverse effects but may introduce noise and artifacts in reconstructed images, affecting diagnostic processes for physicians. Scholars have tackled deep learning training instability by exploring diffusion models. Given the scarcity of clinical data, we propose the unsupervised image domain score generation model (UISG) for low-dose CT reconstruction. During training, normal-dose CT images are utilized as network inputs to train a score-based generative model that captures the prior distribution of CT images. In the iterative reconstruction, the initial CT image is obtained using a filtered back-projection algorithm. Subsequently, diffusion-based prior, high-frequency convolutional sparse coding prior, and data-consistency steps are employed to obtain the high-quality reconstructed image. Given the global characteristics of noise, the score network of the diffusion model utilizes a swin transformer structure to enhance the model’s ability to capture long-range dependencies. Furthermore, convolutional sparse coding is applied exclusively to the high-frequency components of the image, to prevent over-smoothing or the loss of crucial anatomical details during the denoising process. Quantitative and qualitative results indicate that UISG outperforms competing methods in terms of denoising and generalization performance.


Introduction
X-ray computed tomography (CT) is capable of capturing cross-sectional images with significant anatomical details.It has become an indispensable tool in modern medicine due to its wide-ranging applications, rapid imaging capabilities, and excellent reproducibility.However, high radiation doses may pose potential health risks in the clinical examination [1,2].Low-dose CT (LDCT) has significant potential for reducing patient exposure to radiation risks, but it also faces challenges related to the deterioration of image quality.Given the potential and challenges of LDCT, an increasing number of researchers are focusing on exploring methods for high-quality CT image reconstruction.
In the past decades, wide range of traditional methods has been extensively studied to enhance the quality of LDCT images.These methods encompass sinogram filtering [3,4], image post-processing [5][6][7][8], and model-based iterative reconstruction [9][10][11][12].The sinogram filtering methods reprocess the sinogram data and then map the processed sinogram data back to CT images to reduce noise and improve image quality.However, this approach can introduce issues such as inconsistency, leading to secondary artifacts in the reconstructed images.Image post-processing methods, on the other hand, utilize image processing techniques to denoise LDCT reconstructed images.The effectiveness of these methods is limited due to the relatively complex noise distribution.In contrast, iterative reconstruction methods simultaneously consider information in both the sinogram and the image domain, yielding better results in terms of image quality.Nevertheless, iterative methods come with higher complexity and time costs.Additionally, their performance heavily relies on manually crafted regularization methods and hyperparameters.Traditional approaches have limited impact on LDCT imaging, potentially constraining their clinical applicability.
With the rapid advancement of deep learning (DL) technology [13], deep neural networks have emerged as a powerful tool in LDCT reconstruction with remarkable achievements.Convolutional neural networks (CNNs) were the most commonly used approach for LDCT reconstruction.Representative examples include the residual encoder-decoder convolutional neural network (REDCNN) [14], FBPConvNet [15], and densenet deconvolution network (DD-Net) [16].To enhance the visual quality of denoised images, generative adversarial networks (GANs) have also been applied to LDCT reconstruction [17][18][19][20][21]. Due to the adversarial nature of GANs, their training can be challenging, requiring carefully designed optimization methods and network architectures to ensure stable convergence [21].Furthermore, transformer-based networks have improved LDCT reconstruction performance by leveraging their ability to capture long-range dependencies and spatial correlations [22][23][24][25][26].A popular approach has attempted to combine DL-based image processing methods with iterative image reconstruction methods to enhance LDCT imaging performance [27][28][29][30].While these methods have demonstrated some capabilities in medical image reconstruction tasks, many focus on optimizing network structures or designing case-specific loss functions.Typically, DL-based methods learn an end-to-end mapping from low-dose to normaldose CT (NDCT) in a supervised manner.It is worth noting that the success of supervised DL algorithms relies heavily on large amounts of paired training data, and specialized models need to be retrained when the measurement process changes, limiting their generality and flexibility in clinical applications.Therefore, the development of high-performance methods that can leverage deep neural networks without the need for extensive labeled data is of paramount importance.
Recently, the diffusion models [31][32][33][34] have garnered significant attention due to their excellent image generation performance.Among them, score-based generative models (SGMs) [34] have gained prominence for their exceptional ability to sample accurately from complex distributions.The capabilities of SGMs seamlessly extend to various applications in the field of medical image reconstruction [35].Huang et al [36] proposed a fully unsupervised one sample diffusion model in projection domain (OSDM) for low-dose CT reconstruction.Li et al [37] proposed a low-rank tensor-assisted k-space generative model (LR-KGM) for parallel imaging reconstruction.Li et al [37] introduced a patch-based denoising diffusion probability model (DDPM) for sparse-view CT reconstruction.The network does not require paired full-sampled and down-sampled data, thus enabling unsupervised learning.Inspired by cold diffusion, He et al [38] presented a novel contextual error modulation generalized diffusion model called CoreDiff for LDCT denoising, reducing the sampling steps.SGMs, based on the theoretical framework of stochastic differential equations (SDEs), combine forward and backward diffusion processes in a novel way to accomplish image generation tasks without the complexity of adversarial training.Unlike traditional generative models such as GANs [18,19,39], or variational autoencoders (VAEs) [39], SGMs directly model the data probability distribution instead of learning a latent space.Compared to traditional GANs, SGMs abandon adversarial training, thereby demonstrating interpretability and lower training complexity.
This paper proposes an novel unsupervised image domain score generation (UISG) model, that aims to leverage the advantages of traditional iterative methods and deep generative models to provide a robust solution for LDCT reconstruction.The optimization strategy of model-based iterative reconstruction is integrated into the diffusion sampling step, achieving a seamless fusion of data-driven priors and traditional priors.The score network termed 'TransDiff ' is designed to replace the U-Net backbone with a swin transformer structure to capture global dependencies effectively.This provides global information to the noise predictor.Additionally, we introduce a high-frequency CSC step, decomposing the CT image into low-frequency and high-frequency components, and applying CSC priors specifically to the high-frequency part to better restore the texture and details of the image, while preventing excessive smoothing.Our method is entirely unsupervised and demonstrates strong generalization capabilities.Even with changes in projection dose, there is no need for additional retraining on low-dose/normal-dose CT images.
The main contributions of this work can be summarized as follows: (1) We propose an entirely unsupervised LDCT reconstruction algorithm based on diffusion priors.UISG demonstrates powerful reconstruction capabilities and generalization for LDCT images.Experimental validations on the NIH-AAPM-Mayo CT dataset, piglet dataset, and real walnut dataset confirm the effectiveness and applicability of the proposed method.
(2) The score network 'TransDiff ' of the UISG model is equipped with a swin transformer architecture, endowing it with the capability to model long-range dependencies.This enables more effective learning of global prior knowledge of images, facilitating better modeling of the global distribution noise on LDCT images.
(3) By exclusively incorporating CSC in the high-frequency part of the image, the algorithm can concentrate more on addressing local features without excessively affecting the overall structure.It avoids the participation of low-frequency parts and improves the efficiency of the algorithm.
The remainder of the manuscript is structured as follows: The second section furnishes background information on LDCT imaging models and SGMs, elucidating the proposed prior learning approach and delving into implementation details.The third section delineates experimental specifications and presents the results.Lastly, the fourth section offers concluding remarks.

LDCT imaging model
In CT imaging, the Poisson noise model is employed to approximate the physical effects of measurement data [40].Assuming the use of a monochromatic source, the projection measurements from a CT scan are assumed to follow a Poisson distribution and can be represented as: where P z is the number of transmitted photons, b z denotes the x-ray source intensity for the zth ray, E z signifies background electronic noise, l z represents the line integral of the attenuation coefficient along the zth ray path, and Z corresponds to the total number of x-ray paths.The LDCT image can be approximated and modeled with an additive noise term n: where y n ∈ R h×w of size h × w stands for LDCT and is regarded as NDCT y ∈ R h×w with the addition of the noise component n.The inverse problem is equivalent to reconstructing y from y n .Note that the LDCT images here are all initial images reconstructed by filtered back projection (FBP), i.e. y n = FBP (x n ), x n denotes the acquired low-dose projection.
To mitigate ill-conditioning, the reconstruction problem of CT image is formulated as a constrained optimization equation: where ∥y − y n ∥ 2 2 is the data consistency term, which determines the consistency between the expected NDCT y ∈ R h×w and the LDCT y n ∈ R h×w .R (y) stands for the regularization term, λ is employed to balance data consistency and regularization.
To solve equation (3), various prior knowledge is integrated into the regularization term to obtain a stable and high-quality solution.For instance, sparsity-enhancing regularization derived from compressed sensing theory [41], such as wavelets [42], total variation [6], sparse coding [43], etc. Existing supervised end-to-end learning methods inherently rely on discriminative approaches for learning implicit priors, which can lack flexibility and robustness.In this study, we turn to explicit prior construction through SGM.Subsequently, we achieve CT image restoration by alternately employing SGM solvers, ADMM, and data consistency steps.

Score-based generative model with SDE
Score-based SDE has garnered significant attention due to its remarkable success in generating realistic and diverse image samples, opening new possibilities in the field of image generation [35].It perturbs the data by injecting Gaussian noise at different scales to obtain a tractable distribution, and generates data from the noise by reversing the forward process [44].More specific, consider a diffusion process {y (t)} T t=0 and y (t) ∈ R h×w , where t ∈ [0, T] is a continuous time variable, and h × w represents the dimensionality of the CT image.We have a dataset of independently and identically distributed samples (y (0) ∼ p 0 ), along with a computable form capable of efficiently generating samples (y (T) ∼ p T ), where p 0 and p T correspond to the data distribution and the prior distribution, respectively.The 'forward SDE' can be simulated as the solution to the following SDE: in the equation, f (y, t) and g (t) represent the drift and diffusion coefficients, respectively.w ∈ R h×w stands for Brownian motion.According to Song et al [44], Variance Exploding (VE) SDEs typically lead to higher sample quality, so we choose f (y, t) = 0 and g (t) = d [σ 2 (t)] dt, where σ (t) > 0 is a monotonically increasing function.
Based on the reversibility of the SDE, the reverse process of equation ( 4) can be expressed as another diffusion process: where w is the standard Wiener process when time is reversed backward, and dt is an infinitesimal negative time step.We can train a time-dependent score network denoted as S θy (y, t) to estimate the score function ∇ y log p t (y), i.e. S θy (y, t) ≃ ∇ y log p t (y), which can be utilized for numerical solutions of the backward SDE.
In our approach, we employ 'TransDiff ' to train S θy (y, t).The specific architecture of 'TransDiff ' is depicted in figure 1(B).Formally, we optimized the parameters θ y of the score network based on the following cost function: where Gaussian perturbation kernel centered at y (0).

Sore network 'TransDiff '
The transformer-based network architecture has demonstrated competitive performance in various generative models [45][46][47].Inspired by this, we construct a score network named 'TransDiff ' , which use the swin transformer layer as the backbone, aiming to enhance its feature extraction capabilities and achieve self-attention from local to global.'TransDiff ' is composed of a hybrid hierarchical architecture with a U-shaped encoder and decoder.For the encoder, the initial step involves segmenting the input image into non-overlapping patches of size 8 × 8. Subsequently, the feature dimensions are projected to an arbitrary dimension C (C = 512) through a linear embedding layer.The transformed patches are then used to generate hierarchical feature representations through multiple swin transformer block (STB) and patch merge layers.The down-sampling process (2x down-sample) is handled by the patch merge layer, while the STB manages the deep feature representation learning, maintaining a constant resolution and feature dimension.
The decoder comprises multiple STB and patch expansion layers.In contrast to the patch merge layer, the patch expansion layer achieves up-sampling.Furthermore, a multi-scale skip connection scheme has been incorporated to guide the upsampling process of the decoder.This aims to alleviate the detail loss caused by upsampling, as well as mitigate issues such as vanishing gradients and overfitting.Finally, the output image is obtained through a linear projection layer.
The bottleneck of 'TransDiff ' involves using two cascaded STBs for feature learning, without altering the dimensions and resolution of the feature map.Additionally, skip connections are an essential component based on the encoder-decoder network.These connections fuse the extracted contextual features with the multiscale features of the encoder to compensate for spatial information loss caused by down-sampling.Further details of the score network 'TransDiff ' is provided in figure 1(B).

Prior learning reconstruction of UISG
In the reconstruction stage, the acquired projection data are first reconstructed into LDCT images using the FBP reconstruction algorithm, expressed as y n = FBP (x n ).Once the score network S θy (y, t) is trained, we incorporate it into equation ( 6) and solve the resulting inverse SDE to reconstruct the CT image.During the conditional generation process, the PC sampling [44], the high-frequency CSC and data consistency steps are alternated to obtain the final high-quality reconstructed image.The LDCT reconstruction process in UISG is illustrated in figure 1.In the following sections, we will primarily describe the solution process for these three subproblems sequentially.
Step 1: PC Sampler To obtain richer information and higher-quality results, the SGM is utilized to estimate the prior distribution of the CT images.Specifically, it can consider continuous distribution changes over time, rather than simply adding a finite amount of noise.Through the process of solving the inverse SDE, we can transform random noise into sampled data, thus more accurately simulating the evolution of the data.Drawing inspiration from the work of Song et al [44], we employ the PC sampling method.In this context, the 'predictor' refers to the numerical solver for the inverse SDE, which provides estimates of samples for the next time step based on the current time step: ) + ∇y log pt where i = I − 1, . . ., 0 represents the number of discrete steps for the inverse-time SDE.We set z ∼ N (0, 1), y (0) ∼ p 0 , σ (0) = 0. log p t (y) is determined by a prior model, representing pre-existing information about the true model parameters.log p t (y n |y) is derived from the data consistency term, log p t (y HFCSC ) from high-frequency CSC term.
Use correctors to obtain more efficient and robust iterative formulas.Annealed Langevin Dynamics is used as a corrector to correct the solution of the numerical SDE solver: where j = 1, 2, . . ., J is the number of correction steps, and ε i > 0 is the step size.
Step 2: High-frequency CSC Step Following the sampling process, the high-frequency CSC step is applied to the updated reconstructed image.Initially, Gaussian filtering decomposes the CT image into high-frequency and low-frequency components.This step enhances the saliency of details in the CT image, making further processing in the high-frequency part more advantageous.Subsequently, we focus on processing the high-frequency components using CSC.High-frequency information typically contains crucial details and structural information in the image.The adoption of CSC in this step aids in accurately preserving and enhancing these subtle structures.Finally, the processed high-frequency components are combined with the low-frequency components to form the ultimate output CT image.This comprehensive processing approach aims to balance sensitivity to local features and preservation of the overall structure, thereby improving the quality of CT image restoration.Compared to directly applying CSC to the entire CT image, this decomposed and separately processed method brings evident benefits in enhancing image quality, reducing computational costs, and better adapting to local features.The high-frequency CSC step can be expressed as follows: where Gaussian (•) represents Gaussian filter, y l denotes the low-frequency component of the CT image, y h represents the high-frequency component of the CT image, y HFCSC indicates the CT image after the high-frequency CSC step.⊗ represents convolution operator, {Z k } k=1,2,...,K represents sparse coding, {D k } k=1,2,...,K denotes a set of filters.We employ an external fully sampled CT dataset and train the filter D k using the method proposed in [48].ADMM [49] is used and the dual variable B whose constraint is equal to Z k is introduced to solve equation (11) resulting in the following problem: where C is the scaled Lagrange multiplier.γ and ζ are two balance parameters.
Step 3: Data consistency Step By incorporating statistical characteristics of the CT images into the objective function, noise robustness can be improved [46].Additionally, a statistical approach for denoising CT images is proposed, which involves using the PWLS method to find the optimal image from the noisy image.The PWLS is introduced into the regularized objective function.The regularized objective function can be expressed as: where the hyperparameter λ is traded off between PWLS and CSC.W is the diagonal matrix.We solve equation ( 17) using the separable parabolic substitution algorithm to obtain: Taking comprehensive considerations into account, a detailed description of the training and reconstruction process of UISG is provided in algorithm 1.The entire iterative reconstruction process of UISG consists of a two-level loop: an outer loop and an inner loop, which perform the predictor and corrector, respectively.Simultaneously, the prior data term and the data consistency step are incorporated into the outer and inner loops, respectively.5. y l = Gau (y) 6. y h = y − y l 7. Update Z i k , B i , C i , y i h via equations ( 13)-( 16) 8. y HFCSC = y h + y l 9. Update y i via equation ( 18) 10.For j = 1 to J do (Inner loop) 11. y i,j ← Corrector ( y i,j −1 , σ i , ε i ) 12. Repeat from step 5 to step 9 13.End for 14.End for 15.Return ỹ

Experiment
To assess the performance of the proposed UISG, we compared it with FBP [1], REDCNN [14], MAGIC [50] and DU-GAN [19].FBP is a classical analytical reconstruction technique that uses a slope filter.REDCNN is a deep learning algorithm based on image post-processing.MAGIC is a novel convolutional network combining manifolds and graphs for LDCT reconstruction.DU-GAN is a U-net based denoising generative adversarial network that simultaneously learns global and local differences in image and gradient domains.The parameters involved are set with guidance from their original papers.To quantitatively evaluate the various reconstruction methods, we employ the root mean square error (RMSE) and structural similarity index (SSIM) to measure the difference and similarity between the reconstructed image and the reference image, respectively.Additionally, we used an objective image quality assessment (IQA) metric, visual information fidelity (VIF), which provides better agreement with radiologists' subjective assessment of medical images.

NIH-AAPM-Mayo CT dataset
To evaluate the performance of UISG, simulated data was used from the publicly available '2016 NIH-AAPM-Mayo Clinic Low-Dose CT Grand Challenge ' [51].This dataset comprises 2D slices of full-dose abdominal CT scans from 10 anonymous patients acquired at 120 kV and 200 mAs.In our experiments, 4825 NDCT images were randomly selected as the training set, while 100 images were used for testing.All images were reconstructed with a slice thickness of 1.0 mm, and each slice had dimensions of 512 × 512 pixels.The artifact-free images generated from normal-dose projection data using the FBP algorithm served as the ground truth images.The fan-beam geometry projection data was generated using the Siddon ray-driven algorithm, with x-ray source-to-detector and rotation center distances set at 1270.00 mm and 870.00 mm, respectively.The detector width was 4130.00 mm, consisting of 600 detector elements.In this study, 600 views of forward projections were collected at angular intervals of 0.6 • .
To obtain projections at different dose levels, the recognized 'Poisson + Gaussian' noise model is used to generate low-dose projections: where b z is the number of incident photons of x-rays, which is set to b z = 1 × 10 6 .σ 2 E is the variance of electronic noise caused by equipment measurement error, which is fixed to σ 2 E = 10.In the experiment, three different dose levels simulated as low-dose cases are 25%, 10% and 5%, corresponding to b z = 2.5 × 10 5 , b z = 1 × 10 5 and b z = 5 × 10 4 , respectively.

Piglet dataset
We tested the Piglet Dataset [52] using the network trained on the NIH-AAPM-Mayo CT dataset.The Piglet Dataset consists of CT scans of deceased piglets obtained using a GE scanner (Discovery CT750 HD) with a source potential set at 100 kVp and a slice thickness of 0.625 mm.These CT scans were acquired at different dose levels by adjusting the tube current (or voltage) during the piglet scans.Four types of LDCT images were generated, with 300 mAs serving as the normal-dose, and the others representing LDCT (with tube current reductions of 50%, 25%, 10%, and 5%).Each slice had dimensions of 512 × 512 pixels.

Implementation details
All experiments were conducted on a PC equipped with a 48GB NVIDIA RTX A6000 GPU, an Intel(R) Xeon(R) Silver 4216 CPU @ 2.10 GHz, and 64 GB of RAM.We implemented model training and evaluation using the Operator Discretization Library and PyTorch in Python.We employed the Adam optimizer (β 1 = 0.9, β 2 = 0.999) [53] with a fixed learning rate of 1e −3 and initialized the network weights using the Kaiming initialization method.During the reconstruction phase, the number of iterations is set to I = 500, J = 2.Among them, I = 500 represents the number of outer loops and J = 2 represents the number of inner loops.Each time the prediction process of the outer loop is performed, the correction process of the inner loop is iterated twice by annealing Langevin.

Test on NIH-AAPM-Mayo CT dataset
The reconstruction performance of each algorithm was tested at dose levels of 25%, 10% and 5%.All comparative methods, except for our UISG, were trained individually at three different dose levels.Representative slices with different reconstruction methods applied at 25% and 10% dose levels are shown in figures 2 and 3 for visual comparison.As can be seen from figures 2(b) and 3(b), the FBP reconstruction results are significantly affected by noise at two different dose levels.In contrast, deep learning-based methods exhibit noise suppression and texture recovery capabilities.Since the physical mechanism of CT reconstruction is not involved, the post-processing method REDCNN results in the loss of structural details (yellow circles in figures 2(c1-c2) and 3(c1-c2)) and still cannot produce satisfactory images.MAGIC achieves better reconstruction results than REDCNN, but still inevitably suffers from loss of details (yellow circles in figures 2(d1-d2) and 3(d1-d2)) and blurred boundaries (blue arrows in figure 3(d2)).DU-GAN benefits from generative modeling that improves visual fidelity and reduces noise to varying degrees.The reconstructed image of DU-GAN has some detail distortion to a certain extent, such as the blood vessels shown by the blue arrow in figure 2(c2), since measurement data is not taken into account.Compared with MAGIC and DU-GAN, UISG integrates the diffusion model into the iterative algorithm framework, which retains sufficient texture during denoising (yellow circles in figures 2(f1-f2) and 3(f1-f2)), and generates more accurate edge features (blue arrows in figures 2(f1-f2) and 3(f1-f2)).MAGIC and DU-GAN strike a balance between noise suppression and image fidelity, resulting in clearer vessel and lesion boundaries.Without separate training, our algorithm still achieves comparable results to MAGIC and DU-GAN at an ultra-low-dose level of 5%.This demonstrates that our method can achieve reconstruction results comparable to supervised learning methods even at ultra-low-dose level.
To assess the performance of these methods objectively, we conducted a quantitative comparison of the reconstruction results on the NIH-AAPM-Mayo CT dataset.Metrics were computed for all slices in the test  Detectability of tiny lesions is one of the key challenges in ultra-low-dose CT denoising tasks.We introduced the contrast-to-noise ratio (CNR) [54] in figure 4 to evaluate the detectability of low-contrast lesions.A higher CNR value indicates a more pronounced contrast between the lesion and the background area, thereby increasing the probability of detecting low-contrast lesions.As shown in figure 5, we selected the yellow lesion ROI and the green background ROI, and calculated the CNR values of the ROIs under different reconstruction methods.The specific results are shown in table 2. It can be observed that RED-CNN has the highest CNR value, while our method ranks second.However, as shown in figure 4(c2), RED-CNN blurs the edges of liver cysts, which is very important for doctors in disease staging and differential diagnosis.Given that in clinical practice, CT values are often used to distinguish healthy tissue from diseased tissue, we also calculated the average -CT values of the diseased ROI in table 2. It is worth mentioning that our UISG shows the CT number of the lesion area that is closest to the reference.
To further assess the ability of the proposed algorithm to preserve structure, figure 6    reveals that the intensity profile of UISG is closest to the reference image, demonstrating the edge-preserving capabilities of the algorithm.While PSNR and SSIM are currently the most widely used metrics for image evaluation, previous studies have demonstrated their limited correlation with radiologists' actual perception of image quality, as they primarily rely on numerical pixel values for image score computation [55].Consequently, we invited two experienced radiologists with 10 years of CT image interpretation to assess aspects such as artifact reduction, noise suppression, contrast preservation, and overall quality.In the test set, we randomly selected 50 reference images and their corresponding LDCT reconstruction images generated by different methods, with the reference images serving as the gold standard.Subjective ratings were based on the following criteria: 5 = optimal contrast and image quality, no artifacts, conducive to diagnosis, down to 1 = unacceptable image quality and contrast, severe artifacts and noise, rendering diagnosis impossible.To obtain the final human-perceived score for each image, we averaged the scores assigned by the two radiologists.The statistical analysis employed Student's t-test with a significance level of p < 0.05 to assess differences between the sets, and the results are summarized in table 3.All reconstruction methods except FBP improved the image quality to varying degrees.UISG scores were closer to those of the reference image, and students' t-tests showed a similar trend, i.e. the differences between the reference image and UISG results were not statistically significant on all qualitative measures.

Generalization
Despite the remarkable results of DL techniques in medical image reconstruction, the generalization capabilities of these models across different datasets, scanning conditions, and real-world clinical applications remain a challenge.This subsection further demonstrates the generalizability of the UISG algorithm on diverse datasets.
We migrated the model trained with the UISG algorithm on the NIH-AAPM-Mayo CT dataset to the Piglet dataset [16].To visually illustrate the performance of UISG, figure 7 displays the reconstructed images of the Piglet dataset at dose levels of 50%, 25%, 10% and 5%.The results in figure 7 demonstrate that UISG effectively suppresses noise and maintains structural fidelity when noise levels are not lower than the 5%.At the 5% dose level, there is noticeable noise in the reconstructed images, and bone boundaries appear blurred (as indicated by the yellow circles in figure 7(e2)).Table 4 records average numerical results obtained from generalization experiments on the piglet dataset.Qualitative and quantitative evaluations consistently validate that the UISG algorithm exhibits generalization and robustness, even without being specifically trained on the Piglet dataset.It can handle different types of noise and artifacts, producing visually acceptable results.

Real walnut test result
To evaluate the performance of our algorithm on real data, real data acquired using the FleX-ray scanner [56] were also employed to evaluate the proposed methods.In this study, projection data were generated by scanning 42 walnuts using a flat cone-beam geometry.The voltage and power were set to 40 kV and 12 W during the experimental process, respectively.The source-to-rotation-center distance was 66.00 mm, and the detector-to-rotation-center distance was 133.00 mm.The normal-dose had 1200 views and was acquired uniformly over 360 degrees.The detector consisted of 972 elements.Subsequently, central slices (z = 0) of the original cone-beam projections were extracted as fan-beam data.The variance of electronic noise is fixed to σ 2 E = 10.Poisson noise (b z = 5e 5 , 2.5e 5 , 1e 5 , 5e 4 , 1e 4 ) was then added to the real normal-dose projections to generate low-dose projections for testing the LDCT reconstruction performance of UISG.
Table 5 presents the quantitative results.Due to the lack of ground truth, the quantitative comparison is not entirely fair and is provided for reference only.Although the model is not retrained, the quality of UISG reconstructed images improves as the noise level decreases.Figure 8 illustrates the results of UISG reconstruction of real walnut data at five different low-dose levels.When the noise level reaches 1e 4 , UISG struggles to recover the crack structure of the walnut meat (yellow circle in figure 8(f2)), and some septa within the walnut exhibit more noticeable boundary blurring and loss in certain regions (blue arrows in figure 8(f3)).It can be observed that, except at an intensity of 1e 4 , our model effectively suppresses noise and reconstructs visually pleasing images.Experimental results suggest that UISG can handle a wide range of noise intensities.

Ablation experiment 3.6.1. Effectiveness of different modules
To validate the necessity of different modules in our method, we conducted ablation experiments on the NIH-AAPM-Mayo CT dataset at 10% noise level.Ablation analysis was carried out using an incremental verification strategy.The HFCSC was employed as the baseline model, which only utilized high-frequency CSC as post-processing for LDCT.PLWS was introduced as a fidelity term to augment the baseline, creating the first comparative model termed PWLS-HFCSC.Subsequently, the PWLS-HFCSC model augmented with the diffusion model was established as the second comparative model, denoted as PWLS-DHFCSC.Quantitative and qualitative comparisons were conducted between these three models and UISG. Figure 9 and table 6 display the visual effects and quantitative performance of the ablation experiments on the reconstructed images.The post-processing technique HFCSC produced excessively smoothed reconstruction images, making it nearly impossible to recover subtle features (figure 9(b2)) and yielding the lowest quantitative metrics.In contrast, PWLS-HFCSC and PWLS-DHFCSC showed improved reconstruction results, although fine details remained challenging to reconstruct.As shown in figure 9(e),     As can be seen from figure 10, under the same sampling step settings, the reconstruction quality of the UISU model slightly decreases, indicating that 'TransDiff ' can learn more image features than U-Net, leading to superior reconstruction results.Compared to UISG, the loss of details in the reconstructed images of UISG-CSC is minimal.This suggests that CSC applied to high-frequency components can more effectively represent and preserve this detailed information without excessive smoothing.The combination of high-frequency and low-frequency components strikes a balance between denoising and preserving detailed information.

Computational cost
Table 7 shows the average training and testing times of the different methods at the 10% dose level of the NIH-AAPM-Mayo CT dataset.Although the training time of UISG is relatively long, since our algorithm adopts a one-time learning framework, there is no need to retrain when faced with a new unknown dataset, successfully avoiding tedious empirical hyperparameter adjustment for different data.

Discussion and conclusions
In this work, we proposed a score-based LDCT image reconstruction algorithm known as UISG.The algorithm trains an SGM to capture the prior distribution of CT images.During the iterative inference stage, numerical SDE solving, ADMM, and data consistency operations are alternately employed to achieve image reconstruction.This ensures a high correlation between the generated CT image and the input low-dose projection.Ablation experiments results indicate that the dedicated use of CSC in high-frequency components not only more accurately removes noise during the denoising process but also preserves fine structures in the image.The introduction of 'TransDiff ' further strengthens the model's ability to model global information, contributing to better reconstruction CT images.Moreover, our model was trained within a single learning framework and has demonstrated robust generalization performance, outstanding results on both the piglet and the real walnut datasets.

Y Li et al
Although our approach has achieved promising results in terms of visual quality and quantitative metrics, there are still some limitations to consider.The main limitation of our method lies in the slow sampling speed during the reconstruction stage, leading to significantly higher inference times for individual LDCT reconstructions compared to end-to-end networks.In future work, improving sampling speed will be a focal point of our research.Furthermore, the use of multi-domain diffusion models could be investigated to improve LDCT reconstruction performance.

Figure 1 .
Figure 1.The proposed UISG framework and its pivotal components.(A) An overview of the proposed pipeline for LDCT imaging.Top: Training phase in the image domain, gradient distributions are learned via denoising score-matching.Bottom: Iterative reconstruction pipeline in UISG.'P' represents sampling, 'C' represents correction.'HFCSC' stands for the high-frequency convolutional sparse coding step, 'DC' represents the data consistency term.(B) Detailed structure of the time-dependent score network 'TransDiff ' S θy (y, t).

Figure 4
Figure4illustrates the qualitative results of the 5% dose.Compared to figures 2 and 3, all compared methods face significant challenges in recovering tissue details and preserving edges at ultra-low-dose.As shown in figure4(B), the low-attenuation lesions in the FBP reconstructed image are seriously contaminated by noise due to the photon starvation effect, making it difficult to distinguish the lesion information.All network-based reconstruction methods enhanced the visibility of lesions compared with FBP images.Since the MSE loss tends to the average without considering the constraints of image edges and details, the REDCNN reconstructed image exhibits unclear edges between different tissues (blue arrow in figure4(c2)).MAGIC and DU-GAN strike a balance between noise suppression and image fidelity, resulting in clearer vessel and lesion boundaries.Without separate training, our algorithm still achieves comparable results to MAGIC and DU-GAN at an ultra-low-dose level of 5%.This demonstrates that our method can achieve reconstruction results comparable to supervised learning methods even at ultra-low-dose level.To assess the performance of these methods objectively, we conducted a quantitative comparison of the reconstruction results on the NIH-AAPM-Mayo CT dataset.Metrics were computed for all slices in the test
presents the 1D line intensity profiles passing through the red solid line in the bottom left image.These profiles compare the intensity distribution along the same line in CT images reconstructed by various methods.Visual inspection Y Li et al 5. CNR calculated for selected ROIs.(A) Reference image, (B) FBP, (C) REDCNN, (D) MAGIC, (E) DU-GAN, and (F) UISG.The display window is [−160, 240] HU.Extract yellow ROI in the lesion area and green ROI in the background.

Figure 6 .
Figure 6.Comparison of the 1D intensity profiles of different methods passing through red solid line in the bottom bottleft of the image.The dotted black box illustrates the enlarged ROI of the profiles line.

Table 1 .
Quantitative results (Means ± Sds) obtained on the NIH-AAPM-Mayo CT dataset at 25%, 10%, and 5% dose levels using different methods.The best results are highlighted in bold.

5135 ± 0.0301 set
. Table1lists the statistical quantitative evaluations of the reconstruction results obtained at three dose levels, including average PSNR, SSIM and VIF.The optimal values for images reconstructed at different doses are highlighted in bold in the table.Across all three dose levels, the FBP algorithm yields the poorest evaluations.The proposed method achieved the highest PSNR, SSIM and VIF compared to other state-of-the-art DL-based methods at dose levels of 25% and 10%.Even at a ultra-low-dose level of 5%, our reconstruction results achieved PSNR, SSIM and VIF values of 36.0043,0.9638 and 0.5135, respectively, indicating that the UISG algorithm excels in preserving structural details and suppressing noise.

Table 2 .
Comparison of CNR and average pixel value of lesion ROI in reconstructed images using different methods.

Table 5 .
PSNR/SSIM/MSE (Means ± Sds) of the model trained on the NIH-AAPM-Mayo CT dataset reconstructing real walnut dataset at difference dose levels.

Table 6 .
Quantitative evaluations (Means ± Sds) for the ablation study.
3.6.2.TransDiff and high frequency CSCThis section aims to validate the effectiveness of 'TransDiff ' and high-frequency CSC in UISG, respectively.In contrast to typical diffusion model approaches, the score network 'TransDiff ' focuses on the global structure of image to enhance the reconstruction quality of LDCT.Consequently, a critical question arises: To what extent does the global structure impact image quality?We built a comparison model UISU whose structure is similar to UISG, the only difference is that the diffusion model's score network 'TransDiff ' is

Table 7 .
Comparison of computational costs of different models.replacedby the traditional CNN-based U-Net to verify the swin transformer-based score Network effectiveness.Additionally, to assess the impact of high-frequency CSC processing on model performance, we replace it with directly applying CSC prior to the entire image, creating the comparative model UISG-CSC.