Transformer CycleGAN with uncertainty estimation for CBCT based synthetic CT in adaptive radiotherapy

Objective. Clinical implementation of synthetic CT (sCT) from cone-beam CT (CBCT) for adaptive radiotherapy necessitates a high degree of anatomical integrity, Hounsfield unit (HU) accuracy, and image quality. To achieve these goals, a vision-transformer and anatomically sensitive loss functions are described. Better quantification of image quality is achieved using the alignment-invariant Fréchet inception distance (FID), and uncertainty estimation for sCT risk prediction is implemented in a scalable plug-and-play manner. Approach. Baseline U-Net, generative adversarial network (GAN), and CycleGAN models were trained to identify shortcomings in each approach. The proposed CycleGAN-Best model was empirically optimized based on a large ablation study and evaluated using classical image quality metrics, FID, gamma index, and a segmentation analysis. Two uncertainty estimation methods, Monte-Carlo Dropout (MCD) and test-time augmentation (TTA), were introduced to model epistemic and aleatoric uncertainty. Main results. FID was correlated to blind observer image quality scores with a Correlation Coefficient of −0.83, validating the metric as an accurate quantifier of perceived image quality. The FID and mean absolute error (MAE) of CycleGAN-Best was 42.11 ± 5.99 and 25.00 ± 1.97 HU, compared to 63.42 ± 15.45 and 31.80 HU for CycleGAN-Baseline, and 144.32 ± 20.91 and 68.00 ± 5.06 HU for the CBCT, respectively. Gamma 1%/1 mm pass rates were 98.66 ± 0.54% for CycleGAN-Best, compared to 86.72 ± 2.55% for the CBCT. TTA and MCD-based uncertainty maps were well spatially correlated with poor synthesis outputs. Significance. Anatomical accuracy was achieved by suppressing CycleGAN-related artefacts. FID better discriminated image quality, where alignment-based metrics such as MAE erroneously suggest poorer outputs perform better. Uncertainty estimation for sCT was shown to correlate with poor outputs and has clinical relevancy toward model risk assessment and quality assurance. The proposed model and accompanying evaluation and risk assessment tools are necessary additions to achieve clinically robust sCT generation models.


Introduction
Adaptive radiotherapy (ART) involves the re-optimization of plan parameters in response to changing patient anatomy, either prior to treatment (online-ART) (Lim-Reinders et al 2017, Green et al 2019), or between fractions (offline-ART) (Green et al 2019, Nigay et al 2019).Image-guidance for online-ART is achieved via magnetic resonance or CBCT imaging (Hoegen et al 2020, Zwart et al 2022).MR based online-ART boasts superior image quality compared to CBCT at the expense of increased costs, increased acquisition time, and stringent bunker and equipment requirements.Alternatively, CBCT based ART offers a more economical solution that may be applied broadly given the ubiquity of in-room CBCT devices (Jong et al 2021).
Poor CBCT image quality resulting from the large scanning geometry and lengthy acquisition time has impeded its clinical utility beyond patient setup positioning.The increased scatter conditions lead to inaccurate Hounsfield units (HU) and reduced soft tissue contrast, while patient motion during acquisition contributes to blurring, double contours and inconsistent HU reproduction (Schulze et al 2011).The resulting dose calculation accuracy on uncorrected images has been shown to deviate by up to 8.0% +/− 5.7%, 10.9% +/− 6.8%, and 14.5% +/− 10.4% for pelvic, thoracic, and head and neck regions respectively (Richter et al 2008).
Deep learning based synthetic CT (sCT) generation has shown significant potential for correcting CBCT image quality for online and offline-ART (Eckl et al 2021, Rusanov et al 2022).Early attempts utilized a supervised learning approach by training the U-Net architecture (Ronneberger et al 2015) in combination with pixelwise loss functions to learn a mapping between deformably registered (DIR) planning CT and CBCT images (Kida et al 2018, Li et al 2019, Chen et al 2020).A clinically significant shortcoming of supervised methods is that unavoidable anatomic misalignments between CBCT and DIR-CT images imbed an error in the transformation mapping.This manifests in a loss of boundary sharpness and fine anatomical detail, loss of anatomical consistency in regions of large inter-fractional variability, and a general 'averaging' of image texture resulting in overly-smooth and unrealistic outputs (Rusanov et al 2022).
The generative adversarial network (GAN) (Alec et al 2016), in the conditional form of CycleGAN (Jun-Yan et al 2020) was proposed as an unsupervised approach for domain translation.These models do not require paired data-owing to the cycle-consistency constraint-and output sCT images perceptually similar to planning CT images based on the adversarial loss which approximates the probability density function of real CT data (Goodfellow et al 2014, Jun-Yan et al 2020).However, CycleGAN has the tendency to preserve unwanted CBCT artefacts in sCT images as prompts to make the backward transformation easier (Bashkirova et al 2019, Liu et al 2021).
Convolutional neural networks (CNN) have been the de facto backbone for computer-vision tasks; however, their finite kernel size limits the receptive field of the network, in turn restricting its capacity to relate long range information (Jieneng et al 2021).Inspired by the transformer architecture found in large language models such as ChatGPT to handle long-term memory dependencies in text (Radford et al 2018), the vision transformer (ViT) excels in modelling global contextual image information (Dosovitskiy et al 2020).By tokenizing the image into separate patches, the transformer mechanism learns how each patch relates to every other patch (Dosovitskiy et al 2020, Jieneng et al 2021).A drawback is that local information is coarsely modelled.This led to the emergence of hybrid CNN-ViT models that draw on the benefits of CNNs and ViTs for fine and course data modelling, respectively (Jieneng et al 2021, Dalmaz et al 2022).Recently, Chen et al demonstrated the improved performance of their hybrid model over a fully convolutional counterpart for CBCT based sCT generation (Chen et al 2022).
Robust risk identification is necessary where model predictions have significant clinical consequences.Uncertainty estimation in deep learning is one approach to identify and mitigate potentially dangerous predictions (Alex and Yarin 2017, Abdar et al 2021).Bayesian neural networks (Jospin et al 2022) and Gaussian Processes (Joost Van et al 2022) have been proposed as mathematically sound uncertainty estimation methods; however, their inability to scale with model size render them intractable for practical purposes (Yehao et al 2021).In contrast, test-time augmentation (TTA) (Ayhan andBerens 2018, Wang et al 2019) and Monte-Carlo dropout (MCD) (Yarin andZoubin 2016, Wang et al 2019) have been proposed as scalable methods for aleatoric (data-based) and epistemic (model-based) uncertainty estimation, respectively.
In this work, we aim to: 1. Compare baseline U-Net, GAN, and CycleGAN approaches and identify weaknesses in each synthesis.
2. Conduct an ablation and segmentation study to determine effective changes to model architecture and loss function to mitigate weaknesses in point 1.
3. Introduce and validate the Fréchet inception distance (FID) as a more relevant image quality metric in the CBCT based sCT literature.
4. Introduce TTA and MCD uncertainty estimation methods to the CBCT based sCT literature for risk identification and as inference-based post-processing strategies to further boost performance.

Methods
The flowchart in figure 1 describes the conceptual workflow of the current study.After data selection (section 2.1) and preprocessing (section 2.2), three baseline models-U-Net, GAN, and CycleGAN-were trained using the same generator and discriminator architectures (section 2.4.1).Next, the proposed model, 'CycleGAN-Best' was developed by identifying shortcomings in baseline models and conducting an ablation study that experimented with a multitude of architectural and loss function changes made to baseline CycleGAN (section 2.4.2).CycleGAN-Best was formed by combining a subset of the best performing changes into a single model.Lastly, TTA was utilized on CycleGAN-Best with a median aggregation strategy as a post-processing method to further improve results (section 2.5.3).

Data
A total of 50 prostate cancer patients receiving curative radiotherapy were retrospectively enrolled as part of an ethics approved study at our institution (Human Research Ethics Approval RGS4979).First scan CBCT and planning CT (pCT) images were collected and deidentified prior to project initialization.All CBCT images were acquired on a Varian TrueBeam™ linear accelerator kV system using the standard pelvis protocol at our institution while pCT images were taken on an Aquilion LB

Preprocessing
The dataset was partitioned into training, validation, and testing sets containing 40, 5, and 5 patients, respectively.Image resampling and registration was performed in Velocity (Velocity Medical, Atlanta, GA, USA), by initially performing a rigid alignment followed by the deformable multi-pass registration algorithm for deformation of CT to CBCT images.If the registration failed to sufficiently approximate bladder and rectal filling, a subsequent deformable mutli-pass registration within the problematic region of interest (ROI) was applied using the ROI tool.Further preprocessing for all images included HU clipping to [−1000, 2000] and intensity rescaling to [−1, 1].The conical ends of CBCT volumes, typically the first and final four slices, were removed.Non-anatomical structures such as patient immobilization equipment and the couch were replaced by air by binarizing the anatomical region from the background.Otsu thresholding and morphological erosion, dilation, and area closing were used to extract the relevant anatomical region.Each patient volume contained 80 slices, for a total of 3280 training images, 400 testing and 400 validation images respectively.Data partitioning was performed on a patient-wise basis.All models trained utilized the same preprocessing procedure.

Network architectures
Two different generators and a single discriminator architecture were investigated.

Generators
The fully convolutional U-Net generator (FC-Gen) had four down and up-sampling levels with residual blocks after each sampling operation.Six residual blocks were stacked at the bottleneck.Concatenations from the encoder wing were employed at each level, including the input image to aid spatial and contextual information synthesis by the decoder.The ReLU nonlinearity and batch normalization operation was used after each convolution.However, the output layer used the Tanh nonlinearity to scale image intensity back to  information.Consequently, ResViT does not utilize concatenation between the encoder and decoder wings since the ARTB units are sufficiently expressive for accurate decoder synthesis.ResViT was implemented using the smaller 'base' variant which contained a total of 129340801 trainable parameters.Specific model details and network schematics can be found in the original publication (Dalmaz et al 2022).

Discriminator
The Patch-GAN discriminator was used to model image semantics at the scale of local image patches, thereby capturing local structural and textural information (Isola et al 2017).The effective receptive field at the image level captured overlapping patches of size 32×32 pixels.Hence, each patch is classified as real or fake, with the outputs averaged across the entire image.Patch-GAN had a total of 2762689 trainable parameters.Figure 3(A) in the supplementary materials shows the schematic diagram of PatchGAN as used in the present study.

Training configurations 2.4.1. Baseline models
To generate baseline results, three training configurations were defined using the same generator and discriminator architectures, optimizers, augmentation strategies, and datasets.Loss functions varied depending on model configuration: 'U-Net' was a supervised model consisting of FC-Gen trained by the pixel-wise L1 loss function.'GAN' was trained with supervision using FC-Gen and PatchGAN with the adversarial and L1 loss functions to enforce anatomic preservation.'CycleGAN' was trained unsupervised and consisted of two FC-Gen generators and two PatchGAN discriminators which were trained using the adversarial and cycle losses.All models were developed in PyTorch and trained on a Titan RTX 2090 GPU with the Adam optimizer.The following augmentation strategies were used; Flips: vertical/horizontal; Rotations: [0, 90, 180, 270].

CycleGAN-best optimization
After highlighting the weaknesses in each approach relating to clinically relevant endpoints (anatomical preservation, artefact correction, perceptual image quality), model modifications were developed to solve the identified shortcomings.Individual changes relating to loss functions, model architecture, and training strategies were investigated.Finally, the best performing changes were combined into a single model and compared to baseline results.
Figure 2 summarises the MAE and FID of different baseline CycleGAN models modified by a single change, along with CycleGAN-Best-the final optimal model.Changes to loss function, generator architecture, Interested readers can refer to Supplementary Material section 1 for more information on the motivation and implementation of each of the model modifications and their impact on the sCT.The final model, CycleGAN-Best, was trained in a supervised manner using the same augmentations as the baseline models.Specifically, the model used the ResViT generator with style loss (applied between CT and sCT), Conditional L2 Log Base-1000 Loss (applied between CT and sCT, CBCT and sCBCT), Cycle-MSSIM loss (applied between CBCT and cycle-CBCT, CT and cycle-CT), along with the standard adversarial and cycle losses.Model additions were selected by considering quantitative MAE and FID performance, qualitative image quality analysis, and computational limitations.
2.5.Uncertainty quantification 2.5.1.Aleatoric uncertainty and TTA Aleatoric uncertainty is an inherent property of any measurement and represents randomness that cannot be explained away by the acquisition of more data.Practically, this uncertainty can be modelled by applying a series of transformations to input data prior to inference (the TTA method).In this study, we consider reversible data augmentations which include rotations, translations, flips across each 2D axis, and scaling.These augmentations are chosen such that they do not change the image after forward and backward transformation.The probability of a given transformation is assumed to originate from a prior distribution, which should capture the variability of that transformation occurring in pCT scans (Wang et al 2019).However, since all models were trained with data augmentation exceeding what was observed in the pCT scans, matching prior distribution probabilities are assigned to the transformation parameters.Specifically, P(Rotation) ∼ U(0, 2π), P(Scale) ∼ U(1, 1.05), P(Flip) ∼ Bern(0.5),P(Voxel Translation) ∼ U(−20, 20).
To model the uncertainty, a total of N = 20 iterations with augmentation priors based on the distributions above were applied to CBCT images before inference.The reverse transformation was then applied to the sCT, and the variability of each voxel was computed by finding its standard deviation (STD) over the 20 simulations.The final estimate of the uncertainty map is generated by computing the natural logarithm of the STD map for each voxel i, since STD alone could not sufficiently represent the variability: Random error in regression measurements follow a normal distribution.To determine if the estimated aleatoric uncertainty satisfied this assumption, the Shapiro-Wilks test for normality was performed with p > 0.05 indicating a high likelihood the data originates from a normal distribution.

Epistemic uncertainty and MCD
Epistemic uncertainty relates to uncertainty in model parameters and can be estimated using the heuristic proposed in (Yarin and Zoubin 2016, Wang et al 2019), namely, MCD.In this study, the objective was to estimate the predictive distribution, | ( ) p y x , which required capturing the predictive distribution over model parameters, q ( | ) q x y , , i i with model parameters q and training data (x y , i i ).This is described by: By switching off a unique subset of neurons prior to inference and sampling N times, the predictive distribution could be approximated.Each unique dropout configuration, q ( | ) p y x, , n constituted a permutation of the model parameters that sample the approximate predictive distribution (Yarin and Zoubin 2016): Where q n is a Monte Carlo sample from q ( | ) q x y , .i i Hence, the epistemic uncertainty can be simulated by computing the variability of model predictions obtained from N iterations.A dropout probability of 0.5 is adopted.As with aleatoric uncertainty, the final epistemic uncertainty map was found by computing the natural logarithm of the STD for each voxel i:

TTA and MCD postprocessing
It has been shown that TTA can be used as a post processing method to improve deep learning model predictions for a variety of tasks such as segmentation (Wang et al 2019, Moshkov et al 2020) and natural language processing (Lu et al 2022).The general idea behind TTA is that by exposing the model to different perspectives of the same input, its error can be reduced by ensembling the predictions in some manner.In this work, the stack of predictions corresponding to MCD and TTA are aggregated in various ways to further improve sCT image quality.Specifically, the mean, mode, median, 100th, 75th, and 25th quantile predictions are investigated to generate the final sCT.

Loss functions
The L1 loss function was used to measure the absolute difference in pixel intensities between synthesized and ground truth CT images for U-Net and GAN baselines: where N corresponds to the number of pixels in the image and G is a generator used for domain translation.
The adversarial loss was defined by the output of PatchGAN, and measures the extent by which the discriminator has mischaracterized patches as real when they were actually fake: where J is a matrix of ones, D is the PatchGAN discriminator and M is the number of pixels in the output.Note, the discriminators were trained to classify pCT and CBCT images in the training step prior to applying the adversarial loss.
The cycle loss was specific to CycleGAN and measured the absolute difference in pixel intensities between cycle and ground truth images: Additional loss functions were identified for the final improved model.These were mean structural similarity (MSSIM) Loss, Style Loss, and Conditional L2 Log Loss.MSSIM Loss is a differentiable version of the classical perceptual image quality metric (Zhou et al 2004).This metric is a composite function measuring luminance (local mean, m), contrast (local STD, s), and structure (local comparison function).MSSIM is computed with a window size of = k 5 using: The Style Loss aims to extract style representations of two images using Gram matrices followed by minimization of their L2 distance (Gatys et al 2016).In the pretrained VGG-19 (Simonyan and Zisserman 2015) network used to encode image semantics, style embeddings relating to textural information reside in the middle layers, between conv3-1 and conv4-4 (Lee and Tseng 2019).Specifically, using network activations from the relu4-1 layer, the Gram matrix G i j , for an input can be constructed by computing the inner product between all permutations of the flattened filter responses f C contained in the relu4-1 feature map: = ( ) G f f .9 The Style Loss can then be computed as the L2 distance between input Xi j , and target X i j , Gram matrices, with C and K representing feature map depth and size respectively (Gatys et al 2016): The Conditional L2 Log Loss function computes the log base-1000 weighted L2 distance between synthesized and ground truth image pixels.This loss is conditionally applied to pixels in both images which range between −465 and 1950 HU to avoid the impact of air pocket and fiducial marker based misalignments on model optimization.CT and CBCT 1950. 11 i i

 
The overall loss function is formulated as: 2.7.Evaluation metrics 2.7.1.Image quality Classical image quality metrics compare pixel distances for well-aligned data and include MAE and mean error (ME), both of which are reported within the anatomical region of the image in this study.The peak signal-tonoise ratio (PSNR) and MSSIM are also reported.Differences in image quality metrics between CBCT and all models were assessed for statistical significance based on the paired T-test with Bonferroni correction applied for multiple testing.The effective significance level corrected for multiple testing used was p < 0.0125.
In addition, the FID (Heusel et al 2018) was introduced, which is a widely used metric in the generative image synthesis community to quantify image quality in the absence of paired ground truth data.FID measures latent vector distances based on deep feature activations of the InceptionV3 (Szegedy et al 2015) model pretrained on ImageNet.The final Global Average Pooling layer activations consist of a 2048 long vector, which is used to compute the FID for real (y) and generated ( ŷ ) images using: Where m is the element-wise mean of the feature activations across all input images, () Tr is the trace operation, and () Cov is the covariance matrix of the feature vectors.In this study, FID is computed on a per-patient basis.In addition, a blind observer image quality assessment was conducted by asking 3 Radiation Oncologists (RO) and 5 Radiation Oncology Medical Physicists (ROMP) at our institution to score images based on their similarity to real CT images.Observers were blind to both the methods used to create images and the patient identifiers.Images were assessed such that a score of 10 implied the image is equivalent to a pCT, while a score of 1 implied the image is unworkable for treatment planning.Assessments were made on baseline U-Net, baseline GAN, baseline CycleGAN, pCT, CBCT, and the proposed CycleGAN-Best model images.

Dosimetric assessment
To assess the dosimetric impact of computing the clinical dose distributions on CBCT and CycleGAN-best sCT images, both image sets were deformably registered to the pCT using the Deformable Multi-pass algorithm in Velocity, and the resulting dose distributions were compared using global 3%/3 mm, 2%/2 mm, and 1%/1 mm gamma criteria with 10% threshold.Furthermore, the dose-volume histogram (DVH) was used to compute the D mean component for the rectum, bladder, and CTV.The Acuros XB version 16.1 was used for all dose calculations.

Anatomical agreement
Quantification of anatomical preservation between sCT and CBCT datasets was performed by segmenting soft tissue and bony structures on all datasets.To reduce observer variability, the commercial auto-contouring software MIM Contour Protégé AI was used to create initial segmentations for the bladder, rectum, both femoral heads, and sacrum.Minor errors were manually corrected prior to review and approval by an experienced RO.Structural evaluation was performed using the mean surface distance (MSD) metric.MSD is computed by averaging the symmetric minimum Euclidean distance for each voxel between corresponding CBCT and sCT contours.

Baseline results
U-Net, GAN, and CycleGAN baseline results are shown in figures 3(a)-(c), respectively.Note the overly smooth and unrealistic appearance of U-Net outputs, and loss of anatomical consistency about the seminal vesicles and rectum.The GAN produces a more realistic texture, but was unable to visualize the separation between bladder and seminal vesicles.Furthermore, a false high-density structure was present at the center of the rectum.CycleGAN provided the most accurate anatomical representation of the baseline models, with discernable separation between all structures.However, CycleGAN preserved many high-frequency streaking artefacts throughout the image, most evident about the outer fatty region in figure 3(c).
The mean HU and STD values within ROIs (shown in red squares in figure 3) corresponding to muscle, bladder, fatty tissue and femoral heads are summarized in table 1.All baseline methods improved the mean HU accuracy over the CBCT, while U-Net consistently demonstrated the lowest STD, corresponding to the overly smooth texture.
Table 2 summarizes image quality metrics relating to MAE, ME, SSIM, PSNR, and FID.For each metric, all models showed statistically significant improvement over the original CBCT (P < 0.0125).The U-Net model demonstrated the lowest MAE of 22.98 ± 1.49 HU, followed by GAN at 30.65 ± 1.62 HU, and CycleGAN at 31.80 ± 1.46 HU.A similar trend was observed for MSSIM and PSNR, with U-Net scoring the highest followed by GAN and CycleGAN models.However, out of the baseline models, the FID for CycleGAN was shown to be superior to both U-Net and GAN approaches, with a reduction from 144.32 ± 20.91 for the CBCT to 63.42 ± 15.45.The GAN approach was slightly higher at 70.16 ± 11.65, followed by U-Net at 77.95 ± 9.06.

CycleGAN improvements
The ablation study summarized in figure 2 shows the MAE and FID of different baseline CycleGAN models modified by a single change, along with CycleGAN-Best-the final optimal model.CycleGAN-Best resulted in the lowest MAE and FID, demonstrating the additive synergistic effect of combining the various changes.Figure 3(f) shows the sCT generated from CycleGAN-Best.Of note is the reduction in streaking artefacts across the image, the improved contrast between seminal vesicles and the bladder, and overall textural similarity to the CT, as compared to the baseline CycleGAN implementation.Table 1 shows the model demonstrated good agreement in mean HU accuracy for muscle, fatty tissue and bladder ROIs.The model also showed the closest agreement in STD to the CT for each tissue type, suggesting the texture and noise levels were most alike.
The classical image quality metrics in table 2 show CycleGAN-Best is second to U-Net in terms of MAE, MSSIM, and PSNR.However, the FID for CycleGAN-Best shows a drastic improvement of 42.05% over U-Net, 35.60% improvement over the GAN, and 29.45% improvement over the baseline CycleGAN.
Figure 4 exemplifies typical failure cases for baseline CycleGAN which may impede subsequent ART applications.In image a-2, baseline CycleGAN was unable to correct the severe streaking across the image, resulting in an erroneous high-density artefact about the center.Images a-2 and b-2 demonstrate how baseline CycleGAN can erode anatomic information in the sCT due to severe non-uniform shading artefacts in the CBCT.Lastly, image c-2 shows the preservation of shading artefacts as baseline CycleGAN is unable to disentangle the underlying anatomy from artefacts.It should be noted that CycleGAN-Best was able to overcome the typical failure cases demonstrated above.Furthermore, it was able to preserve delicate soft tissue information about the anus in image set b-3 and bladder in image set c-3.These structures are hardly perceptible to the human observer in CBCT images, yet are retained and enhanced in CycleGAN-best demonstrating the superior anatomical awareness of the model.
The volumetric heatmap in figure 5 shows the error distribution in both transverse and isometric views.CycleGAN-Best removed gross shading HU errors distributed about the center of the image volume and reduced the image noise.Most errors stemmed from image mis-registration between high gradient information as can be seen around the bladder, rectum, femur and patient body contour.
The cumulative histogram for all test patients for CBCT, CT, and CycleGAN-Best is shown in figure 6.The overlap between curves improved from 58.80% for the CBCT to 93.70% for CycleGAN-Best, meanwhile baseline CycleGAN (not shown) achieves 88.31% overlap.

Blind observer scores
Blind observer scores for each imaging modality and synthesis method are summarized in table 3. Furthermore, the Pearson correlation between observer scores and corresponding FID for each image volume is included.CycleGAN-Best scored highest among the synthetic datasets for both RO and ROMP participants, although  ROMPs in general scored all image sets higher than ROs.Disagreement between ROMPs and ROs was observed for U-Net and GAN datasets, where ROMPs scored them higher than CBCT images while ROs scored lower.Given the anatomical inconsistency observed in U-Net and GAN images, this outcome was expected as ROs are extensively trained to evaluate subtle anatomical details, while ROMPs may overlook such subtleties and rely more on overall image quality.For the same reason, CycleGAN and CycleGAN-Best were both scored lower by ROs compared to ROMPs, however, neither dataset was scored lower than CBCT by any observer.FID image quality scores correlated best with ROMP assessments, and to a lesser extent with RO scores, with an overall correlation of −0.83 for all observers.By comparison, MAE had a correlation coefficient of −0.64 for all observers.

Epistemic and aleatoric uncertainty
Spatial uncertainty maps produced using equations (1) and (4) for aleatoric and epistemic uncertainty, respectively, are depicted in figure 7. The baseline CycleGAN model was investigated as it contained obvious anatomical errors.Red arrows indicate synthesis artefacts which correlate spatially with higher variability in aleatoric and epistemic uncertainties.However, not all regions of higher uncertainty manifest as an artefact in the sCT, but they do suggest regions the model may underperform.High gradient regions are preferentially captured by both uncertainty estimation methods.However, aleatoric uncertainty seemed to better correspond with missing anatomy, while epistemic uncertainty agreed better with incorrect HU assignment-for example the bladder in images a-3 and c-3 in figure 7.This suggests that aleatoric uncertainty can capture synthesis artefacts due to severe CBCT degradation (i.e.uncertainty in data), while epistemic uncertainty captures errors due to incorrect model prediction of tissue density (i.e.uncertainty in model parameters) which corresponds well with the definitions of each uncertainty type.By plotting the magnitude of the uncertainties against the absolute HU error for each pixel, their correlation can be visualized.Figure 8 shows this relationship accumulated for all test images.A positive linear trend was observed for both uncertainty methods until around 70 HU, after which HU errors due to anatomical misalignments predominate over synthesis errors, and the relationship breaks down.Since the vast majority of soft tissue errors resided in the linear region (0-70 HU), it could be concluded that both uncertainty estimation methods correlate with errors in image synthesis when compared against the ground truth CT.
The Shapiro-Wilks test for normality was computed for aleatoric uncertainty for a single test patient to examine the shape of the predictive distribution.Specifically, this was performed using the stacked predictions after TTA.The percentage of voxels with p > 0.05 for different tissue types was 85.7% (fat), 84.9% (muscle), 78.6% (bone).Hence, the vast majority of points demonstrated a Gaussian distribution which is a commonly assumed regression target distribution in neural network modelling for aleatoric uncertainty (Seitzer et al 2022).Voxels with p < 0.05 typically had a normal distribution with one or more extrema points skewing the distribution tail.

TTA and MCD postprocessing
Various aggregation methods for TTA and MCD are summarized in table 4 for all test patients using CycleGAN-Best.Using TTA postprocessing, the mean and median values both demonstrated superior MAE compared to the other aggregation methods with an improvement over the original sCT of 6.66% and 6.50%, respectively.Whereas for FID, the 75th quantile and median methods produced the best image quality with an improvement of 7.20% and 5.90% for mean and median aggregation, respectively.MCD postprocessing had little to no impact on MAE and FID for all aggregation methods.The application of TTA and MCD to create N = 20 image stacks took 79.44 and 74.08 s, respectively.The aggregation of image stacks ranged from 12.65 to 40.38 s depending on the aggregation method.

Dosimetric analysis
CycleGAN-Best improved over the CBCT for all gamma criteria and structure D mean values as summarized in table 5.Although minor anatomical shifts between CT and sCT/CBCT exist even after deformation and the presence of mismatched air pockets in the rectum were identified, the sCT was able to achieve an average gamma pass rate of 98.66 ± 0.54% for the most stringent 1%/1mm criteria, with no cases falling below the 95% threshold for clinical acceptability.When comparing the mean dose per structure, the CBCT produced errors greater than double that of the sCT for the bladder and prostate.The CTV showed the largest deviation in dose, with the CBCT showing a mean error of 1.27 ± 0.22 Gy, while the sCT reduced the error to 0.27 ± 0.11 Gy.

Anatomical agreement
Figure 9 shows the MSD for the rectum, bladder, and bony structures.Both mean and median values for GAN and U-Net were larger than CycleGAN and CycleGAN-Best, while the latter two models showed similar performance.For the bladder and rectum, CycleGAN showed a slightly lower mean MSD of 0.31 ± 0.13 mm and 0.55 ± 0.19 mm compared to CycleGAN-Best of 0.36 ± 0.17 mm and 0.62 ± 0.11 mm, respectively.Meanwhile the GAN showed a much greater mean MSD of 2.90 ± 2.49 mm and 1.55 ± 1.77 mm for the bladder and rectum, respectively.While U-Net showed the worst mean MSD of 3.98 ± 4.78 mm and 2.87 ± 2.24 mm for the bladder and rectum, respectively.Bony structures showed much less variation between the models, with smallest to largest mean MSD differences respectively shown by CycleGAN-Best (0.19 ± 0.06 mm), CycleGAN (0.19 ± 0.06), U-Net (0.25 ± 0.17 mm) and GAN (0.28 ± 0.23 mm).

Discussion
This study set out to compare baseline methods for CBCT based sCT generation; identify weaknesses in each approach; make appropriate model modifications to rectify the weaknesses; introduce a more relevant image quality metric; and devise a method for risk identification.

Baseline models
In brief, U-Net based synthesis relied too strongly on training supervision, which lead to unfaithful texture reproduction and loss of anatomical integrity (due to imperfect ground truth/training data correspondence).
The GAN included the adversarial loss to improve image realism, but required supervision to maintain CBCT anatomy resulting in similar issues to U-Net.The unsupervised CycleGAN model showed the best anatomical integrity, but preserved CBCT artefacts such as streaking and non-uniformity which manifested as erosion of real tissue or addition of tissue-like structures as shown in figure 4.

CycleGAN improvements
To solve baseline CycleGAN issues, the improved model was trained with several novel loss functions informed from the ablation experiment summarized in figure 2. The combination of the proposed model changes resulted in better MAE and FID than any one change, suggesting a non-trivial additive relationship.For details on all model changes, please refer to supplementary materials section 1.Firstly, the model was trained in a supervised manner with the Conditional L2 Log loss function, which was designed to soften the impact of sCT/CT anatomical misalignments due to (i) air pocket and bony/fiducial marker mismatch; and (ii) soft tissue mismatch.For (i), the function conditionally operated between the range of −465 to 1950 HU to omit air and high density structures in its calculation of error.For (ii), supplementary materials figure 1(A) shows the response of L1, L2, and L2 Log Base-1000 error functions.We note that as the normalized error approaches ∼0.01, which corresponds to a HU error of ∼30, the error response for the L2 Log base-1000 function decreases at an increasing rate.By comparison, the L1 function response remains linear irrespective of the error, while the L2 loss response decreases at a decreasing rate.Since the parameter updating rule used with backpropagation considers the partial derivative of the error with respect to a given weight, the L2 Log loss lends itself to penalizing smaller errors more than larger errors compared to L1 and L2 loss functions.This reduces the training impact of poor registration of soft tissue between CBCT and CT data and ensures the model pays greater attention to regions of anatomical agreement.Visual inspection showed that this form of model supervision was crucial in maintaining anatomical integrity and avoiding synthesis errors depicted in figure 4. Please refer to supplementary materials section 2 for more details.
To further reduce the negative impact of data misalignment between CT/CBCT data, the MSSIM loss was applied on CBCT/cycle-CBCT and CT/cycle-CT images.Hence, the two generators could still benefit from being trained on the MSSIM loss without the introduction of error due to CT/CBCT misalignment since cycle images are perfectly aligned with their input.When compared to MSSIM trained with synthetic pairs (sCT/CT and sCBCT/CBCT) in figure 2, a noticeable increase in MAE is observed, with roughly the same FID.This indicates the greater anatomical preservation attained by training on cycle images.
The style loss (Gatys et al 2016), which yielded a negligible improvement in MAE over the baseline CycleGAN, resulted in an appreciable improvement in FID (figure 2).Even though the InceptionV3 network was pretrained on ImageNet which consists of non-medical images, its deep feature activations were still capable of identifying and eliminating non-representative textural information in sCT images.This resulted in a perceptually more realistic sCT which is hypothesized to be a crucial characteristic for downstream autocontouring applications since segmentation models are trained on real CT images.
Lastly, the ResViT (Dalmaz et al 2022) architecture resulted in a reduction of MAE and FID compared to baseline CycleGAN, which utilized a fully convolutional U-Net generator.The improvement was attributed to the improved long-range modelling capacity of the vision transformer sub-network, which operated at the model bottleneck.Convolution layers in the encoder pathway excelled in capturing local textural information and gross spatial information, while the vision transformer effectively isolated critical structural information, which was then fed into the decoder for synthesis of accurate fine details.Traditional U-Net generators include skip connections to maintain structural coherence; the lack of these connections in ResViT attest to the strength of vision-transformer units in capturing long range data dependencies.

Evaluation metrics
According to a recent review of CBCT based sCT generation, the three highest (Harms et al 2019, Dong et al 2021, Rusanov et al 2022, Wu et al 2022) relative percent improvements in MAE were 85.2%, 71.4%, and 70.8% over the original CBCT for the pelvic region, respectively.In comparison, this study achieved an improvement of 60.7%.As highlighted in this study, MAE-and alignment dependent metrics altogether-are poor indicators of perceptual image quality and anatomical consistency.It was shown that U-Net, which clearly demonstrated poor anatomical consistency and unrealistic image quality, had the best performing MAE, PSNR, and SSIM.This observation is in agreement with previous research, which has shown that supervised (U-Net) models produce lower MAE but do not necessarily preserve anatomy, whereas unsupervised models have higher MAE but maintain the underlying anatomy (Rossi and Cerveri 2021).By contrast, FID was shown to agree with RO and ROMP perceptual image quality scoring to a high degree with a correlation coefficient of −0.83 across all observers, compared to −0.64 for MAE.Therefore, models which exhibit low FID and MAE are a good litmus test for ensuring the sCT is realistic and simultaneously maintains accurate patient anatomy.Results of this study therefore suggest that model evaluation be performed using both FID and MAE, with comparisons always made relative to a baseline to assess the benefit/detriment of a given model change.It should generally follow that for two sCT's with the same FID, the image with lower MAE will have better anatomical consistency over an image with higher MAE since image quality features (contrast, noise, texture, HU accuracy) are controlled for, with only gross anatomy differences contributing to the MAE difference.
However, robust quantification of anatomical consistency is only possible through segmentation analysis.The MSD analysis demonstrated what has been previously shown by authors comparing supervised and unsupervised training strategies and their effects on anatomical preservation: supervised training reduces anatomical consistency (de Rossi andCerveri 2021, Hond et al 2023).The rectum and bladder were both poorly synthesized on U-Net and GAN images (maximum MSD for any structure of 12.14 mm and 7.16 mm, respectively), while CycleGAN maintained sub-millimeter MSD for all structures analyzed (largest MDA for any structure of 0.77 mm).Similarly, CycleGAN-Best did not exceed 1 mm MSD for any structure (largest MSD of 0.73 mm).This further demonstrates the anatomy-preserving properties of the proposed supervised model due to the Cycle-SSIM and Conditional L2 Log loss functions.
Dosimetrically, the proposed sCT achieved higher 1%/1 mm gamma criteria (98.66 ± 0.54%) compared to a recently published transformer-based synthesis method (97.5 ± 1.1%) (Chen et al 2022).Due to imperfect DIR, a body contour mismatch of up to 3 mm was present for pCT and sCT/CBCT data, which contributed additional 1%/1 mm fail points in the outer dose build up region.Omitting these regions in our analysis increased the 1%/1 mm pass rate to 99.16 ± 0.35%, however, this result is not reported to maintain consistency with the literature.Other studies reporting gamma passing rates for 2%/2 mm criteria achieved 97% (Sun et al 2021) and 98.5 ± 1.7% (Eckl et al 2020), compared to 99.94 ± 0.08% in the proposed method.The high dosimetric agreement is attributed to the excellent HU tissue accuracy, as demonstrated in table 1, and the use of DIR for greater anatomical agreement between pCT and CBCT/sCT.
The use of FID in the medical domain is scarce, with its application typically found in purely generative tasks used to address data limitation problems (Saad et al 2021, Segal et al 2023).In this study, FID was evaluated as a superior metric for image quality, as demonstrated by the high agreement with observer scores.Its main strength is its invariance to data alignment, thereby removing a large source of uncertainty typical when reporting classical image quality metrics such as SSIM, MAE and PSNR.It is hypothesized that sCT images with lower FID, and hence more similar feature representations to real CT images, will perform better in downstream autosegmentation tasks critical for an adaptive radiotherapy pipeline.Optimizing for FID is therefore advantageous if sCT images are to be used with commercial autocontouring solutions.These are typically trained on a variety of CT scanner vendors and acquisitions parameters, but not on sCT images, thereby limiting their generalizability if sCT FID is too high.

Uncertainty estimation and postprocessing
Two inference-based methods for uncertainty estimation are presented with accompanying spatial maps.The major advantage of using TTA and MCD in modelling aleatoric and epistemic uncertainty, respectively, is that they can be applied retrospectively to any pretrained model.Even models trained without dropout can be modified with dropout layers prior to inference to produce the related maps.Both types of uncertainty correlate positively with either local regions of synthesis failure or gross regions of poor HU reproduction as indicated in figures 7 and 8.One other study has investigated uncertainty maps generated with MCD and a loss based aleatoric uncertainty (Hemsley et al 2020) for sCT from MRI.The authors too noted a positive correlation between uncertainty and synthesis error, as well as an over-response at high gradient interfaces.Random error in regression measurements typically follow a normal distribution.Normality was observed for the majority of uncertainty voxel estimates generated by TTA used for aleatoric uncertainty.This suggests TTA is well calibrated to model aleatoric uncertainty.
The establishment of uncertainty estimation methods, such as those presented here, will play a crucial role in the clinical validation of high-risk AI tasks such as sCT generation, and will be an important next step for clinical deployment of sCT models.For example, uncertainty maps can be used to automatically flag poor outputs based on user defined criteria such as maximum, mean or cluster-based statistics to spatially locate poor synthesis regions.This can be part of a pre-treatment ART QA protocol which can speed up the localization of potentially erroneous sCT outputs such as false or missing anatomy.Uncertainty maps may also be used as quality assurance tools to assess changes in updated sCT models by comparing updated model uncertainty to baseline uncertainty levels.If newer models produce sCT of higher uncertainty for the same input CBCT, this may indicate the model should not be used clinically.Similarly, if the image quality of CBCT changes due to changes in kV, mA, or mS, (either intentionally or due to deliverability then the output sCT may be negatively impacted.Therefore, analysis of uncertainty maps can help inform clinicians if undesirable acquisition parameters have been inadvertently chosen.Regarding model development, incorporating uncertainty maps as dual-channel inputs to models or as a novel loss function during model training may boost convergence speed and overall model accuracy since regions requiring greater attention are innately localized using TTA and MCD methods.
These applications must be weighed against potential pitfalls such as requirements for additional computational power, additional time required to create the uncertainty maps, and instances where false negative or false positive regions occur.Future work needs to address these issues and further validate the use of uncertainty maps in other sites and models.
The possibility of using TTA and MCD as post-processing tools to further improve sCT image quality was also explored in the present study.A small boost to both MAE and FID was observed using TTA with median aggregation.In a sense, the aggregation methods are a way of ensembling various model predictions.Given that a Gaussian distribution was found for most voxels after TTA, the use of median aggregation was well suited to minimize the impact of noisy predictions and select the 'safest' voxel values, in turn minimizing model uncertainty.

Shortcomings
Several shortcomings were identified in the proposed sCT generation approach.Firstly, high VRAM consumption was noted when training the ResViT generators, specifically, 22.74 GB.This is due to the high parameter count in vision-transformers, even with the weight sharing strategy applied in ResViT.During inference 6.6 Gb of VRAM was required to initiate the single generator responsible for sCT generation.Consequently, the model must be run either on a cloud, or a local machine with a modern GPU with high VRAM.This limitation feeds into the second concern, which is that the model does not operate on 3D images.This limits the capacity of the CNN and transformer subnetworks to model volumetric long-range dependencies and artefacts.With further developments in computational power and model efficiency, these issues are expected to dissipate.
Considering the blind observer assessment, despite a standardized scoring criteria being administered, some observers stated greater emphasis was given to structural resolution rather than perceptual image quality.Therefore, some appreciable heterogeneity in scoring approach was present.
Two more general shortcomings relate to the input data itself: information which is not present in the CBCT cannot be added to the sCT.One example is the lack of fine anatomical detail present in CBCT, and therefore sCT.One RO commented their 'inability to define the inferior prostatic sheath', which is a fine tissue surrounding the prostate gland and is used to accurately contour the prostate (Salembier et al 2018).Since this tissue is not visible in CBCT images, it will not be present in any sCT.This was a major reason why blind observer scores for CycleGAN-Best were lower for ROs than for ROMPs.Secondly, sCT based on conventional CBCT images cannot resolve the issue of severe motion artefacts.While not a major issue in the pelvic site, abdominal and lung scans can exhibit severe tissue ghosting artefacts.The solution is to utilize 4D-CBCT for image synthesis, which is inherently of poorer quality than regular CBCT.

Conclusion
The present study has demonstrated the weaknesses of baseline supervised (U-Net, GAN) and unsupervised (CycleGAN) training schemes for sCT generation.The former may drastically distort anatomical consistency, while the latter fails to correct severe CBCT artefacts which may compromise anatomical integrity.A supervised CycleGAN trained with novel anatomy-preserving loss functions was introduced to maintain anatomical integrity while correcting severe CBCT artefacts.Pixelwise evaluation metrics such as MAE were shown to falsely indicate sCT with worse anatomical correctness and image quality performed better than realistic and anatomically correct sCT images.A blind observer study validated FID as a more accurate measure of perceptual image quality, having implications for training and reporting of model performance.Spatial uncertainty maps from two uncertainty estimation methods (TTA and MCD) were shown to correlate with regions of poor synthesis, opening the door for model risk assessment strategies and new quality assurance methods.In addition, post-processing using TTA-akin to model ensembling-was shown to further boost the MAE and FID of the proposed model.These findings contribute valuable insights toward building anatomically consistent models with accurate dose calculation and (hypothesized) auto-segmentation properties for downstream online adaptive radiotherapy tasks.

Figure 1 .
Figure 1.Study flowchart describing training cohort, preprocessing, investigated baseline models, uncertainty quantification, the ablation study, the selected model changes for our proposed model, post-processing using TTA, and results formation.

Figure 2 .
Figure 2. Comparison of mean absolute error and Fréchet inception distance for single changes made to baseline CycleGAN, including loss function, training configuration, and architectural changes.The bar outlined in black corresponds to baseline results, while bars outlined in red reflect selected model changes.CycleGAN-Best is also displayed for comparison purposes.

Figure 6 .
Figure 6.Histogram of CBCT (dark blue), CycleGAN-Best (red), and CT (green) for all test images.Solid lines indicate kernel density estimator for each distribution for better visualization.

Figure 8 .
Figure 8. Scatter plot of aleatoric and epistemic uncertainty versus corresponding absolute error for each corresponding pixel in all test images.Dark red and blue correspond to data within 95% confidence interval.

Figure 9 .
Figure 9. Mean surface distance for pelvic contours segmented on CBCT and compared against corresponding contours generated on the proposed model (CycleGAN-Best), and the three baseline models (CycleGAN, GAN, U-Net).
proposed for multi-modal image translation but modified for single-modality output in the present study.ResViT uses CNN blocks for precise encoder and decoder local feature modelling.The bottleneck utilizes aggregated residual transformer blocks (ARTB) with weight sharing to reduce model complexity.The aim of ARTB is to simultaneously leverage the long-range attention mechanism inherent to ViT, with high quality local features mapped by CNNs, fusing both information branches to distill relevant structural and contextual [−1, 1].FC-Gen had a total of 6299910 trainable parameters.Please refer to supplementary materials figure 2(A) for a schematic diagram of FC-Gen.The second generator was the hybrid CNN-ViT architecture ResViT (Dalmaz et al 2022) which was recently

Table 1 .
ROI mean and standard deviation values in Hounsfield units.

Table 2 .
Image quality metrics calculated on test set patients for CBCT and various model outputs.Values with * did not show statistical significance (P < 0.0125) compared to CBCT.

Table 3 .
Blind observer image quality scores for ROMPs, ROs, and all participants.The Pearson correlation coefficient for FID and MAE was computed for each observer category.

Table 5 .
Gamma pass rates and absolute dose differences for rectum, bladder, and CTV structures.Results are computed for CBCT and sCT generated from CycleGAN-Best.