Neural blind deconvolution for deblurring and supersampling PSMA PET

Objective. To simultaneously deblur and supersample prostate specific membrane antigen (PSMA) positron emission tomography (PET) images using neural blind deconvolution. Approach. Blind deconvolution is a method of estimating the hypothetical ‘deblurred’ image along with the blur kernel (related to the point spread function) simultaneously. Traditional maximum a posteriori blind deconvolution methods require stringent assumptions and suffer from convergence to a trivial solution. A method of modelling the deblurred image and kernel with independent neural networks, called ‘neural blind deconvolution’ had demonstrated success for deblurring 2D natural images in 2020. In this work, we adapt neural blind deconvolution to deblur PSMA PET images while simultaneous supersampling to double the original resolution. We compare this methodology with several interpolation methods in terms of resultant blind image quality metrics and test the model’s ability to predict accurate kernels by re-running the model after applying artificial ‘pseudokernels’ to deblurred images. The methodology was tested on a retrospective set of 30 prostate patients as well as phantom images containing spherical lesions of various volumes. Main results. Neural blind deconvolution led to improvements in image quality over other interpolation methods in terms of blind image quality metrics, recovery coefficients, and visual assessment. Predicted kernels were similar between patients, and the model accurately predicted several artificially-applied pseudokernels. Localization of activity in phantom spheres was improved after deblurring, allowing small lesions to be more accurately defined. Significance. The intrinsically low spatial resolution of PSMA PET leads to partial volume effects (PVEs) which negatively impact uptake quantification in small regions. The proposed method can be used to mitigate this issue, and can be straightforwardly adapted for other imaging modalities.


Introduction
Prostate specific membrane antigen (PSMA) positron emission tomography (PET) is an imaging modality typically used to detect prostate cancer [1].PSMA ligands accumulate not only in prostate tissue, but also in the salivary glands, lacrimal glands, liver, spleen, kidneys, and colon [2][3][4][5].This leads to undesirable dose to these healthy regions [2][3][4].However, this typically unwanted uptake can be potentially utilized to investigate intra-salivary gland parenchyma, which is an area of active research in the context of radiotherapy treatment planning [6][7][8].
Quantifying uptake statistics in the relatively small size of salivary glands is difficult, due to partial volume effects (PVEs) associated with the intrinsically low resolution of PET imaging [9], which is typically over 5 mm [10].PVEs in PET images primarily correspond to spill-over between image voxels due to the positron range and pointspread function of the scanner [11].Patient motion and Poisson-distributed noise due to low photon counts further exacerbates blur in PET images.While official SNMMI guidelines recommend using the maximum standard uptake value (SU V max ) in single voxels for uptake metrics [12], maximum uptake values are affected by "spill-in" from neighbouring voxels and would be expected to differ within small volumes if PVEs could be mitigated.
There have been attempts to mitigate partial volume effects in PET images using analytical methods [11,[13][14][15][16], with most traditional approaches involving estimation of the point spread function.These methods are not applicable in the case of collaborative studies involving multiple scanners, or where the point spread function is unknown.
The ill-posed problem of simultaneously estimating the theoretical "deblurred" image, x, along with the spread-function or blur kernel, k, from the original image y (y = x * k), is referred to as blind deconvolution.Traditional maximum a posteriori (MAP)-based methods for solving blind deconvolution require estimation of prior distributions for the kernel and deblurred image and specialized optimization techniques to avoid convergence towards a trivial solution.MAP-based methods are governed by the equation Pr(y|k, x)Pr(x)Pr(k) (1) where Pr(y|k|x) is the fidelity term likelihood and Pr(x) and Pr(k) are the priors of the deblurred image and blur kernel, respectively.Many prior models have been suggested [17][18][19][20][21][22], but are generally hand-crafted and insufficient for accurate modelling of x and k [23].
In 2020, Ren et al. [23] developed "neural blind deconvolution" for estimating the deblurred image and blur kernel from 2D natural images using two neural networks which are optimized simultaneously, as opposed to traditional MAP-based methods which generally employ alternating optimization to avoid a trivial solution [23].This is a self-supervised deep learning method that does not involve a separate training set, but instead learns to predict deblurred images independently for each image.Neural blind deconvolution out-performed [23] other traditional methods based on prediction's peak signal-to-noise ratio, SSIM, and error ratio.
In this work, we build off of the development of 2D neural blind deconvolution [23,24], implementing changes to the network architecture and optimization procedures for suitability with 3D PSMA PET medical images.We also modify the architecture to accomodate simultaneous supersampling to enhance spatial resolution.

Methods
Retrospective Patient Data Set Full-body [18F]DCFPyL PSMA PET/CT images were de-identified for 30 previous prostate cancer patients (Mean Age 68, Age Range 45-81; mean weight: 90kg, weight range 52kg-128kg).Patients had been scanned, two hours following intravenous injection, from the thighs to the top of the skull on a GE Discovery MI (DMI) scanner.The mean and standard deviation of the injected dose was 310 ± 66 MBq (minimum: 182 MBq, maximum: 442 MBq).PET images were reconstructed using VPFXS (OSEM with time-of-flight and point spread function corrections) (number of iterations/subsets unknown, pixel spacing: 2.73/3.16mm,slice thickness: 2.8/3.02mm).The scan duration was 180 s per bed position.For appropriate comparison of voxel values between patients, patient images were all resampled to a slice thickness of 2.8 and pixel spacing of 2.73 using linear interpolation.Helical CT scans were acquired on the same scanner (kVP: 120, pixel spacing: 0.98mm, slice thickness: 3.75mm).
Images were cropped to within 6 slices below the bottom of the submandibular glands and 6 slices above the top of the parotid glands.Cropping was employed to avoid exceeding time constraints and the 6 GB memory of the NVIDIA GeForce GTX 1060 GPU used for training.Registered CT images were used for delineating parotid and submandibular glands for calculating uptake statistics.Limbus AI [25] was used for preliminary auto-segmentation of the glands on CT images, which were then manually refined by a single senior Radiation Oncologist, Jonn Wu.Contours were used to calculate uptake statistics within salivary glands.Images were normalized to the maximum voxel value prior to deblurring, and afterwards rescaled to standard uptake values normalized by lean body mass (SU V lbm ).Lean body mass was estimated from patient weight and height using the Hume formula [26].

Overview of Neural Blind Deconvolution
Neural blind deconvolution, as first demonstrated by Ren et al. [23], estimates a deblurred version, x, of an actual image, y, by training neural networks on a casespecific basis.Specifically, two neural networks, G x and G k are trained to predict x, and a blur kernel, k, whose convolution together yields a close estimate of the original image, y.The use of neural networks for image prediction was motivated by the work of Ulyanov et al. on deep image priors [27] which defines where θ x and θ k represent trainable model parameters of G x and G k , respectively.The following implicit constraints exist on network outputs which are satisfied automatically due to the networks' architectures.
Previously, [23] [24] such networks were trained using a loss function including a mean squared error (MSE) fidelity term along with regularization terms, R(x, k), where x, k = arg min The mathematical formulation of our problem differs slightly as we have adapted G x to yield an output image, x which is twice as large as y in each dimension to accomplish supersampling.Our x is then linearly downsampled to the original image size, x ↓ , before being convolved with the kernel.Furthermore, we incorporate a component of mean absolute error into our loss function's fidelity term.
x, k = arg min This can be written in terms of the model parameters θ x and θ k as which can then be optimized via backpropagation of loss gradients.

Model Architecture
Neural blind deconvolution was originally demonstrated on 2D images using a "U-Net" [28]-style, symmetric, multiscale, autoencoder network for G x and a shallow fullyconnected network for G k .We built off of this methodology while implementing some notable changes to optimize model performance on PSMA PET images.A schematic diagram of the network design is shown in Figure 1.

G x Architecture
All operations involving G x were adapted for use with 3D images.The network includes only 3 layers in the encoder-decoder series (as opposed to 5 in Ren et al.'s implementation) [23].A decrease in network layers is justified by the low resolution of PET images, which requires a smaller receptive field than needed for high-resolution images to localize edge features [29].An additional layer was added to the end of the decoder section, including an upsampling operation, and double convolution, to output a final x twice as large in each dimension as the original input.This design allows x to be simultaneously supersampled and deblurred within the blind deconvolution process.Rather than iteratively increasing channels towards deeper regions of the encoder and decoder, we found that reversing the design such that the channel count decreases towards inward regions of the network produced favourable results.For each double convolution, the first convolution decreases the image size via a stride of 2. We included a high number of skip connection channels (64), as it was found to improve prediction accuracy.
Furthermore, we scaled the final sigmoid layer used in previous versions of neural blind deconvolution [23,24], as a regular sigmoid constrains the x to have the same maximum value as y.This is not necessarily true, nor expected for the case of partial volume effect correction.We therefore scaled the final layer by 3/2.

G k Architecture
G k is a single layer perceptron, whose architecture Ren et al. [23] found to outperform the auto-encoder style design of G x for the case of predicting the kernel.Model input is a 1-dimensional vector of length 3375, which is fed through a fully connected layer with 16,875 nodes, to an output of length 3375 which is passed through a SoftMax operation Neural blind deconvolution for deblurring and supersampling PSMA PET 6 before being reshaped into a 15×15×15 3D kernel image.The softmax function ensures that kernel voxels sum to unity, and has the form:

Loss Function and Optimization
It was unnecessary to form distinct training and validation sets as needed in supervised learning algorithms.Neural blind deconvolution is a type of "zero shot" [30] Compute loss and back-propagate gradients 8.
Update G i x and G i k using the ADAM optimizer [31] x Compute loss and back-propagate gradients 20.
Update G i x and G i k using the ADAM optimizer [31] The main optimization procedure consisted of 5000 iterations broken into 3 stages, stage 1: 0 ≤ step < 300, stage 2: 300 ≤ step < 2000, stage 3: 2000 ≤ step < 3500, stage 4: step ≥ 3500.5000 iterations was chosen as it had been used in previous neural blind deonvolution methods, and furthermore because we found this number of iterations to be sufficient for the training loss to level out.We implemented a multi-scale pre-training procedure similar to the method recommended by Kotera et al. [24].This involves initially training G x and G k to predict x and k using 2× downsampled PSMA PET images, which are then upsampled by √ 2 and used as input for a second round of pre-training.The output is then upsampled back to the original size to serve as inputs for the regular training stage.We modify the procedure further such that after every upsampling during pre-training, the models, G x and G k are trained to predict their own inputs for 100 iterations, which was sufficient to allow training in the next scale to begin by predicting upsampled versions of the predictions in the previous layer.The size of k varies with the size of x, such that three kernel sizes are used, 7 × 7 × 7, 10 × 10 × 10, and the original size 15 × 15 × 15.For the downsampled pre-training of x and k, 1000 iterations were used in each stage.This number of iterations was chosen to allow the pre-training image to approximately converge to its final estimate.The initial input vector for G k was a Gaussian Kernel with a standard deviation of two voxels, and the initial input vector for G x was a pseudorandom-generated array, sampled from a uniform distribution.The loss function components for each stage are summarized in Table 1.The parameters of G x and G k were updated using the ADAM optimizer [31].
Table 2: Loss Function components used for optimization of G x (θ x ) and G k (θ k ).To guide training during the first 300 optimization steps, the Structural Similarity Index Measure (SSIM) is minimized between the deblurred image x ↓ and the registered CT's texture map.Conditions are invoked on the kernel estimate until the final 1500 steps of optimization.TV loss is introduced after the first 2000 steps, to avoid convergence towards a trivial solution.An ℓ 2 norm loss term is applied to kernel, but only to voxels with values exceeding 0.7.
It was previously reported by [24] that adopting a total variation (TV) loss term biases outputs towards a trivial solution during the initial stages of training, but improves image quality as predictions near the final, correct solution.We experienced similar findings, and therefore implemented a TV loss term only after the first 2000 iterations.After the first 3500 iterations, the fidelity term of the loss function transitions from a combination of MSE and MAE to the structural similarity index metric, which was also found to boost performance during final stages [24].
During the first 300 iterations of training, we employ registered computed tomography (CT) images to provide anatomical guidance.This is implemented by including a SSIM loss term involving x and a CT texture map of the grey level run length matrix (GLRLM) with a long run emphasis, computed with pyradiomics [32].We opted to use the texture map as opposed to the regular CT image as CT texture maps have been demonstrated to have a closer correlation with PSMA PET uptake [33] than standard images.CT guidance is used to steer outputs towards the correct solution during the the first 300 steps of training before being dropped, to prevent competition with the fidelity term as outputs move closer towards final estimates.A previous study using traditional MAP-based blind deconvolution of PET images demonstrated the utility of magnetic resonance images for guiding optimization [34][35][36].
A cumbersome challenge in performing a blind deconvolution is avoiding a trivial solution, x * k = x * δ = y [37][38][39].It is therefore necessary to impose regularization on kernel predictions.Previously, [23,24] applied an ℓ 2 norm constraint to the kernel to avoid convergence to a delta function.We also applied an ℓ 2 norm term; however, we only penalized kernel values larger than 0.7, to avoid a trivial solution while not penalizing small-to mid-range values in the kernel.

Patient Analysis
Ground truth deblurred patient images were not available, and we therefore adopted a variety of strategies and metrics for evaluating the quality of deblurred patient images and the model's ability to predict accurate blur kernels.
Two quantitative blind image quality metrics were compared between x and y.First, the Blind/Referenceless Image Spatial Quality Evaluator (BRISQUE) score [40] was used, which varies between 0 and 100, indicating the lowest and highest possible image quality, respectively.BRISQUE takes an input image and computes a set of features using the distribution of intensity and relationships between neighbouring voxels.It then predicts its image quality from its predicted deviation from a natural, undistorted image.Second, images were compared using the Contrastive Language-Image Pre-training (CLIP) metric [41] which varies between 0 and 1. CLIP is a multimodal (language and vision) neural network trained with millions of image/text pairs.The model can be used for predicting both image quality and image caption quality.The clip network predicts image quality by ranking the applied captions "good photo" and "bad photo".CLIP was compared with numerous other quality metrics and was found to perform second to only BRISQUE.
The theoretical blur kernel of PET images has a component resulting from the point spread function of the scanner, as well as other limitations and patient motion.Since all PET images were acquired on the same scanner, with the same image collection protocol, it is expected that predicted blur kernels should have high inter-patient similarity, Kernels were compared between patients quantitatively by taking the inner product of normalized kernel images.To validate the ability of neural blind deconvolution to accurately recover blur kernels, rather than arbitrary or semi-random shapes imposed by the loss function/optimization, we generated four types of "pseudokernels" and convolved them with previously deblurred images, which were then fed back into the blind deconvolution model.Predicted kernels were then compared with pseudokernels visually and by taking inner products.Generated pseudokernel shapes included a regular Gaussian, and 3 Gaussians skewed in the x, y, and z direction.

Phantom Analysis
The deblurring methodology was further tested using PET images acquired using the Quantitative PET Prostate Phantom (Q3P) [42].The phantom had been developed for a previous study [43] and included a set of "shell-less" spheres (3-16 mm diameters) of 22 NaCl infused epoxy resin.The positron energy via 22 Na decay is similar to that from 18 F (220.3 keV and 252 keV respectively) [44,45], making 22 Na suitable for approximating the decay seen in [18F]DCFPyL PET.The phantom contains eight spheres with diameters of 16 mm, 14 mm, 10 mm, 8 mm, 7 mm, 6 mm, 5 mm, and 3 mm arranged evenly in a coaxial ring and 50 mm from the radial centre of the phantom.The spheres were prepared with a 22 Na concentration of 57.6 kBq/mL to represent lesion concentrations from an analysis of patients imaged with [18F]DCFPyL PET [43].By the time of acquisition, activity concentrations had decayed to approximately 31 kBq/mL.To establish realistic background concentration, the phantom was filled with a background concentration of 2 kBq/mL.To account for Poisson noise, the phantom was imaged for five noise realizations (2.5 min scan time) using a 5-ring Discovery MI PET/CT scanner.Phantom images were reconstructed using VPFXS (8 subsets, 4 iterations, 6.4 mm filter) using the clinical scanner console.The phantom specifications and image acquisition process are described in further detail by Fedrigo et al. [42] Phantom images were deblurred using the same neural blind deconvolution methodology applied to patient images.However, registered CT images of phantoms did not exist and therefore texture features could not be included during early stages of optimization.As regions of high uptake within the phantom were all spherically symmetric, there was inherent ambiguity in the extent of blurring.This is because multiple combinations of blur kernels and "deblurred" images can be convolved to produce highly similar observed images.This ambiguity is broken when shapes of arbitrary differing shapes are included in images, as in the case of patient acquisitions.To mitigate this challenge, we modified the final layer of G x to produce maximum values of 1.1 × the maximum voxel in the original image, rather than 1.5 ×.This did not present physical challenges, as the maximum voxel value was approximately equal to the known activity concentration in the largest sphere.
Phantom spheres were contoured via two methods.First, a single fixed threshold that minimized the total difference between contoured and actual sphere volumes was determined separately for original and deblurred images.Second, variable thresholds were determined for each sphere size individually.Thresholds were expressed as a fraction of the maximum voxel value within a given region, and values between 0.2 and 0.8 (0.01 step size) were tested.Regions were defined by all voxels connected to the maximum voxel within a given sphere and greater than the given threshold.Actual volumes are referred to as V 0 .The 3 mm sphere was not included in the analysis as it was poorly defined on original and deblurred images.
Activity concentration uniformity was first compared in original and deblurred sphere images by plotting intensity profiles.Recovery coefficients (RCs) for the mean, median, and maximum activity concentration (RC mean , RC median , RC max ) were also determined.These were defined as where a h,true is the ground truth activity concentration for each sphere determined in a previous study [43], and stat = {mean, median, max}.The mean inner product between combinations of phantom kernels was calculated.The difference in total activity within deblurred and original images was assessed.

Patient Analysis
Deblurred images had higher BRISQUE and CLIP scores than original images upsampled using nearest-neighbours, linear, quadratic, and cubic interpolation.These results are summarized in Table 3.
Table 3: Blind image quality metrics, CLIP and BRISQUE are compared for original (y) and deblurred (x) PSMA PET images.For appropriate comparison, y was upsampled by 2 in each dimension to match the size of x.In the first column, y is upsampled using nearest neighbours interpolation to remain visually identical.We also compared the quality of y when upsampled using linear, quadratic and cubic interpolation (y ↑ 1 ,y ↑ 2 ,y ↑ 3 ).Maximum intensity projections of x, k, x * k, and y are shown for a single patient in Figure 2 and visual comparisons of original and deblurred PET image slices including parotid glands and submandibular glands are shown for 3 different patients in Figure 3. Maximum intensity projection images are shown through the head and neck in Figure 4.
Predicted blur kernels displayed relatively low inter-patient variability, having a mean inner product between different patients of 0.73.Projections of the mean predicted   kernel onto the three standard cartesian planes are found in Figure 5. Predicted kernels were mostly symmetric while having small inter-patient deviations in skewness.The distribution of kernel voxels in various patient directions is summarized in Table 4, with maximum variation found in the anterior-posterior direction.Kernels were found to display a centred peak, which rapidly fell off to less than 0.01 when two voxels from the center in any direction.
The mean absolute difference of the total activity found within all voxels of original and deblurred images varied by less than 0.05 ± 0.07%.The predicted blurred image, x * k, and the original image, y, had a MAE and MSE of 0.013 ± 0.004 and 0.003 ± 0.001 SU V lbm , respectively.Figure 5: The mean predicted kernel is projected onto the 3 standard patient planes.The predicted blur kernel was found to be mostly symmetric.
Table 4: The fraction of the kernel contained within a plane oriented in each Cartesian direction and intersecting the center of the kernel image, averaged over all patients.For example, the average blur kernel has a sum of 0.52 anterior to the center point.Pseudokernels applied to deblurred PET images were predicted well, with mean inner products of predicted and generated pseudokernels above 0.90 for each of the 4 types of pseudokernel applied (regular Gaussian, and skewed Gaussian along 3 patient axes).These results are displayed in Figure 6 Deblurred images appeared to be visually sharper than original images.Since x is double the size of y, we upsampled y using nearest-neighbours, linear, quadratic, and cubic interpolation for comparison's sake.Side by side images are found in Figure 7.
Cohort uptake statistics within CT-defined salivary glands, calculated using both original and deblurred images, are listed in Table 5. Maximum uptake values in both parotid and submandibular glands were significantly higher in deblurred images (p < 0.01), and mean uptake values were insignificantly higher.Uptake plots are shown over corresponding lines through the parotid glands in Figure 8.The gradient of uptake from 0.5 SU V lbm to the full width at half-maximum (FWHM) is approximately twice as steep for deblurred images than their original counter-parts.The deblurred images display internal heterogeneity within parotid and submandibular glands that appears unresolved in original images.Axial, coronal and sagittal slices of fusion PET/CT images are shown for both deblurred and original images in Figure 9.     Axial MIPs and slices of original and deblurred phantom images are shown in Figure 10.Activity concentrations were more localized within spheres after deblurring.This was most pronounced in small spheres (≤ 10 mm).In larger spheres, the activity concentration had higher uniformity after deblurring as seen in signal versus distance plots (across axial slices in both y and x directions) in Figure 11. Figure 11 also shows ROI masks defined for each sphere using variable thresholds (Table 6).Deblurred and supersampling images resulted in ROIs with higher symmetry.The total activity concentration in phantom images was unaltered after deblurring (p > 0.61).RC mean , RC median , and RC max increased for each sphere upon deblurring (Table 6).Small sphere volumes were thresholded with higher volumetric accuracy, and ROI contouring using a single fixed threshold was made more effective for minimizing total volume error over all sphere sizes after deblurring (Figure 12).In particular, a single fixed threshold characterized small spheres more effectively after deblurring.In both original and deblurred images, the maximum activity concentration in 16 mm and 14 mm exceeded actual activity levels.Deblurring increased this overestimation by approximately 5 % (Table 6).Table 6 includes recovery coefficients and volumes of thresholded phantom regions.
Deblurred images of all separate phantom acquisitions were similar, with a mean inner product of 0.93 ± 0.04 between distinct kernels.The shape and size of the mean kernel determined for phantom images was similar to that found for patient images.   NaCl infused epoxy spheres in phantom images for original and deblurred images.The top row corresponds to sphere statistics determined using volumes defined with a single fixed threshold, while the bottom row corresponds to statistics determined with individual thresholds determined for each sphere size.Spheres defined using deblurred/supersampled images had lower volume discrepancies, and higher recovery coefficients (RCs) of mean, median, and maximum activity concentration.Error bars indicate standard deviations over the 5 consecutively acquired phantom images.

Discussion
Based on blind image quality metrics and visual appearance, neural blind deconvolution helped mitigate PVEs and improve overall image quality.Quality improvements were a demonstrated result of the blind deconvolution process, and not only standard upsampling, as shown by comparing results with original images upsampled using various polynomial interpolation orders.The intrinsically low spatial resolution of PET imaging, and consequent PVEs [46], render PVE correction a necessary task when quantifying uptake in small image regions.Furthermore, PVE correction of PET images has been shown to increase the statistical significance of correlations between uptake bio-markers and various clinical endpoints and prognostic indices [47][48][49][50][51].
Predicted blur kernels were consistent with anticipated sizes and shapes, being small (generally no more than 3 voxels across in any direction with values larger than 0.01) and mostly symmetric.By observing uptake fall-off at body borders in original images, it appears that PVEs cause small spill-out to distances larger than one voxel, but neural blind deconvolution cannot detect such small blur components.For model validation, it was essential that predicted kernels shared similarities between patients.The low but non-zero inter-patient variability of predicted kernels was as expected, which can be considered to be an unknown combination of the point spread function, scanner limitations, patient motion and other artifacts.
The model's capacity to predict pseudokernels convolved with previously deblurred images validated its ability to predict various kernel shapes rather than simply Gaussian shapes.All generated pseudokernel shapes were predicted with high accuracy, but it is notable that the highest accuracy (mean inner product = 0.98) was found for pseudokernel's that were stretched onto primarily one axial slice, and all kernel's stretched onto primarily one axis were predicted with a higher accuracy than the symmetric Gaussian kernel.This may be due to the slightly larger slice spacing than pixel spacing (ratio of 1.025), as it is unclear what mechanism in the model architecture would result in increased predictive power of axially flattened kernels.
The methodology was further validated on phantom images.Activity was more concentrated within spheres and recovery coefficients were higher after deblurring.Use of a single fixed threshold for defining regions had lower volume error for small regions after deblurring and greater fine detail in small spheres is observable in Figure 10 and Figure 11.These findings, along with improvements in blind image quality metrics and sharpened ROI uptake boundaries support the ability of this method to improve image quality.It is particularly important to further explore the affect of deblurring on the SU V max in lesions, as deblurring led to overestimation of the maximum activity concentration within large spheres.This effect may have been due to the spherical symmetry of phantom lesions, which leads to ambiguity in the determination of a theoretical deblurred image and blur kernel.Overestimation of activity concentrations in large phantom spheres suggest that maximum uptake values in deblurred parotid and submandibular glands may be overestimated.Further exploration of this technique on phantom data with various shapes and sizes could help to understand limitations of this methodology.
An advantage of neural blind deconvolution over traditional blind deconvolution methods [52] for mitigating PVEs in PET images is that it does not require prior assumptions of the PSF to be imposed.A review of various blind deconvolution algorithms (excluding neural blind deconvolution) applied to natural images [53] found that the shift-invariant assumption for the blur kernel is often incorrect, which possibly explains a portion of the variance seen in predicted kernels.The proposed method assumes a uniform blur kernel, such that the image can be modelled as y = x * k.However, it is known that PET image blur is dependent on radial distance from the isocentre (radial astigmatism) [54,55], so this methodology could likely be improved by modifying the training algorithm to incorporate a spatially varying kernel.However, due to the static nature of the PET scanner, it is likely that this variance is less pronounced here than in cases of natural image acquisition using hand-held devices.
Previous approaches to super-resolution of PSMA PET images typically seek to reconstruct images of standard quality using lower than standard doses [56][57][58][59].These methods use supervised learning to train models to predict known"ground truth" high resolution images from their lower resolution counterparts.Other approaches acquire standard images at multiple points of view and attempt to reconstruct single images of supersampled resolution [60].In contrast, the present approach seeks to supersample reconstructed images of standard quality to higher than standard resolution, using selfsupervised learning.Therefore, ground truth images do not exist in the present study for a direct comparison of real and predicted high resolution images.We therefore relied on assessing improvements in blind image quality metrics, and testing kernel similarity between patients and the model's ability to accurately predict pseudokernels.
To directly compare the present method with previous supersampling techniques [ [56][57][58][59], a potential study could acquire both low and high resolution PSMA PET images, and then apply the present method to the low resolution images.The acquired "high resolution" images could then be compared with predicted values.This allows for direct quantification of supersampling quality.
The performance of neural blind deconvolution on 2D natural images was previously found to out-perform other deblurring approaches [23,24].A fundamental difference between applying deblurring approaches to PET images vs natural images is that 'still' natural images can typically be taken as ground-truth images, while 'still' PSMA PET images have intrinsic blur.This makes it especially challenging to assess deblurring approaches applied to PET images.It was not possible to directly assess the accuracy of predicted blur kernels, as true blur kernels are unknown, and can only be estimated from the scanner's point spread function, which is a source of uncertainty that burdens traditional blind deconvolution a posteriori analyses.
A great advantage of self-supervised learning over supervised learning deep learning algorithms is that a large training set is unnecessary.This is especially important in the case of PSMA PET deblurring, PSMA PET is relatively expensive [61], making it difficult to acquire large datasets.A self-supervised approach to super-resolution of PET images using generative adversarial networks [62] has had demonstrated success, despite its use of 2D rather than 3D convolutions.
While in the traditional U-Net architecture as well as previous neural blind deconvolution studies, the channel count increases towards deeper regions of the encoder and decoder [23,24,28], we discovered that larger channel counts towards outside regions of the network, decreasing towards the bottom of the encoder and decoder, performed most accurately.Furthermore, including a large number of skip connection channels (64 each) greatly improved performance.This is likely due to increased high-level feature extraction at the top of the network allowing for ultimately better image reconstruction, along with bottle-necking in deeper regions leading to improved learning of important features [63].Our modelling power was limited by time and GPU resources, and we were unable to test whether scaling all channel counts further could improve results.The model code is available for download online [64].
Image quality appeared higher within the high uptake regions of salivary glands as well as other low uptake regions.Before adopting the combination of MAE and MSE used in the loss function, we used only MSE as in previous literature [23,24], but found that this lent itself poorly to the highly skewed distribution of PSMA PET uptake.Adding in MAE allowed appropriate penalization of discrepancy in lower-uptake regions, leading to higher overall quality.We found that decreasing the number of autoencoder layers in G x from 5 to 4 or 3 had no perceived affect on the model's predictive power, and opened up GPU memory for more channels to be added to the remaining layers.We experimented with various numbers of hidden layers and channel counts in G k and found that a single layer with 5 times the number of channels in the flattened kernel image worked best.
PET imaging suffers from intrinsically low resolution, resulting in apparent partial volume effects.We have demonstrated neural blind deconvolution to be a viable postreconstruction method for mitigating these effects.As opposed to traditional maximum a posteriori methods for estimating deblurred images, neural blind deconvolution allows for simultaneous estimation of x and k without convergence towards a trivial solution.We have built off of previously demonstrated 2D natural image architectures [23,24] to create a suitable architecture for 3D PET images.Inclusion of the CT texture map for steering optimization during early stages helped guide training towards a centered kernel, as opposed to an off-centered kernel and deblurred image, which when convolved, predict a centered image for comparison with the original image.Use of CT images for guiding blind deconvolution relies on accurate image registration.The GLRLM with a long-run emphasis was used rather than the original CT im age, due to its lower contrast which matches PSMA PET better, and due to our previous findings which demonstrated a strong correlation between PSMA PET uptake and the CT GLRLM [33].Initially, a joint entropy loss function was used rather than SSIM, however, this relies on voxel-binning which causes problems during gradient calculation for back-propagation.Regularization of the kernel was necessary for avoiding a trivial solution.However, once a MSE penalty was applied to kernel values greater than 0.7, the model converged to non-trivial solutions for all patients.Maximum kernel values were generally less than 0.4, so our kernel constraint did not arbitrarily constrain the maximum value found in predicted kernels to a set value.
This analysis was performed on a dataset of previously reconstructed images, without access to raw scanner data.VPFXS had been previously chosen for clinical image reconstruction over BSREM/Q.Clear [65,66] for consistency across scanners (which did not all have access to BSREM/Q.Clear reconstruction algorithms).The voxel size is an important acquisition parameter whose modification is sure to affect model performance.The dependence of model performance on voxel size would make for an interesting future study.Many other acquisition and model parameters can be further fine-tuned for optimal performance.Future studies seeking to enhance PSMA PET with neural blind deconvolution should maintain a consistent voxel size across patients for the sake of comparison.
PSMA PET uptake is heavily biased towards the salivary glands than other regions in the head and neck, as seen in figures.A previous study has demonstrated heterogeneous uptake of PSMA-PET in parotid glands [33], and mitigating partial volume effects using neural blind deconvolution appears to make this effect more pronounced.The tubarial glands [67] appeared with greater definition in deblurred images, as two distinct regions of high uptake.Maximum uptake regions were better localized in deblurred images, resulting in increased overall maximum values, as well as slight increases in whole-gland means.These results are too be expected, due to mitigated spill-out and spill-in of voxel values around maximum values, and spill-out near gland edges.The total activity in original and deblurred images remained constant, which was necessary in the context of PET imaging.Originally, we included a separate total activity constraint in the loss function, but found this redundant since our fidelity term accomplished this goal.
Supersampling during neural blind deconvolution is not limited to double scaling, and the demonstrated model architecture can easily be amended for accommodating additional resizing.Furthermore, the methodology could be adapted for kernels to be first convolved with deblurred images before downsampling to compute the fidelity loss.Our methodology was limited by time and computation resources.
Deblurring and supersampling PSMA PET images was motivated by the challenge of quantifying uptake in small spatial regions of images, such as the salivary glands, where partial volume effects become increasingly problematic.Prostate specific membrane antigen (PSMA) positron emission tomography (PET) has high ligand accumulation in the parotid glands [2][3][4], and has been suggested to relate to whole-gland functionality [68][69][70].Heterogeneity of PSMA PET uptake trends in salivary glands can be analysed with greater fine detail brought out by deblurring / supersampling.PSMA PET has the unique potential to quantitatively investigate salivary gland physiology, which could potentially lead to better understanding of their functionality as relevant for radiotherapy treatment planning.
PSMA PET is primarily used for detecting and staging prostate cancer, and deblurring/supersampling with neural blind deconvolution could improve fine detail in small lesions, leading to better localization.This methodology has the potential to improve target localization for treatment planning and reduce the rate of false negative lesion detection.Greater localization ability will lead to better estimation of lesion volumes and better monitoring of treatment outcomes.Use of SU V mean over SU V max will likely be necessary for characterising regions of interest as super-resolution algorithms continue to develop and gain clinical confidence, as the SU V max is more sensitive to partial volume effects in small regions.

Conclusion
In this work, we have adapted neural blind deconvolution for simultaneous PVE correction and supersampling of 3D PSMA PET images.PVE correction allows for fine detail recovery within the imaging field, such as uptake patterns within salivary glands.Leveraging the power of deep learning without the need for a large training data set or prior probabilistic assumptions, makes neural blind deconvolution a powerful PVE correction method.Our results demonstrate improvements in quality metrics of deblurred images over other commonly used supersampling techniques.The model code is available online for further studies [64].

Figure 1 :
Figure 1: The blind deconvolution architecture used for deblurring PSMA PET images is illustrated.G x is an asymmetric, convolutional auto-encoder network for predicting the deblurred image, x, and G k is a fully-connected network for predicting the blur kernel, k.The convolution of x and k is trained to match the original image by using back-propagated model gradients from the loss function for iterative optimization.

Figure 2 :
Figure 2: From left to right, for an individual patient, a deblurred maximum intensity projection image, and axial projection of the predicted kernel are shown, along with the maximum intensity projection of the convolution of the deblurred image and kernel, which is trained to match the original image, on the right.The inferior aspect of the tubarial glands are visible between the parotid glands.

Figure 3 :
Figure 3: Axial slices intersecting parotid glands (left) and submandibular glands (right) for 3 different patients are shown on original and deblurred images.

Figure 4 :
Figure 4: Maximum intensity projections of deblurred (left) and original (right) PSMA PET images are shown through the head and neck in the axial (top), coronal (middle) and sagittal (bottom) planes.

Figure 6 :
Figure 6: To validate the model's ability to predict accurate blur kernels, 4 variously skewed pseudokernels were applied to deblurred images before re-running the blind deconvolution.Generated pseudokernels (k * , left) and their corresponding predicted kernel (k, right) are shown in separate rows for each of the 4 pseudokernel shapes, in axial, coronal, and sagittal image slices.The dot-product of normalized kernels, k * • k, was used to assess kernel similarity.

Figure 7 :
Figure 7: Predicted deblurred images, x are twice the original resolution, and we therefore tested whether perceived improvements in image quality were only due to the size difference by upsampling the original image using nearest neighbours, linear, quadratic, and cubic interpolation.The deblurred image (right) appeared to have the best quality.

Figure 8 :
Figure 8: An axial slice through the original (left) and deblurred (right) PSMA PET images, intersecting the parotid glands is shown, along with a red line whose corresponding uptake plot is shown directly below each image.Uptake plots for deblurred and original images are displayed together at the bottom.The deblurred images are found to have smaller partial volume effects, as demonstrated by a more rapid ascent to the full width at half maximum, as shown on lateral edges.In this case, the rise to FWHM was twice as steep in the deblurred images.

Figure 9 :Figure 10 :
Figure 9: From top to bottom, axial, coronal, and sagittal image slices are shown for both deblurred (left) and original (right) PSMA PET/CT fusion images.

Figure 11 :
Figure 11: Axial slices of 22 NaCl infused epoxy spheres in phantom images for original and deblurred images are shown, along with masks corresponding each sphere's best variable threshold for contouring accurate volumes (blue).From top to bottom, sphere diameters are: 16 mm, 14 mm, 10 mm, 8 mm, 7 mm, 6 mm, 5 mm.Activity concentration plots versus distance across each volume in the coronal and sagittal direction are shown to the right, corresponding to red lines shown in PET images.

Figure 12 :
Figure12: Volume and concentration statistics for22 NaCl infused epoxy spheres in phantom images for original and deblurred images.The top row corresponds to sphere statistics determined using volumes defined with a single fixed threshold, while the bottom row corresponds to statistics determined with individual thresholds determined for each sphere size.Spheres defined using deblurred/supersampled images had lower volume discrepancies, and higher recovery coefficients (RCs) of mean, median, and maximum activity concentration.Error bars indicate standard deviations over the 5 consecutively acquired phantom images.

Table 1 :
[24]optimization algorithm for updating network weights to predict deblurred PSMA PET images.This algorithm builds off of Ren et al.'s proposed joint optimization algorithm[23]and implements modifications suggested by Kotera et al.[24].

Table 5 :
PSMA PET uptake statistics in CT-defined salivary glands are summarized using both original and deblurred images.

Table 6 :
Volume and recovery coefficient statistics for22NaCl infused epoxy spheres in original and deblurred images are shown for volumes determined using a single fixed threshold (top) or a variable threshold determined separately for each sphere size (bottom).Deblurring decreased thresholded volume error, and raised recovery coefficients of the mean, median, and maximum activity concentration.Deblurring increased overestimation of the maximum activity concentration in 16 and 14 mm spheres.