Uncertainty estimation and evaluation of deformation image registration based convolutional neural networks

Objective. Fast and accurate deformable image registration (DIR), including DIR uncertainty estimation, is essential for safe and reliable clinical deployment. While recent deep learning models have shown promise in predicting DIR with its uncertainty, challenges persist in proper uncertainty evaluation and hyperparameter optimization for these methods. This work aims to develop and evaluate a model that can perform fast DIR and predict its uncertainty in seconds. Approach. This study introduces a novel probabilistic multi-resolution image registration model utilizing convolutional neural networks to estimate a multivariate normal distributed dense displacement field (DDF) in a multimodal image registration problem. To assess the quality of the DDF distribution predicted by the model, we propose a new metric based on the Kullback–Leibler divergence. The performance of our approach was evaluated against three other DIR algorithms (VoxelMorph, Monte Carlo dropout, and Monte Carlo B-spline) capable of predicting uncertainty. The evaluation of the models included not only the quality of the deformation but also the reliability of the estimated uncertainty. Our application investigated the registration of a treatment planning computed tomography (CT) to follow-up cone beam CT for daily adaptive radiotherapy. Main results. The hyperparameter tuning of the models showed a trade-off between the estimated uncertainty’s reliability and the deformation’s accuracy. In the optimal trade-off, our model excelled in contour propagation and uncertainty estimation (p <0.05) compared to existing uncertainty estimation models. We obtained an average dice similarity coefficient of 0.89 and a KL-divergence of 0.15. Significance. By addressing challenges in DIR uncertainty estimation and evaluation, our work showed that both the DIR and its uncertainty can be reliably predicted, paving the way for safe deployment in a clinical environment.


Introduction
Multimodal deformable image registration (DIR) is a sophisticated computational process aimed at establishing a spatial correspondence between two paired images accounting for spatial distortions.DIR is crucial for several image processing applications, such as contour propagation (Vargas-Bedoya et al 2022), motion tracking and modeling (Christensen et al 2007), adaptive radiotherapy (Chetty and Rosu-Bubulac 2019) and lesion tracking among others.Classical registration algorithms have been extensively developed and validated to achieve high-performance registrations.Among these algorithms, B-spline models hold a special place (Rueckert et al 1999), frequently utilized to perform high-quality elastic deformations in numerous open-source DIR software tools (Klein et al 2010, Shackleford et al 2012).These algorithms iteratively optimize an objective function that measures the quality of spatial transformations, leading to precise alignments.However, due to the iterative nature and complexity of the optimization process, they often demand significant computational resources.This results in execution times that exceed the time available in the clinical environment.As an example, in radiotherapy, the requirement of fast DIR is indispensable to perform a daily online adaptation of the treatment plan (Albertini et al 2020, Hunt et al 2023).Another limitation of these algorithms is their inability to provide reliable uncertainty estimation of the deformation maps.
DIR is mainly subject to two types of uncertainties.The first type is the aleatoric uncertainty, primarily stemming from limitations in the image acquisition process, including factors like a limited resolution, a poor contrast, or a high noise.These limitations make it difficult to establish a clear correspondence between the same anatomical point in two distinct images.Reducing this uncertainty would require advancements in image acquisition methods.The second type is epistemic uncertainty, often referred to as model uncertainty, which arises from the limitations of the model to find a proper registration.Reducing this type of uncertainty entails enhancing the capabilities of the employed models.The development of DIR models that can effectively capture both types of uncertainties becomes essential for the safe use of DIR algorithms in clinical applications.For example, in adaptive radiotherapy, the estimation of DIR uncertainty plays an important role in quantitatively evaluating the dose accumulated over the treatment fractions (Chetty and Rosu-Bubulac 2019).This directly impacts the plan quality evaluation and the clinical decisions for re-adaptation.In image-guided neurosurgery, DIR uncertainty estimation is crucial to prevent overconfidence in the accuracy of preoperative image matching (Luo et al 2019).This can help surgeons understand the registration limitations and adapt their approach to enhance surgical precision.
Specifically, to the commonly used B-spline DIR models, early efforts to estimate uncertainty consisted of evaluating the sensitivity of an image similarity metric against randomly performed variations of the optimized B-spline coefficients (Hub et al 2009).Although such methods can achieve high accuracy of the B-spline models, they can only provide a rough estimation of the uncertainty in subregions of the image.Another way to estimate B-spline DIR uncertainty is by doing random Monte Carlo (MC) sampling over the hyper-parameters of the B-spline model (Smolders et al 2023).In this way, slightly different models can be optimized, and their solutions can be used to estimate the mean and standard deviation of the likely deformation.Although this method can estimate a voxel-wise DIR uncertainty, it is very time-consuming since it requires multiple runs of the registration algorithm.
With the advent of deep learning, novel algorithms have been developed for fast DIR, demonstrating competitive performance compared to classical registration methods (Balakrishnan et al 2018, Hu et al 2018, Che et al 2023, Hunt et al 2023).At the same time, considerable advancements in deep learning methods were carried out to generate models capable of estimating uncertainty.The development of Bayesian deep learning approaches offered an intuitive way to estimate models' uncertainty (Magris and Iosifidis 2023).Unlike classical neural networks (NNs), these architectures are parameterized by a normal distribution.However, training such a network is impractical since it presents substantial computational times for training.Recent results have shown that Monte Carlo dropout (MCD) models can be used to approximate Bayesian Networks and to estimate the model uncertainty (Gal and Ghahramani 2015).Dropout layers randomly disable nodes of the network, allowing different distributions of the same model while training.When such a model is implemented multiple times during validation, it generates likely different solutions that can be used to calculate the dispersion of the model using the standard deviation of the results.This concept was used by Gong et al (2022) to create a model that uses convolutional neural networks (CNNs) to perform DIR and predict its uncertainty.However, it is an open question how the uncertainty estimated with these models is affected by the election of the dropout probability.
Another approach to estimating uncertainty in deep learning is using Bayesian probabilistic models.Such models describe the problem using Bayesian probabilities and estimate the posterior probability distribution of the phenomenon to model using variational inference (Cheng et al 2023).NNs are used to model the parameters of the approximated probability density function.One of the main features of these methods is that they allow direct estimation of the uncertainty of the problem directly from the network.Such a method was used by Dalca et al to create a probabilistic DIR model (VoxelMorph) that can predict the deformation and its uncertainty (2019).Their innovative work not only established a bridge between classical and learning paradigms but also presented exciting possibilities for predicting DIR uncertainty quickly.Using a similar methodology, Smolders et al built a CNN model that can predict the DIR uncertainty of deformation maps generated using classical registration algorithms (2023).They validated their model by calculating a reliability diagram that correlates the error of the propagation of landmarks drawn by a physician and the uncertainty estimated with their model.
Despite the previous efforts, the performance of the models that can predict the DIR uncertainty remains challenging to evaluate because of the absence of gold standard deformations and dedicated metrics that directly evaluate its quality.Furthermore, the relation between the predicted deformation maps' accuracy and the uncertainty estimated reliability remains unstudied.The main goal of this work is to develop and validate a fast deep learning DIR algorithm that can accurately predict the most likely deformation field and its uncertainty in a multimodal DIR problem.

Methods
In our work, we adopted the Bayesian probabilistic approach used by Dalca et al (VoxelMorph) (2019).Their methodology describes how to predict a distribution of voxel-wise correspondences, or dense displacement fields (DDFs), between a moved image (m) and a fixed image (f) using variational inference and CNNs.They approximated the posterior probability density function p(ϕ|f, m) with a multivariate normal distribution N (ϕ; µ (ψ) , Σ (ψ, σ)), where ϕ is a sample of the DDF, µ is the mean of the distribution, Σ its covariance matrix, σ a fixed scalar and ψ the (hidden) parameters of the CNN that were optimized by minimizing: where f, m ∈ R 1×N with N the number of voxels in the image, To derive equation ( 1), the authors assumed independence between the variables ϕ and m and approximated p(ϕ|m) with a normal distribution N ( ϕ; 0, λ −1 L −1 ) .

Unsupervised multimodal DIR framework
Although VoxelMorph provides a mathematical framework to predict the DDF mean and standard deviation, the approach has limitations on predicting a smooth DDF with a realistic uncertainty.Furthermore, it does not provide an efficient way to calculate LC (σ) Σ (ψ) C (σ) from equation ( 1) and the approach is derived only for the unimodal registration case.Because the probability p(ϕ|m) is modeled as independent of the properties of the image m, the VoxelMorph implementation is missing important prior information that can better guide the registration uncertainty.For these reasons, we approximated p(ϕ|m) with a normal distribution that is dependent on some properties of the moved image m.We assumed that given a moved image m, the average of the deformation field ϕ over all the possible fixed images would be zero and that the covariance matrix would depend on some properties of the moved image: where Σ p is a diagonal covariance matrix that depends on the properties of the moved image m.We modeled Σ p considering the independent contribution of two different sources of uncertainties.The first one is related to random missalignments between the moved and fixed image and its magnitude can be represented as Σ 1 = d 2 I , where d is the average missalignment distance and I is the identity matrix.The second one is related to low-contrast regions in the moved image, and it can be modeled as , where C (d) is a smooth convolution matrix with standard deviation d and T : R 1×N → R 1×N is a function that computes the distance of every voxel of the moved image to the closest neighboring voxel that generates a difference in value above a specified threshold in Hounsfields units (HUs).In this way we calculated the prior covariance matrix as To perform multimodal image registration between different imaging modalities, we exchanged the mean square error term of equation (1) (image similarity term) with the mutual information (MI).MI was chosen for its robust performance in quantifying image similarity between two multimodal images, as it maximizes the information extracted about the moved image when considering the fixed image (Pluim et al 2003).We also considered scaling the magnitude of the BE since its value is influenced by the number of background voxels present in the images.Including these changes in equation (1), and solving the algebra (see supplementary material) we can rewrite the equation (1) as: where γ is a scalar hyperparameter of the problem, K (σ) ∈ R 3N×3N is an effective smoothing convolution matrix and 1 ∈ R 3N×1 is a vector full of ones.Note that the expression K (σ) Σ (ψ) 1 is equivalent to convolving the image of Σ (ψ) with the kernel used to generate the matrix K (σ).This effective 3D convolution kernel k (σ) can be calculated as (detailed calculation in supplementary material): where n r is the size of the fixed or moved image in the r axis, δ is the Kroneker delta, and ĉ (σ, [i, j, k]) is a Gaussian kernel centered to the position i,j,k respectively, c i,j,k is the i,j,k element of the centered Gaussian kernel and ⊙ is an element-wise product.For implementation purposes, we normalized k (σ) to be 1 when all its elements are summed up.The incorporation of the K effective smoothing convolution matrix provides our method a computationally efficient way to calculate the term LC (σ) Σ (ψ) C (σ) from equation ( 1) by using convolutions that can be computed in any deep learning framework in python.
Note that the MI term of equation ( 3) encourages image similarity between the fixed image and one realization of the warped moved image, the bending energy term regularizes µ (ψ) and the trace term, or uncertainty term, regulates the magnitude of Σ (ψ).This can be seen by considering the simplified 1D problem with only one voxel.In this case, Σ (ψ) is a scalar variable, and the uncertainty term is Σ(ψ)/Σ p − log Σ(ψ).This function is minimized only when Σ (ψ) = Σ p .If we extrapolate this result, we can think that the uncertainty term pushes the smoothed variational diagonal covariance matrix Σ(ψ) to be similar to Σ p .In the limit when ω, γ → ∞, the unsupervised loss function collapses to the sum of an image similarity metric and a DDF regularizer as it was previously formulated in other works to predict only a DDF without its uncertainty (Dalca et al 2019, Fu et al 2020).However, when ω is not so big, there is a trade-off between the image similarity term and the uncertainty term.

Supervised multimodal DIR framework
In many clinical applications where registration is needed, the fixed and moving images have segmentation masks of anatomical regions that can be incorporated into the training process to improve the registration's accuracy.However, the process of segmentation is subjected to uncertainty in the borders of the structures, which is mostly related to the bad definitions of the anatomical borders.For this reason, we developed a probabilistic approach that incorporates the segmentation and its uncertainty using probability maps to better guide the registration distribution.
In addition to the methodology described for the unsupervised model, we propose to maximize the log-likelihood of the posterior probability distribution p(s i f | ϕ, s i m ), where s i f ∈ R 1×N is the ith contour in the fixed image and s i m ∈ R 1×N is the ith contour in the moved image.Because s i f and s i m are binary contours, i.e. their voxels values are 1 or 0, we can consider s i f as a realization of the probability map defined as the average of different samples of the warped contour ).Following this rationale, we can model p(s i f | ϕ, s i m ) with an independent Bernoulli distribution: where f is the voxel-wise Bernoulli distribution, ∏ is an element-wise multiplication and PM is the probability map of the position of the contour in the fixed anatomy.Minimizing the log likelihood of p(s i f | ϕ, s i m ) (see supplementary material), and using equation (3), we can write the loss function for the supervised model as: where H is the average of the voxel-wise cross-entropy (CE), J is the i-th probability map of the fixed contour obtained averaging contours of J physicians on the same organ.In this work, we will use J = 1 because we only have one segmentation per organ.

Neural network (NN)
The variational µ (ψ) and the diagonal of the covariance matrix Diag (Σ(ψ)) were modeled using CNNs.The architecture employed in this study is an adapted version of the network developed by Hu et al (2018).By configuring the network with an output of 6 channels (3 for µ and 3 for Σ), it becomes adept at executing probabilistic multi-resolution image registration, effectively capturing both global and local deformations in a progressive manner from coarse to fine scales.At each level of resolution j, the network predicts a that establishes the registration from m to f at that resolution.The composite distribution is derived by aggregating the contributions across various resolutions, resulting in ) .
The network takes the fixed and moved images as inputs and predict µ (ψ) and Σ (ψ) as outputs (figure 1).These outputs are subsequently employed to construct a differentiable DDF sample (ϕ) using the reparameterization step: ϕ = µ (ψ) + C (σ) Σ 1/2 (ψ) δ, where δ is a sample of the standard normal distribution N (0, I) (Dalca et al 2019).This resultant sample is then utilized to warp the moved image and structures using a resampler layer.Following this step, the loss function is computed, and ψ is updated accordingly.

Data
Validation studies were conducted on head and neck patients undergoing external beam radiation therapy.The anonymized data was retrospectively acquired under an approved institutional review board.The moved image and contours were the planning computed tomography (pCT) and the planning structures, respectively.Each pCT contained at least ten planning structures that were segmented by an experienced radiation oncologist.The fixed image and contours were the daily cone beam computed tomography (CBCT) and their associated structures obtained by propagating the planning contours using a certified DIR algorithm (ANACONDA) that is used in clinics to perform contour propagation (Weistrand and Svensson 2015).Every patient in this dataset contained at least one pCT with its structures and around 30 daily CBCTs with their structures.
The networks were trained with 220 pairs of images (pCT and CBCT) obtained from 9 patients.To increase the diversification of the training pairs, the pCTs and their structures were anatomically modified by warping them to different daily anatomies using Elastix.Then, the training pairs were sampled between the different anatomical variations of the pCTs and the daily CBCTs.
For validation and testing, the sampling was different from the training.The image pairs were generated as they will be in the clinics, i.e. the pCT was registered to every daily CBCT.This resulted in an independent validation dataset of 98 pairs of images (pCT and CBCT) obtained from 3 patients and an independent test dataset with 66 pairs of images obtained from 2 patients.To guarantee the fidelity of the evaluation, the testing and validation of daily CBCT structures were reviewed and corrected by an experienced radiation oncologist.
All images and structures were resampled to achieve an isotropic voxel spacing of 1.15 mm, resulting in a size of 192 voxels along each dimension.The image intensity was normalized to ensure values fell within the range [0,1].

Validation and testing
For validation and testing, the data flow was similar to the one described in figure 1, with the difference that 60 samples of the pCT and its corresponding structures were generated to evaluate their dispersion.This number was selected as the maximum sampling size that our GPU could efficiently handle at once.Several metrics were used to assess the quality of the DDF distribution.The dice similarity coefficient (DSC) between the propagated structures with the mean DDF (µ • s i m ) and the physician's structures was calculated as an indirect measure of the quality of µ.The MI was calculated over a sample of the wCT (ϕ (ψ) • m), and the daily CBCTs were calculated as an indirect measure of the quality of the DDF samples.The average value of the percentage of negative values in the determinant of the Jacobian (|J| ⩽ 0) in different samples of the DDF was calculated to evaluate the smoothness of the sampled DDFs.The expected calibration error (ECE) of probability maps (Mehrtash et al 2019) was used as an indirect measure to assess the quality of the DDF distribution.
While it would be ideal to assess the DIR distribution based on ground truth DDFs, it is important to acknowledge that such information is largely unavailable in most datasets.Consequently, various methods have been introduced to assess DDF accuracy using DDFs derived from plausible anatomical deformations as a (Nie et al 2013, Pukala et al 2013).However, these methods primarily evaluate the accuracy of individual realizations of the DDF rather than the entire distribution which is our aim.Additionally, they are constrained to unimodal image registration since applying the known DDF to the fixed image generates a moved image of the same modality as the fixed image.To solve these challenges, we specifically developed a probabilistic approach to enable the evaluation of DDF distribution quality in multimodal image registration problems.
The method consisted of creating a simulated phantom image (Ω • f) with a likely anatomical deformation (Ω) obtained by registering f to a follow-up image of the same patient (figure 2).Then, a DIR model was used to estimate the DDF distribution which registers f to m and Ω • f to m (ϕ 1 and ϕ 2 respectively).Considering that ϕ 1 and ϕ 2 are diffeomorphism, then the deep learning DDF distribution (Ω * ) which registers f with Ω • f as Ω * = ϕ 2 − ϕ 1 can be estimated.If Ω * is a good estimation of Ω, then and Diag is a function which set all off-diagonal elements to zero (detailed calculation in the supplementary material).In other words, the distribution generated with the elements of the vector z (Ω * ) should follow a normal distribution N (0, 1) .For this reason, the Kullback-Leibler divergence KL (z (Ω * ) ∼ N (0, 1)) was used to evaluate the quality of a multivariate normal DDF distribution.In this work, this metric was averaged over all the likely anatomical deformations (Ω) between the first CBCT and all the follow-up CBCTs.

Hyperparameter tuning
The hyperparameters of the supervised (α, γ, ω) and unsupervised (γ, ω) models were fine-tuned in the validation dataset to achieve the best deformation capabilities while still predicting good-quality uncertainties based on the KL-Diverge metric.The hyperparameter tuning was performed iteratively.First, one hyperparameter was selected, and a grid search of the parameter that generated the best trade-off between the averaged DSC, MI, ECE, and the KL-Divergence metric on the validation dataset was selected.Then, this hyperparameter was fixed with its optimal value and the same procedure was repeated for the rest of the hyperparameters.To highlight the benefit in the image similarity, the MI was normalized (%MI) to be 0 when there is no registration and 100 when the MI is the lowest achieved by the model.In this work, the value of σ used to calculate Σ (ψ, σ) was not considered a hyperparameter of the problem.Instead, the same value was used for all models.The smallest σ which generated, on average, less than 0.1% of voxels with negative Jacobian in the sampled DDF (ϕ) for all the models was selected to avoid over-smoothing the matrix Σ (ψ, σ).We also fixed the average misalignment distance (d) used to calculate Σ p to be equal to 2.4 mm.This value is within the expected misalignment error that can be found for different organs when using rigid registration in head and neck treatments (Ronneberger et al 2015).T(m) was calculated for each voxel as the maximum distance in all directions (x, y, z) that resulted in an average HU difference of at least 20 HU compared to the voxel's position.We considered this HU threshold value since this corresponds to two sigmas of the average HU uncertainty of a CT.

Evaluation
The model's capabilities to predict the DDF and its uncertainty were evaluated in different ways.A qualitative assessment was performed by visually inspecting the DDF and its uncertainty.A semi-quantitative assessment was performed by looking at the trend of the DDF uncertainty when the spatial resolution of the pCT and the CBCTs gradually decreased.A quantitative evaluation was performed by comparing the evaluation metrics between the developed models and other DIR methods.

Qualitative assessment
The supervised model capabilities to generate high-quality deformation and uncertainty estimation were visually evaluated by the authors to detect obvious gross miss-registrations in different challenging scenarios of the test set.The deformation capabilities were assessed in two patients who experienced significant anatomical changes throughout their treatment.The diagonal of the covariance matrix (Diag (Σ(ψ, σ))) was examined in regions where the wCT showed discrepancies with the CBCT, including cases where anatomical structures were present in one image but not in the other, as well as in the presence of image artifacts.Furthermore, the structure uncertainty was assessed by comparing probability maps of the binary target segmentations generated by a physician, B-spline and the dropout model.

Aleatoric uncertainty analysis
The relation between the contrast of the input images and the DDF uncertainty, represented by (Diag (Σ(ψ, σ))), was examined in the test set.Gaussian kernels with various standard deviations (σ) were employed to smooth the input pairs (pCT, CBCT).The average of the median of the DDF uncertainty was calculated for all test samples.

Model comparison
The performance of the models developed in this work were compared against three different models that can predict DIR uncertainty, a MCD model, VoxelMorph model and a Monte Carlo B-spline (MCBS) model (see detailed description in the supplementary material).
The MCD model was implemented by introducing dropout layers in all the convolutional layers of the encoding block of the network.The weighted sum of the MI, BE, and DSC loss were used as the loss function (Weistrand and Svensson 2015, Dalca et al 2019, Gong et al 2022).The VoxelMorph model was implemented to predict a multivariate normal distribution of a DDF (Dalca et al 2019).The network of the model was a 3D-Unet which was extracted from the Pytorch version of their GitHub code.The loss function was the same as they proposed in their paper with the additional calculations developed in this paper for the term LC (σ) Σ (ψ) C (σ) of equation ( 1).The MCBS models was implemented using an open-source DIR algorithm (Elastix) that use MI and BE to regularize the displacement field.The MC samplings were performed over the strength of the regularizer and the size of the B-spline grid.All these models hyperparameters were fine tuned in the validation dataset to achieve similar MI than the supervised and unsupervised model.Then, all the models were compared in the test set considering the metrics described in section 2.5.For each metric, paired signed ranked Wilcoxon test was used to evaluate the median of the  distribution of every model against the distribution of the supervised model.Bonferroni correction was used when multiple comparisons were performed.

Hyperparameter tuning
An analysis over the smoothness of the DDF samples predicted with the models for different values of the standard deviation of the smooth convolution matrix C showed that using σ 0 = 6 voxels and a kernel size = 25 voxels generated on average less 0.1% of voxels with negative Jacobian.With these parameters, we used equation ( 4) to calculate the effective smooth convolution kernel k (σ 0 ).Then we compared the effective kernel k(σ) with the Gaussian kernel (figure 3).
The effective kernel k(σ) was sharper than the Gaussian kernel, giving more importance to neighboring voxels.With these parameters fixed we optimized the supervised and unsupervised network.

Unsupervised network
For the unsupervised model, it was found that γ ω = 50 generated the highest DSC in the limit of ω, γ → ∞.This also generated the lowest MI = −0.73,i.e. %MI = 100, for all the explored combination of hyperparameters for this model.The maximum MI = −0.52,i.e. %MI = 0, was achieved when no registration was applied.The optimization of ω showed that the DSC remained roughly stable at 0.855.However, the average normalized MI and the average KL-Diverge metric show a trade-off for different values of ω (figure 4).
When ω = 20, the MI only decreased 15% from its maximum while the quality of the deformation field distribution has almost reached its maximum (KL-Divergence = 0).Moreover, the ECE analysis revealed a decline in conjunction with the KL Divergence metric for lower values of ω reaching its minimum (0.04) at  ω = 20, as illustrated in figure 5.As a compromise between registration accuracy and uncertainty reliability, ω = 20 was used for the supervised and unsupervised models.

Supervised network
To fine-tune the hyperparameter α of the supervised network, the optimal hyperparameters of the unsupervised network ( γ ω = 50 and ω = 20) were used.It was found that the MI and the KL vergence metric remained stable for different values of α but the DSC increased from 0.855 at α = 0 to 0.863 at α = 100.For this reason, in our following experiments, we used γ ω = 50, ω = 20 and α = 100 for the supervised model.

Qualitative assessment
The visual inspection of figure 6 shows that the model could effectively predict a DDF that generate a wCT with similar characteristics to the CBCT even in presence of strong anatomical changes.
Because of the superior quality of the pCT, there are some anatomical structures that can be found on it but not in the daily CBCTs.In figure 6, case 1, a blood vessel and muscle structures can be spotted in the pCT but they are not clearly seen in the CBCT.Because of this information loss, the model predicts high values in the Diag (Σ(ψ, σ 0 )) in those regions.
High uncertainty is also found in regions where the model could not find a proper DDF to register the pCT to the CBCT.In figure 7, case 2, the anatomical differences in the oral cavity between the wCT and the CBCT are associated with an increase of the Diag (Σ(ψ, σ 0 )).Image artifacts were also responsible for high DDF uncertainty.In figure 7, case 3, the CBCT is cropped in the bottom right region of the pCT.Because of this loss of information, the DDF generates high uncertainty in that region.Note that high-contrast regions,  The black contours were segmented by a physician; the white contours were propagated using the B-spline model, the cyan contours were propagated using the mean of the dense displacement field obtained using the dropout model and the probability maps (color map) were generated using 60 samples from the supervised model on a test set patient.The red arrow shows that regions in which the binary contours do not match each other present low confidence in the probability map (PM ≈ 0.5).such as bones, were generally associated with low uncertainty, and low-contrast regions in the pCT and CBCT were associated with high uncertainty.The magnitude of the DDF uncertainty also affects the confidence in the position of the warped structures.Figure 8 shows that the supervised model is confident (PM ≈ 1) in regions where most binary structures agree and not confident in regions where the binary structures disagree (PM ≈ 0.5).

Aleatoric uncertainty analysis
Figure 10 shows that the median of the DDF uncertainty of the supervised model significantly increases with the intensity of the smooth filter applied to the input images.This agrees with the observation in figure 7, Table 1.Comparison of different deformable image registration models that can predict DIR uncertainty.The first value shows the median of the metric over all scenarios and the brackets show the first and third quartile of parameter distribution.DSC is the dice similarity coefficient between the structures propagated with µ and the physician's structures (higher better).The MI is the average mutual information between 60 different samples of the warped CTs and the CBCT (lower better).The KL divergence corresponds to the metric developed in this work to evaluate the quality of the DIR distribution (lower is better).The ECE is the average expected calibration error calculated between the probability maps for all the structures (lower better).The run time is the time spent by the computer to obtain the uncertainty of the DDF.which shows that low-contrast regions have large DDF uncertainty.However, the VoxelMorph model does not associate significant changes in predicted uncertainty with the degree of smoothness and the dropout model predicts a smaller uncertainty when the contrast of the image decreases.

Model comparison
The hyperparameters tuning of the dropout and VoxelMorph model also showed a similar trade-off as shown in figure 4 (see results on supplementary material).The hyperparameters of these models were selected in such a way that these models generate the same MI for the validation dataset as obtained by the supervised and unsupervised models.
Even though the DSC for all the test contours and all the models do not present obvious visible differences (figure 9), the supervised model exhibited statistically superior performance (p < 0.005) compared to the unsupervised model in 6/10 assessed organs (Organs: 2,4,5,6,8,9).At the same time, it demonstrated statistically inferior performance in the submandibular glands and the thyroid when compared to the unsupervised model.In contrast, when compared to the dropout model, the unsupervised model displayed statistically better performance in 5/10 (3, 5, 6, 8, 9), while no statistically significant differences were observed in the remaining organs.A similar pattern was observed when comparing the supervised model with the B-spline model, as the supervised model exhibited superior performance in organs 1, 3, 4, 5, and 6, while no statistical differences were observed in the other organs.Notably, the unsupervised model outperformed the VoxelMorph model in all the assessed organs.
Although the supervised model demonstrated enhanced performance in specific organs, with a higher DSC compared to both the unsupervised model and the B-spline model, there were no statistically significant differences (p < 0.05) in terms of the average DSC across all organs (table 1).This suggests that while the supervised approach may excel in specific organ segmentation tasks, its overall performance, as measured by the average DSC across all organs, remains comparable to that of the B-spline and unsupervised models.However, the supervised model outperformed both the dropout and VoxelMorph models as measured by the DSC.
When analyzing the MI, no statistically significant differences were observed between the supervised model and the dropout, unsupervised, and VoxelMorph models.This can be attributed to the hyperparameter tuning, which aimed to achieve comparable image similarity performance to assess the uncertainty estimation capabilities fairly.
When evaluating the ECE of probability maps and KL divergence metric, the two models presented in this study demonstrated superior performance compared to the other models.Additionally, in terms of computational efficiency, our models outperformed all the models except for VoxelMorph.This discrepancy arises from the nature of the algorithms employed.Our model and VoxelMorph directly estimate uncertainty from the network, whereas the other models need sampling over various parameters, resulting in time-consuming computations.

Discussion
In this work, we built a probabilistic multi-resolution image registration model that uses prior and image information to estimate the parameters of a normal distributed DDF using CNNs.The model offers the flexibility to be trained in either an unsupervised or supervised manner, enabling the integration of segmentation probability maps to enhance the accuracy of DDF distribution prediction.Our model has shown comparable or superior performance when compared to other DIR models considering metrics related to DDF accuracy, such as MI and DSC, and has outperformed all the methods when considering metrics that evaluate the DDF distribution quality, such as the KL-Divergence and ECE.The analysis carried out in this paper has shown that our model can generate registration in a matter of seconds, allowing fast and reliable DIR.
The assumption that the propagated structures follow an independent Bernoulli distribution generates a CE term in the loss function of the supervised model.This metric has been extensively used in segmentation models but has not been used to optimize a DIR distribution before (Loi et al 2018, Boyd et al 2021).The integration of CE into our model offers distinct advantages over the use of the DSC employed in previous works (Hu et al 2018, Dalca et al 2019), particularly in terms of providing a more probabilistic interpretation of the contour matching metric and improving the calibration of the probability maps (Mehrtash et al 2019).In summary, the supervised model presented in the paper has a significant advantage over other methods, as it not only allows to use segmentations provided by a physician but also to use segmentations coded as probability maps, where probability maps reflect aggregated overlap of contours drawn by multiple evaluators.
Unlike other evaluation methods that use DDFs derived from plausible anatomical deformations to assess the DDF accuracy (van Kranen et al 2009, Fu et al 2021), our approach provides a way to perform a probabilistic evaluation of the whole DDF distribution in a multimodal image registration problem.Notably, we observed a correlation between this metric and the ECE (figure 5), which is a metric that indirectly assesses the quality of the DDF by looking at the calibration of the probability maps.However, our metric presents some limitations since it only works for multivariate normal DDF distributions and requires the generation of likely anatomical deformations.
The hyperparameter tuning of the deep learning models employed in this study revealed a trade-off between the average M, calculated with different realizations of the wCT and the CBCTs, and the quality of the DDF distribution.This trade-off arises due to the fine-tuning of the hyperparameter that governs the amplitude of the DDF dispersion.Higher DDF dispersions can impact voxel-wise matching in regions with high-intensity gradients, resulting in reduced image similarity.These trade-off curves are characteristic of the model performance to predict DIR uncertainty since better models will thoroughly increase the uncertainty in regions of the image where this will not significantly affect the image similarity (low contrast areas).Our unsupervised and supervised model achieved the best trade-off balance between these two metrics, allowing the generation of a DDF distribution with KL-Divergence of 0.2 with an average normalized MI (nMI) of 85% during validation.Even though we selected ω = 20 to generate these values, the optimal value of this hyperparameter should be selected based on how much accuracy the user is willing to sacrifice to gain confidence in the uncertainty estimated.This trade-off is particular to the problem at hand.
The visual inspection of the DDF uncertainty showed that our model presents high uncertainty in smoother regions of the images (aleatoric uncertainty) and in regions where the matching of different anatomical structures was not performed accurately (epistemic uncertainty).The uncertainty estimated with our model might also present characteristics for out-of-distribution detection (OOD).In the experiment conducted in figure 10, the use of over-smoothed images can be regarded as an exploration of OOD scenarios.By observing the model's response to such a simplistic analysis, we could suspect that it could recognize inputs that fall outside the distribution it was trained on.The heightened uncertainty for these extreme cases indicates that the model exhibits greater uncertainty when encountering unfamiliar or abnormal inputs, such as over-smoothed images.However, a more detailed analysis should be performed to make a stronger claim.
Despite the theoretical advantages of the supervised model over the unsupervised model, the analysis of the DSC revealed superior performance in only 6 out of 10 organs.Moreover, the supervised model did not exhibit better results when assessing the average DSC across all organs.We suspect that the absence of ground truth segmentation during the training of the supervised model limited its ability to generate even better contour propagation.This problem is significantly more relevant in small organs such as the submandibular glands and the thyroid, in which subtle differences between the predicted and validation mask generate bigger changes in the DSC.This might explain why the unsupervised model performed better in these structures.On the other hand, it is expected that overall the Dropout model and the supervised model did not present statistical differences in DSC because both methods are supervised.Oppositely, the supervised model achieved better DSC than VoxelMorph because the former was trained in a supervised way.
The overall MI did not show statistical differences for the learning-based models because we set the loss function weights to generate similar MI between these models to make a fair analysis of the DDF distribution.The quality of the DDF distribution was relevant only for the unsupervised and supervised models developed in this work.The main reason why the VoxelMorph model could not predict relevant DDF distribution while generating a high image similarity is because, in the loss function, the model couples the uncertainty term with the regularization term using the hyperparameter λ.The dropout model's lack of performance in predicting a good DIR distribution could be interpreted as a limitation of the model in estimating the aleatoric uncertainty (figure 10).As Kendall and Gal (2017) described, such models cannot estimate the uncertainty related to noise in the input data.Although we could not assess the quality of the DIR distribution of the MCBS model, its performance can be estimated to be poor because of its high probability map ECE.However, we believe these results might be influenced by a poor selection of the phase space in investigating B-spline uncertainty.
Incorporating structures within the supervised model inevitably introduces a bias in estimating DDF uncertainty when compared to the unsupervised model.While quantifying the exact magnitude of this bias presents challenges, the assessment of uncertainty prediction consistency, as gauged by KL divergence, reveals statistically significant differences (Wilcoxon p < 0.05).Specifically, the supervised model demonstrates an average KL divergence of 0.11, whereas the unsupervised model registers a slightly higher average of 0.15.This statistically significant difference in uncertainty prediction between the two methods suggests a bias introduced by the supervised model.Incorporating multiple physician segmentations during the training process could increase the bias in the supervised model, thereby generating better DIR distribution prediction in the vicinity of the structures.
One of the most direct applications of this method in clinics is in the area of adaptive radiotherapy since the estimation of DIR uncertainty can be used to assess better the dose accumulated over the treatment fractions quantitatively and to decide if re-planning is needed (Chetty and Rosu-Bubulac 2019).Furthermore, fast and reliable DIR can be used in daily online adaptive radiotherapy for contour propagation.The advantage of predicting the structure uncertainty might open the possibility to speed up the adaptive workflow by incorporating these uncertainties in the treatment optimizer and thus reduce the need for the time-consuming structure physician reviewing.

Conclusion
In conclusion, we developed a probabilistic multi-resolution image registration model that uses CNNs to estimate the DDF distribution between two multimodal images.Compared to other models that can estimate DIR uncertainty, our model presented equal or better contour propagation capabilities, predicted a higher quality DDF distribution and presented faster execution times.Future research will study the consequence of the DIR uncertainty when re-planning in daily adaptive radiotherapy workflow.

Figure 1 .
Figure 1.Illustration of the training workflow used in our supervised and unsupervised models.The solid lines show the data flow for the unsupervised model and the solid lines plus the dash lines show the data flow for the supervised model.

Figure 2 .
Figure 2. Illustration of the method used to evaluate a normalized dense displacement field (DDF) distribution.m and f are the moved and fixed image respectively, Ω is a DDF generated by registering the fixed image to one of the follow-up image, Ω • f is a virtual image generated warping f with a known DDF, ϕ 1 and ϕ 2 are the DDFs distribution created with the DIR method and Ω * is the estimated DDF distribution which register f to Ω • f.

Figure 3 .
Figure 3.Comparison of the middle slice of the normalized Gaussian kernel (A) and the normalized effective kernel (B) used in this work.(C) Shows the diagonal profile indicated in the lines of images (A) and (B).

Figure 4 .
Figure 4. (A) Shows the hyperparameter optimization plot for the unsupervised model with γ ω = 50.The blue plot shows the average KL Divergence metric evaluated in the validation dataset as described in methods (lower is better).The axis is inverted to show lower KL divergence on the top.The red plot shows the average normalized mutual information evaluated in the validation dataset (higher is better image similarity).(B) and (C) compare the distribution z (Ω * ) (blue) with the N (0, 1) (orange) for a specific sample of the validation dataset and ω equal to 20 and 200 respectively.

Figure 5 .
Figure 5. Relation between the KL-Divergence metric developed in this work and the expected calibration error (ECE) for different values of the hyperparameter ω of the unsupervised model.

Figure 6 .
Figure 6.Coronal, sagittal and axial images of two different patients (1 and 2) of the test set.The first, second and third rows show slices from the planning CT (pCT), the last CBCT of the treatment and the warped CT (wCT), propagated with the mean DDF, respectively.The DDF is illustrated with small arrows superimposed over different planes of the pCT.The color of the arrows qualitatively indicates the intensity of the DDF in the region, red for large, yellow and green for moderate and black for mild deformations.The large red arrows indicate the regions of the pCT which shrank during treatment.

Figure 7 .
Figure 7. Different images driving uncertainty of the dense displacement field (DDF).Columns show (a) the planning CT (pCT), (b) the CBCT, (c) the warped CT (wCT), propagated with the mean DDF, and (d) the diagonal elements of Σ (ψ, σ0) for different test samples in rows 1, 2 and 3.The DDF is illustrated with small arrows superimposed on the pCT.The color of the arrows indicates the intensity of the DDF in the region: red large yellow/green moderated and black mild deformation.The large red arrows show regions of the pCT that do not match the CBCT.The blue arrow shows a region of the patient's anatomy which is not capture in the CBCT.The vertical gray bar on the right is a scale for the intensity variations within the RGB image of Diag (Σ (ψ, σ0)).Within this RGB representation, the uncertainty along the right-left direction is depicted in red, the uncertainty along the posterior-anterior direction is represented in green, and the uncertainty along the inferior-posterior direction is visualized in blue.

Figure 8 .
Figure 8. Warped planning target volume for two different patients of the test set.The black contours were segmented by a physician; the white contours were propagated using the B-spline model, the cyan contours were propagated using the mean of the dense displacement field obtained using the dropout model and the probability maps (color map) were generated using 60 samples from the supervised model on a test set patient.The red arrow shows that regions in which the binary contours do not match each other present low confidence in the probability map (PM ≈ 0.5).

Figure 9 .
Figure 9. Dice similarity coefficient (Dice) calculated with the four calibrated methods in 10 different organs of the test dataset.

Figure 10 .
Figure 10.Dependence of the dense displacement field uncertainty with the image contrast.The y-axis shows the normalized average median Diag (Σ (ψ, σ0)) calculated for all the test samples.Because of difference in the magnitude of the uncertainty predicted with the different methods, the median uncertainty values were normalized with their value at σ = 0.The x-axis the standard deviation of the Gaussian filter applied to the input image pairs.The small images at the button of the figure show an example of the CBCTs smoothed with σ.
is a scalar hyperparameter of the problem, ∥•∥ 2 is the Frobenius norm, Tr is the trace function, λ is a hyper-parameter of the problem, L is the matrix representation of the Laplacian operator,