Predicting dice similarity coefficient of deformably registered contours using Siamese neural network

Objective. Automatic deformable image registration (DIR) is a critical step in adaptive radiotherapy. Manually delineated organs-at-risk (OARs) contours on planning CT (pCT) scans are deformably registered onto daily cone-beam CT (CBCT) scans for delivered dose accumulation. However, evaluation of registered contours requires human assessment, which is time-consuming and subjects to high inter-observer variability. This work proposes a deep learning model that allows accurate prediction of Dice similarity coefficients (DSC) of registered contours in prostate radiotherapy. Approach. Our dataset comprises 20 prostate cancer patients with 37–39 daily CBCT scans each. The pCT scans and planning contours were deformably registered to each corresponding CBCT scan to generate virtual CT (vCT) scans and registered contours. The DSC score, which is a common contour-based validation metric for registration quality, between the registered and manual contours were computed. A Siamese neural network was trained on the vCT-CBCT image pairs to predict DSC. To assess the performance of the model, the root mean squared error (RMSE) between the actual and predicted DSC were computed. Main results. The model showed promising results for predicting DSC, giving RMSE of 0.070, 0.079 and 0.118 for rectum, prostate, and bladder respectively on the holdout test set. Clinically, a low RMSE implies that the predicted DSC can be reliably used to determine if further DIR assessment from physicians is required. Considering the event where a registered contour is classified as poor if its DSC is below 0.6 and good otherwise, the model achieves an accuracy of 92% for the rectum. A sensitivity of 0.97 suggests that the model can correctly identify 97% of poorly registered contours, allowing manual assessment of DIR to be triggered. Significance. We propose a neural network capable of accurately predicting DSC of deformably registered OAR contours, which can be used to evaluate eligibility for plan adaptation.


Introduction
In conventional radiotherapy, treatment is typically planned days in advance using a snapshot planning computed tomography (pCT) scan of the patient's anatomy taken during simulation, then delivered in fractions across multiple weeks. The patient may undergo inter-fractional and intra-fractional anatomical changes during the treatment period (Guckenberger et al 2007, Josipovic et al 2012. Inter-fractional changes refer to changes in patient anatomy that occur between fractions of radiotherapy treatment, typically due to weight gain or loss and organ displacements. Such changes occur over a period of days or weeks. Intra-fractional changes occur within a single fraction of radiotherapy treatment, typically from breathing and digestive or metabolic activity. Such changes occur on a time scale of seconds to minutes. As a result, using the same initial treatment plan across all Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
fractions may lead to differences between planned dose and actual delivered dose to the target and organs-at-risk (OARs) (Noble et al 2019, Tamihardja et al 2021. This effect is especially pronounced in proton and heavy ion radiotherapy, where the benefits of a conformal dose distribution come at a cost of higher sensitivity to daily anatomical changes (Paganetti et al 2021).
Adaptive radiotherapy (ART) aims to customise the treatment plan based on the patient's anatomical measurements of the day, thereby allowing a safe dose escalation and improving treatment outcomes (Yan et al 1997). To achieve accurate daily dose accumulation, ideally, the target and OARs need to be delineated on daily cone-beam computed tomography (CBCT) scans, which is a laborious task if performed manually by human experts (Lustberg et al 2018). In addition, manual contouring on CBCT scans is subject to higher inter-and intra-observer variability (Choi et al 2011, Nishioka et al 2013 due to their poorer image quality (Létourneau et al 2005, Stock et al 2009. Image registration presents an alternative for automatically propagating contours delineated on the pCT scans onto the daily CBCT scans (Faggiano et al 2011, Yeap et al 2017. It is the process of finding the optimal geometric transformation that relates identical points between two image series (Brock et al 2017). Among the 3 categories of image registration, namely rigid, affine, and deformable image registration (DIR), most clinical settings employ rigid-body registration for soft tissue or bony anatomy matching (Lim-Reinders et al 2017). However, as DIR gives the highest number of degrees of freedoms of three times the number of voxels in the source dataset (Brock et al 2017), DIR is responsible for the non-rigid anatomical changes of tumours and OARs. As such, validation of DIR accuracy is an essential quality assessment step in the clinical implementation of ART, especially in online ART where the treatment plan is adapted according to ongoing changes during treatment (Lim-Reinders et al 2017). Manual evaluation of deformably registered contours by human experts is timeconsuming, hence there is a need for an automatic and reliable method for validation of DIR accuracy.
Methods to determine registration errors between manually annotated points are becoming a popular area of research (Bierbrier et al 2022). Sokooti et al quantified registration errors using random regression forests with both intensity-based and registration-based features (Sokooti et al 2019), while another work estimated registration errors through independent directions using a random forest regressor with high accuracy (Saygili 2021). Saygili et al estimated the confidence of a registration based on stereo matching algorithms (Saygili et al 2015). Muenzing et al employed a supervised learning method to classify registration quality into the correct, poor or wrong categories (Muenzing et al 2012). Other works have made use of 3D convolutional neural networks to estimate registration errors from pairs of image patches centred around voxels with subvoxel accuracy (Eppenhof and Pluim 2018).
In this work, we focus on the assessment of registration quality in terms of the Dice similarity coefficient (DSC). The workflow involves the use of DIR to propagate OAR contours delineated on pCT plans onto daily CBCT scans for dose evaluation. We propose a deep learning model to automatically estimate the DSC of registered OAR contours, in order to assess their reliability for daily adaptations. To our knowledge, this is the first application of a neural network on DSC prediction, and we believe the methodology in this work will be relevant to many centres which are planning to implement ART in the clinic.

Methods
This study was approved by the SingHealth Centralised Institutional Review Board, Singapore.

Patient characteristics and dataset
The dataset comprised 20 retrospective high-risk prostate cancer (HR-PCa) patients treated at National Cancer Centre Singapore between 2016 and 2019. These patients received a sequential treatment regimen with a total dose prescription of 74-78 Gy over 37-39 fractions (Ong et al 2021(Ong et al , 2022. The patients undergo full bladder filling protocol. Daily CBCT scans were taken before each fraction for positioning, giving 760 CBCT scans in total. The OARs of interest (namely rectum, prostate and bladder) were contoured on each pCT and CBCT scan by the same experienced radiation oncologist (JKL Tuan) for this study. Figure 1 shows the manually contoured OARs on a CT and CBCT scan.
Some images contained imaging artefacts, which were not omitted from the dataset to improve generalisability of the model. The dataset was further augmented by mirroring each scan along the vertical axis, giving 1520 vCT-CBCT input pairs in total. An additional experiment was run where the dataset was further augmented by: (i) introducing a random rotation between −10°and 10°, and (ii) introducing Gaussian noise to the original 760 image pairs. Together with the vertically flipped pairs, which is a simple but effective way to augment the dataset given the presence of slight asymmetries in pelvic scans, this brings the total image pairs to 3040.

DIR and parameter optimisation
DIR was performed on RayStation 10A (RaySearch Laboratories, Stockholm, Sweden). A hybrid DIR using a combination of image intensity information and anatomical information was used. The optimisation problem was based on an objective function that combines an image similarity term, grid regularisation term, and anatomical penalty terms (Oh andKim 2017, Motegi et al 2019).
In all registrations, the pCT scan was assigned as the target (moving) image set, while the same patient's daily CBCT scans were assigned as the reference (fixed) image set. Each pCT scan is deformed into the coordinate system of the CBCT scan, generating a virtual CT (vCT) scan. Manually delineated OAR contours on the pCT scan were then registered onto each CBCT scan.
The DSC, computed as the ratio of intersection of two contours to their union as shown in equation (1), was then calculated between the deformably registered contour and the manually delineated contour (i.e. ground truth) on each CBCT scan using the Python Shapely v2.0.1 package. While it has been reported that the DSC has limitations in evaluating small structures (Reinke et al 2021) such as seminal vesicles, we assess that the use of DSC may be appropriate for the OARs considered in this study given their relatively larger volumes (Roeske et al 1995) A B DSC of two contours and . 1 ( ) DIR parameters were optimised for all 20 patients, using a subset of 8 CBCT scans (1st, 6th, 11th, 16th, 21st, 26th, 31st, and 36th fraction) per patient. Correlation coefficient (CC) and mutual information (MI) were used as similarity measures given their suitability for multimodal image registrations. A final grid resolution of (0.1 cm, 0.2 cm, 0.3 cm, 0.4 cm) was tested, as a high grid spacing can result in small structures being disregarded while a low spacing can result in erratic registration behaviour. The final grid resolutions and similarity measures were optimised, and the parameters that give the widest range of DSC values were selected.

Image pre-processing steps
The image pre-processing steps prior to model training are illustrated in figure 2. The steps include: (i) clipping of CT numbers to ±200 in order to enhance contrast between soft tissues in the body, (ii) min-max normalisation of pixel values to the range (0, 1) for better training, (iii) down-sampling of slice to 256 by 256 pixels for computational efficiency, and finally (iv) coning down the slice to the central portion of the body (128 by 128 pixels) to remove most of the space outside the body. Figure 3 shows some sample scans with artefacts and with rotation and noise introduced after pre-processing.

Siamese neural network
Convolutional neural networks have shown to be effective in image analysis (Anwar et al 2018) as they are designed to automatically learn features from images without the need for manual feature extraction. This is especially useful for medical images, which can be complex and difficult to interpret.
A Siamese neural network, which is capable of computing similarity measures between two different input vectors, was chosen. It uses two identical artificial neural networks (or 'towers') that share the same weights and are each able to learn the representation of the input vector. The purpose of weight sharing is to enforce similarity between the representations of the different inputs processed by the towers, hence the inputs to the towers should be similar. In our centre, the CBCT imaging systems on our linear accelerators undergo quality assurance routinely, and HU values for different materials on the Catphan phantom (Phantom Laboratory, USA) lie within ±40 compared to fan-beam CT scans. Hence, it is appropriate to use CT and CBCT scans as inputs to the two towers and to apply the same weights to generate their respective representations. Some wellknown applications include fingerprint verification and face recognition (Taigman et al 2014). This work is the first use case of a Siamese neural network for DSC prediction in DIR. In this study, a corresponding pair of pre-processed vCT and CBCT scans, each with 128 by 128 by 64 voxels, were used as input vectors, while the true DSC values were used as the target label. 20% of the dataset was set aside as test data, while the remaining 80% of the dataset was trained through a 10-fold cross validation.
Given the limited training data, transfer learning using the pre-trained ResNet-50 deep residual network (He et al 2016) was used for both network towers, with the weights held equal. As the ResNet-50 is trained on 2D images with 3 RGB channels, this work expanded the weight dimensions to take the number of slices into account. To initialise the additional channels, the mean of the initial weights was copied into them. Using TensorFlow v2.7, a further Lambda layer estimates the similarity of the two representations by calculating the Euclidean distance between the vectors. Finally, a dense layer using the Sigmoid activation function outputs the predicted DSC value in the (0, 1) range. The model predicts a single DSC value, hence 3 sets of parameters are required to predict the DSC values for the 3 organs. The model has over 23 million trainable parameters.
Three loss functions, the contrastive loss, L1, and L2 loss, were used for model comparison on the rectum. While L1 or L2 loss are commonly used for regression tasks and are appropriate for this work since the goal is to predict a continuous value representing the similarity between the vCT and CBCT images, we found that L1 loss does not perform as well while L2 loss performs comparably with the contrastive loss. In this work, a contrastive loss function was used as it encourages similar pairs of vCT-CBCT scans to have similar representations while pushing dissimilar pairs to have different representations. This makes it easier to predict the DSC score between two images. To learn the parameters, W, the cost increases with the distance between the predicted and true DSC scores. The contrastive loss function can be expressed as (Hadsell et al 2006): where Y is the true DSC score (from 0 to 1) and Ŷ is the predicted DSC score. The first partial term of equation (2) has a higher weight if the image pair has a lower DSC score, while the second partial term has a higher weight if the image pair has a higher DSC score. A batch size of 4, Adam optimiser, and an exponential decay learning rate schedule from 0.1 to 0.000 001 were used. Figure 4 illustrates the network architecture used.

DIR parameter optimisation
The violin plots of DSC values for the 8 different combinations of grid resolutions and similarity measures are shown in figure 5. The choice of similarity measure was observed to affect DSC values for the rectum and prostate, but no significant differences were observed for the bladder.

Evaluation of neural network predictions
The root mean squared errors (RMSE) and R 2 values of predicted DSC values are used as evaluation metrics for the model, as given in table 1. The model showed promising results for predicting DSC, giving RMSE of 0.070 and 0.079 for rectum and prostate respectively on the holdout test set. The predictions for bladder DSC values did not perform as well, giving a higher RMSE of 0.118. Figure 7 shows the model predictions on the training and test sets, while figure 8 shows the true and predicted DSC of the rectum and bladder for two CBCT-vCT image pairs.
The model's ability to identify poorly registered contours (i.e. contours with low DSC values) was also evaluated. In this study, we consider the event where a contour is poorly registered if its DSC value falls below a defined threshold and well registered otherwise. The AAPM Task Group No. 132 has suggested a DSC tolerance (Brock et al 2017) of 0.80-0.90. Taking the threshold to be 0.6 for rectum and 0.8 for prostate and bladder, the model achieves an accuracy of 92%, 82% and 81% respectively. The model also achieves high sensitivities of 0.97, 0.94 and 0.97 respectively. This means that the model is capable of flagging 97%, 94%, and 97% of rectum, prostate and bladder contours which are poorly registered, hence suggesting that the model is able to accurately identify the vast majority of poorly registered contours, allowing manual inspection of DIR to be triggered. In particular, the model also achieved a reasonably high specificity of 0.59 for the rectum, which means that it is able to identify 59% of well registered rectum contours accurately. The accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) are also shown in table 1.
Using the same model on the extended dataset of 3040 image pairs for the rectum yields a lower RMSE of 0.067 and a higher R 2 value of 0.64 compared to the original dataset, as seen in table 2. However, setting the DSC threshold at the same value of 0.6, the model's accuracy at classifying well registered and poorly registered contours is lower at 89%. Its sensitivity and specificity are also slightly lower at 96% and 46% respectively. While this suggests that it may not perform better in classifying poorly registered contours in a clinical ART workflow, it may also be due to presence of images of poorer quality in the holdout test set.

Discussion
In this study, we showed that a Siamese neural network trained with a relatively small dataset is capable of accurately predicting the DSC values of deformably registered contours of certain OARs. From table 1, the rectum and prostate registered contours of rectum and prostate achieved low RMSE values of 0.070 and 0.079 respectively, implying that the predicted DSC values can be reliably used in clinical settings to determine the accuracy of the registered contours.
From the clinical perspective, it is critical to identify poorly registered contours for further DIR assessment from the physician. In this aspect, the model does a good job for the rectum, with a high sensitivity and specificity of 0.97 and 0.59 respectively. This means that 97% of poorly registered contours and 59% of well registered contours can be accurately identified by the model. The model also achieves a high PPV of 0.95 and high NPV of 0.71, suggesting that a contour classified by the model as poorly (or well) registered has a high probability of  having a low (or high) DSC value. This provides potential time savings from a reduced need to manually inspect every contour. Furthermore, by predicting the DSC value, clinics can implement a 'traffic light system' that indicates whether a deformably registered contour is predicted to be good, acceptable, or unacceptable. This is an easy and time-saving tool for radiotherapists in an online adaptive workflow. The model was observed to perform worst for the bladder, with a RMSE of 0.118. The low R 2 value could be attributed to the presence of other confounding variables in our clinical data, such as contouring uncertainties, that are not included in our training. The specificity for the bladder was 0.04, meaning that only 4% of well registered contours were identified. This might be due to artifacts from bowel gas that could obscure the bladder outline superiorly, as reported by other groups (Kochan et al 2017). Nonetheless, as the model's ability to identify poorly registered contours has a higher clinical importance, its high sensitivity of 0.97 for the bladder can still provide utility in an ART workflow. The model also has a slightly lower PPV of 0.83, suggesting that some time may be wasted to manually investigate contours that have been classified as poorly registered.
The findings in this work highlight the novelty and benefit of using Siamese neural networks for accuracy prediction of registered contours. Since such neural network architectures allow comparison of distances between two input vectors, they have shown success in use cases that require similarity evaluation, such as facial and fingerprint verification. Hence, such models are also suitable to compare the similarity of a deformed (pCT) image and its reference (CBCT) image, and predict their contour conformity (DSC). Our results further showed  that Siamese neural networks are capable of accurately assessing the quality of registered OAR contours, which can result in time savings in a clinical ART workflow.
The major limitation of this study is the lack of training examples with DSC values in the higher and lower extremes. As the DIR parameters were chosen to generate more poorly registered images (i.e. with registered contours of lower DSC values), there were fewer well registered images for the model to learn from. In particular, for the case of rectum, there were few registered contours with DSC > 0.7, and none with DSC > 0.8. Therefore, even though the suggested DSC tolerance (Brock et al 2017) is 0.80-0.90, we had defined our classification threshold for a poorly registered rectum contour to be 0.6 in our clinical evaluation of the model. Furthermore, despite choosing DIR parameters that favour the generation of poorly registered images, it can also be observed from figure 7 that the model still tends to perform worse in the low DSC range, with larger local RMSE values. Nonetheless, the model is still capable of correctly classifying them as poorly registered contours, hence the model can still be useful for classification purposes in an ART workflow. A suggestion for future work is to generate more training examples using other DIR parameters, such that the DSC values of the dataset can span a larger range between 0 and 1.
Finally, in all detection analysis works, the threshold selection will need to balance between false negatives and false positives. Since the potential consequences of false negatives in an ART workflow can affect patient safety greatly, it may be necessary to have an operator to verify certain cases to minimise false negatives. As this model is also able to predict the DSC value itself, operators can therefore verify cases flagged by the models (i.e. below an acceptable threshold), which can reduce the workload for operators significantly. The model is not designed to eliminate operator involvement, but as a complementary tool for operators.

Conclusion
This work demonstrates the first use case of a Siamese neural network to predict the DSC of registered OAR contours by comparing CBCT images with deformed vCT images. It has been shown that the model gives reasonably low RMSE values and high accuracies in identifying poorly registered contours, particularly for the rectum. The model will be helpful for centres planning to implement ART workflows in their clinic, and can be used to evaluate accuracy of deformed contours and their eligibility for plan adaptation (de Jong et al 2021, Zwart et al 2022).
Further work to validate the model on other datasets, registration methods, and anatomical sites will be considered. There is also scope to consider the use of a Siamese neural network to predict other metrics that exhibit greater correlation with deformation vector field errors, such as the distance to agreement (Shi et al 2021).

Data availability statement
No new data were created or analysed in this study.