Patient-specific neural networks for contour propagation in online adaptive radiotherapy

Objective. fast and accurate contouring of daily 3D images is a prerequisite for online adaptive radiotherapy. Current automatic techniques rely either on contour propagation with registration or deep learning (DL) based segmentation with convolutional neural networks (CNNs). Registration lacks general knowledge about the appearance of organs and traditional methods are slow. CNNs lack patient-specific details and do not leverage the known contours on the planning computed tomography (CT). This works aims to incorporate patient-specific information into CNNs to improve their segmentation accuracy. Approach. patient-specific information is incorporated into CNNs by retraining them solely on the planning CT. The resulting patient-specific CNNs are compared to general CNNs and rigid and deformable registration for contouring of organs-at-risk and target volumes in the thorax and head-and-neck regions. Results. patient-specific fine-tuning of CNNs significantly improves contour accuracy compared to standard CNNs. The method further outperforms rigid registration and a commercial DL segmentation software and yields similar contour quality as deformable registration (DIR). It is additionally 7–10 times faster than DIR. Significance. patient-specific CNNs are a fast and accurate contouring technique, enhancing the benefits of adaptive radiotherapy.


Introduction
Over the years, advanced radiation delivery paradigms such as intensity-modulated radiotherapy, volumetric modulated arc therapy and intensity-modulated proton therapy have increased the dose conformality with the tumor, resulting in improved healthy tissue sparing (Lomax 1999, Bortfeld 2006, Otto 2008, Tran et al 2017, Moreno et al 2019. However, daily set-up variations and longitudinal anatomical changes throughout the treatment, such as weight loss and tumor shrinkage, result in differences between the planned dose and the delivered dose. This can lead to target coverage degradation for highly conformal radiotherapy that may impact tumor local control. The effect is especially apparent for proton therapy, because the depth of the proton dose peak is highly dependent on the tissue densities along the beam path, which changes with changing anatomy (Lomax 2008, Zhang et al 2011. Uncertainties in set-up, anatomy and range are accounted for in the planning process either by applying margins around the clinical target volume (CTV) (Albertini et al 2011) or by incorporating the uncertainties directly using robust optimization (Liu et al 2012, Unkelbach et al 2018. However, both techniques result in an increased dose to the normal tissue, reducing the advantage of conformal radiotherapy. With online adaptive radiotherapy, the set-up and anatomical uncertainty can be strongly reduced. The daily treatment plan is reoptimized based on a 3D daily image taken shortly before the treatment (Yan et al 1997, Lim-Reinders et al 2017, Albertini et al 2020, Paganetti et al 2021. The consequent reduction of uncertainty increases the plan conformality and, hence, the sparing of healthy tissue. Online plan adaptation is a time and resourceintensive process as it requires the repetition of several planning steps for every fraction. In particular, it requires organs-at-risk (OARs) and target volumes delineation on the new images, plan evaluation, adaptation, reoptimization and quality assurance (QA). To be effective, all these steps need to be executed in several minutes because the time between the image acquisition and the treatment needs to be as low as reasonably possible to ensure high correspondence between the image and the treated anatomy. Furthermore, faster adaptation shortens the patient's overall treatment time and therefore increases patient comfort.
The time required for online adaptation implies automation of each sub-process with as little as possible manual interventions. The most resource-intensive step is daily contouring, so there is a great interest to automate it with sufficient accuracy and robustness (Lim-Reinders et al 2017). It can be automated in two distinct ways: automatic segmentation or registration.
Firstly, state-of-the-art segmentation is usually based on deep learning (DL) with convolutional neural networks (CNNs), which learn to segment medical images based on large datasets with manually annotated contours (Chen et al 2021, Nikolov et al 2021. The advantages of these methods are that they are fast, consistent and yield accurate results. On the downside, they require large amounts of annotated data to train and do not always generalize well to out-of-distribution data, e.g. scans that are significantly different than the training data. Furthermore, their applicability for tumor and target volume segmentation is limited (Kosmin et al 2019, Liu et al 2021. CNNs do not require manual contours for the patient under study, which is an advantage for segmentation in general. However, in adaptive therapy, such a reference annotation is always available, i.e. on the planning CT, containing information that is not used in the automatic segmentation of the daily scans. Another set of methods relies on image registration for contouring (Thor et al 2011, Kumarasiri et al 2014, Elmahdy et al 2019. Specifically for adaptive therapy, the manual contours on the reference CT can be propagated to the daily scan by registering the former to the latter and applying the same transformation to the reference contours. The main advantage is that this technique does not require a large training dataset. The disadvantage is that it requires at least one annotated scan per patient and that traditional techniques are slow compared to auto-contouring with CNNs (Klein et al 2009, Costea et al 2022. Furthermore, when anatomical changes occur, deformable image registration (DIR) is needed, which is an ill-posed problem requiring careful hyperparameter tuning and algorithm choice to achieve high performance (Brock et al 2017).
To overcome the long runtime of traditional DIR algorithms, recent works have proposed image registration with deep learning (Fu et al 2020, Haskins et al 2020, Xiao et al 2020. Instead of iteratively optimizing a similarity metric, these CNNs are trained to directly predict the deformation which reduces the runtime strongly. However, despite the great potential, these techniques have not yet achieved the same performance as iterative algorithms (Fu et al 2020).
Both registration and segmentation have their advantages and disadvantages. On the one hand, iterative deformable registration is slow and can be unreliable in case of large anatomical changes or mass variations (Oh andKim 2017, Brock et al 2017). On the other hand, CNNs can fail on out-of-distribution data and cannot accurately segment tumors, so they cannot be employed in adaptive therapy without time-consuming manual checks and adjustments made by clinicians. However, by including the information from the (contoured) planning CT in the CNN, its robustness can be increased because the daily images are closely related to the planning CT, so the distribution of the CNN is likely to encompass the daily images. This can be achieved by (re-) training the CNN on the planning CT, also known as patient-specific fine-tuning, which has been explored for prostate cancer on MR and CT (Elmahdy et al 2020, Fransson et al 2022, for a single OAR in the head region on CT (Chun et al 2021) and for brain white matter segmentation on MR (Jansen et al 2020).
Whereas all works report a strong improvement of the quality of the CNN by patient-specific fine-tuning, their implementation details differ and the results are specific to a single anatomical site. A rigorous comparison of this technique to other auto-contouring methods for adaptive therapy has not yet been performed, and it is therefore unclear whether it is usable and optimal.
In this work, we train patient-specific CNNs for automatic contouring in online adaptive proton therapy (PT) and compare this technique to general segmentation networks and registration-based contour propagation for patients with head and neck cancer (HNC) and non-small cell lung cancer (NSCLC). Our work differs from previous publications in: • It uses transfer learning, as in Elmahdy et al (2020), Jansen et al (2020), but updates all parameters of the CNN instead of a subset which enhances the learning capabilities.
• It uses affine and elastic deformations along with noise addition as data augmentations to mimic the set-up and anatomical variations happening in adaptive therapy. This further prevents overfitting, making the quality of the contours less sensitive to the number of training steps during the retraining.
• The technique is evaluated for different anatomical sites. The HNC patients are representative of small anatomical changes whereas the NSCLC patients undergo larger anatomical deformation, therefore also covering a large spectrum of relevant clinical deformations in adaptive radiotherapy. Additionally, both OAR and CTV segmentation is tested.

Materials and methodology
This section describes the different methods for contour propagation used in this study. First, the datasets used for training and evaluation are presented. Then, the registration and segmentation-based methods are described, followed by a short description of the evaluation metrics.

Datasets
This work is based on three datasets. The first dataset is from the Center For Proton Therapy (CPT) in Switzerland and contains patients treated with proton therapy between 2013 and 2021. A total of 388 patients with various indications was included, all having at least one planning CT with or without replanning CTs, yielding a total of 464 scans with annotations. Depending on the tumor location, different OARs were contoured manually by expert medical personnel, resulting in a large variation in the number of ground truth labels for each OAR (table 1). As none of these patients underwent online adaptive therapy, this dataset is solely used to pretrain the segmentation models (see section 2.3). In the remainder of this paper, this dataset will be referred to as the CPT dataset.
The second dataset consists of five patients with non-small cell lung cancer (NSCLC), not included in the CPT dataset. This data has previously been described in Josipovic et al (2016), Nenoff et al (2020), Amstutz et al (2021), Nenoff et al (2021). Each patient has one planning and nine repeated voluntary deep breath hold CTs. The repeated CTs were acquired on three different days, each day consisting of three different acquisitions. However, for this study, we will consider each CT to be representative of a different fraction in online adaptive therapy. All CTs were retrospectively recontoured by expert radiation oncologists according to the clinical protocol (Nenoff et al 2021), which included propagating the planning contours with DIR and slice-wise manual adjustments, either in Eclipse or Velocity (Varian Medical Systems, Palo Alto, USA). This dataset will be referred to as the NSCLC dataset.
The last dataset consists of five patients with various indications of head and neck cancer treated with proton therapy at the CPT. Each patient has a planning CT and 4 to 7 repeated CTs acquired on separate days throughout the treatments. All patients were removed from the CPT dataset so that they were not included in pretraining the networks. Even though these patients were not treated with online adaptive therapy, the repeated CTs are representative of the daily and longitudinal anatomic and set-up variations to be expected during online adaptive therapy. The repeated CTs were retrospectively recontoured by expert radiation oncologists according to the same clinical protocol as the NSCLC scans. We will refer to this data as the HNC dataset.

Registration based methods
In registration-based contour propagation, the reference CT is considered the moving scan which is registered to the daily CT, i.e. the fixed scan. This registration results in a deformable vector field (DVF), which is used to interpolate the binarized reference contours to transfer them to the daily scan. In this work, we consider two distinct registration techniques.

Rigid registration
The first registration method relies on rigid registration (RR), i.e. the reference CT is only translated and rotated to match the daily CT. In case the anatomy is not strongly deforming (such as the head), this technique can be preferred because of its simplicity, speed and consistency. More specifically, we employ rigid registration implemented in elastix (Klein et al 2010) with mean squared error (MSE) as similarity criterion and four consecutive resolutions.

Deformable registration
The second registration method is a deformable image registration (DIR) method, which is the preferred method for contour propagation in case of deforming anatomy. The downside of DIR is that the problem is illposed, so that the results of different algorithms and even hyperparameters can lead to strongly different results (Brock et al 2017). In this work, we use the b-spline algorithm implemented in plastimatch (Sharp et al 2010) with MSE as similarity criterion. A detailed description of the hyperparameters can be found in (Nenoff et al 2021). Several other DIR algorithms were also tested, but for the sake of clarity we focus on this one as it led to good results compared to the other DIR algorithms and is publicly available.

Segmentation based methods
We train deep CNNs for the task of contour propagation in adaptive radiotherapy in two different settings: pretrained (or general) and patient-specific. All networks are based on the 3D UNet architecture, which takes as an input the daily CT and outputs a set of segmentation maps S, each map corresponding to an OAR or target volume (TV). The network has 16 initial convolutional filters, which are doubled in each of the four encoder blocks. Max pooling with kernel size and stride 2 is used for downsampling between the encoders. Four decoders upsample the features to the original resolution with nearest-neighbor interpolation. All encoders and decoders consist of 2 convolutional filters with kernel size 3 × 3 × 3 followed by a rectified linear unit activation. A final convolution with kernel 1 × 1 × 1 is used to convert the 16 features into a set of organ-specific activation maps. All networks are trained with binary cross-entropy, i.e the segmentation allows each voxel to be part of multiple labels. Even though organs generally do not overlap, this allows to easily handle sparsely annotated scans, i.e. scans on which some organs are visible but not segmented by the medical personnel because they were irrelevant to the planning. To still leverage all the contours in the dataset, the loss function is adjusted in such a way that it ignores the loss contributions from labels that were not manually segmented.

Pretrained neural network
The pretrained neural networks (PNN) are firstly trained on the relatively large CPT dataset. The models are trained from scratch with the Adam optimizer for 200 epochs and initial learning rate 10 −3 , which is halved every 20 epochs. Early stopping is applied by retaining the model with the lowest loss on the validation set (10% of the patients). All scans are resampled to a fixed resolution 0.97 × 0.97 × 2 mm and data augmentations include random cropping, rotations within ±5°, ±5% scaling, small localized elastic deformations (Isensee et al 2020) and Gaussian noise with σ 2 = 10 −4 . Note that these networks do not segment any of the target volumes, because the dataset contains a wide variety of indications and previous work has shown the poor quality of CNNs for target volume segmentation (Kosmin et al 2019, Liu et al 2021. We train two networks, one specific for the OARs in the head and neck region, i.e pretrained HNC network, and one for the OARs in the lung region, i.e pretrained NSCLC network. After training on the CPT dataset, the models are in a second step retrained on the HNC and NSCLC datasets themselves. The evaluation is done with leave-one-out validation, i.e the pretrained model is retrained on 4 out of 5 scans of either the HNC or NSCLC dataset and the retrained model is evaluated on the remaining scan. In that way, the pretrained model has still never seen the anatomy of the patient under study and should therefore generalize what it has learned from other patients. The retraining parameters are similar to the initial training parameters, but the magnitude of the data augmentations was increased to ±10°rotations, ±10% scaling and σ 2 = 10 −2 Gaussian noise to avoid overfitting the very small dataset.

Patient-specific neural network
During online adaptive therapy, clinically accepted contours on the planning CT are always available because they were used for the initial planning. The pretrained networks however do not leverage these. To include this prior information, we fine-tuned the pretrained networks by retraining the networks only on the reference CT (Elmahdy et al 2020, Chun et al 2021, yielding patient-specific neural networks (PSNN). This retraining results in overfitting of the network to the reference CT, but because the reference CT is very similar to the daily CTs, it can be expected that this overfitted network still performs better than the generalizing pretrained networks. Further, to avoid complete overfitting, training is restarted with a lower learning rate 10 −4 and, since there is only one scan in the training set, runs for 50 000 epochs. Data augmentations are the same as for the initial pretraining, with the exception of a stronger Gaussian noise with σ 2 = 10 −2 .
Pretraining a network for target volume segmentation is very difficult and would require a lot of data. However, this does not mean that the target volumes (TV) cannot be segmented with deep CNNs. Similar to the fine-tuned models, we can train a neural network solely on the reference CT, which contains the TVs contoured by a clinician. This is commonly referred to as one-shot image segmentation (Shaban et al 2017). Contrarily to the fine-tuned models, the training cannot restart from a pretrained neural network that is already able to segment TVs. It is however possible to leverage some prior information during one-shot learning by means of transfer learning (Weiss and Khoshgoftaar 2016), which has shown promising results in e.g video segmentation (Caelles et al 2017). With transfer learning, the network is first trained on a different task than it is supposed to (e.g. lung segmentation). In a second step the network is then retrained on the original task (e.g. TV segmentation) starting with the initial weights from the other training. Here, we take the pretrained models on the OARs and use transfer learning to segment the CTV. We restart the training from the final weights of the pretrained models for all layers except the final convolution, as this convolution creates organ-specific maps which are not informative for the TVs.

Commercial segmentation
Finally, the trained CNNs are also compared to a clinically used commercial auto-contouring software Limbus Contour 1.7 (AI Limbus Inc., 2076 Athol Street, Regina, SK S4T 3E5, Canada). This software has been clinically validated and shown to only rarely require manual adjustments of OARs (Wong et al 2020, D'Aviero et al 2022).

Evaluation methods
The performance of the above-mentioned contour propagation methods is evaluated on the HNC and NSCLC by comparing the results with the manually annotated contours on the repeat CTs. We use three well-known geometric metrics for this comparison. Firstly, the dice coefficient to evaluate the overlap between the manual and propagated contour. The dice coefficient is however strongly dependent on the size of the structure and is therefore difficult to compare for organs with different sizes. To alleviate this effect, we also include the surface dice, which represents the proportion of the organ surface which is within a tolerance of the surface of the manually annotated organ (Nikolov et al 2021). We set this tolerance to 2 mm. Both dice and surface dice coefficients give insight into the average difference between segmentations. To also assess the maximal error, we evaluate the 95th percentile of the Hausdorff distance (HD). A Wilcoxon signed rank test is performed between each method and the patient-specific NNs to test whether they perform significantly better or worse than the other methods.
Two preliminary experiments are performed on the NSCLC dataset to highlight the differences between the proposed method and previous works. Firstly, we compare our approach (i.e. the fine-tuning all weights of the network) to fine-tuning only the final layer, as proposed by Elmahdy et al (2020), Jansen et al (2020). Secondly, we evaluate the importance of using data augmentations, by comparing our approach to fine-tuning without data augmentations.

Preliminary experiments
Fine-tuning all weights of the network improves the segmentation compared to only fine-tuning the last layer (table 2). This means that increasing the learning capability by retraining all weights indeed improves the performance of the network.
Including data augmentations during fine-tuning increases contouring accuracy ( figure 1). For all patients, the maximum dice score during training is higher with data augmentation than without. Moreover, training with data augmentations avoids overfitting, i.e. the segmentation accuracy on the repeated CTs first increases and then stagnates, without significantly decreasing at the end of the training. Contrarily, without data augmentations, the accuracy reaches a maximum after which it steadily decreases. In practice, a fixed number of iterations needs to be defined. When training without data augmentations, the iteration at which the dice score is maximal depends on the patient (figure 1). For example, here, patient 1 reaches maximal dice after 700 iterations, and for patient 3 this is 3500. Therefore, selecting a fixed number will result in suboptimal performance for some patients. Contrarily, when training with data augmentations, the number of iterations can simply be set high (50 000 in our case) as the quality stagnates.

Contouring accuracy
Regarding the OAR contours, rigid registration (RR) performs generally worst of all methods for the NSCLC dataset (figure 2), except for the spinal cord. This is because the RR aligns the spine well, and, hence, also the spinal cord is accurately contoured. The pretrained NN achieves better contour accuracy but suffers from  outliers with low performance for the lungs and esophagus. This happens when the network is evaluated on outof-distribution data, i.e data that is significantly different from the training data. Because the training set is small for these OARs (table 1), the probability of this is indeed larger than for the more frequently occurring OARs. The commercial system consistently outperforms the pretrained NN and is especially more robust, i.e. it suffers less from outliers. The large HD95 in the lungs in some cases is due to the presence of a tumor, which, depending on location, annotator or method is included or excluded in a contour. This has only a limited effect on the dice and surface dice, but affects strongly the HD95.
Fine-tuning the segmentation networks on a specific patient improves the segmentation accuracy of the OARs strongly, outperforming rigid registration, the pretrained NN and the commercial contouring software significantly for all OARs (figure 3). Note that we only show the significance test results for the surface dice, but similar results are found for the dice and HD95. It also resolves the outliers, because fine-tuning on the planning CT avoids that the network is run on out-of-distribution data. The contour quality is similar to DIR for the lungs, significantly lower for the heart and esophagus and significantly better for the spinal cord ( figure 3).
In order to assess whether the obtained performance of the patient-specific NNs is clinically acceptable, it can be compared to the variability between the contours drawn by different observers, i.e. the inter-observer variability. This variability was not studied here, but has been quantified in other works for the relevant OARs in the thorax region (Yang et al 2018). It is important to note that the values are organ and image-modality-specific, as the metrics are strongly affected by the volume or contrast of the organ. The reported inter-observer dice scores were 0.96 for the lungs, 0.93 for the heart, 0.81 for the esophagus and 0.86 for the spinal cord. These values are very close to the dice scores for the patient-specific NNs and DIR (figure 2).
Regarding target segmentation, the patient-specific NNs perform significantly better than RR, but significantly worse than DIR (figure 3). However, these differences are small and only significant for the surface dice and not for dice. For one patient, the patient-specific NN has much lower contour quality. In this patient, the shape of the tumor changed throughout the treatment, causing the manual delineations to alter significantly from the reference. Whereas the performance of DIR is also low for this patient, the drop in quality is less pronounced. Such strong outliers did not occur for the patient-specific OAR segmentation, which indicates that one-shot segmentation lacks robustness because of its limited general knowledge.
Most general trends found for the NSCLC data also hold for the HNC dataset ( figure 4). The main difference is that RR performs much better. For OARs in head, close to the skull (e.g. brainstem, chiasm, hippocampus), RR performs as well as the more advanced methods (figure 5), because it matches the skull accurately and the rigid assumption is applicable there. Contrarily, for OARs further from the skull (e.g. spinal cord, thyroid), RR performs badly because the rigid transformation matching the skull is not valid there. Lastly, for organs that change strongly during radiotherapy (e.g. parotid glands), RR sometimes performs very badly.
Despite the larger OAR dataset in the HN region compared to the thorax (table 1), the performance of the pretrained NN is still low. This is most apparent for the smaller OARs (e.g. lacrimal gland, optic nerve, chiasm). Again, the commercial system outperforms the pretrained NN. The patient-specific NNs significantly outperform all other methods (including DIR) for all organs in general (figure 5). However, for the individual organs, we find that the difference is not always significant and that segmentation of the optic nerves is even better with registration and the commercial segmentation.
Several other works have investigated the inter-observer variability for OARs in the head and neck region (Deeley et al 2011, Brouwer et al 2012, Mattiucci et al 2013, Verhaart et al 2014, Tao et al 2015, van der Veen et al 2019, Wong et al 2020. Whereas the stated values vary between the publications because of differences in experimental set-up, we found that the mean dice scores of patient-specific NNs and DIR here are similar or even higher than the reported inter-observer variabilities for all OARs except the thyroid. Figure 3. Overview of the Wilcoxon signed rank test results for the surface dice of the contours in the NSCLC dataset. Green: patientspecific NN performs significantly better than the method. Red: patient-specific NN performs significantly worse than the method. White: the performance of the method is not significantly different from the PSNN. Grey: the method does not segment the structure. The significance level is set to 2.5%. Contouring of the main CTV works best with DIR, followed by patient-specific NNs and rigid registration. Rigid registration does not work well because the CTV covers part of the neck region, where significant shrinkage happened for these patients which cannot be captured with rigid transformations. The improvement of the patient-specific NNs compared to rigid registration is only significant based on dice score but not for the surface dice (figure 5). For the boosted region, rigid registration works well and even significantly better than the patient-specific segmentation, as this region is inside the head close to the skull.

Contouring speed
The runtime of the algorithms depends strongly on the hardware and potential GPU acceleration. Image registration in plastimatch and elastix runs on CPU and the runtime is evaluated on a Linux based system with 8 Intel Xeon E3-1240 v5 CPU cores. The runtime of the in-house trained NNs is evaluated by running inference on a Nvidia Quadro P6000 GPU and the commercial software was ran on a Nvidia RTX 3060 GPU.
Rigidly registering the CTs takes approximately the same time as running inference of the in-house trained CNNs. The commercial segmentation software is approximately 2 times slower, but still significantly faster than  Overview of the Wilcoxon signed rank test results for the surface dice of the contours in the HNC dataset. Green: patientspecific NN performs significantly better than the method. Red: patient-specific NN performs significantly worse than the method. White: the performance of the method is not significantly different from the PSNN. Grey: the method does not segment the structure. The significance level is set to 2.5%. DIR, which is 7-10 times slower than rigid registration (table 3). Note that several DIR methods with GPU acceleration exist, which could lead to significant speed up (Gu et al 2010, Weistrand andSvensson 2015). Whereas the runtime of the DIR is likely acceptable, the speed of the other methods offers an advantage for patient comfort and correspondence between CT and treated anatomy in a particularly time-dependent setting such as adaptive therapy. Figure 6 visualizes the trade-off between speed and accuracy. Patient-specific NNs lie on the pareto front for both HNC and NSCLC datasets, i.e. none of the other methods can improve accuracy without increasing runtime nor improve runtime without reducing accuracy. For NSCLC, also DIR lies on the pareto front, yielding slightly higher accuracy but slower runtime. For HNC, the pareto front is shared with RR, which is faster but yields lower accuracy.

Discussion
Our results show that DIR yields generally the most accurate contours for the targets and OARs in the thorax region. Contrarily, the patient-specific NNs are best for OARs in the head and neck region. The differences are however small and not always significant for all metrics. The PSNN is further on average 10 times faster, which is advantageous in adaptive therapy. For the HNC specifically, rigid registration is both fast and accurate for the structures close to the skull, but the accuracy is lower for those far from the skull which can lead to unacceptable degradation in target coverage.
Although both patient-specific NNs and DIR lead to high-quality contours, they do not perfectly correspond to the manual ones. This can be due to limitations of the methods, but also due to inaccuracies in the manual contours, as it has been shown that substantial inter-and inter-observer variability in delineation of HNC and NSCLC exists ( Zhang and Huang 2022). The PSNN and DIR methods reach accuracies similar to such inter-observer variabilities found in the literature, which impose an upper-bound for the average achievable accuracy. This further means that the methods perform similar to a human, indicating that they can be used directly in adaptive therapy.
In order to meticulously evaluate the use of contour propagation methods for adaptive therapy, the effect on the dose and the corresponding biological effect should be analyzed. Treatment plans reoptimized on automatically propagated contours should be compared to plans reoptimized on manual contours, and these  dosimetric differences have to be interpreted clinically before implementation in the clinic. This is the subject of current work at CPT. The size of the evaluation datasets is relatively small, mainly because manual delineation of all daily CTs is a time-consuming process. For the NSCLC dataset, the clinicians completely manually recontoured because of the limited number of OARs. As the number of OARs in the HNC region is much larger, the clinicians manually adjusted contours propagated from the reference using DIR, in accordance with the current clinical protocol for replanning. Even though this creates a bias, the resulting contours are clinically acceptable and the DIR algorithm used to create these initial contours was different from the one used in this study.
The quality of the pretrained NN for the HNC dataset is low, even though the training dataset is relatively large. Especially for the smaller organs, the segmentation accuracy is largely insufficient. This could be due to the large number of OARs segmented by a single network. During training, the loss function is only slightly affected by these small organs, similar to class imbalance. This could result in the network favoring accurate segmentation of the larger structures over the smaller ones. This can be overcome by simply training one network for each structure. Even though this would lead to an increase in runtime, inference could still be parallelized or hierarchical approaches could be employed instead of splitting the image in patches (Shaheen et al 2021).
This analysis relies on the presence of daily CT scans, which requires an in-room CT. Although such inroom CT is present at several proton therapy centers, gantry-mounted CBCT scanners are more prevalent. In the future, also daily MRI scans might be used. The registration-based methods could easily be adjusted to allow multi-modal registration between CBCT/MRI and CT to propagate the contours. Further, a general segmentation network for CBCT/MRI could also be developed if an appropriate dataset is available. The patient-specific fine-tuning cannot be applied directly with CBCT/MRI. However,it can be applied on synthetic CTs, which are produced from the daily CBCT/MRI to reoptimize the plan in adaptive therapy. Although it is expected that the networks will work on such synthetic CTs, the quality of contours will have to be evaluated.

Conclusion
In this work, patient-specific CNNs were compared to general CNNs and (deformable) registration for the task of contour propagation in adaptive radiotherapy. We found that the patient-specific fine-tuning leads to higher quality contours than general segmentation networks, reaching similar quality as DIR but with a significant reduction in runtime. Fine-tuning further allows target volume segmentation, which is not yet feasible with general CNNs.