Evaluating the relationship between contouring variability and modelled treatment outcome for prostate bed radiotherapy

Objectives. Contouring similarity metrics are often used in studies of inter-observer variation and automatic segmentation but do not provide an assessment of clinical impact. This study focused on post-prostatectomy radiotherapy and aimed to (1) identify if there is a relationship between variations in commonly used contouring similarity metrics and resulting dosimetry and (2) identify the variation in clinical target volume (CTV) contouring that significantly impacts dosimetry. Approach. The study retrospectively analysed CT scans of 10 patients from the TROG 08.03 RAVES trial. The CTV, rectum, and bladder were contoured independently by three experienced observers. Using these contours reference simultaneous truth and performance level estimation (STAPLE) volumes were established. Additional CTVs were generated using an atlas algorithm based on a single benchmark case with 42 manual contours. Volumetric-modulated arc therapy (VMAT) treatment plans were generated for the observer, atlas, and reference volumes. The dosimetry was evaluated using radiobiological metrics. Correlations between contouring similarity and dosimetry metrics were calculated using Spearman coefficient (Γ). To access impact of variations in planning target volume (PTV) margin, the STAPLE PTV was uniformly contracted and expanded, with plans created for each PTV volume. STAPLE dose-volume histograms (DVHs) were exported for plans generated based on the contracted/expanded volumes, and dose-volume metrics assessed. Main results. The study found no strong correlations between the considered similarity metrics and modelled outcomes. Moderate correlations (0.5 < Γ < 0.7) were observed for Dice similarity coefficient, Jaccard, and mean distance to agreement metrics and rectum toxicities. The observations of this study indicate a tendency for variations in CTV contraction/expansion below 5 mm to result in minor dosimetric impacts. Significance. Contouring similarity metrics must be used with caution when interpreting them as indicators of treatment plan variation. For post-prostatectomy VMAT patients, this work showed variations in contours with an expansion/contraction of less than 5 mm did not lead to notable dosimetric differences, this should be explored in a larger dataset to assess generalisability.


Introduction
Prostate cancer is the second most frequent malignancy in men (Leslie and Soon-Sutton 2023), with radical prostatectomy being the most frequently used treatment for patients with localised prostate cancer (Yeison et al 2023).However, nearly one quarter to a third of these patients will experience biochemical relapse, and early salvage radiation therapy (RT) can offer high rates of long-term disease-free survival (Kneebone et al 2020).Modern RT techniques, such as intensity-modulated radiotherapy (IMRT) and volumetric-modulated arc therapy (VMAT), can produce highly conformal dose distributions to the target volumes, which in postprostatectomy patients is defined as the prostate bed (Dean et al 2016, Mohammad et al 2018).This volume is identified by the organs that surround the prostatic fossa, and includes the anterior rectum, bladder base, and anastomosis (Sidhom et al 2008).
Accurate target delineation is essential for successful RT treatment as quality target delineation can greatly impact patient outcomes (Cloak et al 2019a(Cloak et al , 2019b).In the case of prostate bed RT, delineating the clinical target volume (CTV) is a challenging task.This is due to the complex anatomy of the prostate bed, which is not a single solid structure, making it more difficult to delineate than most other treatment sites (Michalski et al 2010b).Despite the availability of consensus guidelines for CTV delineation, delineations are often inconsistent (Latorzeff et al 2017).Therefore, manual delineation of prostate bed CTV is subject to large variability (Weiss and Hess 2003).Efforts have been made to reduce this variability, including the use of multimodality images from magnetic resonance imaging (MRI) and computed tomography (CT) scans, but achieving agreement among observers remains a challenge (Lee et al 2018).
Inter-observer variability (IOV) in prostate bed CTV contouring has been widely studied, and multiple studies have found significant levels of variation (Mitchell et al 2009, Hwee et al 2011, Ost et al 2011, Cloak et al 2019a, 2019b).For example, a study by Ost et al assessed the agreement of prostate bed CTV contouring using CT among six observers for 10 patients (Ost et al 2011) and found only moderate observer agreement (mean kappa, 0.49; range, 0.35-0.62).The study also found that the mean volume of 100% agreement (±standard deviation, std) was only 5.0 (±3.3) ml, while the mean union of all contours (±std) was 41.1 (±11.8)ml, indicating a high degree of variation in contouring among observers.
Despite the numerous studies that have investigated IOV in prostate bed CTV contouring, relatively few studies have examined the effect of this variation on dosimetry.Studies that have investigated this effect have found that variation in CTV contouring can have significant impact on dosimetry (Mitchell et al 2009, Cloak et al 2019a, 2019b).For example, a study by Mitchell et al found that the variation in CTV contouring has a significant impact on rectal dosimetry when using the three-dimensional conformal RT (3D-CRT) technique (Mitchell et al 2009).Cloak et al also found that variations in prostate bed CTV contouring can have an impact on dosimetric compliance when using IMRT (Cloak et al 2019a(Cloak et al , 2019b)).These studies also highlighted the importance of minimising IOV in prostate bed CTV contouring to ensure accurate dosimetry and improve patient outcomes.
The manual contouring process in RT is known to be time-consuming and labor-intensive (Weiss and Hess 2003).To address this, there has been increasing interest in automatic segmentation methods (Cardenas et al 2019).In the landscape of auto-segmentation, various techniques have emerged, including traditional methods such as intensity thresholding, region growing, and heuristic edge detection, as well as advanced approaches like (multi)atlas-based segmentation, model-based segmentation, and machine learning-based (ML) segmentation (Cardenas et al 2019).These methods have shown promising results in terms of reducing IOV and can potentially improve the efficiency of the RT treatment planning process, which can be particularly beneficial in adaptive radiotherapy (Hwee et al 2011, Delpon et al 2016).
To quantify the magnitude of IOV or to evaluate the quality of automatic segmentations for prostate bed RT, various contouring similarity metrics have been employed in multiple studies (Mitchell et al 2009, Hwee et al 2011, Ost et al 2011, Delpon et al 2016, Cloak et al 2019a, 2019b).These studies commonly used metrics such as volume difference (VD), Dice similarity coefficient (DSC), 95% Hausdorff distance (95% HD), mean distance to agreement (MDA), and the Jaccard metric (Jaccard).While these metrics are simple to compute, they lack true clinical meaning as they do not have an established correlation with dosimetric effects or patient outcomes (Liu et al 2021).The lack of understanding of the impact of contouring variability on the clinical outcome may result in commonly used contouring similarity metrics providing no dosimetric context regarding the clinical impact of the variation (Jameson et al 2014, Vinod et al 2016).The objective of this study was to investigate the correlation between commonly used similarity metrics and dosimetric effects in prostate bed RT using the VMAT technique.Additionally, the study sought to determine the range of prostate bed CTV variation and assess its impact on dosimetry, categorizing deviations as minor or major based on the constraints outlined in the TROG 08.03 RAVES trial protocol.This study aims to provide a better understanding of the clinical meaning of commonly used similarity metrics in prostate bed RT.The findings of this study will provide a more comprehensive understanding of the potential impact of contouring variation on dosimetry and may ultimately aid in guiding decision-making in the contouring process.This article is based on chapter 4 of the Viet Le Bao's PhD thesis (Le 2023).

Patient datasets and contouring
This retrospective study used two patient datasets from the TROG 08.03 RAVES trial (Pearse et al 2014, Kneebone et al 2020).The first dataset (dataset 1) included CT scans from 10 patients, with the CTV, rectum, and bladder independently contoured by three experienced observers (denoted A, B, and C) on all scans.The second dataset (dataset 2) included one patient's CT scan, with 92 independent manual contour sets of CTV, rectum, and bladder derived from the RAVES trial benchmarking study.Out of the 92 manually generated CTV contours, 42 that were deemed to have no minor or major variations by the RAVES quality assurance (QA) review committee were utilised for analysis.
The contouring guidelines for this trial were adapted from the Faculty of Radiation Oncology Genito-Urinary Group (FROGG) consensus guidelines (Sidhom et al 2008).All planning CTs were acquired with the patient in the supine position and the maximal CT slice thickness was 2.5-3 mm.The patients were simulated and treated with a full bladder and were instructed to maintain an empty rectum (Pearse et al 2014).Regarding contouring guidelines for the rectum, the rectum was contoured as a solid organ superiorly from the rectosigmoid junction (where the rectum turns horizontally into the sigmoid, usually at the inferior border of the sacro-iliac joint) to 15 mm inferior to the inferior border of the CTV.The rectal contours should extend at least 15 mm superior and inferior to the CTV (Sidhom et al 2008).
The research design comprises two distinct sections: the correlation study and the CTV contraction/ expansion study.The correlation study section focuses on relationships between contouring similarity metrics and radiobiological parameters.This correlation study section used dataset 1 with 10 patients and 3 manual contour sets for analysis and dataset 2, comprising 1 patient with 42 contour sets, was employed to generate additional CTV contours on patient images from dataset 1.The CTV contraction/expansion study section, aiming to identify the range of contouring variability, utilized dataset 1 with 10 patients and 3 manual contour sets, and no additional contour sets.

Reference STAPLE volumes
To assess contouring variability, a reference contour was established for each structure (CTV, rectum, and bladder) using the three-observer contour sets for each patient in dataset 1.The reference contour set was generated using the simultaneous truth and performance level estimation (STAPLE) algorithm (Warfield et al 2004), assuming that the true contour existed within the observer contours.The STAPLE contour was generated using Python software (SimpleITK library) (Lowekamp et al 2013).

Generation of additional atlas CTVs
To generate additional CTV contours on patient images from dataset 1, an in-house ABAS algorithm was developed.The ABAS algorithm was based on the benchmarking contouring dataset (dataset 2) and implemented using Python software (simple ITK library) (Dowling et al 2014).The ABAS process uses a reference image (the atlas) in which anatomical regions of interest have been previously delineated.To segment a new patient image, a transformation that registers the atlas to the patient image is computed and used to map the atlas contours onto the new patient image (Dowling et al 2015).
The ABAS process in this study is illustrated in figure 1, and it consisted of the following steps: First, an affine transformation (twelve degrees of freedom) was applied to register the atlas (dataset 2) to each individual patient image (dataset 1).This was followed by a deformable transformation using a demons-based algorithm (Vercauteren et al 2009) to further improve registration.Next, the 42 acceptable CTV contours from dataset 2 were propagated to the patient images in dataset 1.Finally, for each patient image in dataset 1, a set of contours was generated by combining these 42 contours into a probabilistic label (Kardman et al 2016), and taking thresholds at levels of agreement of 20%, 40%, 60%, 80%, 95%, and 100%.
This process provided six atlas-based contours for each patient image in dataset 1.In total, for all 10 patient image data sets, six probabilistic (atlas) contours, three observer contours (A, B, and C), and the STAPLE contour were included.It is important to note that for this study, the additional atlas contours were used to statistically quantify the correlation between contouring variation and clinical outcomes, rather than evaluating the accuracy of the proposed ABAS algorithm.

Treatment planning
To evaluate dosimetric uncertainty, two-arc VMAT treatment plans with beam start and stop angles of 184°/ 176°were generated for each patient in dataset 1, covering all CTV volumes (atlas-based, observer-based, and STAPLE-based), as follows: • STAPLE plan: generated based on STAPLE CTV, STAPLE rectum, and STAPLE bladder volumes (one plan).
• Observer plans: generated based on observer CTV, observer rectum, and observer bladder volumes (three plans).
• Atlas plans: generated based on atlas CTV, STAPLE rectum, and STAPLE bladder volumes (six plans).
A 6 MV photon beam model for the Elekta Versa linear accelerator was used for plan generation with the auto-planning (AP) module of the Pinnacle 3 treatment planning system (TPS) (Pinnacle 3 9.10, Phillips Healthcare Inc. Fitchburg.WI).A three-dimensional uniform margin of 10 mm was added to the CTV to obtain the PTV (Kneebone et al 2020).The plans were generated with a prescription of 64 Gy in 32 fractions to the PTV.AP was used to avoid planner bias.After AP operation, an identical optimisation constraint (approved by a senior dosimetrist) was used for each patient to ensure that the target and OAR dose goals defined in the RAVES treatment plan protocol were met (Pearse et al 2014).The RAVES trial's dose constraints are presented in table 1.

Contouring similarity metrics
To quantitatively compare the similarity between the observer/atlas-based and STAPLE-based volumes, we used MiM Maestro software (Version 6.7, MiM Software Inc., Cleveland, OH).The similarity metrics in this study include VD, DSC, Jaccard, 95% HD, and MDA, which are summarised in appendix A.

Dosimetric and radiobiological analysis
Assuming the STAPLE volumes represent the true target and OARs, dose-volume histograms (DVHs) for the STAPLE plan, as well as plans generated based on the observer and atlas volumes, were generated.The plans based on the STAPLE volumes were taken as the reference plans, and the differences in dose-volume metrics were calculated between these plans and the plans using the observer and atlas volumes.The DVH data were generated for all plans and exported to a Microsoft Excel spreadsheet with a bin resolution of 0.1 Gy for the purpose of data analysis and comparison.From the DVHs, the dose-volume metrics were reported for the prostate bed CTV (D 2% , D 98% ), rectum (V 40Gy , V 60Gy ), and bladder (V 50Gy , V 60Gy ).(1) registration of the atlas image to the target image using both an affine and a deformable transformation, (2) propagation of the atlas CTV labels onto the registered image, and (3) generation of a set of contours by combining the propagated labels and applying thresholds at levels of agreement of 20%, 40%, 60%, 80%, 95%, and 100%.
In addition to comparing dose-volume metrics, radiobiological modelling was also used to estimate tumour control probability (TCP), and normal tissue complication probability (NTCP) using the Comp Plan software (open source) (Holloway et al 2012).The logistic TCP model, and fitted parameters described by King et al for prostate bed salvage RT were used with a TCD 50 for biochemical control of 66.8 Gy and a g 50 of 3.8 (King and Kapp 2008).The NTCP for grade 2 rectal toxicity or bleeding and the NTCP for grade 2 bladder toxicity (urinary urgency or urinary frequency) were estimated with the widely used Lyman-Kutcher-Burman (LKB) model (Lyman 1985, Kutcher andBurman 1989).The TCP and NTCP parameters used for the model are summarised in table 2.

Statistical analysis
Statistical analysis was conducted using the SPSS software platform (SPSS Inc., Chicago, IL).To evaluate the impact of target coverage and sparing OARs upon IOV, a paired Student's t-test was used to test the dosimetric and radiobiological parameters, with statistically significant results based upon a p-value < 0.05.
Spearman's non-parametric rank-correlation coefficient (Γ) was used to assess relationships between geometric and plan dose metrics, with a p-value less than 0.05 considered statistically significant.In general, a value of Γ between −0.3 and 0.3 indicates almost no relationship; Γ between 0.3 and 0.5 (−0.5 to −0.3) indicates a weak positive (negative) relationship; Γ between 0.5 and 0.7 (−0.7 to −0.5) indicates a moderate positive (negative) relationship; Γ between 0.7 and 1 (−1 to −0.7) indicates a strong positive (negative) relationship (Hinkle et al 2003).A total of 90 treatment plans (three-manual and six-atlas contours-based plans for 10 patients) were used to investigate the relationship between contouring variability, dose and radiobiological metrics with two-sided test with α = 0.05, and 1-β = 0.9.This study has the statistical power to detect true correlations of coefficient ρ = 0.3 (Machin et al 2011).

The CTV contraction/expansion simulation
In this section, the aim was to identify the range of CTV variation in which the impact of this variation on dosimetry was minor or major, according to the RAVES protocol.A description of the major and minor variation is outlined in Cloak et al's work (Cloak et al 2019a(Cloak et al , 2019b)).As described in section 2.2, the STAPLE CTV for each patient in dataset 1 was generated by combining the contours of three experienced observer CTVs.A uniform 1 cm margin was then added to the STAPLE CTV to create the STAPLE PTV.To simulate variations in CTV delineation, the STAPLE PTV was contracted or expanded in this study.For contraction, the PTV was uniformly reduced by 1 mm, 2 mm, 3 mm, and 4 mm.For expansion, the PTV was uniformly increased by 1 mm, 2 mm, 3 mm, and 4 mm. Figure 2 illustrates these volumes.
The VMAT treatment technique presented in section 2.4 was applied for all PTV contraction/expansion volumes (combined with the STAPLE rectum and the STAPLE bladder volumes).This resulted in four contraction plans and four expansion plans being generated for each patient.The STAPLE DVHs for plans generated based on the contraction/expansion volumes were exported to a Microsoft Excel spreadsheet with a bin resolution of 0.1 Gy.The dose-volume metrics for the prostate bed PTV (D 2% , D 98% ) and rectum (V 40Gy , V 60Gy ) were reported from the DVHs and compared to the target dose requirements and dose constraints outlined in the RAVES protocol (see table 1).
The aim of this simulation was to assess how variations in CTV contouring, represented by the contraction/ expansion of the STAPLE PTV, would influence dosimetric outcomes.This section focusses on evaluating the impact of simulated CTV variation rather than the application of different PTV margins.

Patient statistics
The average volume of the CTV for all observers, all atlas-based contours, and the STAPLE contour (mean ± std) was 93.1 ± 22.5 c.c.

Observer-based volume evaluation
The geometric evaluation was conducted by calculating the VD, DSC, 95%HD, MDA, and Jaccard metrics between the STAPLE and observer contours.Table 3 presents the average and standard deviation of the various geometric measures for the CTV and OARs. Figure 3 illustrates the results by plotting the (a) DSC and (b) MDA for the CTV, as well as the DSC for the (c) rectum and (d) bladder for each individual patient.The results indicate that significant geometric variations were found for the CTV and rectum contours, while there were no significant changes observed in the bladder contours.Table 4 presents the mean and standard deviation for each of the CTV contouring similarity metrics evaluated.The largest geometric variation was observed for patient 6, while patient 4 had the smallest overall difference.Figure 4 illustrates these patients by displaying (a) patient 6 and (b) patient 4. The largest discrepancies in CTV delineation were observed at the superior and posterior borders of the prostate bed, particularly in the superior areas where the CTV and bladder overlap.
The mean values and standard deviations of dosimetric and radiobiological parameters for the STAPLEbased and observer-based plans are presented in table 5.The results indicate that, when comparing plans generated using observer contours to those generated using STAPLE contours, there were statistically significant increases in rectal and bladder doses.However, no significant differences were identified in CTV doses.It is noted that these significant differences were specific to observer C contours, with no significant differences found in the rectum and bladder parameters for observers A and B contours.

Correlations between contouring similarity metrics and radiobiological parameters
Table 6 presents the mean and standard deviation of the various geometric measures for the CTV when comparing the STAPLE and atlas volumes (referred to as CTV atlas) as well as when comparing the STAPLE volumes and the atlas/observer-based volumes (referred to as CTV overall).The results indicate that there were large differences between the atlas and STAPLE volumes (CTV atlas), with the discrepancies being overall larger than those observed in the observer-based volumes (as seen in table 3).These large ranges of geometric variations (CTV overall) were used to evaluate the correlation with treatment outcomes in this study.Table 7 illustrates the impact of geometric variations on radiobiological parameters.Compared to plans based on the STAPLE volumes, plans based on the atlas and observer volume variations resulted in a 2.2 ± 4.5% change in TCP (equivalent to a ~6% change in the mean value), a 1.4 ± 1.4% change in rectal a NTCP (equivalent to a ~50% change in the mean value), and a 3.2 ± 3.4% change in urinary c NTCP (equivalent to a ~20% change in the mean value).Table 8 presents the results of Spearman's correlation analysis between geometric metrics and radiobiological parameters for all patients and contour sets, including both atlas-and observer-based contours.The analysis revealed that there were no strong correlations between the geometrics metrics and the modelled outcomes.Only moderate correlations (0.5 < | | G < 0.7) were identified between several similarity metrics (DSC, Jaccard, and MDA) and rectum toxicities ( a NTCP, b NTCP).

The impact of CTV contraction/expansion simulation on dosimetry
The study simulated a range of CTV to PTV margins by adding a 1 cm margin to the STAPLE CTV and then contracting or expanding it.Figure 5  Figure 6 illustrates the mean and standard error of (a) PTV D 2 , (b) PTV D 98 , (c) rectum V 40 , and (d) rectum V 60 for all patients when the PTV is contracted or expanded.The results show that all plans met the PTV dose requirement at D 2 , however, only plans generated based on a PTV with a volume larger than PTV-3 mm met the PTV dose requirement at D 98 .As for the rectum, only plans generated based on a PTV with a volume smaller than PTV + 4 mm met the dose constraint V 40 , and only plans generated based on a PTV with a volume smaller than PTV + 2 mm met the dose constraint V 60 .It can be deduced that only plans generated based on a PTV with a volume larger than PTV-3 mm and smaller than PTV + 2 mm met all the target dose requirements and dose constraints.Therefore, the range of the CTV variation in which the impact on dosimetry is minor is <5 mm (from contraction at 3 mm to expansion at 2 mm).Abbreviations: VD, volume difference; DSC, Dice similarity coefficient; 95% HD, 95% Hausdorff distance; Jaccard, Jaccard metrics; data are presented as the mean ± standard deviation; overall, both atlas-and observer-based volumes.
Table 7.The impact of geometric variations on radiobiological parameters is illustrated in this table, which compares the mean ± standard deviation of TCP, rectal NTCP, and bladder NTCP between plans based on the STAPLE volumes and those based on the atlas and observerbased volume variations.
Abbreviations: TCP, tumour control probability; NTCP, normal tissue complication probability; Δ, the difference between the atlas/ observer-based plans and the STAPLE-based plans; data are presented as the mean ± standard deviation.abcd NTCP are the values calculated from the parameter sets abcd as described in table 2.

Discussion
Accurate target delineation is widely recognised as essential for generating high-quality treatment plans for RT.However, manual contouring by medical professionals is a time-consuming task and may result in IOV.To quantify the magnitude of IOV, many contouring similarity metrics have been used.Although these metrics are easy to calculate, they lack true clinical meaning.In this study, we aimed to identify the clinical meaning of commonly used similarity metrics by evaluating the relationship between these metrics and radiobiological parameters.Our results revealed that for the post-prostatectomy setting although a large contouring variation was generated and used, which is relatively larger than clinical variations, no strong correlations were observed.
The results from this study indicate that only weak or no correlations were found between the considered geometric metrics and tumour control (refer to table 8).This can be attributed to the substantial 1 cm PTV margin employed in this study to create the PTV, which appears to effectively account for inter-observer CTV contouring variation.However, it is important to note that while the PTV margin is intentionally designed to  incorporate some contouring variation, it may not specifically address disproportionate contouring variation.The results also indicated that for rectal NTCP, the MDA, DSC, and Jaccard metrics moderately correlated with modelled toxicity (0.5 < | | G < 0.7).This can be explained by the fact that the rectum is often in close proximity to the target volume and in the region of a unidirectional steep dose gradient, such that global differences in target position/surface have a large effect on modelled toxicity.In contrast, these metrics had weak or no correlations with modelled toxicity for the bladder (| | G < 0.5).This was likely caused by the fact that part of the bladder, as defined in the FROGG guidelines used in the RAVES trial, is included in the target volume (Sidhom et al 2008).Therefore, this part will always receive the prescribed dose, while the proportion of the remaining bladder will depend on the bladder volume, which can vary considerably between patients due to their capacity to maintain a full bladder.
In the case of intact prostate, Roach et al investigated CTV DSC and MDA metrics to evaluate the relationship with modelled outcomes for intact prostate RT using an auto-planning VMAT technique, finding no or only weak correlations (Roach et al 2018).In the present study, the auto-planning VMAT treatment technique was used for prostate bed RT, showing moderate correlations between contouring similarity metrics and modelled outcomes could be observed.This demonstrates that the relationships between geometric variations and modelled clinical outcomes may vary between intact prostate and prostate bed.Differences in the two studies may be due to the significant difference in the treatment volumes between the two studies.In the intact prostate study, the CTV STAPLE volume was 42.71 ± 21.3 c.c. with a PTV margin expansion of 0.7 cm, while in the present study, the CTV STAPLE volume was 88.4 ± 24.5 c.c. with a PTV margin expansion of 1 cm.A larger PTV volume in prostate bed RT compared to intact prostate RT leads to a larger part of the rectum that will always be irradiated.This increases the sensitivity of rectal dosimetry upon IOV in prostate bed RT.
The results in figure 3 show that manual bladder contours were highly consistent between all three clinicians (Observer A, B, and C); however, significant variations in manual prostate bed CTV and rectum contours were observed.This can be explained due to the bladder having well-defined borders which are clearly visible on planning CT scans which aids in structure delineations, while the CTV and rectum have poor contrast between the region of interest and surrounding tissues (Liu et al 2021).Additionally, greater variation was found in prostate bed CTV delineation, particularly in superior areas (figure 4).This is because the prostate bed CTV is an anatomically complex structure that is mainly defined by surrounding OARs.In the RAVES trial, the FROGG guidelines, which define the CTV volume as including a part of the bladder volume (CTV superior areas), was used (Sidhom et al 2008).
The goal of developing a treatment plan in RT is to achieve adequate target coverage while minimising dose to OARs.The results of this study demonstrate that CTV variation can result in significant dosimetric/ radiobiological variation in OARs.As shown in figure 3, the CTVs delineated by Observer C had larger variations compared to the CTVs delineated by Observer A and B. This resulted in significant differences in rectum and bladder dosimetry/radiobiology for Observer C, while no significant differences were found for Observer A and B when compared to the STAPLE plans (table 5).These findings suggest that IOV of the prostate bed CTV plays an important role in achieving a high-quality treatment plan, and may affect OAR toxicities.However, no significant difference in CTV dosimetry/radiobiology was found due to an adequate PTV margin used in this study that accounts for IOV.
The observations depicted in figure 6 highlight that, when the range of CTV variation was limited to within 5 mm, the resulting dosimetric impact was generally deemed minor for this cohort of patients.A study by Cloak et al investigated the prostate bed CTV IOV (as defined according to the FROGG guidelines) using data from the RAVES trial.The authors found that the median CTV variation was 3.8 mm, ranging from 1.8 to 8 mm (median MDA = 1.9 mm) (Cloak et al 2019a(Cloak et al , 2019b)).This indicates that the CTV variation was within 5 mm in the majority of occasions, but at some points of the volume, the variation could be larger.The variations at some points of the CTV had only weak or no correlations with the modelled outcomes.This is evident from the weak or no correlations observed between 95% HD and radiobiological parameters in table 8. Therefore, it can be inferred that the impact of the prostate bed CTV variation on the RAVES trial's results is not significant.However, it is suggested that the degree of agreement between observers in clinical practice still needs to be further improved.
The use of automated planning in this study allows for the creation of unbiased, high-quality treatment plans based on a given structure set.In this study, the AP module was used consistently for all cases and an identical optimisation constraint was used to drive the optimisation, ensuring that the target and OAR dose goals defined in the RAVES treatment plan protocol were met.This minimised the impact of treatment plan variation associated with individual planner's knowledge and experience on dosimetry.As the present study aimed to correlate contouring similarity metrics with radiobiological parameters, rather than to evaluate the treatment plan quality, it is assumed that the residual variation in treatment plan quality is insignificant.
When comparing the atlas contours and manual contours, this study found that the variation of the atlas contour was larger than that of the manual contours (see tables 3 and 6).The likely explanation for this is that the benchmarking data used only one patient CT data set, resulting in the atlas created in this study failing to account for patients who have significant anatomical differences compared to the benchmarking data.For example, the CTV STAPLE volumes in patients 2 and 9 were 65.769 c.c. and 79.33 c.c. respectively, and had significant anatomical differences compared to the benchmarking patient with a median CTV volume of 113.2 c.c.Additionally, it is essential to acknowledge the impact of deformable registration uncertainties on the ABAS; the reliance on accurate registration is crucial for the atlas contour outcomes, and potential uncertainties associated with deformable registration are recognized (Dowling et al 2015).Furthermore, while the ABAS tool offered an effective approach with the potential to increase adherence to contouring guidelines, it still needs to be further refined and validated for clinical use (Gambacorta et al 2013, Valentini et al 2014).Therefore, this study suggests that the prostate bed CTV cannot be automatically defined by using the proposed ABAS strategy (single atlas with multiple contour sets) and would require multiple corrections.Using atlas-based segmentation tools that include multiple patients within the atlas increases the chance of the best match being similar to the patient's anatomy and improves delineation accuracy and consistency (Dowling et al 2015); therefore, further studies should include more patients within the atlas.It is important to clarify that the intentional use of varying contours is within the context of the study's specific aim to support looking at geometric and associated dosimetric assessment, rather than requiring high accuracy of deformable registration.Therefore, while deformable registration uncertainties are acknowledged, the study contends that these uncertainties do not undermine the reliability of the paper's conclusions, given the study's distinct focus on clinical interpretation of similarity metrics.
Additionally, the second dataset in this study, comprising 1 patient with 92 manual contour sets, was primarily utilized to generate additional contours for patients in dataset 1 using the ABAS algorithm.The selection of the subset of 42 CTV contours aimed to ensure a focused analysis within the accepted guidelines of the RAVES trial, guided by the RAVES quality assurance (QA) review committee.This excluded contours with minor or major variations with the goal of generating additional contours that are still considered acceptable with minimal deviation compared to real-world clinical scenarios, considering the imperfections of the ABAS algorithm and its sensitivity to input contours (Dowling et al 2015).Consequently, this study selected contours post-QA to mitigate excessive variation of the additional contours compared to clinical trial contours.Within the 92-case dataset there are some contours submitted by the same clinician due to resubmission or postfeedback submission, although it's not possible to determine how many instances.These contours represent intra-observer rather than inter-observer variation for these cases, although still representing the range of variation that might be seen in standard practice.For future studies, the inclusion of contours that did not meet the trial protocols could also be considered, arguably providing a more 'real world' assessment of the variation.
The results of this study suggest that, despite the uncertainties in the parameters used in radiobiological models, the impact of IOV on modelled outcomes remains unchanged.As shown in table 2 various parameters were applied to calculate NTCP, yet the results in table 5 indicate that the use of different parameters did not affect the impact of IOV on the modelled outcomes.This suggests that, while the use of different parameters in radiobiological models may introduce uncertainty, the impact of IOV remains consistent regardless of the parameters used.
For this study the reference volumes were generated using the STAPLE algorithm, other approaches have used the single most experienced observer or majority voting to define a reference volume (Roach et al 2018).There is not a consensus in the literature on the best approach to determining reference volumes.The reference volumes were generated with only the three manual volumes rather than also including the six additional atlas volumes.This is because there was a low level of consistency found between the atlas and the manual contours in this study.Therefore, the atlas contours were not used to minimise potential bias.It should be noted that the STAPLE volumes were used as a reference for comparison to other volumes and not necessarily as true volumes, different reference volumes may vary the conclusions and future studies may wish to assess this.As the purpose of this study was to assess the relationship with the dosimetry of the target, rather than that of the OARs the atlasbased OAR volumes were not generated or assessed.It is noted that the two of three observers in dataset 1 were part of the original 42 observers in dataset 2, indicating an overlap between observer subsets.This may have resulted in 2 contours in dataset 2 showing more similarity to the STAPLE volume than others (an intra-observer variation between dataset 1 and 2), however, our study primarily focused on assessing dosimetric consequences within the specified contouring framework.The practical constraints of observer availability influenced the choice of the observer subset in dataset 1, and future investigations may explore the impact of observer subset size on the validity of STAPLE contours in prostate bed radiotherapy.
Finally, the choice of treatment planning technique might be an important factor to consider when evaluating the relationship between contouring similarity metrics and dosimetry.Livsey et al found no statistically significant correlations between intact prostate CTV variation and rectum dosimetry when using 3D-CRT treatment plans, while Roach et al observed a significant difference using VMAT techniques (Livsey et al 2004, Roach et al 2018).This discrepancy may be explained by the steeper dose gradients generated by VMAT treatment plans, which increase the sensitivity of target volume dosimetry to IOV (Roach et al 2018).Therefore, future studies exploring different treatment techniques, such as stereotactic body radiotherapy (SBRT) treatment plans that deliver higher dose fractions, may possibly reveal stronger correlations than those observed in the present study for prostate bed RT.
This work was conducted to determine if there were clear correlations between geometric and plan dose metrics.It is important to note the impact of the sample size used for this study.The statistical power calculation was conducted using Spearman's rank-correlation coefficient to assess relationships between geometric and plan dose metrics.The sample size of 90 treatment plans was determined to achieve a statistical power of 0.9, ensuring a 10% chance of a false negative result.The study has the power to detect true correlations with a coefficient of ρ = 0.3 (medium correlation) (Machin et al 2011).There is also a possibility of encountering false positives and the impact of multiple variables was not considered (Miguel 2023).A larger future study would provide more confidence and may detect smaller correlations.To undertake this a more automated approach would likely be required, considering that achieving a statistical power of 0.9 with a 10% chance of a false negative and the ability to detect true correlations with a coefficient of ρ = 0.2 would require a sample size of 258 (Jameson et al 2014).
The findings of this study indicate that while contouring similarity metrics can be effective in measuring inter-observer variation, they should be approached with caution when assessing the holistic quality of RT treatment plans, as similarity metrics may not be appropriate for this purpose.Despite generating a large degree of contouring variation in this study, only moderate correlations were observed between several similarity metrics and modelled outcomes.It is recommended that in addition to utilising similarity metrics, the range of CTV variation should also be considered to determine the potential impact on dosimetry.This can be used as a guide to determine whether the variation is minor or major according to the trial protocol.

Conclusions
In conclusion, this study represents the first investigation of the relationship between contouring similarity metrics and radiobiological parameters in the context of prostate bed radiotherapy.The results indicate that, although a large contouring variation was generated and used, there were no strong correlations found between the considered similarity metrics and modelled outcomes.Only moderate correlations between the modelled outcomes and several geometric metrics (DSC, Jaccard, and MDA) could be observed.Utilising a range of CTV-PTV margins, in conjunction with similarity metrics can help to provide a more comprehensive understanding of the potential impact of contouring variation on dosimetry.

Figure 1 .
Figure1.Workflow of the proposed ABAS algorithm used in this study.The process includes three steps: (1) registration of the atlas image to the target image using both an affine and a deformable transformation, (2) propagation of the atlas CTV labels onto the registered image, and (3) generation of a set of contours by combining the propagated labels and applying thresholds at levels of agreement of 20%, 40%, 60%, 80%, 95%, and 100%.

Figure 2 .
Figure 2. The simulation of CTV delineation variation is depicted by contracting and expanding the PTV, which was generated based on the STAPLE CTV by adding a uniform 1 cm margin.The red lines represent the PTV, the light blue lines represent the contracted PTV (uniformly −1 mm, −2 mm, −3 mm, and −4 mm), and the yellow lines represent the expanded PTV (uniformly + 1 mm, +2 mm, +3 mm, and +4 mm).The figure shows transverse, coronal, and sagittal images in a clockwise direction from the left.

Figure 3 .
Figure 3.The geometric variation between the STAPLE and observer-based volumes for CTV, rectum, and bladder are illustrated in this figure: figures (a) and (b) show the DSC and MDA for the CTV, respectively, while figures (c) and (d) show the DSC for the rectum and bladder, respectively.The results are plotted for each individual patient with observers A (blue), B (red), and C (green).P (X-Y) presents the Student's t-test P-value of the geometric measures between observers X and Y.

Figure 4 .
Figure 4. (a) Patient 6 with the largest geometric variations in CTV delineation and (b) patient 4 containing the smallest overall differences in CTV delineation.The contours of the CTV are displayed for observer A (blue), observer B (purple), and observer C (green) in clockwise order from the top on transverse, coronal and sagittal images.The STAPLE volume is shaded in light gold for comparison.

Figure 6 .
Figure 6.The impact of PTV contraction/expansion on dosimetry is illustrated in this figure, which shows the mean and standard error of (a) PTV D 2 , (b) PTV D 98 , (c) rectum V 40 , and (d) rectum V 60 for all patients.The dash lines indicate the target requirements (a) and (b) and the rectal dose constraint (c) and (d) according to the RAVES protocol.The PTV volume was uniformly contracted −1 mm, −2 mm, −3 mm, and −4 mm, and was uniformly expanded +1 mm, +2 mm, +3 mm, and +4 mm.

Table 1 .
Target dose requirements and dose constraints for treatment plans according to the TROG 08.03 RAVES protocol.The D 98 (dose covering 98% of the PTV) shall be at least 95% of the prescription dose (60.8 Gy) • The maximum dose (D 2 dose to 2% of the PTV) shall be no more than 107% of the prescription dose (68.48 Gy) CTV• The minimum dose (defined as the D 98 ) shall be 64 Gy

Table 2 .
The parameters used for radiobiological modelling in this study.Included are the logistic TCP model and its parameters for prostate bed salvage RT, as well as the NTCP parameters for rectal and bladder toxicity using the widely accepted LKB model.This study considered two rectum NTCP parameters (a, b) and two bladder NTCP parameters (c, d).
a Rectal bleeding (Michalski et al 2010a) d Urinary frequency (Mavroidis et al 2018)Abbreviations: ┼ , dose to achieve 50% biochemical control; ╪ , slope of logistic curve at 50% TCP; ╫ , dose to the whole organ leading to 50% complication of the population; m, slope parameter from the LKB model; n, LKB model exponent.

Table 5 .
Comparison of dose and radiobiological metrics from STAPLE-based and observer-based (A, B, C) contoured volumes for all patients.CTV, clinical target volume; D xx% , dose incident on xx% structure volume; V xxGy % volume of structure receiving a dose of xx Gy; TCP, tumour control probability; NTCP, normal tissue complication probability; P1, P2 and P3 are the P-value between STAPLE versus observersA, B, and C, respectively; Bold number, statistical significance (P < 0.05); data are presented as the mean ± standard deviation.abcd NTCP are the values calculated from the parameter sets abcd as described in table 2.

Table 6 .
The CTV similarity metrics calculated between atlas-based volumes and STAPLE-based volumes (referred to as CTV atlas), and between atlas/observer-based volumes and STAPLE-based volumes (referred to as CTV overall).The table displays the mean and standard deviation of the various geometric measures for the CTV.The CTV overall values were used to evaluate the correlation with treatment outcomes in this study.

Table 8 .
Results of Spearman's correlation analysis between CTV geometric variation and radiobiological parameters for all patients and all contour sets, including both atlas-and observer-based contours.The table displays the correlation coefficient (Γ), indicating the strength of the correlation between the two variables.
*Abbreviations: * correlation is significant; mod moderate correlation; CTV = clinical target volume; TCP = tumor control probability; NTCP = normal tissue complication probability; VD, volume difference; DSC = Dice similarity coefficient; 95% HD = Hausdorff distance; MDA = mean distance agreement; Jaccard = Jaccard similarity; the values under the metrics represent the mean and standard deviation calculated for all contour sets and all patients.abcd NTCP are the values calculated from the parameter sets abcd as described in table 2.