Evaluation of auto-segmentation for brachytherapy of postoperative cervical cancer using deep learning-based workflow

Objective. The purpose of this study was to evaluate the accuracy of brachytherapy (BT) planning structures derived from Deep learning (DL) based auto-segmentation compared with standard manual delineation for postoperative cervical cancer. Approach. We introduced a convolutional neural networks (CNN) which was developed and presented for auto-segmentation in cervical cancer radiotherapy. The dataset of 60 patients received BT of postoperative cervical cancer was used to train and test this model for delineation of high-risk clinical target volume (HRCTV) and organs at risk (OARs). Dice similarity coefficient (DSC), 95% Hausdorff distance (95%HD), Jaccard coefficient (JC) and dose-volume index (DVI) were used to evaluate the accuracy. The correlation between geometric metrics and dosimetric difference was performed by Spearman’s correlation analysis. The radiation oncologists scored the auto-segmented contours by rating the lever of satisfaction (no edits, minor edits, major edits). Main results. The mean DSC values of DL based model were 0.87, 0.94, 0.86, 0.79 and 0.92 for HRCTV, bladder, rectum, sigmoid and small intestine, respectively. The Bland-Altman test obtained dose agreement for HRCTV_D90%, HRCTV_Dmean, bladder_D2cc, sigmoid_D2cc and small intestine_D2cc. Wilcoxon’s signed-rank test indicated significant dosimetric differences in bladder_D0.1cc, rectum_D0.1cc and rectum_D2cc (P < 0.05). A strong correlation between HRCTV_D90% with its DSC (R = −0.842, P = 0.002) and JC (R = −0.818, P = 0.004) were found in Spearman’s correlation analysis. From the physician review, 80% of HRCTVs and 72.5% of OARs in the test dataset were shown satisfaction (no edits). Significance. The proposed DL based model achieved a satisfied agreement between the auto-segmented and manually defined contours of HRCTV and OARs, although the clinical acceptance of small volume dose of OARs around the target was a concern. DL based auto-segmentation was an essential component in cervical cancer workflow which would generate the accurate contouring.


Introduction
There is strong evidence (Chino et al 2020) supporting the use of radiotherapy (RT) as an adjuvant treatment for postoperative cervical cancer in the presence of risk factors, and brachytherapy (BT) is strongly recommended for all women receiving definitive RT. (Mauro et al 2019) reported vaginal cuff BT was associated with a reduced recurrence rate (disease free survival of 86.9 months) in the postoperative setting of high-risk patients with earlystage cervical cancer. Indeed, BT is an critical treatment modalities and closely associated with improvements in clinical outcomes (Contreras et al 2020). Modern technique of BT allows adaptive treatment planning based on the three-dimensional (3D) image, and shows more advantages than conventional two-dimensional (2D) imagebased approach (Harkenrider et al 2015). Currently, the application of 3D image-guided BT (IGBT) for each applicator implantation is still limited because of the complex treatment workflow and high workload Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. pressures in China. This highlights the importance of a rapid and accurate method which would improve the IGBT workflow through automation (Banerjee et al 2021).
Artificial intelligence (AI) has been applied for RT procedures during several steps, starting from initial treatment decision-making till delivery of radiation therapy (Huynh et al 2020). Many studies have demonstrated the accuracy and availability about auto-segmentation of target and organs at risk (OARs) in external beam radiation therapy (EBRT) (Rhee et al 2020, Wang et al 2020, Liu et al 2020a, 2020b. However, relevant research was rarely mentioned in BT (Cao et al 2022, Yoganathan et al 2022. The delineation of planning structures is an important task but also such a labor-intensive part in RT workflow, and always suffers from the inter-and intra-observer variability (Duane et al 2014). Meanwhile, the contouring uncertainties would be magnified especially in high-dose rate BT owing to the high steep dose gradient (Saarnak et al 2000). Therefore, improving the precision of delineation and simplifying the RT procedure are shown practical significance in BT workflow. (Mohammadi et al 2021) reported the ResU-Net deep learning (DL) model achieved a good agreement between the predicted and manually defined contours of bladder, rectum and sigmoid. (Jiang et al 2021) used a RefineNet-based DL model to obtain rapid and accurate automatic delineation of CTVs and OARs. As a novel technique, DL has replaced most of the machine learning models and shows the superior ability of human equivalent performance (LeCun et al 2015). In automatic delineation of BT workflow, the quality and reliability of DL models should be further evaluated by multiple goals with geometric metrics, dosimetric metrics and even subjective assessment.
The primary purpose of this study was to evaluate the accuracy of DL model to automatically delineate the high risk clinical target volume (HRCTV) and OARs for patients with postoperative cervical cancer in BT workflow.

Experiments
The evaluation of this work was divided into 2 sections. Section 1, the accuracy of DL based auto-segmentation was assessed using objective indicators, including geometric and dosimetric metrics. Section 2, subjective assessment of auto-segmented contours was evaluated by two experienced radiation oncologists.

Clinical datasets
60 patients of postoperative cervical cancer collected between August 2021 and June 2022 were included in this study. The enrolled patients were diagnosed with 2018 International Federation of Gynecology and Obstetrics (FIGO) stage IB1-IIIC1, treated with EBRT (45Gy-50.4 Gy, 1.8 Gy fraction −1 ) and IGBT (12Gy-30Gy, 6 Gy fraction −1 ) as a boost treatment. As for the choice of applicators, all patients were treated using interstitial metal needles with diameter of 1.5mm. The average age ± standard deviation of these patients was 48.20 ± 14.52 years old. For each patient, the oral contrast (diatrizoate meglumine) was required for small intestine preparations before each BT implantation session. Meanwhile, the bladder was filled with 150cc of normal saline in the CT scanning room, and the CT images were reconstructed with 512 × 512 matrix size and 3 mm slice thickness using a Philips Brilliance Big Bore CT scanner system (Philips Healthcare,Best, the Netherlands).
Definitions for volume-based targets (HRCTVs) were established by the Groupe Européen de Curiethérapie-European Society for Radiotherapy and Oncology (GEC-ESTRO) (Haie-Meder et al 2005), the boost dose was prescribed to the vaginal surface with multi-needles to the upper vagina. Relevant OARs included for IGBT plans were bladder, rectum, sigmoid and small intestine. All of the manual contours were reviewed and approved by senior radiation oncologists specialized in cervical cancer to generate the standard delineation.

Deep learning for segmentation
We introduced a robust deep learning model based on convolutional neural networks (CNN) to delineate the CTVs and OARs for cervical cancer patients with EBRT, which has been proven to provide high performance in automatic segmentation of planning structures (Wang et al 2022). This model is an end-to-end segmentation architecture that can predict pixel class labels in CT images. The inputs to the DL based model were the 2D CT images, and the outputs were the corresponding labels of HRCTV and OARs. The network was constructed employing Python DL library and Keras with TensorFlow as the backend. The training and testing were performed using a graphics card (memory size of 11GB) equipped with IntelCore i7 processor (computer random access memory of 64GB).
Some data enhancement methods such as cut and flip were used to obtain a superior model. This DL model was trained for 10,000 epochs. The initial learning rate was set to 0.01 and multiplied by 0.9 after each epoch. The weight decay was set to be 0.001 and the momentum parameter was set to 0.9. The variants were trained until the training loss converged. The performance of model was assessed using ten-fold cross-validation, in which each fold consisted of randomly selected 40 subjects for training, 10 for validation, and remaining independent 10 for final testing.

Objective indicators
The geometric and dosimetric metrics were used for quantitative analysis. Segmentation was assessed by Dice Similarity Coefficient (DSC), 95% Hausdorff Distance (HD) and Jaccard Coefficient (JC) for geometric calculations. The definitions are as follows: DSC and JC describe the relative overlap between segmentation A and B. HD is used to quantify the 3D distance between two segmentation surfaces. The 95%HD is the distance that indicates the largest surface-tosurface separation among the closest 95% of surface points. For the complete overlap, the value of HD is 0, and the values of DSC and JC are 1. For the incomplete overlap, the value of HD is large, and the values of DSC and JC are close to 0.
Dose-volume index (DVI) were used to compare the dosimetric metrics. Original BT plans were designed and optimized based on the standard manual contours by using Oncentra Treatment Planning System (Nucletron, Elekta AB, Stockholm, Sweden, V.4.3), and the auto-segmentation structures were transmitted to original BT plans for dosimetric evaluation. For HRCTV, we mainly focused on D mean and D 90% , where D mean and D 90% are defined as the average dose and minimal dose to 90% of the HRCTV, respectively. For OARs, we mainly focused on D 0.1cc , D 2cc and D 5cc , where D xcc typically denotes the minimum dose received by the maximally irradiated xcm 3 of a volume. A prescription dose of 60 Gy EQD2 (45-50.4 Gy as EBRT) was considered for the target, equivalent dose of 2 Gy with an α-to-β ratio of 10, and maximum dose (D 2cc ) of 80 Gy EQD2 , 65 Gy EQD2 , 70 Gy EQD2 and 70 Gy EQD2 to the bladder, rectum, sigmoid and small intestine, respectively, assuming α-to-β ratio of 3. Table 1 is presented the constraints and dosimetric metrics.

Subjective assessment
For qualitative analysis, two experienced radiation oncologists would blindly evaluate the results of manual and automatic delineations on 10 tested patients, and were required to score as needing no edits, minor edits, or major edits. If the auto-segmentation scored as no edits, the contour was considered as suitable for employ in the clinic.

Statistical analysis
Bland-Altman test was calculated for the test of agreement between manual and DL based methods, P > 0.05 means agreement of two segmented methods. Wilcoxon's paired nonparametric signed-rank test was performed to compare the dosimetric differences, P < 0.05 indicates the difference is statistically significant. The correlations between geometric metrics and dosimetric difference were evaluated with Spearman's correlation analysis. All the statistical analyses were performed using IBM SPSS Statistics software (version 19.0, IBM Inc., Armonk, NY, USA) and Python software (version 3.6.5, Anaconda Inc.). Figure 1 shows the quantitative indicators of segmentation results on the testing dataset with our proposed DL based model. Automatic delineation produced the results for HRCTV with average DSC value of 0.87 ± 0.06,

Analysis of dosimetric metrics
The comparisons of dosimetric metrics between two methods using Wilcoxon's paired nonparametric signedrank test are presented in table 2. Significant dosimetric differences were found in bladder_D 0.1cc , rectum_D 0.1cc and rectum_D 2cc (P < 0.05), and the remaining dosimetric indicators were shown no significant differences. For all of the planning structures, both the manual and automatic delineation were able to meet the clinical dose constraints (EQD2 10 or EQD2 3 ). Examples of segmentation and dose distributions from manual and DL based methods are illustrated in figure 2.

Correlation analysis between geometric and dosimetric metrics
The results of Spearman's correlation analysis between geometric metrics and dosimetric metrics (Δdose) are presented in table 3

Test agreement between manual and DL based approaches
The Bland-Altman test was mainly calculated for HRCTV_D 90% , HRCTV_D mean , bladder_D 2cc , rectum_D 2cc , sigmoid_D 2cc and small intestine_D 2cc . The figure 4 showed 95% consistent limits for all of the BT planning structures between two methods. The test of agreement for DL based method can be evaluated according to the number of the points outside the 95% consistent limits and the maximum difference within the consistent limits. From the Bland-Altman plot, HRCTV_D 90% , HRCTV_D mean , bladder_D 2cc , sigmoid_D 2cc and small intestine_D 2cc showed no significant inconsistency (P > 0.05) between two segmented methods.

Subjective assessment
Two radiation oncologists scoring of the automatically generated contours on the 10 tested patients is shown in table 4. For the HRCTV, 80% of the segmentation was clinical acceptance (no edits) and 20% was scored as needing minor edits. For bladder, rectum, sigmoid and small intestine, 90%, 70%, 50% and 80% were clinical acceptance, respectively, while 10% of rectum and 30% of sigmoid were scored as needing major edits.

Discussion
IGBT using 3D images for RT planning allows for more precise and individualized treatment compared even to stereotactic body radiotherapy (SBRT) (Lee et al 2021, Shenker et al 2022. However, modern IGBT procedures with increasing real-time steps and complexities will require for more technical and manpower resources before confident applications. Meanwhile, BT provides fewer reimbursement and higher fractional treatment time compared with EBRT. Under these comprehensive factors, it is obvious in the declining trend of BT in the last decades (Petereit et al 2015). The development of AI has the potential possibilities to change this phenomenon of BT due to its ability of solving most challenging tasks, and then might contribute to the progress of IGBT. Segmentation of targets and OARs is an essential task in the treatment planning process of IGBT. GEC-ESTRO recommends magnetic resonance imaging (MRI) as the gold standard in IGBT because of its excellent soft tissue contrast for HRCTV and OARs delineation (Dimopoulos et al 2012). Compared with MRI, there is more challenging for auto segmentation in CT images owing to the lack of visible anatomical edges, especially in the definition of target volume (Cardenas et al 2018). At present, CT images are still the most applied for IGBT technology in China, and few works have been studied for automatic delineation of BT planning structures using CT images. The purpose of this study is trying to compare the performance of DL based auto-segmentation against standard contours in postoperative cervical cancer patients with CT images.
We observed that DL based model generated structures with average DSC of 0.87, 0.94, 0.86, 0.79 and 0.92 for HRCTV, bladder, rectum, sigmoid and small intestine, respectively. The performances of auto segmentation for target and OARs in CT based IGBT from other DL model is presented in table 5. The geometric similarity of all BT planning structures were equivalent to or better than other published literature (Jiang et al 2021, Mohammadi et al 2021, Zhang et al 2020. Among all the DL based models, the superior results were obtained for the bladder delineation (DSC of 0.86-0.96, HD of 4. 05-19.98). This is mainly because of the bladder filling protocol with relatively consistent volume as well as the high contrast in CT images. The geometric metrics of sigmoid in our study were worse than the other BT planning structures, which were reported with the similar results for other DL model. The spatial variation of sigmoid in fractional BT and low contrast in CT images may explain this reason (Jamema et al 2013). Another interest finding in this work was small intestine with superior geometric results (DSC of 0.92, 95%HD of 8.83 and JC of 0.85). As we all know, the location of small intestine is always different in the RT process and confusion with other soft tissue in CT images. Therefore, the representative features of small intestine was difficult to extract and classify using CNN model. In order to recognize the small intestine in CT images, we used oral contrast protocol for small intestine preparations to obtain the distinct training dataset of DL based model. The accuracy of small intestine was satisfying and could be as supplement for our previous study which was absent to assess this organ (Wang et al 2022). Usually, the better auto-segmented results are achieved through increasing the number of training dataset. Nevertheless, our DL model used fewer enrolled patients for training and generated higher overlap with standard manual contours compared to other model in target volume segmentation. Yoganathan et al (2022) demonstrated the importance of dosimetric evaluation over geometric evaluation for an automatic segmentation problem. They found the poor geometric metrics of sigmoid and small intestine, while closely matched dosimetric metrics with the manual segmentation. Similar results were presented in our work, particularly in auto segmentation of sigmoid. In addition, significant dosimetric differences were found in bladder_D 0.1cc , rectum_D 0.1cc and rectum_D 2cc (P < 0.05), and the remaining dosimetric parameters were shown no significant differences. These data indicated the auto-segmented OARs inside of the high-dose region (figure 2) remains necessary to be reviewed by senior radiation oncologists rather than geometric values. Certainly, we could continue to improve the performance of the DL model to obtain better geometric values, which may reduce the dosimetric differences in high-dose regions for auto-segmented OARs. For instance, when we mainly consider the clinical criterion of D 90% and D 2cc , the automatic method would be good enough for clinical applicability if the DSC of HRCTV and OARs could be raised to 0.87 and 0.94, respectively. The heatmap of correlation analysis showed there was no clear strong relationship between geometric metrics and dosimetric differences for OARs. However, the only strong correlation was found for the D 90% of HRCTV with its DSC and JC (R = −0.842 and −0.818, respectively). This quantitative analysis proved that more accurate delineation of target volume would carry out more precise delivery of radiation dose in IGBT workflow.  Besides of evaluation for dosimetric differences, we also used the test of agreement to verify the performance of DL based auto segmentation. The automatic method has been proven to obtain dose consistency for HRCTV_D 90% , HRCTV_D mean , bladder_D 2cc , sigmoid_D 2cc and small intestine_D 2cc which were calculated by Bland-Altman test. Certainly, the test would still achieve the dose agreement even for the situation of segmentation inaccuracies (outside of the high-dose region) such as sigmoid. Therefore, the multiple objective indicators should be used for assessing the accuracy of DL based auto segmentation which would generate the meaningful results. As for subjective assessment, the clinical acceptance rates were 80% for the HRCTVs and 72.5% for the OARs. The quality of 10% of rectum and 30% of sigmoid were unsatisfactory with scored as needing major edits. Overall, the quantitative and qualitative analyses evaluated the automatic delineation were relatively consistent.
In this work, we investigated the performance of DL based auto segmentation in postoperative cervical cancer patients treated with IGBT. In fact, automatic approach would reduce the waiting time for patients in adaptive IGBT process, which may decrease the uncertainties such as displacement of applicator. As a novel technology, DL based method has the potential ability to link the EBRT and BT for full workflow of cervical cancer, especially in the dosimetric prediction for cold point of the target and hot point of the OARs in combined therapy. This work was still focus on the problem of auto-segmented accuracy which would be an essential part implemented in the IGBT workflow of cervical cancer.
There are still several limitations in this study. First, our DL based model was trained and verified using the clinical dataset treated with multiple metal needles, which may not be suitable for other situation such as cylinder applicator applied for treating patients. It mostly affects the HRCTV definition generated by DL based method when different catheters are implanted. Second, we used the bladder and small intestine protocols for preparation of treatment while different centers employ their own methods such as contrast agent inside the rectum. Therefore, increasing the amount of training data including various applicators or protocols in IGBT workflow could make the DL model more robust to improve performance further.

Conclusion
Segmentation of planning structures is an important component of RT treatment planning, and automatic approach would relieve physicians from the labor-intensive tasks in adaptive IGBT workflow. The proposed DL Figure 4. Bland-Altman plot for BT planning structures. The brown horizontal dotted lines represents the upper and lower bounds of 95% limit agreement; the blue horizontal solid lines represent the average of the differences; the green horizontal dotted lines represent the location with difference equal to 0. Table 4. Qualitative scores of the auto-segmented contours on 10 tested CT scans.