A deep learning approach for automatic delineation of clinical target volume in stereotactic partial breast irradiation (S-PBI)

Mahdieh Kazemimoghadam; Zi Yang; Mingli Chen; Asal Rahimi; Nathan Kim; Prasanna Alluri; Chika Nwachukwu; Weiguo Lu; Xuejun Gu

doi:10.1088/1361-6560/accf5e

1. Introduction

Breast cancer is the most prevalent cancer among women. Breast cancer incidence rates have been steadily increasing every year, and with an estimated 2.3 million new cases in 2020, it remains the leading cause of cancer mortality worldwide (Sung et al 2021). Partial breast irradiation after breast conserving surgery is an integral part of breast-conserving therapy for treatment of early-stage breast cancer (Njeh et al 2010). Several clinical trials have investigated the stereotactic partial breast irradiation (S-PBI) (Takanen et al 2022). In the S-PBI, accurate delineation of post-surgical clinical target volume (CTV) is a critical step for personalized and efficient treatment planning and delivery. However, manual delineation of CTV is highly labor intensive, time consuming, and largely relies on the experience of clinicians. Several studies have shown that intra- and inter-observer variations are very common among radiation oncologists in defining CTV after breast lumpectomy (Wong et al 2006, Landis et al 2007, Petersen et al 2007). Such inconsistencies can negatively influence the efficacy of radiotherapy treatment and lead to recurrence of associated complications (Glatstein 2002, Boersma et al 2012a). Accurate and efficient breast CTV delineation is of especial significance in GammaPod (i.e. a breast-specific stereotactic body radiation therapy system) based S-PBI treatment, where CT simulation, planning and treatment are completed in a single session with a vacuum-assisted breast immobilization cup (Yu et al 2013). An efficient and accurate breast CTV segmentation approach can lead to an accelerated delineation process, improved consistency of the results, and support an efficient clinical workflow as well as a successful treatment delivery.

Deep learning (DL) approaches have been widely used for image segmentation in radiotherapy (Meyer et al 2018, Kazemimoghadam et al 2021). DL has been successfully applied in segmenting several cancers' CTVs including cervical cancer (Shi et al 2021), esophageal cancer (Jin et al 2021), prostate cancer (Balagopal et al 2021), and rectal cancer (Men et al 2017). In (Shi et al 2021), the authors built their proposed method upon a 3D U-Net architecture and further extended it with a recursive refinement strategy to improve delineation of CTV in cervical cancer CT images. They adjusted the model's attention through an area-aware reweighting strategy, where the variance between the network prediction and the ground truth (GT) was used as a metric to assign varying weights to each slice. In esophageal cancer (Jin et al 2021), the authors introduced a spatial context encoded CTV segmentation framework, where the 3D spatial distance maps from gross tumor volume (GTV), lymph nodes, and organs at risk along with radiotherapy planning CT images were incorporated into a deep convolutional network to produce accurate margin-based CTV boundaries. A deep dilated convolutional neural network (DDCNN) was introduced by (Men et al 2017) for auto-segmentation of breast CTV in planning CT images. In this work, multi-level and multi-scale features were learnt by the network utilizing dilated convolution filters. They further demonstrated that such contextual features could successfully capture detailed structural information about the intensity, texture, and contour and play a critical role in accurate CTV segmentation.

Auto-segmentation of breast CTV is a challenging task due to several underlying reasons. The exact extent of microscopic disease encompassed by CTV is not visualizable in CT images and remains uncertain (Luo et al 2021). Low contrast and high noise levels usually lead to blurred boundaries between the CTV and normal tissues. As a result, CTV delineation is not a straightforward segmentation task, i.e. unlike GTV, it is not only about segmenting an abnormal appearing mass in the image. Instead, CTV is primarily identified by the clinicians to assure that tumor spread patterns are incorporated in delineation process. While guidelines are incorporated, CTV contouring is highly dependent upon clinician's knowledge and experience (Struikmans et al 2005, Landis et al 2007, Boersma et al 2012b). Despite written delineation guidelines, there is usually substantial differences in identification of breast CTV among radiation oncologists (Struikmans et al 2005, van Mourik et al 2010). Hence, consistent segmentation of breast CTV has been a 'bottleneck' in radiotherapy.

Despite the success of DL in medical imaging, the models have often been considered 'black boxes', and the interpretability of their results has been generally neglected by the researchers in the field. Transparency and clearer understanding of how DL models perform the tasks is essential to successfully integrate them into clinical environment and to make sure that medical professionals can trust the predictions of such algorithms. Transparency can also assist with identifying the potential failures so that efforts can be focused on the right direction to address them (Holzinger et al 2017). Gradient weighted class activation maps (Grad-CAM) (Selvaraju et al 2017) has been a popular technique for visualizing where a DL model is 'looking at' in an input image. Grad-CAM is an efficient method of producing localization maps that highlights the important regions in the image for predicting the target concept. This allows visualizing network attention over the input image. In this work, we utilized Grad-CAM to exploit the spatial information flowing through convolutional layers and to better understand which region(s) of an input image were important for breast CTV segmentation.

In partial breast irradiation, the area with the highest risk for tumor recurrence is typically adjacent to the lumpectomy cavity (Menes et al 2005). Therefore, in clinical practice, the CTV is derived from the tumor bed volume (TBV) via a margin extension. These extensions often must be corrected for anatomical barriers of tumor invasion such as skin and the chest wall. The objective of this work was to propose a deep learning network to mimic physicians' contouring practice for CTV segmentation. For this purpose, we used the patient's planning CT image as an input to a 3D U-Net model along with the corresponding TBV mask as an additional input. This allows the network to emulate the oncologist's manual delineation of TBV to CTV expansion while considering disease filtration anatomical barriers. Further, in order to improve the transparency of the strategy taken by the model, we attempted to provide visual explanations for the internal layers of the model by utilizing Grad-CAM to visualize the regions of input images that are 'important' for CTV segmentation.

2. Methods and materials

2.1. Mimic clinical contouring practice for CTV segmentation by incorporating TBV

Figure 1 illustrates the proposed DL method for breast CTV segmentation. We employed U-Net (Ronneberger et al 2015) as the backbone of our proposed method. The contracting path of U-Net followed the typical architecture of a convolutional network. It consisted of the repeated application of two 3×3×3 convolutions (unpadded convolutions), each followed by a 3D max pooling operation with 2× down sampling. The first block contained 32 channels. The number of channels was doubled after every max-pooling layer. The expansive path of U-Net included four convolutional blocks to up-sample feature maps. The number of channels was halved in every step of the expanding path. Every step in the expansive path consisted of up sampling of the feature maps followed by up-convolution that halved the number of feature channels, a concatenation with the corresponding feature maps from the contracting path, and two 3×3×3 convolutions, each followed by a batch normalization and a LeakyReLU activation function. A 3D spatial dropout with the rate of 0.2 was applied after the second convolution layer of each block.

To mimic the clinical contouring practice for breast CTV segmentation, CT images and the corresponding TBV masks formed a multi-channel input for the CTV segmentation model. During training, each CT image and its ground truth TBV mask were jointly fed into the U-Net model. Supervised by manually delineated CTV expansion mask (CTV-TBV), U-Net learned to look for the location cues provided by the TBV and extracted discriminative features. Network weights were continuously updated until convergence. The supervised learning forced the model to encode location-related image features, guiding the network to focus on the TBV region for TBV to CTV expansion. During training, the expansion rules and the anatomical barriers of the chest wall and the skin boundary are also learnt to constrain TBV to CTV expansion. To further clarify this, in section 3.3, we simulated three test cases with the TBV being migrated from adjacent to the chest wall to the skin boundary and demonstrated how the predicted CTVs follow the rules. The proposed model parameters were optimized by minimizing the Dice loss function (equation (1)), which compared the prediction with manually contoured masks (Milletari 2016). At the testing stage, CT images and their corresponding TBV masks went through the trained network. A threshold was performed to obtain the predicted masks following a Softmax function such that the probability of the predicted voxels was set to either zero (background) or one (CTV expansion). Element-wise addition of the predicted CTV expansion masks and TBV masks was applied to achieve CTV masks as the final output of the network (figure 1).

$\begin{eqnarray}&&L\left(G,P\right)=1-\frac{2\sum _{i}^{k}{Y}_{i}^{P}{Y}_{i}^{G}}{\sum _{i}^{k}{{(Y}_{i}^{P})}^{2}+\sum _{i}^{k}{{(Y}_{i}^{G})}^{2}}\end{eqnarray} \tag{ 1 }$

${Y}^{P}$ and ${Y}^{G}$ represent the prediction and manual segmentation with K voxels respectively.

2.2. Gradient weighted class activation maps

Highlighting the regions in an input image that contribute most to the model's outcome can provide understanding of the overall strategy the network employs to segment the target. Grad-CAM (Selvaraju et al 2017) is an effective method that extract gradients from network's convolutional layers and uses this information to highlight regions which are most responsible for predicting the segmentation target. We aim to utilize Grad-CAM to identify the regions of the input image that the model 'looks at' in the internal layers of the model. This can provide insight into how information flows in the network for target localization/segmentation and can shed light on how spatial attention of a network develops for predicting the segmentation target. Grad-CAM was applied to produce post hoc local explanation by providing localization maps for every internal layer. For this purpose, we converted segmentation to a two-label (object and background) classification problem. To obtain Grad-CAMs, we first computed the gradient of the score for class c, ${y}^{c},$ with respect to feature maps ${A}^{k,l}$ of a convolutional layer i.e. $\frac{{\partial y}^{c}}{\partial {A}^{k,l}}$ (equation (2)). ${y}^{c}$ represents the spatially pooled final segmentation map. Z is the number of pixels in the activation map for channel c. To obtain the neuron importance weights ( ${\alpha }_{k,l}^{c}$ ) these gradients flowing back were global average pooled as described in equation (2). ${\alpha }_{k,l}^{c}$ captures the 'importance' of the feature map k in layer l.

$\begin{eqnarray}&&{\alpha }_{k,l}^{c}=\frac{1}{Z}\,\displaystyle \sum _{i}\displaystyle \sum _{j}\frac{\partial {y}^{c}}{{\partial A}_{{ij}}^{k,l}}\end{eqnarray} \tag{ 2 }$

$\begin{eqnarray}&&{L}_{\mathrm{Grad}-\mathrm{CAM}}^{c}=\mathrm{ReLU}(\displaystyle \sum _{k}{\alpha }_{k,l}^{c}\,{A}_{{ij}}^{k,l})\end{eqnarray} \tag{ 3 }$

The Grad-CAM heat-map is in fact a weighted combination of feature maps followed by a ReLU. ${L}_{\mathrm{Grad}-\mathrm{CAM}}^{c}$ is an output map which is the result of performing Grad-CAM for class c (equation (3)).

2.3. Experimental data

We evaluated our model on our in-house retrospective dataset of 35 post-operative breast cancer patients. All patients received 5-fraction PBI regimen on GammaPod. Prior to each fraction, head-first-prone CT scans were acquired on either Airo^® Mobile Intraoperative CT (Brainlab AG, Munich, Germany) or Philips Brilliance Big Bore CT (Philips Healthcare, Amsterdam, Netherlands), both with 120 kVp and scanning lengths covering the whole chest region. CT images had slice thickness of 1 mm. Pixel resolution varied between 1.17 and 1.37 mm. TBV contours were manually delineated on planning CT using the Eclipse^® treatment planning system (Varian Medical Systems, Palo Alto, CA) by the attending physician on the treatment day. The clinical guideline for CTV contours was to uniformly expand from TBV with a 10 mm margin limited by 5 mm from the skin and not going beyond the chest wall (chest wall and pectoralis muscles are not to be included).

In order to avoid the potential bias due to potentially high correlation across fractional images for each patient, we purposely split the data into training, validation, and test sets by patients not images. The test data (patients) was completely untouched during experiments, and none of the patients in the test set overlapped with the training set. Out of total 35 patients and in random selections, 25 patients were assigned to the training, 5 patients to the validation set, and the remaining 5 patients to the test set. Considering 5 fractions per patient, the total number of images for training, validation and test sets were 125, 25, and 25 images respectively. Validation data was used for tuning network's hyperparameters and selecting the best model. The tuned hyperparameters were comprised of the number of training epochs, learning rate, the activation function, and dropout rate. The network was trained with the Adam optimizer. During training, network weights were continuously updated until convergence. Convergence is when the loss given by the model reaches its minimum, and the validation loss does not improve anymore. We set the training to stop when the performance on the validation dataset did not improve for 50 epochs. The learning rate controls how much to change the model in response to the estimated error each time the model weights are updated. While large learning rates leads to faster learning and require fewer training epochs, the model might skip the optimal solution. However, small learning rates require more training epochs to converge to the best value. This leads to longer training time as smaller changes are applied to the network weight after each update. For our experiments, the learning rate of ${10}^{-4}$ appeared to be optimal. We selected LeakyRelu over Relu and Sigmoid activation functions as it showed superior performance in our experiments. LeakyRelu appeared to reach convergence faster than Relu and Sigmoid. Setting dropout rate to 0.2 assisted the network to avoid overfitting. Larger dropout rates degraded model performance and reduced the convergence rate of the model. Smaller dropout rates did not make a noticeable change in assisting with avoiding overfitting. Finally, the model which performed best on the validation set was selected to evaluate the test dataset. The test set was untouched during training and was only used to evaluate the final tuned model during the final evaluation of the segmentation performance.

The patient cohort used in this study is from a phase II multi-center clinical trial (University of Texas Southwestern Medical Center 2018). Briefly, the study recruited patients with early-stage breast cancers and a GammaPod stereotactic radiotherapy after lumpectomy. The dose fractionation to the CTV (1 cm expansion on TBV) was 40 Gy/5 fractions and the PTV (3 mm expansion of CTV) was a minimum dose of 30 Gy/5 fractions. Patients were treated every other day. Since GammaPod treats patients in a prone position and beams have the limitation in treating the TBV close to the chest wall, we had a criteria to exclude patient with the TBV of 2 cm above the level of breast cup. No special planning constraints and unique indication for GammaPod was applied. Details about patient inclusion and exclusion criteria can be found at University of Texas Southwestern Medical Center (2018).

For each patient, at least two attending physicians covered the entire treatment course of five fractions. However, for each fraction, only one physician was involved in contouring that individual CTV volume. The physicians were all from the department of radiation oncology at UT Southwestern Medical Center who had similar levels of expertise.

CTV was derived from TBV via a margin extension, followed by correcting the anatomical barriers of the skin and the chest wall. CTV volumes (cc) were calculated for all five fractions for each patient. We used coefficient of variation (CV) as the measure of inter-observer variability which is defined as the relative standard deviation (the ratio of the standard deviation over the mean) and was calculated over the CTV volumes (cc) of the five fractions for each patient.

2.4. Implementation details

The experiments were implemented using the Pytorch (version 1.8) framework and python (version 3.8) environment on Windows 10 64x, Intel Xeon processor CPU with 64 GB RAM, and NVIDIA GeForce 2080 Ti GPU with 12 GB memory. To maintain consistency across the entire dataset, all CT images and the corresponding TBV and CTV masks were resampled to voxel size of 2 mm × 2 mm × 2 mm. Images and masks were then center cropped to 96 × 96 × 96 (the cropped region centered on the breast). The voxel intensity of CT images was normalized to 0–1 according to HU window [−200 200].

2.5. Network evaluation

We evaluated the performance of our proposed model on our clinical in-house breast CT dataset. The Dice similarity coefficient (DSC), 95th percentile Hausdorff distance (HD95), and average symmetric surface distance (ASD) (Jafari-Khouzani et al 2011) were used to quantitatively assess the segmentation performance of our model. DSC was computed to measure the degree of overlap between the model output and the ground truth CTV masks (equation (4)). HD95 (equation (5)) and ASD (equation (6)) were computed to compare the distances between the model's segmentation output and the ground truth.

$\begin{eqnarray}&&\mathrm{DSC}=\frac{2\times {\rm{| }}A\cap B{\rm{| }}}{\left|A\right|+{\rm{| }}B{\rm{| }}}\end{eqnarray} \tag{ 4 }$

$\begin{eqnarray}&&dH\,\left(A,B\right)=\max \left\{{\rm{D}}1,{\rm{D}}2\right\};\end{eqnarray} \tag{ 5 }$

$\begin{eqnarray*}&&\begin{array}{c}{\rm{where}}\,{\rm{D}}1=\begin{array}{c}sup\,inf\,d(a,b)\\ a\epsilon \,{\rm{\partial }}A,{\rm{b}}\epsilon \,{\rm{\partial }}B\end{array}\,{\rm{and}}\,{\rm{D}}2=\begin{array}{c}sup\,inf\,d(a,b)\\ b\epsilon \,{\rm{\partial }}B,{\rm{a}}\epsilon \,{\rm{\partial }}A\end{array}\end{array}\end{eqnarray*}$

$\begin{eqnarray}&&{\rm{ASD}}={\rm{Average}}\,\left(\begin{array}{c}inf\,d\left(a,b\right)\\ a\epsilon \,{\rm{\partial }}A\end{array}{\rm{U}}\,\begin{array}{c}inf\,d\left(a,b\right)\\ b\epsilon \,{\rm{\partial }}B\end{array}\right)\end{eqnarray} \tag{ 6 }$

${\rm{d}}{\rm{H}}$ in equation (5) is the Hausdorff distance, and d(a, b) in equations (5) and (6) is the Euclidian distance. A and B in all three questions represent ground truth and predicted segmentation volumes respectively. a and b are points on the volume A and B 's surface, $\partial A$ and $\partial B,$ respectively.

3. Results

3.1. Quantitative outcomes

Table 1 summarizes the DSC, HD95 and ASD for each test case across fractions as well as the average across cases. The outcomes showed great agreement between the predicted contours and the physician manual contours for all the test cases. The Dice scores (DSC) ranged between 0.91 ± 0.01 and 0.95 ± 0.1, with the average DSC of 0.94 ± 0.02 across cases. Highest and lowest HD95 values of 3.2 ± 0.5 and 2.0 ± 0.0 mm were reported across cases, with the average being 2.5 ± 0.5 mm. ASD values ranging from 0.4 ± 0.1 to 0.7 ± 0.1 mm with the average of 0.5 ± 0.1 mm were obtained across the test set.

Table 1. Quantitative results for each test case and the average across cases. Values represent mean ± standard deviation.

Metrics	Case 1	Case 2	Case 3	Case 4	Case 5	Mean
DSC	0.95 (0.01)	0.95 (0.00)	0.91 (0.01)	0.94 (0.03)	0.93 (0.01)	0.94 (0.02)
HD95 (mm)	2.0 (0.0)	2.0 (0.0)	3.2 (0.5)	2.4 (0.9)	2.8 (0.7)	2.5 (0.5)
ASD (mm)	0.4 (0.1)	0.4 (0.1)	0.7 (0.1)	0.6 (0.3)	(0.1)	0.5 (0.1)

3.2. Qualitative evaluation

Examples of input CT images with the ground truth TBV (blue) and CTV (yellow) contours as well as the segmentation outputs by the proposed approach (red) are illustrated for the test set examples in figure 2. The predicted CTV appeared to match the ground truth CTV very well for all the cases. Predicted CTVs followed the clinical TBV-to-CTV expansion rules, i.e. predicted CTV was derived from the TBV via a uniform margin extension. The extensions were also corrected for the anatomical barriers. The chest wall and the first 5 mm underneath the surface of the skin were considered anatomic boundaries for the expansion. For instance, in case a, TBV was surrounded by both the chest wall and skin in posterior and anterior sides of the breast. Therefore, correcting for the anatomical barriers resulted in minimal expansion of the TBV in the anterior and posterior sides, while relatively uniform expansions were observed in other directions. For case b, TBV-to-CTV expansion was restricted by the chest wall. Automatic segmentation of CTV allowed for automatic correction of TBV-to-CTV expansions for anatomical barriers. The extremely small space between TBV and the chest wall resulted in either minimal or no expansion in the posterior side of the breast, with uniform expansion otherwise. In case c, TBV was located adjacent to the skin boundary leading to correcting the extensions in the anterior direction.

**Figure 2.** Qualitative comparison of the predicted CTV contours (red) with the ground truth CTV (yellow) for the five test cases (a–c). Ground truth tumor bed volume (TBV) contours are shown in blue. In case a, tumor bed is surrounded by both the chest wall and skin boundary resulting in minimal expansion in these directions. Similarly, in cases b and c, tumor bed to CTV expansion was restricted by the chest wall and skin boundary respectively.
Download figure:
Standard image High-resolution image

3.3. Anatomic barriers

In the previous section, it was observed that the network learned anatomy barriers to constrain TBV to CTV expansion. In this section, we aimed to provide examples to show how the network learns and considers anatomic barriers for CTV segmentation when TBV distance from skin boundary and the chest wall is altered. To further clarify this, we simulated three test cases with the TBV being migrated from the chest wall to the skin boundary and demonstrated the predicted CTV for each case. As shown in figure 3, top row, we purposely moved TBV from adjacent to the chest wall to the skin (cases a to c), the extension in the posterior direction gradually changed from zero to the uniform margin value (i.e. ∼10 mm). Similar phenomenon was observed in the anterior direction; in case a, as there is enough space between the TBV and the skin boundary, CTV expanded up to the margin (10 mm) anteriorly; in case b, while the TBV was migrated towards the skin, there was still enough space for the CTV to grow by the margin in the anterior direction. However, further migration of TBV (case c), constrained the expansion significantly in this direction. The model attempted to maintain the 5 mm distance between CTV contour and the skin barrier, resulting in minimal expansion on the regions where TBV became adjacent to the skin. The outcomes further confirm the fact that geometric boundaries play a significant rule in CTV segmentation and are learnt by the network during training.

**Figure 3.** Top row, illustration of the predicted CTVs (red) where TBV (blue) migrated from the chest wall towards the skin boundary (a to c). Bottom row, illustration of the corresponding Grad-CAMs obtained at the last convolution layer of the network for images in the top row.
Download figure:
Standard image High-resolution image

3.4. Extracting visual representations using Grad-CAM

3.4.1. Network attention over input images with migrated TBV

We utilized Grad-CAM to improve the transparency of our proposed model in CTV segmentation by producing visual explanations. Grad-CAM produces a coarse localization map highlighting the important regions in the image for predicting the target, i.e. CTV expansion in our study. Grad-CAM highlights the regions identified by the model for CTV segmentation and produces heatmaps showing the relevance of individual pixels for segmentation.

Figure 3, bottom row, depicts the Grad-CAM results obtained at the last double convolution layer of the network when TBV migrated from the chest wall towards the skin boundary (a to c). We selected to show the last layer as this layer contains the finest Grad-CAM maps. According to the outcomes, network attention was highest on the CTV expansion region, and attention moved as TBV migrated. For column a, in the posterior direction, where almost no TBV-to-CTV expansion occurred, network attention did not exist. As TBV migrated towards the anterior direction, network attention on the posterior direction became higher (columns b, c). In column c, there was high attention on the posterior direction, while on the anterior direction low attention was observed due to TBV being close to the skin boundary thus not allowing expansion on this direction.

3.4.2. Network attention over internal layers of the network

To elucidate the strategy taken by the network for CTV segmentation, we utilized Grad-CAM to see how network attention changes over internal layers and to determine how special information flows in the proposed model. To understand the layer-wise feature map importance, Grad-CAM was applied to observe the attention of different internal layers. This accounts to aggregation of gradients of ${y}^{c}$ with respect to chosen feature layers to determine their general relevance for the decision of the network (equations (2) and (3)). Based on this relevance, a heatmap was then obtained for the respective layer as a weighted average of the activations of feature maps. In figure 4, Grad-CAMs at the second convolution layer in each contraction and expansion block have been demonstrated for a given input image. Images b, c, and d represent Grad-CAMs obtained at the double convolution layer of the first, second, and third contraction (down sampling) block respectively. Image e, f, g, and h depict Grad-CAMs at the double convolution layer of the first, second, third, and forth (i.e. last) expansion (up sampling) blocks respectively.

Figure 4 demonstrates the approach the network took to localize CTV expansion. For instance, in the first down-sampling block, the model paid attention to skin boundary (b). Attention then moved to tumor bed region (c), and again in the third down-sampling block it returned to the skin boundary (d). In the first up-sampling block (e), network attention appeared to be on the chest wall region and moved again to tumor bed region in the second expansion block (f). In the final two up-sampling blocks, network attention converged to the finer segmentations (g, h). In (g), the model's focus on the boundary of the predicted CTV expansion was evident, and eventually in (h), the predicted CTV expansion was the most highlighted region.

4. Discussion

Automatic delineation of CTV can significantly reduce contouring burden on radiation oncologists, who currently must manually contour both TBV and CTV for each patient. Automatic methods can also yield more consistent target volumes, as they follow the same clinical guidelines, reducing intra- and inter-observer variability induced by humans. In the current practice, CTV is derived from TBV via a margin extension. These extensions are then corrected for anatomical barriers (e.g. skin and the chest wall). Our proposed automatic CTV segmentation approach aimed to mimic physicians' contouring practice to maximize the agreement between the model's prediction and current clinical contouring guidelines. In this study, supervised TBV to CTV expansion region guides the model to encode the location-related features and thus directs the network to focus on tumor bed to initiate CTV segmentation. The expansion rules and anatomical boundaries were learnt during model training thus assisting the network to limit the expansions beyond the chest wall and to stop at a certain distance from the skin. Grad-CAMs highlighted the image regions considered to be important for producing final segmentation. Grad-CAM visualizations of the model prediction over internal layers revealed that the model learned to look at the tumor bed, chest wall, and skin boundary to segment CTV. The outcomes implies that the approach taken by the network to segment breast CTV is like that of human being for identifying CTV in breast CT images. Extracting such qualitative features of a segmentation model can provide human-understandable information about the concepts that the network learns and how these concepts develop over the internal layers of the network. This can have implications for better acceptability of such deep learning models in clinical applications as trusting a model with human-like inference is easier than a 'black box'. Finally, during model testing, the learnt rules were applied to unseen data. The results suggested high agreement between the network prediction and CTV masks contoured manually by clinical experts. Average DSC, HD95, and ASD values of 0.94 ± 0.02, 2.5 ± 0.5 mm, and 0.5 ± 0.1 mm respectively, confirmed the accurate performance of the model.

Our method showed superior performance to that of previous state-of-the-art approaches for breast CTV segmentation. Men et al (2018) introduced a very deep dilated residual network (DD-ResNet) for auto-segmentation of the breast CTV. The performance of their proposed model was evaluated against a DDCNN and deep deconvolutional neural network (DDNN). For the right breast CTV, mean DSC of 0.91, 0.85, and 0.88 and HD95 values of 10.5, 15.1, and 13.5 mm were reported for DD-ResNet, DDCNN, and DDNN respectively. For the left breast CTV, mean DSC of 0.91, 0.85, and 0.87 and HD95 values of 10.7, 15.6, and 14.1 mm were obtained for DD-ResNet, DDCNN, and DDNN respectively. Qi et al (2020) developed a three-stage process for segmentation of the CTV in breast CT slices. They first identified slices containing CTV, then the black borders around the body were removed by cropping the image, and finally, CTV was segmented in the cropped CT slices. The proposed method achieved DSC of 0.828 and 0.836 for the right and left breast respectively. Seo et al (2020) developed a three-dimensional fully convolutional DenseNet (FCDN) to automatically contour breast CTV and OARs and compared their results with the commercially available atlas-based solutions. DSC and HD95 values of 0.90 ± 0.05 and 4.7 ± 1.7 mm were achieved for the right breast CTV segmentation. Similar results were reported for the left breast (DSC = 0.9 ± 0.03, HD95 = 4.3 ± 1.7 mm). Liu et al (2021) proposed U-ResNet to improve the efficiency and accuracy of breast CTV and OARs delineation compared to those generated by U-Net. To avoid the vanishing gradients of deep convolutional networks, ResNet was incorporated as the encoder. DSC and HD95 scores of 0.94 and 4.31mm were reported for breast CTV.

Although useful, there are some limitations with previous research. For instance, in (Men et al 2018) and (Qi et al 2020) rather than a generalizable model, two separate networks were trained and evaluated for the left- and right-sided breast cancer. In (Men et al 2018, Qi et al 2020, Seo et al 2020, Liu et al 2021) the datasets included CT images and ground truth labels collected from 800, 455, 62, and 160 patients respectively. Considering the highly labor-intensive nature of manual CTV contouring, preparing such dataset is significantly challenging. To compromise the size of the dataset, in (Men et al 2018, Qi et al 2020, Liu et al 2021), the 3D volumetric CT images were cut into 2D slices and 2D models were used for segmentation. 2D approaches are prone to discontinuity in 3D space and fail to leverage context from adjacent slices which are crucial for accurate segmentation, resulting in rough segmentation outcomes in 3D view. Our proposed model attempted to address the limitations of the previous approaches by utilizing a simple yet effective 3D structure, and with significantly smaller training data (i.e. 35 patients) and still outperformed the previous state-of-the-art. The proposed auto-segmentation approach generated TBV to CTV expansion masks in ∼11 s, while manual contouring of CTV takes at least 5 min per each CT volume, implying significantly improved efficiency.

In order to further justify the model structure proposed in this work, i.e. using TBV alongside the CT image, we performed an experiment where CT images were the only input for the model, and DSCs were calculated with respect to the ground truth (table 2). Significantly lower DSC compared to table 1, further support the motivation of adding TBV to assist CTV contouring. The underlying reason behind this inferior performance is that extension of microscopic disease encompassed by CTV is not visible using CT images. Consequently, CTV delineation does not amount to an image segmentation task, unlike GTV delineation, which is about delineating an abnormal appearing mass in radiological images. Hence, CTV cannot be accurately identified by exclusively using intensity information in the image (i.e. using CT images only). In clinical practice, CTV is derived from TBV via an expansion of a clinical-knowledge-informed disease infiltration margin followed by correcting the extensions for anatomical barriers of tumor invasion (e.g. skin, chest wall). Inputting TBV as the second channel directs the network to focus on TBV to initiate CTV segmentation. The expansion rules and geometric boundaries are learnt during model training assisting the network to limit the expansions beyond the chest wall and to stop at a certain distance from the skin.

Table 2. Dice coefficients for the test dataset utilizing the network with the CT images as the only input.

	Case 1	Case 2	Case 3	Case 4	Case 5	Mean
DSC	0.52 (0.17)	0.77 (0.05)	0.42 (0.2)	0.76 (0.03)	0.53 (0.11)	0.6 (0.15)

In this study, inter-observer variation was shown to be substantial for the CTV, despite delineation guidelines. This implies that there is significant difference of opinion across physicians on what to consider as target volume. Figure 5 summarizes CV calculated over the actual CTV volumes (c.c.) of five fractions for each case. The results indicate a wide range (CV = 1.9%–21.8%) of inter-observer variation in manual delineation, with the average CV of 8.5% across cases. Our proposed model provided DSC of 94% and is comparable to physicians' contouring considering the significant inter-physician variability in manual segmentation of CTV.

**Figure 5.** Coefficient of variation (CV) across five manually delineated CTV contours for each case.
Download figure:
Standard image High-resolution image

We attempted to provide more evidence on inter-observer contouring variability of manual CTV volumes by calculating measures of DSC, ASD, and HD95, in which one fraction CTV volumes was used as a reference (table 3). Relatively low DSC and high ASD and HD95 reported in table 3 implies high levels of inter-observer variability induced by human.

Table 3. DSC, HD95, and ASD obtained across manually delineated breast CTV contours as a measure of inter-observer variability of manual contouring. Values represent mean ± standard deviation.

Metrics	Case 1	Case 2	Case 3	Case 4	Case 5	Mean
DSC	0.87 (0.02)	0.63 (0.14)	0.67 (0.1)	0.46 (0.09)	0.75 (0.07)	0.68 (0.15)
HD95 (mm)	4.12 (1.16)	10.07 (4.08)	8.26 (2.77)	14.05 (1.85)	6.21 (1.41)	8.54 (3.8)
ASD (mm)	1.26 (0.28)	4.56 (2.05)	3.22 (1.26)	6.91 (1.37)	2.67 (0.82)	3.72 (2.1)

In the setting of this work, the second channel TBV directs CTV segmentation, where the learnt expansion rules are applied around the TBV. However, the delineation quality of TBV contours strongly depends on the experience of the radiation oncologists. Any levels of inaccuracy in manual TBV delineation would adversely affect CTV segmentation and would result in inaccurate outcomes. To mitigate such effects, weakly delineated TBV contours should be identified and excluded from the dataset before model training and test (Tajbakhsh et al 2020). Moreover, in addition to inter-observer contouring variability, variation of TBV in response to radiation treatment such as breast tissue swelling could lead to TBV variability over fractions. However, in this work, we have not separated the impact of the two aforementioned factors. Further analysis is required to differentiate volumetric changes due to radiation response versus that of caused by different observer's contouring style.

5. Conclusion

Accurate and efficient delineation of the breast CTV is crucial for achieving effective radiotherapy outcomes. However, manual delineation is time-consuming and subject to significant inter- and intra-observer variations. This study proposed and developed a deep learning approach mimicking clinical contouring practice for auto-segmentation of CTV for partial breast irradiation using planning CT images and manual delineations of the TBV. The results demonstrated high levels of agreement between the predicted contours and physicians' manual contours. We further utilized Grad-CAM to improve transparency of the model about the concepts the network learns and to elucidate the network approach for CTV segmentation.

Acknowledgments

This work was in part supported by the National Institutes of Health under Grant No. R01-CA235723 and SBIR 75N91021C00031.

Data availability statement

The data cannot be made publicly available upon publication because no suitable repository exists for hosting data in this field of study. The data that support the findings of this study are available upon reasonable request from the authors.

A deep learning approach for automatic delineation of clinical target volume in stereotactic partial breast irradiation (S-PBI)

Article metrics

Permissions

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Abstract

1. Introduction

2. Methods and materials

2.1. Mimic clinical contouring practice for CTV segmentation by incorporating TBV