Effect of dataset size, image quality, and image type on deep learning-based automatic prostate segmentation in 3D ultrasound

Three-dimensional (3D) transrectal ultrasound (TRUS) is utilized in prostate cancer diagnosis and treatment, necessitating time-consuming manual prostate segmentation. We have previously developed an automatic 3D prostate segmentation algorithm involving deep learning prediction on radially sampled 2D images followed by 3D reconstruction, trained on a large, clinically diverse dataset with variable image quality. As large clinical datasets are rare, widespread adoption of automatic segmentation could be facilitated with efficient 2D-based approaches and the development of an image quality grading method. The complete training dataset of 6761 2D images, resliced from 206 3D TRUS volumes acquired using end-fire and side-fire acquisition methods, was split to train two separate networks using either end-fire or side-fire images. Split datasets were reduced to 1000, 500, 250, and 100 2D images. For deep learning prediction, modified U-Net and U-Net++ architectures were implemented and compared using an unseen test dataset of 40 3D TRUS volumes. A 3D TRUS image quality grading scale with three factors (acquisition quality, artifact severity, and boundary visibility) was developed to assess the impact on segmentation performance. For the complete training dataset, U-Net and U-Net++ networks demonstrated equivalent performance, but when trained using split end-fire/side-fire datasets, U-Net++ significantly outperformed the U-Net. Compared to the complete training datasets, U-Net++ trained using reduced-size end-fire and side-fire datasets demonstrated equivalent performance down to 500 training images. For this dataset, image quality had no impact on segmentation performance for end-fire images but did have a significant effect for side-fire images, with boundary visibility having the largest impact. Our algorithm provided fast (<1.5 s) and accurate 3D segmentations across clinically diverse images, demonstrating generalizability and efficiency when employed on smaller datasets, supporting the potential for widespread use, even when data is scarce. The development of an image quality grading scale provides a quantitative tool for assessing segmentation performance.


Introduction
Prostate biopsy is the current clinical standard for prostate cancer (PCa) diagnosis, but the conventional twodimensional (2D) transrectal ultrasound (TRUS)-guided biopsy has been reported to have a false negative rate up to 30% (Leite et al 2009). Improved tumour sampling can be achieved with the addition of three-dimensional (3D) TRUS imaging using a magnetic resonance imaging (MRI)-3D TRUS fusion guided biopsy approach, which utilizes the superior soft-tissue contrast of MRI to identify suspicious tissue regions for targeting with real-time TRUS guidance (Cool et al 2016). For PCa treatment, high-dose-rate (HDR) brachytherapy is a common treatment modality for intermediate and high-risk localized PCa (Yamada et al 2012, Yoshioka et al 2016. 3D TRUS imaging provides spatial context through visualization of the anatomy in 3D, in addition to improving intraoperative needle tip identification and patient-specific dose optimization (Hrinivich et al 2016). While 3D TRUS imaging offers many benefits, it also necessitates accurate segmentation of the prostate to perform the biopsy and brachytherapy clinical tasks. Segmentations are often completed manually by the physician during the procedure, which can be time-consuming and highly variable, extending procedure times and increasing risk due to anesthesia exposure (Reich et al 2005).
Multiparametric MRI is quickly becoming a standard of care imaging modality for both biopsy and brachytherapy procedures based on the results of recent randomized controlled clinical trials (Kasivisvanathan et al 2018, Alayed et al 2019, Eklund et al 2021. While MRI offers high soft-tissue contrast, limitations including high-cost limit widespread adoption. For MRI-guided brachytherapy, patient movement to the MRI scanner after needle implant may cause needle shifts (Holly et al 2011), while in-bore procedures require highly specialized magnet-safe tools. In hospitals where MRI is utilized for MRI-3D TRUS fusion-guided biopsy or brachytherapy, registration between the modalities is required, often utilizing a surface-based approach, which requires accurate segmentation of the prostate in both image modalities. Thus, even with an MRI-based approach, accurate and fast prostate segmentation in 3D TRUS images is critical.
With the increasing prevalence of deep learning in medicine (Piccialli et al 2021), and specifically of convolutional neural networks (CNNs) for medical imaging tasks, many deep learning-based automatic prostate segmentation approaches have been proposed for TRUS imaging, promising reduced procedure time and similar performance compared to manual approaches (Anas et  We previously proposed an automatic segmentation algorithm involving deep learning prediction with a modified U-Net on 2D TRUS images radially sliced from 3D TRUS volumes followed by reconstruction into a 3D surface (Orlando et al 2020). The algorithm was trained on a large dataset with nearly 7000 2D images resliced from 206 clinically variable 3D TRUS images from various acquisition methods, procedure types (biopsy and HDR brachytherapy), ultrasound machines, and ultrasound transducers. On a testing set of 40 unseen 3D TRUS volumes from different acquisition methods, we demonstrated high performance with a median [quartile 1, quartile 3] DSC, mean surface distance (MSD), and HD of 94.1 [92.6, 94.9]%, 0.89 [0.73, 1.09] mm, and 2. 89 [2.37, 4.35] mm, respectively. This algorithm outperformed a fully 3D V-Net and state-of-the-art methods in the literature (Orlando et al 2020).
However, large and diverse medical image datasets are rare, especially for ultrasound, with recent papers reporting utilization of datasets with 2238 2D TRUS images (Anas et al 2018), and 40 (Wang et al 2019), 44 (Lei et al 2019), 86 (Lei et al 2021), and 109 (Ghavami et al 2018) 3D TRUS volumes. Generation of large clinical datasets is time-consuming and costly, so generalizable and accurate automatic segmentation approaches suitable for small datasets are critical for the widespread integration of deep learning in minimally invasive PCa procedures. The U-Net++ architecture, an evolution of the standard U-Net, has recently been proposed and may help accomplish this goal, introducing multiple CNN backbones as well as nested, dense skip connections (Zhou et al 2018(Zhou et al , 2020. These redesigned skip connections attempt to reduce the semantic gap between feature maps in the encoder and decoder sections of the network, resulting in an easier optimization problem and thus higher performance with small training datasets.
Image quality is highly variable between 3D TRUS volumes, including differences in acquisition methods, image acquisition artifacts, patient anatomy artifacts such as gas, calcifications, catheters, and prostate boundary visibility. These factors may influence prostate segmentation results, so a 3D TRUS image quality grading scale is required to aid in comparing results from different datasets and identify key image quality factors that will influence segmentation performance.
Our work aims to demonstrate that our 2D radial deep learning plus 3D reconstruction approach offers efficient utilization of training data and thus high segmentation performance when trained with smaller datasets and datasets split based on 3D TRUS acquisition type (end-fire and side-fire). To assess the impact of image quality on segmentation performance, we propose an image quality grading scale containing three distinct image quality factors. By rigorously evaluating our deep learning segmentation approach in the context of image quality, training dataset size, and 3D TRUS acquisition type (end-fire and side-fire), we intend to provide a widely accessible, robust, generalizable, and efficient prostate segmentation algorithm. This approach may allow for reduced clinical procedure time and increased efficiency for minimally invasive PCa procedures, allowing for optimization to a clinic's local preferences, without requiring access to large ultrasound image datasets.

Complete ultrasound dataset
The complete ultrasound dataset consisted of 246 3D TRUS volumes of the prostate . This dataset contained 104 end-fire 3D TRUS volumes, obtained from clinical prostate biopsy procedures, and 142 side-fire 3D TRUS volumes, obtained from clinical prostate brachytherapy procedures. Patient clinical information such as age, stage of prostate cancer, and Gleason score were not recorded. The methods to acquire 3D TRUS volumes have been described previously, but are briefly summarized here (Tong et al 1996, Bax et al 2008. To generate these images, a TRUS transducer was mechanically rotated using a motorized fixture about its long axis. 2D TRUS images were acquired at set angular intervals, which were then reconstructed to generate 3D TRUS volumes. The choice of TRUS transducer leads to geometrically variable images: the end-fire transducer used for prostate biopsy was rotated 180°while 2D TRUS images were acquired at 1.0°intervals and reconstructed into a 3D volume; the side-fire transducer used for prostate brachytherapy was rotated 140°while 2D TRUS images were acquired at 0.5°intervals and reconstructed into a fan-shaped 3D TRUS volume. These 3D TRUS volumes were acquired with 3 transducers used with 3 different ultrasound systems of different ages and from two manufacturers. Specifically, an 8848 transducer was used with the Profocus 2202 ultrasound system (BK Medical, Peabody, MA, USA), C9-5 and BPTRT9-5 transducers were used with the ATL HDI-5000 ultrasound system (Philips, Amsterdam, the Netherlands), and a C9-5 transducer was used with the iU22 ultrasound system (Philips, Amsterdam, the Netherlands Manual prostate segmentations in the 3D TRUS volumes, excluding the seminal vesicles, were completed by an observer experienced with 3D TRUS imaging (IG). 20 end-fire 3D TRUS volumes and 20 side-fire 3D TRUS volumes were randomly selected from the complete dataset and reserved as a testing dataset, thus were not included during training.
As outlined in Orlando and Gillies et al, the complete training dataset of 206 3D TRUS volumes was resliced at randomized axial, sagittal, coronal, radial, and oblique image planes, resulting in a final training dataset of 6761 2D TRUS images with matched manual segmentations (Orlando et al 2020). This reslicing allowed for more efficient use of the TRUS data, demonstrating improved performance compared to a fully 3D V-Net approach (Milletari et al 2016, Orlando et al 2020. 2D images were resampled to 256 × 256 pixels with no other applied preprocessing. The complete training dataset of 2D TRUS images was split for deep learning, with 80% (5409 images) used for training and 20% (1352 images) used for validation.

Reduced-size datasets
To evaluate our method's efficiency in utilizing the training data, we generated smaller datasets by splitting and reducing the complete dataset of 6761 2D TRUS images. In all smaller datasets, we maintained the 80/20 training/validation split for deep learning.

Split end-fire and side-fire datasets
We first split the complete dataset into an end-fire training dataset of 2738 2D TRUS images and a side-fire training dataset of 4023 2D TRUS images (table 1). This allowed for an assessment of generalizability by training two sets of parameters and testing on the opposite 3D TRUS acquisition type, which was unseen during training.
2.2.2. Smaller end-fire, side-fire, and mixed datasets Smaller datasets were generated by reducing the split end-fire and side-fire datasets to assess how segmentation performance depends on the size of the dataset used for training, with an aim to find the smallest dataset which still maintains high segmentation performance. Using the split end-fire and side-fire datasets, images were removed at random to create training datasets with 1000, 500, 250, and 100 2D TRUS images of each acquisition type (table 1). These smaller datasets were generated by reslicing from 36, 18, 9, and 4 3D TRUS volumes, respectively. Thus, variation in image quality and anatomical features, as determined by the 3D TRUS volume, were similarly reduced. This resulted in eight reduced-size datasets (four end-fire and four side-fire).
Similarly, smaller mixed datasets were generated by reducing the complete 2D TRUS dataset. Images were removed at random to create training datasets with 4023, 2738, 1000, 500, 250, and 100 mixed 2D TRUS images (table 1), resliced from 119, 86, 36, 18, 9, and 4 3D TRUS volumes, respectively. The segmentation performance of a network trained using 4023 mixed acquisition images was compared to a network trained using 4023 sidefire images; similarly, the segmentation performance using a training dataset of 2738 mixed images and 2738 end-fire images were compared. In all reduced-size mixed datasets, the ratio between end-fire and side-fire images matched the complete dataset, with 40.5% end-fire images and 59.5% side-fire images. This resulted in six reduced-size mixed datasets.

Image quality assessment
3D TRUS image quality varies across patients and 3D acquisition methods, and so it is expected to impact segmentation performance. To explore this effect, an experienced interventional and genitourinary radiologist (DC) developed a 3D TRUS image quality grading scale, provided in table 2. To ensure the scale was not biased and was generalizable, it was developed before the physician viewed our 3D TRUS dataset. Image quality was graded using three factors: acquisition quality, artifact severity, and prostate boundary visibility. Acquisition quality rated the quality of the 3D TRUS acquisition itself, ignoring the anatomy artifacts and visibility, ranging from 1 (poor) to 5 (ideal). Examples of poor acquisition quality included image shadowing due to inadequate transducer contact, transducer translation during 3D TRUS acquisition causing anatomy distortion, and issues with ultrasound gain or depth. Artifact severity estimated the degree of image degradation caused by artifactgenerating items within the prostate gland, such as calcifications, gas, urinary catheters, and brachytherapy seeds, ranging from 1 (major artifacts) to 5 (no artifacts at all). Prostate boundary graded the visibility or clarity of the prostate boundary with the adjacent periprostatic soft tissue, a key factor in the prostate segmentation task, ranging from 1 (more than 75% of the boundary is indistinguishable) to 3 (40% of the boundary is indistinguishable) to 5 (the entire boundary is clearly visible). The test dataset of 20 end-fire and 20 side-fire 3D TRUS volumes was graded by the same radiologist who was blinded to the qualitative and quantitative segmentation performance. Only the test dataset was graded; as the test dataset was randomly selected from the complete dataset, its images quality distribution was representative of the complete dataset. Five-point numerical grading allowed for a quantitative comparison between end-fire and side-fire 3D TRUS volumes, including the calculation of means and statistical testing.

3D segmentation algorithm
Our radial prostate segmentation algorithm was first described in Orlando and Gillies et al and will be briefly summarized here (figure 1). This method utilized a radial segmentation approach, first proposed by Qiu et al for a prostate segmentation algorithm based on convex optimization with shape priors (Qiu et al 2015). In this approach, a 3D TRUS volume is resliced radially about the approximate center of the prostate gland at 15°i ntervals, generating 12 2D TRUS images. The extracted 2D TRUS images appear very similar, as each plane Table 2. Image quality grading scale for 3D TRUS images of the prostate.

Image quality factor
Description Scale Acquisition quality Quality of the 3D TRUS image acquisition regardless of anatomy Anatomy artifacts Severity of anatomy artifacts (calcification, gas, catheter, etc) 1 (major artifacts)-5 (no artifacts) Prostate boundary Visibility/clarity of the prostate boundary 1 (>75% of boundary indistinguishable)-3 (40% of boundary indistinguishable)-5 (entire boundary visible) passes through the mid-gland of the prostate, resulting in similar prostate size and shape regardless of the 3D TRUS acquisition method. This radial approach has been shown to improve segmentation performance in the apex and base of the prostate compared to alternative approaches such as transverse reslicing (Qiu et al 2015).
The 12 radial 2D TRUS images were automatically segmented using neural networks trained with the 2D datasets described in sections 2.1 and 2.2 to generate 12 segmented prostate boundaries, which were used to reconstruct the 3D surface of the prostate (figure 1).

2D neural networks
Two neural network architectures were used in this work, which were trained with identical 2D TRUS datasets (See sections 2.1 and 2.2). Detailed network diagrams are provided in figures A1 and A2 in appendix A for the modified U-Net and U-Net++, respectively. Data augmentation using random combinations of horizontal flips, shifts up to 20%, rotations up to 20°, and zooms up to 20% were applied to double the training datasets. A personal computer with an i7-9700 K central processing unit (CPU) at 3.60 GHz (Intel Corporation, Santa Clara, CA, USA), 64 GB of RAM, and a 24 GB NVIDIA TITAN RTX (NVIDIA Corporation, Santa Clara, CA, USA) graphics processing units (GPU) was used for training all 2D neural networks and for subsequent prediction on unseen testing data.

Modified U-net
A five-layer deep modified version of the widely prevalent U-Net (Ronneberger et al 2015) was implemented using Keras with TensorFlow (Abadi et al 2016). First, 50% dropouts were applied at the last block on the contracting section of the network and at every block on the expansion section of the network to increase regularization and prevent overfitting (Orlando et al 2020). In addition, transpose convolutions were applied in the expansion section of the network instead of the standard upsampling followed by convolution (upconvolution), as this allowed for improved performance (Orlando et al 2020). Padding and ReLU activation were applied in each (3 × 3) convolution operation, with sigmoid activation used in the final (1 × 1) convolution operation. Additional hyperparameter selection based on preliminary experiments included the use of an Adam optimizer, a learning rate of 0.0001, a Dice-coefficient loss function, 100 epochs, and 200 steps per epoch.

U-Net++
A state-of-the-art U-Net++ architecture (Zhou et al 2018(Zhou et al , 2020 was also implemented using Keras with TensorFlow (Abadi et al 2016). We used a standard ResNet-50 architecture (He et al 2016) with batch normalization and a batch size of 10 as our CNN backbone, as it balanced the number of parameters and overfitting risk for the scale of our training datasets. As described in section 2.5.1, the convolution operations and hyperparameters matched the modified U-Net implementation, including the use of transpose convolutions, Adam optimizer, 0.0001 learning rate, Dice-coefficient loss function, and number of epochs.

Evaluation and comparison
All trained models were evaluated using a testing dataset which consisted of 20 end-fire plus 20 side-fire 3D TRUS volumes unseen by the networks during training. The evaluation metrics included Dice Similarity Coefficient (DSC), recall, precision, absolute volume percent differences (VPD), mean surface distances (MSD), and Hausdorff distances (HD), computed for both 2D radial slice and reconstructed 3D segmentations for each prostate. Computation times were recorded for 2D slice segmentation, 3D reconstruction, and overall 3D segmentation time. We have previously demonstrated significantly improved performance with a 2D radial deep learning plus 3D reconstruction approach compared to fully 3D CNNs; consequently, no 3D CNNs were used for comparison in this work. A detailed list of comparisons and corresponding statistical tests is provided in table 3. Figure 1. 3D prostate segmentation workflow using an example end-fire 3D TRUS volume. The input 3D TRUS volume was resliced radially at 15°spacing to generate 12 2D TRUS images with similar size and shape. A trained 2D neural network was used to predict the prostate boundary locations in 2D binary masks, which were used to reconstruct the 3D prostate surface.
Statistical calculations were performed in GraphPad Prism 9.2 (Graphpad Software, Inc., San Diego, CA, USA). The Shapiro-Wilk test was used to evaluate the normality of distributions. Failure of the Shapiro-Wilk test led to the use of nonparametric statistical tests and the reporting of median [quartile 1, quartile 3] results. The significance level for statistical analysis was chosen such that the probability of making a type I error was less than 5% (p<0.05), with statistically significant differences denoted simply as 'significant' for the remainder of this manuscript.

Complete dataset
Example U-Net, U-Net++, and manual segmentations for median end-fire and side-fire cases are shown in figure 2. The evaluation metric results comparing the modified U-Net to the U-Net++ when trained using the full dataset of 6761 images are shown in table 4. No significant differences were observed between the U-Net and U-Net++ for any metric for the full testing dataset. When considering the side-fire and end-fire test datasets separately, no significant differences were observed for the end-fire testing images, while only the precision and recall metrics were significantly different for side-fire testing images, with the U-Net demonstrating higher precision and the U-Net++ demonstrating higher recall. The mean computation time per 2D segmentation was 0.028 s for the modified U-Net and 0.088 s for the U-Net++. The mean 3D reconstruction time was 0.27 s, resulting in a total 3D segmentation time of 0.61 s for the modified U-Net and 1.33 s for the U-Net++.
Of note, a comparison of segmentation performance relative to prostate volume for the U-Net and U-Net++ demonstrated significant correlations between prostate size and the DSC and VPD metrics. The DSC metric showed a Spearman r coefficient of 0.58 and 0.61 for the U-Net and U-Net++, respectively, while the VPD metric showed a Spearman r coefficient of −0.44 and −0.51 for the U-Net and U-Net++, respectively.

Split end-fire and side-fire datasets
Qualitative segmentation results comparing the modified U-Net and U-Net++ to manual segmentations for networks trained with only end-fire and only side-fire images are shown in figures 3 and 4, respectively, and the corresponding quantitative comparisons are shown in tables 5 and 6. Plots showing DSC for the modified U-Net and U-Net++ trained using only end-fire and only side-fire datasets are shown in figure 5. For both the end-fire and side-fire networks evaluated on the complete testing dataset, which included images from both acquisition methods, the U-Net++ significantly outperformed the modified U-Net for all metrics except VPD for the sidefire networks. When evaluated on the end-fire and side-fire testing datasets separately, the U-Net++ also significantly outperformed the U-Net for all metrics aside from VPD when tested on the same image type it was trained on. Comparing the results shown in tables 4 and 5 of the U-Net++ trained with the full 6761 image dataset to the U-Net++ trained using only end-fire images, use of the full dataset only demonstrated a significant improvement for the HD metric (0.4 mm) when tested on end-fire images. Similarly, comparing the U-Net++ trained with the full dataset to one trained using only side-fire images, only the precision metric was significantly different (1.8%) when tested on side-fire images. When tested on the unseen acquisition type, the use of the full dataset demonstrated improved performance for every metric in both cases.    figure 6. Plots of DSC as a function of training dataset size are shown in figure 7, highlighting the high performance of the U-Net++ when trained with small datasets. Corresponding quantitative comparisons are provided in appendix tables B1, B2, and B3. As shown in table B1, for the U-Net++ trained with reduced-size end-fire datasets and tested on end-fire images, significant differences were observed between the full (2738 images) end-fire training dataset and the 250 and 100 image sets for the DSC and MSD metric, and all reduced-size image sets for the HD metric.
When the U-Net++ was trained with reduced-size side-fire datasets and tested on side-fire images (table B2), multiple comparisons tests showed significant differences for the DSC, MSD, and HD metrics between the full (4023 images) side-fire training dataset and the 500 and 100 image sets.
As shown in table B3, for the U-Net++ trained with reduced-size mixed datasets and tested on end-fire images, multiple comparisons tests showed significant differences between the full (6761 images) mixed training dataset and the 500, 250, and 100 image sets for the DSC, MSD, and HD metrics. When tested on side-fire images significant differences were observed between the full mixed training dataset and the 1000 through 100 image sets for the DSC, MSD, and HD metrics.
Comparing the U-Net++ trained with 2738 mixed images to the U-Net++ trained with 2738 end-fire images, no significant differences were observed when tested on end-fire images, but when tested on side-fire images, use of the mixed training dataset demonstrated significantly improved performance for all metrics. Similarly, for the U-Net++ trained with 4023 mixed images compared to the network trained with 4023 sidefire images, only the precision and recall metrics were significantly different when tested on side-fire images, with all metrics except precision significantly improved with use of the mixed training dataset when tested on end-fire images.

Image quality
A comparison of average image quality grading results for side-fire and end-fire 3D TRUS images of the prostate is shown in table 7. There were no significant differences between end-fire and side-fire image quality for any image quality factor or for the total averaged image quality.
A graph of DSC as a factor of grade for each individual image quality factor is shown in figure 8. For end-fire testing images, image quality grade did not have a significant effect on segmentation performance in any metric. For side-fire testing images, only the boundary visibility grade had a significant effect for the modified U-Net, Figure 4. Example side-fire (top row) and end-fire (bottom row) median DSC prostate segmentation results comparing manual (red), modified U-Net (blue), and U-Net++ (yellow) 3D surfaces for networks trained only using side-fire images. The columns from left to right show the prostate surface in the axial plane, sagittal plane, and an oblique radial plane, respectively. while all image quality factors except anatomy artifact grade had a significant effect on the DSC metric for the U-Net++. Analysis of plots of DSC as a function of total image quality grade for the U-Net and U-Net++ (figure 9) showed no significant correlation for the end-fire testing dataset for any metric, with Spearman r coefficients less than 0.4. For the side-fire testing images, the modified U-Net showed a significant correlation between total image quality grade and DSC, recall, and HD metrics, with Spearman r coefficients of 0.60, 0.61, and −0.56, respectively, while the U-Net++ showed a significant correlation for the DSC and recall metrics with Spearman r coefficients of 0.46 and 0.55, respectively.

Complete dataset
To provide a baseline maximum performance level, we first compared the segmentation accuracy of the modified U-Net to the U-Net++ for both networks trained on the complete dataset. The results shown in table 4 demonstrate the nearly equivalent performance of the networks. This highlights that with a large training dataset of nearly 7000 2D images, the more advanced U-Net++ network with significantly more parameters did not offer any improvement in performance, motivating the experiments described in sections 3.2 and 3.3 focused on reduced-size datasets. Using the same 24 GB NVIDIA TITAN RTX GPU, the modified U-Net demonstrated a segmentation time that was three times faster, with speeds of 0.028 s per 2D slice compared to  0.088 s per slice for the U-Net++. After reconstruction of the 2D predictions into a 3D prostate surface, the total segmentation time was 0.61 s for the modified U-Net, which was half of the 1.33 s for the U-Net++. While this is a large relative difference, in a clinical setting the difference is inconsequential, as both present a significant reduction in segmentation time relative to manual segmentations, which can take 10-20 min Correlations between segmentation performance and prostate size were only significant for the DSC and VPD metrics. This is an expected result due to the nature of these metrics, as absolute differences that would be readily apparent for smaller prostate volumes would be reduced for large volumes when considering these overlap and volume-based metrics. As expected, boundary-based metrics showed no correlation with prostate size. The correlations we did observe were still weak, however, with Spearman coefficients of roughly r=0.6 for DSC and r=−0.4 to −0.5 for VPD, highlighting the general robustness of our approach to prostate size differences. Recent

Split end-fire and side-fire datasets
Segmentation performance of the modified U-Net and U-Net++ trained with only end-fire or only side-fire images (figure 5 and tables 5 and 6) showed that the U-Net++ significantly outperformed the modified U-Net in nearly all cases. When trained using side-fire images and tested on end-fire images, no difference was observed, but the U-Net++ did have higher median performance, countered by a larger variation. These differences highlight the generalizability and efficiency of the U-Net++ in utilizing small training datasets. The modified U-Net had boundary errors due to shadowing artifacts, even when tested on the same image type as seen in the  top row of figure 3. When tested on the image type not seen during training of the network, the U-Net++ still performed better, although it also had difficulties with shadowing artifacts (e.g. the bottom row of figure 3, with the heavily shadowed region seen near the top of the prostate). The modified U-Net had a depth of five layers compared to 50 for the U-Net++. This reduction in depth and number of parameters for the U-Net compared to the U-Net++ may alleviate the overfitting problem, which is important as training dataset size is reduced. When assessing how the U-Net++ trained with only end-fire or only side-fire images compared to one trained with the full dataset, we found that there was little difference when tested on the same TRUS acquisition type the networks were trained with. This highlights a potentially practical finding that the presence of other image types in the training dataset do not add a significant benefit to the segmentation performance when only one image type is required to be segmented. However, when the U-Net++ trained with only end-fire or only side-fire images were tested on the TRUS acquisition type they had never seen before, use of the full dataset significantly improved performance. This demonstrates the necessity of including all image types in the training dataset, especially when generalizability and widespread application is important. DSC performance in these cases was still in the range of 85%-89% for the U-Net++, however, demonstrating the generalizability of our approach.
Differences between end-fire and side-fire images, including image quality and artifact prevalence, may explain the observed segmentation performance differences between TRUS image types. The differences in acquisition method between end-fire and side-fire 3D TRUS may result in artifacts in side-fire images such as air gaps due to lack of transducer contact or distal shadowing due to transducer distance from the prostate. Due to the nature of end-fire image acquisition, the radial plane used for deep learning segmentation matches closely the acquisition plane, resulting in improved segmentation accuracy. For side-fire images, only one of the twelve radial planes is the acquisition plane and the other eleven are interpolated slices resulting in reduced resolution, potentially explaining some of the observed differences in segmentation performance. In HDR brachytherapy procedures where side-fire 3D TRUS is utilized, urinary catheters are commonly used, which create artifacts that are not seen in end-fire images used for prostate biopsy. The appearance of other organs such as the rectum and bladder also differ between end-fire and side-fire leading to increased prostate segmentation error where the algorithm included parts of the rectum or bladder when tested on the 3D TRUS type unseen by the network. Furthermore, due to differences in patient selection and the prevalence of hormone therapy prior to HDR brachytherapy treatment, the prostate sizes in patients presenting for end-fire TRUS-guided biopsy are typically larger than the prostate sizes of patients undergoing side-fire TRUS-guided HDR brachytherapy. This led to underpredictions for side-fire networks tested on end-fire images and overpredictions for end-fire networks tested on side-fire images, limiting generalizability and necessitating the presence of both 3D TRUS types in the training dataset so the network can learn differences in size and shape.
4.3. Smaller end-fire, side-fire, and mixed datasets For as small as 500 end-fire images used in the training dataset, which is just over 7% of the full dataset, DSC performance was within 1% of the U-Net++ trained with the full dataset of 6761 images. Results were similar for the U-Net++ trained with reduced-size side-fire datasets and tested on side-fire images. Networks trained with end-fire images performed better when tested on side-fire images compared to networks trained with sidefire images and tested on end-fire images, suggesting features the network learns from end-fire images are more generalizable to side-fire images. As expected, mixed training datasets had high segmentation performance when tested on both image types even as the dataset size was reduced. This improved performance and generalizability is apparent in figure 7, highlighting the benefit of including all image types in the training dataset.
For a segmentation task involving only one image type, performance plateaus at a training dataset size of 1000 2D training images of that type, which were obtained from approximately 36 3D volumes. A dataset of this size is achievable at even small hospitals or research centers, showing that deep learning segmentation in 3D may be possible even with limited data by utilizing organ symmetry and a radial approach. The reduced training data requirement reduces the amount of manual segmentation required, a key benefit as accurate manual segmentation is a difficult and time-consuming process that is often a bottleneck in supervised machine learning. These results also show that for a segmentation task involving multiple image types, the presence of all image types in the training dataset is critical. Segmentation performance for mixed training datasets also plateaus at approximately 1000 training images, suggesting that deep learning segmentation in two image types is possible even if data is scarce.

Image quality
We developed a 5-point image quality grading scale based on three factors specifically for 3D TRUS prostate images. This grading scale helps provide transparency regarding the image quality of our clinical dataset, helping to contextualize our results. A numerical scale with clearly defined image quality factors rated from one to five may enable an easier comparison of segmentation performance between networks trained using different datasets. Designing the image quality grading scale independently of our dataset should allow it to be successfully applied to 3D TRUS datasets of varying quality.
Mean image quality grades for each individual factor provided in table 7 highlight the overall high quality of our dataset and the general similarity in image quality between end-fire and side-fire images, with no statistically significant differences observed and a maximum difference in mean of only 0.2. Side-fire images did have an increased standard deviation for each individual factor, highlighting the larger range of image qualities, including the presence of grades of 2 in each factor, which was not seen in the end-fire images. Our dataset contained no images with a grade of 1.
For end-fire images in our testing dataset, image quality had no significant effect on segmentation performance. In contrast, for side-fire images in our testing dataset, the boundary visibility grade and the acquisition quality, boundary visibility, and total averaged image quality grades significantly impacted segmentation performance for the U-Net and U-Net++, respectively. Boundary visibility showed to be a key factor in the algorithm's ability to accurately segment the prostate boundary for both networks, as expected. These results were further confirmed with the correlation analysis shown in figure 9, highlighting the significant effect of image quality on segmentation performance for side-fire images, but not for end-fire images. Correlations were not strong with Spearman r coefficients in the range of 0.46-0.6 for the DSC metric for both the U-Net and U-Net++. The lack of significant differences observed when comparing how segmentation performance varies with image quality, especially considering the end-fire images, may be attributed to the high mean image quality and subtle variation between the poorest quality image and the highest quality image. A dataset with more variation in image quality may better demonstrate the dependence of segmentation performance on image quality. In addition, due to the testing set size of 20 end-fire and 20 side-fire 3D TRUS volumes, some individual image quality grades had a very small sample size, which likely factored into the lack of significant differences observed for some of the image quality factors.
The differences in image quality and its effect on segmentation performance for end-fire compared to side-fire images may be explained in part due to the nature of image acquisition. Ultrasound transducer orientation is one critical component; during end-fire image acquisition the transducer contacts the rectal wall at the end of the transducer pointing towards the prostate. During side-fire acquisition, however, the transducer is positioned horizontally inside the rectum, requiring a much larger contact area, which can result in increased prevalence of air gaps due to lost contact, reducing image quality. Furthermore, due to differences in transducer position based on the intended application the side-fire transducer is further away from the prostate, leading to hypoechoic regions away from the transducer due to issues with time-gain compensation.

Limitations and future work
Only one observer provided manual gold standard segmentations, thus inter-and intra-observer variability were not directly assessed; however, these considerations were addressed in Orlando et al (2020). In addition, only one observer defined the image quality grading scale and graded the testing dataset, which did not assess the impact of inter-or intra-observer variability. Future work will include validation of our image quality grading scale and its reliability, including an assessment of inter-and intra-observer variability. Image quality of the training dataset may play a critical role in segmentation performance, and although image quality of the testing dataset should have been representative of the training dataset, direct grading of the training images would allow for confirmation of this assumption. As shown in table 7, our 3D TRUS dataset was of high quality on average. A wide range in image quality is important for algorithm generalizability. Future work should investigate our segmentation approach when trained and tested with a lower quality dataset, ideally from a different center.
Patient clinical information, such as age, stage of prostate cancer, and Gleason score was not recorded for our dataset, and thus an assessment of how segmentation performance is impacted by these measures could not be completed. While this has not been assessed in previous work to our knowledge, differences in such measures could manifest as differences in image quality, potentially captured by our image quality grading scale as anatomy artifact severity for example. Future work could explicitly investigate the influence of patient clinical information on segmentation quality.
For our U-Net++ implementation, only one type of CNN backbone (ResNet) was used. Future work will utilize a U-Net++ ensemble network with results from multiple CNN backbones combined into one segmentation result using a method such as averaging, majority vote, or the STAPLE algorithm (Warfield et al 2004). Finally, a leave-oneout vendor study examining the impact of ultrasound machine vendor on segmentation performance would offer a strong assessment of generalizability, which is critical for widespread clinical translation.

Conclusions
This study investigated the effect of training dataset size, image quality, and image type on prostate segmentation in 3D TRUS volumes using a 2D radial plus 3D reconstruction approach, comparing a modified U-Net to a U-Net++ architecture. Beginning with a large, clinically diverse dataset of TRUS images, smaller training datasets were generated by splitting and reducing the dataset. Segmentation performance for the U-Net++ plateaued at end-fire, side-fire, or mixed training dataset sizes of 1000 2D images, resliced from approximately 36 3D volumes. This high performance with small datasets highlights the potential for widespread use of our approach or similar methods, even if data is scarce, demonstrating the possibility for increased access to automated segmentation methods. The development of an image quality grading scale specifically for 3D TRUS imaging provides a quantitative tool for assessing segmentation performance, with an aim to increase transparency regarding dataset quality and aid in comparison between segmentation methods trained using different datasets. program using funds raised by the London Health Sciences Foundation. N Orlando was supported in part by the Queen Elizabeth II Graduate Scholarship in Science and Technology. The authors would also like to thank Dr Ashley Mercado for his assistance in collecting images during prostate biopsy procedures and Dr Aaron Ward for his thoughtful discussion surrounding this work.

Disclosures
The authors have no relevant conflicts of interest to disclose.
Appendix A Figure A1. Network diagram for the modified U-Net. Figure A2. Network diagram for the U-Net++. Table B1. Median [Q1, Q3] 2D results for the U-Net++ trained using end-fire datasets of varying size, from 2738 (full end-fire set) to 100 images. The networks were evaluated on an unseen test dataset of 20 end-fire and 20 side-fire 3D TRUS images of the prostate.