Deep learning for high-resolution dose prediction in high dose rate brachytherapy for breast cancer treatment

Objective. Monte Carlo (MC) simulations are the benchmark for accurate radiotherapy dose calculations, notably in patient-specific high dose rate brachytherapy (HDR BT), in cases where considering tissue heterogeneities is critical. However, the lengthy computational time limits the practical application of MC simulations. Prior research used deep learning (DL) for dose prediction as an alternative to MC simulations. While accurate dose predictions akin to MC were attained, graphics processing unit limitations constrained these predictions to large voxels of 3 mm × 3 mm × 3 mm. This study aimed to enable dose predictions as accurate as MC simulations in 1 mm × 1 mm × 1 mm voxels within a clinically acceptable timeframe. Approach. Computed tomography scans of 98 breast cancer patients treated with Iridium-192-based HDR BT were used: 70 for training, 14 for validation, and 14 for testing. A new cropping strategy based on the distance to the seed was devised to reduce the volume size, enabling efficient training of 3D DL models using 1 mm × 1 mm × 1 mm dose grids. Additionally, novel DL architecture with layer-level fusion were proposed to predict MC simulated dose to medium-in-medium (D m,m ). These architectures fuse information from TG-43 dose to water-in-water (D w,w ) with patient tissue composition at the layer-level. Different inputs describing patient body composition were investigated. Main results. The proposed approach demonstrated state-of-the-art performance, on par with the MC D m,m maps, but 300 times faster. The mean absolute percent error for dosimetric indices between the MC and DL-predicted complete treatment plans was 0.17% ± 0.15% for the planning target volume V 100, 0.30% ± 0.32% for the skin D 2cc , 0.82% ± 0.79% for the lung D 2cc , 0.34% ± 0.29% for the chest wall D 2cc and 1.08% ± 0.98% for the heart D 2cc . Significance. Unlike the time-consuming MC simulations, the proposed novel strategy efficiently converts TG-43 D w,w maps into precise D m,m maps at high resolution, enabling clinical integration.


Introduction
In high dose rate brachytherapy (HDR BT), catheters and/or applicators are inserted temporarily inside the patientʼs body to guide a sealed and highly radioactive photon emitting source (a seed), usually Iridium-192  ( 192 Ir), to the tumor site, irradiating the tumor from the inside out (Schulz et al 1984).Post-implant imaging of the treatment site is then acquired with computed tomography (CT) or magnetic resonance imaging (MRI).The resulting images are used to segment the target volume containing the tumor, the surrounding organs at risk (OARs) and to digitize the catheters or applicators to identify possible locations (dwell positions) where the radioactive source temporarily rests a certain time (dwell time) to deliver radiation during the treatment.Next, the absorbed dose contribution per second from each potential dwell position is calculated to optimize the dwell times.The optimization process involves finding the most effective combination of dwell positions and dwell times to achieve an optimal treatment plan.The primary tool for assessing a treatment plan is the dose volume histogram (DVH) and its dosimetric indices (Drzymala et al 1991).DVHs simplify the 3D dose distribution data into 2D graphs, illustrating the radiation distribution across each structure.Cumulative DVHs depict the percentage or absolute volume of a particular tissue receiving an absorbed dose equal to or greater than specified dose intervals.Dosimetric indices, derived from DVHs, summarize information in a volume of interest.V x is the volume receiving at least x% of the prescribed dose for one fraction.The conformity index (CI) is defined as the planning target volume (PTV) V 100 divided by the body V 100 .D xcc is defined as the maximum dose received by x cm 3 of a volume and is used to evaluate the impact of the treatment plans on the OARs.The selection of dosimetric indices investigated in this study aligns with the guidelines outlined by Strnad et al (2018).
The absorbed dose is calculated according to the American Association of Physicists in Medicine (AAPM), task group report 43, termed TG-43 (Nath et al 1995, Rivard et al 2004) by using precalculated dose distributions around a single source and scaling it to the daily air kerma strength of the source.The TG-43 formalism ignores the influence of patient tissue, seed, applicator heterogeneities and finite patient dimensions on absorbed dose.It considers the patient an infinite, homogeneous water sphere, and the absorbed dose is computed and reported as dose to water in water (D w,w ).Model-based dose calculation algorithms (MBDCAs), with the Monte Carlo (MC) method being the gold standard, calculate absorbed dose to medium in medium (D m,m ) (Enger et al 2020).The accuracy of MBDCAs depends on the precise description of the seed, applicator, and patient geometry, along with their respective material/tissue compositions and corresponding mass densities.The dosimetric influence of MBDCAs for 192 Ir HDR BT for treatment of prostate, gynecological, and rectal cancers is on average less than 5% when compared to the conventional TG-43 dose calculation formalism (Beaulieu et al 2012, Shoemaker et al 2019, Famulari et al 2020, Morcos et al 2021).However, in cases where the radiation source is positioned close to materials that can significantly modify the scattering conditions, such as bones or air, as is often the case in breast brachytherapy, the difference in absorbed dose between voxels can reach up to 20% (Beaulieu et al 2012, Enger et al 2020).The adoption of MBDCAs in clinical practice has posed challenges despite their potential to offer superior dosimetry compared to TG-43 for various tumor types.These challenges primarily stem from their high computational demands and time-consuming nature.For the gold standard MC method, the accuracy of the absorbed dose relies on the number of primary particles employed in the simulations.To decrease the statistical uncertainty, a considerable number of histories (i.e.primary particle decay) must be simulated (Ma et al 2005).Running high-resolution simulations with a large number of histories can take hours, even when parallelized on advanced hardware.
Numerous studies, including  Wang et al (2023), have explored the potential of deep learning (DL) in radiotherapy to predict D m,m with accuracy comparable to the ground truth MC, while substantially reducing the computational time.However, in the context of HDR BT, only two prior studies have demonstrated the feasibility of DL-based dose prediction, but only on a coarse grid, i.e. volume elements (voxels) of 3 mm × 3 mm × 3 mm (Mao et al 2020, Akhavanallaf et al 2021) and small tumor sites.Using low-resolution voxels to predict absorbed dose for HDR BT fails to comply with clinical practice standards.The training of DL models for predicting D m,m in high-resolution settings, specifically on clinical voxel sizes of 1 mm × 1 mm × 1 mm, presents a notable challenge due to the substantial memory and computation requirements involved.This challenge stems from the training heavily relying on graphics processing units (GPUs), which possess limited memory capacity.Consequently, to our knowledge, no DL-based HDR BT dose prediction models are specifically designed for high-resolution scenarios.
To overcome this challenge, this study aimed to develop a high-resolution DL-based HDR BT dose prediction model using 1 mm × 1 mm × 1 mm voxels for breast cancer cases.An innovative problem size reduction strategy was developed to facilitate DL model training in a high-resolution 3D context, and a novel layer-level fusion DL architecture was proposed.It is worth noting that breast cancer patients often have larger treatment sites compared to other types of tumors.Consequently, training a 3D model with a resolution of 1 mm × 1 mm × 1 mm becomes impractical without reducing the volume size.DL models were trained with a 3D TG-43 based D w,w dose map and 3D patient composition as inputs to accurately predict a MC-calculated 3D D m,m dose map as output.An extensive comparison of different 3D convolutional neural network (CNN) architectures and model inputs was performed.Layer-level fusion DL architectures were compared to a U-Net model, and different inputs were investigated to describe the patient composition.

Learning strategy and problem formulation
In this study, we propose a supervised learning approach for predicting D m,m absorbed dose maps.The objective was to learn a function f for predicting the MC D m,m dose map accounting for tissue heterogeneities, given the TG-43 D w,w absorbed dose map and the corresponding patient body composition.The function f was approximated using a deep CNN and defined as ( ) = f D D , patient body composition w w m m , , .We aimed to minimize the difference between the predicted  D m m , and the MC-calculated D m,m .Additionally, we hypothesized that by providing information on the patient's tissues/materials, a neural network could learn to modify TG-43 D w,w and generate an accurate estimate of D m,m in a timely manner suitable for clinical use.

Patient cohort
The Institutional Review Board of the Jewish General Hospital (JGH) and McGill University Health Center approved this retrospective study using data from cancer patients treated with HDR BT.For this project, data from 98 breast cancer patients were extracted from the JGH clinical database.These patients were chosen as per the American Society for Radiation Oncology criteria (Correa et al 2017).The prescribed dose was 32.4 Gy in 10 fractions.Treatment planning CT images were used for each case, with contours imported from DICOM RT structure set files.Furthermore, the dwell positions and their respective dwell times were extracted from DICOM RT plan files.
The pixel size averaged 1.00 mm ± 0.13 mm, varying from 0.64 to 1.36 mm.Slice thickness was either 2.0 mm or 3.0 mm, with a mean of 2.84 mm.The PTV had a mean measurement of 163 cm 3 , ranging from 41 cm 3 to 418 cm 3 .On average, the treatment involved 15 needles (ranging from 3 to 28) and 158 dwell positions (ranging from 34 to 312) per patient.
All patients were treated with the 192 Ir microSelectron-v2 (Elekta Brachytherapy, Veenendaal, Netherlands) source (Perez-Calatayud et al 2012).Consequently, this source model was used in both TG-43-calculated D w,w and MC-calculated D m,m .

Absorbed dose calculation
As mentioned earlier, the dwell positions used in this study were extracted from the clinical treatment plans used for each patient's treatment.For every dwell position, two distinct 3D dose maps were produced: one based on TG-43 (D w,w map) and another based on MC simulations (D m,m map).Each 3D dose map represents the dose deposited by the radioactive seed decay at that particular position.Throughout the manuscript, these 3D dose maps are referred to as dwell position dose maps and were normalized to one second of decay of 192 Ir, with the corresponding activity obtained from each patient's plan file.

Dose to water in water calculation
The analytical 3D D w,w absorbed dose maps were generated using the TG-43 formalism, as implemented by Kalinowski and Enger (2024).The patients' tissue was considered water with unit density for each voxel element.Each dwell position dose map was created on a separate thread on a typical PC with 8 cores and 16 threads.With these 16 threads, it took on average one to two seconds to calculate a single dwell position 3D D w,w absorbed dose map with 1 mm × 1 mm × 1 mm resolution.

Dose to medium in medium calculation
RapidBrachyMCTPS (Glickman et al 2007, Famulari et al 2018), an MC-based treatment planning software, built on the Geant4 simulation toolkit (Allison 2007), was used to generate the 3D D m,m absorbed dose maps for each dwell position.To match the resolution used in the clinic, the volumes based on the CT images were resampled linearly to 1 mm × 1 mm × 1 mm voxels.These resampled CT volumes were then used to create a tissue/material volume, referred to later as the 'patient geometry', and a mass densities volume, referred to later as the 'mass density'.A material and a mass density was assigned to every single voxel of the volume using a Hounsfield Unit to mass density/material curve.The curve used is provided in table A.1 in appendix A. The tissue elemental compositions were obtained from the International Commission on Radiation Units and Measurements-46 (Scott 1993) and are reported in table A.2 in appendix A.
The voxels beyond the boundaries of the patient's body contours were cropped to improve simulation efficiency.RapidBrachyMCTPS utilizes the parallel world concept available in Geant4 for scoring.The absorbed dose per voxel is scored by overlaying a parallel voxelized geometry over the patientʼs volume (Apostolakis et al 2008, Enger et al 2012).Following this, the patient geometry is placed within a 5 m × 5 m × 5 m air box, which remains non-voxelized.While particle transport and interaction inside the air box are considered, the actual scoring of absorbed dose deposition is omitted.Consequently, removing the voxels outside of the patient's body only allows the scoring of the absorbed dose to a voxelized geometry within the patient's body contour.The voxelized geometry with assigned tissues and mass densities per voxel was saved in egsphant format (Walters et al 2005) and provided as input to the simulations, along with the dwell position and source specifications.
For each dwell position, 10 7 192 Ir decay events were parallelized on a computer equipped with 2 AMD Rome 7532 central processing units and 64 cores.The history-by-history method (Walters et al 2002) was used to compute the uncertainty in each of the created dwell position dose maps.In HDR BT, the final treatment dose map is a weighted sum of Gaussian distributions described in equation (1) where w i is the treatment time at dwell position i divided by the total treatment time T, N is the number of dwell positions, and D i is the dose absorbed for 1 s of irradiation at dwell position i.Consequently, the uncertainty of the treatment plan dose map was obtained by computing the standard deviation of a weighted sum of Gaussian distributions (Weisstein 2024, Wu 2004) as described in equation (2) The σ i is the dwell position dose map uncertainty.
The duration of a single dwell position dose map simulation varied between 10 and 15 min, depending on the patient volume.Parameters for MC simulations are summarized in table A.3 in appendix A as recommended by the TG-268 report (Sechopoulos et al 2018).

Data preprocessing 2.4.1. Training data
The D w,w dose map, the patient geometry, and the mass density were used as various inputs, and the D m,m dose map was used as the desired output, i.e. ground truth, for the model.In the same fashion as in Berumen Murillo et al (2023), we investigated if combining the patient body composition with the distance to the dwell positions as inputs could benefit the developed models.Hence, we created an inverse squared distance volume representing the inverse squared distance to the dwell position with 1 mm × 1 mm × 1 mm voxels.This volume is referred to as r 2 -kernel in future sections.
Of the 98 patients included in this study, 70 were randomly assigned to the training set, 14 to the validation set, and 14 to the test set.Details on the diversity of the patient datasets can be found in appendix B. Due to the varying number of catheters/dwell positions used in the treatments of different patients, 25 dwell positions were randomly selected from the treatment plans of each patient to create D w,w and D m,m maps for the training and validation sets.This resulted in 1750 dwell position samples for training and 350 for validation.All the dwell positions used in the treatment of the test set patients were utilized, allowing for treatment plan comparison between the different dose calculation algorithms, resulting in 2060 dwell position samples.

Problem size reduction
All the volumes, being proportional to the patient size, remained quite large even after removing voxels outside the body contour.The size of these volumes was approximately 500 × 250 × 200 voxels.Large volumes fed to a 3D CNN demand extensive computations and memory, resources constrained by the GPU memory and computational power.To reduce volume sizes while keeping the 1 mm × 1 mm × 1 mm resolution, the approach proposed by Antaki et al (2020) was adapted to crop the volumes and only keep parts of the patient volumes needed to compute dosimetric indices as shown in figure 1.Indeed, to compute D xcc only the voxels making the x cm 3 with the highest dose were used.To compensate for the absence of D m,m to compute dosimetric indices at the preprocessing stage, the distance between each volume voxel and the dwell position was computed as the metric for voxel selection.The distances to every dwell position of a patient volume were summed to give the final patient-wise metric volume.Indices based on this new distance metric volume are referred to as M xcc and M x as opposed to the clinical dosimetric indices D xcc and D x which are based on a dose.
To account for differences between M xcc and D xcc , and be able to calculate the OARs' D 2cc (see Results section 3.2), voxels inside the 10 cm 3 that had the lowest values in the distance metric volume, i.e.M 10cc , were kept for each OAR.100% of the PTV was kept.This cropping process was applied to all patient volumes, and minimal volume cropping boundaries were saved.The distribution of volume sizes along the x, y, and z axes of minimal volumes that preserve the same OAR M 10cc information as the complete un-cropped volumes can be found in figure C.1, appendix C.
Based on this analysis, the input volume dimensions for training the DL models were established and set to 160 × 128 × 112 .When minimal volumes were smaller on one axis compared to the defined input shape, the cropping boundaries were evenly extended at both ends of the cropped axis.The same cropping coordinates were used to crop the patient geometry volume, the mass density volume, the TG-43 D w,w maps, MC D m,m maps, and the r 2 -kernel.

Data normalization
The patient geometry was represented as a volume in which each voxel indicated a value corresponding to a tissue type.Values ranging from 0 to 5 representing six tissues (see appendix A) were used.These voxel values were divided by the highest value (5 in this case) to normalize the data between 0 and 1.The mass density volume, containing floating point values representing the mass densities of the different materials (see appendix A), was divided by the maximum mass density in the used CT to density curve to normalize the data between 0 and 1.The dose map values, created using sources with varying radioactivity for different patients, were all scaled to the highest source activity in the training set.Subsequently, potential new architectures were investigated to manage multi-input training, particularly when using two distinct inputs, D w,w and patient body composition, to generate a single output volume (D m,m ).Several fusion techniques exist for combining inputs in DL (Zhou et al 2019).The baseline U-Net model used in this study corresponds to the straightforward approach and performed input-level fusion, where the different inputs were concatenated and directly fed into the convolutional neural network.Consequently, features learned inside the neural network encompass information from both inputs simultaneously rather than being specific to a single input.Thus, a layer-level fusion model named Combining Network (C-Net) was created, where features were autonomously learned for each of the two inputs and merged.

Model design
The proposed architecture is shown in figure 2. Two different CNNs encoded features for the two inputs.The architectures of both CNNs were identical, but weights were not shared, resulting in potentially different learned features for the two inputs.Each CNN consisted of three downsampling blocks, each containing two sequential 3D convolutional layers followed by IN and ReLU activation.
The features generated from each spatial dimension were combined using a fusion operation approach.Two different fusion operations between the dose in water features, x, and the patient body features, y, were considered as f (x, y) = x + y and f (x, y) = x + x * y; hereafter, the resulting models are noted respectively with '+' and ' * ' at the end of the model names.A 3D MaxPooling operation is used to downsample the volumes.The decoder part of our model was a CNN with three upsampling blocks that mirror the downsampling blocks.3D transpose convolutional layers were used for upsampling.The fused feature map with the smallest spatial dimension was used as the input to the CNN decoder, while the rest of the fused feature maps were passed to the CNN decoder through skip connections.To allow for a fair comparison with the proposed U-Net-like model and to have a similar number of parameters between the two models, the number of filters used for each convolutional block in the C-Net model was set to (25, 50, and 100).Architectural parameters are summarized in table 1.
Finally, a deeper version of the U-Net and C-Net implementations was created to evaluate whether increasing the network depth could enhance performance.Models with three and five downsampling blocks are referred to as U-Net-D3 and C-Net-D3, and U-Net-D5 and C-Net-D5, respectively.The number of convolutional filters for the deeper versions was determined so that the models would have comparable training   [12,24,48,96,192] times with their shallower versions.The shallower model variants possessed more filters for convolving larger volumes, maintaining a comparable computational time compared to the deeper counterparts across various models, even with the additional parameters in the deeper versions., .The r 2 -kernel was concatenated with the mass density or the patient geometry input in the second CNN encoder for the C-Net model architectures.In contrast, the three inputs were concatenated in the U-Net model architectures.

Model training
The training objective was to minimize the sum of absolute errors (SAE) between the model prediction  D m m , and MC D m,m .All the models were trained for 1000 epochs.No augmentation was performed during the data loading.The Adam optimizer was used with a learning rate of 0.001 and a polynomial scheduler with a power of 0.9, as in Liu et al (2015), Chen et al (2018), Isensee et al (2021).PyTorch 1.13 was used to define the model architecture, and PyTorch Lightning 1.8.3 was used for the model training.

Model testing
After model training, the performance evaluation was conducted in two ways.Each dwell position dose map created with DL and MC was compared.Alternatively, the 'combined doses', encompassing the total delivered dose prescribed to patients from the entire treatment plan with all utilized dwell positions, were calculated for comparison.The combined doses were computed using equation (1).Combined absorbed dose maps were generated for all the 14 test patients, with MC-simulated D m,m maps, TG-43-created D w,w maps and , DLpredicted  D m m , maps.For the combined DL-generated dose maps,  D m m , maps were padded with zeros to match the patient volume size.No padding was necessary for MC D m,m and TG-43 D w,w combined dose maps.
The per dwell position MC D m,m maps were compared with the per dwell position DL-predicted  D m m , or the per dwell position TG-43 D w,w maps in terms of per voxel mean absolute percent error (MAPE) and mean absolute error (MAE).The MAPE and MAE are presented in equations (3) and (4) , V refers to the number of voxels in the test dataset.The standard deviation of these metrics across all voxels in the dwell position dose maps of the test set was also calculated and is reported in the Results section 3.1.Furthermore, the comparison of combined MC-created D m,m maps with combined TG-43-created D w,w maps or combined DL-predicted  D m m , maps included per voxel MAPE, gamma index, and dosimetric indices MAPE.All these metrics were computed for each patient combined dose map separately and then averaged.The standard deviation of these metrics among test patients was also computed and is reported in the Results section 3.2.

Results
All models presented in the experiments subsection 2.6 were trained.The training took between two to five days on an A100 GPU.The training time of the C-Net architectures was between one to two times longer than for the U-Net architectures.The best MAEs obtained on the validation set during the different training epochs were saved (see appendix E).For each model architecture, the combination of model/inputs yielding optimal performance on the validation set was selected for further analysis on the test set.

Comparison of TG-43 and DL absorbed dwell position dose maps with Monte Carlo
The initial comparison involved evaluating the TG-43 D w,w and DL-predicted dose maps  D m m , against MC D m,m for single dwell positions.These volumes represented cropped patient volume data and without zeropadding.Table 2 presents the per voxel MAE and MAPE across all dwell position dose maps in the test patient set.Notably, the per voxel MAPE between TG-43 and MC exceeded 16%, whereas it consistently remained around 6% for DL solutions.A similar pattern was observed for MAE, with DL solutions demonstrating values below 2E-4 Gy for most cases, while TG-43 exceeded 6E-4 Gy.Remarkably, the U-Net-D5 model, predicting with D w,w and mass density, achieved the lowest MAPE and MAE, signifying superior per-voxel predictive performance on average.

Comparison of TG-43 and DL predicted treatment plans with Monte Carlo
The capability of the TG-43 algorithm and different DL models to produce absorbed dose maps for entire treatment plans, referred to as combined dose maps, was evaluated and compared with MC simulations.To accomplish this, the combined dose maps for all 14 test patients were reconstructed as detailed in the subsection 2.8.
Type A uncertainties of the combined dose maps obtained from MC simulations were computed as described in the subsection 2.3.2 and averaged 0.20% ± 0.03% and 0.52% ± 0.16% for isodose lines 100% and 10% respectively.The mean uncertainties were 0.21% ± 0.03% for the PTV, 1.03% ± 0.37% for the lung closest to the treatment site, 1.09% ± 0.43% for the chest wall, 0.94% ± 0.29% for the heart, and 3.02% ± 2.20% for the skin.However, the mean uncertainties on the cropped volumes (see section 2.4.2) used as ground truth for the DL model were lower: 0.21% ± 0.03% for the PTV, 0.70% ± 0.22% for lung, 0.66% ± 0.25% for the chest wall, 0.72% ± 0.17% for the heart, and 0.59% ± 0.20% for the skin.Figure 3 shows the uncertainty map of the treatment plan dose map of one of the patients used in this study.
Using an A100 GPU, the prediction time for all 2060 dwell position  D m m , maps in the test set remained below 200 s for all DL models.The average prediction time for each dwell position  D m m , map was less than 0.1 s.In contrast, MC-simulated single dwell position D m,m maps typically required 10-15 min.Test patients had an average of 147 dwell positions per patient.This equated to combined dose maps for entire plans predicted by DL in less than 15 s, a stark contrast to the hours required for MC simulations.Figure 3 shows the combined dose map of one of the test patients obtained from dwell position dose maps created with MC simulations, TG-43, and DL predictions.Table 3 presents the MAPE between MC D m,m combined dose maps and TG-43 D w,w combined maps or DL  D m m , combined maps.As DL-predicted combined doses were padded with zeros to match the patient volume and the size of MC simulated dose maps, the MAPEs were computed on maps cropped to non-zero values of DL predictions.Results for full-size maps can be found in table F.1 in appendix F. The per voxel MAPE was around 13% for TG-43 D w,w maps and around 2% for DL  D m m , predictions.The difference in MAPE with MC D m,m combined dose maps between TG-43 and DL was also important when focusing on each specific contoured organ.The largest difference was observed for the skin, for which the mean MAPE between TG-43 and MC combined doses was around 22% whereas it was less than 2% for DL solutions.The U-Net-D5 approach performed overall the best in terms of MAPE for contoured OARs.Gamma index pass rate results are presented in table F.1 in appendix F.
To assess the impact of the DL-based dose prediction on treatment planning, the dosimetric indices obtained with MC simulations, TG-43 and DL were compared.As described earlier, DL-based combined doses were padded with zeros to match the full size of MC and TG-43 dose maps.Discrepancies in dosimetric indices between the MC D m,m maps and TG-43 D w,w maps were presented in table 4. For the PTV, the V 150 MAPE exceeded 20%, and the conformity index MAPE approached 5% in the test patients.Regarding the OAR, differences in dosimetric indices were observed, with variations of up to 5%.The highest error was observed for the lung D 2cc and the skin D 2cc , with 9.69% and 7.51%, respectively.
Table 2. Mean absolute percent error (MAPE) and mean absolute error (MAE) computed on the dwell position dose maps of the test set, comparing dose maps simulated with MC against dose maps computed using the TG-43 formalism or predictions made by DL models.DL approaches are presented as 'model (input 1, input 2)'.The reported measures are in the form of mean ± standard deviation over all voxels in the test set.It is evident that, on average, the predictions of any DL model are much closer to the MC calculated dose than those obtained with the TG-43 formalism.The developed DL models predict the D m,m absorbed dose map for the full treatment in 15 s on average per patient.This will not impact the brachyterapy procedure time that usually takes more than an hour between CT scan acquisition and treatment delivery (Sanders et al 2018, Dutta 2019, Suralik et al 2020).
The presented dose prediction method relied on voxel-by-voxel material and mass density information derived from CT images rather than the CT images themselves.Consequently, it can be readily implemented in various institutions, even if different CT scanners are utilized.The dose prediction method can also be used with MRI images or other imaging modalities, with per contour material and mass density assignment instead of pervoxel assignment.In addition, the DL models were developed for tissues specific to breast cancer patients.The models can only be applied to other treatment sites after additional training on a suitable dataset.
Patient body sizes vary substantially, and our DL predictions were performed on a portion of the patient volume.However, the size of the cropped subvolume was defined to accommodate the computation of dosimetric indices for the largest patients in our set of 98 patients.Therefore, the suggested cropping method was considered generalizable to any new patient.Additionally, as demonstrated in table 4, the cropping method did not affect the treatment planning.Table 4 indicates a significant discrepancy in PTV treatment metrics when calculated with TG-43 compared to MC for the test patients.This observation extends to OARs, with the most substantial variations observed for the lung and skin.This is attributed to the fact that air is the material for which scattering conditions differ the most from water.These findings underscore the critical need for accurate DL solutions.
The highest differences in dosimetric indices MAPE between MC D m,m maps and DL-predicted  D m m , maps were observed for the heart.It might be explained by the fact that the heart, being the farthest OAR, receives a lower dose, resulting in a lesser impact on the loss function during training when compared to other OARs.
Remarkable concordance with MC can be achieved by DL models, regardless of the input used to describe the patient body composition.Mass density volumes and material volumes appeared interchangeable, even though mass density volumes should contain more information due to their creation with the CT to density/ material curve.Moreover, incorporating the r 2 -kernel as an input to the DL models did not enhance their performance, even for the C-Net architecture, where the models were trained to learn features representing the connection between patient body composition and the distance from the dwell positions.It was implied that the distance to the radioactive source was already efficiently captured by the TG-43 dose in water.
Even though excellent dose prediction performance was achieved with different inputs, the TG-43 D w,w map was always included in the inputs, demonstrating that feeding the model with dose to water was efficient.Implicitly, information about the shape of the radioactive source and its orientation to the model was provided.These data were overlooked when feeding the model with a volume representing the dwell position or an r-squared distance volume instead.
The success of our method is partly explained by the choice of DL model inputs.The number of parameters in the model architecture setting also played a significant role in this success, along with using IN layers.
C-Net, a novel DL model architecture tailored for tackling the multi-input dose prediction challenge, was designed.Two effective feature fusion approaches for integrating input-specific features were identified, resulting in outstanding dose prediction accuracy comparable to U-Net-like architectures.However, distinct learning behaviors were exhibited by C-Net and U-Net models.Superior per voxel MAPE results were illustrated for the U-Net-D5 model in table 2, while more robust dosimetric indices MAPE were showcased for the C-Net-D5+ model in table 4. Notably, only the C-Net-D5+ architecture consistently achieved dosimetric indices MAPE below 1.1% for all OARs, including challenging cases such as the heart D 0.1cc .This observation suggested that the C-Net-D5+ architecture predicted lower dose values than the U-Net-D5 architectures.Conducting an external multi-center study with a larger cohort of test patients beyond our 14-patient test set could further validate these findings.
It was also empirically observed that a C-Net with the multiplication fusion method was unstable when training with more than two inputs and could sometimes lead to overfitting on the training set.This behavior had not been observed with only two inputs.Additionally, the non-sequential aspect of the architecture led to longer training time than for the U-Net-like models.Conversely, inference time was similar, allowing for clinical implementation.
Results presented in tables 3 and 4 indicated a modest performance improvement when deeper models were used, especially for predicting dose to the heart, which is situated farther from the dwell positions.This validated the utilization of deeper models.
In this study, only one instance of each model was trained to save computational time; training several model instances for each model in a cross-validation manner would have allowed for a better statistical comparison between the performance of each model.

Conclusion
A pipeline was developed to predict high-resolution dose maps in patient geometry within an acceptable timeframe for clinical use and with the same accuracy as MC simulations.The pipeline introduced methods to efficiently crop the input data, providing the DL models with inputs of appropriate size for training.A novel model architecture was trained to fuse information from the water dose map and the patient geometry.This approach yielded state-of-the-art results for DL dose prediction in HDR BT.Future work will involve generalizing the method to new materials and treatment sites and integrating the software into the RapidBrachyMCTPS treatment planning system.
A 3D U-Net architecture(Ronneberger et al 2015(Ronneberger et al  , Çiçek et al 2016) )  was used as the baseline model, which reflects the current state-of-the-art in radiotherapy dose prediction(Javaid et al 2019, Mao et al 2020, Bai et al 2021, Tsekas et al 2021).Since large volumes were used as input for the model, the batch size was set to two, allowing for more parameters in the neural network while utilizing the same amount of GPU memory compared to larger batch sizes.To effectively manage learning with this reduced batch size, Batch Normalization (BN) layers (Ioffe and Szegedy 2015) of the traditional architecture were replaced with Instance Normalization (IN) layers (Ulyanov et al 2017).These normalization layers were incorporated between the convolutional layer and the Rectified Linear Unit (ReLU).The number of filters for convolutional layers was determined by considering the balance between performance and computational time.The model architecture is illustrated in figure D.1, in appendix D.

Figure 1 .
Figure 1.Proposed strategy to define cropping boundaries.The importance metric is the sum of the distances to the different seed positions of the treatment.For each patient, a subvolume keeping voxels that represent the 10 cm 3 with the minimal importance metric values (M 10cc ) for each OAR is defined.

Figure 2 .
Figure 2. Proposed C-Net-D3 architecture.In this network, features were learned separately for the two different inputs via two different CNNs.The features generated from each spatial dimension were combined using a fusion operation approach.The fusion operations can be f (x, y) = x + y (referred to as '+') and f (x, y) = x + x * y (referred to as ' * ').The fused feature map with the smallest spatial dimension was used as the input to the CNN decoder, while the rest of the fused feature maps were passed to the CNN decoder through skip connections.The various components of the model include Conv3D (3D Convolution), ReLu (Rectified Linear Unit), MaxPooling3D (3D Max Pooling), IN (Instance Normalization), Copy operator, Deconv3D (3D Deconvolution), and element-wise fusion.
Each of the six model architectures underwent distinct training with numerous inputs.Initially, the models were trained to forecast the D m,m dose maps based on inputs D w,w and patient geometry ( Following this, predictions with the mass density were made, expressed as ( ) = f D , mass density w w , Dm m , .Finally, for both situations, the distance to the dwell position was combined with the patient body composition by adding the r 2 -kernel as input, as in (

Table 1 .
Comparison of architectures between the different models.

Table 3 .
Mean absolute percent error (MAPE) computed for each contoured structure between combined dose maps (sum of single dwell position dose maps weighted by their dwell times) obtained with MC-simulated single dwell position dose maps in patient geometry and TG-43-computed single dwell position dose maps in water or DL-predicted single dwell position dose maps in patient geometry.The left column presents DL models as 'model (input 1, input 2)'.The reported measures are in the form of mean ± standard deviation over all voxels in the test set.MAPE between the MC and DL is always below 4% in any contoured structures.

Table 4 .
Mean absolute percent error between dosimetric indices obtained from combined dose maps simulated with MC and computed with TG-43 or predicted with DL.V x represents the volume receiving at least x% of the prescribed dose and D xcc the maximum dose received in x cm 3 of the volume.CI is the conformity index.The reported measures are in the form of mean ± standard deviation calculated for the patients in the test.The left column presents DL models as 'model(input 1, input 2)'.We can see that the proposed DL solutions agreement in dosimetric indices is almost always within 1%.
composition and mass densities encountered in breast cancer cases, if our approach can accurately predict the D m,m absorbed dose maps for such cases, it is expected to perform equally well, if not better, for other treatment sites.