Improving proton dose calculation accuracy by using deep learning

Abstract Pencil beam (PB) dose calculation is fast but inaccurate due to the approximations when dealing with inhomogeneities. Monte Carlo (MC) dose calculation is the most accurate method but it is time consuming. The aim of this study was to develop a deep learning model that can boost the accuracy of PB dose calculation to the level of MC dose by converting PB dose to MC dose for different tumor sites. The proposed model uses the PB dose and computed tomography image as inputs to generate the MC dose. We used 290 patients (90 head and neck, 93 liver, 75 prostate and 32 lung) to train, validate, and test the model. For each tumor site, we performed four numerical experiments to explore various combinations of training datasets. Training the model on data from all tumor sites together and using the dose distribution of each individual beam as input yielded the best performance for all four tumor sites. The average gamma passing rate (1 mm/1%) between the converted and the MC dose was 92.8%, 92.7%, 89.7% and 99.6% for head and neck, liver, lung, and prostate test patients, respectively. The average dose conversion time for a single field was less than 4 s. The trained model can be adapted to new datasets through transfer learning. Our deep learning-based approach can quickly boost the accuracy of PB dose to that of MC dose. The developed model can be added to the clinical workflow of proton treatment planning to improve dose calculation accuracy.


Introduction
Proton therapy has attracted increasing attention in recent years.The main rationale for using proton therapy is the physical characteristics of the depth dose curve, which has a dose peak (Bragg peak) at a well-defined depth in tissue [1].This physical advantage allows proton therapy to achieve higher dose conformity in the tumor volume with a lower dose to the surrounding healthy tissue than conventional radiation therapy.Nonetheless, uncertainties in dose calculation have a bigger impact on the desired dose distributions in proton therapy, so accurate dose calculation is essential for the success of a proton therapy treatment [2].Currently, many pencil beam (PB)-based dose calculation algorithms, based on the works of Hong et al. [3] and Schaffner et al. [4], are widely used in clinical practice.These algorithms provide fast computation but come at the expense of lower dose calculation accuracy in the presence of tissue heterogeneity [5].This is mainly because these algorithms adopt approximations that disregard lateral inhomogeneities to achieve fast dose calculations [6,7].In PB dose calculation, the proton dose is computed by using the water equivalent path length along the central path of a pencil beam, and the medium on the central axis is assumed to be laterally infinite and homogeneous [6].These approximations lead to inaccurate modeling of the multiple Coulomb scattering (MCS) process and, therefore, cause both dose distortion and range uncertainties, especially in the presence of complex geometries and heterogeneous environments [8].Moreover, PB algorithms' approximations in modeling elastic and inelastic nuclear interactions can also lead to considerable errors in dose calculation, even in homogeneous geometries [6,7].
Monte Carlo (MC) methods are recognized as the gold standard for dose calculation because they can simulate particle propagation through materials by randomly sampling the cross-section of interactions [7][8][9][10].Several phantom studies comparing PB dose calculations and MC algorithms have demonstrated that MC algorithms can provide more accurate dose distributions than PB algorithms [11][12][13][14].For example, in a multiinstitution phantom study, Taylor et al reported that the dose distribution obtained with the MC algorithm matched the phantom dose measurement more closely than the dose distribution obtained with the PB algorithm [11].Besides phantom studies, PB dose accuracy deficiencies have also been investigated in several studies comparing PB dose with MC dose in different tumor sites [8,9,15,16].Schuemann et al demonstrated that current PB dose calculation algorithms can cause underdosage to the target by as much as 5%, which can result in differences in tumor control probability of up to 11%.For complex geometries (head and neck cancer and lung cancer) and very deep-seated targets (prostate cancer), MC simulations should be considered instead of PB dose calculations [8].Moreover, proton therapy can be very sensitive to patient anatomical changes that may distort the planned dose distribution and deteriorate the treatment quality.A recent study has also shown that online adaptation with MC dose calculation can further improve treatment quality for inter-fractional patient geometry changes [17].Despite all these advantages, MC dose calculation methods have not been widely used in clinical routine because of their computational burden and implementation complexity [7].
Over the years, the research community has devoted significant efforts to accelerating MC dose calculation for proton therapy [18].Recently, parallel computing techniques based on graphics processing units (GPU) have been employed for this application.Different groups have achieved considerable acceleration factors over conventional CPU-based computations [18][19][20][21][22][23][24][25].To date, the computation time of the fastest reported MC dose engine is in the minute range.Furthermore, some commercial treatment planning systems (e.g., RayStation and Eclipse) have already included the MC dose calculation feature for proton therapy.
Recently, deep convolutional neural networks have been successfully used to predict patient-specific dose distributions from anatomical information [26][27][28][29][30][31][32][33].Although these works tried to learn the relationship between patient anatomy and the optimal dose distribution, they did inspire us to use deep learning methods to learn the relationship between tissue inhomogeneities and the differences between PB and MC dose distributions.
In this paper, we present a novel approach that uses deep learning techniques to achieve MC dose calculation accuracy with PB dose calculation efficiency for proton therapy.Specifically, we developed a deep learning model that can precisely and efficiently convert a 3D PB dose distribution to a dose distribution with MC-equivalent accuracy for different tumor sites.The model architecture and training details are presented in Section 2. In Section 3, we present the results of the proposed model.We discuss some future work in Section 4 and draw conclusions in Section 5.

Model architecture
The model developed in this work is based on the 3D HD U-Net that was developed and tested for voxelwise 3D dose prediction for patients with head and neck cancer [28].HD U-Net combines the essence of two influential neural networks: U-Net [34] and DenseNet [35].In general, HD U-net's architecture combines U-net's ability to abstract both local and global features from input images with DenseNet's efficient feature propagation and reuse, while maintaining a reasonable memory usage [28,32].Figure 1 shows the architecture of the HD Unet model modified for this work.The details of the three operations between layers (Dense Convolve, Dense Downsample and U-net Upsample) have been previously introduced elsewhere [28].
The proposed model contains two input channels: one for the 3D proton PB dose distribution and the other for the corresponding CT image.The model has 5 max pooling and 5 upsampling operations, which decrease the image patch size from 128 x 128 x 16 voxels to 8 x 8 x 1 voxels, then increase it back to 128x 128 x 16 voxels.The convolutional kernel size is set to 3 x 3 x 3, and the max pooling size is set to 2 x 2 x 1. Batch normalization is added after the convolution and before the rectified linear unit operations.The dropout rate is set to 0 because no overfitting issue has been observed during training.

Patient database
The patient database used in this work consists of 90 head and neck, 93 liver, 75 prostate and 32 lung patients treated the double scattering method at the Massachusetts General Hospital (MGH) proton therapy center.For each tumor site, about 20% of the patients were randomly selected as the testing set, and the rest were used for training and validation.The specific numbers of test patients and training & validation patients for each tumor site are presented in Table 1.It should be noted that, some treatment plans were only used for research use or as alternate plan options or beam arrangements and the corresponding treatment fields were not actually delivered to patients.For each patient, the proton PB dose was calculated using an algorithm implemented on the XiO treatment planning system (by Computerized Medical Systems Inc, now by ELEKTA), and the MC dose was obtained by using TOPAS version 3.0.1 [36], which is based on Geant4.10.3 [37].All the dose data and their corresponding CT images were resampled to have a voxel resolution of 2 x 2 x 2.5 mm 3 .

Training Experiment Design
The patient data of the four tumor sites used in this work have different beam configurations, so the model must be able to maintain high performance across different beam settings.To address this issue, for each patient, we rotated the dose distribution of each beam and the CT image to the same default angle and then used them as inputs for the model.As the model output, the converted dose distribution of each beam is rotated back and added to the dose distributions of other beams to obtain the total converted dose.In this work, the default angle is set to a 270° gantry angle and a 0° couch angle.We believe this method allows the model to better learn the upstream and downstream characteristics when the proton beam passes through the patient's body and deposits energy along the path, and therefore helps the model to better learn the accurate mapping between the PB dose and the MC dose in relation to tissue inhomogeneities.We also implemented and tested an alternative method that directly uses the composite dose distribution as the model input and output.In this article, we refer to the method that uses the beam dose as the model input and output as Method 1 and the method that uses the composite dose as Method 2. The schematic flow chart of the two methods is shown in Figure 2.
Because the prescription dose varies from patient to patient and from tumor site to tumor site, all the dose and CT images are normalized before being input into the model.For both methods, the CT voxel values are normalized to be between 0 and 1, and the dose distribution voxel values are normalized by dividing the 95th percentile dose value of all voxels receiving dose values greater than 5% of the maximum PB dose.On the output side of the model, the converted beam or composite dose is multiplied by the same normalization factor.To develop a general model that can convert PB dose distribution to MC dose distribution for each tumor site, in addition to using site-specific data for training, we implemented joint training that used data from all sites for each method.For each tumor site, four different numerical experiments were carried out to investigate which model had the best performance.Take head and neck patients as an example: for Experiment 1, the beam dose distributions of head and neck patients were used as the model input, and for Experiment 2, the beam dose distributions of all four tumor sites were used as the model input.For Experiment 3, the composite dose distributions of head and neck patients were used as the model input, and for Experiment 4, the composite dose distributions of all four tumor sites were used as the model input.The details of the four experiments are summarized in Figure 3.In total, 10 models were trained and evaluated (two general models trained with all-sites data and eight models trained with site-specific data).For each tumor site, the models of all four experiments were tested on the same test dataset.

Model Training details
The mean squared error (MSE) between the converted dose distribution and the MC dose distribution was used as the loss function for training each model.The learning rate of each model was adjusted to minimize the validation loss as a function of epochs.During each training iteration, a patch of size 128×128×16 was randomly selected from the patient volume.This patch-wise training method, similar to data augmentation, can reduce overfitting [28].The Adam algorithm was selected as the optimizer to minimize the loss function.All the deep learning models were built and implemented in Keras with Tensorflow [38] as the back end.Each model was trained for 200 epochs on one NVIDIA Tesla V100 GPU card with 32 GB RAM.

Gamma index and MSE results
To assess the accuracy of the converted dose, we computed the MSE between the converted dose and the MC dose above a threshold of 10% of the maximum MC dose in the test dataset.We also computed the 3D gamma index (γ), which can evaluate the dosimetric accuracy of voxels by combining the distance difference and dose difference metrics [39].The gamma index with 1%/1mm criteria and the MSE results of all experiments for head and neck and liver test patients are presented in Table 2; the corresponding results for lung and prostate test patients are presented in Table 3.Both tables include the PB dose metrics for comparison.For every tumor site, the gamma index and MSE results are substantially better for all the experiments than for the same metrics of the PB dose.Method 1 (Experiments 1 and 2), which uses the beam dose as the model input, clearly outperformed Method 2 (Experiments 3 and 4), which directly uses the composite dose as the model input.Experiment 2, which used the beam dose distribution data from all sites as the model input, clearly outperformed Experiment 1, which used the beam dose distributions of site-specific data as the model input, except in the case of prostate test patients.In general, Experiment 2, which used the beam dose distribution data from all sites as the model input, had the best performance across all four types of test tumor sites.The average gamma index between the converted and the MC dose distributions for head and neck, liver, lung and prostate test patients was 92.8%, 92.7%, 89.7% and 99.6%, respectively.To illustrate the performance of the developed model, we compared the converted and the MC dose distributions for one example head and neck and prostate test patient.Figures 4 and 6 show dose color washes for one example patient from the head and neck and prostate test sets, respectively.In each of these two figures, one axial slice of the PB dose distribution, the converted dose distribution, the absolute differences between the PB and MC dose distribution and the absolute differences between the converted and MC dose distribution are illustrated in sub-figures a, b, c and d, respectively.Figures 5 and 7 show the dose volume histograms (DVHs)curves of the same example head and neck and prostate test patient, respectively.Upon visual inspection of the dose color washes, the converted dose distributions are almost identical to the MC dose distributions for each example patient.The DVH curves of the converted dose are found to be very close to those of the MC dose for both targets and OARs, but an obvious gap can be observed between the DVH curves of PB and MC doses.

Dosimetry analysis
The results of the other test patients and other tumor sites are not presented here because of space limits in the manuscript, but the behaviors are similar for all of them.
We compared the voxel-based dose difference between the PB dose and the MC dose and between the converted dose and the MC dose for all test patients.If any voxel in the PB or the converted dose distribution received more than 3% of the maximum MC dose and its absolute dose difference with the MC dose distribution was larger than 3% of the maximum MC dose, then that voxel was used for evaluation.Dose difference histograms for 18 head and neck, 18 liver, 6 lung and 14 prostate test patients are shown in Figure 8.We can clearly see that, for each type of test patient, the number of voxels with dose differences larger than 3% of the maximum MC dose is substantially lower for the converted dose than for the PB dose.

Efficiency analysis
Table 4 shows the average time required for MC simulation, PB dose calculation and dose conversion of a single field for each tumor site.The MC simulations were implemented in TOPAS, and the PB doses were calculated using an algorithm implemented on the XiO system, both calculations were performed at the MGH proton therapy center.The conversion processes were executed on a NVIDIA Tesla V100 card with 16 GB dedicated RAM.The total time to obtain the converted MC dose is roughly the sum of the PB dose calculation time and the conversion time.Compared with the MC simulation time, the total time to obtain the converted MC dose is substantially shorter.

Generalizability analysis
Experiment 2, which used the beam dose distribution data from all sites as the model input, generally gave the best performance.To test the generalizability of this model, we designed and implemented Experiment 5.In this experiment, for a tumor site A, first, we implemented joint training using the beam dose distributions of the other three tumor sites B, C and D as the model input.The model was trained for 200 epochs, and the learning rate was adjusted to minimize the validation loss as a function of epochs.Then, the trained model was fine-tuned by using the beam dose of tumor site A for 100 epochs.The learning rate for fine-tuning was set to 1E-4.After joint training and fine-tuning, the model was tested on the same test dataset of tumor site A as the other experiments.The gamma index and MSE results of Experiments 2 and 5 for the four types of test patients are presented in Table 5, both using MC results as reference.In general, Experiments 2 and 5 had similar performance.The models trained in Experiment 5 showed good performance for each tumor site, with the average gamma index between the converted dose and the MC dose above 90%.These results show that the models that use the beam dose distributions of three tumor sites as input can be easily adapted to a new tumor site, which suggests that joint training using beam dose distributions of multiple tumor sites as the model input has good generalizability.This supports our assumption that the model that uses the beam dose distributions of all four tumor sites as input in Experiment 2 can be adapted to new datasets in clinical practice.To further test the generalizability of the model trained in Experiment 2 and to see whether it can be adapted to a new dataset from another hospital, we designed and implemented Experiment 6.The new dataset we used in Experiment 6 includes 27 patients with non-small cell lung cancer from a hospital other than MGH.We randomly selected 22 of these patients as the training set to fine-tune the model and used the remaining 5 patients as the testing set.For each patient, we used RayStation 9A to develop intensity modulated proton therapy (IMPT) treatment plans, and to calculate the PB and MC dose distributions.Method 1, which uses beam dose as the model input, was applied, and the beam dose distributions of the patients in the training set were used to fine-tune the model trained in Experiment 2. We set the total number of epochs to 100 and the learning rate to 1E-4.After finetuning, we evaluated the model on the testing set.For the patients in the testing set, the mean gamma index (1%/1 mm) between PB and MC dose distributions was 75.5% ± 8.9% (standard deviation).We found that the converted dose distributions showed substantial improvement over the PB dose distributions and had strong agreement with the MC dose distributions.The mean gamma index between the converted dose and MC dose (1%/1 mm) was 91.8% ± 4.8% (standard deviation).This result proves that the model trained in Experiment 2 can be adapted to a new dataset from different hospitals through simple fine-tuning.

Discussion
In this work, we present a novel method that uses deep learning methods to convert PB dose distributions to MC dose distributions for different tumor sites.We performed four different experiments to compare different combinations of training datasets, and we found that joint training using the beam dose distributions data from all sites as the model input had the best performance.The converted dose distributions showed substantial improvement over the PB dose distributions in all evaluation criteria that we considered: gamma index, MSE, DVH and dose difference histogram.The developed model can learn from a heterogeneous database that includes beam dose distributions from four tumor sites with different beam configurations and maintain high performance for each tumor site.Using beam dose distributions as the model input (Experiments 1 and 2) yielded substantially better model performance than directly using composite dose distributions as the model input (Experiments 3 and 4).These results suggest that rotating the normalized beam dose and the corresponding CT to the same angle and using them as the model input helps the model to better learn the difference between the PB and the MC dose distributions in relation to tissue heterogeneity and thus improves the model's performance.We also noticed that the converted dose of prostate test patients had the best performance, and the converted dose of lung test patients had the worst, in terms of gamma index and MSE results.One possible reason is that all the prostate patient data used in this work have similar beam settings.The beam angles for prostate patients are set to be either 90° or 270° gantry angle, which we believe makes the prostate patients' data size needed for the model to achieve high performance smaller than other tumor sites.The gamma index and MSE results of Experiments 1 and 2 (see Table 3) show that, for prostate patients, adding the beam dose of other tumor sites for training did not improve the model's performance.This indicates that, in this work, the prostate patient data size is big enough for the model to learn the mapping between the PB and the MC dose distributions.For the other three treatment sitesespecially for lung patients which have the smallest data size, adding the training data of the other tumor sites improved the model's performance.Thus, joint training using beam dose distributions from all the tumor sites as the model input can improve the model's performance when the beam configurations are heterogeneous and the data size is relatively small.
The average time needed to convert a single field for all four types of patients in this work is less than 4 seconds.To date, the fastest proton PB dose calculations are in the sub-second range.Da Silva et al achieved PB dose calculations with a double Gaussian kernel in 0.22 s [40].Because the number of fields used in proton therapy is usually no more than 4, the total time required to obtain the converted dose can potentially be kept within half a minute.This will allow the implementation of online adaption of MC calculations to further improve the treatment quality for inter-fractional geometry changes [17].We have not yet dedicated any efforts to improving the model's efficiency, which can be easily achieved through methods like model compression; we will explore such methods in our future work.In addition, by accurately and efficiently generating converted MC doses, the proposed model could be used as a plan evaluation or even a decision support tool for physicians.
We have also demonstrated generalizability of the model.In Experiment 5, we showed that the trained model can be easily used for another tumor site through transfer learning.With Experiment 6, we also showed that the trained model can be deployed to different hospitals through transfer learning, even when using a different treatment plan delivery, i.e.IMPT vs. double scattering The patient database used in this work comprises 90 head and neck, 93 liver, 75 prostate and 32 lung patients treated with double scattering methods at MGH proton center.Currently, IMPT which uses scanned proton pencils to shape the delivered dose distributions, is adopted by most new proton therapy facilities.We plan to extend this work by applying the developed model to IMPT plans of different tumor sites and testing the performance.

Figure 1 .
Figure 1.Architecture of the HD U-Net.Black numbers on the left side of the model represent the volume shape and resolution at a specific hierarchy.Blue features represent the newly calculated features and trainable parameters to learn.Yellow features are copied or max pooled features that do not need trainable parameters.

Figure 2 .
Figure 2. Flow chart of the two training methods adopted in this work.

Figure 3 .
Figure 3. Schematic overview of the four experiments for each tumor site.

Figure 4 .
Figure 4. Dose color wash of an axial slice close to the center of the target volume for one example head and neck test patient.(a) PB dose distribution; (b)Converted dose distribution; (c) Absolute dose difference between the PB dose distribution and the MC dose distribution (PB-MC), and (d) Absolute dose difference between the Converted dose distribution and the MC dose distribution (Converted-MC).

Figure 5 .
Figure 5. DVH plots of the MC dose distribution(solid), the PB dose distribution (dashed), and the converted dose distribution (dotted) for one example head and neck test patient.

Figure 6 .
Figure 6.Dose color wash of an axial slice close to the center of the target volume for one example prostate test patient.(a) PB dose distribution; (b)Converted dose distribution; (c) Absolute dose difference between the PB dose distribution and the MC dose distribution (PB-MC), and (d) Absolute dose difference between the Converted dose distribution and the MC dose distribution (Converted-MC).

Figure 7 .
Figure 7. DVH plots of the MC dose distribution(solid), the PB dose distribution (dashed), and the converted dose distribution (dotted) for one example prostate test patient.

Figure 8 .
Figure 8. Dose difference histograms for all test patients.(a) Head and neck test patients; (b) liver test patients; (c) lung test patients, and (d) prostate test patients.

Table 1 .
Beam numbers and distribution details of the testing set and the training & validation set for each tumor site.

Table 2 .
Gamma index (1%/1 mm) and MSE results of all the experiments for head and neck and liver test patients (above a threshold of 10% of the maximum MC dose).

Table 3 .
Gamma index (1%/1 mm) and MSE results of all the experiments for lung and prostate test patients (above a threshold of 10% of the maximum MC dose).

Table 4 .
Average single-field MC simulation, PB dose calculation, and dose conversion time results for all four tumor sites used in this work (all in seconds).

Table 5 .
Gamma index (1%/1 mm) and MSE results of Experiments 2 and 5 for all test patients (above a threshold of 10% of the maximum MC dose).