Robustness analysis of CTV and OAR dose in clinical PBS-PT of neuro-oncological tumors: prescription-dose calibration and inter-patient variation with the Dutch proton robustness evaluation protocol

Objective. The Dutch proton robustness evaluation protocol prescribes the dose of the clinical target volume (CTV) to the voxel-wise minimum (VWmin) dose of 28 scenarios. This results in a consistent but conservative near-minimum CTV dose (D98%,CTV). In this study, we analyzed (i) the correlation between VWmin/voxel-wise maximum (VWmax) metrics and actually delivered dose to the CTV and organs at risk (OARs) under the impact of treatment errors, and (ii) the performance of the protocol before and after its calibration with adequate prescription-dose levels. Approach. Twenty-one neuro-oncological patients were included. Polynomial chaos expansion was applied to perform a probabilistic robustness evaluation using 100,000 complete fractionated treatments per patient. Patient-specific scenario distributions of clinically relevant dosimetric parameters for the CTV and OARs were determined and compared to clinical VWmin and VWmax dose metrics for different scenario subsets used in the robustness evaluation protocol. Main results. The inclusion of more geometrical scenarios leads to a significant increase of the conservativism of the protocol in terms of clinical VWmin and VWmax values for the CTV and OARs. The protocol could be calibrated using VWmin dose evaluation levels of 93.0%–92.3%, depending on the scenario subset selected. Despite this calibration of the protocol, robustness recipes for proton therapy showed remaining differences and an increased sensitivity to geometrical random errors compared to photon-based margin recipes. Significance. The Dutch proton robustness evaluation protocol, combined with the photon-based margin recipe, could be calibrated with a VWmin evaluation dose level of 92.5%. However, it shows limitations in predicting robustness in dose, especially for the near-maximum dose metrics to OARs. Consistent robustness recipes could improve proton treatment planning to calibrate residual differences from photon-based assumptions.


Introduction
Proton therapy (PT) with pencil-beam scanning (PBS) allow us to achieve better dose conformity to the clinical target volume (CTV) compared to conventional radiotherapy (RT) and PT with passive scattering (Bortfeld et al 2005, Kosaki et al 2012, Langen and Zhu 2018, Florijn et al 2020. However, the distribution of pencil-beam Bragg peaks with modulated intensities is very sensitive to errors in beam and patient-alignment (setup or geometrical error), variations in anatomy and uncertainties in the proton stopping-power prediction (SPP or range error) (Stroom et al 1999, van Herk et al 2000, Lomax 2008a, 2008b. These may compromise both organ-at-risk (OAR) sparing and CTV coverage, while conventional expansion margins are not well-suited to mitigate their impact (van Herk et al 2004, Unkelbach et al 2018. To this end, scenario-based robust minimax optimization (Fredriksson et al 2011, Unkelbach et al 2007 and the robustness evaluation (Henríquez and Castrillón 2008, Korevaar et al 2019, Buti et al 2020, Hernandez et al 2020, Teoh et al 2020, Sterpin et al 2021, Rojo-Santiago et al 2021a are widely used in PBS-PT nowadays. Both the optimization and evaluation are based on a sample set of (combined) geometrical and SPP (range) error scenarios, replacing planning target volume (PTV) margins (Liu et al 2013a, 2013b, van Dijk et al 2016. In the Netherlands, a national proton robustness evaluation protocol has been established following the Dutch Proton Therapy (DUPROTON) group guidelines (Korevaar et al 2019). A voxel-wise minimum (VWmin) dose level is prescribed to the CTV, while near-maximum doses to the CTV and serial OARs are assessed on a voxel-wise maximum (VWmax) dose distribution. This protocol was defined in order to establish a robustness evaluation for PT consistent with PTV-based photon plan evaluation metrics. Although it has been in use in all three Dutch proton therapy centers since 2018, it has some known limitations. In a recent paper (Rojo-Santiago et al 2021a), a probabilistic robustness evaluation using polynomial chaos expansion (PCE) for a cohort of neuro-oncological patients was performed. It was found that the DUPROTON robustness evaluation protocol, which uses 28 error scenarios, is safe but conservative in terms of dose delivered to the CTV. These results indicate the following: (i) A dosimetric calibration of the robustness evaluation protocol is required. The conservativism of the protocol can partially be explained by the construction of the VWmin dose as a composite of extreme scenario voxel doses. The fact that setup robustness settings are often derived from photon-based margin recipes, while the underlying assumption of the static dose cloud approximation does not hold for PT, may also play a role.
(ii) The consistency of the protocol needs to be assessed. Neither the degree of inter-patient variation in the protocol, nor how that depends on the number and (sub)set of scenarios used for the evaluation is known.
(iii) It is unknown how clinical VWmax dose metrics correlate with delivered dose to serial OARs. In the DUPROTON consensus paper (Korevaar et al 2019), it was found that the clinically used VWmax-D 2%,CTV to the CTV is conservative by 2.3 percentage points (p.p.), but no data are available for serial OARs.
To address the abovementioned points, we systematically and quantitatively investigated the robustness of dose to the CTV and serial OARs, for a cohort of 21 clinically robust neuro-oncological treatment plans. The impact of geometrical and range errors was modeled with PCE. PCE was applied to perform a robustness evaluation with 100,000 complete fractionated treatments per plan and naturally results in proper statistical weighting of the scenarios. Treatment courses were sampled from error distributions consistent with van Herk's photon-based margin recipe (van Herk et al 2000), also used as the basis of the Dutch proton robustness evaluation protocol. First, we analyzed (i) how the clinically used near-minimum (D 98% ) VWmin and nearmaximum (D 2% and D 0.03cc ) VWmax metrics correlate with corresponding evaluation metrics in delivered dose to the CTV and serial OARs. Second, (ii) how the Dutch proton robustness evaluation protocol can be calibrated in terms of dose for different scenario subsets (Korevaar et al 2019) and what degree of inter-patient variation remains. Finally, (iii) how a probabilistically derived robustness recipe, consistent with the requirements van Herk used in his derivation (van Herk et al 2000) and derived from this clinical cohort with the clinical treatment planning software (TPS), differs from the photon-based margin recipe before and after calibration of the protocol.

Patient data and treatment planning
The first 21 neuro-oncological patients treated at our center for meningioma, grade-I glioma, grade II-III oligodendroglioma with 1p/19q co-deletion and grade-II astrocytoma with isocitrate dehydrogenase (IDH) mutation and robustly planned according to clinical protocol, were analyzed (van der Weide et al 2021). The prescribed doses (D pres ) were 45 Gy(RBE) (1 case), 50.4 Gy(RBE) (15 cases), 54 Gy(RBE) (2 cases) and 59.4 Gy(RBE) (3 cases) in 1.8 Gy(RBE) fractions, prescribed to the VWmin dose of 28 evaluation scenarios (see figure 1). Planning goals for the CTV were specified on the VWmin near-minimum dose (VWmin-D 98%,CTV 95% D pres ) and on the VWmax near-maximum CTV dose (VWmax-D 2%,CTV 107% D pres ) (ICRU 1993(ICRU , 1999. Furthermore, planning constraints in the VWmax-D 0.03cc,OARs and on the VWmax-D mean,OARs for the relevant serial OARs (Eekers et al 2018, Weide et al 2020 were also included. A constant relative biological effectiveness (RBE) of 1.1 was assumed. For more details, we refer to (Rojo-Santiago et al 2021a). All treatment plans were made using RayStation (version 7, RaySearch Labs, Sweden) TPS, with patient-specific non-coplanar arrangements of two or three beam directions. They were made using minimax robust optimization (Fredriksson et al 2011, Unkelbach et al 2007 and evaluated with VWmin and VWmax dose distributions of 28 evaluation scenarios (Korevaar et al 2019). An isotropic setup robustness (SR) setting of 3 mm was used to account for geometrical errors. Based on errors in the conversion of the CT number to proton stopping-power ratio from the literature (Lomax 2008a, van der Voort et al 2016), a relative range robustness (RR) setting of 3% was used, i.e. uncertainties of ±3% were taken into account.

Scenario subsets with the DUPROTON protocol
With the Dutch proton robustness evaluation protocol (DUPROTON protocol), voxel-wise dose distributions from 28 evaluation scenarios are generated to assess clinical planning goals. As maximum and minimum voxel dose levels of all scenarios are considered in this approach, different scenario selections will result in different VWmin/max dose distributions. To analyze this dependence, different sets of geometrical (seven geometrical strategies S N , see figure 1) and range error scenarios (two range strategies R N ), all within the framework of the DUPROTON protocol were combined (scenario subsets S N ⊗R N ) (Korevaar et al 2019). As depicted in figure 1, 14 subsets of a total of 81 error scenarios were defined. The S N geometrical error scenarios were selected as the normalized vectors, to the clinical SR setting used, pointing towards the faces (Fs), vertices (Vs) and edges (Es) of a cube. This resulted in seven geometrical strategies, which were ordered according to the number of error scenarios included: (S 1 ) F (six error scenarios); (S 2 ) V (eight error scenarios); (S 3 ) E (12 error scenarios); (S 4 ) F+V (14 error scenarios); (S 5 ) F+E (18 error scenarios); (S 6 ) V+E (20 error scenarios); (S 7 ) F+V+E (26 error scenarios). The R N range error scenarios were selected following two different strategies: (R 1 ) ±3% range extremes and (R 2 ) ±3%, 0% and also the nominal (free of geometrical error) scenarios for each of the RR values. The scenario subset that is calibrated in the DUPROTON protocol and clinically used in the three Dutch proton centers, results from the combination of the geometrical strategy S 4 and range strategy R 1 (S 4 ⊗R 1 ).

PCE-based robustness evaluation
PCE was applied to provide a computationally efficient patient-and treatment plan-specific analytical model of the dependence of voxel doses on treatment uncertainties. In a 3D dose distribution, the dose D i of each voxel i is approximated by the series expansion x r For the PCE-based robustness evaluation, treatment courses were sampled assuming systematic and random geometrical (Σ and σ) and systematic range (ρ) errors (1 SD) from Gaussian distributions. The (1 SD) errors were chosen since they exactly match a 3 mm SR in treatment planning, given by the linearized photon-based margin recipe M = 2.5Σ + 0.7σ, and to be consistent with clinical experience. Thus, a systematic and a random geometrical error of Σ = 0.92 mm and σ = 1.00 mm were considered for the PCE-based robustness evaluation. For more error combinations, see [Supplementary Material (SM), section S1]. For the range error, a fixed systematic SPP value of 1.2% ± 1.0% (1 SD) was used for the PCE-based robustness evaluations (Wohlfahrt et al 2017(Wohlfahrt et al , 2018(Wohlfahrt et al , 2019. Thus, one systematic geometrical and one systematic range error were sampled for each treatment course and one random geometrical error for each treatment fraction. Using PCE, scenario probability distribution of voxel doses and clinically relevant dose-volume histogram (DVH) parameters for the CTV (PCE-D 98%,CTV and PCE-D 2%,CTV ) and for the serial OARs (PCE-D 0.03cc,OARs ) were obtained per patient.
To calibrate the protocol in terms of CTV dose, VWmin-D 98%,CTV and VWmax-D 2%,CTV doses were scaled to a fixed percentile of the scenario D 98%,CTV distribution. In line with van Herk (van Herk et al 2000), they were consistently scaled per patient to achieve at least 95% of D pres at the 90th percentile of the D 98%,CTV probability distribution (10th percentile PCE-D 98%,CTV = 95%D pres ). Furthermore, scaled VWmin-D 98%,CTV and VWmax-D 2%,CTV boxplots were generated for all 14 scenario subsets. Adequate prescription-dose levels (L) for all scenario subsets within the protocol were determined by evaluating the median of the scaled VWmin-D 98%,CTV values (L (S N ⊗R N )).

Comparison of the robustness recipe with the photon-based margin recipe
Since the assumptions underlying the static dose cloud approximation do not apply to PBS-PT, photon-based margin recipes cannot be directly applied to calculate the SR setting. To this end, PCE was used to construct a robustness recipe, which amounts to the different combinations of systematic (Σ) and random (σ) geometrical errors for which adequate CTV dose with a pre-defined probability is exactly achieved with the clinical 3 mm SR setting. The probability of achieving adequate CTV dose was defined as the probability of meeting the planning CTV constraint (D 98%,CTV 0.95 D pres ) for a given percentile of the scenario D 98%,CTV distribution. Therefore, robustness recipes aiming to achieve adequate CTV dose for the 10th (90% robustness recipe), 5th (95% robustness recipe) and 2nd (98% robustness recipe) percentiles of the D 98%,CTV scenario (and population) distribution were derived. For an initial combination of Σ and σ geometrical errors, PCE was first used to sample 100,000 fractionated treatments to determine the D 98%,CTV scenario distribution for all 21 plans. For each of the 21 D 98%,CTV distributions, the probability of achieving adequate dose was determined as the probability of meeting the planning CTV constraint P const = P(D 98%,CTV 0.95 D pres ). If the averaged probability for all 21 plans did not meet the criterion with a bandwidth of 0.1 p.p., the value of the geometrical Σ was iteratively changed. A non-linear three parameter function was used to fit the recipes: Σ = −aσ/exp(−bσ 2 ) + c. The coefficients for each of the recipes are tabulated in [SM, section S2].
Robustness recipes for two different situations were determined. The first situation addresses the remaining differences between photon-and proton-based robustness recipes after calibration of the DUPROTON protocol (robustness recipe after protocol calibration). To this end, treatment plans for all patients were scaled according to the VWmin adequate dose evaluation level (L) of the scenario subset used clinically in the DUPROTON protocol (L(S 4 ⊗R 1 )), determined in section 2.4. The second situation focuses on how the protocol performs when SR settings are tight against the errors assumed. Thus, no protocol calibration was used for the derivation of this recipe (without protocol calibration) and the treatment plans were scaled per patient to achieve the D pres in the 50th percentile of the scenario D 50%,CTV distribution (50th percentile PCE-D 50%,CTV = D pres ) to reduce inter-patient variation.
To assess the applicability of the robustness recipe, different combinations of geometrical Σ and σ errors satisfying the P const = 90% and 98% robustness recipe were evaluated and compared against the photon-based margin recipe in the [SM, section S1]. The evaluation of extreme zones of the clinical and photon-based recipes (where Σ or σ are 0 mm) were excluded since they are not realistic in clinical practice.

Statistical analysis
A statistical analysis on the median (Wilcoxon signed-rank test) and on the data dispersion (Ansari-Bradley test) were performed using Matlab (Mathworks version R2017a) to evaluate the differences between the scenario subsets. A p-value < 0.05 was considered to be statistically significant.

Correlation between clinical plan evaluation metrics versus probabilistic CTV dose metrics
In order to assess differences of the protocol in the selection of the scenario subset, VWmin/VWmax CTV and VWmax OARs dose values for the different combinations of geometrical (S 1 to S 7 ) and range (R 1 or R 2 ) scenarios subsets were compared to actual delivered CTV (D 98%,CTV , D 2%,CTV ) and OARs (D 0.03cc,OARs ) dose metrics. The coefficients of the linear and non-linear regressions of PCE against the clinical CTV and OAR voxel-wise metrics can be found in table 1 and table 2, respectively. The inclusion of more geometrical scenarios in the subsets leads to a decrease in VWmin-D 98%,CTV and an increase in VWmax-D 2%,CTV values (p < 0.05), increasing the conservativism of the protocol. At a 10th percentile of the scenario D 98%,CTV distribution, a significant increase in the slope from 1.022 (S 1 ) to 1.030 (S 7 ) was found, while, for R 2 , a value from 1.023-1.031 was obtained (p < 0.05). For the 90th percentile of the D 2%,CTV distribution, an increase of 0.6 percentage points (p.p.) and 0.5 p.p. along the geometrical strategies were found for R 1 and R 2 , respectively. For the OARs, the correlation was the best for the zero-dose and the high-dose region, with the largest PCE-D 0.03cc,OARs and VWmax-D 0.03cc,OARs at 60% of the D pres ( figure 2(b)). Non-linear coefficients A and B were on average 0.35 and 0.61 for the 98th percentile fitting (R 2 = 0.99), which respectively increased and decreased with the addition of geometrical scenarios. Despite the considerable difference in the number of scenarios between R 1 and R 2 strategies, statistically non-significant (p > 0.05) differences in the slopes and non-linear coefficients were found between range strategies. Correlation of the voxel-wise CTV and OARs dose metrics for the scenario subset clinically used in the DUPROTON protocol (S 1 ⊗R 4 ) are depicted in figure 2. A visualization of the conservativism of the DUPROTON protocol on these metrics can be found in figure 3.

Adequate prescription-dose evaluation levels for the DUPROTON protocol
Differences between VWmin-D 98%,CTV /D pres and VWmax-D 2%,CTV /D pres depending on the geometrical (S 1 to S 7 ) and range strategies (R 1 and R 2 ) are displayed in figure 4. Dose metrics were scaled for each patient to the 10th percentile of their scenario D 98%,CTV distribution to determine adequate dose evaluation levels for each scenario subset. All scaled VWmin-D 98%,CTV /D pres extended below the target clinical criteria (D 98%,CTV 95%D pres ). Assuming a population coverage probability of 90%, adequate dose evaluation levels from 93.0% (figure 4(a): S 1 ⊗R 1 ) to 92.2% (figure 4(b): S 7 ⊗R 2 ) on average were found compared to the clinically used 95%. The protocol also results in more homogeneous plans than expected, in which scaled VWmax-D 2%,CTV /D pres values of 1.01 (S 1 ⊗R 1 ) to 1.02 (S 7 ⊗R 2 ) on average were found. Inter-patient variation had a larger impact on the clinical VWmin-D 98%,CTV /D pres than inter-scenario subset variation, where no significant differences resulted for the latter (p > 0.05). Further analysis based on other combinations of σ and Σ errors can be found in the [SM, section S1].

Photon-based margin recipe versus consistent robustness recipes before and after protocol calibration
The robustness recipe, derived for this patient cohort from the clinical TPS, is displayed in figure 5 after (figure 5(a)) and before ( figure 4(b)) calibration against the scenario subset used clinically (S 4 ⊗R 1 ). Before calibration of the protocol, the errors determined from the robustness recipe, assuming a population coverage of 90% (90% robustness recipe, blue line), are significantly larger than the errors assumed from the photon-based margin  recipe, also compared to its original non-linearized form. In fact, it does not reproduce the factor 2.5 that was determined in the photon-based margin recipe when the random geometrical σ error is 0 mm. For σ 1.5 mm, both the linearized and non-linearized photon-based margin recipe could be calibrated with a linear scale factor. After calibration of the protocol, the differences between the 90% robustness recipe and the linearized and nonlinearized photon-based margin recipe were reduced, but variations remained. When σ > 1.5 mm, neither form of the photon-based margin recipe can be calibrated to reproduce the robustness recipes. Small differences in the σ values lead to significant Σ differences in this part of the recipe, indicating that PT is more sensitive to random errors. For instance, a σ error = 1 mm (central part of the recipe) leads to geometrical errors of Σ = 1.54 mm (before calibration) and Σ = 1.15 mm (after calibration) according to the 90% clinical recipe, while a lower geometrical Σ error = 0.92 mm is suggested for the linearized photon-based margin recipe, which aims at the same population coverage. No fitting parameters were found for the robustness recipe aiming at a population coverage of 98% since no combination of Σ and σ errors ensured a 98% probability after calibration of the protocol. The robustness recipes for both situations (after and without protocol calibration), showed a similar consistency for different combinations of geometrical Σ and σ errors as linear photon-based margin recipes [SM, section S1]. For the robustness recipe without protocol calibration, the inter-patient variation in the scaled VWmin-D 98%,CTV values was higher for the different combinations of geometrical Σ and σ errors compared to the recipe after protocol calibration, indicating that the protocol might not be suitable when a large number of geometrical errors are handled in comparison to the SR setting used [SM, section S2].

Discussion
In this paper, we have quantitatively and systematically assessed the performance of the Dutch proton robustness evaluation protocol in a cohort of robustly planned PT treatments for 21 neuro-oncological patients. We evaluated how VWmin and VWmax dose metrics probabilistically translate into delivered dose to the CTV and OARs under geometrical and range errors. Thus, we calibrated the DUPROTON protocol by deriving adequate CTV prescription-dose levels, assuming different scenario subsets in line with the DUPROTON group and analyzed residual inter-patient variation. Finally, a robustness recipe was determined before and after calibration of the protocol and compared to a photon-based margin recipe in which the protocol is based, to respectively assess the remaining differences when (i) SR settings are pushed to the limits to handle geometrical errors and (ii) the photon-based margin recipe is applied to PT. The DUPROTON protocol, combined with the photon-based margin recipe to determine the adequate SR setting, can be calibrated using a lower evaluation dose level depending on the evaluation scenarios selected to construct the voxel-wise doses. In line with our findings in (Rojo-Santiago et al 2021a), the DUPROTON protocol (S 4 ⊗R 1 ) as implemented at our center leads to consistent but conservative results in terms of CTV and OARs doses (figure 2). Assuming a population coverage of 90%, VWmin and VWmax doses respectively result in an under-and over-estimation of 3 p.p. and 2 p.p. of the near-minimum and maximum CTV doses, respectively. The slight CTV overdose can be corrected by evaluating the VWmin CTV dose from a L = 93.0% (S 1 ⊗R 1 ) to 92.2% (S 7 ⊗R 2 ) level, instead of the usual 95%, depending on the scenario subset used for the robustness evaluation (figures 4(a) and (b)). In addition, this lower prescription-dose level does not lead to unacceptable hotspots in the delivered dose distribution. In fact, as the VWmax dose metric also overestimates the near-maximum CTV dose (figures 4(c) and (d): VWmax-D 2%,CTV = 1.02 on average for all scenario subsets), the protocol realizes slightly more homogeneity in the delivered dose distributions compared to conventional RT plans.
In contrast, if the SR settings are pushed to the limit [SM section S1], the DUPROTON protocol is no longer conservative. In this case, the dose can be corrected by evaluating the VWmin CTV dose at a 95.6% level, which leads to more inter-patient variation in the clinical metrics. The lack of consistency of the protocol when tight robustness settings are used may be due to the limitation of using robust minimax optimization in treatment planning, which uses a discrete set of scenarios. Thus, a calibration of the protocol with probabilistic robustness evaluation approaches, which uses a semi-infinite set of scenarios, comes at the expense of increased interpatient variation in the clinical dose metrics. In addition, the larger number of geometrical and range errors used might also contribute to the inter-patient variation, but comparable results were obtained with a cohort of headand-neck patients planned with a SR = 5 mm setting (Rojo-Santiago et al 2021b). Therefore, proper probabilistic approaches for treatment plan optimization could aid in reducing the remaining inter-patient variation.
The conservativism of the DUPROTON protocol while applying photon-based margin recipes could be partially explained by (i) the inherent construction of the voxel-wise approach, in which the extreme dose levels for each voxel are reported, and (ii) the incompatibility of photon-based margins to calculate SR settings for PBS-PT, as shown in figure 5. If only Σ errors are considered, the robustness recipe does not reproduce the factor of 2.5 from the photon-based margin recipe. In addition, PT planning is more sensitive to random errors due to the steeper lateral and distal penumbrae compared to conventional RT. In fact, the remaining differences after calibration of the protocol confirm that photon-based margin recipes do not apply to PT ( figure 5(b)). The differences in the degree of modulation of the intensities (conventional RT versus PBS-PT) on the treatment plans used, how the optimization was done and the assumption of a constant lateral penumbra from conventional RT and its application in PBS-PT might also contribute to these differences. Furthermore, the Figure 5. Robustness recipe probabilistically consistent with the TPS before (right) and after (left) the calibration of the protocol. Recipe ensures that the clinical CTV criterion (D 98%,CTV 95%D pres ) is met at different population coverage probabilities (90% in blue, 95% in green and 98% in red). Linearized (black) and non-linearized (dashed black) photon-based margin recipes from van Herk, which aim for a 90% population coverage probability in conventional radiotherapy, are also displayed (van Herk et al 2004). point minimum dose (D min ) was the metric proposed to assess the plan adequacy during the construction of the photon-based margin recipe, which was used for PTV evaluation, while nowadays the near-minimum dose (D 98% ) is commonly used instead (ICRU 1993(ICRU , 1999. The VWmax-D 0.03cc dose, which is commonly evaluated for serial OAR in clinical practice (Eekers et al 2018), results in a conservative metric proving that is not a good predictor of the near-maximum D 0.03cc dose to serial organs in dose gradients. Furthermore, it depends on the dose, in which the largest absolute deviations from the unity lines in figure 2(b) are found at intermediate dose levels (relative to the prescribed dose). This is particularly relevant for cases in which robust target coverage is sacrificed to spare critical OARs, as it leads to over-estimation of the OAR dose and, therefore, to suboptimal trade-offs between target coverage and critical serial OARs. Thus, based on the DUPROTON protocol, one can give additional dose to OARs that are located close to the target if there is an improvement in CTV coverage. An example is skull-base chordomas patients, in which the prescription dose (70-74 Gy(RBE)) to the target is above critical OARs tolerances (Fung et al 2018, Kroesen et al 2022. However, a higher dosage of serial OARs should be balanced against RBE effects, which has an increased impact after the distal part of the spread-out Bragg peak (Luhr et al 2018).
A limitation of the study comes from the lack of knowledge about adequate probabilistic planning goals to assess target dose adequacy directly on the CTV. Clinical plan robustness evaluations depend on the robustness approaches used to mitigate uncertainties in PT (robust optimization and evaluation) and in RT (PTV-based methods), which are usually based on enlarged treated volumes around the CTV (PTV-D 98% for RT and VWmin-D 98%,CTV for PT) instead of on the CTV itself. In addition, the relaxation of the historical clinical goal from a point minimum (PTV-D 100% ) to the near-minimum dose (PTV-D 98% ) masked the volume v of the CTV that should be probabilistically covered by 95% of the D pres , which has also been adopted in the DUPROTON protocol (VWmin-D 98%,CTV ). In this paper, we used the 10th percentile of the scenario D 98%,CTV distribution as an adequate probabilistic CTV dose metric from the PCE-based robustness evaluations, to subsequently calibrate the DUPROTON protocol. Other dose-volume metrics may be established through cross-calibration with photon treatment plans.
We limited this study to the evaluation of the clinical treatment plans in a clinical TPS, including per-patient clinical decisions and trade-offs in treatment planning, with geometrical and range robust optimization settings of 3 mm and 3%. Instead, we evaluated and optimized the performance of the protocol by using different numbers of treatment errors.
Another limitation of this study relates to the number of scenarios selected for the protocol, which were defined in line with the DUPROTON consensus group. The scenarios used in each of the subsets are highly correlated, limiting the coverage of the actual error distribution even when using a larger sample of scenarios. Thus, a more uniform sampling of the scenarios, i.e. from a fixed percentile of the 4D probability distribution used in the DUPROTON protocol might lead to a better interpretation of the clinical metrics. However, the addition of more scenarios turns the approach to a more conservative direction, as figure 4 shows. Consequently, a robustness evaluation protocol that satisfies a lower VWmin dose evaluation level and includes fewer scenarios could reduce computational time in treatment planning.
Compared to MC-based robustness evaluation methods, PCE is an analytical approximation of the dose engine that, through the computational feasibility of millions of dose calculations, can aid in accurately interpreting the impact of treatment uncertainties in fractionated treatments, in this case geometrical and SPP errors, on relevant dosimetric parameters (D 98%,CTV, D 0.03cc,OARs) in PBS-PT. Its speed and accuracy allow us to perform probabilistic robustness evaluations, which enables us (i) to quantify the sensitivity and true robustness of these clinical dose metrics more precisely, and (ii) to benchmark other robustness strategies used in clinical practice. However, the parameterization of the treatment errors in the problem enforces a validation of the model, which might fail in the case of a more complicated source of uncertainties, i.e. anatomical variations. Furthermore, additional source of errors in the PCE construction might increase its complexity and computational cost. For other treatment sites including moving targets and anatomical deformations, the combination of PCE with more advanced anatomical modeling could further improve clinical robustness evaluation protocols (Pastor-Serrano et al 2021).

Conclusion
In summary, we have shown that the Dutch proton robustness protocol, when combined with the photon-based margin recipe to determine the adequate SR, can be calibrated with a lower VWmin evaluation dose level depending on the chosen scenario subset (e.g. S 4 ⊗R 1 : 92.5%). PCE-based robustness evaluations showed that the protocol leads to consistent but conservative results in patients in which robustness in target coverage can be achieved. Without a dose calibration, the protocol underestimates/overestimates the near-minimum/ maximum CTV doses (D 98%,CTV /D 2%,CTV ) by 3 p.p./2 p.p. on average for all scenario subsets. Furthermore, in particular, this shows limitations when assessing robustness in OAR doses. The VWmax near-maximum resulted in a poor robust metric of the near-maximum dose, especially for cases in which trade-off between robust target coverage and OAR dose must be made. Finally, the protocol might not perform well when tight SR settings are used in planning, in which the inter-patient variation in clinical dose metrics substantially increases.