A generalized model for monitor units determination in ocular proton therapy using machine learning: A proof-of-concept study

Objective. Determining and verifying the number of monitor units is crucial to achieving the desired dose distribution in radiotherapy and maintaining treatment efficacy. However, current commercial treatment planning system(s) dedicated to ocular passive eyelines in proton therapy do not provide the number of monitor units for patient-specific plan delivery. Performing specific pre-treatment field measurements, which is time and resource consuming, is usually gold-standard practice. This proof-of-concept study reports on the development of a multi-institutional-based generalized model for monitor units determination in proton therapy for eye melanoma treatments. Approach. To cope with the small number of patients being treated in proton centers, three European institutes participated in this study. Measurements data were collected to address output factor differences across the institutes, especially as function of field size, spread-out Bragg peak modulation width, residual range, and air gap. A generic model for monitor units prediction using a large number of 3748 patients and broad diversity in tumor patterns, was evaluated using six popular machine learning algorithms: (i) decision tree; (ii) random forest, (iii) extra trees, (iv) K-nearest neighbors, (v) gradient boosting, and (vi) the support vector regression. Features used as inputs into each machine learning pipeline were: Spread-out Bragg peak width, range, air gap, fraction and calibration doses. Performance measure was scored using the mean absolute error, which was the difference between predicted and real monitor units, as collected from institutional gold-standard methods. Main results. Predictions across algorithms were accurate within 3% uncertainty for up to 85.2% of the plans and within 10% uncertainty for up to 98.6% of the plans with the extra trees algorithm. Significance. A proof-of-concept of using machine learning-based generic monitor units determination in ocular proton therapy has been demonstrated. This could trigger the development of an independent monitor units calculation tool for clinical use.


Introduction
The number of proton therapy centers treating eye melanomas worldwide has recently reached 20 (Hrbacek et al 2016).Most of the institutes operate with dedicated ocular fixed horizontal beamlines using a passive scattering technology for beam delivery (Kacperek 2012, Fleury et al 2021).However, these systems vary widely, ranging from a few centers equipped with low-energy accelerators to the more common, contemporary setup of multiroom centers linked to high-energy accelerators (Hrbacek et al 2016).Consequently, the differences across the institutes lie in the use of a single or double scattering system, quality of the initial beam entering the ocular nozzle, shape of the range modulator wheels, material of the range shifter, and position of each component within the nozzle.
The calibration of a beamline is performed under reference conditions, which involve a fixed and defined setup.However, during treatment, the setup becomes patient-specific.This difference affects the relation between dose and monitor units (MUs).The changes in dose/MU relative to the reference conditions can be defined as output factors (OFs) (Newhauser et al 2002, Hsi et al 2009, Slopsema et al 2013).The OF depends on several factors: field size, modulation width, range of the proton beam, geometrical distances between the beammodifying devices, air gap, source-to-axis distance, and the positioning of the ionization chambers along the beamline (Bonnett et al 1993, Daartz et al 2009, Hsi et al 2009).The OF must be calculated for each patient-plan individually and the number of MUs to deliver a certain dose is usually calculated by the treatment planning systems (TPS).Currently, no TPS is available for dedicated ocular proton beamlines that allows for the direct calculation of plan-specific MUs.Hence, treatment plans are calculated only in terms of relative dose.To translate the planned treatment dose into the number of required MUs, several approaches have emerged in practice: (i) determination by dosimetric measurements; (ii) in centers with a long experience, look-up tables relating the absolute dose and MUs for specific nozzle settings, with measurements being performed only for a set of patients to verify the validity of look-up tables; (iii) Monte Carlo simulations (Hérault et al 2005, Koch et al 2008); and/or (iv) by a (semi)-empirical model (Kooy et al 2003(Kooy et al , 2005)).Patient-specific quality assurance measurements are generally considered as the gold-standard method, as the development of a Monte Carlo algorithm and/or the implementation of an empirical model require time and specific knowledge.
In an effort of reducing the workload triggered by the measurements and withthe goal of implementing an independent MU calculation tool, several research groups have already investigated the prediction of OFs or MUs for passively-scattered proton beamlines or in uniform scanning proton therapy (Kooy et al 2005, Fontenot et al 2007, Sahoo et al 2008, Kim et al 2011, Zheng et al 2011, Lin et al 2014, Ferguson et al 2016, 2017).Recently, another approach for OF determination using supervised machine learning (ML) has been proposed in nonocular dedicated proton therapy (Sun et al 2018, Grewal et al 2020, Zhu et al 2022).This is an interesting alternative that might be applied for MU prediction from dosimetric factors.However, in light of the technical complexity and site-specific nature of ocular beamlines, the direct application of methodologies developed for general passive scattering proton therapy MU predictions for ocular beamlines is not feasible.Nevertheless, the underlying principles and reasoning from these previous works may provide a robust foundation for developing such models in ocular proton therapy.Independent of the specific application, ML algorithms have indeed abilities to identify data patterns when dealing with complex relationships and then use this learning for generalization purposes with unseen data (Li and Murphy 2015).Among available ML algorithms, regression models can be used to estimate continuous target variable based on input features (El Naqa et al 2018, Alpaydin 2020).
As a wide variety of methods exists for regression problems and the optimal choices depend on the complexity of the problem at hand, the primary goal of this study was to implement ML algorithms to develop predictive models for MU prediction in passively-scattered ocular proton therapy.The ultimate goal was to investigate the applicability of ML to build a generalized model for MU determination across different institutes by using data from three European centers.To the best of our knowledge, no other study has focused on dedicated passive ocular eyelines using machine learning for MU determination.

Materials and methods
2.1.Center-specific proton delivery and eye nozzles Data were provided by the Centre Antoine Lacassagne, France (abbr.CAL); Bronowice Cyclotron at the Institute of Nuclear Physics, Polish Academy of Science, Poland (abbr.IFJ), and HollandPTC, The Netherlands.The three centers are cyclotron-based proton therapy facilities producing proton beams with extracted nominal energies of 65, 235 and 250 MeV, respectively.Beam delivery and transport system at CAL has been in-house developed, IFJ and HollandPTC were built upon commercial solutions, using IBA (Louvain, Belgium) for the former and Varian (a Siemens Healthineers Company, Palo Alto, California, USA) for the latter.Therefore, the three institutes utilize different technologies in terms of manufacturers, cyclotron designs, energy selection and degradation systems, and eye nozzle components.As a consequence, energies and energy spread values at the entrance of the three fixed horizontal eyelines differ and are 62.5 ± 0.10 MeV (CAL), 70 ± 0.70 MeV (IFJ), and 75 ± 0.70 MeV (HollandPTC).All institutes use a single-scattering foil made of Tantalum for passivelyscattered proton therapy beams.The virtual source is located at the scattering foil, resulting in an average virtual source-to-axis distance of 300 cm (CAL), 165 cm (IFJ), and 120 cm (HollandPTC).The snout-isocenter distance is 7 cm and all treatments are performed under isocentric conditions in all three centers.A dose monitor system is integrated in all beamlines (PTW, Freiburg, Germany).Each center independently commissioned their clinical treatment options defined by a unique combination of a range shifter thickness (resulting in a finite residual range in water) and modulation widths achieved by a spectrum of wheels to create spread-out Bragg peaks (SOBPs).CAL has 50+, IFJ 50+, and HollandPTC 14 SOBP-wheels available for clinical use.Maximum commissioned ranges in water with no range shifter in the beam (therefore corresponding to the eyeline energies) were: 3.24, 3.15 and 3.97 gcm −2 , respectively.Residual range is defined at the 90% distal dose level in depth.
The detailed characterization of the HollandPTC eye beamline, compared to five worldwide existing eyelines in clinical use including CAL and IFJ, is reported in Fleury et al (2021).

Eyeline-specific OF and patient-specific MU determination
To understand the impact of the beamline design on the determination of MUs, OF measurements were collected from all three institutes.
Center-specific calibration factors relating the MUs and dose for the nominal eyeline energy were defined independently on-site.The unique and constant calibration dose for each eyeline was determined under the same reference conditions: SOBP REF = 2.00 gcm −2 , field size REF = 25 mm circular diameter, z REF = mid-SOBP, no range shifter in place, thus leading to an OF = 1.00 (figure 1).Reference depth z REF in water corresponded to 2.00, 2.32 and 2.97 gcm −2 for CAL, IFJ, and HollandPTC, respectively.The relation between MU and dose is however dependent on the inserted beam-modifying devices, and therefore, differs with every option.
We analysed the OFs as function of the field size, SOBP-modulation width, range shifter, and air gap (Sahoo et al 2008, Daartz et al 2009, Titt et al 2017).Hereafter the air gap corresponds to the in-air physical distance between the eye nozzle and the entrance of the phantom along the central axis of the beam.Changing one of these four factors alters the OF value.The OFs under non-reference conditions were independently determined in all three centers.All centers measured an extensive set of ranges, SOBPs, field sizes, air gaps, thereby spanning the minimum and maximum residual ranges in water, minimum and maximal SOBPs achievable in each eye beamline, and air gaps measured between −2.5 and 2.5 cm away from the isocentre along the central axis of the beam.The amount of measured options (combinations of range, SOBP and air gaps) differed, however, among the centers due to differences in available resources and beam time.
The clinical determination of MU was performed by Monte Carlo simulations as this is standard practice at CAL (Hérault et al 2005(Hérault et al , 2007)), and patient-specific quality assurance measurements at IFJ and HollandPTC.The measurements were performed at mid-SOBP prior to each treatment course at IFJ and HollandPTC using the PTW23343 and PTW Advanced Markus chambers, respectively.Ranges of clinical patient-specific real MUs in each cohort were: [108; 3801] at CAL, [3287; 7367] at IFJ,and [4325;8507] at HollandPTC, respectively.

Machine learning-based models for MU prediction 2.4.1. Supervised ML algorithms
Selecting the best possible algorithm(s) suitable for the prediction of MUs is a costly-to-perform process.For this purpose, the automated machine learning tool (Auto-sklearn, version 0.15.0) was used to extract and select the best-performing regression models (Feurer et al 2019).Auto-sklearn integrates a Bayesian optimization process to search for the best machine learning pipeline(s), including feature pre-processing, algorithm selection, and hyperparameters tuning.The methodology behind Bayesian optimization in Auto-sklearn can be summarized in three main steps: (i) initialization.Auto-sklearn starts with a collection of machine learning algorithms and a search space of hyperparameters for each algorithm.It initializes a surrogate model, often a Gaussian process, to model the performance of different pipeline configurations.(ii) Sequential optimization.Auto-sklearn performs a sequential optimization loop over a pre-defined number of iterations or a pre-defined time.(iii) Final models selection.After the optimization process is complete, Auto-sklearn selects the best-performing machine learning pipeline(s) based on the final surrogate model's predictions.
One single regression model is referred to as ML-pipeline, hereafter.Based on the analysis of factors influencing the OFs and therefore determination of MU (see section 2.2), following numerical feature variables (inputs) to the ML-pipelines were: patient-specific SOBP, residual range, air gap, center-specific calibration factor, and fraction dose.An overview is provided in figure 2. The dependent variables were the clinical patientspecific MUs determined from institutional gold-standard methods i.e. measurements performed at HollandPTC and IFJ, or Monte Carlo simulations at CAL.

ML-pipeline
MUs derived from the machine learning algorithms were compared against on-site measured MUs for IFJ and HollandPTC data, and against Monte Carlo MCNPX-based (Hérault et al 2005(Hérault et al , 2007) ) determined MUs for CAL as they represent the gold-standard in each institute.
Every ML-pipeline was computed by randomly splitting the dataset into a training, validation, and evaluation set using the Scikit-learn library (Python 3.8, version 0.24.2).The three sets were generated independently of each other.The training set sampled 70% of patients from each center.Data were concatenated together for the training set.The validation set is an internal process using patients assigned to the training set only, and was used to optimize the hyperparameters of each ML-pipeline.Tuning of the hyperparameters in the training of each ML-pipeline was achieved using a grid search approach.Validation of the hyperparameters was performed using a repeated K-fold cross-validation process.This process involved dividing the data into K = 10 subsets (splits), and then repeating this K-fold process five times, with different random data partitions in each repetition.The number of repeats helps obtain a more robust and reliable estimate of the performance of each ML-pipeline, as it averages over multiple rounds of testing, hence reducing the influence of randomness in the data split.The optimal hyperparameters resulted in the lowest mean absolute error i.e. differences between real and predicted MU in the validation set.To prevent an overly optimistic estimation of the MU predictive values, we only evaluated the models on the evaluation set once all hyperparameters were considered as optimal and the final model was trained.In this way, the performance in the evaluation set did not influence the decisions made during the implementation for the ML-pipeline, thereby preventing possible overfitting.The final performance measure of each ML-pipeline was determined by calculating the mean and median absolute errors between real and predicted MUs for the patients of the evaluation set.
Additionally as the number of patients from CAL consisted of 90% of the whole cohort, each ML-pipeline was independently retrained and tuned using HollandPTC data-only to investigate the performance with a single-institutional model against a multi-institutional model, thus enabling a better interpretation of the generalization results.

OF comparison
Comparison of the OFs among the three institutes, as function of field size, range shifter thickness, SOBP width, and air gap, is provided in figure 1.The OFs as function of the field size was negligible for field sizes larger than 1 cm.The OF values decreased as the SOBP widths increased from the reference width.The largest difference between centers was observed for smaller range modulations, up to an OF difference of 17% between CAL and IFJ for a modulation of 1.00 gcm −2 .A third-order polynomial best-fit curve representing the trend for any nondiscrete value of SOBP width is also displayed in figure 1(C).Due to the insertion of the range shifter in the proton beams, the number of protons contributing to the dose at the measurement point decreased.It resulted in a difference in OF value up to 20% for a range shifter thickness of 1.50 gcm −2 between CAL and HollandPTC.A difference in OF of 1.5% was observed with an air gap distance of 2.5 cm between CAL and HollandPTC.

ML-based models for MU prediction
Based on the Bayesian optimization, six algorithms leading to the lowest mean absolute error were automatically selected: decisions tree, random forest, extra trees, K-nearest neighbors (KNN), gradient boosting XGBoosting, and the support vector regression (SVR) (Hastie et al 2009).
The ML-based methods were tested using 30% of the delivered patient-specific treatment plans for each institution.In total, it resulted in predicting MU for 1025 patients from CAL, 64 patients from IFJ, and 36 patients from HollandPTC.Results are shown in table 1. Results for HollandPTC only are shown in table 2.    The extra trees regressor outperformed the other ML-based algorithms.In the test set comprising all 1125 patients, the predictive outputs with the extra trees were within 3%, 5% and 10% of the real MUs (simulated by Monte Carlo or measured) for 85.2%, 91.4% and 98.6% of the plans, respectively.Mean absolute errors between predicted and real MUs were 25.7 and 39.4 for the extra tree and support vector regression algorithms, respectively (table 1).
The mean and median absolute errors generated with the single-institutional model were globally larger than with the multi-institutional model.The predicted MUs within 10% accuracy were improved for more than 97% of all 36 patients belonging to the test dataset, except for the decision tree algorithm (table 2).

Discussion
In this proof-of-concept study, we have implemented a methodology to determine the MU for patient-specific treatment plans in proton therapy of eye melanomas.We have demonstrated that the results are comparable within a certain uncertainty to current institutional gold-standard methods, i.e. measurements or Monte Carlo simulations.This is an important step towards decreasing the workload of measurements prior to treatment delivery and towards the development of an independent MU calculation tool that can be used for quality assurance.As ocular proton therapy centers treat limited numbers of patients, it is challenging to collect a large and sufficient dataset needed for an institutional-based ML prediction tool.A generalized multi-institutional tool would overcome this problem and might be used as a starting warm-up process for centers that would wish to implement their own model.Therefore in this study, we collected data from three passively single-scattered proton therapy systems utilizing different technology and demonstrated the feasibility of generalized MU prediction tool using six popular machine learning regressors (Kacperek 2012, Fleury et al 2021).Predictions across algorithms were accurate within 3% uncertainty for up to 85.2% of the plans and within 10% uncertainty for up to 98.6% of the plans with the extra trees algorithm.
The important question to address first is what type of uncertainty might be clinically acceptable.As presented in tables 1 and 2, the larger the uncertainty the higher performance score of the selected MLprediction tool.At this stage, the proposed methodology is successful in predicting the correct number of MUs required for a plan within 10% uncertainty, independently of the center where the input data came from.In radiotherapy, an accuracy below 3% (Arjomandy et al 2019) is generally required in clinical use, therefore improvements in the current ML-models are necessary.On the other hand, this tool might have a great potential in flagging output errors by outliers identification, thus improving the current process for treatment failures detection and increasing patient safety delivery.A thorough external validation is needed to identify whether there is a type of patient-specific plans systematically failing the prediction.
As demonstrated in figure 1 on the examples of three centers participating in this current investigation, the OF as a function of field size, range shifter thickness, SOBP-modulation width and air gap, depend on the sitespecific beamline parameters.We used this knowledge to objectively decide which features should be used as inputs into the ML-predictive pipelines.As the performance of the algorithms are driven by the complexity of the data themselves, this first investigation study was restricted to five input features (figure 2) that mostly influence the monitor units.Overall, the best-performing ML algorithm associated with the lowest mean absolute errors was the extra tree algorithm.Unlike the decision tree algorithm, which can be easily prone to overfitting, we tried using multiple trees instead of just one to prevent this problem.Both random forest and extra trees are methods that create a group of decision trees, leading to better accuracy and robustness.When using these two algorithms, the final MU prediction was the average of all the predictions made by each tree.While random forest and extra trees are similar in concept, the key difference lies in how they split and combine the trees.The extra trees method, which adds more randomness during the splitting phase, seems to better handle the five input features of our study.Besides, it is interesting to mention that we observed a similar predictive power when using the extra trees regressor or the gradient boosting.Unlike the extra tree algorithm, which builds trees separately, the gradient boosting algorithm constructs them one by one in a deterministic and systematic splitting approach, with each tree correcting the errors of the previous ones.Finally, the SVR algorithm also demonstrated the ability to make accurate predictions; however, it resulted in larger mean absolute errors in both the multi-institutional and single-institutional models.This observation can be attributed to the high sensitivity of SVR to outliers, as illustrated in figure 3(F).To better understand this limitation, the selection of the kernel function that builds the hyperplane of the SVR algorithm should be better explored using a larger dataset.
It is important to note that the beamline selection for this study included completely different setups on purpose.Geometrical distances between all beam-modifying devices, localization of the ionization chambers, and material thicknesses are major parameters influencing the absolute dosimetry in ocular proton therapy and they differ between current clinical installations (Fleury et al 2021).Factors contributing to changes in MU for a given patient are both center-specific and patient-specific.For the former, the energy straggling and the spreading system until the scattering foil define the initial beam quality of the pristine Bragg peak hence the nominal energy beam emittance.The latter, defined by a patient's plan, is expressed by a specific SOBP and the residual range.The main beam-modifying devices that play a role are the scattering foil(s), the range shifter and the modulator wheels, triggering the multiple coulomb scattering phenomenon (Gottschalk et al 1993).It leads to a wider angular beam emittance of the beam envelope.While the scattering foil(s) contribute(s) less to changes in final MU required for a treatment, range shifter and modulator wheels account for the most for differences in MU among patients.The air gap is an indirect translation of the angular beam emittance of the proton beam back-projected at the virtual source, which differs from center-to-center and patient-to-patient depending on the scattering foil(s) and range shifter materials and thicknesses, and the steps of each modulator wheel.To further improve the predictions that may be deployed for clinical use, it could be interesting to accurately take into account the center-dependent angular beam emittance, design of all beam-modifying devices as well as the source-to-axis distances.Lastly, the surface area from the aperture shape could be added as input feature into the predictive models, especially for small-sized tumors between 5 mm and 1 cm in diameter for which the OF as function of the field size matters the most.
This study has some limitations.The different amount of patients from each center comes first.CAL provided 3415 patients, IFJ 213 patients, and HollandPTC 120 patients.Therefore, the prediction might be biased towards the CAL beamline and patient's selection.The beamline differences were presented in section 2.1.The differences in patient's selection are explained by a local patients' referral.For example, the spectrum of tumors treated in CAL was broader than in IFJ and HollandPTC due to for instance inclusion of conjunctival melanomas (Scholz et al 2019, Thariat et al 2019) in CAL, leading to a different selection of the beam-modifying devices.To validate the generalization potential of this proof-of-concept study, it would be necessary to involve more centers for both, training and testing of the models.The multi-institutional data would have to represent the typical patient set per institute and should be balanced in the number of patients.It is expected that additional data called during the training process would provide a higher prediction accuracy and a higher generalization score across proton centers.It would be interesting to investigate the versatility of such a generalized model with the inclusion of centers that have even more therapeutic options.The ocular beamline OPTIS2 at the Paul Scherrer Institute, Switzerland, comprises for instance a set of nine scattering foils in clinical use (Heufelder et al 2008).It would also be interesting to investigate the applicability of such a generalized model in the institutes that did not provide the training data, but would like to use the tool in their clinical practice.
Based upon this work, the ultimate goal will be to develop a MU calculator that could be integrated into any commercial treatment planning system and provide an independent MU calculation tool for quality assurance.

Conclusion
A proof-of-concept of using machine learning-based generic MU prediction tool for ocular proton therapy has been demonstrated.At this stage, predictions accuracy is within 10% with both extra trees and gradient boosting regressors performing best to generalize the proton eye beamlines of CAL, IFJ, and HollandPTC.Future work requires the collection of an extensive amount of data within a broader international setting to improve the prediction accuracy.

Figure 1 .
Figure 1.Output factors comparison.Output factors expressed as function of: (A) field size; (B) range shifter thickness (therefore residual range); (C) SOBP; (D) air gap distance.Reference conditions for the three institutes were: SOBP REF = 2.00 gcm −2 , field size REF = 25 mm circular diameter, z REF = mid-SOBP, no range shifter in place.Measurement points are represented with crosses and best-fitting curves with solid lines.

Figure 2 .
Figure 2. Venn diagram explaining the selection of numerical features used as inputs in every machine leaning (ML)-pipeline implemented in this study, and number of patients trained and evaluated (tested) by the models.

Figure 3 .
Figure 3. Scatter plots displaying the ratio between predicted and real MUs for each ML-pipeline considered in this study (A.-F.).Real MUs were obtained by institutional gold-standard methods, i.e. on-site measurements for the cohorts of patients from IFJ-Krakow and HollandPTC, and calculated by Monte Carlo simulations using MCNXP code for the cohort of patients from CAL-Nice.Results for patients belonging to the test dataset.

Figure 3
Figure3shows a comparison between the MUs predicted by each algorithm and the real MUs obtained using institutional gold-standard methods i.e.Monte Carlo simulations at CAL and measurements at HollandPTC and IFJ, for the test dataset.Figure4displays the histograms of the counts of the relative errors between the predicted MUs and real MUs for the test dataset.The extra trees regressor outperformed the other ML-based algorithms.In the test set comprising all 1125 patients, the predictive outputs with the extra trees were within 3%, 5% and 10% of the real MUs (simulated by Monte Carlo or measured) for 85.2%, 91.4% and 98.6% of the plans, respectively.Mean absolute errors between predicted and real MUs were 25.7 and 39.4 for the extra tree and support vector regression algorithms, respectively (table1).The mean and median absolute errors generated with the single-institutional model were globally larger than with the multi-institutional model.The predicted MUs within 10% accuracy were improved for more than 97% of all 36 patients belonging to the test dataset, except for the decision tree algorithm (table 2).

Figure 4 .
Figure 4. Histograms displaying the counts of the relative errors between predicted versus real MUs for each ML-pipeline considered in this study (A.-F.).Real MUs were obtained by institutional gold-standard methods.The non-dashed zone displays an accuracy tolerance of +/−3% commonly required in clinical use.Results for patients belonging to the test dataset.

Table 1 .
Differences between predicted and real MUs in the generalized model: the training, validation and test sets were based on the treatment plans from all centers together.

Table 2 .
Differences between predicted and real MUs in a single-institution model: the training, validation and test sets were based on the treatment plans from HollandPTC only.