Benchmarking proton RBE models

Objective. To biologically optimise proton therapy, models which can accurately predict variations in proton relative biological effectiveness (RBE) are essential. Current phenomenological models show large disagreements in RBE predictions, due to different model assumptions and differences in the data to which they were fit. In this work, thirteen RBE models were benchmarked against a comprehensive proton RBE dataset to evaluate predictions when all models are fit using the same data and fitting techniques, and to assess the statistical robustness of the models. Approach. Model performance was initially evaluated by fitting to the full dataset, and then a cross-validation approach was applied to assess model generalisability and robustness. The impact of weighting the fit and the choice of biological endpoint (either single or multiple survival levels) was also evaluated. Main results. Fitting the models to a common dataset reduced differences between their predictions, however significant disagreements remained due to different underlying assumptions. All models performed poorly under cross-validation in the weighted fits, suggesting that some uncertainties on the experimental data were significantly underestimated, resulting in over-fitting and poor performance on unseen data. The simplest model, which depends linearly on the LET but has no tissue or dose dependence, performed best for a single survival level. However, when fitting to multiple survival levels simultaneously, more complex models with tissue dependence performed better. All models had significant residual uncertainty in their predictions compared to experimental data. Significance. This analysis highlights that poor quality of error estimation on the dose response parameters introduces substantial uncertainty in model fitting. The significant residual error present in all approaches illustrates the challenges inherent in fitting to large, heterogeneous datasets and the importance of robust statistical validation of RBE models.


Introduction
Proton therapy is an important alternative to conventional radiotherapy, due to proton's physical advantages over photons.Proton's finite range and Bragg peak allow dose to be conformed to the tumour region, which can significantly reduce radiation-induced side-effects (Mohan and Grosshans 2017).Protons also produce slightly more biological damage than photons for the same physical dose, due to energy being deposited more densely, resulting in more complex damage (Vitti and Parsons 2019).This biological difference is quantified by their relative biological effectiveness (RBE), defined as the ratio of a reference photon dose to the dose of the radiation of interest (such as protons) which leads to the same biological effect.Together, these factors can enable better treatment outcomes for proton therapy.
Currently, a constant proton RBE of 1.1 is widely adopted in proton therapy treatment planning, but although it has been successful clinically, the use of a constant RBE is a simplification with early experimental and clinical studies suggesting variations in proton RBE (Blomquist et al 1993, Jones andErrington 2000).Experimental studies in vitro and in vivo have shown that RBE depends on the proton energy (through linear energy transfer, LET), tissue radiosensitivity, dose, biological endpoint, and tissue oxygenation (Paganetti 2014, Paganetti et al 2002).In high precision in vitro cell survival experiments, RBE increases from approximately 1 in the entrance region, to 1.4 in the mid spread out Bragg peak (SOBP) region and significantly higher (over 3) in the distal fall-off regions (Guan et al 2015).
Deviation from a constant RBE of 1.1 in clinical data remains less clear (Underwood et al 2022), but analysis of magnetic resonance (MR) images in pediatric ependymoma patients revealed a correlation between changes in MR images post-treatment and increased LET, indicating increased biological effectiveness (Gunther et al 2015, Peeler et al 2016).Another study which analysed follow-up computed tomography (CT) scans for chest-wall patients showed changes in lung density at the end of range for proton treatments, indicating variations in proton RBE (Underwood et al 2018).In addition to these correlative LET and RBE studies, there have been clinical studies where a higher incidence of serious late effects following treatment of brain gliomas with proton therapy compared to photon therapy, which may indicate an increase in RBE (Eulitz et al 2023).
While protons have both physical and biological advantages over photons, associated uncertainties limit the potential of proton therapy.Due to the physical uncertainties associated with proton beam delivery technique and target conformity, as a result of beam range uncertainty, a safety margin is added to protect organs at risk (OARs), resulting in a less conformal dose distribution (Paganetti 2012).Similarly, the large uncertainties associated with RBE limit its clinical exploitation in treatment planning, meaning that for many patients the target tumour may be under dosed and healthy tissue over dosed, limiting the effectiveness of the treatment (Mohan et al 2017).Therefore, to achieve the biological optimisation of proton therapy, accurate proton RBE models are essential.
Mechanistic approaches to modelling proton RBE have proven challenging to validate (McMahon and Prise 2019), so most published proton RBE models are phenomenological.These models are based on empirical in vitro cell survival data, and generally use the well-established linear quadratic (LQ) radiation response model to predict proton α and β values (McNamara et al 2020).These models are similar in that they incorporate the proton LET, tissue specific photon α and β values, and the dose to predict proton RBE by identifying key trends in the experimental data.However, these models all propose different relationships between these parameters and RBE.
A number of dose planning studies have incorporated phenomenological RBE models, to examine the clinical effects of variable RBE in treatment planning (Giovannini et al 2016, Underwood et al 2016, Ödén et al 2017).These studies have demonstrated the impact incorporating variable RBE in clinical scenarios can have on treatment predictions, highlighting the importance of validating models for clinical use.However, most such studies only incorporate a small portion of models, and often in too few scenarios to enable them to be robustly compared.
Rørvik et al completed the first wide comparison across eleven different phenomenological models and two plan-based models, where model intentions were compared, and by applying the models to simulated patient cases differences in the model dependencies and assumptions were analysed (Rørvik et al 2018).This study showed considerable variation in the RBE-weighted doses calculated by the different models, particularly in the tumour region where all models predicted RBEs much higher than 1.1.It was suggested that this variability resulted from fundamental differences in the experimental databases used to fit the different models, the underlying model assumptions, and the fitting techniques applied.Analysis of the different datasets used to fit models highlighted significant variation in the ranges of LET and (α/β) x values used, with most fit to very small datasets.
The same thirteen models were also investigated to evaluate systematic differences and similarities in the underlying assumptions of the models (McMahon 2021).The effect of both the physical (LET) and biological (α/β) x parameters on the RBE predictions was investigated by analysing the correlation in predictions of the different models with respect to these parameters.The LET dependence of RBE across models was shown to correlate well at clinically relevant LETs, but significant disagreement was seen on the impact of the (α/β) x parameter on RBE, suggesting large variation in how the models interpret biological factors.
The variability in fitting data is thought to be a key limitation of current models.Many models were fitted using small datasets, some including only one cell line (and therefore a very small range of (α/β) x values), whereas other models were fitted to larger heterogeneous datasets covering a wider range of cell lines.This hampers direct comparison of model performance and identification of optimal model features.A certain (α/ β) x dependence may be representative of limited subset of cells used to fit a model, but may not explain the true variation seen across diverse cell types.The different models may thus only be applicable to specific clinical scenarios.This is further compounded by the significant uncertainties associated with many biological endpoints, such as the (α/β) x values used in RBE model fitting, and the inter-experimental variation introduced when data is compared across several publications (Seed et al 2016).These uncertainties can significantly obscure the real biological trends and contribute to disagreement seen between different models.This work aims to benchmark phenomenological proton RBE models against a comprehensive dataset of experimentally measured proton RBE data, assessing the robustness of their predictions when all models are fitted to the same data using the same fitting techniques.This may allow the identification of the models, or specific model features, which best represent the proton RBE data, providing useful information for future model development and the biological optimisation of proton therapy.

Data preparation
Data from the proton RBE review by Paganetti was selected for this analysis, as it is the largest available source of proton RBE data, containing 369 cell survival data points from 76 published experiments across a large range of LET and (α/β) x (Paganetti 2014).Most data points have low LETs, with 62% below 5 keV μm −1 and 87% within the clinically relevant range of LET (<20 keV μm −1 ).There was a large spread of (α/β) x in the dataset, with the majority (78%) within the clinically relevant range ((α/β) x < 10 Gy).However, a large fraction of points (9%) have very high (α/β) x values (>25 Gy) and a small fraction (1.6%) have (α/β) x values close to zero.
Some data was excluded from this analysis.Two data points with very high LET values (63.7 keV μm −1 and 88.8 keV μm −1 ) from the Belli et al (1989) study were excluded, along with one point with a very high RBE value (RBE = 7.42) from the Petrović et al (2010) study.Particles with such high LET values have a range less than a cell diameter, increasing uncertainties and hampering interpretation.Experiments performed under hypoxia were also excluded, since this affects both radiosensitivity and RBE compared to well-oxygenated conditions.Finally, some points were removed due to variability in experimental methods, including cells that had been reirradiated (Bettega et al 2000), and those with only <2 Gy survival data that did not allow for accurate estimations of the β parameter (Wouters et al 2015).In total, 18 data points were removed for these reasons, leaving 351 points.
An initial analysis identified that errors were dominated by a small number of high-LET points (>20 keV μm −1 ).As these points were highly heterogeneous and subject to significant uncertainty, the final benchmarking study was performed within the clinically relevant range, with all points with LET >20 keV μm −1 excluded from the dataset.This removed 40 points, leaving 311 points in this analysis.
In addition, many experiments reported either no or very small uncertainties on α and β (1% of the parameter value).Distributions of these reported uncertainties on α and β are shown in supplementary figure 1.These experiments dominated weighted fits and made them very unstable.As these levels of uncertainty are biologically implausible, uncertainties on parameters which reported fractional values of <1% were scaled to the median value in the dataset, 13% of the parameter value.61 of the 311 remaining points had uncertainties scaled.
Finally, when the (α/β) x ratio was used in an RBE model, its value was restricted to between 0.01 and 100, as some RBE models scaled either in proportion or inversely proportional to this parameter, and fits were made unstable by outlying values.Similarly, for models which scaled inversely proportional to the α x parameter, this parameter was restricted to be above 0.0001.Underlying α and β values were still used to calculate overall survival.

Proton RBE models
In the thirteen phenomenological RBE models (Wilkens and Oelkfe 2004, Tilly et Where S is the surviving fraction of cells following exposure to a dose D, and α and β are the dose response parameters (McMahon 2019).RBE is defined as the ratio of doses of two different types of radiation which produce the same biological effect.The choice of biological effect can thus impact absolute RBE value.In this work, two biological effects were investigated.Firstly, the mean inactivation dose (MID), which is defined as the area under the dose response curve, was used to calculate RBE as follows: Where MID x and MID p are the photon and proton MID values, respectively (Belli and Simula 1991).The key advantage of using MID is that it summarises the whole survival curve in a single value, avoiding using a single survival or dose level selected by the investigator.The second biological effect considered is dose D X required to give a particular survival level, where X is the percentage survival (e.g.D 10 is the dose at 10% survival).This RBE was calculated as follows: When analysing different surviving fractions, the RBE was calculated at ten log-spaced survival levels between 0.01 and 1 for each experiment, and the models were fitted to all survival levels at once, so that the effect of incorporating different dose levels into the model fits could be studied.The phenomenological RBE models implemented in this study were formulated based on the standardised model expressions published by Rørvik et al (2018), where they were expressed in terms of RBE max and RBE min functions: These RBE min and RBE max functions depend on the cell specific α x and β x parameters and the proton LET (with varying dependencies between models) and define the relationship between the proton and photon LQ dose response parameters.The functions are listed in table 1 for each of the models, where the parameters p 0 to p 4 represent the model-specific fit parameters.These models were originally fitted to different selections of data (in most cases, a very small subset of the Paganetti dataset) and these published fit parameters were used for initial comparisons.In this work, all models were re-fitted to the Paganetti dataset, giving new fit parameters (p 0 to p 4 ) for each model.Note, as the expressions for the Wedenberg, Mairani, and Rørvik un-weighted (Rørvik U) models are the same, the published models only differ in their parameter values due to fitting of different datasets.Re-fit models of this form are denoted only as 'Wedenberg' in the text below for simplicity.Some of the parameter estimations for the Jones model were very unstable when fit to the Paganetti dataset alone.The Jones model was developed using both proton and heavy ion data covering a wide range of LET and RBE values.This was intended to provide applicability to clinically relevant ions other than protons, but due to the model design incorporating turnover at higher LETs, fitting this model to proton data alone often resulted in unstable fits.Therefore, to ensure stable fits, two of the parameters in this model most strongly dependent on these high-LET points were set to the published values (p 0 and p 4 ).These values have a limited Table 1.The thirteen phenomenological RBE models benchmarked in this study, as formulated in the study by Rørvik et al (2018).In the RBE min and RBE max expressions, p 0 -p 4 represent the fit parameters, LET is the linear energy transfer, and α x and β x are the photon dose response parameters.

Model
Reference RBE max RBE min

Wedenberg
Wedenberg et al (2013) Wilkens and Oelkfe (2004) Chen and Ahmad (2012) impact on the fit performance for low LETs as they are highly covariant with other model parameters (p 1 and p 3 ).
Similarly, for the Carabe model, a stable fit for one of the parameters (p 4 ) could not be generated.This is because p 4 was used to normalise predicted responses to the (α/β) x of their reference cell line (V79), and so is perfectly covariant with the two other slope parameters (p 1 and p 3 ).To enable comparison with parameters in the original publication, we therefore decided to fix p 4 at the published value.
A function for each model was created in Python, which takes the photon dose response parameters and LET values from the Paganetti dataset along with the chosen biological endpoint as an input, and uses the model specific RBE min and RBE max functions listed in table 1 to calculate the proton dose response parameters (α p and β p ) for each experiment.These proton response parameters are then used along with the corresponding photon response parameters to calculate the predicted proton RBE at the specified biological endpoint for each model.

Comparing the published models
The RBE predictions of the published models were compared as a function of (α/β) x and LET, to examine similarities and differences in the model trends.For this, values for α x , β x and LET were randomly generated using clinically relevant ranges: 0-10 Gy for (α/β) x and 0-15 keV μm −1 for the LET.These were used to calculate the α p and β p values and corresponding RBE MID for each virtual experiment, similar to previous studies (McMahon 2021).

Benchmarking the RBE models
The RBE models were benchmarked to the restricted Paganetti dataset to evaluate their performances when using a common dataset.This involved fitting each model to the data to determine new fit parameters (p 0 -p 4 ), calculating the predicted RBE values using these fit parameters, and comparing the level of agreement statistically as described below.
The models were initially benchmarked using all experiments in the restricted Paganetti dataset to both fit the models and then measure performance.Each model was fitted to the data using Levenberg-Marquaedt optimisation with the SciPy Python library's optimize.curve_fitmethod, and RBE values were calculated for each experiment.The impact that weighting by uncertainties had on model fits was also considered, where the reported uncertainties on the dose response parameters were used to calculate the uncertainties on the observed RBE values and used in a weighted optimisation.
Model performance was then quantified through residual analysis.For the un-weighted fits, the root mean square deviation (RMSD) was used to measure model performance, and for the weighted fit the reduced chisquare (χ 2 ) statistic was used: The reduced χ 2 , which is the χ 2 value divided by the number of degrees of freedom (DOF) in the dataset was also calculated.If a model has a reduced χ 2 ? 1, this suggests that either the model was a poor fit and unable to effectively capture the data, or that the errors associated with the data were underestimated.On the other hand, if a model has a reduced χ 2 = 1, this may mean the model is over-fitting to the data or errors are overestimated.While such quantitative observations cannot be drawn directly about RMSD, the relative RMSD between models is informative about their performance.The models were then ranked from best to worst performing based on these values, and the residuals distribution was also assessed to identify any outliers.To identify the optimal model, the Akaike information criterion (AIC), which rewards models which perform well and penalises models for increased complexity, was used: Where k represents the number of parameters in the model and N is the total number of data points.For each model the difference between its AIC (AIC model ) and the AIC value of the best performing model (AIC min ) was calculated (ΔAIC) to determine their relative performance:

Cross-validation
When fitting to the full dataset, the same data isused to both fit the models and test performance, which means there is a high chance of over-fitting to the data, and there is no way to know how well the model will perform on unseen data.A K-fold cross-validation approach, where part of the dataset was used for fitting and the rest for testing, was therefore implemented to validate model performance and robustness.
In K-fold cross-validation, the parameter K is equal to the number of groups or 'folds' that the dataset is to split into.Each model is fitted using K−1 folds and validated using the remaining fold, with the process repeated until all folds have been used for validation.Three folds were selected for this analysis, as this was a high enough number to examine how well the models generalise, but low enough to still allow sufficient data points in each fold.
For this study, data was divided into three parts based on publication (ordered alphabetically by author), balanced so approximately 33% of data was in each fold.The split was performed by experiment to ensure there was no leakage of information due to data from a given experiment being in both the testing and training data.Each of the three folds initially had 117 data points, and when high LET values were filtered out the three folds had 103, 103, and 105 points, respectively.
The data was fitted and tested in each fold using the workflow outlined above for the full dataset.Figure 1 shows an overview of this cross-validation process, which was implemented in Python.The three folds were used to create training and testing datasets, distributions of the ranges of LET and (α/β) x values in each training and testing dataset are shown in supplementary figures 2-4.Once the three iterations of training and testing were completed, the average parameter values and the corresponding standard deviations were used to examine the variation in predictions across folds, and the sums of the chi-square or residuals across the three folds were calculated to assess model performance.As for the full dataset fits, the models were then ranked using AIC to identify the best performing.The cross-validation was completed using both MID and multiple survival levels as the biological endpoint, as outlined above.

Testing model performance on low (α/β) x values
To investigate if any models would be best for predicting responses in late reacting tissues, where serious side effects can occur, model performance was assessed using a range of low (α/β) x values.For this analysis, the models were trained using the full dataset and then performance was tested on only the (α/β) x values between 1 and 5 Gy (179 data points).Model performance was quantified through residual analysis, as described above.

Results
Comparison of published models Large differences in both the underlying structures of the published models and their absolute values were seen when used with published parameters (figure 2).While most models have a linear dependence on the LET (apart from Peeler, Rørvik W, and Chen) the magnitude of the RBE predictions differed significantly.In particular, the Chen, Jones, and Wilkens models predicted notably higher RBEs.The Chen and Wilkens models do not have tissue dependence, and the Jones model is unique in that RBE max and RBE min depend on only α x or β x respectively, rather than the (α/β) x ratio.
The dependence on cellular (α/β) x is where models show the largest discrepancies.Several models have an explicit inverse dependence on this parameter, predicting lower RBE values at higher (α/β) x .However, the models which do not include this inverse dependence generally predicted higher RBE values at higher (α/β) x .The Unkelbach model showed no tissue dependence, depending only on LET, and the Peeler model showed almost no dependence at this LET.

Fitting to the full restricted dataset
Table 2 shows the model performance results from the full dataset fits, measured using RMSD for the unweighted analysis and χ 2 for the weighted analysis, with models ranked from best to worst performing for each analysis.The McNamara model, which was originally fitted to the Paganetti dataset (though using different restrictions), performed best for both the un-weighted and weighted fits.The Wilkens and Chen models showed the worst performance for both approaches, with more variability amongst other models.The reduced χ 2 values were much higher than 1 for all weighted fits, with the best performing model, McNamara, having a reduced χ 2 value of 3.5.This indicates that none of the model fits have fully captured trends in the weighted analysis and that they differ from measured values by much more than the published uncertainty.This in turn indicates either that the models do not fully describe the underlying RBE variation, or that the published uncertainties do not fully capture the experimental variability.
Figure 3 shows the resulting RBE predictions.Using a uniform dataset improved the overall agreement in RBE predictions, however large differences in both the magnitude of the values predicted and the structure of the models remained, particularly for their (α/β) x dependency.Notably, the models were closer in agreement for the un-weighted fits, with largely similar trends and magnitudes of RBE.The weighted fits showed significantly more divergence in RBE predictions, with the Chen, Frese, and Peeler models appearing almost independent of  LET, while other models showed both positive and negative trends.Analysis of the dependence of RBE on (α/β) x showed that although fitting all models to the same (α/β) x ranges brought the magnitude of RBE predictions closer together, a number of models lost their (α/β) x dependency, particularly in the weighted fits.

Cross-validation
When model performance was measured using cross-validation, the model rankings changed slightly, with the Jones model performing best for the weighted analysis, followed by the McNamara model (table 2).However, as noted above, additional constraints had to be placed on the Jones model to enable fitting to proton data alone, due to its inclusion of heavy ion data in its original design, which may affect the interpretation of these results.
It is expected that χ 2 values would be somewhat higher for cross-validation compared to the full dataset fitting (since performance is measured on unseen data).However, in this weighted analysis the χ 2 values increased dramatically, by more than a factor of two in all cases, indicating poor generalisability of model predictions from weighted fits.This highlights the importance of testing the statistical robustness of the models, as they may not perform as well when applied to unseen data, particularly in the weighted analysis.Table 2.The root mean square deviation (RMSD) values for the un-weighted fits (left) and the reduced chi-square values for the weighted fits (right) for both the full dataset and the cross-validation analysis, with the models ranked from best to worst performing.Both the un-weighted and weighted fits had a number of outliers that made disproportionately large contributions to the total RSS/χ 2 values.This was particularly significant for the weighted fits where small numbers of outliers with small uncertainties had a much greater influences on the fit, resulting in unstable parameters and very high χ 2 values.This highlights that many experiments have uncertainties which underestimate their true variability, causing the models to over-fit to these outliers.Due to the poor performance of all models in the weighted data, an un-weighted fit was selected for the final benchmarking of the RBE models.

Final benchmarking
Analysis of outliers showed that many of the points which were making large contributions to the total RMSD or χ 2 sum were experiments with a high LET (>20 keV μm −1 ).Compared to other points in the dataset, these showed significantly greater heterogeneity, which may reflect the greater challenges in both determining LET and ensuring experimental robustness for such low-energy protons.To evaluate how models performed using only clinically-relevant LETs, these 40 high LET points were filtered out of the dataset and the cross-validation analysis was repeated using an un-weighted fit.The calculated model parameters are listed in supplementary table 1.
Table 3 shows the cross-validation results using the LET filtered dataset, with the models ranked from best performing to worst performing using the AIC.This analysis showed that the overall performance of the models improved when the high LET points were not included, as the RMSD values reduced by approximately 35% after removing only 11% of the points.The rankings showed that the simplest model, the Unkelbach model, performed best.Notably, the Unkelbach model has no (α/β) x dependence by design, and the next two models (Tilly and McNamara) show greatly reduced (α/β) x in this fit, suggesting that (α/β) x inclusion may not dramatically affect model performance if only considering a single survival level.
The comparisons of the observed and predicted RBEs for the Unkelbach and Tilly models in figure 4 highlights the large differences which remained following the benchmarking, even for the best performing models.Model trends for the final benchmarking in figure 5 show that incorporating this LET restriction has further improved the agreement between most models (supplementary table 2), however differences in the underlying structure remained, particularly the dependence on (α/β) x .
Differences between the RBE MID predictions of the final benchmarked models were assessed by analysing the correlation between the RBE predictions of each model over clinically relevant ranges of LET and (α/β) x , as previously completed for the published model parameters (McMahon 2021).Figure 6 shows scatter plots of the correlations between RBEs for individual pairs of models by row and column, along with the corresponding Pearson correlation coefficients.Dark blue indicates a strong positive correlation and red indicates a weak correlation.These correlations show a slight improvement from those determined using the published models, indicating that the benchmarking has improved the overall agreement between model predictions.However, as expected, many models still poorly correlate due to differences in underlying assumptions, specifically their tissue dependence.

Fitting to multiple survival levels
When the models were fitted to multiple survival levels the Tilly model performed best (table 4).This suggests that while the Unkelbach model (which has no dose dependence) performs best at describing a single metric (MID), when considering multiple doses a more complex model is required.In this analysis models are now clearly grouped by (α/β) x dependence, with the top four models having an inverse dependence of RBE on (α/β) x , then the Unkelbach model which has no dependence on (α/β) x , followed by other models with different positive dependencies.Notably, the Tilly model's (α/β) x is not as strong as that of the other models, mitigating its impact at extreme (α/β) x which may indicate where disagreements arise in other models.The calculated model parameters for the fitting to multiple survival levels are listed in supplementary table 3.
Testing model performance on low (α/β) x values Analysis of model performance when testing only on data points with low (α/β) x values (a range of 1-5 Gy), showed that the McNamara model performed best with small variations in the performance of the top models, similar to the results when testing on the full dataset (table 5).The current data therefore does not indicate any particular model being optimal specifically for late responding tissues with low (α/β) x values.

Discussion
Proton RBE uncertainties remain a major limitation on the biological optimisation of proton therapy.Current phenomenological RBE models make very different predictions, due to differences in model assumptions, the data to which they were fit, and fitting techniques.Benchmarking these models to a uniform dataset allows for the direct comparison of model predictions and the identification of model assumptions which perform best on the currently available data.Analysis of published models highlighted the significant differences in the RBE predictions between models, and how they depend on physical (LET) and biological ((α/β) x ) factors, as previously shown in other publications (Rørvik et al 2018, McMahon 2021).The largest discrepancies between models arose due to the relationship between RBE and (α/β) x , with models which include a 1/(α/β) x dependency in a or b showing RBE increasing with decreasing (α/β) x , while others showed a decrease in RBE with decreasing (α/β) x .
The results of the initial benchmarking to the full dataset showed improved agreement and less variation in RBE predictions between the models, particularly for high (α/β) x and high LET.This confirmed that, in  addition to differences in model design, the different data used to fit the published models caused large differences in their predictions, particularly in the case of models with few or only a single cell line in their training data.
Analysis of model performance in the initial benchmarking revealed very different results depending on the choice of weighting applied to the model fits.Both the full dataset fits and the cross-validation analysis had high  reduced χ 2 values for the weighted analysis, indicating that differences between models and observations were significantly greater than the uncertainties in the data.Analysis of residual distributions for each of the model fits showed that there were several experiments which made disproportionate contributions to the overall χ 2 error in the weighted analyses.This resulted in skewed error distributions and undue influence of these points on the fit, suggesting that published uncertainties underestimated the true variability in the data.Dose response parameters obtained from cell survival experiments are often associated with large uncertainties, compounded by different authors using different approaches to obtain these parameters from their raw data (Friedrich et al 2021).While a weighted approach is normally advantageous, if the uncertainties on measurements are incorrectly estimated this can result in a worse model performance.As our analyses suggests many of the uncertainties in the dataset were underestimated, an un-weighted fit was applied for the final benchmarking.This is the approach originally used when fitting most of the published models, excluding the Mairani, Rørvik-weighted, and McNamara models which applied weighted fits.It is also important to note that these analyses are focused on uncertainties in the biological parameters.However physical parameters, such as dose and LET, can also have significant uncertainties which are frequently not reported (Draeger et al 2020).These factors may both explain some degree of the observed variability in biological parameters, and also contribute to fit uncertainties.Improved reporting of such parameters may prove important to refining future RBE models.
The results from the cross-validation study demonstrated the importance of testing the generalisability of the models.While it is expected that the residuals would be higher for cross-validation as models are tested on unseen data, for the un-weighted fit some models showed a dramatic difference in performance in the crossvalidation compared to the full dataset fit.For example, the RMSE for the Rørvik weighted model was over 30% higher when fitted using cross-validation, compared to the full dataset fits.This indicates that the model performed much worse when tested on unseen data, which may be due to the inclusion of the higher-order LET 3 and LET 4 terms leading to over-fitting at high LETs.
The results of the final benchmarking, where a cross-validation analysis was applied to the LET-restricted Paganetti dataset with MID as the biological endpoint, showed that the optimal model was the simplest one with no tissue or dose dependence: the Unkelbach model.The Unkelbach model was originally designed for LET optimisation in treatment plans for intensity modulated proton therapy (IMPT), to try and avoid high LET values in critical structures positioned in proximity to the target volume (Unkelbach et al 2016).Therefore, the model is simply based on the principle that RBE is linearly dependent on the LET.While other parameters are believed to impact on RBE, it can be seen in this analysis that they may not contribute significantly enough to the overall trend to offset the risks of over-fitting afforded by more complex models.
When the models were re-fitted considering multiple survival levels as the biological endpoint rather than simply the MID, the Unkelbach model no longer performed best.This suggests that a more complex model with dose-and tissue-dependence is required when considering different survival levels.The results of this analysis also showed that the models which have an inverse (α/β) x dependence performed best on the currently available data, indicating that this is a key RBE model feature.However, the two best-performing models (Tilly and McNamara) both do not have a purely inverse (α/β) x dependence, and in best-fitting models show flatter dependence on (α/β) x than models such as Wedenberg (figure 5).This suggests a slightly more complex relationship between a and b may be required to fully capture proton LET trends.
In this analysis the (α/β) x values input to RBE min and RBE max calculations were restricted to between 0.01 and 100 to ensure stable model fits could be produced.Additional analysis of model performance when different restrictions were applied to the (α/β) x parameter resulted in slightly different model performance rankings (supplementary tables 4 and 5).However, the general trend of model performance was not significantly impacted, with similar models performing best and worst in each of the fits (supplementary figure 5).Therefore, while the model performance is sensitive to changes in how extreme values of (α/β) x are handled, there is not a significant change in the relative performance of models.
With the noisy biological response parameters proving to be a key limitation in the development of robust RBE models, there has been interest in models which can predict responses based on the physical factors alone.For example, a recent empirical model which is based on how radiosensitivity varies with LET and predicts proton RBE based on the linear correlation between x-ray and proton radiosensitivity, has shown improved performance at a number of survival levels compared to the Wedenberg, Mairani and McNamara models (Flint et al 2022).This is in line with the results of this study when fitting to a single endpoint (MID), that a simpler approach based on LET alone provides the best predictions for the currently available data.However, analysis of model performance when fitting to multiple survival levels in this study highlighted that this tissue dependence is important to consider and more complex models may be required.
The lack of a definitive 'best' RBE model across all conditions suggests that more work is required to fully understand these biological processes and quantitatively characterise them.However, the general consistency of results does suggest some potentially preferred models which could be considered to guide initial evaluations of treatments.In particular, for simple fields with relatively consistent dose levels but varying LET, the analysis suggests that very simple approaches, such as the Unkelbach model which is equivalent to addition of a simple dose LET term to the delivered dose, effectively captures much of the biological variability.For more complex conditions with greater variation in delivered dose and thus expected survival, more complex models are required.The analysis in this work suggests that models which contain a dependence on (α/β) x which is slightly weaker than a 1/(α/β) x scaling in the alpha term typically perform best regardless of selections made regarding extreme (α/β) x values, and so those model forms (McNamara and Tilly) may provide useful candidate models for initial RBE screening analyses.However, it is important to emphasise that this work has shown that model performance is dependent on the various data processing steps and analysis techniques applied.Therefore, care must be taken when extrapolating these observations both to new datasets as well as to clinical settings.Current experimental proton RBE datasets have several limitations which should be considered.Large datasets which gather information from multiple experiments are very useful for benchmarking models as they cover a large range of conditions.This allows features to be analysed with high statistical power, and for crossvalidation analysis.However, as seen here, inconsistencies in the experimental procedures and different approaches to the calculation of uncertainties between publications should be considered when analysing large heterogeneous datasets, as these could potentially mask important biological information (Matsui et al 2019).Additionally, while the Paganetti dataset contains results from a large number of proton RBE experiments, the majority of these used only one cell line, and mostly cell lines with small (α/β) x values, making it difficult to identify the models which best describe trends in the underlying biology across the full clinical range.
If standardised techniques, such as those recently outlined in a report by Paganetti et al were implemented for in vitro measurements, this could reduce the uncertainties associated with future inter-experimental comparison (Paganetti et al 2019).Furthermore, if a wider range of comprehensively genetically characterised human cell lines were studied, this could help improve the robustness of the model predictions for early responding tissues with high (α/β) x values, and potentially facilitate the identification of genetic features in the cell lines which are linked to RBE variation.
It is also important to note that the models were only benchmarked against measurements of clonogenic survival, which may not be the only clinically relevant endpoint.Although cell survival is considered a strong indicator for tumour control probability (TCP), it is thought that other endpoints may be more relevant for normal tissue complication probability (NTCP).Other endpoints such as DNA damage, chromosome aberrations, and micronuclei are other potential markers of normal tissue toxicity that may not be fully captured by cell survival (Paganetti 2014).Which parameters best correlate with NTCP remains unclear due to the lack of experimental data and requires further study for future model benchmarking.
Finally, as the data used in this analysis was based on in vitro radiation sensitivity measurements, it is difficult to know how well the models will predict response in vivo (Willers et al 2018).In vitro data does not account for effects such as the interaction of cells with their microenvironment, involvement of the immune system, or differences in proliferation patterns (McNamara et al 2020).However, as discussed by Paganetti et al in vivo studies may be associated with higher levels of uncertainty due to the challenges with precisely positioning the target area within the high LET region, and the use of imaging-based response parameters of the biologic effect in animals may be more informative (Paganetti et al 2019).Therefore, to determine how well these phenomenological models can predict RBE variation and identify an optimal model for clinical use, verification using clinical data is essential.

Conclusions
Fitting phenomenological proton RBE models to a uniform dataset improved the agreement between their predictions.However, some disagreements remained due to fundamental differences in model assumptions.Cross-validation highlighted that many models perform poorly when attempting to generalise to unseen data, highlighting the need for robust validation.The results suggest that while models which incorporate an inverse dependence on (α/β) x perform better across this dataset, the optimal model depends on data selection and endpoint.To identify the most accurate model for clinical use, a deeper understanding of the underlying biology of proton RBE is essential.To achieve this, more studies which investigate the mechanistic biological dependence of RBE are necessary, and it is essential that these studies follow standardised procedures to obtain cell survival measurements and associated uncertainties, to allow robust integrated analysis in the future.

Figure 1 .
Figure1.Cross-validation benchmarking workflow.The restricted Paganetti dataset was split into three folds, and these were used to create training and testing datasets for each iteration of the cross validation.The models were then fitted to the training dataset to obtain the fit parameters, and model predicted RBE values were calculated using the fit parameters and the testing data.Model performance was then measured through residual analysis.After repeating the fitting and testing for all three folds, the average parameter values were calculated, along with the RSS/Chi-square total, and the best-performing model was identified.

Figure 2 .
Figure 2. Comparing model RBE MID predictions as (a) a function of LET for a constant (α/β) x of 6 Gy , and (b) as a function of (α/β) x for a constant LET of 5 keV μm −1 , for the published models.

Figure 3 .
Figure3.The plots of RBE MID as a function of LET for a constant (α/β) x of 6 Gy calculated using the fit parameters from the full dataset (a) unweighted fits and (b) weighted fits.RBE as a function of (α/β) x , for a constant LET of 5 keV μm −1 calculated using the fit parameters from the full dataset (c) unweighted fits and (d) weighted fits.

Figure 4 .
Figure 4. Plots of observed RBE against model predicted RBE for (a) the Unkelbach model and (b) the Tilly model.Points represent individual measurements, coloured according to fold, and the shaded regions represent the difference between the best fit lines for the individual folds and the best fit line to all of the data.

Figure 5 .
Figure 5. Plots of RBE MID as a function of (a) LET for a constant (α/β) x of 6 Gy, and (b) (α/β) x , for a constant LET of 5 keV μm −1 , calculated using the fit parameters from the final benchmarking (un-weighted cross-validation on the LET-restricted dataset).Note the Jones model has been excluded due to unstable fits which resulted in very high RBE values.

Figure 6 .
Figure 6.Correlation of trends in the different phenomenological RBE models.RBE MID values were calculated for each model for a set of randomly generated LET and (α/β) x values and plotted against one another to evaluate correlation.The lower triangle shows scatter plots of modelled RBEs between pairs of models, while the upper triangle presents Pearson correlation coefficients for the two models being compared.Dark blue represents a strong positive correlation whereas red indicates a weaker correlation.Poor correlation is typically seen between those models which differ in their assumed (α/β) x dependence.Models are arranged in order of performance, best to worst.

Table 3 .
Akaike information criterion (AIC) model rankings for the final benchmarking by MID.RMSD-root mean square deviation.

Table 4 .
The Akaike information criterion (AIC) model rankings when fitted to multiple survival levels.RMSD-Root Mean Square Deviation.

Table 5 .
The root mean square deviation (RMSD) values for unweighted fits to the full dataset, with performance tested on (α/β) x values between 1 and 5 Gy.Models are ranked from best to worst performing.