OpenKBP-Opt: an international and reproducible evaluation of 76 knowledge-based planning pipelines

Objective. To establish an open framework for developing plan optimization models for knowledge-based planning (KBP). Approach. Our framework includes radiotherapy treatment data (i.e. reference plans) for 100 patients with head-and-neck cancer who were treated with intensity-modulated radiotherapy. That data also includes high-quality dose predictions from 19 KBP models that were developed by different research groups using out-of-sample data during the OpenKBP Grand Challenge. The dose predictions were input to four fluence-based dose mimicking models to form 76 unique KBP pipelines that generated 7600 plans (76 pipelines×100 patients). The predictions and KBP-generated plans were compared to the reference plans via: the dose score, which is the average mean absolute voxel-by-voxel difference in dose; the deviation in dose-volume histogram (DVH) points; and the frequency of clinical planning criteria satisfaction. We also performed a theoretical investigation to justify our dose mimicking models. Main results. The range in rank order correlation of the dose score between predictions and their KBP pipelines was 0.50–0.62, which indicates that the quality of the predictions was generally positively correlated with the quality of the plans. Additionally, compared to the input predictions, the KBP-generated plans performed significantly better P<0.05; one-sided Wilcoxon test) on 18 of 23 DVH points. Similarly, each optimization model generated plans that satisfied a higher percentage of criteria than the reference plans, which satisfied 3.5% more criteria than the set of all dose predictions. Lastly, our theoretical investigation demonstrated that the dose mimicking models generated plans that are also optimal for an inverse planning model. Significance. This was the largest international effort to date for evaluating the combination of KBP prediction and optimization models. We found that the best performing models significantly outperformed the reference dose and dose predictions. In the interest of reproducibility, our data and code is freely available.


Abstract-
We establish an open framework for developing plan optimization models for knowledge-based planning (KBP) in radiotherapy.Our framework includes reference plans for 100 patients with head-and-neck cancer and high-quality dose predictions from 19 KBP models that were developed by different research groups during the OpenKBP Grand Challenge.The dose predictions were input to four optimization models to form 76 unique KBP pipelines that generated 7600 plans.The predictions and plans were compared to the reference plans via: dose score, which is the average mean absolute voxel-by-voxel difference in dose a model achieved; the deviation in dosevolume histogram (DVH) criterion; and the frequency of clinical planning criteria satisfaction.We also performed a theoretical investigation to justify our dose mimicking models.The range in rank order correlation of the dose score between predictions and their KBP pipelines was 0.50 to 0.62, which indicates that the quality of the predictions is generally positively correlated with the quality of the plans.Additionally, compared to the input predictions, the KBPgenerated plans performed significantly better (P <0.05; one-sided Wilcoxon test) on 18 of 23 DVH criteria.Similarly, each optimization model generated plans that satisfied a higher percentage of criteria than the reference plans.Lastly, our theoretical investigation demonstrated that the dose mimicking models generated plans that are also optimal for a conventional planning model.This was the largest international effort to date for evaluating the combination of KBP prediction and optimization models.In the interest of reproducibility, our data and code is freely available at https://github.com/ababier/open-kbp-opt.

I. INTRODUCTION
Automated radiotherapy planning is transforming clinical practice and personalized cancer treatment [1].The most common type of automated planning is knowledge-based planning, which leverages knowledge derived from historical clinical treatment plans to generate new treatment plans without human intervention [2]- [4].Most common KBP methods can be thought of as a two-stage pipeline that first predicts the dose that should be delivered to a patient, and then converts that prediction into a treatment plan via optimization (Figure 1).Both stages of this pipeline, which are active areas of research, can significantly affect the quality of generated treatment plans [5].The contributions of this paper are twofold: 1) to provide data that supports KBP optimization research at scale and 2) to establish a connection between dose mimicking (a type of KBP optimization) and conventional planning methods.We expand on the impact of these contributions throughout this paper.includes data for 340 head-and-neck patients undergoing intensity modulated radiotherapy (IMRT), is limited to dose prediction research (i.e., it is incompatible with KBP optimization research).Although there are still no open datasets for KBP optimization research, there are two open datasets that support research in other areas of plan optimization [11], [12].However, it is challenging to use these datasets in KBP plan optimization research for two reasons.First, neither dataset includes dose predictions, which are the input to KBP plan optimization models.Second, they are smaller (123 patients across both datasets), span multiple sites (prostate, liver, headand-neck), and multiple modalities (CyberKnife, volumetric modulated arc therapy, proton therapy, IMRT).While such a diversity in cases is important to demonstrate the robustness and generalizability of optimization algorithms across sites and modalities, this same diversity is a disadvantage when it comes to training dose prediction models, since there is insufficient data for any one site-modality pair [13].
Most KBP pipelines are developed as fully-automated pipelines that can replace human treatment planners in the planning process [14]- [17].These approaches have demonstrated promising results in prospective research studies where a sizeable portion of KBP-generated plans were considered inferior to human-generated plans, which suggests that there is an opportunity for improvement [2], [4].In those cases, making manual adjustments to the KBP-generated plan is nontrivial because they are generated by fully-automated pipelines that rely on the quality of the data.In contrast to fully automated pipelines, semi-automated pipelines rely on both the quality of data and human expertise, which puts less reliance on the data.For example, a semi-automated KBP pipeline could enable human planners to improve upon a KBPgenerated plan via an intuitive process (e.g., inverse planning) and thereby provide a pipeline that leverages human expertise, models, and data.In the KBP literature, however, there are relatively few papers that describe tools that humans can intuitively interact with in semi-automated KBP pipeline [18]- [21].
In this paper, we extend the results from the OpenKBP Grand Challenge, which we call OpenKBP, with an international validation of 76 KBP pipelines.We made this extension, which we call OpenKBP-Opt, open to provide a benchmark for KBP optimization research and to lower the barriers for contributing to this research area.We also demonstrate how KBP plan optimization models can be used to initialize the conventional planning process (i.e., inverse planning) with good patient-specific parameters (i.e., objective weights) and provide the means for a semi-automated KBP pipeline.Identifying this relationship provides a mechanism for transforming existing KBP optimization models, which are generally fully-automated pipelines that impede manual intervention, into semi-automated pipelines that promote human planners to improve upon a KBP-generated plan via inverse planning (i.e., a familiar and intuitive process).The data and code to reproduce this paper is publicly available at https://github.com/ababier/open-kbp-opt.

II. METHODS AND MATERIALS
Figure 2 summarizes the overall methodological approach into five components.The first three components (i.e., processing data, developing dose prediction models, and generating KBP dose predictions) are based on the results of the OpenKBP Grand Challenge.The final two components (i.e., developing plan optimization models and generating KBP treatment plans) are an extension of the OpenKBP Grand Challenge and the focus of this paper.Below, we describe all five components and our analysis.

A. Processing patient data
We obtained data for 340 patients (n = 340) with head-andneck cancer from the OpenKBP Grand Challenge.The data consisted of a training set (n = 200), a validation set (n = 40), and a testing set (n = 100).The plans were delivered via 6 MV step-and-shoot IMRT from nine equidistant coplanar beams at angles 0 • , 40 • , . . ., 320 • .Those beams were divided into a set of beamlets B, which make up a fluence map.The relationship between the intensity w b of beamlet b and dose d v deposited to voxel v was determined using the influence matrix D v,b generated by the IMRTP library from A Computational Environment for Radiotherapy Research [22] using MATLAB, and it is given by

B. Developing dose prediction models
All dose prediction models used in this paper were developed in the OpenKBP Grand Challenge [10].During the challenge, teams developed dose prediction models using identical training and validation datasets with access only to ground truth data (i.e., dose) for the training set.Every dose prediction model used a neural network architecture that was based on either a U-Net [23], V-Net [24], or Pix2Pix [25] architecture.Many of the best performing models also used other generalizable techniques like ensembles [26], one-cycle learning [27], radiotherapy-specific loss functions [28], and deep supervision [29].
All teams competed to develop models that minimize one of two pre-defined error metrics that quantified the difference between the reference dose and a KBP-generated dose (i.e., KBP prediction or plan dose).The metrics were: 1) dose error, which was the mean absolute voxel-by-voxel difference between two dose distributions, and 2) dose-volume histogram (DVH) error, which was the absolute difference between a DVH point from two dose distributions.The DVH error was evaluated on two and three DVH points for each organ-at-risk (OAR) and target, respectively.The OAR DVH points were the D mean and D 0.1cc , which was the mean dose delivered to the OAR and the maximum dose delivered to 0.1cc of the OAR, respectively.The target DVH points were the D 1 , D 95 , and D 99 , which was the dose delivered to 1% (99 th percentile), 95% (5 th percentile), and 99% (1 st percentile) of voxels in the target, respectively.The models were ranked according to: 1) dose score, which was the average dose error of a model, and 2) DVH score, which was the average DVH error of a model.

C. Generating KBP dose predictions
In this paper, the OpenKBP organizers collaborated with teams that competed in the OpenKBP Grand Challenge.The 28 teams that completed the final phase of the OpenKBP Grand Challenge were invited to participate in the OpenKBP-Opt project, and 21 of those teams agreed to participate.We obtained the dose predictions from all teams for each patient in the test set to create a set of 2100 dose predictions (21 different predictions for each of the 100 patients).We observed that two models produced dose scores that were over two standard deviations (6.3 Gy) above the mean (4.0 Gy), whereas the rest were within half a standard deviation (1.6 Gy) of the mean.Thus, we omitted those two outlier models and proceeded with only 19 KBP models (n = 1900 predictions).

D. Developing plan optimization models
Next, we formulated four dose mimicking models, which are a type of KBP optimization model.Each model used the same set of structures and objective functions that we described in Section II-D.1 and Section II-D.2, respectively.However, they differ in how they mimic (i.e., penalize differences) a specific dose distribution.In particular, they each have a different cost function, outlined in Section II-D.3.Note that in this paper the terms "objective function" and "cost function" refer to distinct concepts, and the cost functions in this paper are functions of objective functions.
1) Structures: All of our optimization models used the same set of regions-of-interest (ROIs) R p for each patient p ∈ P in our test set.The set R p contains OARs I p , targets T p , and optimization structures O p .The OARs contained in I p were the brainstem, spinal cord, right parotid, left parotid, larynx, esophagus, and mandible.Each target t ∈ T p was a planning target volume (PTV) with a dose level θ t , and those targets were the PTV56, PTV63, and PTV70.The optimization structures contained in O p were the limPostNeck, which was used to limit dose to the posterior neck, and six PTV ring structures (a 3 mm ring and a 6 mm ring for each target).These were the same structures used to generate the plans in the original OpenKBP dataset [10].Every ROI r ∈ R p was also divided into a set of voxels V r .
2) Objective functions: Our models used the objective functions in Table I.Each objective function quantified a different measure of the dose delivered to a single ROI r ∈ R p in a patient p ∈ P, which we call an objective value.Specifically, the average and maximum objective values quantified the average dose and maximum dose delivered to an ROI r, respectively.
The high and low conditional value at risk (CVaR) objective values quantified the average dose in ROI r that was higher and lower, respectively than the dose threshold f .

Name
Objective function In total, we considered 107 objectives functions: seven per OAR, three per target, and seven per optimization structure.The objective functions for each OAR were the mean dose; maximum dose; and high CVaR dose with thresholds f equal to 0.25, 0.50, 0.75, 0.90, and 0.975 of the maximum predicted dose to that structure.The objective functions for each target were the maximum dose, low CVaR dose with a threshold equal to the dose level of the target (i.e., f = θ t ), and a high CVaR dose with a threshold f equal to 1.05 of the dose level of the target (i.e., f = 1.05θ t ).The objective functions for each optimization structure were the same as the OAR objective functions.Not all patients had all ROIs, so the models associated with those patients had fewer than 107 objective functions.

Average dose mean
3) Model formulations: Our KBP optimization models performed dose mimicking to generate plans with optimized objective values that closely matched the input objective values from a dose prediction.To streamline our model formulation, let each m ∈ M p denote one of the 107 objective functions (as outlined in Section II-D.2).Let g m and ĝm be objective values of their corresponding objective functions evaluated over the optimized plan and predicted dose, respectively.In all models, the cost functions were formulated such that lower values of g m were favored over higher values.
Table II presents the cost functions of our dose mimicking models.Each model minimized either the mean or max difference between all corresponding pairs (g m , ĝm ) of the objective values, which were quantified via an absolute (e.g., g m −ĝ m ) or relative (e.g., (g m − ĝm )/ĝ m ) difference measure, resulting in four dose mimicking models.In the mean difference models, we chose to prioritize the positive differences (i.e., where the optimized plan objective value was higher than the predicted dose objective value) more than the negative differences, which we assigned a small positive weight ( = 0.0001 in our experiments).This was done to incentivize the model to do at least as well as the dose prediction before striving to outperform the dose prediction on other objective functions.In contrast, the max difference models used only a single term because the max difference naturally incentivizes the model to outperform the prediction only once the plan outperforms the prediction across all objective values (i.e., when g m ≤ ĝm , ∀m ∈ M p ).
The main constraint in all four models was a constraint to limit plan complexity.In particular, the sum-of-positive gradients (SPG) [30] of all plans generated by the models was constrained to be less than or equal to 65, which was a constraint in the reference plans [10].The remaining constraints were simply auxiliary constraints (including auxiliary variables) used to linearize both the objective and cost functions (i.e., the formulations in Table I and Table II).The optimization models were all formulated in Python 3.7 using OR-Tools 8.2 and solved using Gurobi 9.1 (Gurobi Optimization, TX, US) on a single computer with an Intel i7-8700K (6-Core 3.7 GHz) CPU and 16 GB of random access memory.Default parameters were used with the Gurobi solver except for Crossover set to 0, Method set to 2, and BarConvTol set to 0.0001, which were selected based on past experience to improve solve time without compromising solution quality.

E. Generating KBP treatment plans
Next, we assembled 76 KBP pipelines by combining the 19 dose prediction models with each of the four dose mimicking models.Each pipeline was applied to the 100 patients in the testing set, resulting in 7600 KBP plans (see Figure 3).We used these plans in our analysis to measure the quality of the respective KBP models.We refer to the four plans generated from each dose prediction as the MeanAbs, MaxAbs, Mean-Rel, and MaxRel plans.Altogether, after completing the process in Figure 3, we had dose distributions for a set of reference plans (n = 100), predictions (n = 1900), and KBP plans generated by four dose mimicking models (n = 4 × 1900).The reference plans are the plans that were released as part of the OpenKBP Grand Challenge, and the predictions are dose distributions that were submitted by 19 teams in the final testing phase of OpenKBP.In general, there will be differences between the reference plan, prediction, and KBP plan dose distributions.Differences between a dose prediction and its corresponding KBP plan are due to factors including prediction noise and deliverability of the dose prediction.Differences between a KBP plan and its corresponding reference plan reflect different trade-offs in the cost function used to generate these plans.

F. Analysis
We conducted three analyses to measure model performance in terms of dose error, DVH point differences, and clinical criteria satisfaction.We also investigated the theoretical connection between our dose mimicking models and inverse planning.Finally, we summarized empirical optimization metadata.
1) Dose score and error: We evaluated the KBP models using the dose score and dose error as defined in Section II-B.We calculated the Spearman rank order correlation of the dose score between the prediction models and corresponding KBP pipelines.The distribution of dose error was visualized using a box plot.A one-sided Wilcoxon signed-rank test was used to determine whether the dose error of the optimization models was the same (null hypothesis) or lower (alternative hypothesis) than the dose predictions models.For all hypothesis tests in this paper, P < 0.05 was considered significant.
2) DVH point differences: To measure the relative quality of dose distributions from a clinical perspective, we examined the distribution of DVH point differences between the reference and KBP-generated dose.The differences were evaluated over the DVH points listed in Section II-B and visualized using boxplots.We used the one-sided Wilcoxon signed-rank test to determine whether the dose generated by all optimization models performed the same (null hypothesis) or better (alternative hypothesis) than the dose predictions.This test was chosen to evaluate the aggregate performance of all optimization models relative to the predictions.Lower values were better for D mean , D 0.1cc , and D 1 ; higher values were better for D 95 and D 99 .
3) Expected criteria satisfaction: As another measure of plan quality, we examined the proportion of clinical criteria that were satisfied by the reference plans and KBP-generated dose.One criterion was evaluated for each ROI (see Table III).We tabulated the proportion of criteria that were satisfied by the reference plans, dose predictions, MeanAbs plans, MaxAbs plans, MeanRel plans, MaxRel plans, and the plans from the KBP pipeline that satisfied the most clinical criteria overall.We also plotted the proportion of OAR, target, and all ROI clinical criteria that each of the 76 KBP pipelines achieved.
4) Theoretical analysis of dose mimicking models: To justify our choice of dose mimicking models, we conducted a theoretical analysis into their structure using linear programming duality theory [31,Chapter 4].This analysis was based on previous literature that showed a connection between Benson's method [32], which identifies efficient solutions to multiobjective optimization models, and estimating the weights for inverse planning [33].We were motivated to conduct a similar analysis as in Chan et al. [33] because our dose mimicking models are similar to the formulations in Benson [32].In 5) Optimization metadata: Lastly, we summarized the metadata that each optimization model generated.In particular, we evaluated the average proportion of objective weight that each model assigned to OAR, target, and optimization structure objective functions.Additionally, we recorded the average, first quartile, and third quartile solve time.

III. RESULTS
In this section, we summarize the performance of the 19 dose predictions models, four dose mimicking models, and 76 KBP pipelines.

A. Dose score and error
Table IV summarizes the rank order correlation between the dose prediction models and their corresponding KBP pipelines.We found that the rank of a prediction model is positively correlated with its corresponding KBP pipeline rank.However, there was a wide range in correlation from 0.50 to 0.62.This demonstrates that high quality predictions are correlated with high quality plans, but this result also indicates that a prediction model that outperforms a competitor will not always generate better plans.Additionally, the KBP plans generated by an optimization model that evaluated relative differences (i.e., MeanRel and MaxRel) achieved higher rank order correlations than their counterparts that evaluated absolute differences (i.e., MeanAbs and MaxAbs).
The dose errors of predictions and KBP plans are shown in Figure 4. Two of the four sets of KBP plans had a median dose error that was lower than the median dose error of the predictions (2.79 Gy), implying that it is possible for

B. DVH point differences
Figure 5 shows the DVH point differences between the reference dose and either the predicted dose or KBP plan dose.In general, dose mimicking tends to produce a plan dose that is significantly better than the dose it received as input from a dose prediction model.In particular, the KBP plan dose is significantly better on 18 of the 23 DVH points than the predicted dose (all OAR points and four target points).The five DVH points where the plans were not significantly better are the three D 95 points and two D 99 points.

C. Expected criteria satisfaction
In Table V, we compare the percentage of criteria that were satisfied by the reference plans (n = 100), the predictions (n = 1900), the plans generated by each of the four dose mimicking models (n = 4×1900), and the plans generated by the top performing KBP pipeline (n = 100).We use the term baselines to refer to the reference dose and dose predictions collectively.The top performing KBP pipeline (denoted "Best" in Table V) was defined as the single pipeline (i.e., the combination of one dose prediction model and one dose mimicking model) whose plans satisfied the most clinical criteria.Of all dose mimicking models, the MaxRel and MeanAbs models generated plans that satisfied the fewest (69.8%) and most (72.9%)ROI clinical criteria, respectively.For comparison, predictions only satisfied 66.2% of all clinical criteria, which was 3.5 percentage points lower than the reference plans (69.7%).The best KBP pipeline, which used the MeanAbs model and one of the 19 prediction models (discussed later), satisfied 77.0% of all ROI clinical criteria.
In general, clinical criteria satisfaction varied across each ROI criterion.The brainstem, spinal cord, esophagus, and mandible criteria were each satisfied more than 85% of the time across all the baselines and our dose mimicking models in Table V.The right parotid, left parotid, and larynx were satisfied less than 40% of the time for the two baselines.In contrast, each of our four KBP models generated a higher average criteria satisfaction for these ROIs compared to the baselines.In fact, some were substantially higher.For example, the average criteria satisfaction of the MeanAbs model on the larynx was 71.5%, compared to an average of 36.2% for the baselines.In aggregate over all 19 prediction models, the performance of the four dose mimicking model was comparable or slightly worse than the reference dose in terms of criteria satisfaction in the targets.However, the best KBP pipeline outperformed the baselines on all criteria.
Figure 6 summarizes the clinical criteria that were satisfied by each of the 76 KBP pipelines that we evaluated.The MeanAbs model generated plans that satisfied more criteria than the other three optimization models for 16 of the 19 dose prediction models (see Figure 6(c)).Additionally, the pipelines that used better prediction models (i.e., dose score rank closer to 1) generally produced plans with higher criteria satisfaction.Interestingly, the best performing KBP pipeline (the last column of Table V) used the dose prediction model that ranked 16 th in terms of dose score.The spread in OAR criteria satisfaction across all 19 models (55.4% to 82.1%) was lower than that of target criteria satisfaction (24.5% to 89.7%), see Figure 6(a) and Figure 6(b), respectively.Note that the poor performing KBP pipelines used the 12 th , 13 th , 17 th , 18 th , and 19 th ranked dose prediction models.Since the columns in Table V included all KBP pipelines, these poor performing models contributed to low performance on the target criteria.In contrast, many of the KBP pipelines that used the top ranked models prediction models clearly performed much better on target criteria.

D. Theoretical analysis of dose mimicking models
The inverse planning model ( 2) is shown in model ( 3) in vector and matrix notation following Chan et al. [33].
The objective functions are the rows of matrix C and the objective function weights are represented by the vector α.
The decision variables, which include the fluence variables (w b ∀b ∈ B) and auxiliary variables are represented by vector x.The SPG and auxiliary constraints are encoded in the matrix A and vector b.
Table VI presents the formulations of the four dose mimicking models and their respective dual models.The positive and negative differences between optimized objective values Cx and predicted objective values Cx are represented by vectors σ and δ, respectively.The max difference between the optimized and predicted objective values is expressed as a scalar ζ.The dual variables of the dose mimicking models are denoted α and p.The vectors of all 0 and 1 are denoted by 0 and e, respectively.The symbol denotes element-wise multiplication of two vectors and prime denotes the transpose operator.
Next, we complete our theoretical analysis.By Proposition 5 from [33], it follows that an optimal decision vector x * from each dose mimicking model is also optimal for the inverse planning model (3)) with an optimal dual vector α * as objective weights (i.e., x * is an optimal solution for model (3) when α = α * ).This result means that the solution to each dose mimicking model is also optimal to the inverse planning model with a particular set of objective function weights.

E. Optimization metadata
In Table VII, we present metadata that was generated by each optimization model, which assigned a different proportion of weight to the objectives for each group of ROIs  (i.e., OARs, targets, optimization structures).The models that evaluate relative differences (i.e., MeanRel and MaxRel) spread the proportion of weight relatively evenly between the OAR and target objectives, but the other two models assigned the majority of the weight to target objectives with no more than 0.018 weight to OARs.Additionally, the optimization structures generally received the smallest proportion of weight with the exception of the MaxAbs model, which assigned more weight to optimization structure objectives (0.170) than OAR objectives (0.011).There is also a wide range in average solve time of the models (222 seconds to 393 seconds).On average, the MaxAbs model was the fastest.

IV. DISCUSSION
Knowledge-based planning research is flourishing.However, optimization models for KBP (e.g., dose mimicking) have received much less attention in the literature than dose prediction models.In this paper, we developed four dose mimicking models and evaluated their performance with 19 different dose prediction models, which were inputs to the optimization models.We showed that both the dose prediction model and optimization model contributed to considerable variation in the quality of plans generated by the corresponding KBP pipeline.Additionally, we conducted a theoretical investigation to show that our dose mimicking models generate plans that are optimal for a multi-objective inverse planning model with particular weights.
Our data and code is published at https://github.com/ababier/open-kbp-opt. to enable others to reproduce our results, which meets the gold standard in reproducibility [34].Our data includes the first open dataset of predictions and reference plans to accompany CT images.We hope that this effort produces a common resource and lowers the barriers for future KBP optimization research, given that researchers must currently acquire their own private datasets and develop in-house prediction models before they can start testing new KBP optimization models.
Our open dataset contains the data for 100 patients who were treated with IMRT and a sample of high quality dose predictions for those same patients.The dataset was curated for the purpose of developing new fluence-based KBP optimization models that use ROI masks, dose influence matrices, and a dose prediction.The dose predictions were generated by 21 dose prediction models that were developed by an international group of researchers, which provided a diverse sample of realistic inputs for a KBP optimization model.Two of those prediction models (20 th and 21 th ranked model) were removed from our analysis because their dose scores were low, which we elaborated on in Section II-C.For completeness, however, those 200 predictions are also available as part of our dataset.
We also performed a theoretical analysis to justify our dose mimicking models.Our key theoretical finding was that dose mimicking and conventional inverse planning are equivalent under certain specifications of the objective function weights.This allows us to interpret previous weight estimation techniques [33] through the more intuitive lens of dose mimicking models.Finally, by connecting dose mimicking to inverse planning, there is the potential to convert fully-automated KBP pipelines into semi-automated pipelines.Specifically, we use dose mimicking to generate a high-quality plan with its corresponding objective weights, which can be used in an inverse planning model (i.e., model ( 3)).This is advantageous because it enables human planners to improve the quality of plans generated by KBP via a conventional inverse planning process.By enabling this intuitive human interaction, we create a semi-automated KBP pipeline that is aligned with a common belief that AI will augment, rather than replace, the duties of healthcare practitioners [35].
Evaluating the performance of optimization models using many different dose predictions helps to identify interaction effects between these two stages of a KBP pipeline [5].For example, the 16 th ranked model generated lower quality predictions (in terms of dose error) than most of its competitors.However, when used in a KBP pipeline with the right optimization model, in this case the MeanAbs model, it generated high quality plans that achieved more clinical criteria than any other KBP pipeline.In other words, the errors made by the 16 th ranked model that contribute to its low prediction quality were corrected by the KBP optimization model.Since these interaction effects contribute to considerable variation in quality, it is important to evaluate KBP optimization models across a diverse set of dose prediction models.Additionally, if we can understand what types of prediction error are most highly correlated with KBP plan quality we could propose better evaluation metrics to drive KBP prediction research towards making predictions that consistently translate into higher quality plans.
As in the original OpenKBP challenge, a limitation of this work is that we use synthetic dose distributions (i.e., the reference dose) as a substitute for real clinical dose.Although these dose distributions were subject to less quality assurance than clinical plans, they were previously shown to be of similar quality [10].A second limitation of this work is that the dose prediction models were developed with the goal of optimizing the dose and DVH scores.There may be other scoring metrics that are better suited for developing a dose prediction model that excels in a KBP pipeline.This is a possible direction for future research.Lastly, this work only covers a single site and treatment modality.There is no guarantee that KBP optimization models that are developed with this dataset can generalize to other sites or treatment modalities.

V. CONCLUSION
In this paper, we combined the dose predictions contributed by a large international team to several KBP optimization models, resulting in 76 KBP pipelines.This was the largest international effort to date on KBP pipeline evaluation.We found that the best performing pipeline significantly outperformed the baseline approaches.In the interest of reproducibility, our data and code is freely available.

Fig. 1 .
Fig. 1.Overview of a complete knowledge-based planning pipeline.

Fig. 2 .
Fig. 2.An overview of our methods.A full description of each component is provided in the corresponding subsection.

Fig. 3 .
Fig.3.An overview of our process.First, dose prediction models were trained on out-of-sample data.Those models were used to predict dose for input to dose mimicking optimization models to generate KBP plans.

Fig. 4 .
Fig. 4. The distribution of dose error over all KBP-generated dose (n = 1900 points in each box).Boxes indicate median and interquartile range (IQR).Whiskers extend to the minimum of 1.5 times the IQR and the most extreme outlier.

Fig. 5 .
Fig. 5.The distribution of DVH point differences between the reference dose and each set of KBP-generated dose.Negative differences indicate cases where the KBP-generated dose had a lower DVH points than the reference dose.Boxes indicate median and IQR.Whiskers extend to the minimum of 1.5 times the IQR and the most extreme outlier.

Fig. 6 .
Fig. 6.The percentage of (a) OAR, (b) Target, and (c) all ROI clinical criteria that were satisfied by each KBP pipeline.The points indicate the percentage of satisfied criteria for n = 100 patients.A dashed line indicates the percentage of criteria satisfied by reference plans.

TABLE III THE
CLINICAL CRITERIA THAT WE USED TO EVALUATE DOSE DISTRIBUTIONS.BEFORE EVALUATING THESE CRITERIA, WE REINSTATED ANY OVERLAP BETWEEN TARGETS THAT WAS REMOVED.
m , subject to SP G ≤ 65, Auxiliary constraints to linearize functions in Table I and II.

TABLE IV EACH
DOSE MIMICKING MODEL IS COMPARED TO THE PREDICTIONS IN TERMS OF MEDIAN RANK CHANGE AND RANK ORDER CORRELATION.

TABLE V THE
PERCENTAGE OF CLINICAL CRITERIA SATISFIED IN EACH SET OF KBP-GENERATED DOSE.NOTE THAT "BEST" IS DEFINED AS THE TOP PERFORMING KBP PIPELINE THAT GENERATED PLANS THAT SATISFIED THE MOST ROI CLINICAL CRITERIA.THE HIGHEST PERCENTAGE OF SATISFIED CRITERIA IS BOLDED IN EACH ROW.

TABLE VI THE
DOSE MIMICKING MODELS USED IN THIS PAPER ARE PRESENTED IN MATRIX NOTATION WITH THEIR CORRESPONDING DUAL MODELS.

TABLE VII A
SUMMARY OF THE METADATA THAT EACH OPTIMIZATION MODEL GENERATED AFTER OPTIMIZING 1900 PLANS.
Lulin Yuan is with Department of Radiation Oncology, Virginia Commonwealth University Medical Center, Richmond, VA, United States (e-mail: luliny@gmail.com)Simeng Zhu is with Department of Radiation Oncology, Henry Ford Health System, Detroit, MI, United States (e-mail: szhu1@hfhs.org)Lukas Zimmermann is with Faculty of Health, University of Applied Sciences Wiener Neustadt, Wiener Neustadt, Austria and Competence Center for Preclinical Imaging and Biomedical Engineering, University of Applied Sciences Wiener Neustadt, Wiener Neustadt, Austria (e-mail: