Embedding machine learning based toxicity models within radiotherapy treatment plan optimization

Objective. This study addresses radiation-induced toxicity (RIT) challenges in radiotherapy (RT) by developing a personalized treatment planning framework. It leverages patient-specific data and dosimetric information to create an optimization model that limits adverse side effects using constraints learned from historical data. Approach. The study uses the optimization with constraint learning (OCL) framework, incorporating patient-specific factors into the optimization process. It consists of three steps: optimizing the baseline treatment plan using population-wide dosimetric constraints; training a machine learning (ML) model to estimate the patient’s RIT for the baseline plan; and adapting the treatment plan to minimize RIT using ML-learned patient-specific constraints. Various predictive models, including classification trees, ensembles of trees, and neural networks, are applied to predict the probability of grade 2+ radiation pneumonitis (RP2+) for non-small cell lung (NSCLC) cancer patients three months post-RT. The methodology is assessed with four high RP2+ risk NSCLC patients, with the goal of optimizing the dose distribution to constrain the RP2+ outcome below a pre-specified threshold. Conventional and OCL-enhanced plans are compared based on dosimetric parameters and predicted RP2+ risk. Sensitivity analysis on risk thresholds and data uncertainty is performed using a toy NSCLC case. Main results. Experiments show the methodology’s capacity to directly incorporate all predictive models into RT treatment planning. In the four patients studied, mean lung dose and V20 were reduced by an average of 1.78 Gy and 3.66%, resulting in an average RP2+ risk reduction from 95% to 42%. Notably, this reduction maintains tumor coverage, although in two cases, sparing the lung slightly increased spinal cord max-dose (0.23 and 0.79 Gy). Significance. By integrating patient-specific information into learned constraints, the study significantly reduces adverse side effects like RP2+ without compromising target coverage. This unified framework bridges the gap between predicting toxicities and optimizing treatment plans in personalized RT decision-making.


Introduction
Radiation-induced toxicity (RIT), an unavoidable downside of all radiotherapy (RT) treatments, is a critical bottleneck and a major concern throughout patients' treatment course, from treatment planning and delivery through post-RT follow-ups.Traditionally, RIT associated with a given RT plan has been estimated using normal tissue complications probability (NTCP) models such as, Lyman (1985), Lyman-Kurcher-Burman (LKB) (Kutcher and Burman 1989), relative seriality (RS) model (Källman et al 1991), and their variants.The reliance of these models on only external (dosimetric) factors and ignoring the impact of patients' underlying risk factors such as physical health and genetic predispositions limit the potential of these models when applied to individual patients, which is the aim of personalized radiotherapy (Baumann et al 2016).In recent years, machine learning (ML)-based outcome models have been introduced as more powerful tools for estimating RIT.These models are capable of synthesizing (many) dosimetric and non-dosimetric risk factors, often showing superior performance (i.e.predictive power) when compared to their non-ML counterparts (see Isaksson et al 2020 for an overview of ML-based outcome modeling of RIT).
Regardless of which model is used to estimate the risk of toxicity, in a RIT-cognizant, personalized RT, the estimated risks should be translated into 'clinically actionable decisions.'That is, one must learn how to adjust the RT treatment plan to keep the estimated patient-specific risk within tolerable bounds.Despite this, the field of MLbased outcome modeling remains largely disconnected from clinical RT planning.Using outcome models directly during treatment planning has so far remained limited to NTCP-based RIT models (see Witte et al 2007, Semenenko et al 2008, Nahum and Uzan 2012, Kierkels et al 2014, 2016, for some examples), again making the resulting treatment plans more population-based than personalized.In the work of Ajdari et al (2022), the authors proposed a framework for 'calibrating' the prediction of conventional NTCP model according to the prediction of an ML-based outcome model (Bayesian Networks) for prediction of radiation pneumonitis (RP).The updated NTCP model was then embedded within the treatment plan optimization to obtain RT plans more tailored toward patients' individual RP risk.While the authors effectively integrate a predictive model with treatment planning optimization, the approach separates ML predictions from the optimization module.In other words, the ML model is utilized in the optimization solely indirectly.As a result, the resulting treatment plans are not optimized based on the actual risk but assume a fixed risk during the optimization process, independent of the optimized decision variables.
In this work, we use a novel methodology for directly embedding entire ML risk models into treatment planning, thus closing the gap between ML-based outcome prediction and treatment plan optimization.The approach is built upon the concept of optimization with constraint learning (OCL) (Verwer et al 2017, Biggs et al 2023, Maragno et al 2023) for learning mathematical representation (constraints) of ML predictive models directly and directly embedding them within the treatment plan optimization.Although generalizable to different types of cancer and toxicity endpoints, we will build our framework on a non-small cell lung cancer (NSCLC) cohort, with the goal of personalizing the treatment plans to minimize the patient-specific risk of RP.
By integrating biomarker data, ML-based risk prediction, and treatment plan optimization, our approach provides a computationally tractable means of achieving personalized treatment planning that can be customized to other treatment sites beyond lung cancer.As a result, this study represents a crucial step forward in the pursuit of more efficient and tailored cancer treatment.

Methodology
Our method is based on the OCL framework (Fajemisin et al 2023).The framework is generic and can be deployed on a wide range of problems to learn mathematical representations (constraints) from ML models.In the context of RT treatment plan optimization (TPO), OCL can be used to learn a variety of outcome models, such as TCP, NTCP, and any combinations thereof (e.g.uncomplicated tumor control probability, UTCP (Chaikh et al 2016)).OCL leverages ML to design optimization models in which (parts of the) constraints and objectives are directly learned from data when an explicit expression is unknown or the conventional formulations are deemed inadequate, which is the case for conventional TCP/NTCP models.In the remainder of this section, we describe the general OCL framework in a TPO context.In the last part of this section, we describe the design of the experiments to evaluate our methodology.

General OCL framework
Suppose we have a dataset  {( ˆˆˆ)} = = x w y , , 1 of size N, with xi the observed decision variables (e.g.mean total lung dose), ŵi the contextual information4 (e.g.patient's age, gender, or tumor location), and ŷi the outcomes of interest for sample i (e.g.risk of developing RIT side effects), such that relationship between these variables has to be learned from data.An OCL framework involves three primary steps.First, the conceptual model is designed, including the formulation of decision variables, objective function, and the constraints that are to be learned.Next, one or more predictive models are fitted to historical data to estimate the outcomes of interest.Finally, the predictive models are embedded into the optimization model which is now ready to be solved.Figure 1 provides a schematic representation of the OCL framework.In the remainder of this section, we discuss each step of the OCL framework in a radiation therapy setting., which is an ML model trained on .We do not use the 'hat' notation for x, w, and y because these variables represent a distinct and unobserved instance, which is not part of the dataset .The predictive model outcome, denoted as y, corresponds to the probability of a patient experiencing a serious complication or not, subject to the constraint that it remains below a threshold value τ.We note that the embedding of a single learned outcome may require multiple constraints and auxiliary variables as shown in the next section.

Predictive models
The second step of the OCL framework involves exploring various ML models to learn the outcome of interest, denoted as variable y in the conceptual model (1).In RT, the goal is to find a predictive model that accurately approximates the probability of a patient experiencing high levels of toxicity.To evaluate each predictive model, we consider four key metrics, summarized in table 1 and described below:  • Interpretability: the selection of a predictive model may depend on whether the practitioner requires the model to be interpretable.Interpretable models, such as linear models or decision trees, can provide insight into the factors that influence the predictions.
• Approximation accuracy: a predictive model must be able to accurately approximate the underlying function that generates the data.The ability of a model to fit a wide range of functions is an important consideration when selecting a predictive model.More flexible models, like neural networks, can potentially approximate more complex functions.
• Data required: the quantity and quality of data required to fit a model can vary depending on the model's complexity and the amount of information it can extract from the data.More complex models may require larger and higher-quality datasets to achieve good performance.
• Complexity class: the selection of a predictive model impacts the computational complexity class of the conceptual optimization model.Linear models typically do not affect the complexity class, while other models like tree ensembles and neural networks require binary variables to be represented as constraints, which increases the model's complexity.In some cases, such as Bayesian networks and neural networks with activation functions other than rectified linear unit (ReLU), the embedding requires complex (non-convex) constraints.
The interpretability of predictive models is crucial in clinical decision-making for radiation treatment planning.It ensures transparency, trust, and ethical considerations, allowing clinicians to comprehend and validate the model's reasoning for accurate and accountable treatment plans.The outcomes of interest depend on multiple factors, including dosimetric, clinical, and imaging information, making their approximation challenging, especially with scarce, fragmented, and incomplete data.Moreover, the complexity class of the predictive model significantly impacts the tractability of the final optimization problem.Since the conceptual model may involve a large number of constraints and decision variables, incorporating complex predictive models can lead to problems that cannot be solved within a reasonable timeframe.

Embedding the predictive models
In the final step of our framework, we integrate the trained predictive models into the conceptual model.In this section, we focus on those predictive model belonging to the complexity class LO (linear optimization) and MIO; see table 1.It is important to note that although the embedding of the predictive models is done using piece-wise linear constraints, it does not imply that the function being approximated is linear.This is possible for various types of predictive models, such as linear models, decision trees, tree ensembles, and neural networks, and is applicable to both classification and regression tasks.In light of the paper's objective, the following discussion focuses on binary classification models.For readability, we use the notation x to indicate both the vector of decision variables and the derived features previously denoted as d(x).
Below, we show the formulations adopted by Maragno et al (2023) to embed these predictive models, and we refer to them for further details.
Logistic regression.Logistic regression is a natural choice due to its inherent linearity and ease of integration.Once the model has been trained, only the final coefficient vectors b Î  x n and b Î  w p (along with the intercept term β 0 ) are necessary to describe the model.In classification tasks, these models find a hyperplane that separates positive and negative samples.The prediction for a given sample is computed as This function, known as the sigmoid function, is nonlinear.However, if we impose the constraint y τ, where τ represents the desired probability lower bound, this constraint is satisfied when . Consequently, the binarized output of logistic regression, using a threshold of τ, can be expressed as follows: For example, at a threshold of τ = 0.5, the predicted value is 1 when . Here, τ can be chosen according to the minimum necessary probability to predict 1.
Support vector machines.Similar to logistic regression models, linear support vector machines (SVM) aim to identify the optimal hyperplane that can effectively separate positive and negative samples (Cortes and Vapnik 1995).A trained binary classification SVM also provides coefficients β x , β w , and β 0 , and the prediction for a given sample can be expressed as follows: In SVM, the output variable y is binary rather than a probability, and as a result, it does not necessitate the use of the threshold parameter τ.In this case, the constraint can simply be embedded as . In the case of SVM with a nonlinear kernel (such as the Gaussian kernel or polynomial kernel), the embedding process differs, and its linearization is not possible.
Decision trees.A generic decision tree of depth two is shown in figure 2. A split at node i is described by an inequality , where A i represents the coefficients associated with the input features x, and A iw captures the weight of auxiliary variables w.To simplify the notation, we include the effect of A iw in the constant term b i and omit it for the remainder of this paragraph.We assume that A i can have multiple non-zero elements, in which we have the hyper-plane split setting; if there is only one non-zero element, this creates a parallel (single feature) split.Each terminal node j (i.e.leaf) yields a prediction (p j ) for its observations.In binary classification, the prediction is the proportion of leaf members with the feasible class.
Suppose that we aim to constrain the predicted value of this tree to be at most τ, a fixed constant between 0 and 1.Once we have trained the tree in figure 2, we can identify which paths satisfy the desired bound ( pi τ).Suppose that p 3 does satisfy the bound, but p 4 , p 6 , and p 7 do not.In order to enforce the decision vector x to reach leaf p 3 , we can use the constraints: When more than a leaf satisfies the condition p i τ, we can decompose the problem into multiple separate problems, one per feasible leaf.The learned constraints represented by , i where the learned constraints for leaf iʼs subproblem are implicitly represented by the polyhedron  i .These subproblems can be solved in parallel, and the minimum across all subproblems is obtained as the optimal solution.An alternative approach is to represent the entire decision tree by utilizing auxiliary binary variables, which assist in determining the leaf to which the solution belongs, see Verwer et al (2017), Halilbasic et al (2018).
The computational complexity arising from the embedding of tree-based models is determined by the total number of nodes and leaves.Therefore, it becomes crucial to leverage the values of the contextual features w to selectively prune unreachable leaves.This is especially significant in radiation therapy since most features are patientspecific and known prior to solving the optimization model.In practical terms, for each patient, we first prune the fitted decision tree by removing leaves that cannot be reached based on the contextual features.Subsequently, we incorporate the pruned tree into the optimization model.An example of pruning is shown in section 2.2.
Tree ensembles.Ensemble methods, such as random forests and gradient-boosting machines, utilize multiple decision trees combined to generate a single prediction for an observation (Breiman 2001).These models can be implemented by embedding multiple sub-models representing each tree.In random forests, predictions are typically obtained by averaging the predictions from individual trees as follows: where P represents the number of trees in the ensemble, and y i denotes the predicted value from the ith tree.On the other hand, gradient-boosting machines calculate the model output by summing the predictions of the individual regression models, weighted by corresponding coefficients: where y i is the predicted value of the ith model, and β i represents the weight associated with the prediction.
Neural networks.Neural networks with a ReLU activation function are known to be linearly representable (Grimstad andAndersson 2019, Anderson et al 2020).These networks typically consist of an input layer, L − 2 hidden layers, and an output layer.Importantly, the ReLU operator, { } = v x max 0, , can be encoded using linear constraints: Here, the large numbers M L and M U stand for lower and upper bounds on all possible values of x.Although this embedding relies on a big-M formulation, there are ways to enhance it.The constraints for a neural network can be recursively generated, starting from the input layer, allowing the embedding of trained networks with any number of hidden layers and nodes.The output layer of the neural network consists of a single neuron with a sigmoid activation function.Similarly to logistic regression, the sigmoid function can be represented as a linear constraint using the τ bound: where v L−1 represent the output vector of the penultimate layer L − 1.
Literature on predictive models in RIT.Table 2 provides an overview of published papers on predictive models for radiation treatment of head and neck (H&N), lung, and prostate cancers, focusing on xerostomia, pneumonitis, and rectal toxicity/bleeding due to their prevalence in ML-based predictions.The last column of the table categorizes the feasibility of embedding the ML models in treatment plan optimization as follows: • The model is linear or linearly representable, and the methodology proposed in this study can be followed for the embedding.
• The model is linear or could be linearly representable, but this study does not provide the methodology for embedding the learned constraints.
• The model is not linearly representable.
Among the selected toxicities, 19 out of 23 (82.6%) literature models belong to the first category, indicating that they can be embedded in the optimization process following the methodology proposed in this paper.However, it is noteworthy that linear discriminant analysis (LDA) and distance-based (norm) models, while being MIOrepresentable, are not incorporated into our methodology due to the absence of documented MIO embeddings in the existing literature.In contrast, both Naïve Bayes and Bayesian Networks cannot be represented using an MIO formulation.Attempting to embed them into the optimization model, though theoretically feasible, would introduce highly non-convex constraints, thereby making the model impractical to solve to optimality.

Experimental design
The experiments aim to optimize the treatment plan for patients with NSCLC while constraining the risk of symptomatic pneumonitis (grade 2), referred to as RP2+, to be smaller than a given threshold (τ).The experiments consist of two parts.In the first part, we design a (simplified) toy example with voxel spacing values of [10, 10, 10], which allows for faster optimization even when complex predictive models, like neural networks, are deployed to constrain the RP2+ risk.In the second part, we use voxel spacing of [5,5,5] to demonstrate the performance on realistic (clinical) cases and use a decision tree to model the RP2+ outcome.In both scenarios, we use a spot spacing and layer spacing equal to 10, and a target margin equal to 5. In appendix, we report the size, in terms of beamlets and voxel sizes, of the optimization problems solved for the clinical study.The first part of the experiments demonstrates how various predictive models, including those discussed in section 2.1.3,can be used to learn the probability of toxicity.We conduct plan adaptations for patients with an RP2+ predicted risk higher than 0.5 and sensitivity analyzes by modifying both the RP2+ threshold and several patient-specific parameters that impact the predictive model's outcome.The second part showcases the suitability of decision trees due to their interpretability and their linear embedding (without the use of binary variables).We present adjusted treatment plans for four patients who would otherwise have a probability greater than 0.5 of experiencing RP2+ according to the fitted decision tree.Since the primary objective of this work is to offer practitioners a structured approach for incorporating predictive models into treatment planning, our discussion of the experiments will emphasize the interconnection between prediction and prescription rather than the finetuning of highly performing predictive models.In the remainder of this section, we offer additional information about the conceptual model we adopted, the dataset utilized for training the predictive models, and the specific predictive models employed to assess the RP2+ risk.Finally, we provide a detailed explanation of how a decision tree can be seen as a different set of rules depending on the patient's characteristics.
Conceptual model.The conceptual model used for the experiments is a treatment plan optimization model.A Max-dose constraint of 54 Gy-RBE is imposed on the spinal cord.For heart and total lung, we used mean dose constraints with an upper bound respectively in the range 15.06-20.03Gy-RBE, depending on the patient.The probability of experiencing RP2+ is constrained to be less than 0.5 (τ), however, it is worth noting that any value between 0 and 1 can be used.We recommend that the threshold value is chosen in consultation with a physician.The reference doses for the gross tumor volume (GTV) and planning target volume (PTV) in the objective function are 66 Gy-RBE and 65 Gy-RBE, respectively.We use the unit 'Gy-RBE' to keep the doses consistent between photon and proton treatments in terms of their biological effect.There were no specific constraints on RBE or its range during the treatment planning.As per standard practice, we scale the dosimetric metrics for proton planning using a weighting factor of RBE = 1.1 to accommodate the biological differences between protons and photons.To ensure that the optimal solution yields low mean doses delivered to normal tissues, we introduce a penalty in the objective function by incorporating a weighted sum of the mean doses delivered to the heart, lung, and spinal cord.We give equal weight to all objectives, and this remains consistent in the OCL-adapted plans too.
Dataset.We built our predictive models on a dataset of 73 stage IIB-IIIA NSCLC patients treated with passive scattering proton or intensity-modulated RT (Liao et al 2018).More details about the dataset can be found in Ajdari et al (2022).Among the patients in the dataset, 27% of them experienced grade 2+ RP, which serves as the outcome for the predictive models.The features used to predict RP2+ can be categorized into three main groups: dosimetric information (e.g.mean lung dose), clinopathological information (e.g.age, smoking status, breathing function), and tumor-related information (e.g.GTV size, histology, tumor location).Before training the predictive model, categorical features are encoded into numerical values using one-hot encoding.We utilized the same dataset throughout the experiments.However, in the first part of the experiments, we excluded a set of features associated with the calculation of the deposition matrix, such as the GTV size, as this was necessary to facilitate the sensitivity analysis.
Predictive models.We trained four predictive models: • Logistic regression with ℓ 1 penalty term and default regularization coefficient (C = 1).
• Random forest composed of 10 decision trees each with a max depth of 4.
• Neural network with one hidden layer of 10 nodes activated with a ReLU activation function.
In figure 3, we report the fitted decision tree used for the second part of the experiments.Among the 73 patients available in the dataset, 19 of them were selected to evaluate the performance of the model: 14 patients did not develop RP (RP2+ = 0) and 5 did (RP2+ = 1).The fitted decision tree yields a crossvalidation AUC of 0.66 (95% confidence interval, CI = [0.65,0.68]) and a test AUC of 0.66.While we recognize the significance of achieving a predictive model with a high AUC score, improving model accuracy lies beyond the scope of our present work.
From decision tree to treatment constraints.In the second part of the experiments, we evaluate our method by conducting tests on several patients.For each patient, we first solve an optimization model (baseline_model) without incorporating the learned constraints.Subsequently, we assess whether the patients, considering their current treatment, had a probability greater than 0.5 of developing pneumonitis within two months.The decision tree shown in figure 3 indicates that a patient receives an RP2+ score smaller than 0.5 if one of the following three scenarios is met: The GTV is less than or equal to 55.20 cc.
(ii) The GTV is greater than 55.20 cc and the FEV1 pre-bronchodilator (BD) score is greater than 2.52.
(iii) The GTV is greater than 55.20 cc, the FEV1 pre-bronchodilator (BD) score is less than or equal to 2.52, and the total lung mean dose is greater than 16.80 Gy.
The decision tree structure ensures that only one set of conditions can be active at a time, depending on the patient's information and the treatment plan.If none of these conditions are satisfied, the RP2+ score for the patient, given their treatment plan, will be higher than 0.5 (bottom left leaf).To illustrate how different conditions are active for different patients, let us consider the cases of three patients (see figure 4 for a visual representation).The first patient has a GTV of 45 cc, falling under scenario (i).Regardless of the other features, the decision tree predicts a low probability of RP2+.The second patient has a GTV of 60 cc and an FEV1 pre-BD score of 3, corresponding to scenario (ii).Regardless of the mean dose delivered to the lung, the patient is unlikely to experience side effects according to the DT.Finally, the third patient has a GTV of 60 cc, but this time the FEV1 pre-BD score is 2.This scenario corresponds to (iii), and the only way to achieve an RP2+ score smaller than 0.5 is to design a treatment plan with a mean total lung dose below 16.80 Gy.While some patients fall into scenarios (i) or (ii), where ensuring a low RP2+ score is not necessary to adapt the treatment plan, other patients require a treatment plan that ensures a mean total lung dose smaller than the threshold suggested by the tree.These three scenarios demonstrate that for different patients, only certain leaves of the decision tree are active, and therefore, reachable while the remaining leaves can be excluded from consideration.In the next section, we will incorporate the learned decision tree into the optimization model and observe how the solution changes for some of the patients.

Results
For the experiments, we use the open-source Python package OpenTPS (Wuyckens et al 2023) for dose calculation and visualization.We embed the predictive model into the optimization framework using the opensource Python package OptiCL (Maragno and Wiberg 2021).The optimization model is written and solved in Gurobi (Gurobi Optimization, LLC 20222022).The training of the predictive models is performed using the Python library Sklearn v1.0.2 (Pedregosa et al 2011).All computations are performed on a laptop i7-10850H CPU 2.70 GHz with 32 GB of RAM.

Toy example
In this first part of the experiments, we use a simplified treatment planning problem and employ three distinct predictive models, namely logistic regression, random forest, and neural network, to learn the RP2+ risk function.In this context, we demonstrate how the adjusted treatment plan varies based on the chosen predictive model used to define the RP2+ risk for a given patient.The dose-volume histograms (DVH) can be found in figure 5, and the dosimetric parameters are detailed in table 3. The results show that across all three scenarios, adjustments to the plans are necessary to decrease the RP2+ risk below the 0.5 threshold.In fact, the logistic regression model assigns a 57% RP2+ risk probability to the baseline plan, the random forest model assigns 55%, and the neural network model assigns 87%.The adapted plan obtained using logistic regression shows a reduction of 1.25 Gy in mean total lung dose, compared to the 1.68 Gy of the random forest, and the 0.95 Gy of neural network.However, this decrease in mean total lung dose required an increase in mean dose for the heart and spinal cord, ranging between 0.47 Gy and 0.56 Gy for the heart and 0.06 Gy and 0.09 Gy for the spinal cord.
The adapted plans exhibit variations in the delivered dose to GTV and PTV, evident especially in terms of D98, which remains below the 2.46 Gy (random forest).Using the same toy example, we run two sensitivity analyzes.The first analysis, shown in figure 6, involves varying the threshold τ in model (1) to observe its impact on treatment planning, represented by the PTV D98 and the mean total lung dose.As the threshold is incrementally raised, the plots illustrate the changes in the treatment solution, allowing for higher mean total lung dose and PTV D98 until it plateaus.In the cases of random forest, it is not possible to obtain a feasible solution to the resulting optimization problems for a threshold smaller than 0.5.This analysis provides insights into the behavior of the predictive models concerning the constraint threshold, helping to identify critical points where the optimal treatment plan attains a steady state and remains insensitive to further changes in the threshold value.
In figure 7, we report the results of the sensitivity analyzes performed using three different predictive models: logistic regression, random forest, and neural network.For each predictive model, we select the three most relevant features that influence the RP2+ risk prediction.These features were perturbed individually to observe their effects on the optimal solution while ensuring a maximum risk threshold of τ = 0.5 for the RP2+ risk.The conducted experiments are patient-specific and offer insights into the treatment's sensitivity to changes in features.Both the logistic regression model and the neural network model demonstrate that a higher Karnofsky score allows for a higher mean total lung dose and PTV dose while ensuring an RP2+ risk remains below the threshold.The logistic regression model indicates that patients with a higher average of daily smoked cigarettes and those who are currently smoking can tolerate a higher radiation dose.Moreover, the neural network model suggests that for older patients, the delivered dose should decrease, while higher values of forced vital capacity (FVC) corresponding to healthier lung function allow for a higher radiation dose.

Clinical study
In the second part of the experiments, we run a clinical study to compare the treatment planning solutions before and after the adaptation for some of the patients falling in scenario (iii); see section 2.2.The adaptation is done using the decision tree in figure 3 to constrain the risk of RP2+ to be smaller than 0.5.In table 4, we summarize the optimal treatment plan for 4 patients using various metrics.
For the first two patients, we display in figures 8-9 the DVH, as well as the dose distribution (a) before adaptation, (b) after adaptation, and (c) their voxel-wise difference.Notably, the lung DVH is significantly reduced in all cases (mean dose reduction of 1.06 Gy, 1.85 Gy, 1.72 Gy, and 2.49 Gy).Occasionally, a lower lung dose may result in trade-offs with other organs at risk.For instance, in the adapted plan for Patient 2, the maximum spinal cord dose limit was respected in all adapted plans.In all the cases, the RP2+ probability prediction drops from 95% (baseline) to 42% (adapted).Table 3.Comparison of dosimetric parameters for various organs at risk obtained solving both baseline and adapted models using different predictive models for the radiation pneumonitis constraint with 0.5 as the upper bound.GTV stands for gross tumor volume; PTV stands for planning target volume; D2/98 are the minimum dose delivered to 98%/2% of the volume; V5/20/60 is the percentage of volume receiving at least 5/20/60 Gy; D mean and D max are the mean and max dose in Gy delivered to the volume; time is computation time, in seconds, required to optimize the treatment plan.

Discussion
Despite the recent breakthroughs in the realm of ML-based outcome modeling in RT, currently, there is no rigorous methodology for integrating these predictions within RT treatment planning to guide the treatment design.In this study, we set out to address this gap.To the best of our knowledge, our work is the first study to formalize a methodology that directly integrates ML-based outcome models within RT treatment planning.Our methodology is capable of handling the majority of popular ML-based outcome models proposed in the literature (Isaksson et al 2020), including decision trees, random forests, support vector machines (with linear Figure 6.Effect of the radiation pneumonitis constraint (upper bound) threshold (horizontal axis) on the optimal treatment plan (vertical axis) before (dashed line) and after (solid line) adaptation.
Figure 7. Impact of perturbed relevant features (horizontal axis) on the optimal treatment plan (vertical axis) while guaranteeing a radiation pneumonitis risk prediction smaller than 0.5.The black (dashed-dotted) line represents the original value of each feature, while the blue (solid) line depicts the planning target volume D98 value, and the red (dashed) line shows the mean total lung dose.kernels), neural networks, and general ensemble models.We show that the often complex structure of these models can be mathematically represented by a set of linear, mixed-integer constraints which can, in turn, be included in the conventional treatment plan optimization, thus bridging the gap between data-oriented outcome modeling and optimization-based treatment planning.Our methodology is capable of learning the radiation dose response directly from historical data (as long as the relationship can be captured through one of the proposed models), without the need to rely on oversimplified, a priori assumptions on the shape and nature of such response.Marks et al (2010) observed that a small reduction in the mean total lung dose had a relatively modest effect on the risk of RP, which might appear in conflict with our results, in which for some cases, even small reduction in MLD led to considerable RP risk reduction.However, we note that the reductions reported in Marks et al (2010) are averaged over a population, encompassing a spectrum of patients.This means that the reported effects include individuals who do not experience any improvement in outcomes as well as those who are significantly affected, as indicated by the presence of large error bars in the data; second, our predictive model (decision tree) is a classification-based model, meaning that in some patients whose predicted risk lie close to the cutoff point, even small reduction in dose might lead to reclassification (changing from a positive risk group to a negative).We note that this is an inherent property of these classification models and not a kink of our methodology.Nevertheless, we attempted to demonstrate the high sensitivity of the mean total lung dose to variations in input data, as illustrated in figure 7. Whether such 'sensitive' models should be used to guide clinical dose-based decisions, though certainly a worthy question, lies beyond the scope of the current article.
Our streamlined approach for direct embedding of ML models within treatment plan optimization reduces the uncertainty sources to a single model, as opposed to the two models used in Ajdari et al (2022), namely the NTCP model and the predictive model used to learn NTCP parameters.This reduction in complexity and reliance on surrogate models enhances the transparency and robustness of our methodology, making it a valid alternative in personalized treatment planning.Furthermore, it is crucial to recognize the potential extension of this study to different endpoints.The framework is designed to be site-and endpoint-agnostic, allowing its application to any endpoint of interest, as long as a predictive (outcome) model is known for that endpoint or could otherwise be learned from available datasets.Despite the advantages, our study has certain limitations.
A side-advantage of our current approach (learning mathematical representation of ML-based outcome models) is that it adds a level of transparency to the more 'black-box' ML models.This in turn can enhance the users' ability to detect 'nonsensical' decision rules that might have been learned by the models due to, among other factors, model overfitting, insufficient or noisy data, or misrepresentation or bias in the training dataset (Cho 2021, Belenguer 2022, Tasci et al 2022).This is a crucial step, as prediction uncertainty and the lack of physician trust in ML-based outcome models are among the most important hurdles facing the clinical implementation of outcome-based treatment planning.
Despite its advantages, the limitations of our study must be acknowledged.First, the underlying performance of our tested ML models were not very high (AUC = 0.66), most likely due to the small size of our training dataset (n = 73) and the quality of the features at our disposal.Prior studies, such as that conducted by Scott et al (2021), have shown that integrating additional features like genomic structures can enhance predictive outcomes.Furthermore, research by Palma et al (2019) showed a connection between radiation-induced pneumonia and lower lung dose, suggesting potential benefits for predictive models through the inclusion of spatial information in the dataset.This added significant uncertainty to the models' predictions and the resulting risk-adapted plans.However, our primary goal here was to provide a proof-of-concept for the integration of ML-based outcome models into RT treatment planning.Our approach is cancer-and endpoint-agnostic, meaning that it can be easily applied to other, more powerful outcome models.Second, the integration of complex predictive models, such as neural networks and tree ensembles into the treatment planning process increases the computational complexity of the optimization problem, making it difficult to solve in a reasonable timeframe for large optimization models with numerous constraints and (binary) decision variables.Moreover, the use of learned constraints for specific types of toxicities may impact the optimal treatment plan, potentially at the expense of other organs that are not considered in the optimization model.Our feasibility study showed one patient showed an increased dose to the spinal cord to meet the RP2+ threshold.In the case that specific organs might be more susceptible to increased dose delivery, e.g. the spinal cord, we can include additional (learned) constraints in the optimization model.

Conclusion
In this study, we have illustrated the use of the OCL framework to tailor radiotherapy treatment plans based on individually predicted risks of side effects.Our methodology is showcased through its application to non-small cell lung cancer patients, where we employ various predictive models to learn the risk of radiation-induced pneumonitis.The learned function is then incorporated as a constraint within the treatment plan optimization model.This work can be seen as a proof-of-concept for the future development of personalized treatment planning models.The objective of our work is to establish a bridge between prediction and prescription proposing a common framework for both ML and optimization communities within the domain of RT.By effectively integrating predictive insights into the optimization process, our contribution has the potential to replace the current one-size-fits-all methodologies in the RT field.
2.1.1.Conceptual model Let n represent the dimension of the beamlet intensity vector x, p denote the number of non-decision variable features w, and k indicate the dimension of y.The dataset  contains historical solution information and has a where N represents the number of data points, and ¢ n stands for the number of derived features obtained using the function (•)  ¢   d : n n .Some examples of derived features include mean dose, max dose, and V20Gy.With the decision variable Î  x n and the fixed feature vector Î  w p , a TPO conceptual model can be defined as • ) is a dose fidelity function and g i ( • ) represents the conventional dosimetric (mean-dose, max-dose, and dose-volume histogram constraints) and/or known biological constraints (e.g.equivalent uniform dose, biologically effective dose, tumor control probability).The learned outcome is represented by the function ˆ( ( ) ) d x w h

Figure 1 .
Figure 1.Schematic representation of an Optimization with constraint learning framework.The flowchart is based on Fajemisin et al (2023).

Figure 2 .
Figure 2. Decision tree of depth two.

Figure 3 .
Figure3.Classification tree trained on the radiation pneumonitis dataset.The three split nodes are represented as histograms with the frequency on the y-axis and the following features: total GTV measured in cubic centimeters, mean total lung dose measured in Gy, and FEV1.Pre.BD.Every split node is defined by a threshold, symbolized by the black triangle positioned on each x-axis of the histograms.The histograms are constructed using the training samples able to reach the respective split node.GTV: gross tumor volume; D_mean_TotalLung: mean dose delivered to the total lung; FEV1.Pre.BD: forced expiratory volume in the first second before bronchodilator administration.

Figure 4 .
Figure 4. Side-by-side comparison of a fitted decision tree illustrating different paths leading to the prevention of radiation pneumonitis development with a probability greater than 0.5 for three distinct patients.

Figure 5 .
Figure5.Comparison of dose-volume histograms between baseline and OCL adapted plans using logistic regression, random forest, and neural network approaches to capture the radiation pneumonitis risk constraint.

Figure 8 .
Figure 8. Baseline versus optimization with constraint learning (OCL) adapted plans: (a) dose-volume histogram for the baseline (solid) and the adapted (dashed) plans (b): CT images overlapped with dose distribution for (left) baseline plan, (mid) adapted plan, and (right) difference.The contours for the planning target volume, total lung, and spinal cord are respectively colored in white, green, and yellow.

Figure 9 .
Figure 9. Baseline versus optimization with constraint learning (OCL) adapted plans: (a) dose-volume histogram for the baseline (solid) and the adapted (dashed) plans (b): CT images overlapped with dose distribution for (left) baseline plan, (mid) adapted plan, and (right) their difference.The contours for the planning target volume, total lung, and spinal cord are respectively colored in white, green, and yellow.

Table 1 .
Pros and cons of predictive models commonly used in optimization with constraint learning.

Table 2 .
Supervised machine learning models for predicting radiation-induced toxicity (adapted from Isaksson et al 2020).
Abbreviations: LDA, linear discriminant analysis; LR, logistic regression; MV-LR, multivariate logistic regression; NN, neural network; RF, random forest; SVM, support vector machine; Symbols: : linearly representable and considered in the methodology; : linearly representable but not considered in the methodology; : not linearly representable and not considered in the methodology.

Table 4 .
Comparison of dosimetric parameters for various organs at risk obtained solving both baseline and adapted models for four different patients.GTV stands for gross tumor volume; PTV stands for planning target volume; D2/ 98 are the minimum dose in Gy delivered to 98%/2% of the volume; V5/20/60 is the percentage of volume receiving at least 5/20/60 Gy; D mean and D max are the mean and max dose delivered to the volume; time is computation time, in seconds, required to optimize the treatment plan.