Machine learning-based predictions of gamma passing rates for virtual specific-plan verification based on modulation maps, monitor unit profiles, and composite dose images

Machine learning (ML) methods have been implemented in radiotherapy to aid virtual specific-plan verification protocols, predicting gamma passing rates (GPR) based on calculated modulation complexity metrics because of their direct relation to dose deliverability. Nevertheless, these metrics might not comprehensively represent the modulation complexity, and automatically extracted features from alternative predictors associated with modulation complexity are needed. For this reason, three convolutional neural networks (CNN) based models were trained to predict GPR values (regression and classification), using respectively three predictors: (1) the modulation maps (MM) from the multi-leaf collimator, (2) the relative monitor units per control point profile (MUcp), and (3) the composite dose image (CDI) used for portal dosimetry, from 1024 anonymized prostate plans. The models’ performance was assessed for classification and regression by the area under the receiver operator characteristic curve (AUC_ROC) and Spearman’s correlation coefficient (r). Finally, four hybrid models were designed using all possible combinations of the three predictors. The prediction performance for the CNN-models using single predictors (MM, MUcp, and CDI) were AUC_ROC = 0.84 ± 0.03, 0.77 ± 0.07, 0.75 ± 0.04, and r = 0.6, 0.5, 0.7. Contrastingly, the hybrid models (MM + MUcp, MM + CDI, MUcp+CDI, MM + MUcp+CDI) performance were AUC_ROC = 0.94 ± 0.03, 0.85 ± 0.06, 0.89 ± 0.06, 0.91 ± 0.03, and r = 0.7, 0.5, 0.6, 0.7. The MP, MUcp, and CDI are suitable predictors for dose deliverability models implementing ML methods. Additionally, hybrid models are susceptible to improving their prediction performance, including two or more input predictors.


Introduction
Artificial intelligence (AI) methods have been applied in radiotherapy, supporting the contouring of the target and organs at risk volumes (Lustberg et  Certainly, in the past six years, machine learning (ML) methods dedicated to quality assurance (QA) predictions of intensity-modulated radiotherapy (IMRT) and volumetric modulated arc therapy (VMAT) treatments have increasingly been studied (Hussein et al 2017, Chan et al 2020a, Osman and Maalej 2021. The most common ML models implemented in this matter are Poisson regression (Valdes et al 2016, Li et al 2019, decision trees-based models (e.g. random forest or gradient boosting models) (Lam et al 2019, Hirashima et al 2020, support vector machine (Valdes et al 2016, Granville et al 2019, and artificial neural networks or convolutional neural networks (CNN) (Interian et al 2018, Tomori et al 2018. The CNN-based models, which were being less explored in QA predictions, are characterized commonly by convolution plus pooling layers arranged consecutively, ending with fully connected layers and a Softmax activated dense layer for classification or a Linear activated dense layer for regression (Payer et al 2016). The convolution operations intend to detect patterns from the input images using specific filters and reducing their dimensions. Then, these newly detected features are processed by the pooling layers, weighting the found features and their nearby values to be the input of the next convolutional-pooling layer arrangement, filtering intricated 'hidden' features that will potentially be associated with the predicted output.
From the specific-plan verification perspective, models dedicated to QA prediction were implemented generally to detect potential treatment errors (Ezzell et al 2009, Miften et al 2018 and predict gamma passing rate (GPR) values (Low et al 1998). The GPRs account for the dosimetric regions in agreement with the gamma index analysis between the calculated and the measured dose distributions (Low et al 1998, Hussein et al 2013. In turn, the gamma index is a metric that evaluates the coincidence between both dose distributions, calculating the dose difference (DD) and the distance to agreement (DTA) (Hussein et al 2013). Commonly, a verified treatment is suitable for delivery if the GPR is higher than one reference value, selecting the DD/DTA criteria defined in each institution and per the expert recommendations (Miften et al 2018). For instance, a specific treatment might be considered appropriate if its GPR is equal to or higher than 98% based on 3%/2 mm criteria. Nevertheless, although this metric has been studied and implemented widely, some gaps have been identified in detecting errors with clinical impact or retrieving information needed to detect specific discrepancies regarding treatment parameters (Zhen et al 2011, Hussein et al 2017, Park et al 2018. Hence, the GPR evaluation and the modelled predictions should be considered complementary tests to other assessment protocols (e.g. dose-volume histogram changes evaluation) rather than one exclusive verification method.
Consequently, a useful GPR prediction model based on ML methods should be able to provide additional information to complement and explain the expected dose deliverability evaluation results, featuring the predominant predictors and achieving a more robust evaluation of the treatment parameters. Similarly, it might be beneficial to track possible 'problematic' treatment features, as suggested by Park et al (Park et (Chiavassa et al 2019), using modulation complexity metrics and plan parameters. However, the reported models using automatic-extracted features methods (e.g. CNN-based models) are based mainly on dose distributions (Osman and Maalej 2021), and predictor features associated with the plan parameters cannot be extracted. In contrast, other input features, such as modulation maps (MM) given by the multi-leaf collimator (MLC) trajectories per control points (CP), gantry speed variations, or monitor units (MU) variations profiles, have not been explored, and it might help to complement the dose deliverability evaluation because their direct relation to specific treatment conditions.
In terms of the studied features for GPR predictions using ML models, classification or regression solutions have been proposed based on IMRT beam fluencies (Interian et al 2018, Hirashima et al 2020, planar dose images plus organs at risk volumes and total MU values (Tomori et al 2018, radiomic features from the dose distribution images (Nyflot et al 2018, Hirashima et al 2020, and various calculated modulation complexity metrics (Valdes et al 2016, Ono et al 2019, Chan et al 2020a. In fact, benefits on prediction performance have been reported when more than one input feature category is implemented (i.e. hybrid datasets or hybrid models) Considering the abovementioned, this study aims to explore features directly related to treatment unit parameters to predict GPR values based on CNN models, contributing to the inclusion and evaluation of additional treatment parameters that might facilitate the designs of more robust dose deliverability evaluation protocols. For this reason, the primary objective of this study was to evaluate the potential utility of MM and MUcp as input features for GPR predictions. Consequently, since our GPR values were calculated using electronic portal imaging devices (EPID) measurements, we decided to include the calculated composite dose image (CDI) as a third evaluated input feature (i.e. dosimetric input feature). The second objective was to verify whether concatenated models presented an improved GPR prediction performance or not. Furthermore, we aimed to evaluate the model stability in terms of the quality of the learned features extracted by each model.

Workflow
The five-step workflow followed in this study is illustrated in figure 1. (I) From 1024 DICOM-RT files, the MM, MUcp profiles, and CDI were retrieved and classified to form three specific datasets representing each feature category. (II) An independent CNN model was designed for each input dataset to predict GPRs (classification and regression). The architecture optimization, the hyper-parameter tunning, and stability tests were performed with TensorFlow (Dillon et al 2017). (III) In addition, four hybrid models based on all possible previous models' combinations were proposed to verify if the GPR prediction improves concatenating two or more models. (IV) Next, the ROC-AUC and the accuracy were calculated to evaluate the prediction performance of classification models, and the MAE, RMSE, and Spearman correlation coefficients were calculated for regression models. (V) Finally, the activation maps for randomly chosen plans were extracted to verify the relevance of the trained features.

Dataset
A total of 1024 anonymized DICOM-RT files from 746 prostate plans, retrospectively treated in our institution, were retrieved to extract the MM, the MUcp, and the CDI features by Python scripting (Quintero 2020). The treatments were planned with Eclipse version 15.6 (Varian Medical Systems, Palo Alto, CA), 2 degrees per CP configuration, and 6 MV beam energy in two Varian treatment units (TrueBeam and Halcyon-v2) available in our institution with the same EPID model (aS1200) and calibrated under the same reference conditions. Both treatment units have 5 mm of nominal resolution at the isocentre with Millennium 120 MLC (TrueBeam) and dual-layer MLC (Halcyon-v2) models and a maximum leaf speed of 25 mm s −1 and 50 mm s −1 , respectively. Furthermore, the dataset was divided into 80% for training and validation sub-datasets (80%/20% in turn, N = 819) and 20% for the testing sub-dataset (N = 205), as it is illustrated in figure 2. The treatment plan conditions are summarised in table 1.
The GPRs were calculated from gamma analysis evaluation (Low et al 1998) based on EPID measurements and a global 2% dose and 1 mm distance differences criteria (2%/1 mm). For classification models, the VMAT dose distributions with a GPR 98% were labelled as 'pass' (N = 49%); otherwise, they were labelled as fail (N = 51%). This 2%/1 mm reference value was chosen considering both treatment units and one evaluation threshold able to discriminate potential errors that might affect the planned dose distributions, in accordance with the AAPM-TG 218 recommendations (Miften et al 2018). However, this value also promoted the best-balanced conditions in GPR terms when the datasets were divided into sub-datasets (figure 2(b)), avoiding unreliable classification modelling and overfitting effects (Chen et al 2020). As it is registered in the supplementary material 1.1, most measured plans evaluated with 3%/3 mm, 3%/2 mm, 2%/3 mm and 2%/ 2 mm criteria presented GPR values of 100%, generating highly unbalanced datasets.

Input features
• The MM input feature from a single VMAT-arc is a two-dimensional image created with all MLC positions per cp (figure 3(a)). The leaf number indicated on the y-axis includes both MLC banks (four in the case of Halcyon-v2), and the displacements were normalized to take values from zero to one. Additionally, to optimize the model's 'learning process,' the static leaves were removed, keeping just the active ones during the treatment ( figure 3(b)).
• The MUcp is one-dimensional data containing all MU contributions per cp during one VMAT-arc trajectory, normalized from zero to one based on the total MU values (figure 3(c)). It is extracted from the dose contribution coefficient within the DICOM-RT tag [300A,010C] labelled CumulativeDoseReferenceCoefficient.
• The CDI is a two-dimensional image created with the superposition of all calculated dose fluencies during the VMAT-arc trajectory over a gantry perpendicular common plane. It is calculated by the portal dosimetry image prediction algorithm (Berger et al 2006, Esch et al 2013 integrated into Eclipse (figure 2(d)) and is used to be compared to the dose measured by the EPID to perform the gamma analysis. For modelling purposes, the CDIs were normalized from zero to one.

Models
The designed models for MM, MUcp, and CDI features were noted as M_1, M_2, and M_3, respectively. An r or c character was included at the end of the notation to differentiate between regression and classification models  (e.g. M_1r for regression and M_1c for classification). Additionally, four hybrid models were created from the three main previous models and were noted as M_12, M_13, M_23, and M_123, indicating the included concatenated models with their indexed notation. Furthermore, five-fold cross-validation was applied and 'Horizontal Flip' was the only data augmentation explored in this study to ensure that all input features keep accurate physical representation within training modelling. Accordingly, all models implemented in this study were based on CNN architectures and were designed using the most straightforward possible architectures, establishing the minimum optimal number of CNN-Maxpool layers and filters for each type of input category. This direction might help to control overfitting events, track specific features from each input increasing the model reliability, and reduce the predictions predominated by random features with no physical context (Chauhan et al 2018, Chen et al 2020, Kimura et al 2020.
After the models were designed and optimized, the three main models, M_1c, M_2c, and M_3c were modified, including drop-out layers after each convolution/max-pooling layer arrangement to evaluate their performance stability as the drop-out rate increases systematically. This test is proposed to verify the minimum number of nodes needed to extract features that correlate to GPRs and simultaneously evaluate the contribution of the random extracted features created by the convolutions.

Evaluation
The prediction performance for regression models were evaluated measuring the mean absolute error (MAE), the root mean squared error (RMSE), and the Spearman's correlation coefficient (r) between the measured and the predicted GPR values. High, moderate, and lower correlations were defined for r < 0.4, 0.4 r 0.7, and r > 0.7 values, respectively. Furthermore, the classification model performance was assessed calculating the area under the receiver operating characteristic curve (ROC_AUC), accuracy, specificity, and sensitivity (table 2).

Activation maps
The activation maps of six plans from the testing datasets were generated to verify if the trained features correspond to regions of interest associated with dose deliverability (e.g. demanding hardware conditions) that might help in further decision support tools implementations. Three cases were randomly selected from the correctly classified plans labelled as 'Pass', and three plans correctly labelled as 'fail'.

Model architecture
The M_1, M_2, and M_3 models were designed independently using HParam tool in TensorBoard, optimizing for each model the number of layers, number of filters, kernel size, drop-out rate, and activation functions. A brief representation of the resulting models' architecture is displayed in figure 4 and a detailed description is available in the supplementary material 1.2.

Architecture stability
The results for the model stability test are represented in figure 5. The models M_1, M_2, and M_3 presented more stability with up to 50% activated nodes (Drop-Out rate of 0.5) of each convolution layer, indicating that the remaining extracted features are still enough for GPR predictions. These results are consistent with the original models' performances, however, is it clear that M_2 is more susceptible to reduce the accuracy compared to M_1, which represent a more robust prediction based on the remaining features.

Modelling performance
The modelling classification and regression performances for all models were summarised in table 3, and in figures 6 and 7. The activation maps for the classified 'passing' and 'failing' plans are summarised in figures 8 and 9, respectively. For MM, passing plans activated static leaf regions while failing plans detected specific regions associated with demanding variations of leaf positions. For MUcp profiles, no distinctive regions were detected. Finally, for CDI, the high dose regions were identified in both failing and passing scenarios. The information of the other plans is available in supplementary material 1.3.

Discussion
Our study investigated the suitability of MM, MUcp, and CDI for GPR predictions implementing ML models. We used these three input features to explore new treatment-plan information apart from the already studied dose distributions and reported complexity metrics . Hence, we intended to predict GPRs based on practical physical aspects involved in the treatment delivery, avoiding calculating limited complexity metrics from empirical equations. Furthermore, we also evaluated the CDI as an additional predictor feature because the GPR values in this study were calculated from EPID measurements, and these dose images might contain information associated with demanding linac conditions (Agnew et al 2014, Miri et al 2016, Lam et al 2019. In addition to this exploratory study, we also evaluated and confirmed the potential benefit of including more than one kind of treatment feature within the GPR prediction process (figure 6). Certainly, we believe that a GPR prediction model should consider all possible physical aspects involved in the treatment simultaneously, whether dosimetric or mechanic features, to achieve a more robust performance based on all variables that intervene in each treatment plan delivery. Considering the above, the goal of this study was not to propose the more efficient and complex CNN-based models but to (1) implement straightforward architecture models to evaluate the potential utility of MM, MUcp, and CDI features in GPR predictions, (2) verify if concatenated models increase the GPR prediction performance, and (3) asses the quality of the learned features extracted by each model in GPR predictions.
This study is the first reported evaluation of the MM, MUcp, and CDI as potential GPR predictors using ML methods (Chan et al 2020a, Osman andMaalej 2021). Previous works have implemented regression models based on modulation complexity metrics and dosimetric parameters, reporting mean prediction errors between 2.2% and 4.5%   and ROC_AUC values presented comparable results for all models (table 3), demonstrating the potential benefits of these features for GPRs prediction. Indeed, for model classification, the models designed in this study demonstrated outstanding performance with similar or higher ROC_AUC values than the reported studies. However, while many published models did not report the model performance with the validation tests (Chan et al 2020a, Osman andMaalej 2021), the results obtained in this study using the validation dataset are also comparable (ROC_AUC values of 0.84 ± 0.01, 0.77 ± 0.05, and 0.75 ± 0.03 for M_1, M_2, and M_3 respectively). These results demonstrate the present models' suitability since the validation results are one of the  main approaches to verify the model generalization and the overfitting level; consequently, it is usual that these values are lower than those obtained by the training-testing dataset.
Following the already reported works (Tomori et al 2018, Hirashima et al 2020 and the discussion regarding model evaluation, we also confirm the improving effects of concatenating models using more than one feature category, especially from the validation dataset point of view, combining MM and CDI for model M_13 having ROC_AUC value of 0.91 ± 0.02 (figures 6, 7). However, the general improvement effects of concatenated models are still a field not completely explored and should be evaluated independently in each case because of the different origins and dimensions of the predictor features (Shin et al 2016,Li et al 2017). Furthermore, although the benefits of concatenating various multi-scale features have been reported, even in radiotherapy   (Hirashima et al 2020, concatenating too many features might compromise the model's performance and the training model (Li et al 2017). However, using concatenated models and controlling the different types of inputs might represent a technical advantage in mitigating premature or suboptimal gradient optimization (Tomori et al 2018), plus the benefit of implementing additional treatment plan features that describe treatment plan parameters related to dose deliverability during the same control points.
From the dataset conformation point of view, it is important to notice that the GPRs and modulation metrics ranges are susceptible to change between treatment units and anatomic regions (Wall and Fontenot 2020, Jin et al 2015. Thus, the previously reported models trained with their respective datasets (having a heterogeneous number of anatomic regions, beam energies, treatment units, and unbalanced GPR values) might potentially experience low data generalization and overfitting events (Payer et al 2016, Chen et al 2020, heading suboptimal predictions. Therefore, we deem that our datasets were designed using treatment plans for one single pathology (prostate), planned for two different treatment units (46.6% TB and 53.4% Halcyon, table 1), and ensuring that the passing and failing plans contribute equally to the dataset. Furthermore, with this dataset design and adopting the most straightforward CNN architectures, we intended that the extracted features by the CNNs correspond mainly to specific treatment conditions and, in turn, be able to associate physical or mechanical aspects to the final prediction. Consequently, we only explore horizontal flip for data augmentation. This rationale, from a practical point of view, might procure more robust models since the predicting process is highly focused on features with a real physical meaning and does not rely completely on random weighted feature extractions. Eventually (with further studies), tools like activation maps (Payer et al 2016) might be used to narrow specific treatment moments susceptible to contributing to a 'fail' or lower GPR prediction, or to assist onboard adaptative therapy strategies. Accordingly, similar insights will be beneficial to develop ML solutions from a closer medical physics perspective, contemplating potential strategies to evaluate the model's reliability and consistency of in-house or commercial models dedicated to dose deliverability predictions. In this study, we proposed to evaluate the architecture model stability and the relevance of the 'learned' (extracted) features in the prediction performance, increasing systematically drop-out rates after each CNN layer (figure 5). With this method, we implicitly estimated for each model (1) the proportion of the minimum active nodes (i.e. remaining features) to maintain comparable prediction performances, and subsequently, (2) the potential random features extracted by the model that not necessarily contributes to the prediction.
From the model interpretability point of view, the reported CNN-based models dedicated to GPR predictions (Tomori et al 2018 do not offer straightforward ways to retrieve or identify the features associated with the predictions (Feng et al 2018, Chan et al 2020b, limiting the understanding and evaluation of the model quality because they were developed using dose distribution regions as predictors (Osman and Maalej 2021). These inputs do not provide enough explanatory parameters for plan deliverability analysis; hence, ML models considering high dimensional treatment parameters are also needed to contemplate the utility of retrieving the activation maps pinpointing specific hardware or dosimetric aspects that might influence the dose deliverability in a particular treatment moment (i.e. control point). Accordingly, and considering the mentioned utility of activation maps, figures 8 and 9 are a clear representation of the retrieved plan information associated with the prediction. However, despite the failing and passing activation maps localized distinctive regions, mainly for MMs, further studies are needed to verify that these highlighted changes in MLC position represent actual demanding hardware scenarios that might compromise the dose deliverability. Furthermore, this information might potentially support the setting of hardware tolerance limits for MLC trajectories or configuring TPS tools associated with the MLC sequencing algorithms (Varian Medical Systems 2018).
The GPR evaluation is widely used as a deliverability metric and is one of the worldwide standard tests for specific treatment verification (Miften et al 2018). However, it has been thoroughly questioned because of its arguable sensitivity to reflect or discriminate plan errors with potential clinical implications (Hussein et al 2017). Nevertheless, this study, rather than predicting just on metric, shows the promising opportunity to explore more treatment-associated parameters that can be part of an integral evaluation method of dose deliverability evaluation. We consider that this evaluation does not have to be enclosed by one single metric; hence, ML-based models in this matter will have to explore how to include new treatment parameters to predict relevant features contributing to a multiple-factors analysis to decide if the deliverability of a specific plan is acceptable or not. Additionally, we note that ML-based applications within treatment verification protocols are not intended to replace the quality assurance evaluation. Instead, ML models are recommended as part of decision-making tools to ease the evaluation workflow and reduce the number of dose measurements from suboptimal plans.
We acknowledge that this study was performed with limitations also identified in previously reported works. First, the dataset size is a fundamental factor related to ML model performance, especially for CNN-based models (Tomori et al 2018. However, considering that our dataset size is similar to or higher than others reported, our principal aim was to explore the suitability of three treatment features, and our results were consistent, encouraging further investigations. Similarly, we acknowledge that the extracted datasets were based on treatment plan information from one institution, and external verifications will be necessary to perform further validations. Finally, we acknowledge that further studies are necessary to explore and evaluate the effects of including the intrinsic uncertainty of the dose detectors, the dose calculation, and mainly the uncertainty from the model itself (el Naqa and Murphy 2015, Avanzo et al 2020, el Naqa and Das 2020). We consider that including different sources of uncertainty in ML algorithm design is an essential field to be explored, which might increase the model's robustness and reliability, mainly if it is intended to be implemented in practice.
In summary, with this research, we aimed to contribute to three main gaps within the ML models predicting dose deliverability using CNN-based models. First, the implementation of new treatment features, especially with potential physical factors traceable by the activation maps. Also, the use of multiple feature inputs to increase the prediction performance. And finally, to opening the discussion about how to develop and understand ML applications in radiotherapy that might help to design new strategies to evaluate dose deliverability.

Conclusions
The MP, MUcp, and CDI are convenient features for dose deliverability predictive models implementing ML methods. Additionally, hybrid models including two or more input features are susceptible to improving the prediction performance compared to models with single features. Besides, decision-making strategies based on ML models might help to support new methodologies to evaluate dose deliverability within the patient-specific treatment verification protocols.