A 3D transfer learning approach for identifying multiple simultaneous errors during radiotherapy

Objective. Deep learning models, such as convolutional neural networks (CNNs), can take full dose comparison images as input and have shown promising results for error identification during treatment. Clinically, complex scenarios should be considered, with the risk of multiple anatomical and/or mechanical errors occurring simultaneously during treatment. The purpose of this study was to evaluate the capability of CNN-based error identification in this more complex scenario. Approach. For 40 lung cancer patients, clinically realistic ranges of combinations of various treatment errors within treatment plans and/or computed tomography (CT) images were simulated. Modified CT images and treatment plans were used to predict 2580 3D dose distributions, which were compared to dose distributions without errors using various gamma analysis criteria and relative dose difference as dose comparison methods. A 3D CNN capable of multilabel classification was trained to identify treatment errors at two classification levels, using dose comparison volumes as input: Level 1 (main error type, e.g. anatomical change, mechanical error) and Level 2 (error subtype, e.g. tumor regression, patient rotation). For training the CNNs, a transfer learning approach was employed. An ensemble model was also evaluated, which consisted of three separate CNNs each taking a region of interest of the dose comparison volume as input. Model performance was evaluated by calculating sample F1-scores for training and validation sets. Main results. The model had high F1-scores for Level 1 classification, but performance for Level 2 was lower, and overfitting became more apparent. Using relative dose difference instead of gamma volumes as input improved performance for Level 2 classification, whereas using an ensemble model additionally reduced overfitting. The models obtained F1-scores of 0.86 and 0.62 on an independent test set for Level 1 and Level 2, respectively. Significance. This study shows that it is possible to identify multiple errors occurring simultaneously in 3D dose verification data.


Introduction
Dose-guided radiotherapy (DGRT) is a strategy for quality assurance and adaptive radiotherapy (RT), that assesses if the delivered dose corresponds to the planned dose during radiotherapy treatment.The delivered dose can be mapped by capturing the exit radiation on an electronic portal imaging device (EPID) behind the patient and converting this measurement to a 2D portal dose image (PDI) or 3D dose distribution in the patient's computed tomography (CT) or cone beam CT (CBCT) image.The planned dose distributions can be calculated using knowledge about the patient's anatomy (i.e. a CT image), the treatment plan, and prior knowledge of the EPID response (van Elmpt et al 2005, van Elmpt et al 2008, Podesta et al 2014).
Gamma analysis is a commonly used tool to compare the delivered dose with the planned dose using pointby-point analysis, implementing dose difference (DD) and distance-to-agreement (DTA) criteria that allow for small discrepancies and shifts during evaluation (Low and Dempsey 2003, Low et al 1998, 2013).Thresholding methods are the current clinical standard for error detection when evaluating gamma analysis results.Several studies have used this methodology to identify simulated treatment errors by placing fixed thresholds on metrics (e.g.gamma pass/fail rate, mean gamma value, and near maximum gamma value) that are obtained from gamma maps (2D) or volumes (3D) (Bojechko and Ford 2015, Vieillevigne et al 2015, Wolfs et al 2017, Mijnheer et al 2018, Olaciregui-Ruiz et al 2020).However, this approach has limitations as multi-dimensional EPID data is compressed into a few numbers, leading to the loss of spatial information (Wolfs et al 2020).Using the spatial information of the EPID data and their respective dose comparison maps/volumes can be crucial for the identification of the error source.Knowing the error source can improve the RT workflow, as it provides additional information for a more targeted approach when adapting the RT treatment plan.However, manual evaluation of all clinically obtained EPID data is unrealistic as this would cause high workloads in the RT workflow.
Artificial intelligence (AI) algorithms, such as deep learning (DL), have become widely available and present an opportunity for the automatic processing of large amounts of complex multi-dimensional data.The application of these DL networks has seen a major increase within the medical field, including radiation oncology (Thompson et al 2018).Recent studies have shown promising results for error detection during treatment, using DL models such as convolutional neural networks (CNNs), that can take full dose comparison images as input (Nyflot et al 2019, Potter et al 2020, Wolfs et al 2020, Kimura et al 2020, 2021, Wolfs and Verhaegen 2022).Wolfs et al (2020) showed prospects in identifying the underlying source of the errors using the information from entire gamma maps.This simulation study showed a proof-of-concept for a DL model built for the identification of treatment errors in EPID dosimetry based DGRT in an in vivo scenario using CNNs.However, they only simulated one treatment error at a time, while clinically more complex scenarios should be considered, with the risk of multiple errors occurring simultaneously during treatment.
Therefore, this study expands upon previous research by investigating the possibilities of a DL algorithm for the identification of multiple treatment errors that can occur simultaneously during RT treatment.This work is a proof-of-concept with the aim to test the feasibility of CNNs developed for the automatic identification of multiple simultaneous treatment errors in EPID dosimetry based DGRT.

Data
In this study, data from 40 lung cancer patients (43 treatment plans) treated with volumetric modulated arc therapy (VMAT) was used (Retrospective study approved by Maastro IRB: P0310-Decision DGRT-I).These patients were randomly selected and received RT treatment between July 2015 and July 2018.All VMAT plans consisted of two half-arcs.Table 1 shows that the selected cohort represents sufficient variation in anatomical characteristics, such as tumor location and stage, as well as in treatment plan characteristics such as fractionation.
The original dose volumes were recalculated with in-house developed software using the original treatment plans and CT images.The 3D dose distributions were reconstructed from 2D EPID dose predictions with Monte Carlo simulations, using the XVMC algorithm for the 3D photon dose calculation (Fippel 1999, van Elmpt et al 2006).The motivation to use 3D doses in this work is that the additional spatial information compared to 2D doses is easier to interpret and could be essential for the identification of multiple treatment errors occurring simultaneously.These dose distributions can be evaluated per VMAT segment, i.e. time-resolved, or can be summed over the entire VMAT arc, i.e. time-integrated.In this work, only time-integrated data was considered for the simulated 3D dose distributions.

Error simulation
In this work, simulated data was used, as manual evaluation and accurate labeling of a large set of clinical images is an arduous, difficult, and time-consuming task.A dataset for lung cancer patients, containing planned and simulated 3D dose distributions, was created by simulating a clinically realistic range of possible combinations of various treatment errors that can occur during RT treatment.The treatment errors are defined by three main error types: anatomical changes, patient positioning errors, and mechanical errors.For each of the main error types, 12 error subtypes were simulated.Anatomical changes were replicated by simulating tumor shifts, tumor regression, pleural effusion, and mediastinum shifts.Patient positioning errors included simulations of translation and rotation of the patient within the linear accelerator (linac) and mechanical errors were characterized by the simulation of monitor unit (MU) scaling errors, multileaf collimator (MLC) shifts, and collimator rotation errors.The magnitudes of the simulated errors were randomly chosen from a predefined range that was set for each of the different error subtypes.With in-house developed software, multiple treatment errors could be simulated simultaneously within the treatment plan and/or CT image of one patient.Anatomical changes were introduced by shifting and changing the intensity of the voxels within the relevant structures delineated in the CT image, positioning errors were simulated by rotating or translating the entire CT image along a randomly sampled axis, and mechanical errors, such as MLC and MU scaling errors, were introduced systematically or randomly within the treatment plans.A more detailed description of the error simulation process is provided in Supplementary Materials A and B.

Model input
The error-induced 3D dose distributions were calculated using the modified treatment plans and CT images.The original and error-induced 3D dose distributions were compared using gamma analysis, using the clinically standard (3%, 3 mm) DD and DTA criteria (Low andDempsey 2003, Low 2010).Several preprocessing steps were used.A large part of the obtained gamma volumes contained voxels with gamma values close to 0, which were considered uninformative.Therefore, these were removed by applying a low dose threshold, removing all voxels from the volume that contained a dose value lower than 5% of the maximum dose in the volume.Furthermore, voxels with values higher than 0 that were located outside of the body structure were removed from the dose comparison volumes for the same reason.Figure 1 shows an example of an original and simulated CT image, the corresponding 3D dose distributions, and the resulting gamma volume.Additional details about data pre-processing steps are outlined in Supplementary Material C.
Besides the clinically standard (3%, 3 mm) DD-DTA gamma analysis, (3%, 1 mm) and (1%, 1 mm) gamma analysis and also a simpler dose comparison method, i.e. the relative dose difference, were tested to optimize DL model performance.These additional dose comparison methods allow for the analysis of smaller discrepancies between the simulated and original dose, thereby providing dose comparison images with additional information.While these dose comparison methods may be too sensitive for conventional threshold  ) where gamma values are expected to exceed 1 if an error is present.Additional to the 3D gamma volume of the body, which is used as standard input for the CNN, the left or right lung volume (depending on the location of the tumor) and the planning target volume (PTV) were extracted.Using the CTV instead of the PTV was not feasible as the size of the extracted volumes would be too small for the proposed DL models.Figure 2 shows an example of the lung volume and PTV extracted from the corresponding body volume.This work tested if evaluating the ROIs individually and then combining the information found within these volumes, using an ensemble model, can improve AI model performance.
In total, 2580 gamma volumes (i.e. 60 simulations per treatment plan) were simulated.The dose comparison volumes were split patient-based into a training, validation, and a hold-out test set, to ensure that all data of one patient belonged to the same set.In total 35 patients (2100 volumes) were used for training and validation and 8 patients (480 volumes) were used as a hold-out test set.The models were trained and evaluated for the identification of multiple treatment errors at two classification levels: Level 1, the main error type with three classes, and Level 2, the error subtype with 12 classes.For both levels, separate models using the same architecture were trained.This provides insight into the level of detail at which the model can accurately identify multiple treatment errors.Nested cross-validation was used to train, evaluate, and optimize the models.In nested cross-validation, k-fold cross-validation for hyperparameter optimization is nested inside the k-fold cross-validation for model evaluation.The additional k-fold cross-validation loop for hyperparameter optimization provides a less biased estimate of the optimized model performance on the dataset, compared to normal cross-validation.In this work, 3D CNNs were trained on (3%, 3 mm) gamma volumes to obtain baseline results.For analysis of these results, the number of folds for both hyperparameter optimization (inner loop) and model evaluation (outer loop) was set to five.For evaluation of the results obtained from the different  comparison methods and the ensemble model, the number of folds for the hyperparameter optimization was set to five as well, however, for model evaluation only one fold was used to speed up the training process.The final optimized model was applied to the hold-out test set to evaluate the model's performance on unseen data.

Classification models
Models capable of multi-label classification were developed for error identification on dose comparison volumes.Multi-label classification means that the classes are not mutually exclusive, and a data sample can be assigned to multiple classes (as opposed to the more common multi-class classification problem, where each data sample can only be assigned to one class).Multi-label classification is achieved by using multiple neurons in the output layer of the classification model, where the number of neurons equals the number of classes, and by performing a binary classification per class.
The CNNs consist of two main structures: (1) a backbone, that takes an input image and outputs a latent feature space representation, and (2) a classification head that converts this representation into a 1D array of logits, and converts these logits into probabilities using a sigmoid activation function.The output of the classification head of each model is a 1D array containing the class probabilities [0-1] for each class.A class was assigned to the input image when the class probability exceeded a pre-defined threshold (by default 0.5), essentially transforming the 1D array of probabilities into a one-hot encoded vector, where multiple classes could have a value of 1 simultaneously.In this work, a 3D CNN and an ensemble of three 3D CNNs, where each CNN was provided with a different ROI as input volume, were developed and evaluated.For training the CNNs, a transfer learning approach was employed for faster convergence and to improve model performance, by initializing the backbone with pre-trained network weights.Transfer learning was shown to be effective in medical physics datasets when the dataset size is small, especially when the source and target datasets have similarities (Chen et al 2019, Romero et al 2020).Hyperparameters, such as batch size and learning rate, were optimized using the Optuna framework (Akiba et al 2019).More details about training and hyperparameter optimization of the CNNs are provided in Supplementary Material D.

3D CNN
The 3D CNN used a slightly modified 3D ResNet architecture as its backbone structure.This architecture was adapted from the paper of Chen et al (2019).Figure 3 shows a schematic representation of the CNN's architecture.The modifications compared to the standard 3D ResNet are: (1) the number of input channels was changed from 3 to 1 due to the use of single channel input volumes (i.e.grayscale images containing gamma or dose difference values) and (2) strided convolutional layers in the third and fourth block were removed and replaced with dilated convolutional layers, to prevent down-sampling of the feature maps (Chen et al 2019).A smaller ResNet backbone is preferred as the larger models are more prone to overfit on a smaller dataset, therefore the size of the backbone was empirically set to a 3D ResNet-18 model.Transfer learning was utilized by initializing the backbone with weights obtained from pre-training on the 3DSeg-8 dataset.This 3D medical imaging dataset contains diverse modalities, target organs, and pathologies (Chen et al 2019).The ResNet-18 backbone architecture consists of a 3D convolutional layer with a filter size of 7 × 7 × 7 and a max-pooling operation with a pool size of 2 × 2 × 2, followed by four blocks each containing: a 3D convolutional layer with a filter size of 3 × 3 × 3, a batch normalization function, and a ReLU activation function.This series of layers was repeated four times in each block.The classification head contained a global average pooling (GAP) operation and one fully connected layer, resulting in a final total of 18 layers within the 3D CNN.The GAP operation compresses the 3D extracted feature maps into a 1D array by taking the average over each map.For the convolutional layers, ReLU activation was used.

Ensemble model
An ensemble model was developed as an extension of the 3D CNN.This model combines the features that are extracted from dose comparison volumes localized in different ROIs. Figure 4 shows a schematic representation of the ensemble model.It consists of three 3D ResNet-18 models, each extracting features from either the body volume, lung volume, or PTV.The ResNet backbone that extracts the features from the lung and PTV is again slightly modified, removing one additional down-sampling step to accommodate the smaller input volume sizes.The model combines the extracted features after the GAP operation resulting in a feature vector three times larger compared to the default 3D CNN.To accommodate for the large, outputted feature vector, an additional fully connected layer was added to the classification head architecture that reduces the feature vector back to its default size.For both the convolutional layers and the fully connected layers ReLU activation was used.Additionally, dropout was added to the fully connected layers to reduce overfitting (Srivastava et al 2014).To minimize the computational effort, hyperparameters were optimized globally to reduce computational costs, meaning that each model in the ensemble used the same batch size and learning rate.

Evaluation metrics
Model performance was optimized and evaluated based on the sample precision, recall, and F1-score.The precision and recall scores give a clear indication of the number of false positive and false negative predicted classes.In other words, a false positive would be triggered if the model predicts an error that does not exist in the data sample, a false negative would be triggered if the model the model does not predict an existing error in the data sample.A low precision would indicate that the model is over-confident in its predictions (i.e.high number of false positives), while in contrast, a low recall would indicate that the model is not confident in its predictions (i.e.high number of false negatives).The sample F1-score combines these two values per input volume to create a value that can be more easily optimized.A perfect model would have a precision and recall score of 1, and consequently, a sample F1-score of 1.A sample F1-score of 1 would indicate that all errors have been identified correctly for an input volume.To obtain the predictions, the probabilistic array that the AI model outputs is mapped to an array of binary values of 0 or 1 indicating the presence or absence of an error.Commonly a classification threshold of 0.5 is used, meaning that when the class probability is higher than 0.5 the model would identify that error class.However, this default classification threshold does not always provide the highest F1score and thus, in this case, optimal model performance.In this study, classification thresholds were optimized per class, i.e. each class probability needed to exceed a different threshold.Classification thresholds that provide the highest sample F1-score on the validation set were chosen.Area under the receiver-operating-characteristic (ROC) curve (AUC) scores were also reported per class, to evaluate differences in performance between classes.

Baseline results
The results of the nested cross-validation for the baseline model are shown in figure 5.The training and validation F1 scores are higher than 0.87 for Level 1, with an average F1-score on the validation set of 0.87 ± 0.01.The results show a balanced precision and recall score in the training and validation sets for Level 1.For Level 2, a steep drop in performance can be seen, with an F1-score on the validation set of 0.59 ± 0.02.For Level 2, a substantial imbalance in the precision and recall score can be observed for both the training and validation set, indicating the presence of a relatively large amount of false negative predicted classes.

Dose comparison methods
Figure 6 shows the effect of the input data on the overall model performance.Models using as input (3%, 1 mm) and (1%, 1 mm) gamma volumes and relative dose difference are compared to gamma volumes with (3%, 3 mm) criteria.While differences are small, all the different dose comparison methods have a negative effect on the model performance for Level 1, resulting in a decrease in the F1 score of 0.01, 0.02, and 0,01 on the validation set, for the (3%, 1 mm) gamma analysis, (1%, 1 mm) gamma analysis, and relative dose difference, respectively.There also appears to be more overfitting for the (3%, 1 mm) and (1%, 1 mm) gamma analysis compared to the standard (3%, 3 mm) gamma analysis, indicated by an increase in the training F1-score and a decrease in the validation F1-score.For Level 2, the opposite effect on the model performance can be observed.All the different dose comparison methods have a positive effect on the model performance for Level 2, increasing the F1-score with 0.04, 0.03, and 0.05 on the validation set, for the (3%, 1 mm) gamma volumes, (1%, 1 mm) gamma volumes, and relative dose difference volumes, respectively.Overfitting remains similar between the standard (3%, 3 mm) gamma analysis and the other dose comparison methods, as both the training and validation F1scores increase.

Ensemble model
Figure 7 shows the results of the ensemble model (ensemble of different ROIs) compared to the 3D CNN with (3%, 3 mm) gamma analysis input volumes.The ensemble model is trained with the standard (3%, 3 mm) gamma volumes and the relative dose difference volumes, the latter being the best performing dose comparison method for Level 2 (figure 6).For Level 1 the 3D CNN with (3%, 3 mm) gamma volumes slightly outperformed the ensemble model when trained on either the (3%, 3 mm) gamma volumes or the relative dose difference volumes, decreasing the validation F1-score with 0.02 for both input volumes.However, for Level 2 the ensemble model outperformed the 3D CNN with (3%, 3 mm) gamma volumes when trained with both dose comparison methods, improving the validation F1-score with 0.05 and 0.06 when trained on the (3%, 3 mm)

Prediction thresholds
Based on these results, the best performing models were: the 3D CNN using (3%, 3 mm) gamma analysis for Level 1 and the ensemble model using relative dose difference for Level 2. Figure 8 shows the AUC values of the validation set per class for both levels for these best performing models.It can be observed that there are differences between the AUC scores for different classes.Tumor regression and random MU scaling errors have lower AUC scores than, for instance, patient rotation and translation errors.
Classification thresholds of the individual classes were optimized for the best performing models to improve the validation F1-score.The results of the threshold optimization optimization process are shown in figure 9, and the classification threshold values of the individual classes are provided in Supplementary Material E. Optimizing the classification thresholds increases the validation F1-score with 0.01 for both Level 1 and Level 2. For Level 2, it can be observed that the precision and recall scores become more balanced when the classification thresholds are optimized, reducing the gap between the precision and recall from 0.26 to 0.15. Figure 7. F1-score of the ensemble model for classification Level 1 and Level 2 on the training and validation datasets, using both the standard (3%, 3 mm) gamma volumes and relative dose difference volumes as input.The performance of the ensemble model is compared with the performance of the 3D CNN using (3%, 3 mm) gamma volumes as input, which is considered a baseline performance.

Hold-out test set
The best performing models, i.e. the 3D CNN with (3%, 3 mm) gamma volumes for Level 1 and the ensemble model with relative dose difference volumes for Level 2, were evaluated on the hold-out test set.The performance of the best performing models on the training, validation, and hold-out test set are shown in table 2. For Level 1 the performance is good, with F1-scores of 0.91, 0.90, and 0.86 for the training, validation, and hold-out test set, respectively.For Level 2, performance is moderate to good, with F1-scores of 0.78, 0.67, and 0.62 for the training, validation, and hold-out test set, respectively.

Discussion
The results of the nested cross-validation in figure 5 show that the 3D CNN scores very well for Level 1 error identification, with a mean sample F1-score higher than 0.87 on the validation set.Furthermore, the precision and recall scores on the validation set are balanced, indicating a balanced number of false positive and false negative predictions.These scores suggest that the 3D CNN is capable of accurate identification of multiple main treatment errors.For Level 2, overfitting becomes more apparent and a decrease in performance on the validation set can be observed.Furthermore, compared to Level 1, there seems to be a substantial imbalance between the precision and recall scores, indicating that there are more false negative predictions.This suggests that the model is not confident when predicting the error subtypes.
The results can be partly compared to those of Wolfs et al (2020), who developed a CNN that was capable of accurate identification of the main error type (Level 1) and error subtype (Level 2) in the 2D gamma images (not in 3D dose comparison volumes, as in this work) when only one error occurred during treatment.Similar to the present study, Wolfs et al (2020) observed that overfitting became more apparent when the classification level was increased, decreasing the validation accuracy from 98.1%, for Level 1, to 89.8%, for Level 2. However, the overall performance is higher in the work of Wolfs et al (2020), 98.1% accuracy versus 0.90 F1-score and 89.8% versus 0.67 F1-score for Level 1 and Level 2, respectively, which highlights the added complexity when classifying multiple errors at once.The overfitting between Level 1 and Level 2 can be explained by the increased number of classes in Level 2 compared to Level 1.This means that there are fewer data points available per class, making it more difficult for the models to generalize between the training and validation sets.The number of error simulations used in this study was capped at 60 simulations per treatment plan.However, the dataset could be expanded by simulating additional errors for each treatment plan or by adding new patients to the dataset, potentially resolving the overfitting issue.Additional optimization of the CNNs may also improve the results, e.g.optimization of the 3D CNN classification-head architecture or a more exhaustive hyperparameter search.
Besides the standard (3%, 3 mm) gamma volumes, 3D CNNs were trained on input volumes with stricter gamma criteria, i.e. (3%, 1 mm), and (1%, 1 mm), and a more simple dose comparison method, i.e. relative dose difference.These stricter and simpler dose comparison methods are more sensitive and as a result, provide additional information within the images.While no improvements can be found for Level 1 compared to the standard (3%, 3 mm) gamma analysis, Level 2 seems to benefit from the stricter (3%, 1 mm) and (1%, 1 mm) criteria and the more simple relative dose difference comparison.This indicates that the additional information in the input images helps the model with error subtype identification.The results found for Level 2 can be partly compared to the results found in the work of Wolfs and Verhaegen (2022).This study showed that (3%, 1 mm), (1%, 1 mm) gamma criteria and relative dose difference maps improved model performance compared to the standard (3%, 3 mm) gamma analysis, for the identification of mechanical error subtypes.The results found for Level 2 are in line with the findings of their research.For Level 1, the model performance does not improve when different dose comparison methods are used.While differences are small, the results show a slight decrease in performance when compared to the standard (3%, 3 mm) criteria.The stricter gamma criteria add complexity to the data (i.e. more information within the volumes).For Level 2 this seems to boost the performance, however, for Level 1 this seems to cause overfitting as the task is less complex.This is supported by the enlarged gap between the training and validation F1-score for the (3%, 1 mm) and (1%, 1 mm) gamma criteria for Level 1.It should be noted that these results are based on simulation data.In real clinical practice, noise within the measured dose images during RT treatment is almost unavoidable, meaning that (3%, 3 mm) criteria might be more beneficial in real clinical practice as the stricter criteria might amplify the noise.Furthermore, the simulations used in this study are static, however, motion can influence the dose during RT treatment of the lungs as the patient is likely to move (i.e.breathing) during a irradiation.Again, less strict gamma criteria might be more favorable to prevent amplification of the noise caused by movement of the patient.
For Level 1, the 3D CNN outperforms the ensemble model based on the validation F1-score.However, for Level 2, the developed ensemble model outperforms the 3D CNN when trained on either the (3%, 3 mm) gamma volumes or relative dose difference volumes (figure 7).The performance difference between the 3D CNN and the ensemble model shows that the ensemble model can extract useful features from the lung volume and PTV for the identification of the error subtypes.Similar to the results of the different dose comparison methods, Level 1 again does not seem to benefit from the additional input information, as the added complexity can lead to overfitting.Therefore, it is not recommended to use the ensemble model for the identification of the main error types.Additional optimization of the ensemble model may further improve the results, e.g.different dose comparison methods per ROI volume, hyperparameter optimization for the individual models, and improvement of the classification head architecture.
The results additionally showed that AUC scores for different classes differed substantially, especially for Level 2. It is interesting to note that errors affecting the whole or a large part of the high dose treatment fieldsuch as patient positioning errors, anatomical changes in large organs like the mediastinum and systematic mechanical errors-have higher AUC scores and therefore seem easier for the model to identify.As a final optimization step, the classification thresholds of the individual class predictions were optimized for the bestperforming model, i.e. the 3D CNN with (3%, 3 mm) gamma volumes for Level 1 and the ensemble model with relative dose difference volumes for Level 2 (figure 9).Optimization of the classification thresholds improved the F1-score both for Level 1 and Level 2 (figure 9).In this work, the classification thresholds that result in the highest possible validation F1-score were chosen for both Level 1 and Level 2. A different approach can be to optimize the ratio between the precision and recall score, allowing more false positives or false negative predictions depending on the problem at hand.For clinical use, a higher recall might be preferred, as a false negative prediction (not detecting an error) can be potentially more harmful than a false positive prediction (detecting an error that is not there).The best performing models, after classification threshold optimization, were evaluated on the hold-out test set.The final models show good performance on the hold-out test set for Level 1 and a moderate performance for Level 2. For Level 2 differences between the training, validation, and test set are quite substantial, highlighting the overfitting issue.
Several approaches can be taken to further improve the results of the proposed models.A possible limitation of the 3D CNNs is the GAP operation in the classification head architecture.With a GAP-based approach, the 3D feature maps that are extracted by the backbone are compressed into a single value, causing a loss of spatial information, which can be suboptimal for multi-label classification (Ridnik et al 2023).Relatively small anatomical changes, such as a small tumor shift and tumor regression, are likely mainly affected by this.An attention-based approach can be more optimal for multi-label classification, as it enables a more elaborate use of spatial information (Ridnik et al 2023).New developments in AI for medical imaging, such as vision transformers (Ayana et al 2023, Wang et al 2023), could also be explored.However, the dataset size needed for these novel architectures to train optimally are an order of magnitude larger than the dataset available in this work.They are therefore currently not expected to improve performance, although the combination of transfer learning with vision transformers could be promising.Another approach to improve model performance is to train on time-resolved data.In this work, only time-integrated input images are considered as model input.However, random MLC and MU errors can average out when dose images are accumulated over the entire VMAT arc.Using time-resolved data allows for better analysis of these random errors (Persoon et al 2016, Schyns et al 2016).However, using time-resolved data would significantly increase computational time and complexity.
Another direction for future research is the identification of the magnitude of the treatment errors.This additional classification level would identify whether or not it is necessary to act upon an identified error.This extra classification level is required for a full adaptive RT workflow purely based on EPID dosimetry.Furthermore, the number and the combinations of the simulated errors should be further investigated.In this work, a wide range of possible combinations of treatment errors has been simulated together, creating a dataset that consisted of dose comparison images containing one and up to four simulated treatment errors.However, the study of (Mans et al 2010) showed that the occurrence rate of serious errors during RT treatment is low, where they picked up only 17 out of 4337 treatment plans containing an error that led to intervention, within the Antoni van Leeuwenhoek hospital.This suggests that, while this work shows the robustness of CNNs for multiple treatment error identification, not all combinations might be clinically relevant, with a very low probability of occurring during actual RT treatment.To reduce overfitting, it can be considered to create a simulation dataset with error combinations known to be more pertinent in real clinical practice, which would lead to a dataset containing more data per class.
An additional factor that may hinder clinical use of an AI model as proposed in this work, is that it is possible that in practice, different errors may occur than the model was trained on, known as out-of-distribution data.There are different methods for incorporating this uncertainty in an AI model, such as adding an additional 'unknown error' class during training with a variety of cases that have different errors than originally included in the model, or evaluating the output classification probabilities across all classes and leveraging this information to determine whether a data sample is in-our out-of-distribution (Wang et al 2021).

Conclusion
This study shows that it is possible to identify multiple errors occurring simultaneously in 3D dose verification data from EPID imaging.While the classification of the main error type shows high performance, overfitting becomes more apparent for the classification of the error subtype, however, this improves by using relative dose difference volumes and ensemble models that take dose comparison volumes in different anatomical structures into account.

Funding
This study was partially funded by Varian Medical Systems (project: Decision DGRT-I).

Figure 1 .
Figure 1.Examples of the 3D dose volumes obtained from the original and simulated CT and treatment plan.This example contains a total of three simulated errors: tumor regression, pleural effusion, and a random MU scaling error (not visible in the bottom left panel).The gamma volume obtained from gamma analysis between the original and simulated dose is visualized on the right side of the figure.The CT images, 3D dose distributions, and gamma volume are visualized in 2D by showing the nth slice in the volumes.

Figure 2 .
Figure 2. Examples of regions of interest in 3D gamma volumes.Going from left to right: body volume (default structure), lung volume, and PTV.

Figure 3 .
Figure 3. Representation of the backbone and classification head architecture of the 3D CNN.The backbones extract features from the input image and the classification head converts these features into class probabilities, by performing a binary classification per class using a sigmoid activation function.GAP: global average pooling, 3 × 3 × 3 Conv3D: a 3D convolutional layer with a filter size of 3 × 3 × 3. The dose comparison volume has one input channel (1 × 128 × 128 × 32), and is therefore visualized in grayscale.

Figure 4 .
Figure 4. Backbone and classification head architecture of the ensemble model.Due to the smaller size of the lung and PTV volume, the ResNet architecture used employs one less down-sampling step and instead opts to use dilated convolutions to extend the model's field of view.The features obtained from each model are concatenated after the global average-pooling operation.This feature vector is used as input for the classification head, which contains two fully connected layers and a sigmoid activation function to obtain the final probabilities, by performing a binary classification per class.GAP: global average pooling, 3 × 3 × 3 Conv3D: a 3D convolutional layer with a filter size of 3 × 3 × 3. The dose comparison volumes have one input channel (1 × height × width × depth), and are therefore visualized in grayscale.

Figure 5 .
Figure 5. Sample precision, recall, and F1-score of the 3D CNN for Level 1 and Level 2 errors using (3%, 3 mm) gamma volumes.Level 1 constitutes the classification of the main error type, Level 2 constitutes the classification of the error subtypes.Results are obtained with nested cross validation, each box represents the metric score of the five outer folds on the training (blue) and validation (orange) datasets.

Figure 6 .
Figure 6.F1-score of the 3D CNN for Level 1 and Level 2 for different dose comparison methods on the training and validation datasets.Model performance of the standard (3%, 3 mm) gamma volume, (3%, 1 mm) and (1% 1 mm) gamma volumes, and the relative dose difference volume are compared.

Figure 8 .
Figure 8. AUC-ROC scores per class for the best performing models, i.e. the 3D CNN for classification Level 1 and the ensemble model for classification Level 2, on the validation dataset.

Figure 9 .
Figure 9. Optimized (hatched) and non-optimized (blank) F1-scores of the best performing models, i.e. the 3D CNN for classification Level 1 and the ensemble model for classification Level 2, on the validation dataset.Model performance was optimized by optimizing the individual classification thresholds for each class on the validation dataset.

Table 1 .
Patient, anatomical and treatment plan characteristics of the patients included in this study.

Table 2 .
Performance of the best performing model for classification Level 1 and Level 2 on the training, validation, and hold-out test set.