Performance Deterioration of Deep Learning Models after Clinical Deployment: A Case Study with Auto-segmentation for Definitive Prostate Cancer Radiotherapy

We evaluated the temporal performance of a deep learning (DL) based artificial intelligence (AI) model for auto segmentation in prostate radiotherapy, seeking to correlate its efficacy with changes in clinical landscapes. Our study involved 1328 prostate cancer patients who underwent definitive radiotherapy from January 2006 to August 2022 at the University of Texas Southwestern Medical Center. We trained a UNet based segmentation model on data from 2006 to 2011 and tested it on data from 2012 to 2022 to simulate real world clinical deployment. We measured the model performance using the Dice similarity coefficient (DSC), visualized the trends in contour quality using exponentially weighted moving average (EMA) curves. Additionally, we performed Wilcoxon Rank Sum Test to analyze the differences in DSC distributions across distinct periods, and multiple linear regression to investigate the impact of various clinical factors. The model exhibited peak performance in the initial phase (from 2012 to 2014) for segmenting the prostate, rectum, and bladder. However, we observed a notable decline in performance for the prostate and rectum after 2015, while bladder contour quality remained stable. Key factors that impacted the prostate contour quality included physician contouring styles, the use of various hydrogel spacer, CT scan slice thickness, MRI-guided contouring, and using intravenous (IV) contrast. Rectum contour quality was influenced by factors such as slice thickness, physician contouring styles, and the use of various hydrogel spacers. The bladder contour quality was primarily affected by using IV contrast. This study highlights the challenges in maintaining AI model performance consistency in a dynamic clinical setting. It underscores the need for continuous monitoring and updating of AI models to ensure their ongoing effectiveness and relevance in patient care.


Introduction
Over the last ten years, artificial intelligence (AI), driven by deep learning (DL) techniques, has achieved remarkable progress, particularly in areas such as computer vision (CV) and natural language processing (NLP), leading to transformative developments across numerous applications.This surge has led to significant enthusiasm in the medical realm, and DL-related medical publications have been growing exponentially since 2015. 1 However, despite the promising prospects of DL in the medical field, its practical deployment remains constrained. 2is lack of clinical translation is multifactorial.1][12][13][14][15][16][17][18][19][20] For instance, a model trained and validated on data from one institution may fail when implemented at another. 213][24][25][26] The decline in model performance post-clinical deployment can often be attributed to data drift, such as variations in imaging acquisition protocols over time within the institution, and evolving practice patterns as new faculty join. 27 one of the first clinically-oriented studies evaluating a model's performance, Davis et al. 22 observed a temporal decline in their model's ability to predict acute kidney injury.While they attributed this decline to calibration drift, they did not explore the underlying factors in detail.Similarly, Nestor et al. 25 also noted temporal performance changes when predicting mortality and prolonged length of stay.Clearly, there is a pressing need for further research to explore how and why a DL model's performance may deteriorate over time.In this study, we have observed a temporal decrease in the accuracy of our automated prostate segmentation model.Furthermore, we investigated the potential impact of evolving clinical workflows on this observed decline in model performance.We found that by refreshing the model with recent data, we were able to enhance its accuracy.

Dataset
In this single-institutional study approved by our institutional review board, we identified 1480 patients at UTSW diagnosed with prostate cancer and treated with definitive EBRT from January 2006 to August 2022.EBRT treatment regimens included conventional, moderately hypofractionated, or ultrafractionated radiotherapy, also known as stereotactic body radiotherapy (SBRT).All patients had delineated contours on radiotherapy planning computed tomography (CT) for the prostate, rectum, or bladder.We excluded prostate contours that incorporated the seminal vesicles.To be included, patients were required to have at least a prostate, bladder, or rectum contour.Moreover, patients were excluded if significant artifact was observed.Our final cohort comprised 1,328 patients (Figure 1).

Model Training
We utilized a 3D U-Net-based auto-segmentation model to contour the prostate, rectum, and bladder on CT images intended for radiotherapy planning.Our implementation was based on the open-source MONAI U-Net. 28The model was trained using the Adam optimizer with default hyperparameters (β 1 = 0.9 and β 2 = 0.999) over 1 × 10 5 iterations, leveraging the dice loss function.We initiated the learning rate at 1 × 10 −4 , reducing it to 1 × 10 −5 at the 4 × 10 4 th iteration and 1 × 10 −6 at the 8 × 10 4 th iteration, respectively.We set the batch size to one.The model was trained and validated using data from 163 patients treated before 2012.

Longitudinal performance evaluation
DSC was calculated to evaluate the model's performance (i.e. the quality of the contours it generated).
The ground truth contours were the clinical contours used in the patients' delivered treatments.
To comprehensively evaluate the model's performance and discern any influencing factors, we employed both visualization techniques and statistical analyses.We charted the model's performance trends from 2012 to 2022 using an EMA.This was applied to the DSC trend curve (termed EMA DSC) with a window size of 180 days and a minimum of 50 observations within the window to have a value.
We performed Wilcoxon Rank Sum Test to analyze the differences in DSC distributions across distinct periods.
Our analysis delved into the potential effects of evolving clinical practices on the model's temporal performance.We meticulously examined the influence of factors such as CT imaging slice thickness, types of hydrogel spacers, IV contrast, MRI-guided contouring techniques, and physician styles.We utilized multiple linear regression to identify the key contributors to performance degradation.Since DSC values fell between 0 and 1 and that the DSC distributions of the predicted contours for the prostate, rectum and bladder were left skewed, we applied a logit transformation to the DSC values.
Subsequently, these logit-transformed DSC values were fitted into a linear model that incorporated the variables mentioned above and their second-order interactions.We preset a significance level at 0.05 to determine the statistical relevance of each factor.All the analyses were performed using SAS 9.4.The model's performance trajectory can be divided into three periods:

Model Performance Deterioration
• Year 2012-2014: Following its initial deployment, the model exhibited peak performance for all three organs.
• Year 2015-2019: A marked decline in contour quality for the prostate and rectum.
• Year 2020-2022: Modest improvements noted in the quality of the prostate and rectum contours; however, not reaching the quality at initial deployment of the model (2012-2014).

Data Drift
From 2006 to 2011, the majority of patients underwent conventionally fractionated radiotherapy, with a small subset (17.8%) participating in an SBRT clinical trial.From 2012 to 2014, conventionally fractionated radiotherapy remained prevalent, but a significant transition to SBRT was observed post-2015.SBRT, characterized by its precision facilitated by specialized equipment and refined imaging, predominantly utilizes CT scans with a 2 mm slice thickness, compared to the 3 mm thickness common in conventional treatments.The incorporation of hydrogel spacers and IV contrast became more prevalent in SBRT protocols with hydrogel spacers also being used in some patients undergoing moderately hypofractionated treatments.Institutionally, if feasible, MRI-guided prostate contouring was the preferred method for most patients, especially after 2015.As precision radiotherapy gained prominence, these techniques have become foundational in most radiotherapy planning protocols for prostate cancer patients.

Model Updates
We found the utilization of the hydrogel spacers with different types, different slice thicknesses, MRIguided contouring, IV contrast and different physicians' contouring styles are contributors to the DLbased auto-segmentation model's performance deterioration.
To counteract this deterioration, we instigated model updates, incorporating enriched data sets to enhance performance.Specific updates were triggered upon the identification of discernible performance declines and conducted every two years. •

Discussion and Conclusions
In this simulated yet practical study evaluating the performance of a DL model deployed in the clinic, we showed that performance decreased over time.We then investigated and identified the potential factors contributing to this deterioration, explicitly evaluating the introduction of new technologies and procedures like hydrogel spacer, CT slice thickness, MRI-guided contouring and the presence of contrast agent in the bladder.Additionally, we considered the changing clinical personnel.To the best of our knowledge, no other studies in the DL auto-contouring medical domain have undertaken a similar exploration into identifying and analyzing factors contributing to the performance deterioration of a deployed DL model over time.
0][31][32][33][34] However, many of these studies do not evaluate the models' performance years later.Given the evolving of medical practices, shifts in treatment paradigms are inevitable.Such changes can pose challenges for DL models that are not periodically updated for clinical application.
Our research revealed our model, trained on data from 2006 to 2011, maintained clinically acceptable performance until the end of 2014.After this period, its ability to contour the prostate and rectum declined significantly.This decline is concerning, especially considering the critical role these structures play in prostate cancer radiotherapy.Radiotherapy targets the prostate while trying to spare adjacent structures, like the rectum and bladder, from receiving toxic radiation doses. 35An inappropriately contoured target can lead to poor treatment coverage, 36 which increases the patient's risk for recurrence or increases the patient's risk for toxicity if the contoured target inadvertently includes normal tissues, such as the rectum.
It is critical to understand why models might fail in the future.Thus we explored several potential factors that could have impacted our model's performance: the evolution of new treatments and changing personnel.Prostate cancer treatment has evolved over the last several decades.Initially, patients were treated with conventionally fractionated radiotherapy, which included daily radiation treatment, five days a week for more than seven weeks. 37Over more recent years, patients are more likely to be treated with hypofractionated approaches, which may take only two to four weeks. 38However, early studies for these hypofractionated approaches did not arise until 2007, 39 with their mainstream clinical adoption occurring much later.
Given this shift, we investigated potential influence of SBRT on our model.Specifically, we assessed the impact of various factors commonly associated with SBRT treatment, such as CT slice thickness, hydrogel spacer usage, MRI-guided contouring, and the use of contrast.Our findings revealed that subgroups utilizing hydrogel spacers (type I and type II), MRI-guided contouring, or 2 mm thick CT slices generally exhibited lower contour quality compared to their counterparts.Specifically, the model's predictions for prostate and rectum contours were inferior in these subgroups compared to those without hydrogel spacers, without MRI-guided contouring, or with 3 mm thick CT slices.Interestingly, the presence of contrast in the bladder enhanced the model's accuracy in predicting prostate contours.
However, it's crucial to note that no single technique was solely responsible for the model's performance decline.This is clearly illustrated by the marked decreases observed in the 'without certain technique group' EMA DSC curves (e.g.WO Hydrogel Spacer in Figure 4B), underscoring the multifactorial nature of the performance deterioration.
quickly observed that the model's ability to contour the rectum and the prostate declined after the type I hydrogel spacer's introduction into the clinic.However, the prostate contour quality improved significantly after type II hydrogel spacer's use in the clinic.In contrast, bladder contour quality remained relatively stable, primarily influenced by the use of IV contrast.Clinically, a hydrogel creates a gap between the prostate and the rectum, reducing rectal radiation exposure and associated side effects. 40 believe this introduced a data distribution shift by systematically altering traditional anatomy, given that the original model was trained before the FDA's approval of the hydrogel spacer in 2015. 41e type I hydrogel spacer, while indistinguishable on CT scans, was visible on MRI images.
Consequently, contours for SBRT cases, which predominantly used hydrogel spacers, were delineated with MRI guidance, adhering to our institutional protocol.In contrast, the brighter type II hydrogel spacer was evident on CT scans due to its higher Hounsfield Unit (HU).The pre-trained DL model likely interpreted the 'brighter' type II hydrogel spacer as non-prostate tissue, leading to a noticeable gap between the auto-generated prostate and rectum contours for patients with type II spacer.We speculate that the effect of contrast present within the bladder mirrors that of the type II hydrogel spacer, aiding the model in distinguishing non-prostate tissues.This could explain the enhanced prostate contour quality in patients with contrast present in the bladder.A similar rationale applies to the prediction of rectum contours in the presence of hydrogel spacers and bladder contrast, though the impact of the latter was not statistically significant.
We also found MRI-guided contouring and physicians' contouring styles correlated with the model performance in predicting the prostate and rectum contours.Notably, an interaction effect between physicians' contouring styles and MRI-guided contouring was observed for the prostate contour quality.
This suggests that physicians' contours vary depending on whether they utilize MRI imaging or not.
The exact reasoning is unknown, but we postulate that physicians are trained to prioritize target coverage and often contour the entire target based on their expertise and interpretation.While there are established guidelines for contouring the prostate and rectum, the delineation can be ambiguous in certain imaging modalities.MRI-guided contouring, in contrast to CT imaging alone, offers enhanced accuracy and potentially more consistent target and OAR contours. 42 Other possible explanations for the observed decline in the model's performance need further exploration, especially if the model is to be integrated into clinical practice.This observation underscores the critical need for continuous and systematic monitoring of deployed models.We advocate for the establishment of guidelines that methodically oversee these models' performance, coupled with strategies to investigate contributing variables.Leveraging out-of-distribution quantification metrics, such as uncertainty estimation techniques, might offer a case-specific approach to identify testing instances that diverge from the training dataset. 43,44

Conclusion
In this novel study, we clearly demonstrated that a model's performance can deteriorate significantly over time.While we could not draw definitive causal conclusions from this retrospective analysis, we pinpointed potential variables that led to shifts in data distribution and subsequent performance decline.
Based on these insights, we updated the model biennially using post-deployment data, witnessing consistent performance improvements after each iteration.These findings are not just limited to our specific model but are broadly applicable to other models employed in medicine.Thus, it's crucial to establish guidelines for regular monitoring and refinement of these models, ensuring their continued efficacy and contribution to patient care across various medical domains.

Data Sharing Statement
The data that support the findings of this study are available from the corresponding author upon reasonable request.The source code of the deep learning model used in this study are available online and adapted to the investigated data (https://github.com/baiti01/iRT_AutoSegmentation).

Declaration of interests
We declare no conflict of interests.

Acknowledgment
This study is supported by NIH grants R01CA237269, R01CA254377, and R01CA258987.
We would like to thank Ms. Sepeadeh Radpour for editing the manuscript.
We retrospectively simulated the clinical implementation of our DL model to investigate temporal performance patterns.Our study involved 1328 prostate cancer patients who underwent definitive external beam radiotherapy (EBRT) between January 2006 and August 2022 at the University of Texas Southwestern Medical Center (UTSW).We trained a U-Net-based auto-segmentation model on data obtained between 2006 and 2011 and tested it on data obtained between 2012 and 2022, simulating the model's clinical deployment starting in 2012.We measured the model's performance using the Dice similarity coefficient (DSC), visualized the trends in contour quality (DSC) using exponentially weighted moving average (EMA) curves.Additionally, we performed Wilcoxon Rank Sum Test to analyze the differences in DSC distributions across distinct periods, and multiple linear regression to investigate the impact of various clinical factors.
Within the final cohort, 982 had well-defined prostate contours, 1269 had available rectum contours, and 1277 had available bladder contours.One hundred and sixty three (163) patients were treated between 2006 and 2011, 203 patients between 2012 and 2014, 602 between 2015 and 2019, and 360 patients were treated between 2020 and 2022.

Figure 1 .
Figure 1.Data selection flow chart

Figure
Figure 2A, B and C illustrate the temporal trends in contour quality (DSC) for the prostate, rectum, and bladder.We observed distinct performance patterns for each organ.While the quality of prostate and rectum contours declined significantly over time, bladder contours remained relatively stable.

Between 2012 -Figure 2 .
Figure 2. Trends in Auto-Generated Contour Quality (A), (B) and (C) present the Exponential Weighted Moving Average (EMA) of Dice Similarity Coefficient (EMA DSC) over time post-simulated model deployment for: A) Prostate EMA DSC, B) Rectum EMA DSC, and C) Bladder EMA DSC.EMA DSC for the auto-generated prostate and rectum contours declined, but those for the bladder contours remained stable.

3. 3
Figure 4A depicts a decline in the model's prostate contour quality, correlating with the increased incorporation of diverse clinical parameters, including various hydrogel spacer types, slice thickness variations, MRI-guided contouring, and IV contrast usage starting from 2015.

Figure 4B and
Figure 4B andTable 1 collectively indicate superior prostate contour quality during the 2012-2014 when

Figure 4 .Figure 5
Figure 4. Auto-generated Prostate Contour Quality Vs.Techniques Utilization and Subgroups (A) EMA DSC for prostate contour quality and proportion of patients with different techniques utilization versus time after the simulated model deployment.(B) Prostate contour quality for subgroups stratified by techniques used versus time.(C) Prostate contour quality for subgroups stratified by treating physicians versus time.The minimum number of observations for the subgroup EMA DSC curves is set to be 30 in figure (C) due to small sample size of patients treated by some physicians.Physicians with available prostate contours fewer than 30 will not have EMA DSC curves

FigureCFigure 5 .
Figure 5C suggests the model demonstrated superior accuracy for physician B compared to other physicians for generating rectum contours.The contour quality hierarchy was: during 2012-2014: B > H > A; during 2015-2019: H > E > B > C and F; during 2020-2022: E > B > C > D.

Figure 6 CFigure 6 .
Figure 6 underscores the bladder contour quality's stability, demonstrating its resilience to changes in clinical techniques and personnel.The EMA DSC curves of different subgroups are nearly indistinguishable, each maintaining a median DSC of 0.95, indicating consistent performance.However, a deviation is noted between 2012 and 2014, where the IV contrast group exhibited a lower DSC compared to the non-contrast group.Meanwhile, the model showcased consistent efficacy across all physicians (Figure 6C) with minimal variations.
Update Model 1 was trained on data spanning 2006-2016 and tested on data from 2017-2022.• Update Model 2 utilized training data from 2006-2018 and tested on 2019-2022 data.• Update Model 3 was trained using data from 2006-2020 and tested on 2021-2022 data.

FigureFigure 7 .
Figure 7A, B, and C delineate the EMA DSC curves for prostate, rectum, and bladder contours, respectively, illustrating performance enhancements post-update.Each model update manifested in improved contour predictions, although we noticed decline in rectum contour quality in Update Model 2 relative to Model 1 (Figure 7B) in 2019, which was attributed to disparate initial values influencing the EMA curves.Post-2019, the EMA DSC curve for Update Model 1 benefited from the weighted averaging effect of preceding higher values, which was not extended to Update Model 2.

Biling
Wang and Michael Dohopolski provided data resources, curated data, searched the literatures, wrote and revised the manuscript.Biling Wang, Michael Dohopolski and Steve Jiang had full access to the data and verified the data.Biling Wang trained the deep learning model and analyzed the data.Ti Bai constructed the deep learning framework and guided the model training.Junjie Wu downloaded all the DICOM data and helped to analyze the DICOM data.Dan Nguyen guided the model training and the analysis.Mu-Han Lin and Michael Dohopolski guided the data curation and contributed essential clinical insights.Raquibul Hannan, Neil Desai, Aurelie Garant, Daniel Yang and Robert Timmerman provided data annotations and clinical expertise.Xinlei Wang guided the overall research study, supervised the statistical evaluation, and contributed editorial assistance to the manuscript.Steve Jiang was pivotal in conceptualizing and directing the research, providing overall guidance on the direction and goals of the project, ensuring the integrity and quality of the research, and contributing editorial assistance to the manuscript.All co-authors have reviewed and consented to the published version of the manuscript.

Table 1
of MRI-guided contouring, the group without this technique registered a superior median DSC of 0.84, compared to the MRI-guided contouring group, which recorded a median DSC of 0.77.Contrastingly, the performance of the IV contrast groups diverged notably during 2020-2022, with the IV contrast group achieving a median DSC of 0.78, surpassing the non-contrast group's 0.76.Additionally, the trend of narrowing differences in subgroup performances suggests potential interaction effects among these variables.

Table 1 Results for Auto-generated Contour Quality
(.74, .82) modify their contouring techniques, given the superior soft tissue contrast and detail provided by MRI.This adaptation might result in contours that deviate from those based solely on CT, leading to discrepancies when compared with DL model predictions trained on CT-based contours.Additionally, while not widespread, there might be instances where physicians defer OAR contouring to residents or other staff, making only minor adjustments themselves.Such delegation can introduce variability, especially if the training or experience level of the delegates differ from the physicians.The utilization of a 2 mm thick slice in CT scans, favored for its precision and detailed representation, is predominantly used for SBRT procedures.Notably, the proportion of CT scans employing this 2 mm thickness escalated from 19.6% (training/validation) to 51.2% (testing).However, intriguingly, our observations indicated that the contour quality for both the prostate and rectum was inferior with the 2 mm thick slice compared to the 3 mm thick slice.This counterintuitive finding suggests that while finer slices capture more anatomical details, they might introduce complexities or nuances that challenge the current DL model's ability to generate accurate contours which was trained predominantly on contours based on 3 mm thick CT images.
This precision can lead to variations in contouring, especially when combined with individual physician preferences.Furthermore, it's plausible that the introduction of MRI-guided contouring has influenced physicians to adapt or