Prospective validation of clinical deterioration predictive models prior to intensive care unit transfer among patients admitted to acute care cardiology wards

Objective. Very few predictive models have been externally validated in a prospective cohort following the implementation of an artificial intelligence analytic system. This type of real-world validation is critically important due to the risk of data drift, or changes in data definitions or clinical practices over time, that could impact model performance in contemporaneous real-world cohorts. In this work, we report the model performance of a predictive analytics tool developed before COVID-19 and demonstrate model performance during the COVID-19 pandemic. Approach. The analytic system (CoMETⓇ, Nihon Kohden Digital Health Solutions LLC, Irvine, CA) was implemented in a randomized controlled trial that enrolled 10 422 patient visits in a 1:1 display-on display-off design. The CoMET scores were calculated for all patients but only displayed in the display-on arm. Only the control/display-off group is reported here because the scores could not alter care patterns. Main results. Of the 5184 visits in the display-off arm, 311 experienced clinical deterioration and care escalation, resulting in transfer to the intensive care unit, primarily due to respiratory distress. The model performance of CoMET was assessed based on areas under the receiver operating characteristic curve, which ranged from 0.725 to 0.737. Significance. The models were well-calibrated, and there were dynamic increases in the model scores in the hours preceding the clinical deterioration events. A hypothetical alerting strategy based on a rise in score and duration of the rise would have had good performance, with a positive predictive value more than 10-fold the event rate. We conclude that predictive statistical models developed five years before study initiation had good model performance despite the passage of time and the impact of the COVID-19 pandemic.


Introduction
Unexpected clinical deterioration is a pernicious problem in hospital medicine.With the advent of the widespread availability of electronic health record data and new awareness and capabilities for machine learning, many academic and commercial groups have developed clinical decision-support tools.Early detection of clinical deterioration is a common goal among them.
Uptake into clinical practice, though, has been slow.There are concerns about clinical content validity-are the models capturing an interpretable signature of illness?-and data drift-are the models relying on tests and measurements we no longer make while ignoring newer ones?(Finlayson et al 2021, Yang et al 2022).For example, the myocardial band of creatine kinase was once the blood test of choice in patients suspected of acute myocardial infarction, but that has been replaced by troponin.Thus, a predictive model for heart attack that included the former and omitted the latter would be severely impaired.Moreover, implementation is difficult (Prudente Moorman 2021), and randomized clinical trials (RCTs) showing benefits are rare (Moorman et al 2011).In fact, a recent review of 41 RCTs showed no benefit of ML tools in clinical practice (Plana et al 2022).
Demonstrating clinical content validity and the absence of drift within a prospective non-intervention group is a fundamental step before examining the findings of RCTs.As part of a recent RCT of predictive analytics monitoring on consecutive patients admitted to an acute care cardiac medical-surgical ward (Keim-Malpass et al 2021), we calculated but did not display risk scores for over 4484 patients among 5184 visits randomized to the control arm.More than five years elapsed between model development and the randomized trial, and the trial started, by coincidence, at the beginning of the first wave of the pandemic in Charlottesville, VA.Here, we report the model performance of this predictive analytics tool as a means of testing clinical content validity and drift.This is a TRIPOD type IV prospective validation study since the data were not used for model development (Collins et al 2015).Primary findings from the parent RCT will be reported separately.

Methods
Following Institutional Review Board approval from the University of Virginia (IRB #22196), we undertook a 2-arm cluster randomized controlled trial to test the impact of a predictive analytics monitor display (CoMETⓇ, Nihon Kohden Digital Health Solutions LLC, Irvine, CA.) on outcomes in consecutive patients admitted to acute care medical/cardiology, post-operative cardio-thoracic surgery, and medical/surgical hospital wards.This research was conducted in accordance with the principles embodied in the Declaration of Helsinki and in accordance with local statutory requirements under a waiver of informed consent.The trial protocol has been published (Keim-Malpass et al 2021).Briefly, we randomized clusters of four contiguous beds among the 80 beds spread over three wards.Clusters were re-randomized every two months.In the display-on arm, large computer monitor screens displayed the predicted risks of imminent cardiorespiratory events and of imminent cardiovascular events using CoMET.
CoMET models are logistic regression models with cubic splines based on physiological measurements from continuous cardiorespiratory monitoring data and on EHR elements of vital signs and laboratory tests.The models were trained separately on cardiorespiratory and cardiovascular events of clinical deterioration leading to escalation in care delivery (Moss et al 2017, Ruminski et al 2019, Blackwell et al 2020, Keim-Malpass et al 2022).The predictors of the CoMET score include: (1) cardiorespiratory dynamics measured from continuous electrocardiogram (ECG) (including heart rate (HR) variability, and pairwise cross-correlations between HR and ECG-derived respiratory rate (RR) local dynamics score, coefficient of sample entropy (COSEn), detrended fluctuation analysis (DFA) of heart inter-beat intervals)-all sampled every 2 s; (2) electronic medical record derived parameters (including: vital signs (temperature, HR, blood pressure, RR, SpO 2, ), oxygen flow rate, laboratory results (complete blood count, basic metabolic panel)-all sampled every 15 min.
Randomization began on 4 January, 2021 and ended 4 October, 2022.We enrolled 10 422 patient visits into the randomized controlled trial.The display-off arm comprised fewer than half of the sample and received usual care.This subgroup is the focus of this report on predictive model performance, as the scores were not available to clinicians to influence care and potentially alter performance.Clinical research coordinators prospectively followed enrolled patients and individually adjudicated transfers to the intensive care unit (ICU).We analyzed only those transfers that were undertaken for true clinical deterioration as opposed to movement to the ICU for elective procedures.An independent extractor then reviewed all emergent ICU transfer cases to determine the type of clinical deterioration using earlier definitions (Blackwell et al 2020).Reasons for ICU transfer included: bleeding/acute hemorrhage, coronary artery disease (CAD), neurological disease (Neuro), arrhythmia, heart failure (HF), infection, and respiratory deterioration.Importantly, these categories are not mutually exclusive, so a patient could have multiple etiologies of clinical deterioration for a single ICU transfer event.
We tested the performance of the models as continuous risk predictors analyzed every 15 min.All analyses used the 12 h preceding clinical deterioration requiring emergent ICU transfer as the event detection window, i.e. a positive binary classification was labeled a true positive if it occurred within 2-12 h before emergent ICU transfer.We calculated (1) model discrimination using area under the receiver operating characteristic (ROC) curve (2) model calibration of predicted relative risk to observed relative risk, (3) dynamic changes in model outputs prior to the deterioration event with significance testing using a Wilcoxon sign rank test, (4) model binary classification performance with an alert for a specified amount and duration of model rise, and (5) empirical risk of emergent ICU transfer as a function of the joint distribution of estimated cardiovascular and cardiorespiratory risk.
We additionally estimated the observed risk of emergent ICU transfer in the next 12 h as a function of the empirical distribution of cardiovascular and cardiorespiratory risk estimates.Using kernel density estimation, we calculated the two-dimensional density of risk estimates for all patients all the time, and also for data within 12 h before emergent ICU transfer (i.e.event data).We estimated the kernel bandwidth using the event data and used the same bandwidth for event data and all data density estimates.The relative risk of ICU transfer, then, is the ratio of the event data density divided by the density of all data.

Patient population
Over 22 months, there were 5184 visits among the 4482 patients.Table 1 shows the characteristics of the display-off arm patients at the time of entry into the study.The sample was predominantly male (58%) and white (77%) with a mean age at admission of 66 years old.Of these display-off arm patients, 311 patients (6.9%) emergently transferred to the ICU.We analyzed model outputs in 2174 117 15 min epochs, of which 0.59% were within 12 h prior to an emergent ICU transfer.

Clinical events
Figure 1 shows the etiologies of clinical deterioration, which were often multiple, leading to emergent ICU transfer.The UpSet plot represents overlapping attributes; for example, the overlapping set on the farthest right-hand column represents the number of patients transferred emergently to the ICU for arrhythmia, infection, and respiratory decompensation.The most common reason patients experienced clinical deterioration was due to respiratory distress (36.3%).This plot shows the wide diversity of reasons for decompensation along with various co-occurring etiologies.

Model performance: continuous risk prediction
We evaluated the discrimination of the models by measuring the area under the receiver-operating characteristic curve (AUC) on every 15 min predicted risk.The cardiovascular model predictive performance was AUC = 0.725 (95% CI 0.695-0.755),the cardiorespiratory model predictive performance was AUC = 0.737 (95% CI 0.709-0.761),and the all-cause ICU transfer model predictive performance was AUC = 0.733 (95% CI 0.701-0.761).For reference, the all-cause ICU transfer model had an AUC of 0.729 in the training dataset using a lead time of zero hours compared to 2 h in this analysis (Moss et al 2017).We also compared the performance against qSOFA, a well-known model for clinical deterioration which yielded an AUC = 0.669 (95% CI 0.643-0.693).
We also assessed model performance by biological sex, as noted in the SAGER guidelines, where we did find differences in AUC by sex (table 2).
Figure 2 shows the calibration of the models.The major finding is that the points lie close to the dotted line of identity; the predicted risks were close to the observed risks.Thus, the model is calibrated well, though the respiratory model slightly underestimated the observed risk at the extreme high end.
Figure 3 shows the diagnostic likelihood ratio, or the ratio of post-test risk to pre-test risk, of ICU transfer as a function of the range of the CoMET score.These ranges provide risk groups based on natural thresholds interpretable to end users.The likelihood ratio increases with increasing risk group and the risk is much higher in patients with scores >4.

Model performance: dynamic changes before events
Figure 4 shows the average model scores as a function of time until the ICU transfer at time zero.The average relative risk rose from approximately 2-fold in the 48-24 h prior to emergent ICU transfer to approximately 3-fold at transfer.Open circles denote a statistically significant (p < 0.05) rise in the CoMET score relative to scores for the same patients to 24 h prior, and the average score begins to rise appreciably up to 8 h before the event.

Model performance: binary classification
Here we defined hypothetical alerts based on the trend of an increase in score that remains high for a period of time.To calculate this, we averaged the model scores hourly and calculated the delta at each hour relative to the value 2 h prior (each delta is therefore based on a 3-hour retrospective window).An alert is identified when an hourly score results in a delta that is greater than d and the average over the next N hours remains >90% of the hourly score.If the score rises from, for example, 1-4, this satisfies a delta d = 3. N is the number of hours for which the (running) average of the hourly scores is ⩾3.6 (90% of 4).
Figure 5 shows the positive predictive value (PPV) of an emergent ICU transfer as a function of the alert threshold.The PPV is shown relative to the event rate of 0.59%.Note that, unlike previous studies of alert strategies, only emergent ICU transfers were used to identify true positive alerts (Keim-Malpass et al 2020).The best binary classification performance occurs when an alert is issued for an increase in the CoMET score by 3 or 4 units followed by an elevated score for a period of 3-6 h.

Empirical risk of emergent ICU transfer
In figure 6, grayscale quantifies the relative risk of emergent ICU transfer in the next 12 h as a function of estimated cardiovascular risk (x-axis) and cardiorespiratory risk (y-axis).Whiter indicates higher risk.While figure 2 shows the calibration of each model individually, that calibration in figures 2(a) and (b) is the marginal distribution that would be obtained by averaging over the y-and x-axes of figure 6, respectively.Figure 6 shows that the combination of 2 model risk estimates adds information about patient risk: while a patient with 2-fold cardiovascular risk has on average 2-fold risk of ICU transfer, that risk is less than one given a cardiorespiratory risk near zero, but greater than 3 given a cardiorespiratory risk near 5.

Discussion
To assess clinical content validity and drift, we studied the performance of risk models for clinical deterioration in acute care cardiology, cardio-thoracic surgery, and medical-surgical wards in the context of a randomized controlled trial.The model performance and calibration were good, and the scores appropriately and dynamically rose in the 8-16 h preceding clinical deterioration requiring ICU transfer.We also found that these CoMET models performed better than qSOFA, and we made hypothesis-generating findings about differences in model performance by sex that require further study.This analysis is limited to the patients in the control (display-off arm) of the randomized controlled trial to evaluate model accuracy without contamination by potential changes in clinical practice based on the model results.More work is needed to understand the link between model rise and actual clinical action in real-world contexts.
A major finding was that there was no drift in model predictions.The predictive models used in this study and trained on 2013-2015 data (Moss et al 2016) continued to produce calibrated and accurate results  regarding the risk of clinical deterioration despite the COVID-19 pandemic and the resulting changes in hospital admission practices against the backdrop of fewer elective admissions for cardiac surgery, fewer emergency room visits for cardiac symptoms, and high turnover rates among nursing staff.
The ROC areas ranged from 0.72 to 0.75.We note that the interpretation of the ROC area for a clinical prediction model depends on both how it is measured and on the clinical scenario.First, ROC areas for model predictions assessed outside the relevant time windows can be very high even if the model's utility is low.For example, the ROC for TREWS, a model trained on septic shock in ICU patients and applied to all patients in the hospital, is 0.97 (Henry et al 2022, Saria et al 2022).This very high value was obtained by comparing the highest score observed during the entire hospitalization with the overall outcome of whether or not there was an episode of sepsis during the hospitalization.There was an attempt to limit the assessment of the score to the hours before the event.Thus, the score and the outcome were dissociated in time, and the ROC area has no clinical meaning.Another approach is to evaluate predictive models using time windows where the event has already occurred.The Rothman Index, for example, was evaluated for sepsis using data from up to 7 d after the diagnosis (Rothman et al 2017).More recently, Kamran and coworkers evaluated the Epic Sepsis Model at three time points: the whole hospital stay, like TREWS, before sepsis diagnosis, before blood cultures, before fluids, and before lab tests (Kamran et al 2024).The ROC areas fell with each remove, from 0.87 to 0.50.The lowest value of the ROC area corresponded to the most relevant clinical scenario-the prediction of sepsis before there was any clinical suspicion in the form of tests or treatments.Here, we report ROC areas for the period of time ending 2 h before ICU transfer.
Second, if the outcome is obvious to clinicians, a predictive model with even a very high ROC area does not add information.For example, patients with terminal illnesses assigned to comfort care have very deranged vital signs and lab values as they approach death, and a predictive model of the kind described here would rise to high values.If many patients of this kind were mixed into the population, then the ROC area would be high, but since the outcome had been foregone, the model's value would be zero.We report ROC areas for an unexpected ICU transfer as adjudicated by individual chart reviews, a non-obvious outcome.HR characteristics monitoring for early detection of neonatal sepsis, the only AI tool to show a survival benefit in a large RCT, had a ROC area of 0.71 in the derivation cohort.This value, thought by many to be of only moderate value, had a high impact on the difficult clinical scenario of sepsis in premature infants.
We also found that an alerting strategy based on both a rise in score and a continued elevation over a duration of several hours promises good performance, with PPV more than 10-fold increased over the event rate.We have evaluated these and other models for their performance as a basis for alerting clinicians of patient deterioration.Rather than use a single observed value, we have used the pattern of a large, abrupt spike in the risk score.In this way, we have found PPV ranging from 14% to nearly 50% (Sullivan et al 2014, Keim-Malpass et al 2020).In a retrospective analysis, the models reported here have a PPV of 24% for a combined endpoint of unplanned ICU transfer (the subject of this report), infection workup, rapid response team visit, myocardial infarction, stroke, unplanned surgery, initiation of cardiopulmonary resuscitation, and death.(Keim-Malpasset al 2020) Other reported alerting strategies are usually based on the transgression of a threshold by a single value.TREWS, moreover, requires the presence of 'significant' (Saria et al 2022 ) and 'verifiable symptoms' (Henry et al 2022) as well as 'notable findings such as the presence of deterioration in key markers' (Adams et al 2022) before an alert is issued.
We note that the cardiovascular model performed less well than the cardiorespiratory model based on AUC, though still commensurate with performance in the training data.We have consistently found higher predictive accuracy as represented by model AUC for models targeting urgent unplanned intubation for acute respiratory failure than for acute coronary events or stroke.One interpretation is that, on an acute cardiology ward, there are frequent episodes of cardiovascular decompensation that are treated without requiring intensive care that would count as false positives when evaluating the cardiovascular model, while respiratory decompensation that elevates the cardiorespiratory model more often requires escalation to the ICU.
Finally, we note that these models use mathematical analyses of continuous cardiorespiratory monitoring data In addition to EHR-based data elements.These continuously acquired physiological measurements can provide a more complete representation of the patient's physiological state and fill in the gaps between blood draws and nursing assessments.They represent a substantial advancement in dynamic predictive risk estimation in real-world settings (Moss et al 2017, Monfredi et al 2021).The calibration of CoMET models can be compared favorably to that of HR characteristics (HRC) monitoring for predicting neonatal sepsis, which is based entirely on continuous cardiorespiratory monitoring data; in fact, the calibration of CoMET is better than that of HRC values, which fit well at low and medium risk but over-estimate risk at the high end (Lake et al 2014).
We conclude that signatures of cardiorespiratory and cardiovascular illness are present in the continuous cardiorespiratory monitoring, vital signs, and laboratory data in patients hospitalized in an acute care cardiology ward.These statistical predictive models, developed five years previously, had good predictive performance despite the passage of time and the impact of the COVID-19 pandemic.Incorporation into clinical care awaits additional research on model impact on clinical actions, additional alert-based strategies, and patient-centered modeling approaches to provide added understanding in this important field.

Figure 1 .
Figure 1.UpSet plot of additive reasons for emergent ICU transfer.The set represents overlapping attributes, for example, the first set on the right hand column represents the number of patients who had simultaneous arrhythmia, infection, and respiratory clinical deterioration events.

Figure 2 .
Figure 2. Calibration of the predictive models.(a) Is the cardiorespiratory model, (b) cardiovascular model.The data points are the average observed risk plotted as a function of each decile of the CoMET score, i.e. predicted relative risk.Perfect calibration is shown as a dashed line of identity.

Figure 3 .
Figure 3. Calibration of the risk models ((a) is the cardiorespiratory model, (b) cardiovascular model) by risk group.Each bin is inclusive at the lower bound and exclusive at the upper bound.This figure demonstrates the likelihood ratio of the event as a function of the CoMET scores.As expected, the likelihood of an event increases substantially as the model-predicted risk rises.

Figure 4 .
Figure 4. (a) Is the cardiorespiratory model, (b) cardiovascular model.The relative risk of an event rises as the time to ICU transfer nears.Open circles denote statistically significant increases in the score relative to 24 h prior using a Wilcoxon sign rank test and 0.05 significance level.

Figure 5 .
Figure 5. (a) Is the cardiorespiratory model, (b) cardiovascular model -evaluation as a trend-based threshold alert demonstrated as the positive predictive value of both the rise and duration of risk score elevation.An alert is identified when an hourly score results in a delta that is greater than d and the average over the next N hours remains >90% of the hourly score.If the score rises from, for example, 1-4, this satisfies a delta d = 3. N is the number of hours for which the (running) average of the hourly scores is ⩾3.6 (90% of 4).

Figure 6 .
Figure 6.Empirical risk of emergent ICU transfer as a function of both estimated cardiovascular (x-axis) and cardiorespiratory risk (y-axis).Contour lines are lines of iso-risk.Color indicates relative risk at a given location (x, y).Black is low risk and white is high risk.

Table 2 .
Model performance metrics by biological sex.