Detecting central apneas using multichannel signals in premature infants

Objective. Monitoring of apnea of prematurity, performed in neonatal intensive care units by detecting central apneas (CAs) in the respiratory traces, is characterized by a high number of false alarms. A two-step approach consisting of a threshold-based apneic event detection algorithm followed by a machine learning model was recently presented in literature aiming to improve CA detection. However, since this is characterized by high complexity and low precision, we developed a new direct approach that only consists of a detection model based on machine learning directly working with multichannel signals. Approach. The dataset used in this study consisted of 48 h of ECG, chest impedance and peripheral oxygen saturation extracted from 10 premature infants. CAs were labeled by two clinical experts. 47 features were extracted from time series using 30 s moving windows with an overlap of 5 s and evaluated in sets of 4 consecutive moving windows, in a similar way to what was indicated for the two-step approach. An undersampling method was used to reduce imbalance in the training set while aiming at increasing precision. A detection model using logistic regression with elastic net penalty and leave-one-patient-out cross-validation was then tested on the full dataset. Main results. This detection model returned a mean area under the receiver operating characteristic curve value equal to 0.86 and, after the selection of a FPR equal to 0.1 and the use of smoothing, an increased precision (0.50 versus 0.42) at the expense of a decrease in recall (0.70 versus 0.78) compared to the two-step approach around suspected apneic events. Significance. The new direct approach guaranteed correct detections for more than 81% of CAs with length L ≥ 20 s, which are considered among the most threatening apneic events for premature infants. These results require additional verifications using more extensive datasets but could lead to promising applications in clinical practice.


Introduction
Neonatal intensive care units (NICUs) are the designated hospital environments where care for premature infants is provided (Bambang Oetomo 2012).Here, physiological signals from these fragile patients are continuously monitored by monitoring devices which alarm nurses and clinicians in case of patient deterioration.It has however been estimated that up to 80% of the alarms sounding in NICUs are false or clinically irrelevant (Poets 2018, Ostojic et al 2020).This problem can also result in alarm fatigue, a condition in which nurses and clinicians can become desensitized to the alarms, with the potential risk of delayed and missed responses to critical alarms (Cvach 2012, Keller 2012, E. Institute 2020).
Different solutions were proposed in the recent years to solve the problem of the high alarm burden present in NICUs.These are focused on two main strategies.The first relates to the optimization of the alarm parameters, which can for instance be customized depending on different patient profiles and of the workflow within a NICU (Van Pul et al 2015a, 2015b, McClure et al 2016, Varisco et al 2021a).The second strategy is focused on the development of new solutions and algorithms to improve the detection of important conditions that are currently associated with a high number of false alarms (Lee et al 2012, Gee et al 2017, Joshi et al 2020).
A condition which is characterized by a high number of false alarms in clinical practice (up to even 65%) is apnea of prematurity, a developmental disorder which occurs in premature infants and that relates to an immaturity of respiratory control (Abu-Shaweesh and Martin 2008, Di Fiore et al 2013, Eichenwald and Committee on Fetus and Newborn 2016, Lee et al 2012, Amin and Erica 2013).Apnea of prematurity is diagnosed when several apneic events, characterized by cessations of breathing longer than 20 s or longer than 10 s in case these are accompanied by a bradycardia or desaturation, are detected in the respiratory traces of a premature infant (Zhao et al 2011, Mohr et al 2015a, Eichenwald and Committee on Fetus and Newborn 2016, Fairchild et al 2016).Apneic events are classified into central apneas (CAs), characterized by a cessation of the respiratory drive with consequent absence of effort to breathe in the infant, obstructive apneas, characterized by an airway obstruction and therefore an absence of respiratory flow, as well as mixed apneas, which share characteristics of both central and obstructive apneas (Mohr et al 2015a, Eichenwald andCommittee on Fetus andNewborn 2016).
Different solutions to reduce the number of false apnea alarms were proposed in recent years.A noticeable one is the algorithm for CA detection developed by Lee et al which filters out the cardiac artefact from the chest impedance (CI) signal before computing a CA probability function (Lee et al 2012, Vergales et al 2014, Mohr et al 2015a).It was also shown that this algorithm is characterized by a higher precision compared to the algorithms used by patient monitors in clinical practice, and was further optimized without affecting its high recall (Lee et al 2012, Varisco et al 2021b).However, since several events extracted by this algorithm do not share all the characteristics with a CA (i.e. a trace of flat CI signal following a previously regular fluctuating CI) (Varisco et al 2021b), it is more precise to consider it as an optimized algorithm for a general detection of apneic events.Other studies made use of machine learning (ML) and deep learning algorithms and showed promising results (Williamson et al 2013, Mago et al 2016, Shirwaikar et al 2016, Shirwaikar et al 2019, Lim et al 2020, Zuzarte et al 2021, Varisco et al 2022).Skirwaikar et at (Mago et al 2016, Shirwaikar et al 2016, Shirwaikar et al 2019).These used various supervised ML algorithms on a dataset containing demographic information, maternal covariates together with physiological values to detect apnea of prematurity.Their approach however did not allow to detect single apneic events but was rather applied to detect the occurrence of this condition in entire patient records.Zuzarte et al (2021) managed instead to predict apneic events accompanied by bradycardia and hypoxia, using cardiorespiratory and movement features derived from physiological waveform data and considering set of events consisting of 7.5 min long inter-apnea frames and pre-apnea frames extracted from 10 premature infants in a model that combined an unsupervised method (Gaussian mixture models) with a supervised one (logistic regression, LR).More recently, we performed a CA detection study in which we included 20 patients and 48 h of data per patient to detect the occurrence of CAs (Varisco et al 2022).We developed a two-step approach consisting of a first detection of apneic events, performed by means of the optimized algorithm for the detection of apneic events which was previously mentioned, followed by the use of a machine-learning based detection model to determine whether these were CAs or not, and obtained significant results which include a mean area under the receiver operating characteristic curve (AUROC) equal to 0.88 with our best detection model, achieved by using LR with elastic net penalty and leave-one-patient-out cross-validation (LOPO CV).
The aim of this new study is to verify whether it is feasible to perform CA detection by using a direct approach, which consists only of a detection model based on ML directly working with multichannel signals extracted from premature infants without requiring a preliminary screening of the apneic events.The new direct approach was developed to reduce the complexity and improve precision and, as shown in figure 1, its performance was compared to the two-step approach (Varisco et al 2022) using our complete dataset.
The current study is organized as follows: section (2) describes the methods used for the development of the new direct approach to perform CA detection, section (3) provides the detection results obtained by means the new direct approach, section (4) provides a discussion with further considerations regarding the obtained results as well as a comparison between the two-step approach and the new direct approach in terms of performance and computational time needed to perform CA detection and section (5) provides a conclusion to the current study.

Dataset and annotations
In this study we used a dataset which consisted of 10 premature infants who developed late-onset sepsis (LOS) and 10 matched controls, all characterized by a gestational age (GA) 30 weeks and admitted to the NICU of Máxima Medical Center in Veldhoven, the Netherlands, from July 2016 to December 2018 (Varisco et al 2022).
While for LOS patients we extracted the 48 h of data immediately preceding the CRASH-moment (blood Culture collection, Resuscitation, and Antibiotics Started Here), as defined by Griffin and Moorman (2001), for the matched controls we extracted the 48 h immediately preceding an equivalent-in-time CRASH-moment, which was calculated after matching the GA and postmenstrual age (PMA).Since our study had a retrospective and non-invasive nature, a waiver was provided by the medical ethical committee in accordance with the Dutch law on medical research with humans (WMO).
Physiological signals for each patient, measured using a Philips IntelliVue MX800 patient monitors (Philips Medical Systems, Böblingen, Germany) were obtained from a data warehouse (PIIC-iX, Data Warehouse Connect; Philips Medical Systems, Andover, MA).Three different physiological signals were considered to perform CA detection: the ECG and CI signal, which were respectively measured at 250 and 62.5 Hz using three ECG leads, and the peripheral oxygen saturation (SpO 2 ), which was measured at 1 Hz by means of a photoplethysmogram.
Annotation of apneic events was performed in our previous study (Varisco et al 2022) and required two main steps: (1) a first detection of suspected apneic events, consisting of both true and false CAs (i.e.false CA alarms), by means of the optimized algorithm for the detection of apneic events and (2) manual annotations for the apneic events, which were annotated into CAs, rejections (i.e.cessations of breathing located in the CI which were not annotated as CAs, due to the presence of small oscillations in the respiratory signal which can possibly exclude a central origin for these apneic events, and thus considered false CA alarms) and artefacts (used in case of missing or corrupt signal) by two clinical experts in two rounds of annotations until a final consensus was reached.Matlab (R2022b, The MathWorks, Natick, Massachusetts, United States) was used for the annotation process, as well as for signal processing, to perform feature extraction and to build the datasets used to perform CA detection, as explained in the following sections.

Signal processing and feature extraction
From the patients' physiological signals we first extracted four different signals and consequently computed a total of 47 features (Varisco et al 2022).
From the ECG we extracted the RR-intervals, a measure of the regulation performed by the autonomic nervous system on the cardiovascular system which was computed by detecting the R-peaks (Task Force of The European Society of Cardiology and The North American Society of Pacing and Electrophysiology 1996, Joshi et al 2020, Cabrera-Quiros et al 2021).From the ECG we also extracted the signal instability index (SII), a measure of patient motion which can be derived by applying a band-pass filter (0.001-0.40 Hz) to 10 s long ECG windows followed by a computation of a kernel density estimate (Joshi et al 2020, Cabrera-Quiros et al 2021).
From the CI we extracted the respiration ribcage respiratory effort (RRE), a filtered normalized respiration signal which was computed by subtracting the mean from the CI, applying a tenth-order Butterworth low-pass filter (cutoff 0.8 Hz) to remove the HF noise, and performing an amplitude normalization using the median peak-totrough amplitude to remove the tidal volume for each patient (Redmond and Heneghan 2006).To ensure a comparable amplitude throughout the dataset, normalization was performed by computing the median peakto-trough amplitude over the 48 h included for a patient and the CI was then normalized using this value, in order to obtain a median peak-to-trough amplitude equal to 1. Finally, the cardiorespiratory coupling signal (CRC), a signal able to provide information regarding the interaction between the cardiovascular and the respiratory system, was computed by resampling the respiration RRE in correspondence of the R-peaks in the ECG (Long et al 2014).These four signals, together with the SpO 2 were then used to perform the feature extraction.
Feature extraction was performed using 30 s moving windows with an overlap of 5 s for the full dataset, to be able to capture quick changes happening in each premature infant.A total of 47 features were computed for each moving window.20 features were extracted from the RR-intervals: these included features for the heart rate variability computed in the time-domain (e.g.Mean RR, standard deviation of the normal RR-intervals), frequency-domain (e.g.power spectral densities (PSD) computed for 4 frequency ranges) as well as features derived using the phase-rectified signal averaging algorithm (e.g.immediate deceleration and acceleration response) (Task Force of The European Society of Cardiology and The North American Society of Pacing and Electrophysiology 1996, Bauer et al 2006, Kantelhardt et al 2007, Indic et al 2008).4 features were extracted from the SII (e.g.Mean SII, Standard Deviation of the SII) (Joshi et al 2020, Cabrera-Quiros et al 2021).17 features were extracted from the respiration RRE: these included features computed in the time-domain (e.g.Mean RRE, Skewness RRE, breath-by-breath correlation) and in the frequency-domain (computed following the same frequency ranges defined for the RR-intervals) (Richman and Moorman 2000, Redmond and Heneghan 2006, Indic et al 2008).Three features were extracted from the CRC: these were found after computing the Nonlinear Visibility Graph, a method able to describe a time series based on specific geometric criteria (e.g.Mean Degree of the Nodes, Degree Variation of Nodes) (Long et al 2014).Finally, 3 features were extracted from the SpO 2 (e.g.Mean SpO 2 , Slope of the SpO 2 ).A table with all the features included in the current study is presented in appendix.Further details on the computation of these features can be found in our previous study (Varisco et al 2022).

Windows selection and creation of training and test sets
Previous studies indicated that ML algorithms are sensitive to imbalance in the dataset, for instance when the minority class samples are severely outnumbered by majority class samples (Fairchild et al 2016).In small datasets, this type of imbalance can severely affect the learning process of various ML algorithms, which learn too much from the majority class and not enough from the minority one, as well as the evaluation of different metrics (e.g.accuracy, precision and recall) (He and Garcia 2019).It is also important to include a sufficiently high number of minority class samples in a dataset for a ML algorithm to learn from these samples (He and Garcia 2019).Considering the low count of CAs included in our dataset consisting of 20 premature infants, we used the following steps, which are also described in the pipeline included in figure 2, to generate the training set and test set used in this study.
In step 1, we excluded 5 out of the 20 patients since they presented a very low count of CAs compared to the others (CAs 5, resulting in a maximum of 68 30 s moving windows overlapping with CAs per excluded patient) and could have otherwise had a large effect on the imbalance.Since we used a matched dataset based on GA and PMA in order to exclude differences in heart rate variability features due to maturation (Chiera et al 2020), we excluded the matched patients as well.The dataset after the exclusion consisted of the 10 remaining patients.The characteristics of this dataset are presented in table 1.
In step 2, from the 48 h of data from all patients a total of 345570 sets of 4 consecutive moving windows (4windows-sets) were defined.For each moving window, the 47 features defined in the previous section were computed, resulting in a total of 188 features for each 4-windows-set.In step 3, since the clinical experts annotated all apneic events as CAs, rejections or artefacts, we used this information to determine a label for each 4-windows-set.All 4-windows-sets that presented at least one window with a minimum overlap (> 0 s) with a CA annotation were annotated as CA-sets, as shown in figure 3, resulting in 10273 CA-sets in total (median count of 707 CA-sets per patient, maximum count equal to 3793, minimum count equal to 161).CA-sets were defined considering a brief overlap with a CA since we were interested in detecting an upcoming CA already at an early stage and have our detection model trained to recognize the upcoming CA by using data from the preceding period.
This same procedure mentioned for CA annotations was applied to artefacts annotations resulting in artefacts-sets which were discarded together with 4-windows-set that presented one or more 'NaN' as feature value, which accounted for a total of 9739 4-windows-sets.All remaining 4-windows-sets, even in the event of an overlap with a rejection, were annotated as stable-sets (i.e.4-windows-sets referring to stable periods for the premature infant).As a result of this step, a total of 335831 4-windows-set were left and constituted the test set that was used in this study.All the steps that are mentioned next were used for the definition of the training set used in this study.
In step 4, after noticing the high imbalance that still characterized our dataset (10 273 CA-sets on a total of 335831 4-windows-sets), we decided to perform a further sub-selection of 4-windows-sets useful to perform CA detection.Together with all the CA-sets (i.e.minority class) we decided to include first the stable-sets that respected one of the following conditions: • being one of the 7 stable-sets preceding the onset of a CA or following its end (i.e. a stable-set whose start and end were fully enclosed in the 1 min preceding or following a CA); • being a stable-set originally annotated as rejection by the clinical experts (i.e. a stable-set overlapping with an apneic event not annotated as a CA); • being one of the 7 stable-sets preceding the onset of a rejection or following its end.
Their inclusion was considered relevant to verify the effect of the detection around different apneic events, assuming that if a detection model based on ML is able to distinguish stable-sets located very close or with similar characteristics to CA-sets, it could also possibly perform well in detecting stable-sets located far from a CA.After this step, a total of 81258 4 windows-sets were included and the ratio between CA-sets and stable-sets was therefore found to be around 1:7.However, this dataset was not considered to be representative enough of the original dataset since it was still quite small and did not include stable-sets located far from a CA (distant-stable-sets).
We decided to also perform additional inclusions from the pool of distant-stable-sets without building a severely unbalanced dataset.Literature regarding solutions to this problem is extensive and encompasses undersampling methods (e.g.random and informed undersampling) as well as oversampling methods (e.g.SMOTE) (He and Garcia 2019).Since there are no previous reports regarding the use of oversampling methods considering the signals and features included in this study, we preferred not to create additional artificial data for the minority class.In step 5, we therefore added additional distant-stable-sets for each patient by means of an undersampling method based on the K-nearest neighbor (KNN) called NearMiss 3 by Zhang and Mani (2003) to extract a training set.This undersampling method selects majority class examples considering the minimum distance to each minority class example and was selected since we prioritized having a higher precision compared to a higher recall in order to reduce the number of false alarms, a characteristic that has been achieved by means of this undersampling method (Zhang and Mani 2003).NearMiss 3 was implemented for each patient independently from the others.Considering our previously defined choice to define CA-sets, the number of majority class examples to be selected for each minority class example was set to 5.
As a result of all the steps described here, a training set which included a total of 132 623 4-windows-sets considering NearMiss 3 (approximately 184 h of data, consisting for the 7.74% of CA-sets) was defined to perform CA detection using ML.

Machine learning: detection model and cross-validation
Logistic regression (LR) with elastic net penalty was selected as the ML algorithm to create a detection model since it is faster than many other algorithms (an important characteristic when large datasets are considered), allows for an easier interpretation of the results compared to other ML algorithms and performed best using the two-step approach for CA detection (Varisco et al 2022).The detection model used in this study was trained using the training set extracted using the NearMiss 3 undersampling method, as discussed in section (2.3), while their performance was tested on the test set (i.e. the full dataset), using therefore all available 4-windows-sets.Leave-one-patient-out cross-validation (LOPO CV) was used since it suits well with clinical practice applications, where each new patient admitted in a NICU is treated as a new test set.Scikit-learn in Python (Python Software Foundation, Fredericksburg, United States) was used to implement the detection model.
An overview of the steps performed during each iteration of LOPO CV is shown in figure 4.During each iteration of the LOPO CV, the 4-windows-sets for the selected left out patient were extracted from the test set and constituted an independent test subset (i.e.4-windows-sets for this patient were not used for the training and validation of the model during the same iteration).The 4-windows-sets for the remaining 9 patients were instead extracted from the training set and separated using a 70/30 split ratio (70% constituting a training subset and 30% constituting a validation subset).LR with elastic net penalty was used to perform hyperparameter optimization by considering a grid search approach.
This allowed to extract the training and validation scores and to choose the optimal set of hyperparameters based on two requirements and considering the AUROC as a scoring function: a set of hyperparameters leading to (1) a training score greater than a predefined threshold (Th class ), a solution implemented to prevent underfitting, and to (2) the minimum difference between the training and validation scores, a solution implemented to avoid overfitting.Th class was initially empirically set equal to 0.9.In case less than 30% of the training scores computed for all sets of parameters were found to be greater than 0.9, its value was decreased each time by 0.01 until this requirement was met.
Two hyperparameters were optimized following this cross-validation: the parameter C, measuring of the strength of regularization (with lower C values leading towards stronger regularization), and L1 ratio, influencing the algorithm penalty oscillating from a mere L1 penalty to a mere L2 penalty (respectively with a value oscillating from 1 to 0).The selected hyperparameters were then selected and a detection model based on LR with elastic net penalty was trained using all 4-windows-sets for the previously mentioned 9 patients present in the training set and then tested on the test subset.This repeated process allowed to extract the annotations for all 4-windows-sets present in the full test set (i.e.ML-annotations).In case these received a CA annotation by the detection model based on LR with elastic net penalty, these were named CA-ML-annotations.

Machine learning: performance and effects evaluation
Nine different evaluations of the results obtained with detection model based on the new direct approach were performed and are summarized in figure 5.
The first two evaluations (i.e.evaluation 1.1 and 1.2) were grouped together under the name machine learning model evaluations since they were included to obtain a first understanding of the outputs of the detection model based on LR with elastic net penalty.Evaluation 1.1 consisted in the extraction of the mean AUROC considering all patients left each time in the test set.Evaluation 1.2 investigated the feature importance by extracting the coefficients from the detection model presented as log odds ratios.
The following 3 evaluations (i.e.evaluation 2.1-2.3)were introduced to evaluate the results of the detection using all the 4-windows-sets included in the test set and were therefore grouped together under the name 4-windows-sets evaluations.Evaluation 2.1 was performed by extracting the confusion matrices considering different thresholds associated to the false positive rate (FPR or fall-out) in the mean ROC curve.This was possible since we fixed an FPR value, chose the corresponding threshold and applied it to the probabilities returned by the detection model.In addition to this, the recall (TPR), precision (PPV), F1 score, miss rate (or false negative rate, FNR) and the specificity (or true negative rate, TNR) were also extracted, in a similar way to what was done for the two-step approach (Varisco et al 2022).An optimal threshold characterized by a high precision and suitable for possible use in clinical practice was then selected and considered for the following evaluations.Evaluation 2.2 was performed by computing the percentages of CA-ML-annotations per patient in each group of 4-windows-sets: pre-CA-sets (i.e. the 7 stable-sets preceding the onset of a CA), CA-sets (separated considering all CA-sets and CAsets that included more than 10 s of an apneic event, in agreement with most definitions of apnea of prematurity  an apnea alarm in most clinical patient monitoring systems), post-CA-sets (i.e. the 7 stable-sets following the end of a CA), pre-rejection-sets, rejection-sets, post-rejection-sets, distant-stable-sets (i.e.all possible stable-sets that were considered at step 5 of the pipeline described in figure 2) and within-apnea-sets (i.e.all stable-sets included between consecutive CAs or rejections which were separated by less than 14 stable-sets).Evaluation 2.3 was performed by computing the percentages per patient of CA-ML-annotations that were detected around the onset of the CAs annotated by the clinical experts.This evaluation allowed to verify when alarms returned by the detection model were more likely to be triggered after the onset of the CAs.We performed this evaluation 5 times: by considering single CA-ML-annotations returned by the detection model as well as with smoothing, which allowed to consider as true only CA-ML-annotations found for multiple (2, 3, 4 and 5) consecutive MLannotations, in order to evaluate the overall distribution of false positives in the whole dataset after progressively removing CA-ML-annotations found for a predefined number of consecutive amount of 4-windows-sets, aiming at decreasing the number of false CA alarms in the whole dataset.
The last four evaluations were performed to extract the number of alarms that would occur by using our detection model to perform CA detection on our population of premature infants and simulate therefore its use in a NICU environment.These were called alarm evaluations.Evaluation 3.1 was performed by computing the same metrics presented for evaluation 2.1 and by considering the total count of CAs and rejections, as annotated by the clinical experts, that received a minimum of 1 CA-ML-annotation by detection model.This evaluation was performed by considering both the original CA-ML-annotations as well as smoothing.Furthermore, this evaluation allowed for the most appropriate comparison with the two-step approach used in our previous study, where we evaluated the results of the detection only around the onset of suspected apneic events (Varisco et al 2022).Evaluation 3.2 was performed by computing the percentages of all CAs and rejections with different lengths L that received a minimum of 1 CA-ML-annotation from the detection model, considering both the original CA-ML-annotations as well as smoothing.Following the indications included in previous studies (Mohr et   L < 60 s, 60 s L < 80 s.Since very few CAs and rejections were found for some patients considering some values for L, it was performed considering all patients together.This evaluation allowed to verify whether CAs and rejections characterized by a longer L were more likely to be detected as CAs by the detection model.Evaluation 3.3 was performed by extracting the computational time needed to perform CA detection using the direct approach and smoothing.This allowed to compare it with the computational time needed by the best model obtained with the two-step approach (Varisco et al 2022) considering a desktop computer.Finally, in evaluation 3.4 we computed the count of false CA alarms per patient per hour separated into false CA alarms overlapping with stable-sets, rejection-sets, and distant-stable-sets, as annotated by the clinical experts.

Machine learning model evaluations
Results for evaluation 1.1 are shown in figure 6, where the mean ROC curve computed on the test set by the detection model is represented.As a result of this evaluation, we obtained a mean AUROC value equal to 0.86.
The resulting feature relevance, investigated in evaluation 1.2, is shown in figure 7(A).This figure shows the median log odds ratios computed considering each iteration of LOPO CV for the twenty most relevant features ranked considering their absolute values.Positive log odds ratios indicate increased risk towards CA-sets whereas negative values indicate increased risk towards stable-sets.It was noticed that features extracted from each one of the used physiological signals played an important role in the detection as can be seen with the SDNN, the Mean RRE, the Degree Variation of Nodes and the Standard Deviation of SpO 2 being included in the twenty most relevant features.It was however also noticed that most of the contribution in the detection came from features derived from the respiration RRE signal, whereas none of the features related to patient motion (SII) appeared in this list, indicating a lower contribution of the included movement features.Figure 7(B) shows the boxplots for the four most relevant features and their capability to separate CA-sets from stable-sets.

4-windows-sets evaluations
In evaluation 2.1 different FPR and the corresponding thresholds in the mean ROC curve were used to extract the confusion matrices and additional metrics.In particular, results obtained with FPR equal to 0.1, 0.15 and 0.2 are presented in table 2. Since in this study we aimed at increasing the precision (i.e.reducing the number of false positives) at the expense of having a decrease in recall to reduce the number of false alarms (i.e.missing some CA-sets), a FPR equal to 0.1 was considered as an optimal solution and was therefore selected to perform the following evaluations.
The percentages per patient of CA-ML-annotations (i.e.4-windows-sets that were annotated as CAs by the detection model) found for different groups of 4-windows-sets and computed in evaluation 2.2 are shown in figure 8. Median percentages of correctly detected CA-sets equal to 72.7% and 82.7% were respectively found when only CA-sets that included more than 10 s and more than 20 s of an apneic event were included, whereas a far lower median percentage, equal to 62.6%, was found considering all CA-sets.At the same time, a median percentage of wrongly detected distant-stable-sets equal to 7.6% was found, a result that was found promising considering the possibility to implement an additional smoothing in following evaluations.A median percentage of wrongly detected rejection-sets equal to 21.9% was also found, possibly due to the similarity that some apneic events annotated as rejections might share with CAs.
Percentages per patient of CA-ML-annotations around the onset of the CAs computed in evaluation 2.3 are shown in figure 9. Irrespective of the choice of using the original CA-ML-annotations or applying any smoothing, an improvement in the detection of CA-sets was found when more seconds of a CA were included.On the other hand, smoothing performed with a higher number of consecutive ML-annotations were responsible for a decrease in the detection of CA-sets.As an example, when no smoothing was performed, a median percentage of CA-sets that overlapped with the first 25-30 s of a CA equal to 74.9% was detected and this percentage progressively decreased up to 58.7% and 40%, found when smoothing performed with 4 or 5 consecutive ML-annotations were respectively applied.

Alarms evaluations
An overview of the total count of CAs and rejections, as annotated by the clinical experts, that received a minimum of 1 CA-ML-annotation by the detection model was computed in evaluation 3.1 and it is shown in table 3. Results in this table are presented by considering the original CA-ML-annotations as well as smoothing.Results computed with the direct approach showed an increase in precision at the expense of a decrease in recall while increasing the number of consecutive ML-annotations used for smoothing.In particular, the detection model followed by smoothing performed with 4 consecutive ML-annotations returned very similar results in terms of recall and precision compared to the results achieved using the two step approach (i.e.0.75 versus 0.78, 0.45 versus 0.42) (Varisco et al 2022).Smoothing performed with 5 consecutive ML-annotations provided an increase in precision compared to the best results achieved with the two-step approach (i.e.0.50 versus 0.42) and Figure 8. Boxplots for the percentages per patient of CA-ML-annotations found in different groups of 4-windows-sets: pre-CA-sets, CA-sets, CA-sets including more than 10 s of an apneic event, CA-sets including more than 20 s of an apneic event, post-CA-sets, prerejection-sets, rejection-sets, post-rejection-sets, distant-stable-sets and within-apnea-sets.led therefore to a further decrease in the number of false CA alarms at the expense of a decrease in recall (i.e.0.70 versus 0.78).
In evaluation 3.2, the percentages of all CAs and rejections with different lengths L that received a minimum of 1 CA-ML-annotation from the detection model were computed and are shown in figure 10.Only results obtained by considering the original CA-ML-annotations and smoothing performed with 5 consecutive ML-annotations are shown for the sake of simplicity.Percentages of correctly detected CAs always remained very high (i.e.88%) when the original CA-ML-annotations were considered.They also remained high (i.e.81%) irrespectively of the different lengths L when smoothing was performed with 5 consecutive ML-annotations, with an only exception found with CAs with 10 s L < 20 s, which were correctly detected 67% of the time.On the other hand, smoothing performed with 5 consecutive ML-annotations also allowed to significantly decrease the percentage of wrongly annotated rejections, with a minimum found for rejections with 10 s L < 20 s (i.e.21%).
Evaluation 3.3 allowed to evaluate the computational time needed to perform CA detection using the new direct approach and a desktop computer.This was found with the summation of: (1) the time overlap of the Table 3.Total count of CAs and rejections, as annotated by the clinical experts, that received a minimum of 1 CA-ML-annotation when considering the original CA-ML-annotations as well as smoothing performed with 2, 3, 4 and 5 consecutive ML-annotations.A comparison with the results from the best detection model using the two-step approach (Varisco et al 2022)  4-windows-sets (including smoothing) with CAs, (2) the computational time needed to extract features and (3) the testing time needed by the detection model to work with 4 consecutive moving windows.This led to a total of 33 s and 38 s needed to perform CA detection with the direct approach when smoothing was performed with 4 and 5 consecutive ML-annotations, respectively.The computational time needed by the two-step approach considering a desktop computer was found by considering the summation of: (1) the time overlap of the 4-windows-sets with CAs, (2) the time needed by the optimized algorithm to extract suspected apneic events and merge the close ones (Lee et al 2012, Varisco et al 2021b, 2022), (3) the testing time needed by the detection model to work with 4 consecutive moving windows and (4) the computational time needed to extract feature values (in case these cannot be extracted in parallel to the use of the optimized algorithm for CA detection).This led to a minimum of 33-38 s needed to perform CA detection with the two-step approach, a result that is in line with the one obtained by using the direct approach and smoothing performed with 4 and 5 consecutive ML-annotations.
Table 4 shows the results for evaluation 3.4, displaying the median count of false CA alarms per patient per hour separated considering stable-sets (i.e.complete dataset), rejection-sets and distant-stable-sets (i.e.stable-Figure 10.Percentages of all CAs and rejections with different lengths L that received a minimum of 1 CA-ML-annotation.CAs and rejections were split considering the following values for L: 10 s L < 20 s, 20 s L < 30 s, 30 s L < 40 s, 40 s L < 60 s, 60 s L < 80 s.These were computed by considering (A) the original CA-ML-annotations as well as (B) smoothing performed with 5 consecutive 4-windows-sets.Table 4. Median count of false CA alarms per patient per hour considering overlaps with stable-sets (i.e.complete dataset), with rejection-sets, and distant-stable-sets (i.e.stable-sets not overlapping nor located within 1 min surrounding CAs and rejections).These were extracted considering the original CA-ML-annotations as well as smoothing performed with 2, 3, 4 and 5 consecutive ML-annotations.sets not overlapping nor located within 1 min surrounding CAs and rejections).A significant decrease in the median count of false CA alarms per patient per hour was noticed when smoothing was performed with progressively higher numbers of consecutive ML-annotations.In particular, smoothing performed with 5 consecutive ML-annotations allowed a reduction up to 2.58 false CA alarms per patient per hour in distantstable-sets.

Discussion
Apnea detection in premature infants, performed in current clinical practice by means of standard monitoring techniques, is characterized by a high number of false alarms.In this paper, we proposed a new direct approach to directly detect CAs from multichannel signals using ML without any additional step while focusing on minimizing false alarm rates.We then compared our results with those achieved while using our two-step approach (Varisco et al 2022).
Our detection model based on LR with elastic net penalty showed a mean AUROC value equal to 0.86 (evaluation 1.1), a result that was found to be promising to perform CA detection when combined with different additional metrics extracted in this study.Features extracted from all physiological signals included in this study played a significant role in achieving this result (evaluation 1.2).An important additional value of the current study is that these results were obtained by considering a complete dataset, including both an extensive set of stable periods and apneic events, whereas the two-step approach was only used to investigate its performance around the onset of suspected apneic events (Varisco et al 2022).Since both approaches use the same sets of features, the predictive value found in both studies is considered as promising for the development of direct detection methods in clinical patient monitors.A comparison with other literature studies that investigated apnea detection in premature infants is difficult considering that most studies used very different sets of features and included demographic information, maternal covariates together with physiological values from entire patient records (Mago et al 2016, Shirwaikar et al 2016, Shirwaikar et al 2019).One of the best comparisons can be performed with the study of Zuzarte et al who targeted apneic events accompanied by bradycardia and hypoxia using a selection of events from a dataset and features extracted from physiological signals (Zuzarte et al 2021).Differently from this study, we found out that movement features in our study did not play a relevant role in the detection of CAs.This result might be due to the different approach used to extract movement features but future research on this aspect is needed.
To prevent the occurrence of too many false CA alarms in a similar way to what frequently happens in clinical practice (Cvach 2012, Keller 2012), we aimed at increasing the precision returned by our detection model even at expense of a slight decrease in recall.This result was achieved by (1) identifying an optimal FPR equal to 0.1 (evaluation 2.1) and (2) introducing smoothing performed with consecutive ML-annotations (evaluation 2.3 and onwards).The selected FPR alone proved to be optimal in separating 4-windows-sets overlapping with CAs (i.e.CA-sets) from 4-windows-sets which are distant from any suspected apneic event (distant-stable-sets) since it returned a very high median percentage of correctly detected CA-sets (i.e.82.7% for CA-sets that included more than 20 s of an apneic event) and a low median percentage of wrongly detected distant-stable-sets (i.e.7.6%) (evaluation 2.2).However, at the same time, the choice of this FPR returned a relatively high median percentage of wrongly detected rejection-sets (i.e.21.9%).This result is possibly also related to the fact that during the annotation process similarities were noticed by the clinical experts between some apneic events annotated as CAs and some annotated as rejections, a characteristic that was also previously described in literature for the annotation of mixed apneas (Mathew 2011, Picone et al 2014, Eichenwald and Committee on Fetus and Newborn 2016).Rejections were not used to indicate signal artefacts but in case of apneic events that did not comply with all characteristics of CAs, despite they contained some characteristics of apneas and might have a clinical relevance.
Smoothing, which provided a sort of annotation averaging over time, proved to be particularly useful in improving precision at the expense of a decrease in recall around suspected apneic events (evaluation 3.1 and 3.2).However, smoothing also introduced additional time before a CA alarm was issued.Results of the detection around suspected apneic events allowed for a comparison with those obtained with the two-step approach and showed increased precision at the expense of a decrease in recall when smoothing was performed with 5 consecutive ML-annotations (i.e.0.50 versus 0.42, 0.70 versus 0.78).In addition to this, smoothing performed with 5 consecutive ML-annotations returned correct detections for most long CAs (i.e.81% with length L 20 s), which are considered among the most threatening apneic events for premature infants (Finer et al 2006, Mohr et al 2015a), while at the same time significantly reducing wrong detections for short rejections (i.e.< 21% with L < 20 s), which could possibly cause several false CA alarms in clinical practice.When compared instead to the results obtained by Zuzarte et al who also detected apneic events while considering only a set of events (Zuzarte et al 2021), our direct approach and smoothing performed with 5 consecutive ML-annotations allowed to achieve the same recall (i.e.0.70) and a lower fall-out (i.e.0.20 versus 0.34), a result that indicates a lower occurrence of false alarms.Further evaluations of the detection in the time-periods that were excluded from their dataset could provide a better indication about which method performs best in a complete dataset.
Compared to other studies present in literature that investigated the occurrence of apnea of prematurity in entire patient records (Mago et al 2016, Shirwaikar et al 2016, Shirwaikar et al 2019), our direct approach allows to make one step further towards a detection of CAs soon after their occurrence, allowing clinicians to monitor each change happening to their patients and not only to evaluate entire patient records retrospectively.One current limitation that still prevents actual direct detection relates to normalization of the CI by means of the median peak-to-trough amplitude computed by considering the 48 h included for a patient.While this solution guaranteed to obtain a median peak-to-trough amplitude for each patient which was equal to 1, this approach could be substituted in clinical practice by performing a normalization with the median peak-to-trough amplitude computed while considering the first 24 h of life, since different studies pointed out that the first occurrence of apnea of prematurity happens within 1-2 d after birth (Barrington andFiner 1991, Mohr et al 2015b).This solution could also be compared with a continuous adjustment for each 30 s moving window by means of the median peak-to-trough from the previous 24 h spent in the NICU.We hypothesize that both these solutions would still allow to obtain an accurate estimate of the tidal volume from different respiratory cycles and of the full range of the transthoracic impedance derived from the use of the 3-lead ECG electrodes, but future studies can further confirm this hypothesis.
The false CA alarm rate found in the complete dataset was found to be significantly reduced when including smoothing performed with a high number of consecutive ML-annotations (evaluation 3.4).Smoothing performed with 5 consecutive ML-annotations returned 2.58 false CA alarms per patient per hour considering all distant-stable-sets.It was previously reported that up to 65% of the apnea alarms sounding in clinical practice are false (Lee et al 2012, Amin and Erica 2013) but it is difficult to find studies that report an actual false apnea alarm rate per hour.We can therefore only relate the results obtained in this study to what happens in our NICU, where apnea alarms are oftentimes turned off, a solution that suggests the need of more robust algorithms compared to the ones currently implemented in patient monitors.False CA alarms were often found in the 1 min surrounding CAs and rejections, supporting previous reports indicating that apneic events often occur close to each other (Joshi et al 2016, Varisco et al 2022).The apneic events triggering these false CA alarms might also be short enough (i.e.< 5 s) to not been picked out by the optimized algorithm for CA detection (Lee et al 2012, Mohr et al 2015a, Varisco et al 2021b) which preceded the annotations from our clinical experts.Future studies might therefore be performed to annotate the events that in this study we considered as false CA alarms.
An important requirement to perform apnea detection is that apneas are detected in time to allow for clinical action.The computational time needed to perform CA with the direct approach is similar to the one that is achieved using the two-step approach and is in the order of 33-38 s from CA onset (evaluation 3.3).This can be considered a reasonable time especially considering that it provide clinicians with an additional clear definition of the annotation of the detected apneic event according to most definitions (Zhao et al 2011, Mohr et al 2015a, Eichenwald and Committee on Fetus and Newborn 2016, Fairchild et al 2016).
This study showed that performing CA detection with the new direct approach based on multichannel signals is feasible and can lead to an increase in precision and a reduction of false CA alarms at the expense of decrease in recall compared to the two-step approach.An additional benefit of the direct approach is the fact that it allows to perform CA detection only considering the physiological signals extracted from premature infants without requiring an additional step that requires tuning and is also prone to errors (i.e. the use of the optimized algorithm for CA detection).This study includes multiple strengths, which include an extensive evaluation of different metrics, of the effect of smoothing and an evaluation of CA detection not only around suspected apneic events but also in the complete dataset, which is unprecedented for studies aiming at detecting apneic events in premature infants.
Despite of the increase in precision compared to the two-step approach and other advantages that the direct approach holds compared to other studies focusing on apnea detection that are present in literature, we consider the improvable performance both in terms of recall and precision (i.e.false CA alarms) as a limitation that should be addressed in future studies.The inclusion of additional new features and more detailed annotations of the suspected apneic events included in the study into more apnea classes (i.e.depending on their physiological origin) could potentially reduce false alarms returned by future detection models.
In addition to this, the current dataset is small.A bigger dataset could further improve CA detection results achieved in the current study.Nonetheless, it is important to mention that the limited dataset used in the current study still includes a higher number of continuous hours of data on which CA detection is performed compared to other datasets that were used in literature with a similar aim (Williamson et al 2013, Mago et al 2016, Shirwaikar et al 2016, Shirwaikar et al 2019, Lim et al 2020, Zuzarte et al 2021).This limitation is primarily due to the fact that annotating CAs in several hours of patient data is very time consuming.Future studies should however consider the inclusion of the 5 patients from our dataset that presented a very low count of CAs as well as the matched patients.More patients, possibly from other hospitals, should also be included in the future to validate the results on an external dataset.In addition to this, further investigations of the best methods to generate more balanced datasets for training should be investigated, both considering undersampling methods as well as oversampling methods (such as SMOTE) (Zhang andMani 2003, He andGarcia 2019).
Using demographic data such as GA, PMA and birth weight could also help in the task of performing apnea detection, since this information has already played an important role in the detection of other conditions occurring to premature infants (Peng et al 2022).However, this information could also hinder direct implementation into clinical patient monitors, since typically this demographic data is not added into the system as it is stored in the electronic medical records but not in the patient monitor.
Finally, in this study only a detection model based on LR with elastic net was considered due to computational reasons (i.e.LR is a much faster algorithm compared to others, such as SVM, when large datasets are considered) and since it performed very well with the two-step approach (Varisco et al 2022).However, future studies would definitely benefit from the comparison with detection models based on different ML or deep learning algorithms.

Conclusion
In this study we developed a direct approach to perform CA detection in premature infants based on multichannel signals using ML with the aim of reducing complexity and improve precision compared to a two-step approach that we recently developed.We trained a detection model based on LR with elastic net penalty and evaluated the CA detection results in a complete dataset consisting of both suspected apneic events and stable periods.
Our detection model returned a mean AUROC value equal to 0.86 and, after the selection of a FPR equal to 0.1, a median percentage per patient of correctly detected CA-sets that included more than 20 s of an apneic event equal to 82.7%.By also applying smoothing performed with 5 consecutive ML-annotations, we achieved an increased precision (i.e.0.50 versus 0.42) at the expense of a slight decrease in recall (i.e.0.70 versus 0.78) compared to the two-step approach around suspected apneic events, guaranteeing a minimum of 81% of correctly detected CAs with length L 20 s.
These results are considered promising for its application in clinical practice considering the high number of false apnea alarms that sound every day in NICUs but further optimizations and validations using more extensive datasets are still needed.

Figure 1 .
Figure1.Comparison between the two-step approach(Varisco et al 2022) and the new direct approach from this study.

Figure 2 .
Figure 2. Pipeline to perform a selection of 4-windows-sets and generate a more balanced training set between CA-sets and stable-sets.
(Zhao et al 2011, Mohr et al 2015a, Eichenwald and Committee on Fetus and Newborn 2016, Fairchild et al 2016), and more than 20 s of an apneic event, minimum time needed from the onset of a cessation of breathing to return

Figure 4 .
Figure 4. Overview of the steps performed during each iteration of the leave-one-patient-out cross-validation (LOPO CV).
al 2015a, Varisco et al 2022) the following values for L were considered: 10 s L < 20 s, 20 s L < 30 s, 30 s L < 40 s, 40 s

Figure 5 .
Figure 5. Overview of the different evaluations of the detection model included in this study.

Figure 6 .
Figure 6.Mean ROC curve and AUROC value computed by using the detection model based on logistic regression (LR) with elastic net penalty trained using a training set that received NearMiss 3 undersampling and tested on the test set (i.e. the full dataset).

Figure 7 .
Figure 7. (A) Feature relevance extracted from the detection model based on logistic regression (LR) with elastic net penalty trained using a training set that received NearMiss 3 undersampling and tested on the test set and (B) boxplots for the four most relevant features separated considering CA-sets and stable-sets.Feature relevance was investigated by extracting the coefficients, indicating the log odds ratios.Median log odds ratios were then computed and were ranked considering their absolute values to represent the twenty most relevant features.

Figure 9 .
Figure 9. (a) percentages per patient of CA-ML-annotations found around the onset of cas computed by considering the original CA-ML-annotations.(b) median percentages per patient of CA-ML-annotations found around the onset of cas computed by considering the original CA-ML-annotations as well as smoothing performed with 2, 3, 4 and 5 consecutive ML-annotations.

Table 1 .
Characteristics of the patients included in the analysis.

Table 2 .
Confusion matrices and metrics found considering a false positive rate (FPR) equal to (a) 0.1, (b) 0.15 and (c) 0.2 in the mean ROC curve.
is also provided.