Brought to you by:
Paper

FASSNet: fast apnea syndrome screening neural network based on single-lead electrocardiogram for wearable devices

, , and

Published 27 August 2021 © 2021 Institute of Physics and Engineering in Medicine
, , Citation Yunkai Yu et al 2021 Physiol. Meas. 42 085005 DOI 10.1088/1361-6579/ac184e

0967-3334/42/8/085005

Abstract

Objective. Sleep apnea (SA) is a chronic condition that fragments sleep and results in intermittent hypoxemia, which in long run leads to cardiovascular diseases like stroke. Diagnosis of SA through polysomnography is costly, inconvenient, and has long waiting list. Wearable devices provide a low-cost solution to the ambulatory detection of SA syndrome for undiagnosed patients. One of the wearables are the ones based on minute-by-minute analysis of single-lead electrocardiogram (ECG) signal. Processing ECG segments online at wearables contributes to memory conservation and privacy protection in long-term SA monitoring, and light-weight models are required due to stringent computation resource. Approach. We propose fast apnea syndrome screening neural network (FASSNet), an effective end-to-end neural network to perform minute-apnea event detection. Low-frequency components of filtered ECG spectrogram are selected as input. The model initially processes the spectrogram via convolution blocks. Bidirectional long-short-term memory blocks are used along the frequency axis to complement position information of frequency bands. Layer normalisation is implemented to retain in-epoch information since apnea periods have variable lengths. Experiments were carried out on 70 recordings of Apnea-ECG database, where each 60 s ECG segment is manually labelled as an apnea or normal minute by technician. Both ten-fold and patient-agnostic validation protocols are adopted. Main results. FASSNet is light-weighted, since its value of model parameters and multiply accumulates are 0.06% and 28.33% of those of an AlexNet benchmark, respectively. Meanwhile, FASSNet achieves an accuracy of 87.09%, a sensitivity of 77.96%, a specificity of 91.74%, and an F1 score of 81.61% in apnea event detection. Its accuracy of diagnosing SA syndrome severity exceeds 90% under the patient-agnostic protocol. Significance: FASSNet is a computationally efficient and accurate neural network for wearables to detect SA events and estimate SA severity based on minute-level diagnosis.

Export citation and abstract BibTeX RIS

1. Introduction

Sleep apnea (SA) syndrome, including apnea and hypopnea, is a chronic condition that carries serious health risks. Recent AASM guidelines have addressed that patients with untreated SA syndrome may have a higher risk of developing chronic health consequences like diabetes, and cardiovascular diseases like stroke (Kushida et al 2005, Iber et al 2007, Lam et al 2012, Kapur et al 2017). According to the report of an American Academy of Sleep Medicine Task Force (1999) and Kapur et al (2017), an apnea event can be defined by intermittent cessation of breathing for over 10 s, and hypopnea is a reduction of breathing volume below 50%, associated with at least 4% decreases in oxygen saturation. Guilleminault et al (1978) defined SA syndrome according to the number of apnea events per-night. Apnea-hypopnea index (AHI), the average number of apnea and hypopnea events per hour, is widely used to evaluate SA syndrome severity. Polysomnography (PSG) is the gold standard for SA detection (Moody et al 2000). However, it is not an optimal solution to family use due to economic constraints and limited medical resources. PSG collects multi-channel signals like electrocardiogram (ECG) and electrooculogram, which leads to unfamiliar sleeping environment, and diagnosis through PSG has a long waiting list.

SA event detection and apnea severity estimation are major outcomes in continuous and ambulatory monitoring of SA syndrome, and wearable devices has been an emerging option for online diagnosis. Processing physiological signal at wearable devices has advantages in real-time performance, large-scale applications, and user privacy protection compared to transferring raw data to an authorized cloud (Shi et al 2016), and a real-time analysis scheme frees wearables from saving large-volume whole-night data. ECG has been accessible for medical and consumer-grade wearable devices (Yamakawa et al 2014, Fujiwara et al 2016, Surrel et al 2018, Xu et al 2020). SA symptoms were found to be associated with cardiovascular sequelae (He et al 1988). This fact implies that ECG signal has potential in detecting SA syndrome. The modulation between multi-lead ECG and respiration reveals that ECG signal can be used as an alternative of respiratory signal.

The feasibility of single-lead ECG signal in SA syndrome diagnosis has been verified in recent years. The Apnea-ECG database (Goldberger et al 2000, Penzel et al 2000) has been a benchmark database to test algorithms for minute-apnea event detection. Compared to AASM standard, it estimates AHI by minute-level diagnosis in the evaluation of apnea severity. Features derived from heart-rate variability parameters (HRV) can be implemented on mobile devices in an IF.THEN rule set (Sannino et al 2014). The large amplitude of R-peaks makes HRV noise-resistant for SA screening. This method reported an accuracy of 0.8857 on a subset of Apnea-ECG database. However, an IF.THEN rule set, which is based on the value of HRV parameters, is personalized due to the discrepancy among patients. Time and frequency domain features of HRV achieved an AUC score of 0.91 (Nakayama et al 2019). Designing these features relies on expert knowledge on SA syndrome, for instance, the number of pairs of adjacent intervals of two successive R waves (RR intervals) with a difference of 50 ms, and in the power spectrum of RR intervals, a low-frequency band is defined as 0.04–0.15 Hz, and a high-frequency band is defined as 0.15–0.4 Hz. Integrating features from ECG-derived respiration (EDR) signal is a common strategy. Band power and statistical features like mean value, standard deviation, skewness and kurtosis are derived from EDR signal as from RR intervals (Ravelo-Garcia et al 2015, Varon et al 2015, Song et al 2016), and Song et al reported an accuracy of 0.862. Modern technique generates EDR signal from single-lead ECG signal. The construction of EDR signal is on the basis of the variance of ECG signal in each R-peaks, or a phase reconstruction feature from QRS waves (Janbakhshi and Shamsollahi 2018). In these work, deriving accurate EDR signals is an essential prerequisite.

Deep neural networks (DNNs) utilize ECG signal directly in SA screening scenario (da Silva Pinho et al 2016, Pathinarupothi et al 2017, Li et al 2018), unlike feature-based methods (Hassan and Haque 2016, 2017, Zarei and Mohammadzadeh Asl 2020). When DNN was used to extract features, the classifier achieved an accuracy of 0.847 on segments of all 70 recordings of the Apnea-ECG database (Li et al 2018). End-to-end DNNs undertake the task of feature generation and classification. They require the least expert knowledge in SA screening. Wang et al trained an end-to-end deep ResNet (Wang et al 2019a). The memory required for saving the deep ResNet is around 33 MB. However, the clinical outcomes of neural networks are inferior to state-of-the-art feature-based methods, and implementing these networks on wearable devices remains a major challenge when the computation resource for wearable devices is extremely stringent. For example, the wearable platform of Surrel et al used a microcontroller, which has only 384 KB of flash storage (Surrel et al 2018). Mendon ca et al addressed the importance of optimal performance-complexity ratio (Mendon ca et al 2019). It encourages us to make trade-offs between DNN complexity and accuracy in SA syndrome detection.

We propose fast apnea syndrome screening neural network (FASSNet), a light-weight neural network, to provide wearables with accurate and continuous minute-by-minute-apnea diagnosis, and apnea severity can be estimated following the Apnea-ECG protocol. Identifying and excluding redundant computation in SA screen scenario is our strategy to find efficient model structures. High accuracy is ensured by feeding FASSNet with spectrogram, because spectro-temporal features have been effective in identifying SA events (Jarvis and Mitra 2000, McNames and Fraser 2000). The low-frequency dominant nature of ECG spectrograms leads to our light-weight design in model input. In order to ensure that FASSNet produces reliable results, we carried out patient-agnostic validation and mechanism analysis. In summary, FASSNet proves to be a workable solution to SA screening with wearable devices.

2. Methods

The overall workflow of the algorithm is showed in figure 1. Raw ECG signal is turned to a small spectrogram after three data pre-processing steps. FASSNet outputs minute-by-minute predictions. Evaluation metrics and experiments are introduced in the method section.

Figure 1.

Figure 1. Workflow of the proposed algorithm. The desired outcome of each data pre-processing step is given in brackets. STFT denotes short-time Fourier transformation (STFT), and HF denotes high-frequency components. Baseline wander is removed in STFT.

Standard image High-resolution image

2.1. Data pre-processing

Pre-processing steps are designed to facilitate learning process of FASSNet. In this study, the sampling frequency of raw ECG signal is set to 100 Hz. Interpolation methods can be used if ECG segments are of different sampling frequencies. The data pre-processing scheme is illustrated in figure 2. Signal smoothing is the first step. High-quality RR waves have proved useful in previous studies (Sannino et al 2014, Nakayama et al 2019). Hence, we adopt a Chebyshev type-II low-pass filter, which produces trimmed RR waves by smoothing raw ECG signal. Spectrograms of 60 s ECG segments are selected in view of their well-practised effectiveness (Jarvis and Mitra 2000, McNames and Fraser 2000). They are calculated on the basis of power spectrum density via STFT, and provide spectro-temporal patterns of SA syndromes. Baseline wander removal is performed during STFT. High-frequency components in a typical ECG epoch are negligible after signal smoothing (figure 2). Yu et al demonstrated that they are redundant in abnormal diagnosis (Yu et al 2019). Hence, we conclude that high frequencies in ECG signal are redundant for SA syndrome screening, and removing them has a limited adverse influence on accuracy. Removing high-frequency part of spectrogram reduces the size of model input, which contributes to reducing computation cost. A 'preserve-edge-information' design is realised to select the input size along frequency axis on the basis of the cut-off frequency (Fs) for signal smoothing. The upper bound frequency of model input is marginally higher than the Fs, which enables models to preserve-edge information with minimal input size. The mapping relation can be formulated as equation (1):

Equation (1)

where 65 is the maximum position index of the frequency array in the spectrogram, and the $ceil(\cdot )$ operation returns the minimum integer which is larger than the input. There will be misleading spectrogram patterns when ECG signals are completely missing. To avoid wrong feedback in model training process, we eliminate epochs with signal absence if they are of low standard variation.

Figure 2.

Figure 2. Data pre-processing scheme. At the top of the top two panels, low-pass filter smooths ECG signals to produce trimmed RR waves and reduce high-frequency noise, as shown inside the boxes and the rightmost columns. We discard high-frequency components of spectrograms to decrease the size of model input since high-frequency energy is negligible. The vertical axis of the spectrogram ranges from 0 to 65, which is a 0–100 Hz frequency interval.

Standard image High-resolution image

2.2. FASSNet

The FASSNet architecture is shown in figure 3. The model comprises two stages, namely local spectrogram pattern and long-range information learning. Convolutional layers extract features automatically during the local spectrogram pattern learning stage. In the long-range information learning process, we prioritise position information along frequency bands for application of bidirectional long-short-term memory (Bi-LSTM) blocks to avoid redundant computation along time axis. Because there is no need to identify the onset of an SA event during minute-to-minute SA syndrome screening, whilst distinguishing high- and low-frequency bands contributes to providing effective features (Ravelo-Garcia et al 2015, Varon et al 2015, Song et al 2016). Layer normalisation is a key component of FASSNet that substantially improves classification performance with low computation cost. It gives us insight to design a robust model for SA syndrome detection.

Figure 3.

Figure 3. FASSNet architecture. Blocks in black denote vector operations. The model extracts local spectrogram patterns on the basis of convolutional layers (kernel size = 3, stride = 1). All convolutional layers have 64 kernels, which is a fine-tuned result validated in section 4.1. Bi-LSTM block (hidden size = 16 and depth = 2) and attention block aggregate temporal and channel-wise pixels along the frequency axis for representative long-range information. FC denotes the fully connected layer. The normalisation method of the output of the hidden layers is layer normalisation (LN). Dropout layers randomly deactivate 50% of neurons in the training process.

Standard image High-resolution image

2.2.1. Local spectrogram pattern learning

Small kernels are adopted to capture fine grained spectrogram details (Simonyan and Zisserman 2014). We place a pre-activation block using two 2D convolutional layers in the early stage, followed by a layer normalisation block, a rectified linear unit (ReLU) activation function and a dropout block with a probability of 0.5. The ReLU activation function outputs the maximum value of its input and 0 (Nair and Hinton 2010). Our study uses more convolutional layers in the pre-activation block compared to He et al (2016). Afterwards, two convolutional blocks sequentially perform the following operations. The width of all the convolutional layers in FASSNet is the same. Dropout randomly deactivates neurons in the training process to promote efficient representations and makes neural networks generalisable (Srivastava et al 2014). Each pixel of the output is derived from a sufficient long ECG segment. It is necessary to report an apnea/hypopnea event or a normal duration, which usually lasts over 10 s according to Quan et al (1999). Lastly, an adaptive max-pooling layer is adopted to select prominent patterns. In this way, the model automatically adapts to variations in input size. The duration of each pixel from local spectrogram learning is viewed as a hyper-parameter of FASSNet. It can be fine-tuned by changing the number of convolution layers in the pre-activation block or in convolutional blocks, and empirical experiments are shown in section 4.1.

2.2.2. Long-range information learning

Convolutional layers are insensitive to the location of its input (Liu et al 2018). It suggests that position information of spectrogram in our convolution outcomes is insufficient. Position information is unevenly distributed along time and frequency axis in the spectrogram (figure 2). High-frequency noise can have the same negative impact as low-frequency noise if no position information of frequencies is provided. Furthermore, confusing low and high-frequency components of a ECG segment may lead to a misleading diagnosis of FASSNet. In contrast, location information along the time axis in CNN output is less important, because a short ECG segment of Apnea-ECG recordings can be classified as an apnea epoch without providing information about the exact time of an SA syndrome event onset. Therefore, Bi-LSTM network and attention layer are implemented along frequency axis to bring long-range frequency correlations to model decision. The Bi-LSTM is a concatenation of forward and backward long-short-term memory (LSTM), as shown in equations (2)–(5), and the attention layer is expressed in equations (6)–(7):

Equation (2)

Equation (3)

Equation (4)

Equation (5)

Equation (6)

Equation (7)

where $LST{M}^{f}$ and $LST{M}^{b}$ denote forward and backward LSTMs, respectively. LSTM (Hochreiter and Schmidhuber 1997) takes an N-length data sequence $\left\{{x}_{t},\,t=0,\,1,\,2,\,3,\ldots ,\,N\right\}$ as input. Their outputs ${o}_{t}^{f}$ and ${o}_{t}^{b}$ contain the output features ${h}_{t}^{f}$ and ${h}_{t}^{b}$ from the last LSTM layer. The LSTM cell states are denoted as ${c}_{t}^{f}\,\,$and ${c}_{t}^{b},$ which control the information flow and preserve long-range information. $LST{M}^{f}$ leverages past information ${h}_{t-1}^{f}$ and ${c}_{t-1}^{f}$ to determine current feature, and $LST{M}^{b}\,\,$leverages future information ${h}_{t+1}^{b}$ and ${c}_{t+1}^{b}.$ Bi-LSTM is powerful in extracting both forward and backward contextual information (Graves et al 2005). In FASSNet, the input sequence of Bi-LSTM is along the frequency axis. The spatial relation of low- and high-frequency components of ECG spectrogram is learned. Soft attention block (Bahdanau et al 2014, Xu et al 2015) assigns weights ranging in [0, 1] according to the importance of the regions of input. It provides in-epoch information about the exact time interval at which an apnea event is likely to occur according to $LST{M}^{b}$ and $LST{M}^{f}.$

2.2.3. Layer normalization

The sensitivity of gradients with respect to weights in one hidden layer to outputs of the previous layer is an undesired property in training DNNs (Ba et al 2016). Mean value $\mu $ and standard variance ${\sigma }^{2}$ for normalising the output $X\in {R}^{C\times H\times W}$ of a layer are defined as equation (8)

Equation (8)

where $C$ is the channel axis and both $H$ and $W$ are spatial axes. If a lengthy event activates additional apnea-sensitive neurons in hidden layers, then the layer-wise output will be accurately normalised by a larger mean value according to the layer normalisation. The proportion of apnea duration is a segment-wise property. Therefore, we abandon the commonly used batch normalisation (Ioffe and Szegedy 2015), where mean values typically come from multiple records in the same batch. Layer normalisation promotes the robustness of FASSNet. We discuss the effect of layer normalisation in section 2.4.2.

2.3. Evaluation metrics

Two sets of criteria are included in evaluating model performance. We introduced accuracy (Acc), specificity (Spe), sensitivity (Sen), F1 score (F1) and area under receiver operating characteristics curve (AUC) to describe the classification performance. Meanwhile, value of model parameters (Params) and multiply accumulates (MACs) in an inference are used to measure real-time performance, because they indicate the algorithm applicability to hardware constraints. Classification performance measurements are expressed as equations (9)–(12):

Equation (9)

Equation (10)

Equation (11)

Equation (12)

where TP, TN, FP and FN are quantities of true positives, true negatives, false positives and false negatives, respectively. SA syndrome epochs are positives in our systems.

AUC is used for determining the hyper-parameters of FASSNet. It is calculated by accumulating the area under a receiver operating characteristics (ROC) curve. An ROC curve takes false positive rate (100%-Spe) at different classification thresholds as horizontal axis and true positive rate (Sen) as vertical axis.

There are two real-time metrics to evaluate whether a model for wearable devices is light-weight or not. Params measures memory allocations. The MAC operation is defined in equation (13)

Equation (13)

where $a,$ $b,$ and $c$ are floating numbers. MACs is one of the standard measures to evaluate computation cost in model inferences. The real-time performance of neural networks is calculated by summarising the Params/MACs of each block.

2.4. Experiments

2.4.1. Dataset

The open-source Physionet Apnea-ECG database is used in this study. Table 1 provides the notations of 70 recordings from 32 subjects and their demographic information, and the recording-wise information is available in the appendix. Subjects were between the ages of 27 and 63 and between the BMI of 19.20 and 45.33 kg m−2. Each recording contains a 100 Hz single-lead ECG signal digitised at 12 bit resolution, with a length of approximately 7–10 h. Each recording includes reference annotations for each minute of ECG segments (Goldberger et al 2000, Penzel et al 2000). Each 60 s ECG segment is labelled as a $Non \mbox{-} apnea\,minute$ or an $Apnea\,minute.$ In apnea-ECG, an $Apnea\,minute$ contains one or more apnea events or hypopnea events. However, the number of apnea and hypopnea events for each minute is not given. Hence, instead of using AHI defined in AASM standard (equation (14)), we adopted a common AHI computation approach as previous protocol (Penzel et al 2000), as shown in equation (15), to estimate SA syndrome severity. At present, all data and labels are publicly available.

Equation (14)

Equation (15)

Table 1. Apnea-ECG recordings and demographics a .

Subject IDRecordingsSubject IDRecordingsSubject IDRecordingsSubject IDRecordings
01a01, 09 a09, 17 b05, 25 c10,
 a14 a18 x11 x18
02a02, 10 a11 18 c01, 26 x02
 x14   x35  
03a03, 11 a15, x27, 19 c02, 27 x06,
 x19 x28 c09 x24
04a04, 12 a17, 20 c03, 28 x09,
 a12 x12 x04 x23
05a05, a10, 13 a19, x05, 21 c04, 29 x10,
 a20, x07 x08, x25 x29  
06a06, 14 b01, 22 c05, 30 x13,
 x15 x03 x33 x26
07a07, a16, 15 b02, b03, 23 c06 31 x17,
 x01, x30 x16, x21   x22
08a08, a13, 16 b04, 24 c07, 32 x31,
 x20 c08 x34 x32
Demographics of all 32 subjects
Length minutes: mean (std)Non-apnea minutes: mean (std)Apnea minutes: mean (std)AHI $\geqslant \,\,\,$5 : %Age, year: mean (std)Female gender : %BMI, kg m−2 : mean (std) 
1075.9 (426.4)670.7 (362.4)405.2 (434.1)5043.88 (10.86)21.928.24 (6.90) 

a During the Physionet challenge 2000, the labels of 35 recordings with a prefix 'x' was unreleased. Some previous studies also excluded these recordings, as shows in table 2. Currently, the labels of all 70 recordings is available.

2.4.2. Experiment protocols

The diagnosis performance was maintained using both ten-fold and patient-agnostic validation protocols. K-fold validation is a common protocol for evaluating the model performance. Records are divided into training and validation set at each iteration. We performed ten-fold validation and evaluated FASSNet after every three epochs during training and retained the results in test set when the model yields the maximum accuracy. Outputs of tests set of all 10 folds were pooled together to compute classification performance. Patient-agnostic validation shows the generalization ability of the algorithm when introducing new patients. Moreover, the real-time performance of neural networks is given.

Compared with ten-fold cross validation, patient-agnostic validation or blind validation rules out the possibility that different records from the same subjects are simultaneously distributed in the training and the test sets. To the best of our knowledge, the patient-agnostic scheme using Physionet Challenge 2000 database was reported in Surrel et al (2018) and Wang et al (2019a). The authors intentionally used one recording from the test subject in the training set to demonstrate the importance of patient-agnostic protocol for preventing leakage of personnel information. We adopt the same train-test splits in order to make a fair comparison. Moreover, we took a leave-one-patient-out experiment to obtain the test results for each subjects, while each time FASSNet was trained on ECG segments from all other subjects.

2.4.3. Experiment setup

An i7 CPU with 8GB DDR3 memory was used to pre-process ECG data. The validation procedure was carried out on a server with 2 GPU and an i5 CPU. We used WFDB package to divide ECG recordings into 60 s segments. Signal pre-processing steps were completed on the basis of SciPy and NumPy (Jones et al 2001, Oliphant 2006, Van Der Walt et al 2011). As for the parameters for STFT, the duration of a Hamming window was 2.56 s, whilst the overlap size was 1.28 s. Hence, the duration of a small STFT segment equals to around 2–3 heart-beats. Neural networks were implemented using PyTorch (Paszke et al 2019). Models were trained with Adam optimiser (Kingma and Ba 2014). The learning rate, beta1, beta2 and weight decay factor were set to 0.0003, 0.9, 0.999, and 0.001, respectively. Cross entropy was considered the loss function during the training process. We iterated a maximum of 60 epochs. Computations of MACs and Params were completed using THOP package.

3. Results

We evaluated our model in terms of classification and real-time performance on the Apnea-ECG dataset. The classification performance maintained using the ten-fold validation protocol is showed in table 2. The comparison of various state-of-the-art SA detection methods is made. The classification results of patient-agnostic scheme using Physionet Challenge 2000 database are compared to Surrel et al (2018) and Wang et al (2019a), as shown in table 3, and the result of leave-one-patient-out protocol is given in table 4. Moreover, the real-time performance of neural networks is given in table 5.

Table 2. Classification results of ten-fold validation using the Apnea-ECG database. Hyper-parameters of FASSNet are organised as (Fs, width). NR denotes the number of recordings. NN refers to RR intervals between normal beats.

YearMethodNRInputAcc (%)Sen (%)Spe (%)F1 (%)
2014Sannino et al (2014)35Hand-crafted features88.57
2015Ravelo-Garcia et al (2015)35Hand-crafted features84.675.190.578.89
2016Song et al (2016)35Hand-crafted features86.282.688.481.95
2016Hassan et al ( 2016)35Hand-crafted features83.7785.2082.7981.02
2016da silva Pinho et al ( 2016)35Hand-crafted features82.1288.4172.2985.78
2017González et al ( 2017)35Hand-crafted features84.7681.4586.9280.85
2017Hassan and Haque (2017)35Hand-crafted features88.8887.5891.4991.32
2017Kumar and Kanhangad (2017)35Hand-crafted features89.8088.4690.6386.86
2018Surrel et al (2018)35Hand-crafted features84.579.288.180.5
2018Li et al (2018)70DNN features84.788.982.181.63
2019Wang et al (2019b)70Hand-crafted features87.683.190.383.4
2020Zarei and Mohammadzadeh Asl70Hand-crafted features 93.90 92.26 94.92 92.09
End-to-end DNN methods
2019Wang et al (2019a)351 min NN83.0378.7385.6077.63
2020FASSNet (28, 32)701 min ECG spectrogram85.6174.5892.5780.04
2020FASSNet (38, 64)701 min ECG spectrogram 86.41 77.9691.74 81.61
2020FASSNet (48, 32)701 min ECG spectrogram85.7274.06 93.08 80.05

Table 3. Results of patient-agnostic validation. NTR is the number of test recordings. The convolutional layer width w and the high-frequency cut-off Fs of the low-pass filter in FASSNet are 64 and 38 Hz. The best performance in each fold is highlighted in bold.

Test subjectsNTRMethodAcc (%)
101Wang et al (2019a)61
101FASSNet 70.8
111Wang et al (2019a)77.6
113FASSNet 85.1
121Wang et al (2019a)76.9
122FASSNet 88.2
141Wang et al (2019a)83.2
141FASSNet 93.1
241Wang et al (2019a)94.5
241FASSNet 97.1
10, 11, 12, 14, 245Wang et al (2019a)80.6
10, 11, 12, 14, 2410FASSNet 86.4
152Wang et al (2019a)70.3
154Surrel et al (2018)69.8
154FASSNet 85.6

Table 4. Results of patient-agnostic validation (leave-one-patient-out validation). In apnea severity diagnosis, we denote the results as True if FASSNet make correct predictions. The thresholds of apnea severity refer to Kapur et al (2017).

Test subjectPer-segment acc (%)Apnea severity diagnosis
  AHI >= 5AHI >= 15AHI >= 30
0184.97TrueTrueTrue
0250.10TrueTrueFalse
0385.88TrueTrueTrue
0478.93TrueTrueTrue
0571.56TrueTrueTrue
0682.34TrueTrueTrue
0781.51TrueTrueTrue
0877.48TrueTrueTrue
0956.00TrueTrueFalse
1067.17TrueFalseTrue
1181.46TrueTrueTrue
1288.33TrueTrueTrue
1385.50TrueTrueTrue
1496.32TrueTrueTrue
1583.71FalseTrueTrue
1697.47TrueTrueTrue
1791.69TrueTrueTrue
18100TrueTrueTrue
1998.74TrueTrueTrue
20100TrueTrueTrue
2199.79TrueTrueTrue
2299.46TrueTrueTrue
2399.79TrueTrueTrue
2499.66TrueTrueTrue
2598.17TrueTrueTrue
2688.70TrueTrueTrue
2799.65TrueTrueTrue
2888.02TrueTrueTrue
2997.45TrueTrueTrue
3065.39TrueTrueTrue
3199.63TrueTrueTrue
3291.87TrueTrueTrue
Average87.0996.97%96.97%93.94%

Table 5. Real-time performances amongst neural networks. AlexNet (Krizhevsky 2014) and Deep ResNet (Wang et al 2019a) are used as the benchmark of model complexity. The input of Deep ResNet and FASSNet lasts for 1 min. Width is a hyper-parameter of FASSNet. The best real-time performance is highlighted in bold.

ModelWidthInput sizeMACs (M)Params (M)
AlexNet $65\times 92$ 104.30656.997
Deep ResNet $1\times 360$ 546.6358.160
FASSNet32 $20\times 92$ 22.819 0.0333
 64 $20\times 92$ 90.1880.115
 128 $26\times 92$ 358.2580.439
 32 $26\times 92$ 29.555 0.0333
 64 $26\times 92$ 116.7270.115
 128 $33\times 92$ 464.0360.439

3.1. Classification results of ten-fold validation

Table 2 shows the classification results of ten-fold validation. Hand-craft-features like Hassan and Haque (2017) and Zarei and Mohammadzadeh Asl (2020) are the most effective, the best method is from Zarei and Mohammadzadeh Asl (2020), whose per-segment accuracy is 93.90%. FASSNet (38 Hz, 64) achieves an accuracy, sensitivity, specificity, and F1 of 86.41%, 77.96%, 91.74%, and 81.61%, and the accuracy of the most light-weight version of FASSNet can reach 85.61%. It outperforms a series of commonly used SA syndrome detection approaches in classification performance and has the highest accuracy among DNNs. In comparison, the accuracy of the Deep ResNet is 83.03% (Wang et al 2019a), and DNN features proposed by Li et al (2018) is 84.7%. Notably, classification performance alone is insufficient to decide a feasible model for portable devices. For instance, Surrel et al (2018), reported an F1 score which is slightly lower than previous researches. However, the algorithm has been competitive for wearable devices since 2018, where trade-offs between classification and real-time performance are required.

3.2. Classification results of patient-agnostic validation

FASSNet scores the best accuracies in each fold, as shows in table 3. The accuracy is 86.4% when test subject set is {10, 11, 12, 14, 24}. However, the accuracy drops to 70.8% when patient 10 is the only test subject. It can be seen that the accuracy FASSNet is over 15% higher than that of Wang et al (2019a) when no information of subject 15 is available, which is an evidence that features from FASSNet are robust. For some subjects, the accuracy can be extremely high (subject 14 and subject 24). The polarization of classification performance in different patients is also observed in table 4. The test results of some patients can have an over 95% per-segment accuracy (subject 22–25), and the overall diagnosis of AHI severity is almost perfect. However, FASSNet fails to classify ECG segments of some patients such as subject 09 and 30. A possible reason is the heterogeneity of ECG patterns. According to table 1, $AH{I}_{AASM}$ and $AH{I}_{Apnea \mbox{-} ECG}$ of these patients are considerably different, which may introduce errors. Through comparing the results with previous studies, we can tell that FASSNet is an excellent method in SA syndrome screening task from the perspective of classification performance. The polarization of test results demonstrates the importance of patient-agnostic validation in testing SA syndrome screening algorithms.

3.3. Real-time performance

According to table 5, FASSNet (38 Hz, 64) using 0.2% of parameters of AlexNet, which is a small neural network in image classification tasks (Krizhevsky 2014). Moreover, it has 4.17% of MACs of the deep ResNet in FASSNet (Wang et al 2019a), which is a remarkable improvement in real-time performance. As to hyper-parameter selection, the complexity of FASSNet quadruples when its width doubles or its input size quadruples. This finding indicates that the width is a dominant factor for model complexity compared to the input size. Temporal-frequency components of 28–38 Hz contain most of useful information to diagnose SA syndrome, whilst providing elements of 38–48 Hz shows very limited improvement. This evidence is consistent with the phenomenon in figure 2 that few spectral-temporal patterns are visible in the 38–48 Hz frequency band.

4. Discussion

The principle objective of our study is the evaluation of light-weight neural networks for wearable devices in minute-by-minute SA syndrome detection. FASSNet provides an accuracy of 86.71%, a sensitivity of 77.96%, a specificity of 91.74%, and a F1 score of 81.61% in ten-fold validation with very stringent computation resource (Params: 0.0333 M, MACs: 22.819 M). Moreover, only a quarter of computation cost and memory is needed for a light-weight version of FASSNet.

The superiority of this study includes rigorous validation on the robustness of FASSNet. Patient-agnostic validations enable the performance of FASSNet to be generalized real-world applications. Based on per-segment diagnosis, FASSNet can identify patients with mild or severe SA syndrome (AHI > 15) at an accuracy of 96.97% in leave-one-patient-out validation. The estimation of AHI (${AH}{I}_{{Apnea} \mbox{-} {ECG}}$) follows the protocol proposed in Physionet Challenge 2000 (Goldberger et al 2000, Penzel et al 2000). The minute-apnea detection criterion has been used in various previous studies (e.g. Zarei and Mohammadzadeh Asl 2020) as a reliable standard for checking algorithms, since it is closely correlated to the AASM guideline.

Moreover, mechanism analysis is introduced to understand to what extent FASSNet is reliable for the SA syndrome screening task. The real-time performance and the upper limit of classification performance are predetermined by hyper-parameters of FASSNet. A hyper-parameter selection procedure enables us to find the optimal hyper-parameters on wearables and verify the robustness of model architecture. Layer normalisation is analysed due that it has great contribution to the classification of FASSNet. The attention block provides interpretability for FASSNet and shows that the backward and forward LSTM produce reliable results. We summarise advantages and limitations of FASSNet. State-of-the-art feature-based methods show that higher accuracies are practical, and future work is proposed to improve FASSNet.

4.1. FASSNet hyper-parameter selection

We discussed the influence of Fs in low-pass filtering and width w in FASSNet convolutional layers using a grid search method. Fs candidates are 28, 38 and 48 Hz, and the width candidates are 32, 64 and 128 kernels. According to the 'preserve-edge-information' design in section 2.2, input sizes are $20\times 92,$ $26\times 92$ and $33\times 92.$ The optimal duration of a single pixel from the spectrogram pattern learning outputs is explored by removing a convolution layer in the pre-activation block or a convolution block. The duration edges down when there is only one convolution layer in the pre-activation block, and removing a convolution block results in a significant decline. Figure 4 shows the AUC scores on the test set in 10-fold validation. In this study, we select FASSNet (Fs = 38 Hz, width = 64) as an optimal model for wearable devices, because it has the highest AUC score with low MACs and Params (table 5). FASSNet (28 Hz, 32) is an extremely light-weight solution to wearable devices. A small decline in the duration of the pixel of spectrogram patterns results in slightly weaker model performance, whilst a significant decrease makes the performance degeneration evident.

Figure 4.

Figure 4. Classification performances in the grid search approach. Hyper-parameters are organised as (Fs, width). The hyper-parameters of FASSNet (Conv Layer −1) and FASSNet (ConvBlock −1) are (Fs = 38 Hz, width = 64).

Standard image High-resolution image

4.2. Effect of layer normalisation

Layer normalisation prevents the model from overfitting. It uses sample-dependent mean and standard value to align features in different samples. In our view, the randomness of occurrence and the variable duration of an apnea event are partly responsible for the heterogeneity among samples. FASSNet uses layer normalisation to cope with the situation, thereby screening SA syndrome accurately. Layer normalization significantly improves the classification performance of FASSNet. FASSNet scored an accuracy of around 70% in the test set in the ten-fold validation process if it used batch normalization instead. The design of layer nomaliasation adapts to the randomness of SA syndrome, which makes FASSNet a reliable model.

4.3. Attention mechanism for model interpretability

In FASSNet, forward and backward LSTM use forward and backward information to detect SA syndrome events. The subsequent attention block assigns importance to the position information of LSTM outputs. For apnea epochs, LSTMs are expected to focus on the same period, which shows that both LSTMs can detect the onset of an apnea event. For epochs where all the in-epoch periods are normal, we expect to see the divergence to define a normal state from attention weights of the two LSTMs. Because it indicates the intrinsic heterogeneity of FASSNet, a good sign of robustness. The kernel density of attention weights of per-segment labels is visualized in figure 5. Attention weight distributions of apnea segments are concentrated in small areas and consistent with similar Gaussian distributions. It indicates that both forward and backward information are unanimous in the most likely period of an apnea event. In contrast, attention weights of normal segments distribute in a random way. They are dispersed at edge, where the position index is 0 or 5. It is likely because there is no 'abnormal event' in 'normal state' at any time point, and central pixels receive additional attention in convolution and max-pooling operations.

Figure 5.

Figure 5. Kernel density estimation of apnea and normal segments. Green and blue plots denote attention weights distributions of apnea and normal segments in the Apnea-ECG database, respectively. Forward and backward weights are obtained from the Bi-LSTM and attention block of the well-trained FASSNet (38 Hz, 64). The numerical range of all the x- and y-axes is [0.1, 0.25]. The numbers 0–5 in the top right-hand corner are position indexes of the outputs of attention block.

Standard image High-resolution image

4.4. Advantages and limitations

Compared to light-weight algorithms on wearables (e.g. Sannino et al 2014, Surrel et al 2018), the chief advantage of FASSNet (Fs = 38 Hz, width = 64) is excellent classification performance (Acc = 86.41%, Sen = 77.96%, Spe = 91.74%, F1 score = 81.61%). And compared to current DNNs (e.g. Wang et al 2019a), diagnosis of apnea screening through FASSNet requires extremely low computation cost (MACs = 90.188 M, Params = 0.115 M). The accuracy of FASSNet shows an upper trend with an increasing volume of training set despite patient-agnostic validation is more difficult than ten-fold validation (tables 2 and 4). It indicates that FASSNet benefits from the expansion of database as typical deep networks. The architecture of FASSNet is a robust, since it takes the characteristics of SA syndrome into account. Theoretically, FASSNet possesses a strong rejection ability to high-frequency disturbance because only low-frequency components are considered as model input. Moreover, an empirical evidence is that changing hyper-parameters of FASSNet causes only slight classification performance degradation (figure 4). The light-weight design and reliability of FASSNet make it applicable to various scenarios for wearable devices.

However, there are some drawbacks in FASSNet compared to previous approaches if we assume that computation cost is not a gap. Some computer-aided neural network (Li et al 2018) and feature-based method (Zarei and Mohammadzadeh Asl 2020) show higher accuracies (table 2). Patient-agnostic validation shows that the diagnosis of some patients can be unsatisfactory even models possess high accuracies in cross validation. For instance, the accuracy of subject 10 is 70.8% (see table 3), and the diagnosis of apnea severity of subject 09 is wrong (see table 4). Surrel et al (2018) also reported the same performance deterioration in certain patients. They suggested capturing subject-specific information since subjects vary demographically in ways that induce different ECG morphological patterns. For example, Laureanti et al (2020) documented that ECG parameters show statistical differences in gender. These findings demonstrate the necessity of patient-agnostic validation. Thereby, evaluation on more large-scale databases is important before implementing FASSNet on wearable devices. Furthermore, Apnea-ECG is a database released in the year 2000, and we need to update FASSNet according to recent AASM standard.

4.5. Future work

Large-scale modern datasets contribute to obtaining reliable models. Feeding more data to FASSNet can improve model performance without introducing additional computation loads. Considered that the architecture of FASSNet is robust, it is promising to apply it to other physiological signals. For instance, a dataset of 86 recordings from a hospital was used to evaluate a CNN model (Urtnasan et al 2018). Airflow signal recordings from clinical and MIT-BIH PSG database (Ichimaru and Moody 1999) are sufficient to train LSTM (Yang et al 2019). The optimal hyper-parameters of FASSNet need to be fine-tuned when input signal is different.

5. Conclusion

FASSNet, an end-to-end SA syndrome screening neural network for wearable devices, is proposed in this study. It learns features from ECG signal directly. The model design emphasises on capturing spectro-temporal patterns. Low-frequency components and edge information of ECG spectrogram are used as the input to achieve good classification and real-time performance. The model integrates convolution layers, layer normalisation, Bi-LSTM and attention blocks along the frequency axis. The algorithm achieves feasible classification results with an accuracy, sensitivity, specificity and F1 score of 86.41%, 77.96%, 91.74% and 81.61%, respectively. FASSNet requires extremely few computation resources considered to its light-weight design (Params = 0.155 M, MACs = 90.188 M). The hyper-parameter selection procedure demonstrates the robustness of FASSNet architecture. FASSNet is tested rigorously by patient-agnostic validation protocol, and mechanism analysis also proves its reliability. Overall, FASSNet is light-weight and accurate to be implemented on wearable devices.

Acknowledgments

This work was supported by National Natural Science Foundation of China (81973744 and 81473579), Natural Science Foundation of Beijing Municipality (7173267).

Conflict of interest statement

The authors declare that they have no conflicts of interest.

Appendix: . Record-wise demographics

Subject IDRecordingsNon-apnea minutesApnea minutesAHIAgeSexBMI (kg m−2)
01a012047069.651M33.31
 a1412738354.7   
02a0210942069.538M37.04
 x145243979.5   
03a0327424639.154M28.34
 x198140756.2   
04a044045377.452M40.42
 a124453480.2   
05a051792764158M25.18
 a1041810021   
 a2019631541   
 x0727024021   
06a0630520624.763M32.46
 x1529920015.9   
07a071903226344M33.52
 a1616332041   
 x0114937563   
 x3028632641   
08a083131894251M27.46
 a1325224442   
 x2025026443   
09a0911538131.752M25.88
 a185243882.4   
10a112452221458M36.49
11a151433685260M36.48
 x271148875   
 x286243475   
12a173281583344M29.86
 x124715733   
13a192982053455M28.41
 x0519031634   
 x0819432448   
 x2522029148   
14b01469190.2444F21.80
 x03454120.13   
15b02425931953M27.44
 b033697324   
 x164516524   
 x2139112019   
16b04420100.742M19.75
 c0853500   
17b0537757552M41.67
 x11445131   
18c014850031M21.86
 x3548400   
19c025021037M25.62
 c0946720   
20c034550039M19.20
 x0448300   
21c044830041F20.06
 x2947100   
22c054643028F19.96
 x3347130   
23c064681028F22.23
24c074504030F25.18
 x3447240   
25c104311027M21.27
 x1845820   
26x0226120937.746M24.74
27x064510031M22.84
 x2442910   
28x0934216718.543M25.54
 x2340911914.3   
29x10415961039M45.33
30x1321529218.757M33.17
 x2617734415.1   
31x174001027F21.23
 x2248120   
32x314251693.529F29.86
 x3211442571.8   

Please wait… references are loading.
10.1088/1361-6579/ac184e