FASSNet: fast apnea syndrome screening neural network based on single-lead electrocardiogram for wearable devices

Yunkai Yu; Zhihong Yang; Yuyang You; Wenjing Shan

doi:10.1088/1361-6579/ac184e

1. Introduction

Sleep apnea (SA) syndrome, including apnea and hypopnea, is a chronic condition that carries serious health risks. Recent AASM guidelines have addressed that patients with untreated SA syndrome may have a higher risk of developing chronic health consequences like diabetes, and cardiovascular diseases like stroke (Kushida et al 2005, Iber et al 2007, Lam et al 2012, Kapur et al 2017). According to the report of an American Academy of Sleep Medicine Task Force (1999) and Kapur et al (2017), an apnea event can be defined by intermittent cessation of breathing for over 10 s, and hypopnea is a reduction of breathing volume below 50%, associated with at least 4% decreases in oxygen saturation. Guilleminault et al (1978) defined SA syndrome according to the number of apnea events per-night. Apnea-hypopnea index (AHI), the average number of apnea and hypopnea events per hour, is widely used to evaluate SA syndrome severity. Polysomnography (PSG) is the gold standard for SA detection (Moody et al 2000). However, it is not an optimal solution to family use due to economic constraints and limited medical resources. PSG collects multi-channel signals like electrocardiogram (ECG) and electrooculogram, which leads to unfamiliar sleeping environment, and diagnosis through PSG has a long waiting list.

SA event detection and apnea severity estimation are major outcomes in continuous and ambulatory monitoring of SA syndrome, and wearable devices has been an emerging option for online diagnosis. Processing physiological signal at wearable devices has advantages in real-time performance, large-scale applications, and user privacy protection compared to transferring raw data to an authorized cloud (Shi et al 2016), and a real-time analysis scheme frees wearables from saving large-volume whole-night data. ECG has been accessible for medical and consumer-grade wearable devices (Yamakawa et al 2014, Fujiwara et al 2016, Surrel et al 2018, Xu et al 2020). SA symptoms were found to be associated with cardiovascular sequelae (He et al 1988). This fact implies that ECG signal has potential in detecting SA syndrome. The modulation between multi-lead ECG and respiration reveals that ECG signal can be used as an alternative of respiratory signal.

The feasibility of single-lead ECG signal in SA syndrome diagnosis has been verified in recent years. The Apnea-ECG database (Goldberger et al 2000, Penzel et al 2000) has been a benchmark database to test algorithms for minute-apnea event detection. Compared to AASM standard, it estimates AHI by minute-level diagnosis in the evaluation of apnea severity. Features derived from heart-rate variability parameters (HRV) can be implemented on mobile devices in an IF.THEN rule set (Sannino et al 2014). The large amplitude of R-peaks makes HRV noise-resistant for SA screening. This method reported an accuracy of 0.8857 on a subset of Apnea-ECG database. However, an IF.THEN rule set, which is based on the value of HRV parameters, is personalized due to the discrepancy among patients. Time and frequency domain features of HRV achieved an AUC score of 0.91 (Nakayama et al 2019). Designing these features relies on expert knowledge on SA syndrome, for instance, the number of pairs of adjacent intervals of two successive R waves (RR intervals) with a difference of 50 ms, and in the power spectrum of RR intervals, a low-frequency band is defined as 0.04–0.15 Hz, and a high-frequency band is defined as 0.15–0.4 Hz. Integrating features from ECG-derived respiration (EDR) signal is a common strategy. Band power and statistical features like mean value, standard deviation, skewness and kurtosis are derived from EDR signal as from RR intervals (Ravelo-Garcia et al 2015, Varon et al 2015, Song et al 2016), and Song et al reported an accuracy of 0.862. Modern technique generates EDR signal from single-lead ECG signal. The construction of EDR signal is on the basis of the variance of ECG signal in each R-peaks, or a phase reconstruction feature from QRS waves (Janbakhshi and Shamsollahi 2018). In these work, deriving accurate EDR signals is an essential prerequisite.

Deep neural networks (DNNs) utilize ECG signal directly in SA screening scenario (da Silva Pinho et al 2016, Pathinarupothi et al 2017, Li et al 2018), unlike feature-based methods (Hassan and Haque 2016, 2017, Zarei and Mohammadzadeh Asl 2020). When DNN was used to extract features, the classifier achieved an accuracy of 0.847 on segments of all 70 recordings of the Apnea-ECG database (Li et al 2018). End-to-end DNNs undertake the task of feature generation and classification. They require the least expert knowledge in SA screening. Wang et al trained an end-to-end deep ResNet (Wang et al 2019a). The memory required for saving the deep ResNet is around 33 MB. However, the clinical outcomes of neural networks are inferior to state-of-the-art feature-based methods, and implementing these networks on wearable devices remains a major challenge when the computation resource for wearable devices is extremely stringent. For example, the wearable platform of Surrel et al used a microcontroller, which has only 384 KB of flash storage (Surrel et al 2018). Mendon ca et al addressed the importance of optimal performance-complexity ratio (Mendon ca et al 2019). It encourages us to make trade-offs between DNN complexity and accuracy in SA syndrome detection.

We propose fast apnea syndrome screening neural network (FASSNet), a light-weight neural network, to provide wearables with accurate and continuous minute-by-minute-apnea diagnosis, and apnea severity can be estimated following the Apnea-ECG protocol. Identifying and excluding redundant computation in SA screen scenario is our strategy to find efficient model structures. High accuracy is ensured by feeding FASSNet with spectrogram, because spectro-temporal features have been effective in identifying SA events (Jarvis and Mitra 2000, McNames and Fraser 2000). The low-frequency dominant nature of ECG spectrograms leads to our light-weight design in model input. In order to ensure that FASSNet produces reliable results, we carried out patient-agnostic validation and mechanism analysis. In summary, FASSNet proves to be a workable solution to SA screening with wearable devices.

2. Methods

The overall workflow of the algorithm is showed in figure 1. Raw ECG signal is turned to a small spectrogram after three data pre-processing steps. FASSNet outputs minute-by-minute predictions. Evaluation metrics and experiments are introduced in the method section.

2.1. Data pre-processing

Pre-processing steps are designed to facilitate learning process of FASSNet. In this study, the sampling frequency of raw ECG signal is set to 100 Hz. Interpolation methods can be used if ECG segments are of different sampling frequencies. The data pre-processing scheme is illustrated in figure 2. Signal smoothing is the first step. High-quality RR waves have proved useful in previous studies (Sannino et al 2014, Nakayama et al 2019). Hence, we adopt a Chebyshev type-II low-pass filter, which produces trimmed RR waves by smoothing raw ECG signal. Spectrograms of 60 s ECG segments are selected in view of their well-practised effectiveness (Jarvis and Mitra 2000, McNames and Fraser 2000). They are calculated on the basis of power spectrum density via STFT, and provide spectro-temporal patterns of SA syndromes. Baseline wander removal is performed during STFT. High-frequency components in a typical ECG epoch are negligible after signal smoothing (figure 2). Yu et al demonstrated that they are redundant in abnormal diagnosis (Yu et al 2019). Hence, we conclude that high frequencies in ECG signal are redundant for SA syndrome screening, and removing them has a limited adverse influence on accuracy. Removing high-frequency part of spectrogram reduces the size of model input, which contributes to reducing computation cost. A 'preserve-edge-information' design is realised to select the input size along frequency axis on the basis of the cut-off frequency (Fs) for signal smoothing. The upper bound frequency of model input is marginally higher than the Fs, which enables models to preserve-edge information with minimal input size. The mapping relation can be formulated as equation (1):

$\begin{eqnarray}&&frequency\,input\,size=ceil\left(\displaystyle \frac{65\,{\rm{Fs}}}{100\,{\rm{Hz}}}\right)+1,\end{eqnarray} \tag{ 1 }$

where 65 is the maximum position index of the frequency array in the spectrogram, and the $ceil(\cdot )$ operation returns the minimum integer which is larger than the input. There will be misleading spectrogram patterns when ECG signals are completely missing. To avoid wrong feedback in model training process, we eliminate epochs with signal absence if they are of low standard variation.

**Figure 2.** Data pre-processing scheme. At the top of the top two panels, low-pass filter smooths ECG signals to produce trimmed RR waves and reduce high-frequency noise, as shown inside the boxes and the rightmost columns. We discard high-frequency components of spectrograms to decrease the size of model input since high-frequency energy is negligible. The vertical axis of the spectrogram ranges from 0 to 65, which is a 0–100 Hz frequency interval.
Download figure:
Standard image High-resolution image

2.2. FASSNet

The FASSNet architecture is shown in figure 3. The model comprises two stages, namely local spectrogram pattern and long-range information learning. Convolutional layers extract features automatically during the local spectrogram pattern learning stage. In the long-range information learning process, we prioritise position information along frequency bands for application of bidirectional long-short-term memory (Bi-LSTM) blocks to avoid redundant computation along time axis. Because there is no need to identify the onset of an SA event during minute-to-minute SA syndrome screening, whilst distinguishing high- and low-frequency bands contributes to providing effective features (Ravelo-Garcia et al 2015, Varon et al 2015, Song et al 2016). Layer normalisation is a key component of FASSNet that substantially improves classification performance with low computation cost. It gives us insight to design a robust model for SA syndrome detection.

2.2.1. Local spectrogram pattern learning

Small kernels are adopted to capture fine grained spectrogram details (Simonyan and Zisserman 2014). We place a pre-activation block using two 2D convolutional layers in the early stage, followed by a layer normalisation block, a rectified linear unit (ReLU) activation function and a dropout block with a probability of 0.5. The ReLU activation function outputs the maximum value of its input and 0 (Nair and Hinton 2010). Our study uses more convolutional layers in the pre-activation block compared to He et al (2016). Afterwards, two convolutional blocks sequentially perform the following operations. The width of all the convolutional layers in FASSNet is the same. Dropout randomly deactivates neurons in the training process to promote efficient representations and makes neural networks generalisable (Srivastava et al 2014). Each pixel of the output is derived from a sufficient long ECG segment. It is necessary to report an apnea/hypopnea event or a normal duration, which usually lasts over 10 s according to Quan et al (1999). Lastly, an adaptive max-pooling layer is adopted to select prominent patterns. In this way, the model automatically adapts to variations in input size. The duration of each pixel from local spectrogram learning is viewed as a hyper-parameter of FASSNet. It can be fine-tuned by changing the number of convolution layers in the pre-activation block or in convolutional blocks, and empirical experiments are shown in section 4.1.

2.2.2. Long-range information learning

Convolutional layers are insensitive to the location of its input (Liu et al 2018). It suggests that position information of spectrogram in our convolution outcomes is insufficient. Position information is unevenly distributed along time and frequency axis in the spectrogram (figure 2). High-frequency noise can have the same negative impact as low-frequency noise if no position information of frequencies is provided. Furthermore, confusing low and high-frequency components of a ECG segment may lead to a misleading diagnosis of FASSNet. In contrast, location information along the time axis in CNN output is less important, because a short ECG segment of Apnea-ECG recordings can be classified as an apnea epoch without providing information about the exact time of an SA syndrome event onset. Therefore, Bi-LSTM network and attention layer are implemented along frequency axis to bring long-range frequency correlations to model decision. The Bi-LSTM is a concatenation of forward and backward long-short-term memory (LSTM), as shown in equations (2)–(5), and the attention layer is expressed in equations (6)–(7):

$\begin{eqnarray}&&{o}_{t}^{f},\,{h}_{t}^{f},\,{c}_{t}^{f}=LST{M}^{f}\left({h}_{t-1},\,{c}_{t-1},\,{x}_{t}\right),\end{eqnarray} \tag{ 2 }$

$\begin{eqnarray}&&{o}_{t}^{b},\,{h}_{t}^{b},\,{c}_{t}^{b}=LST{M}^{b}\left({h}_{t+1},\,{c}_{t+1},\,{x}_{t}\right),\end{eqnarray} \tag{ 3 }$

$\begin{eqnarray}&&{o}_{t}={o}_{t}^{f}| | {o}_{t}^{b},\end{eqnarray} \tag{ 4 }$

$\begin{eqnarray}&&{h}_{t}={h}_{t}^{f}| | {h}_{t}^{b},\end{eqnarray} \tag{ 5 }$

$\begin{eqnarray}&&attention\,weight=softmax\left({o}_{t}\times {h}_{t}\right),\end{eqnarray} \tag{ 6 }$

$\begin{eqnarray}&&attention\,outputs={o}_{t}^{T}\times attention\,weight,\end{eqnarray} \tag{ 7 }$

where $LST{M}^{f}$ and $LST{M}^{b}$ denote forward and backward LSTMs, respectively. LSTM (Hochreiter and Schmidhuber 1997) takes an N-length data sequence $\left\{{x}_{t},\,t=0,\,1,\,2,\,3,\ldots ,\,N\right\}$ as input. Their outputs ${o}_{t}^{f}$ and ${o}_{t}^{b}$ contain the output features ${h}_{t}^{f}$ and ${h}_{t}^{b}$ from the last LSTM layer. The LSTM cell states are denoted as ${c}_{t}^{f}\,\,$ and ${c}_{t}^{b},$ which control the information flow and preserve long-range information. $LST{M}^{f}$ leverages past information ${h}_{t-1}^{f}$ and ${c}_{t-1}^{f}$ to determine current feature, and $LST{M}^{b}\,\,$ leverages future information ${h}_{t+1}^{b}$ and ${c}_{t+1}^{b}.$ Bi-LSTM is powerful in extracting both forward and backward contextual information (Graves et al 2005). In FASSNet, the input sequence of Bi-LSTM is along the frequency axis. The spatial relation of low- and high-frequency components of ECG spectrogram is learned. Soft attention block (Bahdanau et al 2014, Xu et al 2015) assigns weights ranging in [0, 1] according to the importance of the regions of input. It provides in-epoch information about the exact time interval at which an apnea event is likely to occur according to $LST{M}^{b}$ and $LST{M}^{f}.$

2.2.3. Layer normalization

The sensitivity of gradients with respect to weights in one hidden layer to outputs of the previous layer is an undesired property in training DNNs (Ba et al 2016). Mean value $\mu$ and standard variance ${\sigma }^{2}$ for normalising the output $X\in {R}^{C\times H\times W}$ of a layer are defined as equation (8)

$\begin{eqnarray}&&\mu =\displaystyle \frac{\displaystyle {\sum }_{x\in X}x}{C\times H\times W},\,{\sigma }^{2}=\displaystyle \frac{\displaystyle {\sum }_{x\in X}{\left(x-\mu \right)}^{2}}{C\times H\times W},\end{eqnarray} \tag{ 8 }$

where $C$ is the channel axis and both $H$ and $W$ are spatial axes. If a lengthy event activates additional apnea-sensitive neurons in hidden layers, then the layer-wise output will be accurately normalised by a larger mean value according to the layer normalisation. The proportion of apnea duration is a segment-wise property. Therefore, we abandon the commonly used batch normalisation (Ioffe and Szegedy 2015), where mean values typically come from multiple records in the same batch. Layer normalisation promotes the robustness of FASSNet. We discuss the effect of layer normalisation in section 2.4.2.

2.3. Evaluation metrics

Two sets of criteria are included in evaluating model performance. We introduced accuracy (Acc), specificity (Spe), sensitivity (Sen), F1 score (F1) and area under receiver operating characteristics curve (AUC) to describe the classification performance. Meanwhile, value of model parameters (Params) and multiply accumulates (MACs) in an inference are used to measure real-time performance, because they indicate the algorithm applicability to hardware constraints. Classification performance measurements are expressed as equations (9)–(12):

$\begin{eqnarray}&&Acc=\,\displaystyle \frac{TP+TN}{TP+TN+FP+FN},\end{eqnarray} \tag{ 9 }$

$\begin{eqnarray}&&Spe=\,\displaystyle \frac{TN}{TN+FP},\end{eqnarray} \tag{ 10 }$

$\begin{eqnarray}&&Sen=\,\displaystyle \frac{TP}{TP+FN},\end{eqnarray} \tag{ 11 }$

$\begin{eqnarray}&&F1=\,\displaystyle \frac{2TP}{2TP+FP+FN},\end{eqnarray} \tag{ 12 }$

where TP, TN, FP and FN are quantities of true positives, true negatives, false positives and false negatives, respectively. SA syndrome epochs are positives in our systems.

AUC is used for determining the hyper-parameters of FASSNet. It is calculated by accumulating the area under a receiver operating characteristics (ROC) curve. An ROC curve takes false positive rate (100%-Spe) at different classification thresholds as horizontal axis and true positive rate (Sen) as vertical axis.

There are two real-time metrics to evaluate whether a model for wearable devices is light-weight or not. Params measures memory allocations. The MAC operation is defined in equation (13)

$\begin{eqnarray}&&A\,MAC\,operation:a\leftarrow a+b\times c,\end{eqnarray} \tag{ 13 }$

where $a,$ $b,$ and $c$ are floating numbers. MACs is one of the standard measures to evaluate computation cost in model inferences. The real-time performance of neural networks is calculated by summarising the Params/MACs of each block.

2.4. Experiments

2.4.1. Dataset

The open-source Physionet Apnea-ECG database is used in this study. Table 1 provides the notations of 70 recordings from 32 subjects and their demographic information, and the recording-wise information is available in the appendix. Subjects were between the ages of 27 and 63 and between the BMI of 19.20 and 45.33 kg m⁻². Each recording contains a 100 Hz single-lead ECG signal digitised at 12 bit resolution, with a length of approximately 7–10 h. Each recording includes reference annotations for each minute of ECG segments (Goldberger et al 2000, Penzel et al 2000). Each 60 s ECG segment is labelled as a $Non \mbox{-} apnea\,minute$ or an $Apnea\,minute.$ In apnea-ECG, an $Apnea\,minute$ contains one or more apnea events or hypopnea events. However, the number of apnea and hypopnea events for each minute is not given. Hence, instead of using AHI defined in AASM standard (equation (14)), we adopted a common AHI computation approach as previous protocol (Penzel et al 2000), as shown in equation (15), to estimate SA syndrome severity. At present, all data and labels are publicly available.

$\begin{eqnarray}&&AH{I}_{AASM}=\frac{60\times (Apnea\,Events+Hypopnea\,Events)}{Apnea\,minutes\,+\,Non \mbox{-} apnea\,minutes},\end{eqnarray} \tag{ 14 }$

$\begin{eqnarray}&&AH{I}_{Apnea \mbox{-} ECG}=\frac{60\times Apnea\,minutes}{Apnea\,minutes\,+\,Non \mbox{-} apnea\,minutes}.\end{eqnarray} \tag{ 15 }$

Table 1. Apnea-ECG recordings and demographics^{^a}.

Subject ID	Recordings	Subject ID	Recordings	Subject ID	Recordings	Subject ID	Recordings
01	a01,	09	a09,	17	b05,	25	c10,
	a14		a18		x11		x18
02	a02,	10	a11	18	c01,	26	x02
	x14				x35
03	a03,	11	a15, x27,	19	c02,	27	x06,
	x19		x28		c09		x24
04	a04,	12	a17,	20	c03,	28	x09,
	a12		x12		x04		x23
05	a05, a10,	13	a19, x05,	21	c04,	29	x10,
	a20, x07		x08, x25		x29
06	a06,	14	b01,	22	c05,	30	x13,
	x15		x03		x33		x26
07	a07, a16,	15	b02, b03,	23	c06	31	x17,
	x01, x30		x16, x21				x22
08	a08, a13,	16	b04,	24	c07,	32	x31,
	x20		c08		x34		x32

Demographics of all 32 subjects

Length minutes: mean (std)	Non-apnea minutes: mean (std)	Apnea minutes: mean (std)	AHI $\geqslant \,\,\,$ 5 : %	Age, year: mean (std)	Female gender : %	BMI, kg m⁻² : mean (std)
1075.9 (426.4)	670.7 (362.4)	405.2 (434.1)	50	43.88 (10.86)	21.9	28.24 (6.90)

^aDuring the Physionet challenge 2000, the labels of 35 recordings with a prefix 'x' was unreleased. Some previous studies also excluded these recordings, as shows in table 2. Currently, the labels of all 70 recordings is available.

2.4.2. Experiment protocols

The diagnosis performance was maintained using both ten-fold and patient-agnostic validation protocols. K-fold validation is a common protocol for evaluating the model performance. Records are divided into training and validation set at each iteration. We performed ten-fold validation and evaluated FASSNet after every three epochs during training and retained the results in test set when the model yields the maximum accuracy. Outputs of tests set of all 10 folds were pooled together to compute classification performance. Patient-agnostic validation shows the generalization ability of the algorithm when introducing new patients. Moreover, the real-time performance of neural networks is given.

Compared with ten-fold cross validation, patient-agnostic validation or blind validation rules out the possibility that different records from the same subjects are simultaneously distributed in the training and the test sets. To the best of our knowledge, the patient-agnostic scheme using Physionet Challenge 2000 database was reported in Surrel et al (2018) and Wang et al (2019a). The authors intentionally used one recording from the test subject in the training set to demonstrate the importance of patient-agnostic protocol for preventing leakage of personnel information. We adopt the same train-test splits in order to make a fair comparison. Moreover, we took a leave-one-patient-out experiment to obtain the test results for each subjects, while each time FASSNet was trained on ECG segments from all other subjects.

2.4.3. Experiment setup

An i7 CPU with 8GB DDR3 memory was used to pre-process ECG data. The validation procedure was carried out on a server with 2 GPU and an i5 CPU. We used WFDB package to divide ECG recordings into 60 s segments. Signal pre-processing steps were completed on the basis of SciPy and NumPy (Jones et al 2001, Oliphant 2006, Van Der Walt et al 2011). As for the parameters for STFT, the duration of a Hamming window was 2.56 s, whilst the overlap size was 1.28 s. Hence, the duration of a small STFT segment equals to around 2–3 heart-beats. Neural networks were implemented using PyTorch (Paszke et al 2019). Models were trained with Adam optimiser (Kingma and Ba 2014). The learning rate, beta1, beta2 and weight decay factor were set to 0.0003, 0.9, 0.999, and 0.001, respectively. Cross entropy was considered the loss function during the training process. We iterated a maximum of 60 epochs. Computations of MACs and Params were completed using THOP package.

3. Results

We evaluated our model in terms of classification and real-time performance on the Apnea-ECG dataset. The classification performance maintained using the ten-fold validation protocol is showed in table 2. The comparison of various state-of-the-art SA detection methods is made. The classification results of patient-agnostic scheme using Physionet Challenge 2000 database are compared to Surrel et al (2018) and Wang et al (2019a), as shown in table 3, and the result of leave-one-patient-out protocol is given in table 4. Moreover, the real-time performance of neural networks is given in table 5.

Table 2. Classification results of ten-fold validation using the Apnea-ECG database. Hyper-parameters of FASSNet are organised as (Fs, width). NR denotes the number of recordings. NN refers to RR intervals between normal beats.

Year	Method	NR	Input	Acc (%)	Sen (%)	Spe (%)	F1 (%)
2014	Sannino et al (2014)	35	Hand-crafted features	88.57	—	—	—
2015	Ravelo-Garcia et al (2015)	35	Hand-crafted features	84.6	75.1	90.5	78.89
2016	Song et al (2016)	35	Hand-crafted features	86.2	82.6	88.4	81.95
2016	Hassan et al ( 2016)	35	Hand-crafted features	83.77	85.20	82.79	81.02
2016	da silva Pinho et al ( 2016)	35	Hand-crafted features	82.12	88.41	72.29	85.78
2017	González et al ( 2017)	35	Hand-crafted features	84.76	81.45	86.92	80.85
2017	Hassan and Haque (2017)	35	Hand-crafted features	88.88	87.58	91.49	91.32
2017	Kumar and Kanhangad (2017)	35	Hand-crafted features	89.80	88.46	90.63	86.86
2018	Surrel et al (2018)	35	Hand-crafted features	84.5	79.2	88.1	80.5
2018	Li et al (2018)	70	DNN features	84.7	88.9	82.1	81.63
2019	Wang et al (2019b)	70	Hand-crafted features	87.6	83.1	90.3	83.4
2020	Zarei and Mohammadzadeh Asl	70	Hand-crafted features	93.90	92.26	94.92	92.09

End-to-end DNN methods

2019	Wang et al (2019a)	35	1 min NN	83.03	78.73	85.60	77.63
2020	FASSNet (28, 32)	70	1 min ECG spectrogram	85.61	74.58	92.57	80.04
2020	FASSNet (38, 64)	70	1 min ECG spectrogram	86.41	77.96	91.74	81.61
2020	FASSNet (48, 32)	70	1 min ECG spectrogram	85.72	74.06	93.08	80.05

Table 3. Results of patient-agnostic validation. NTR is the number of test recordings. The convolutional layer width w and the high-frequency cut-off Fs of the low-pass filter in FASSNet are 64 and 38 Hz. The best performance in each fold is highlighted in bold.

Test subjects	NTR	Method	Acc (%)
10	1	Wang et al (2019a)	61
10	1	FASSNet	70.8
11	1	Wang et al (2019a)	77.6
11	3	FASSNet	85.1
12	1	Wang et al (2019a)	76.9
12	2	FASSNet	88.2
14	1	Wang et al (2019a)	83.2
14	1	FASSNet	93.1
24	1	Wang et al (2019a)	94.5
24	1	FASSNet	97.1
10, 11, 12, 14, 24	5	Wang et al (2019a)	80.6
10, 11, 12, 14, 24	10	FASSNet	86.4
15	2	Wang et al (2019a)	70.3
15	4	Surrel et al (2018)	69.8
15	4	FASSNet	85.6

Table 4. Results of patient-agnostic validation (leave-one-patient-out validation). In apnea severity diagnosis, we denote the results as True if FASSNet make correct predictions. The thresholds of apnea severity refer to Kapur et al (2017).

Test subject	Per-segment acc (%)	Apnea severity diagnosis
		AHI >= 5	AHI >= 15	AHI >= 30
01	84.97	True	True	True
02	50.10	True	True	False
03	85.88	True	True	True
04	78.93	True	True	True
05	71.56	True	True	True
06	82.34	True	True	True
07	81.51	True	True	True
08	77.48	True	True	True
09	56.00	True	True	False
10	67.17	True	False	True
11	81.46	True	True	True
12	88.33	True	True	True
13	85.50	True	True	True
14	96.32	True	True	True
15	83.71	False	True	True
16	97.47	True	True	True
17	91.69	True	True	True
18	100	True	True	True
19	98.74	True	True	True
20	100	True	True	True
21	99.79	True	True	True
22	99.46	True	True	True
23	99.79	True	True	True
24	99.66	True	True	True
25	98.17	True	True	True
26	88.70	True	True	True
27	99.65	True	True	True
28	88.02	True	True	True
29	97.45	True	True	True
30	65.39	True	True	True
31	99.63	True	True	True
32	91.87	True	True	True
Average	87.09	96.97%	96.97%	93.94%

Table 5. Real-time performances amongst neural networks. AlexNet (Krizhevsky 2014) and Deep ResNet (Wang et al 2019a) are used as the benchmark of model complexity. The input of Deep ResNet and FASSNet lasts for 1 min. Width is a hyper-parameter of FASSNet. The best real-time performance is highlighted in bold.

Model	Width	Input size	MACs (M)	Params (M)
AlexNet	—	$65\times 92$	104.306	56.997
Deep ResNet	—	$1\times 360$	546.635	8.160
FASSNet	32	$20\times 92$	22.819	0.0333
	64	$20\times 92$	90.188	0.115
	128	$26\times 92$	358.258	0.439
	32	$26\times 92$	29.555	0.0333
	64	$26\times 92$	116.727	0.115
	128	$33\times 92$	464.036	0.439

3.1. Classification results of ten-fold validation

Table 2 shows the classification results of ten-fold validation. Hand-craft-features like Hassan and Haque (2017) and Zarei and Mohammadzadeh Asl (2020) are the most effective, the best method is from Zarei and Mohammadzadeh Asl (2020), whose per-segment accuracy is 93.90%. FASSNet (38 Hz, 64) achieves an accuracy, sensitivity, specificity, and F1 of 86.41%, 77.96%, 91.74%, and 81.61%, and the accuracy of the most light-weight version of FASSNet can reach 85.61%. It outperforms a series of commonly used SA syndrome detection approaches in classification performance and has the highest accuracy among DNNs. In comparison, the accuracy of the Deep ResNet is 83.03% (Wang et al 2019a), and DNN features proposed by Li et al (2018) is 84.7%. Notably, classification performance alone is insufficient to decide a feasible model for portable devices. For instance, Surrel et al (2018), reported an F1 score which is slightly lower than previous researches. However, the algorithm has been competitive for wearable devices since 2018, where trade-offs between classification and real-time performance are required.

3.2. Classification results of patient-agnostic validation

FASSNet scores the best accuracies in each fold, as shows in table 3. The accuracy is 86.4% when test subject set is {10, 11, 12, 14, 24}. However, the accuracy drops to 70.8% when patient 10 is the only test subject. It can be seen that the accuracy FASSNet is over 15% higher than that of Wang et al (2019a) when no information of subject 15 is available, which is an evidence that features from FASSNet are robust. For some subjects, the accuracy can be extremely high (subject 14 and subject 24). The polarization of classification performance in different patients is also observed in table 4. The test results of some patients can have an over 95% per-segment accuracy (subject 22–25), and the overall diagnosis of AHI severity is almost perfect. However, FASSNet fails to classify ECG segments of some patients such as subject 09 and 30. A possible reason is the heterogeneity of ECG patterns. According to table 1, $AH{I}_{AASM}$ and $AH{I}_{Apnea \mbox{-} ECG}$ of these patients are considerably different, which may introduce errors. Through comparing the results with previous studies, we can tell that FASSNet is an excellent method in SA syndrome screening task from the perspective of classification performance. The polarization of test results demonstrates the importance of patient-agnostic validation in testing SA syndrome screening algorithms.

3.3. Real-time performance

According to table 5, FASSNet (38 Hz, 64) using 0.2% of parameters of AlexNet, which is a small neural network in image classification tasks (Krizhevsky 2014). Moreover, it has 4.17% of MACs of the deep ResNet in FASSNet (Wang et al 2019a), which is a remarkable improvement in real-time performance. As to hyper-parameter selection, the complexity of FASSNet quadruples when its width doubles or its input size quadruples. This finding indicates that the width is a dominant factor for model complexity compared to the input size. Temporal-frequency components of 28–38 Hz contain most of useful information to diagnose SA syndrome, whilst providing elements of 38–48 Hz shows very limited improvement. This evidence is consistent with the phenomenon in figure 2 that few spectral-temporal patterns are visible in the 38–48 Hz frequency band.

4. Discussion

The principle objective of our study is the evaluation of light-weight neural networks for wearable devices in minute-by-minute SA syndrome detection. FASSNet provides an accuracy of 86.71%, a sensitivity of 77.96%, a specificity of 91.74%, and a F1 score of 81.61% in ten-fold validation with very stringent computation resource (Params: 0.0333 M, MACs: 22.819 M). Moreover, only a quarter of computation cost and memory is needed for a light-weight version of FASSNet.

The superiority of this study includes rigorous validation on the robustness of FASSNet. Patient-agnostic validations enable the performance of FASSNet to be generalized real-world applications. Based on per-segment diagnosis, FASSNet can identify patients with mild or severe SA syndrome (AHI > 15) at an accuracy of 96.97% in leave-one-patient-out validation. The estimation of AHI ( ${AH}{I}_{{Apnea} \mbox{-} {ECG}}$ ) follows the protocol proposed in Physionet Challenge 2000 (Goldberger et al 2000, Penzel et al 2000). The minute-apnea detection criterion has been used in various previous studies (e.g. Zarei and Mohammadzadeh Asl 2020) as a reliable standard for checking algorithms, since it is closely correlated to the AASM guideline.

Moreover, mechanism analysis is introduced to understand to what extent FASSNet is reliable for the SA syndrome screening task. The real-time performance and the upper limit of classification performance are predetermined by hyper-parameters of FASSNet. A hyper-parameter selection procedure enables us to find the optimal hyper-parameters on wearables and verify the robustness of model architecture. Layer normalisation is analysed due that it has great contribution to the classification of FASSNet. The attention block provides interpretability for FASSNet and shows that the backward and forward LSTM produce reliable results. We summarise advantages and limitations of FASSNet. State-of-the-art feature-based methods show that higher accuracies are practical, and future work is proposed to improve FASSNet.

4.1. FASSNet hyper-parameter selection

We discussed the influence of Fs in low-pass filtering and width w in FASSNet convolutional layers using a grid search method. Fs candidates are 28, 38 and 48 Hz, and the width candidates are 32, 64 and 128 kernels. According to the 'preserve-edge-information' design in section 2.2, input sizes are $20\times 92,$ $26\times 92$ and $33\times 92.$ The optimal duration of a single pixel from the spectrogram pattern learning outputs is explored by removing a convolution layer in the pre-activation block or a convolution block. The duration edges down when there is only one convolution layer in the pre-activation block, and removing a convolution block results in a significant decline. Figure 4 shows the AUC scores on the test set in 10-fold validation. In this study, we select FASSNet (Fs = 38 Hz, width = 64) as an optimal model for wearable devices, because it has the highest AUC score with low MACs and Params (table 5). FASSNet (28 Hz, 32) is an extremely light-weight solution to wearable devices. A small decline in the duration of the pixel of spectrogram patterns results in slightly weaker model performance, whilst a significant decrease makes the performance degeneration evident.

**Figure 4.** Classification performances in the grid search approach. Hyper-parameters are organised as (Fs, width). The hyper-parameters of FASSNet (Conv Layer −1) and FASSNet (ConvBlock −1) are (Fs = 38 Hz, width = 64).
Download figure:
Standard image High-resolution image

4.2. Effect of layer normalisation

Layer normalisation prevents the model from overfitting. It uses sample-dependent mean and standard value to align features in different samples. In our view, the randomness of occurrence and the variable duration of an apnea event are partly responsible for the heterogeneity among samples. FASSNet uses layer normalisation to cope with the situation, thereby screening SA syndrome accurately. Layer normalization significantly improves the classification performance of FASSNet. FASSNet scored an accuracy of around 70% in the test set in the ten-fold validation process if it used batch normalization instead. The design of layer nomaliasation adapts to the randomness of SA syndrome, which makes FASSNet a reliable model.

4.3. Attention mechanism for model interpretability

In FASSNet, forward and backward LSTM use forward and backward information to detect SA syndrome events. The subsequent attention block assigns importance to the position information of LSTM outputs. For apnea epochs, LSTMs are expected to focus on the same period, which shows that both LSTMs can detect the onset of an apnea event. For epochs where all the in-epoch periods are normal, we expect to see the divergence to define a normal state from attention weights of the two LSTMs. Because it indicates the intrinsic heterogeneity of FASSNet, a good sign of robustness. The kernel density of attention weights of per-segment labels is visualized in figure 5. Attention weight distributions of apnea segments are concentrated in small areas and consistent with similar Gaussian distributions. It indicates that both forward and backward information are unanimous in the most likely period of an apnea event. In contrast, attention weights of normal segments distribute in a random way. They are dispersed at edge, where the position index is 0 or 5. It is likely because there is no 'abnormal event' in 'normal state' at any time point, and central pixels receive additional attention in convolution and max-pooling operations.

**Figure 5.** Kernel density estimation of apnea and normal segments. Green and blue plots denote attention weights distributions of apnea and normal segments in the Apnea-ECG database, respectively. Forward and backward weights are obtained from the Bi-LSTM and attention block of the well-trained FASSNet (38 Hz, 64). The numerical range of all the x- and y-axes is [0.1, 0.25]. The numbers 0–5 in the top right-hand corner are position indexes of the outputs of attention block.
Download figure:
Standard image High-resolution image

4.4. Advantages and limitations

Compared to light-weight algorithms on wearables (e.g. Sannino et al 2014, Surrel et al 2018), the chief advantage of FASSNet (Fs = 38 Hz, width = 64) is excellent classification performance (Acc = 86.41%, Sen = 77.96%, Spe = 91.74%, F1 score = 81.61%). And compared to current DNNs (e.g. Wang et al 2019a), diagnosis of apnea screening through FASSNet requires extremely low computation cost (MACs = 90.188 M, Params = 0.115 M). The accuracy of FASSNet shows an upper trend with an increasing volume of training set despite patient-agnostic validation is more difficult than ten-fold validation (tables 2 and 4). It indicates that FASSNet benefits from the expansion of database as typical deep networks. The architecture of FASSNet is a robust, since it takes the characteristics of SA syndrome into account. Theoretically, FASSNet possesses a strong rejection ability to high-frequency disturbance because only low-frequency components are considered as model input. Moreover, an empirical evidence is that changing hyper-parameters of FASSNet causes only slight classification performance degradation (figure 4). The light-weight design and reliability of FASSNet make it applicable to various scenarios for wearable devices.

However, there are some drawbacks in FASSNet compared to previous approaches if we assume that computation cost is not a gap. Some computer-aided neural network (Li et al 2018) and feature-based method (Zarei and Mohammadzadeh Asl 2020) show higher accuracies (table 2). Patient-agnostic validation shows that the diagnosis of some patients can be unsatisfactory even models possess high accuracies in cross validation. For instance, the accuracy of subject 10 is 70.8% (see table 3), and the diagnosis of apnea severity of subject 09 is wrong (see table 4). Surrel et al (2018) also reported the same performance deterioration in certain patients. They suggested capturing subject-specific information since subjects vary demographically in ways that induce different ECG morphological patterns. For example, Laureanti et al (2020) documented that ECG parameters show statistical differences in gender. These findings demonstrate the necessity of patient-agnostic validation. Thereby, evaluation on more large-scale databases is important before implementing FASSNet on wearable devices. Furthermore, Apnea-ECG is a database released in the year 2000, and we need to update FASSNet according to recent AASM standard.

4.5. Future work

Large-scale modern datasets contribute to obtaining reliable models. Feeding more data to FASSNet can improve model performance without introducing additional computation loads. Considered that the architecture of FASSNet is robust, it is promising to apply it to other physiological signals. For instance, a dataset of 86 recordings from a hospital was used to evaluate a CNN model (Urtnasan et al 2018). Airflow signal recordings from clinical and MIT-BIH PSG database (Ichimaru and Moody 1999) are sufficient to train LSTM (Yang et al 2019). The optimal hyper-parameters of FASSNet need to be fine-tuned when input signal is different.

5. Conclusion

FASSNet, an end-to-end SA syndrome screening neural network for wearable devices, is proposed in this study. It learns features from ECG signal directly. The model design emphasises on capturing spectro-temporal patterns. Low-frequency components and edge information of ECG spectrogram are used as the input to achieve good classification and real-time performance. The model integrates convolution layers, layer normalisation, Bi-LSTM and attention blocks along the frequency axis. The algorithm achieves feasible classification results with an accuracy, sensitivity, specificity and F1 score of 86.41%, 77.96%, 91.74% and 81.61%, respectively. FASSNet requires extremely few computation resources considered to its light-weight design (Params = 0.155 M, MACs = 90.188 M). The hyper-parameter selection procedure demonstrates the robustness of FASSNet architecture. FASSNet is tested rigorously by patient-agnostic validation protocol, and mechanism analysis also proves its reliability. Overall, FASSNet is light-weight and accurate to be implemented on wearable devices.

Acknowledgments

This work was supported by National Natural Science Foundation of China (81973744 and 81473579), Natural Science Foundation of Beijing Municipality (7173267).

Conflict of interest statement

The authors declare that they have no conflicts of interest.

Appendix: . Record-wise demographics

Subject ID	Recordings	Non-apnea minutes	Apnea minutes	AHI	Age	Sex	BMI (kg m⁻²)
01	a01	20	470	69.6	51	M	33.31
	a14	127	383	54.7
02	a02	109	420	69.5	38	M	37.04
	x14	52	439	79.5
03	a03	274	246	39.1	54	M	28.34
	x19	81	407	56.2
04	a04	40	453	77.4	52	M	40.42
	a12	44	534	80.2
05	a05	179	276	41	58	M	25.18
	a10	418	100	21
	a20	196	315	41
	x07	270	240	21
06	a06	305	206	24.7	63	M	32.46
	x15	299	200	15.9
07	a07	190	322	63	44	M	33.52
	a16	163	320	41
	x01	149	375	63
	x30	286	326	41
08	a08	313	189	42	51	M	27.46
	a13	252	244	42
	x20	250	264	43
09	a09	115	381	31.7	52	M	25.88
	a18	52	438	82.4
10	a11	245	222	14	58	M	36.49
11	a15	143	368	52	60	M	36.48
	x27	11	488	75
	x28	62	434	75
12	a17	328	158	33	44	M	29.86
	x12	471	57	33
13	a19	298	205	34	55	M	28.41
	x05	190	316	34
	x08	194	324	48
	x25	220	291	48
14	b01	469	19	0.24	44	F	21.80
	x03	454	12	0.13
15	b02	425	93	19	53	M	27.44
	b03	369	73	24
	x16	451	65	24
	x21	391	120	19
16	b04	420	10	0.7	42	M	19.75
	c08	535	0	0
17	b05	377	57	5	52	M	41.67
	x11	445	13	1
18	c01	485	0	0	31	M	21.86
	x35	484	0	0
19	c02	502	1	0	37	M	25.62
	c09	467	2	0
20	c03	455	0	0	39	M	19.20
	x04	483	0	0
21	c04	483	0	0	41	F	20.06
	x29	471	0	0
22	c05	464	3	0	28	F	19.96
	x33	471	3	0
23	c06	468	1	0	28	F	22.23
24	c07	450	4	0	30	F	25.18
	x34	472	4	0
25	c10	431	1	0	27	M	21.27
	x18	458	2	0
26	x02	261	209	37.7	46	M	24.74
27	x06	451	0	0	31	M	22.84
	x24	429	1	0
28	x09	342	167	18.5	43	M	25.54
	x23	409	119	14.3
29	x10	415	96	10	39	M	45.33
30	x13	215	292	18.7	57	M	33.17
	x26	177	344	15.1
31	x17	400	1	0	27	F	21.23
	x22	481	2	0
32	x31	42	516	93.5	29	F	29.86
	x32	114	425	71.8

FASSNet: fast apnea syndrome screening neural network based on single-lead electrocardiogram for wearable devices

Article metrics

Permissions

Author e-mails

Author affiliations

Author notes

Dates

Abstract

1. Introduction