Brought to you by:
Paper

Parallel use of a convolutional neural network and bagged tree ensemble for the classification of Holter ECG

, , , and

Published 12 September 2018 © 2018 Institute of Physics and Engineering in Medicine
, , Focus on detection of arrhythmia and noise from cardiovascular data Citation Filip Plesinger et al 2018 Physiol. Meas. 39 094002 DOI 10.1088/1361-6579/aad9ee

0967-3334/39/9/094002

Abstract

The automated detection of arrhythmia in a Holter ECG signal is a challenging task due to its complex clinical content and data quantity. It is also challenging due to the fact that Holter ECG is usually affected by noise. Such noise may be the result of the regular activity of patients using the Holter ECG—partially unplugged electrodes, short-time disconnections due to movement, or disturbances caused by electric devices or infrastructure. Furthermore, regular patient activities such as movement also affect the ECG signals and, in connection with artificial noise, may render the ECG non-readable or may lead to misinterpretation of the ECG.

Objective: In accordance with the PhysioNet/CinC Challenge 2017, we propose a method for automated classification of 1-lead Holter ECG recordings. Approach: The proposed method classifies a tested record into one of four classes—'normal', 'atrial fibrillation', 'other arrhythmia' or 'too noisy to classify'. It uses two machine learning methods in parallel. The first—a bagged tree ensemble (BTE)—processes a set of 43 features based on QRS detection and PQRS morphology. The second—a convolutional neural network connected to a shallow neural network (CNN/NN)—uses ECG filtered by nine different filters (8×  envelograms, 1×  band-pass). If the output of CNN/NN reaches a specific level of certainty, its output is used. Otherwise, the BTE output is preferred.

Main results: The proposed method was trained using a reduced version of the public PhysioNet/CinC Challenge 2017 dataset (8183 records) and remotely tested on the hidden dataset on PhysioNet servers (3658 records). The method achieved F1 test scores of 0.92, 0.82 and 0.74 for normal recordings, atrial fibrillation and recordings containing other arrhythmias, respectively. The overall F1 score measured on the hidden test-set was 0.83. Significance: This F1 score led to shared rank #2 in the follow-up PhysioNet/CinC Challenge 2017 ranking.

Export citation and abstract BibTeX RIS

Introduction

Cardiac arrhythmia in general is a consequence of several cardiac diseases. Different types of cardiac arrhythmias are usually recognized using an electrocardiogram (ECG) signal. Some arrhythmias which are permanently present in an ECG signal may be recognized during common short-time (30 s) ECG measurement. On the other hand, some arrhythmias may also be present in non-persistent form, which is often the case of, for example, atrial fibrillation. For this reason, the patient may be fitted with a Holter device for a longer time period of days or weeks. However, although Holter ECG is a perfect source for revealing non-persistent arrhythmias, the processing of these recordings may be very time-demanding.

The design and implementation of fully automated arrhythmia detectors for 1-lead Holter ECG signals was the topic of the PhysioNet/CinC Challenge 2017 [1]. The challenge focused on the processing of short recordings (9–61 s). The goal was to distinguish between normal recordings (N), atrial fibrillation (A), other arrhythmia (O) and recordings too noisy to process (X).

N recordings should contain only a steady heart rhythm at a pace of 60–100 beats per minute (bpm) without any extra beats or any other rhythm disturbances, although these limits are rather informative. In contrast, atrial fibrillation represents a rhythm with a fibrillating p-wave resulting in random inter-beat (RR) intervals. Single class A is reserved only for atrial fibrillation because it is associated with a higher risk of heart failure; it is usually episodic, and seeking evidence of atrial fibrillation is a common reason for fitting a patient with a Holter ECG device. The class O for other arrhythmias represents any other rhythm disturbances such as premature ventricular contractions (PVC), premature atrial contractions (PAC), bradycardia, tachycardia and others.

Simple methods for the detection of atrial fibrillation such as [25] are based on analysis of inter-beat (RR) intervals, usually using a Poincaré plot or Lorenz plots [6]. However, these approaches may become limited in cases in which other arrhythmias are present in the ECG signal because they also disturb heart rhythm. Furthermore, Holter ECG recordings are acquired during regular patient activity (sleep, movements, employment, sports) meaning that the ECG signal usually contains movement artifacts and power-grid noise and may also suffer from poor contact between the body and ECG electrodes (sweat, especially in connection with movements). The capability of these simple methods may suffer under the given circumstances which may also be deduced from results from the PhysioNet Challenge 2017 in which methods using only a few well-known features [79] showed rather limited performance.

Therefore, the trend in the classification of real Holter ECG signals is towards the extraction of a higher number of features (describing both atrial and ventricular activity) and to the use of some machine learning approach such as recurrent neural networks [10, 11], tree forests/ensembles [1214] or supported vector machines [15, 16]. While these approaches use features dependent on QRS detection, a substantial amount of research has led to the use of convolutional neural networks (CNN) [1719] in which specific features are extracted during the machine learning process itself. Since CNN were built primarily for image processing, the usual approach is to convert the signal into some form of image, most often using Fourier [2022] or wavelet transformation. Moreover, transformation to dual-beat coupling [23] or approaches using even 1D ECG signals have already been presented [24, 25]. It is also worth mentioning that some approaches combine deep learning techniques with features based on QRS detection [26, 27].

In this manuscript, we propose a method for automated detection of cardiac arrhythmias in Holter ECG signals. It is an improvement on our previous work [28] with a substantially simplified feature set and a different architecture. These improvements have led to increased capability over the previous solution.

Method

The presented approach (figure 1) uses two machine learning methods in parallel—a bagged tree ensemble (BTE) and a convolutional neural network chained with a shallow neural network (CNN/NN). While the BTE uses mostly a common feature set based on QRS detection (mean inter-beat interval or its variation range, etc), the CNN/NN uses a transformed ECG signal. Both these methods compute results independently; if the CNN/NN branch is very sure about its result (exceeds a given threshold) then the CNN/NN result is used. Otherwise, the BTE result is used (with the exception of noisy—X class). This design was expected to increase method robustness in comparison to the use of CNN/NN and BTE approaches alone; it should also suffer less from overtraining (overtraining was an issue for our former approach in which these methods were used in a chain). The method was implemented in Matlab (r2017a) software using Machine Learning and the Statistics and Signal Processing toolbox.

Figure 1.

Figure 1. Flowchart of presented method. Source signal is loaded and transformed to envelograms (A). The method uses two independent branches: a bagged tree ensemble (BTE) branch and a convolutional neural network (CNN/NN) branch. The CNN/NN branch (H)–(J) uses a different feature set to the BTE branch (B)–(F). If the recording is not considered noisy by the BTE branch (G), the two machine learning methods compete at the end of the process (K). N-A-O-X refers to the classes normal-atrial fibrillation-other arrhythmia and noisy.

Standard image High-resolution image

Used dataset

We used a public dataset from the PhysioNet/CinC Challenge 2017—single-lead Holter ECG, acquired from a device (AliveCor Inc., CA, US) in 12-bit depth and with a sampling frequency of 300 Hz. The dataset contains files categorized by pathology into four classes. The expert labeled dataset was divided into a public (8528) and hidden (3658) part. We removed 345 files from the public part in cases in which we disagreed with the expert labeling. Therefore, a total of 8183 files was used for training and local testing. The hidden part of the dataset was used for remote tests on PhysioNet servers. Labeling corrected after the end of the 'Official Challenge Phase' was not used for training.

Preprocessing—signal transformation into envelograms

The ECG signal is transformed into several amplitude envelopes (envelograms) which are later used for QRS detection and feature extraction (figure 1(A)). Envelograms are computed using Fourier and Hilbert transformations in specific frequency ranges. Envelograms in lower frequencies (LF, 1–6 Hz, figure 2(C)) and middle frequencies (MF, 8–30 Hz, figure 2(B)) serve as sources for QRS detection (figure 1(B)—dots). The LF envelogram is more sensitive to T-waves, while the MF envelogram is more sensitive to QRS and this is used by specific conditions during QRS detection. These envelograms are also used for distinguishing ventricle contractions from regular QRS complexes using ratio of LF and MF envelograms [29]. The input ECG signal is also filtered into the 2–30 Hz band (CF signal) to remove baseline wandering and noise; the signal transformed in this way is prepared for QRS morphology clustering (figure 1(C)).

Figure 2.

Figure 2. QRS detection. The raw ECG signal (A) is transformed into MF (B) and LF (C) envelograms. While the MF envelogram serves as a source of peaks (checked by maxima intensity in a 600 ms window and standard deviation in its three subregions), the LF envelogram serves for an additional check of QRS existence at a specific point—if area (C)-L is more than 1.5×  larger than area (C)-R, the QRS is resolved as false (important when sharp T-waves are misinterpreted as QRS). The ratio of areas under the LF and MF curves [29] is also used for evaluating several features (table 1 #25–29).

Standard image High-resolution image

QRS detection

For QRS detection (figure 1(B)), peaks were detected in the MF envelogram. Detection criteria were set to 'Descend' order and the minimal distance to other peaks was set to 150 ms (the maximal detectable heartrate is, therefore, 400 bpm). Each peak must then pass multiple checks. First, each peak is checked to see whether the MF signal maximum in a window  ±300 ms from the peak exceeds 1.5×  nominal peak value. This time window was set in respect to the maximal expected P-R distance and R-T (top of the T-wave) distance. If this check is not passed, it means that it is probably a P-wave or a case of a very sharp top of a T-wave or a noise; such peaks are removed from further processing. Next, the same time window around the peak is divided into three equal and consecutive parts (figure 2RA, RB and RC, each of a length of 200 ms) and the standard deviation inside these parts is compared. If the sum of the standard deviation of border regions RA and RC is higher than the standard deviation of RB the peak is removed from further processing. Finally, each peak is checked for low-frequency activity on its borders using an LF envelogram, because we expect that low-frequency activity connected to any QRS complex should follow QRS due to a T-wave. Therefore, the sum of region  −250:0 ms from the tested peak (figure 2(C)-L) is tested against region 0:250 ms (figure 2(C)-R). If sum(L)  >  sum(R)  ×  1.5, the low-frequency activity in LF before the peak is much higher than after the peak and it is unlikely to be a QRS. Therefore, the peak is considered a false QRS and is removed from further processing. The time windows of 250 ms were set in respect of the capture of areas in LF envelopes related to the T-wave. However, in the case of a very fast pace it is possible that both T-waves prior and post the tested peak will be in the analyzed  ±250 ms area. For this reason, we had to add a 1.5×  multiplier to cover such a situation. Furthermore, during events such as ventricle tachycardia it may happen that the majority of the detected QRS are removed. Therefore, this step is omitted if more than 95% of all peaks are removed during this check. Finally, if the number of detected QRS is insufficient (<4), the process is aborted and the record is considered noisy.

QRS clustering

Because the task is also to find other arrhythmias in addition to atrial fibrillation, we decided to cluster QRS complexes using their morphology (figure 1(C)). The reason for this is that atrial fibrillation, as well as the sinus rhythm, should contain a single QRS morphology, while premature ventricle contractions, for example, should result in additional morphological groups. Clustering is achieved using the Pearson correlation, computed in a  ±150 ms window centered at QRS complexes; correlation is computed using the ECG signal filtered in the 2–30 Hz band (CF). The size of the correlation window was set in respect to QRS duration, which is not expected to be longer than 300 ms.

The clustering process works as follows: the first QRS complex is assigned to the first morphological group. The next QRS complex is compared to all already assigned QRS complexes (called prototypes). If they correlate sufficiently (>0.9), the tested peak is assigned to the same morphological group as the prototype. It is typical of noisy signals that the number of detected morphological groups is very high, while signals containing a high-quality signal (even with a pathological content) should contain 2-3-4 morphological groups. Therefore, if an excessive number of morphological groups is found (>length of signal in seconds), the process is aborted and the record is considered noisy.

There is also a risk that the same QRS morphology will be registered multiple times, but with a time shift. To avoid this, an average shape for each morphology group is created and their central parts (±75 ms) are correlated with a shift (also  ±75 ms). If any two groups correlate well enough (>0.95), they are merged together; averaged shapes must be regenerated and the process continues until none of the groups can be merged with another.

Morphological groups are ordered by a count of QRS complexes. Therefore, the first morphological group (G1) should always describe a sinus rhythm (with the exception of bigeminy PVCs). G2 will usually contain the most common morphology of extra beats, G3 will contain a less common morphology, etc. Obviously, the presence of noise will create additional morphological groups, but these should not contain multiple QRS complexes because the noise is expected to be random.

Once morphological groups are defined, we can look for missing QRS complexes (which have not, for some reason, passed the initial checks). The search for missing QRS complexes works as follows: a cross-correlation function is generated from the averaged shape of each morphological group (containing more than one QRS complex) and the CF signal. Next, peaks are detected in this cross-correlation function (>0.94). Next, nominal values in the position of the detected peaks are compared to corresponding samples in averaged shapes of morphological groups. Finally, if a newly detected peak passes these checks and its position is further than 150 ms from a QRS already detected, the new QRS complex is defined and added to the specific morphological group.

Clustering based on the Pearson correlation may lead to incorrect results in cases in which shapes correlate sufficiently but have a different scale. In terms of ECG, beats represented by the same appearance of QRS, but with 0.1×  amplitude, probably do not belong to the same morphological group and are probably the result of noise. Therefore, the median sum of the region  ±150 ms around each QRS complex from group G1 is computed in an MF envelogram (MedMF). This value serves as a base for comparison of each other QRS complex: this value of each QRS complex must be between MedMF  ×  0.2 and MedMF  ×  4 or it is excluded from the list of QRS complexes. Finally, morphological groups are reordered again using their number of QRS complexes.

Feature extraction for the bagged tree ensemble

A list of 43 extracted features for the bagged tree ensemble is shown in table 1. The first three features (table 1, #1–3) are not dependent on QRS detection and cover the hypothetical situation in which, for example, a block of ventricular fibrillation prohibits the QRS detector from detecting single beats. For this, we use a previously developed approach [29]. The longest block when the LF envelogram is dominant over the MF envelogram is evaluated in terms of length (s) (table 1 #1), MF to LF area ratio (table 1 #2) and the sum of the LF envelogram (table 1 #3) form additional features describing the same issue.

Table 1. List of 43 features extracted for bagged tree ensemble.

# Feature name Based on Detailed description
1 maxLFratioMFLF Prevalence of LF over MF envelogram Ratio of MF to LF sums during the longest prevalence
2 maxLFLength Maximal length (s) of this prevalence
3 maxLFint Sum of this prevalence
4 rrMin RR intervals derived from all QRS complexes Minimum
5 rrMax Maximum
6 rrMean Mean
7 rrMed Median
8 rrSTD Standard deviation
9 rrSTD_mean Standard deviation/mean
10 rrVR Variation range
11 numExRat RR intervals Number of peaks on RR series
12 lower10ratio Number of RRs  <  rrMean  −  0.1  ×  (rrMean)
13 higher10ratio Number of RRs  >  rrMean  +  0.1  ×  (rrMean)
14 ratioQRS1toAll QRS morphological groups Ratio of QRS from G1 to all QRS complexes
15 ratioQRS2toAll Ratio of QRS from G2 to all QRS complexes
16 ratioQRS2to1 Ratio of G2 QRS to G1 QRS complexes
17 ratioQRS12toAll Ratio of G1  +  G2 QRS complexes to all QRS complexes
18 twSTDmin Examination of 6-beat sequence of QRS complexes Minimal STD(RR)
19 twSTDmax Maximal STD(RR)
20 twSTDmed Median STD(RR)
21 twSTDstd Standard deviation of STD(RR)
22 twC0Smin Minimal average RR
23 twC0Smax Maximal average RR
24 twC0Smed Median average RR
25 KESmin Ratio of LF to MF sum around QRS complex (high values should be associated with extra ventricle beats) Minimum KES value of all QRS
26 KESmax Maximum KES value of all QRS
27 KESmed Median KES value of all QRS
28 KESmean Mean KES value of all QRS
29 KESstd Standard deviation KES value of all QRS
30 exToAll RR intervals Number of unusual RR intervals to all RR intervals
31 dRRsMin Derivative of RR intervals series Minimum
32 dRRsMax Maximum
33 dRRsMed Median
34 dRRsMean Mean
35 dRRsSTD Standard deviation
36 qrs1ratioLFMF QRS morphological group G1 Ratio of center of averaged G1 QRS complex in LF envelope to the same sample in MF envelope
37 pW1 Regions of expected P-wave presence Variability using variation range
38 pW2 Variability using STD
39 pW3 Variability using mean
40 jhCorrRR RR intervals Correlation of circularly shifted RR intervals (by 1)
41 rawNumExA Averaged shape G1 from RAW ECG Number of local extremes before R-wave
42 rawNumExB Number of local extremes around R-wave
43 rawNumExC Number of local extremes after R-wave

All other features used for the bagged tree ensemble are dependent on detected QRS complexes and their morphological group. Statistical description of the RR intervals forms seven features (table 1 #4–11); the number of peaks in consecutive RR values, as well as the number of RRs higher or lower than average RR (outside  ±0.1 tolerance), are additional features (table 1 #11–13).

Thanks to morphology clustering it is possible to quantify ratios of other morphological groups as well as the ratio between the most important groups G1 and G2 (table 1 #14–17). We also examined the whole sequence of RR intervals in a sliding 6-beat window, where seven features were extracted from the standard deviation and mean RR in this sliding window (table 1 #18–24).

Frequency components of all beats are described in five features (figure 3 and table 1 #25–29), where the ratio of LF and MF envelograms is sensitive to abnormal beats, specifically premature ventricle beats. This is also extracted for the average shape from G1, resulting in a specific feature (table 1 #36).

Figure 3.

Figure 3. Detected QRS (B) are accompanied by a feature reflecting their frequency components—'KES' vector (A). While regular QRS usually reach values close to or below one, ventricle beats (C) reach a much higher value. A KES vector is described using five statistical features (min, max, median, mean and standard deviation) which enter the bagged tree ensemble (table 1-#25–29). Detected QRS complexes are shown as dots.

Standard image High-resolution image

Technically, most other arrhythmias as well as atrial fibrillation result in irregularities in RR intervals. Therefore, another feature (table 1 #30) describes the ratio of unusual RR intervals to all RR intervals (an RR change greater than 10% is considered an unusual interval). This is also covered by examining the first derivative of the RR series; five features (table 1 #31–35) statistically describe changes of RR intervals. The last feature describing changes in RR intervals (table 1 #40) is computed as the correlation of the whole RR series with its copy shifted by one RR interval.

The behavior of the P-wave, or its link to the QRS complex, is essential to stating pathology—namely atrial fibrillation, 2nd or 3rd degree AV-block, or atrial tachycardia. However, the detection of the P-wave is a difficult task because it may be buried under background noise. This is common in Holter recordings. Therefore, we decided not to detect the P-wave directly, but to analyze the region in which the P-wave should exist. We statistically describe the stability of this region preceding each QRS complex (figure 4(D)) in three features (table 1 #37–39).

Figure 4.

Figure 4. Averaged shapes of normal (solid) and atrial fibrillation (dashed) records. Atrial fibrillation will cause unnatural flattening of the region preceding the QRS complex. The average shape is used to generate features (table 1 #41–43) from the three regions (A) (120 ms), (B) (160 ms) and (C) (120 ms) and also from region (D) (with adaptable width) where the P-wave should be expected (table 1 #37–39).

Standard image High-resolution image

The last features (table 1 # 41–43) used for the bagged tree ensemble describe a number of extrema computed on the averaged shape from the most common morphology G1. Three regions are analyzed—preceding, surrounding and following the average QRS complex (figures 4(A)(C)).

If any of these 43 features cannot be evaluated or if the mean RR interval (computed using the 1st morphological group only) is longer than 3 s, the process is aborted and the recording is considered noisy. Otherwise, these features are fed into the bagged tree ensemble. This condition correctly detects 21% (N  =  58) of noisy recordings and incorrectly detects 3% (N  =  9) in the training set.

Bagged tree ensemble

A bagged tree ensemble was chosen because it has shown the best results of the tested machine learning methods (simple decision trees, shallow neural networks and supported vector machines with different kernels). 70% of the public dataset was used for training; the remaining 30% was used for testing. The tree ensemble consisted of 30 trees. The bagged tree ensemble was trained using the Machine Learning and Statistics toolbox in Matlab software (r2017a). The resulting confusion matrix is shown in table 2.

Table 2. Test confusion matrix for bagged tree ensemble acquired from 30% of training dataset. N is normal sinus rhythm, O is other arrhythmia, A is atrial fibrillation and X is noisy signal. Se—sensitivity, Sp—specificity, The overall F1 score is computed as average over F1 scores of N, O and A classes.

  Expert labeling
N O A X Sp F1
Bagged tree labeling N 1398 84 4 8 0.94 0.91
O 148 478 20 12 0.73 0.75
A 8 43 168 0 0.77 0.81
X 22 14 3 44 0.53 0.60
Se 0.89 0.77 0.86 0.69    
 
  Overall 0.82

Convolutional neural network (CNN)

The presented method combines two—mostly independent—approaches. While the approach with a bagged tree ensemble relies on QRS detection, the convolutional neural network uses only filtered ECG data. The input for the CNN consists of 6 s-long (without zero padding) blocks of input ECG signal. These blocks are pre-processed with zero-phase bandpass digital filtering (Butterworth filter) and envelograms in ranges (1–5, 5–10, ..., 35–40 Hz) are estimated using the absolute value of the Hilbert transform. Furthermore, the RAW signal and computed envelograms are concatenated into a matrix that is used as an input image for CNN.

The CNN architecture consists of 13 layers. The input image of raw signal and envelograms (dimensions 9  ×  1800) is processed with the first convolutional layer with kernel dimension (9  ×  150) and 30 filters. Next, a non-linearity mapping layer with a ReLU activation function is used. Next, the extracted features are down-sampled with a max-pooling layer (1  ×  2). Subsequently, a second convolutional kernel (1  ×  15  ×  30) with 50 filters, ReLU and dropout (0.5) is used in order to extract more complex features. The resulting tensor is squashed into vector form and consequently a fully connected layer (15 neurons) with ReLU and dropout (0.5). Finally, the fully connected layer (three neurons) with softmax activations is attached and probabilities for each class are obtained.

The proposed CNN system is trained for classification into three groups, i.e. normal rhythm, atrial fibrillation and other arrhythmias. CNN was not trained for noisy recordings due to their low number. A gradient descent algorithm with momentum (100 training examples per minibatch) was used as a training method and cross-entropy error was used as a cost function. The training process was performed for 25 epochs. We have used a data augmentation technique (ECG lead inversion) in order to train CNN for the classification of signals with inverse polarity. On the other hand, we did not see any significant improvement after this augmentation and, therefore, this step should be reconsidered.

Features extracted from CNN and further processing by NN

Because we used a short 6 s window for analysis by CNN, its output is a probability vector of being N, O or A (class X was not recognized by CNN due to the low number of X recordings). A typical situation for a record containing one premature ventricle beat (resulting in class O) is that most of the record points to class N, but in terms of the location of PVC it points to class O (figure 5). This means that quantifying simple maxima may be insufficient for encoding CNN results. Therefore, we decided to extract several statistical descriptors from CNN output vectors—${{\overrightarrow{{\rm CNN}}}_{N}}$ , ${{\overrightarrow{{\rm CNN}}}_{A}}$ and ${{\overrightarrow{{\rm CNN}}}_{O}}$ . These descriptors are used as input features for a shallow neural network. A list of these features is presented in table 3.

Figure 5.

Figure 5. Inputs and output of convolutional neural network. The ECG signal (A) is filtered (band-pass 1–40 Hz, (B)); the ECG signal is also transformed into eight envelograms (1–5 Hz, 5–10 Hz, etc up to 40 Hz, (C)). Then, a 6 s floating window moves over the data with a 1 s step and the result of the convolutional neural network (CNN) is computed for each window (D). Therefore, the result of this process are probability vectors for the three classes normal rhythm (blue), atrial fibrillation (red) and other arrhythmia (yellow). The class noisy is not detected using CNN.

Standard image High-resolution image

Table 3. List of 17 features extracted from the convolutional neural network. These features are mostly statistical descriptors of probability vectors of each class N (normal), A (atrial fibrillation) and O (other arrhythmia). These features are fed into a shallow neural network.

# Feature name Description
1 cnns(1) mean(${{\overrightarrow{{\rm CNN}}}_{N}}$ )
2 cnns(2) mean(${{\overrightarrow{{\rm CNN}}}_{A}}$ )
3 cnns(3) mean(${{\overrightarrow{{\rm CNN}}}_{O}}$ )
4 cnns(4) std(${{\overrightarrow{{\rm CNN}}}_{N}}$ )
5 cnns(5) std(${{\overrightarrow{{\rm CNN}}}_{A}}$ )
6 cnns(6) std(${{\overrightarrow{{\rm CNN}}}_{O}}$ )
7 cnns(7) max(${{\overrightarrow{{\rm CNN}}}_{N}}$ )
8 cnns(8) max(${{\overrightarrow{{\rm CNN}}}_{A}}$ )
9 cnns(9) max(${{\overrightarrow{{\rm CNN}}}_{O}}$ )
10 cnns(10) min(${{\overrightarrow{{\rm CNN}}}_{N}}$ )
11 cnns(11) min(${{\overrightarrow{{\rm CNN}}}_{A}}$ )
12 cnns(12) min(${{\overrightarrow{{\rm CNN}}}_{O}}$ )
13 avgNsubMaxO mean(${{\overrightarrow{{\rm CNN}}}_{N}}$ )  −  max(${{\overrightarrow{{\rm CNN}}}_{O}}$ )
14 avgNsubMaxA mean(${{\overrightarrow{{\rm CNN}}}_{N}}$ )  −  max(${{\overrightarrow{{\rm CNN}}}_{A}}$ )
15 maxCodeNAO Index of class with maximal value (1, 2, 3 for N, A, O)
16 maxCodeAvgNMaxAO Index of class with maximal value; class N uses its average instead of maximum
17 multNprev avgNsubMaxO (#13)  ×  avgNsubMaxA (#14)

The computed features are fed into a shallow neural network which points directly to one of the three classes N, A and O. This shallow neural network contains 17 neurons in an input layer, 10 neurons in a hidden layer and three neurons in an output layer. The output layer implements a softmax function. The estimated test confusion matrix (70/30 dataset division for training/testing) is shown in table 4; the output for specific classes is shown in figure 6.

Table 4. Test confusion matrix generated from CNN/NN combination. 30% of training set was used as a test set. N—normal recordings, O—other arrhythmia, A—atrial fibrillation, Se—sensitivity, Sp—specificity.

  Expert labeling
N O A Se F1
CNN/NN labeling N 1419 56 1 0.96 0.96
O   49 591 26 0.89 0.88
A    3 27 197 0.87 0.87
Sp 0.96 0.88 0.88    
 
  Overall 0.91
Figure 6.

Figure 6. CNN output processed by shallow NN. Three outputs from NN (A, N and O) show strong association with specific classes.

Standard image High-resolution image

Parallel use of bagged tree and convolutional neural network

Finally, output values from CNN/NN are compared to thresholds. If any of them passes, it means that there is high credibility of the CNN/NN output and the given result is used. Otherwise, the result from the bagged tree ensemble is used. Thresholds (0.8 for classes A and N, 0.75 for class O) were obtained experimentally with regard to the resultant F1 score (figure 7). In the training set, these thresholds were passed in 6636 of 8183 cases meaning that the CNN/NN result was preferred in 81% of cases.

Figure 7.

Figure 7. Threshold selection for acceptance of CNN/NN results. F1 scores were independently generated for each of the classes N (black, solid line), A (blue, dash-dotted line) and O (orange, dashed line). The decrease of F1 score to the right is caused by decreasing sensitivity, which is not problematic due to the fact that if values do not pass any of presented thresholds, the output of the opposite branch—bagged tree (BT)—will be used instead (figure 1). Selected thresholds were passed in 81% of training cases.

Standard image High-resolution image

Results

The presented method has been tested on a hidden test set containing 3658 recordings. The results in table 5 show a score of 0.92 for normal recordings, 0.82 for atrial fibrillation and 0.74 for other arrhythmia, resulting in an overall score of 0.83. Among a total of 80 competitors, this score resulted in shared rank #2 in the final PhysioNet/CinC 2017 Challenge rankings. During the run on a test subset (N  =  710), the method used 16.1% of the available running time.

Table 5. Estimated F1 scores obtained during training in comparison with real test scores. Estimations were obtained from 30% of recordings available to the public (70/30 training/test division of reduced public dataset, N  =  8183); testing results were obtained from a hidden dataset (3658 recordings).

Class Estimated F1 score (bagged tree ensemble) Estimated F1 score (convolutional and shallow neural network) Test F1-score
Normal 0.91 0.96 0.92
Atrial fibrillation 0.81 0.87 0.82
Other arrhythmia 0.75 0.88 0.74
Overall modified accuracy 0.82 0.91 0.83

Discussion

The presented method is a significant simplification of our previous approach [28] during the 2nd phase of the CinC/PhysioNet Challenge 2017. The number of features has been decreased from 277 to 60 (43 in the BTE branch; 17 in the CNN/NN branch); moreover, the interaction between a convolutional neural network and a bagged tree ensemble has been changed (previously, CNN/NN produced features supplied to the BTE, while they now work independently). Both these steps have led to a new solution showing a higher F1-score on the full hidden test set (83 versus 81). The increase in the score has probably been due to a stronger generalization effect, shown as a smaller decrease between the subset test (N  =  710) score and the full-test (N  =  3658) test score (from 0.85 to 0.81 in the previous approach; from 0.84 to 0.83 in the current approach). The challenge organizers have also relabeled part of the data and this has, in all probability, also played a positive role in the score change. However, these newly relabeled data were not used for training of presented approach; relabeling affected test results only.

From our previous experience, we considered the bagged tree a machine learning technique giving us solid training results. On the other hand, we were also aware that such a solution is solely dependent on QRS detection and, therefore, might not work in ECG signals containing a larger amount of noise caused by body motion or external sources. For this reason, we implemented an independent processing branch with a convolutional neural network. This branch does not use QRS detection and should be robust in cases in which a bagged tree loses its capability. And because the CNN/NN chain produces the probability of a result, we were able to decide where the CNN/NN is very confident (i.e. it passed any threshold as in figure 7).

Although the parallel use of two machine learning approaches seems to increase the overall score, we found that part of the detection task—detecting class X with noisy recordings—remains problematic for any machine learning approach. This was probably due to the relatively low number of X recordings in the training set (N  =  278), but also due to the subjective opinion of the specialist performing such an assignment. Therefore, we described a simple logical rule which is not affected by machine learning and further machine learning processing is performed only if this rule is passed. Although this rule affects only a small proportion of X recordings (21% true positive, 3% false positive), it improves the weak results of machine learning for class X. Figure 8 shows two examples of recordings labeled X (too noisy to decide; expertly labeled). The top one (figure 8(A)) did not pass the mentioned rule, while the bottom one (figure 8(B)) passed this rule and was processed by the BTE. According to its training, the BTE labeled this recording as noisy, while such labeling may be questionable due to QRS complexes that are still detectable. It may be worth noting that ideal QRS detection (which would, unlike that presented here, capture all QRS complexes) may be counterproductive in cases like this.

Figure 8.

Figure 8. Two examples of data labeled as too noisy (class X). The top (A) recording shown an excessive number of morphological groups of (false) QRS complexes (dots); it was considered noisy without the use of machine learning ('No' in figure 1(E)). Recording (B) passed checks prior to processing by the bagged tree ensemble which decided to label it as too noisy to process (correctly with regard to the official PhysioNet/CinC Challenge 2017 labeling; 'Yes' in figure 1(G)).

Standard image High-resolution image

In comparison to other competing entries, the presented work lead to shared rank #2 losing by 0.02 to the top scoring entry [11]. Another six competing entries (from a total of 80) also shared the same rank with a resultant score of 0.83. These other entries with the same ranking used a random forest classifier [13, 30, 31] or binary classifier [32] or SVM [33] with a feature count from 37 to 380. Only the entry [27] implemented an approach most similar to ours (combination of CNN and XGBoost ensemble).

The advantage of the proposed method is its high overall F1 score, leading to one of highest scores in the challenge. The proposed method is immune to reversed signal polarity because both the QRS detection algorithm and CNN work mostly with amplitude envelopes. In comparison to other top-scoring methods, it also extracts a low amount of hand-crafted features, though this applies only for the bagged-tree processing branch.

The disadvantage of the presented method is its complexity in comparison with the top-scoring entry [11]. Moreover, although the achieved F1 scores were among the highest in the challenge, they are not sufficient for a fully automated system without a human operator (this applies to F1 scores across all competitors in the PhysioNet Challenge 2017).

Since the most challenging issue with Holter ECG processing is noise, future work in this area should involve accelerometer data which are already implemented in some ECG Holters. Also, it would be appropriate to test (and possibly re-build) algorithms from the PhysioNet Challenge 2017 on databases separately distinguishing a larger number of pathologies as the clinical impact of such a mechanism would be much more important.

Conclusions

The presented approach demonstrated parallel use of two independent machine learning methods, where the first (bagged tree ensemble) uses regular features based on QRS detection, while the second (convolutional neural network and shallow neural network) uses only a transformed ECG signal. This approach was not used for detection of recordings labeled too noisy to process; this was covered by simple logical conditions and a bagged tree ensemble. Finally, it was shown that the presented method may improve results in comparison with both of the machine learning approaches used alone; we achieved an improvement of 1.3% in comparison to our approach from the previous challenge phase. The presented method achieved rank 2 (shared rank with six other teams) in the follow-up phase of the PhysioNet/CinC Challenge 2017. Its code is publicly available under GNU GPL license [34].

Acknowledgments

This research was supported by the Czech Science Foundation, project GA17-13830S, project LO1212 by MEYS CR and by project MSM100651602 by the Czech Academy of Sciences.

Please wait… references are loading.
10.1088/1361-6579/aad9ee