Parallel use of a convolutional neural network and bagged tree ensemble for the classification of Holter ECG

Filip Plesinger; Petr Nejedly; Ivo Viscor; Josef Halamek; Pavel Jurak

doi:10.1088/1361-6579/aad9ee

Introduction

Cardiac arrhythmia in general is a consequence of several cardiac diseases. Different types of cardiac arrhythmias are usually recognized using an electrocardiogram (ECG) signal. Some arrhythmias which are permanently present in an ECG signal may be recognized during common short-time (30 s) ECG measurement. On the other hand, some arrhythmias may also be present in non-persistent form, which is often the case of, for example, atrial fibrillation. For this reason, the patient may be fitted with a Holter device for a longer time period of days or weeks. However, although Holter ECG is a perfect source for revealing non-persistent arrhythmias, the processing of these recordings may be very time-demanding.

The design and implementation of fully automated arrhythmia detectors for 1-lead Holter ECG signals was the topic of the PhysioNet/CinC Challenge 2017 [1]. The challenge focused on the processing of short recordings (9–61 s). The goal was to distinguish between normal recordings (N), atrial fibrillation (A), other arrhythmia (O) and recordings too noisy to process (X).

N recordings should contain only a steady heart rhythm at a pace of 60–100 beats per minute (bpm) without any extra beats or any other rhythm disturbances, although these limits are rather informative. In contrast, atrial fibrillation represents a rhythm with a fibrillating p-wave resulting in random inter-beat (RR) intervals. Single class A is reserved only for atrial fibrillation because it is associated with a higher risk of heart failure; it is usually episodic, and seeking evidence of atrial fibrillation is a common reason for fitting a patient with a Holter ECG device. The class O for other arrhythmias represents any other rhythm disturbances such as premature ventricular contractions (PVC), premature atrial contractions (PAC), bradycardia, tachycardia and others.

Simple methods for the detection of atrial fibrillation such as [2–5] are based on analysis of inter-beat (RR) intervals, usually using a Poincaré plot or Lorenz plots [6]. However, these approaches may become limited in cases in which other arrhythmias are present in the ECG signal because they also disturb heart rhythm. Furthermore, Holter ECG recordings are acquired during regular patient activity (sleep, movements, employment, sports) meaning that the ECG signal usually contains movement artifacts and power-grid noise and may also suffer from poor contact between the body and ECG electrodes (sweat, especially in connection with movements). The capability of these simple methods may suffer under the given circumstances which may also be deduced from results from the PhysioNet Challenge 2017 in which methods using only a few well-known features [7–9] showed rather limited performance.

Therefore, the trend in the classification of real Holter ECG signals is towards the extraction of a higher number of features (describing both atrial and ventricular activity) and to the use of some machine learning approach such as recurrent neural networks [10, 11], tree forests/ensembles [12–14] or supported vector machines [15, 16]. While these approaches use features dependent on QRS detection, a substantial amount of research has led to the use of convolutional neural networks (CNN) [17–19] in which specific features are extracted during the machine learning process itself. Since CNN were built primarily for image processing, the usual approach is to convert the signal into some form of image, most often using Fourier [20–22] or wavelet transformation. Moreover, transformation to dual-beat coupling [23] or approaches using even 1D ECG signals have already been presented [24, 25]. It is also worth mentioning that some approaches combine deep learning techniques with features based on QRS detection [26, 27].

In this manuscript, we propose a method for automated detection of cardiac arrhythmias in Holter ECG signals. It is an improvement on our previous work [28] with a substantially simplified feature set and a different architecture. These improvements have led to increased capability over the previous solution.

Method

The presented approach (figure 1) uses two machine learning methods in parallel—a bagged tree ensemble (BTE) and a convolutional neural network chained with a shallow neural network (CNN/NN). While the BTE uses mostly a common feature set based on QRS detection (mean inter-beat interval or its variation range, etc), the CNN/NN uses a transformed ECG signal. Both these methods compute results independently; if the CNN/NN branch is very sure about its result (exceeds a given threshold) then the CNN/NN result is used. Otherwise, the BTE result is used (with the exception of noisy—X class). This design was expected to increase method robustness in comparison to the use of CNN/NN and BTE approaches alone; it should also suffer less from overtraining (overtraining was an issue for our former approach in which these methods were used in a chain). The method was implemented in Matlab (r2017a) software using Machine Learning and the Statistics and Signal Processing toolbox.

Used dataset

We used a public dataset from the PhysioNet/CinC Challenge 2017—single-lead Holter ECG, acquired from a device (AliveCor Inc., CA, US) in 12-bit depth and with a sampling frequency of 300 Hz. The dataset contains files categorized by pathology into four classes. The expert labeled dataset was divided into a public (8528) and hidden (3658) part. We removed 345 files from the public part in cases in which we disagreed with the expert labeling. Therefore, a total of 8183 files was used for training and local testing. The hidden part of the dataset was used for remote tests on PhysioNet servers. Labeling corrected after the end of the 'Official Challenge Phase' was not used for training.

Preprocessing—signal transformation into envelograms

The ECG signal is transformed into several amplitude envelopes (envelograms) which are later used for QRS detection and feature extraction (figure 1(A)). Envelograms are computed using Fourier and Hilbert transformations in specific frequency ranges. Envelograms in lower frequencies (LF, 1–6 Hz, figure 2(C)) and middle frequencies (MF, 8–30 Hz, figure 2(B)) serve as sources for QRS detection (figure 1(B)—dots). The LF envelogram is more sensitive to T-waves, while the MF envelogram is more sensitive to QRS and this is used by specific conditions during QRS detection. These envelograms are also used for distinguishing ventricle contractions from regular QRS complexes using ratio of LF and MF envelograms [29]. The input ECG signal is also filtered into the 2–30 Hz band (CF signal) to remove baseline wandering and noise; the signal transformed in this way is prepared for QRS morphology clustering (figure 1(C)).

**Figure 2.** QRS detection. The raw ECG signal (A) is transformed into MF (B) and LF (C) envelograms. While the MF envelogram serves as a source of peaks (checked by maxima intensity in a 600 ms window and standard deviation in its three subregions), the LF envelogram serves for an additional check of QRS existence at a specific point—if area (C)-L is more than 1.5× larger than area (C)-R, the QRS is resolved as false (important when sharp T-waves are misinterpreted as QRS). The ratio of areas under the LF and MF curves [29] is also used for evaluating several features (table 1 #25–29).
Download figure:
Standard image High-resolution image

QRS detection

For QRS detection (figure 1(B)), peaks were detected in the MF envelogram. Detection criteria were set to 'Descend' order and the minimal distance to other peaks was set to 150 ms (the maximal detectable heartrate is, therefore, 400 bpm). Each peak must then pass multiple checks. First, each peak is checked to see whether the MF signal maximum in a window ±300 ms from the peak exceeds 1.5× nominal peak value. This time window was set in respect to the maximal expected P-R distance and R-T (top of the T-wave) distance. If this check is not passed, it means that it is probably a P-wave or a case of a very sharp top of a T-wave or a noise; such peaks are removed from further processing. Next, the same time window around the peak is divided into three equal and consecutive parts (figure 2—R_A, R_B and R_C, each of a length of 200 ms) and the standard deviation inside these parts is compared. If the sum of the standard deviation of border regions R_A and R_C is higher than the standard deviation of R_B the peak is removed from further processing. Finally, each peak is checked for low-frequency activity on its borders using an LF envelogram, because we expect that low-frequency activity connected to any QRS complex should follow QRS due to a T-wave. Therefore, the sum of region −250:0 ms from the tested peak (figure 2(C)-L) is tested against region 0:250 ms (figure 2(C)-R). If sum(L) > sum(R) × 1.5, the low-frequency activity in LF before the peak is much higher than after the peak and it is unlikely to be a QRS. Therefore, the peak is considered a false QRS and is removed from further processing. The time windows of 250 ms were set in respect of the capture of areas in LF envelopes related to the T-wave. However, in the case of a very fast pace it is possible that both T-waves prior and post the tested peak will be in the analyzed ±250 ms area. For this reason, we had to add a 1.5× multiplier to cover such a situation. Furthermore, during events such as ventricle tachycardia it may happen that the majority of the detected QRS are removed. Therefore, this step is omitted if more than 95% of all peaks are removed during this check. Finally, if the number of detected QRS is insufficient (<4), the process is aborted and the record is considered noisy.

QRS clustering

Because the task is also to find other arrhythmias in addition to atrial fibrillation, we decided to cluster QRS complexes using their morphology (figure 1(C)). The reason for this is that atrial fibrillation, as well as the sinus rhythm, should contain a single QRS morphology, while premature ventricle contractions, for example, should result in additional morphological groups. Clustering is achieved using the Pearson correlation, computed in a ±150 ms window centered at QRS complexes; correlation is computed using the ECG signal filtered in the 2–30 Hz band (CF). The size of the correlation window was set in respect to QRS duration, which is not expected to be longer than 300 ms.

The clustering process works as follows: the first QRS complex is assigned to the first morphological group. The next QRS complex is compared to all already assigned QRS complexes (called prototypes). If they correlate sufficiently (>0.9), the tested peak is assigned to the same morphological group as the prototype. It is typical of noisy signals that the number of detected morphological groups is very high, while signals containing a high-quality signal (even with a pathological content) should contain 2-3-4 morphological groups. Therefore, if an excessive number of morphological groups is found (>length of signal in seconds), the process is aborted and the record is considered noisy.

There is also a risk that the same QRS morphology will be registered multiple times, but with a time shift. To avoid this, an average shape for each morphology group is created and their central parts (±75 ms) are correlated with a shift (also ±75 ms). If any two groups correlate well enough (>0.95), they are merged together; averaged shapes must be regenerated and the process continues until none of the groups can be merged with another.

Morphological groups are ordered by a count of QRS complexes. Therefore, the first morphological group (G1) should always describe a sinus rhythm (with the exception of bigeminy PVCs). G2 will usually contain the most common morphology of extra beats, G3 will contain a less common morphology, etc. Obviously, the presence of noise will create additional morphological groups, but these should not contain multiple QRS complexes because the noise is expected to be random.

Once morphological groups are defined, we can look for missing QRS complexes (which have not, for some reason, passed the initial checks). The search for missing QRS complexes works as follows: a cross-correlation function is generated from the averaged shape of each morphological group (containing more than one QRS complex) and the CF signal. Next, peaks are detected in this cross-correlation function (>0.94). Next, nominal values in the position of the detected peaks are compared to corresponding samples in averaged shapes of morphological groups. Finally, if a newly detected peak passes these checks and its position is further than 150 ms from a QRS already detected, the new QRS complex is defined and added to the specific morphological group.

Clustering based on the Pearson correlation may lead to incorrect results in cases in which shapes correlate sufficiently but have a different scale. In terms of ECG, beats represented by the same appearance of QRS, but with 0.1× amplitude, probably do not belong to the same morphological group and are probably the result of noise. Therefore, the median sum of the region ±150 ms around each QRS complex from group G1 is computed in an MF envelogram (Med_MF). This value serves as a base for comparison of each other QRS complex: this value of each QRS complex must be between Med_MF × 0.2 and Med_MF × 4 or it is excluded from the list of QRS complexes. Finally, morphological groups are reordered again using their number of QRS complexes.

Feature extraction for the bagged tree ensemble

A list of 43 extracted features for the bagged tree ensemble is shown in table 1. The first three features (table 1, #1–3) are not dependent on QRS detection and cover the hypothetical situation in which, for example, a block of ventricular fibrillation prohibits the QRS detector from detecting single beats. For this, we use a previously developed approach [29]. The longest block when the LF envelogram is dominant over the MF envelogram is evaluated in terms of length (s) (table 1 #1), MF to LF area ratio (table 1 #2) and the sum of the LF envelogram (table 1 #3) form additional features describing the same issue.

Table 1. List of 43 features extracted for bagged tree ensemble.

#	Feature name	Based on	Detailed description
1	maxLFratioMFLF	Prevalence of LF over MF envelogram	Ratio of MF to LF sums during the longest prevalence
2	maxLFLength		Maximal length (s) of this prevalence
3	maxLFint		Sum of this prevalence

4	rrMin	RR intervals derived from all QRS complexes	Minimum
5	rrMax		Maximum
6	rrMean		Mean
7	rrMed		Median
8	rrSTD		Standard deviation
9	rrSTD_mean		Standard deviation/mean
10	rrVR		Variation range

11	numExRat	RR intervals	Number of peaks on RR series
12	lower10ratio		Number of RRs < rrMean − 0.1 × (rrMean)
13	higher10ratio		Number of RRs > rrMean + 0.1 × (rrMean)

14	ratioQRS1toAll	QRS morphological groups	Ratio of QRS from G1 to all QRS complexes
15	ratioQRS2toAll		Ratio of QRS from G2 to all QRS complexes
16	ratioQRS2to1		Ratio of G2 QRS to G1 QRS complexes
17	ratioQRS12toAll		Ratio of G1 + G2 QRS complexes to all QRS complexes

18	twSTDmin	Examination of 6-beat sequence of QRS complexes	Minimal STD(RR)
19	twSTDmax		Maximal STD(RR)
20	twSTDmed		Median STD(RR)
21	twSTDstd		Standard deviation of STD(RR)
22	twC0Smin		Minimal average RR
23	twC0Smax		Maximal average RR
24	twC0Smed		Median average RR

25	KESmin	Ratio of LF to MF sum around QRS complex (high values should be associated with extra ventricle beats)	Minimum KES value of all QRS
26	KESmax		Maximum KES value of all QRS
27	KESmed		Median KES value of all QRS
28	KESmean		Mean KES value of all QRS
29	KESstd		Standard deviation KES value of all QRS

30	exToAll	RR intervals	Number of unusual RR intervals to all RR intervals

31	dRRsMin	Derivative of RR intervals series	Minimum
32	dRRsMax		Maximum
33	dRRsMed		Median
34	dRRsMean		Mean
35	dRRsSTD		Standard deviation

36	qrs1ratioLFMF	QRS morphological group G1	Ratio of center of averaged G1 QRS complex in LF envelope to the same sample in MF envelope

37	pW1	Regions of expected P-wave presence	Variability using variation range
38	pW2		Variability using STD
39	pW3		Variability using mean

40	jhCorrRR	RR intervals	Correlation of circularly shifted RR intervals (by 1)

41	rawNumExA	Averaged shape G1 from RAW ECG	Number of local extremes before R-wave
42	rawNumExB		Number of local extremes around R-wave
43	rawNumExC		Number of local extremes after R-wave

All other features used for the bagged tree ensemble are dependent on detected QRS complexes and their morphological group. Statistical description of the RR intervals forms seven features (table 1 #4–11); the number of peaks in consecutive RR values, as well as the number of RRs higher or lower than average RR (outside ±0.1 tolerance), are additional features (table 1 #11–13).

Thanks to morphology clustering it is possible to quantify ratios of other morphological groups as well as the ratio between the most important groups G1 and G2 (table 1 #14–17). We also examined the whole sequence of RR intervals in a sliding 6-beat window, where seven features were extracted from the standard deviation and mean RR in this sliding window (table 1 #18–24).

Frequency components of all beats are described in five features (figure 3 and table 1 #25–29), where the ratio of LF and MF envelograms is sensitive to abnormal beats, specifically premature ventricle beats. This is also extracted for the average shape from G1, resulting in a specific feature (table 1 #36).

Technically, most other arrhythmias as well as atrial fibrillation result in irregularities in RR intervals. Therefore, another feature (table 1 #30) describes the ratio of unusual RR intervals to all RR intervals (an RR change greater than 10% is considered an unusual interval). This is also covered by examining the first derivative of the RR series; five features (table 1 #31–35) statistically describe changes of RR intervals. The last feature describing changes in RR intervals (table 1 #40) is computed as the correlation of the whole RR series with its copy shifted by one RR interval.

The behavior of the P-wave, or its link to the QRS complex, is essential to stating pathology—namely atrial fibrillation, 2nd or 3rd degree AV-block, or atrial tachycardia. However, the detection of the P-wave is a difficult task because it may be buried under background noise. This is common in Holter recordings. Therefore, we decided not to detect the P-wave directly, but to analyze the region in which the P-wave should exist. We statistically describe the stability of this region preceding each QRS complex (figure 4(D)) in three features (table 1 #37–39).

**Figure 4.** Averaged shapes of normal (solid) and atrial fibrillation (dashed) records. Atrial fibrillation will cause unnatural flattening of the region preceding the QRS complex. The average shape is used to generate features (table 1 #41–43) from the three regions (A) (120 ms), (B) (160 ms) and (C) (120 ms) and also from region (D) (with adaptable width) where the P-wave should be expected (table 1 #37–39).
Download figure:
Standard image High-resolution image

The last features (table 1 # 41–43) used for the bagged tree ensemble describe a number of extrema computed on the averaged shape from the most common morphology G1. Three regions are analyzed—preceding, surrounding and following the average QRS complex (figures 4(A)–(C)).

If any of these 43 features cannot be evaluated or if the mean RR interval (computed using the 1st morphological group only) is longer than 3 s, the process is aborted and the recording is considered noisy. Otherwise, these features are fed into the bagged tree ensemble. This condition correctly detects 21% (N = 58) of noisy recordings and incorrectly detects 3% (N = 9) in the training set.

Bagged tree ensemble

A bagged tree ensemble was chosen because it has shown the best results of the tested machine learning methods (simple decision trees, shallow neural networks and supported vector machines with different kernels). 70% of the public dataset was used for training; the remaining 30% was used for testing. The tree ensemble consisted of 30 trees. The bagged tree ensemble was trained using the Machine Learning and Statistics toolbox in Matlab software (r2017a). The resulting confusion matrix is shown in table 2.

Table 2. Test confusion matrix for bagged tree ensemble acquired from 30% of training dataset. N is normal sinus rhythm, O is other arrhythmia, A is atrial fibrillation and X is noisy signal. Se—sensitivity, Sp—specificity, The overall F1 score is computed as average over F1 scores of N, O and A classes.

		Expert labeling
		N	O	A	X	Sp	F1
Bagged tree labeling	N	1398	84	4	8	0.94	0.91
	O	148	478	20	12	0.73	0.75
	A	8	43	168	0	0.77	0.81
	X	22	14	3	44	0.53	0.60
	Se	0.89	0.77	0.86	0.69

		Overall					0.82

Convolutional neural network (CNN)

The presented method combines two—mostly independent—approaches. While the approach with a bagged tree ensemble relies on QRS detection, the convolutional neural network uses only filtered ECG data. The input for the CNN consists of 6 s-long (without zero padding) blocks of input ECG signal. These blocks are pre-processed with zero-phase bandpass digital filtering (Butterworth filter) and envelograms in ranges (1–5, 5–10, ..., 35–40 Hz) are estimated using the absolute value of the Hilbert transform. Furthermore, the RAW signal and computed envelograms are concatenated into a matrix that is used as an input image for CNN.

The CNN architecture consists of 13 layers. The input image of raw signal and envelograms (dimensions 9 × 1800) is processed with the first convolutional layer with kernel dimension (9 × 150) and 30 filters. Next, a non-linearity mapping layer with a ReLU activation function is used. Next, the extracted features are down-sampled with a max-pooling layer (1 × 2). Subsequently, a second convolutional kernel (1 × 15 × 30) with 50 filters, ReLU and dropout (0.5) is used in order to extract more complex features. The resulting tensor is squashed into vector form and consequently a fully connected layer (15 neurons) with ReLU and dropout (0.5). Finally, the fully connected layer (three neurons) with softmax activations is attached and probabilities for each class are obtained.

The proposed CNN system is trained for classification into three groups, i.e. normal rhythm, atrial fibrillation and other arrhythmias. CNN was not trained for noisy recordings due to their low number. A gradient descent algorithm with momentum (100 training examples per minibatch) was used as a training method and cross-entropy error was used as a cost function. The training process was performed for 25 epochs. We have used a data augmentation technique (ECG lead inversion) in order to train CNN for the classification of signals with inverse polarity. On the other hand, we did not see any significant improvement after this augmentation and, therefore, this step should be reconsidered.

Features extracted from CNN and further processing by NN

Because we used a short 6 s window for analysis by CNN, its output is a probability vector of being N, O or A (class X was not recognized by CNN due to the low number of X recordings). A typical situation for a record containing one premature ventricle beat (resulting in class O) is that most of the record points to class N, but in terms of the location of PVC it points to class O (figure 5). This means that quantifying simple maxima may be insufficient for encoding CNN results. Therefore, we decided to extract several statistical descriptors from CNN output vectors— ${{\overrightarrow{{\rm CNN}}}_{N}}$ , ${{\overrightarrow{{\rm CNN}}}_{A}}$ and ${{\overrightarrow{{\rm CNN}}}_{O}}$ . These descriptors are used as input features for a shallow neural network. A list of these features is presented in table 3.

**Figure 5.** Inputs and output of convolutional neural network. The ECG signal (A) is filtered (band-pass 1–40 Hz, (B)); the ECG signal is also transformed into eight envelograms (1–5 Hz, 5–10 Hz, etc up to 40 Hz, (C)). Then, a 6 s floating window moves over the data with a 1 s step and the result of the convolutional neural network (CNN) is computed for each window (D). Therefore, the result of this process are probability vectors for the three classes normal rhythm (blue), atrial fibrillation (red) and other arrhythmia (yellow). The class noisy is not detected using CNN.
Download figure:
Standard image High-resolution image

Table 3. List of 17 features extracted from the convolutional neural network. These features are mostly statistical descriptors of probability vectors of each class N (normal), A (atrial fibrillation) and O (other arrhythmia). These features are fed into a shallow neural network.

#	Feature name	Description
1	cnns(1)	mean( ${{\overrightarrow{{\rm CNN}}}_{N}}$ )
2	cnns(2)	mean( ${{\overrightarrow{{\rm CNN}}}_{A}}$ )
3	cnns(3)	mean( ${{\overrightarrow{{\rm CNN}}}_{O}}$ )
4	cnns(4)	std( ${{\overrightarrow{{\rm CNN}}}_{N}}$ )
5	cnns(5)	std( ${{\overrightarrow{{\rm CNN}}}_{A}}$ )
6	cnns(6)	std( ${{\overrightarrow{{\rm CNN}}}_{O}}$ )
7	cnns(7)	max( ${{\overrightarrow{{\rm CNN}}}_{N}}$ )
8	cnns(8)	max( ${{\overrightarrow{{\rm CNN}}}_{A}}$ )
9	cnns(9)	max( ${{\overrightarrow{{\rm CNN}}}_{O}}$ )
10	cnns(10)	min( ${{\overrightarrow{{\rm CNN}}}_{N}}$ )
11	cnns(11)	min( ${{\overrightarrow{{\rm CNN}}}_{A}}$ )
12	cnns(12)	min( ${{\overrightarrow{{\rm CNN}}}_{O}}$ )

13	avgNsubMaxO	mean( ${{\overrightarrow{{\rm CNN}}}_{N}}$ ) − max( ${{\overrightarrow{{\rm CNN}}}_{O}}$ )
14	avgNsubMaxA	mean( ${{\overrightarrow{{\rm CNN}}}_{N}}$ ) − max( ${{\overrightarrow{{\rm CNN}}}_{A}}$ )
15	maxCodeNAO	Index of class with maximal value (1, 2, 3 for N, A, O)
16	maxCodeAvgNMaxAO	Index of class with maximal value; class N uses its average instead of maximum
17	multNprev	avgNsubMaxO (#13) × avgNsubMaxA (#14)

The computed features are fed into a shallow neural network which points directly to one of the three classes N, A and O. This shallow neural network contains 17 neurons in an input layer, 10 neurons in a hidden layer and three neurons in an output layer. The output layer implements a softmax function. The estimated test confusion matrix (70/30 dataset division for training/testing) is shown in table 4; the output for specific classes is shown in figure 6.

Table 4. Test confusion matrix generated from CNN/NN combination. 30% of training set was used as a test set. N—normal recordings, O—other arrhythmia, A—atrial fibrillation, Se—sensitivity, Sp—specificity.

		Expert labeling
		N	O	A	Se	F1
CNN/NN labeling	N	1419	56	1	0.96	0.96
	O	49	591	26	0.89	0.88
	A	3	27	197	0.87	0.87
	Sp	0.96	0.88	0.88

		Overall				0.91

**Figure 6.** CNN output processed by shallow NN. Three outputs from NN (A, N and O) show strong association with specific classes.
Download figure:
Standard image High-resolution image

Parallel use of bagged tree and convolutional neural network

Finally, output values from CNN/NN are compared to thresholds. If any of them passes, it means that there is high credibility of the CNN/NN output and the given result is used. Otherwise, the result from the bagged tree ensemble is used. Thresholds (0.8 for classes A and N, 0.75 for class O) were obtained experimentally with regard to the resultant F1 score (figure 7). In the training set, these thresholds were passed in 6636 of 8183 cases meaning that the CNN/NN result was preferred in 81% of cases.

**Figure 7.** Threshold selection for acceptance of CNN/NN results. F1 scores were independently generated for each of the classes N (black, solid line), A (blue, dash-dotted line) and O (orange, dashed line). The decrease of F1 score to the right is caused by decreasing sensitivity, which is not problematic due to the fact that if values do not pass any of presented thresholds, the output of the opposite branch—bagged tree (BT)—will be used instead (figure 1). Selected thresholds were passed in 81% of training cases.
Download figure:
Standard image High-resolution image

Results

The presented method has been tested on a hidden test set containing 3658 recordings. The results in table 5 show a score of 0.92 for normal recordings, 0.82 for atrial fibrillation and 0.74 for other arrhythmia, resulting in an overall score of 0.83. Among a total of 80 competitors, this score resulted in shared rank #2 in the final PhysioNet/CinC 2017 Challenge rankings. During the run on a test subset (N = 710), the method used 16.1% of the available running time.

Table 5. Estimated F1 scores obtained during training in comparison with real test scores. Estimations were obtained from 30% of recordings available to the public (70/30 training/test division of reduced public dataset, N = 8183); testing results were obtained from a hidden dataset (3658 recordings).

Class	Estimated F1 score (bagged tree ensemble)	Estimated F1 score (convolutional and shallow neural network)	Test F1-score
Normal	0.91	0.96	0.92
Atrial fibrillation	0.81	0.87	0.82
Other arrhythmia	0.75	0.88	0.74
Overall modified accuracy	0.82	0.91	0.83

Discussion

The presented method is a significant simplification of our previous approach [28] during the 2nd phase of the CinC/PhysioNet Challenge 2017. The number of features has been decreased from 277 to 60 (43 in the BTE branch; 17 in the CNN/NN branch); moreover, the interaction between a convolutional neural network and a bagged tree ensemble has been changed (previously, CNN/NN produced features supplied to the BTE, while they now work independently). Both these steps have led to a new solution showing a higher F1-score on the full hidden test set (83 versus 81). The increase in the score has probably been due to a stronger generalization effect, shown as a smaller decrease between the subset test (N = 710) score and the full-test (N = 3658) test score (from 0.85 to 0.81 in the previous approach; from 0.84 to 0.83 in the current approach). The challenge organizers have also relabeled part of the data and this has, in all probability, also played a positive role in the score change. However, these newly relabeled data were not used for training of presented approach; relabeling affected test results only.

From our previous experience, we considered the bagged tree a machine learning technique giving us solid training results. On the other hand, we were also aware that such a solution is solely dependent on QRS detection and, therefore, might not work in ECG signals containing a larger amount of noise caused by body motion or external sources. For this reason, we implemented an independent processing branch with a convolutional neural network. This branch does not use QRS detection and should be robust in cases in which a bagged tree loses its capability. And because the CNN/NN chain produces the probability of a result, we were able to decide where the CNN/NN is very confident (i.e. it passed any threshold as in figure 7).

Although the parallel use of two machine learning approaches seems to increase the overall score, we found that part of the detection task—detecting class X with noisy recordings—remains problematic for any machine learning approach. This was probably due to the relatively low number of X recordings in the training set (N = 278), but also due to the subjective opinion of the specialist performing such an assignment. Therefore, we described a simple logical rule which is not affected by machine learning and further machine learning processing is performed only if this rule is passed. Although this rule affects only a small proportion of X recordings (21% true positive, 3% false positive), it improves the weak results of machine learning for class X. Figure 8 shows two examples of recordings labeled X (too noisy to decide; expertly labeled). The top one (figure 8(A)) did not pass the mentioned rule, while the bottom one (figure 8(B)) passed this rule and was processed by the BTE. According to its training, the BTE labeled this recording as noisy, while such labeling may be questionable due to QRS complexes that are still detectable. It may be worth noting that ideal QRS detection (which would, unlike that presented here, capture all QRS complexes) may be counterproductive in cases like this.

**Figure 8.** Two examples of data labeled as too noisy (class X). The top (A) recording shown an excessive number of morphological groups of (false) QRS complexes (dots); it was considered noisy without the use of machine learning ('No' in figure 1(E)). Recording (B) passed checks prior to processing by the bagged tree ensemble which decided to label it as too noisy to process (correctly with regard to the official PhysioNet/CinC Challenge 2017 labeling; 'Yes' in figure 1(G)).
Download figure:
Standard image High-resolution image

In comparison to other competing entries, the presented work lead to shared rank #2 losing by 0.02 to the top scoring entry [11]. Another six competing entries (from a total of 80) also shared the same rank with a resultant score of 0.83. These other entries with the same ranking used a random forest classifier [13, 30, 31] or binary classifier [32] or SVM [33] with a feature count from 37 to 380. Only the entry [27] implemented an approach most similar to ours (combination of CNN and XGBoost ensemble).

The advantage of the proposed method is its high overall F1 score, leading to one of highest scores in the challenge. The proposed method is immune to reversed signal polarity because both the QRS detection algorithm and CNN work mostly with amplitude envelopes. In comparison to other top-scoring methods, it also extracts a low amount of hand-crafted features, though this applies only for the bagged-tree processing branch.

The disadvantage of the presented method is its complexity in comparison with the top-scoring entry [11]. Moreover, although the achieved F1 scores were among the highest in the challenge, they are not sufficient for a fully automated system without a human operator (this applies to F1 scores across all competitors in the PhysioNet Challenge 2017).

Since the most challenging issue with Holter ECG processing is noise, future work in this area should involve accelerometer data which are already implemented in some ECG Holters. Also, it would be appropriate to test (and possibly re-build) algorithms from the PhysioNet Challenge 2017 on databases separately distinguishing a larger number of pathologies as the clinical impact of such a mechanism would be much more important.

Conclusions

The presented approach demonstrated parallel use of two independent machine learning methods, where the first (bagged tree ensemble) uses regular features based on QRS detection, while the second (convolutional neural network and shallow neural network) uses only a transformed ECG signal. This approach was not used for detection of recordings labeled too noisy to process; this was covered by simple logical conditions and a bagged tree ensemble. Finally, it was shown that the presented method may improve results in comparison with both of the machine learning approaches used alone; we achieved an improvement of 1.3% in comparison to our approach from the previous challenge phase. The presented method achieved rank 2 (shared rank with six other teams) in the follow-up phase of the PhysioNet/CinC Challenge 2017. Its code is publicly available under GNU GPL license [34].

Acknowledgments

This research was supported by the Czech Science Foundation, project GA17-13830S, project LO1212 by MEYS CR and by project MSM100651602 by the Czech Academy of Sciences.

Parallel use of a convolutional neural network and bagged tree ensemble for the classification of Holter ECG

Article metrics

Permissions

Author e-mails

Author affiliations

ORCID iDs

Dates

Abstract

Introduction