3D ECG display with deep learning approach for identification of cardiac abnormalities from a variable number of leads

Objective. The objective of this study is to explore new imaging techniques with the use of the deep learning method for the identification of cardiac abnormalities present in electrocardiogram (ECG) signals with 2, 3, 4, 6 and 12-lead in the framework of the PhysioNet/Computing in Cardiology Challenge 2021. The training set is a public database of 88,253 twelve-lead ECG recordings lasting from 6 s to 60 s. Each ECG recording has one or more diagnostic labels. The six-lead, four-lead, three-lead, and two-lead are reduced-lead versions of the original twelve-lead data. Approach. The deep learning method considers images that are built from raw ECG signals. This technique considers innovative 3D display of the entire ECG signal, observing the regional constraints of the leads, obtaining time-spatial images of the 12 leads, where the x-axis is the temporal evolution of ECG signal, the y-axis is the spatial location of the leads, and the z-axis (color) the amplitude. These images are used for training Convolutional Neural Networks with GoogleNet for ECG diagnostic classification. Main results. The official results of the classification accuracy of our team named ‘Gio_new_img’ received scores of 0.4, 0.4, 0.39, 0.4 and 0.4 (ranked 18th, 18th, 18th,18th, 18th out of 39 teams) for the 12-lead, 6-lead, 4-lead, 3-lead, and 2-lead versions of the hidden test set with the Challenge evaluation metric. Significance. The results indicated that all these algorithms have similar behaviour in the various lead groups, and the most surprising and interesting point is the fact that the 2-lead scores are similar to those obtained with the analysis of 12 leads. It permitted to test the diagnostic potential of the reduced-lead ECG recordings. These aspects can be related to the pattern recognition capacity and generalizability of the deep learning approach and/or to the fact that the characteristics of the considered cardiac abnormalities can be extracted also from a reduced set of leads.


Introduction
The automatic detection and classification of cardiac abnormalities from standard twelve-lead ECG signals has been an area of research interest for a long time, belonging to a useful diagnostic screening system (Willems et al 1990, Kligeld et al 2007. At present, the limited accessibility of twelve-lead ECG devices provides a rationale for studying the impact of smaller, lower-cost, and easier to use devices, and it is conceivable that all of the standard 12 leads are not equally important. Several efforts have been done for studying the potentiality of a reduced number of leads, like the optimal lead locations for ST-segment monitoring, with evidence that subsets of the standard twelve leads can capture useful information (Drew et al 1998, 2002, Green et al 2007. Some cardiac abnormalities are more frequent in some parts of the heart, and there are also blind spots in the standard ECG for certain regions of the heart (Green et al 2007). Consequently, several studies were performed to determine which leads in the standard 12-lead (ECG) are the best for detecting cardiac abnormalities (Aldrich et al 1987); nevertheless, there is limited evidence to demonstrate the utility of reduced-lead ECGs for capturing a wide range of diagnostic information. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
In the last years, there is a growing interest in deep learning architectures, for example considering medical images (Litjens et al 2017) or time series classification (Fawaz et al 2019).
In addition, several studies have been performed in the framework of ECG signal processing, with the development of a very large variety of architectures of deep learning methods (Ebrahimi et al 2020, Liu et al 2021. For example, a deep recurrent neural network for classification of atrial fibrillation considering 21 ECG features was considered in (Chocron et al 2020), and in (Ribeiro et al 2020) a deep neural network was trained in a large 12-lead ECG database for classification of six diagnostic classes claiming an accuracy 'closer to the standard clinical practice'. In (Chen et al 2020), a deep learning architecture based on several convolutional neural networks for the classification of nine cardiac arrhythmias.
In (Murat et al 2020) several methods of deep learning for classification of five classes have been tested and compared with the state of the art , providing a schematic information of several approaches present in the literature.
After a long series of interesting annual Challenges, the PhysioNet/Computing in Cardiology Challenge 2020 and Challenge 2021 (Perez Alday et al 2020, Reyna et al 2021) provide the opportunity to address the complexity of ECG classification from a different point of view and the effect of the analysis of a different number of leads.
The main objective of this study is to test the deep learning approach for the automatic classification of ECG signals from a variable number of leads with the participation in PhysioNet/Computing in Cardiology Challenge 2021. Interesting ECG displays proposed in the literature are taken into considerations, and a new 3D view is proposed as input of deep learning architecture. Then, a technique based on direct learning from innovative time-spatial ECG images from a variable number of leads through deep learning methods is explored.

Challenge database
The Challenge provided a large dataset with annotated 12-leads ECG recordings lasting from 6 to 1800 s (Perez Alday et al 2020, Reyna et al 2021Reyna et al , 2022. The ten considered datasets for the learning, validation and testing phase are reported in table 1. In particular it consists of 88,253 learning records, 6630 validation records (1463 CPSC, 5167 G12EEC), and 36,266 test records (1463 CPSC, 5167 G12EEC, 10,000 Undisclosed, and 19,642 UMich). The initial 133 diagnoses were reduced to the 30 diagnostic classes of special interest considered in the Challenge scoring system, with a further reduction to 26 considering 4 equivalent classes (table 2). This large dataset, considering the patients with one or more selected diagnostic classes and excluding the records with one or more null leads, consists of 80,567 'useful' or admissible' records, and 116,882 diagnostic instances. It is evident the non-uniform distribution of the diagnostic classes. There are 4 diagnostic classes (BBB, Brady, LPR, PRWP) with less than 1000 records and 3 diagnostic classes (NSR, SA, TAB) with more than 10,000 records.
A random under-sampling (RUS) was performed for reducing the size of the learning set through a reduction the majority groups. In particular, a random selection of ECG records with at most 5000 instances for all the 26 considered diagnostic classes was determined for a more equilibrated distribution of their consistency and a more efficient learning phase, obtaining the subset S51K containing 51,522 ECG records.  Table 2 reports the distribution of the diagnostic instances present in the entire database, and the weighted number of records (WNR) of the subset S51K, consisting of 51,522 ECG records.

Methods
All the ECG records have been used, for the learning phase, considering the set of 80,567 'admissible' records, as previously described (table 1). Several pre-processing algorithms have been applied to the ECG signals. All ECG data are resampled at 500 Hz, if necessary, with spline interpolation (matlab function interp1), for compatibility purposes. The first ten seconds of all ECG signal are considered for the analysis, zero-padding shorten records. The ECG recordings are filtered by quadratic variation reduction (Fasano and Villani 2014) for removing the drift of the zero-line, and by a moving averaging filter on 20 ms for noise reduction (Bortolan et al 2021).
12-lead, 6-lead (I, II, III, aVR, aVL, aVF), 4-lead (I, II, III, V2), 3-lead (I, II, V2), and 2-lead (I, II) are considered. 6-lead and 2-lead are equivalent from an informative point of view, as similarly 4-lead and 3-lead. In fact, in both cases, the first set is directly derived from the second set of leads. Nevertheless, similar situations have been considered as distinct ones, to test the diagnostic power of the different sets.

3D ECG display
The deep learning method considers images that are built from raw ECG signals (The Deep Learning Toolbox 2020, Bortolan et al 2021). Several deep learning architectures can process and classify images. In this way, there are several ways to obtain images from ECG signals, to be classified by a deep learning method. This paper uses 3D display of the entire ECG signal, observing the regional constraints of the leads and obtaining time-spatial images.
Three-dimensional visualizations of the 6 peripheral leads or the 6 limb leads are described in the literature (Chiang et al 2001, Bond et al 2013, Heo et al 2020. For example, Chang et al (Chiang et al 2001) proposed the reverse Cabrera sequence (III, aVF, II, -aVR, I, aVL) and the orderly sequence of (V1, V2, V3, V4, V5, V6) which represents the projection of the cardiac signal from the right to the left side of the body. The necessity to define and test spatial display or single page view has been previously outlined (Anderson et al 1994, Selvester 1998. In this paper, the regional constraints reported in table 3 have been considered, and a reasonable way to have a unique view from (V1, V2, V3, V4, V5, V6) and (aVL, I, -aVR, II, aVF, III) is the connection of the antero-lateral and lateral regions. Consequently, a unique 3D view of the 12 leads has been defined, considering the regional constraints and the order reported in table 3, for a unique 3D top-view display, with a proper sequence. In particular, regional and anatomical positions placing aVR as -aVr between I and II, and the Cabrera system are posed in its orderly display.
Therefore, a two-fold image is produced, because the regions are not strictly in the same plane (the precordial leads in a horizontal plane and the limb leads in a frontal plane), but it reflects regional contiguity.
In Listing 1, the matlab code of the procedure that generates and save a jpeg file of the 3D-ECG-display, considering the matlab function griddedInterpolant to perform interpolation on 2D grided data set, and ngrid which produces rectangular grid in 2D space (The Deep Learning Toolbox 2020). In particular, ten additional samples between two contiguous leads with linear interpolation method were added and the 'nearest' extrapolation method was considered. This procedure produces an image (224 × 224) jpeg file which will be used by the deep learning method described in the next section.
Two examples of this 3D display for the interval of two seconds are reported in figure 1 (two beats) and figure 2 (three beats), where the x-axis represents the temporal evolution of the ECG signal, the y-axis represents the spatial location of the leads and the color (z-axis) represents the voltage of the ECG signal. This threedimensional display of the 12 leads is a temporal-space ECG representation.

Deep learning networks
The 3D display method described previously and obtained considering only raw ECG signals, is then used for training Convolutional Neural Networks for ECG diagnostic classification. Pretrained image CNN classification network that has already learned to extract powerful and informative features from natural images has been used as a starting point to learn a new classification task (The Deep Learning Toolbox 2020, Bortolan et al 2021). One pre-trained CNN for image classification has been used: GoogleNet (The Deep Learning Toolbox 2020). This is a model pretrained on a subset of the ImageNet database (Russakovsky et al 2015), which is used in the ImageNet Large-Scale Visual Recognition Challenge. GoogleNet is a convolutional neural network characterized by 22 Table 3. Regional limits of the 12 leads and the considered leads in 12, 6, 4, 3 and 2-lead systems.

Lead system
Lead Region 12 layers, and it is pretrained to classify images into 1000 object categories. Each layer can be considered as a filter, consequently, the first ones characterize more common features while the deeper ones characterize more specific features to differentiate the considered diagnostic classes (Bortolan et al 2021). Two examples of the 3D  12-lead ECG display for the interval of 10 s that feed the connectionist CNN approach are reported in figure 3 (A008874-NSR) and figure 4 (J00289 -AF). The learning procedure, as previously described in (Bortolan et al 2021), is characterized by the binary crossentropy (BCE) loss function, by the initial learning rate = 0.0001, a minibatch size in the interval [20:60] as the minimum factor of the elements of the training set, the use of the stochastic gradient descent optimization algorithm with momentum (=0.9), and a variable number of iterations in the various experiments. In addition, particular techniques have been developed to adapt the general task to this particular class imbalance and multilabel classification problem: • Class imbalance is addressed with two data-level methods: (1) a random under-sampling (RUS) for reducing the size of the learning set , as previously described (table 2), decreasing the majority groups; (2) a random over-sampling (ROS) algorithm, with duplication of records from the minority groups in the learning set to have a more uniform and balanced distribution of the diagnostic classes (performed for all classes with a number of instances lower than half of the median number of instances of all groups) • the learning process of the deep learning method was adapted to cope with multiple diagnoses classification (comorbidity), adding one record for every multiple diagnoses with the corresponding classification in the training set.

Results and discussion
Our team named 'Gio_new_img' participated successfully to the 2021 Challenge. During the unofficial and official phases, the twelve-lead, six-lead, four-lead, three-lead, and two-lead models have been trained on the public training data, and the trained models were scored on the validation set. At the end of the official phase, the models with highest scores were tested on the testing sets.
A specific scoring metric, the 'challenge score' (Perez Alday et al 2020, Reyna et al 2021), was chosen for comparing and validating the trained models. It is a generalized version of the traditional accuracy metric, assigning full credit to correct diagnoses and partial credit to misdiagnoses with similar outcomes or treatment In the first phase, the learning process considered 12-lead, 6-lead, 3-lead and 2-lead systems. In this phase, the class imbalance technique with a random over-sampling (ROS) algorithm was tested. Table 4 reports the challenge metric with and without the ROS technique. The accuracy with the ROS technique (3D + ROS method) is always lower than the accuracy without over-sampling algorithm (3D method). Consequently, in order to simplify the computation, it was not considered the over-sampling of the input data in the second phase of the Challenge.
In the official Challenge phase, the deep learning process was performed and tested, considering the five different lead combinations: 12-lead, 6-lead, 4-lead, 3-lead, and 2-lead. The 3-fold cross-validation technique was performed for obtaining pre-trained networks to overcome the limitations (72 h) for the training of the 5 lead groups, considering the random under-sampled training set of 51,522 records (S51K), previously described. The performance of the selected pre-trained networks are shown in table 5, which reports the computed classification indices (Challenge score, AUROC and AUPRC) for the five lead groups. In this case, the computed 'challenge score' takes values in the short interval [0.512: 0.535]. At the end of the official phase, the models with highest scores (best performance) were tested on the testing sets, and the results are reported in table 6, which reports the official results of the Challenge score in the validation and test set for the 5 lead groups. The total running time for training the 5 different deep learning networks was approximately 53 h, performing only 4 complete iterations in the entire training set. In order to optimize the training strategy, the submitted algorithms started the training phase from pre-trained networks like those reported in table 5. For example, the deep learning process with 3D ECG display of 12 leads resumed the training from a previously saved pre-trained network (3-fold cross-trained for 10 iterations) for an additional 4 iterations over the same learning subset, S51K, producing a validation score of 0.446 and a test score of 0.40. Similarly, the 3D & 2-lead system, starting from a pre-trained network with 20 iterations and additional 4 iterations, producing the highest validation score (0.457) and a similar test score (0.40). Consequently, the official results of the classification accuracy of our team named 'Gio_new_img' received a score of 0.4, 0.4, 0.39, 0.4 and 0.4 (ranked 18th, 18th, 18th,18th, 18th out of 39 teams) for the 12-lead, 6-lead, 4-lead, 3-lead, and 2-lead versions of the hidden test set with the Challenge evaluation metric.
The results of the proposed deep learning algorithm surprisingly had similar performance across the five lead combinations, not showing increasing accuracy by models trained with a higher number of leads. Therefore the results will be analysed in a deeper way, considering two particular viewpoints. In the first one, the two extreme situations of input data are analyzed, namely considering the maximum number of leads (12-lead, table 7) and the minimum number of leads (2-lead, table 8), for a comparison of the performance indices (challenge score, AUROC, AUPRC, and F-measure) in the validation set and in the four test sets (CPSC test, G12ECG test, Undisclosed test and UMich test). For example, considering the two-lead system (table 8), the challenge score in the validation set was 0.46, and in the 4 test sets in the range [0.35: 0.47], while with twelve leads (table 7) the challenge score was respectively 0.45 and in the interval [0.35: 0.46]. This means there areuniform and similar results in the two extreme situations and in both validation and test sets. This analysis suggests there is no significant difference in the classification task considering the minimum and maximum number of leads.
With the second viewpoint, the behaviour of the evaluation scores can be analysed reducing the number of leads used in the learning/testing process from 12 to 6, 4, 3, and 2 leads (table 9), considering the evaluation indices (challenge score, AUROC and AUPRC) in the two most numerous test sets: Undisclosed (10,000 records) and UMich test (19,642 records). From this table, the results in both test sets show very similar values for all the five lead combinations, for all the evaluation indices. For example, the challenge score does not show any significant decreasing by the reduction of the number of considered leads. Again, this analysis produces the same conclusion, no significant differences among the 5 lead groups. This means a positive classification capacity of the deep learning method considering as input images the 3D views of the considered ECG leads. The results previously described produced some interesting considerations. There is no evidence that the standard twelve-lead ECGs performed better (or produced any additional diagnostic power) than any reduction of the number of leads. The validation scores reported in table 6 were in the range [0.440 : 0.457], and the test score was 0.4 for all the combinations of leads. This indicates that the deep learning method performed similar classification results considering 12-lead, 6-lead, 4-lead, 3-lead, and 2-lead . In particular, the classification considering the minimum number of leads (2-lead) produce similar scores considering the full ECG signal (12leads). From a computational point of view this is very surprising and interesting point, because the addition of learning information (from 2-lead to 12-leads) does not produce significant improvement in the global classification task. This aspect is related to two possible explanations: the pattern recognition capacity and generalizability of the deep learning approach with 3D ECG views can recognize and classify ECG signal both with 2 or 12 leads, and, on the other hand, the characteristics of the considered cardiac abnormalities can be extracted also from a reduced set of leads. Obviously, these conclusions are valid only for the various rhythms explored in this work.