This site uses cookies. By continuing to use this site you agree to our use of cookies. To find out more, see our Privacy and Cookies policy.
Brought to you by:
Paper

Quantifying the accuracy of inter-beat intervals acquired from consumer-grade photoplethysmography wristbands using an electrocardiogram-aided information-based similarity approach

, , , and

Published 11 March 2024 © 2024 Institute of Physics and Engineering in Medicine
, , Citation Xingran Cui et al 2024 Physiol. Meas. 45 035002 DOI 10.1088/1361-6579/ad2c14

0967-3334/45/3/035002

Abstract

Objective. Although inter-beat intervals (IBI) and the derived heart rate variability (HRV) can be acquired through consumer-grade photoplethysmography (PPG) wristbands and have been applied in a variety of physiological and psychophysiological conditions, their accuracy is still unsatisfactory. Approach. In this study, 30 healthy participants concurrently wore two wristbands (E4 and Honor 5) and a gold-standard electrocardiogram (ECG) device under four conditions: resting, deep breathing with a frequency of 0.17 Hz and 0.1 Hz, and mental stress tasks. To quantitatively validate the accuracy of IBI acquired from PPG wristbands, this study proposed to apply an information-based similarity (IBS) approach to quantify the pattern similarity of the underlying dynamical temporal structures embedded in IBI time series simultaneously recorded using PPG wristbands and the ECG system. The occurrence frequency of basic patterns and their rankings were analyzed to calculate the IBS distance from gold-standard IBI, and to further calculate the signal-to-noise ratio (SNR) of the wristband IBI time series. Main results. The accuracies of both HRV and mental state classification were not satisfactory due to the low SNR in the wristband IBI. However, by rejecting data segments of SNR < 25, the Pearson correlation coefficients between the wristbands' HRV and the gold-standard HRV were increased from 0.542 ± 0.235 to 0.922 ± 0.120 for E4 and from 0.596 ± 0.227 to 0.859 ± 0.145 for Honor 5. The average accuracy of four-class mental state classification increased from 77.3% to 81.9% for E4 and from 79.3% to 83.3% for Honor 5. Significance. Consumer-grade PPG wristbands are acceptable for HR and HRV monitoring when removing low SNR segments. The proposed method can be applied for quantifying the accuracies of IBI and HRV indices acquired via any non-ECG system.

Export citation and abstract BibTeX RIS

1. Introduction

Due to their unobtrusive nature and ability to provide continuous and passive measurements of physiological signals, techniques such as photoplethysmography (PPG) (Elgendi 2012), ballistocardiogram (BCG) (Sadek et al 2019), millimeter-wave radar (Churkin and Anishchenko 2015), and Wi-Fi are increasingly used in long-term recordings in laboratory research and during daily life. Among these techniques, the PPG wristband is the most commonly used.

An estimate of the inter-beat intervals (IBI), the inverse of the real-time heart rate (HR), resulting from sinus node depolarizations can be derived from PPG pulse intervals using the distances (in ms) between regular features in the systolic waveforms (e.g. systolic peaks, pulse wave foot points). The IBI time series can be used to compute the real-time HR and average HR, and to derive measures (e.g. heart rate variability, or HRV) related to the parasympathetic nervous system activity. Specifically, resting high-frequency HRV is an index of global health and overall adaptability, reflecting vagal cardiac control, which is involved in psychophysiological processes like mental stress and emotion (Laborde et al 2017). However, the high-frequency HRV index is much more sensitive to noise than the low-frequency index. Thus, the noise ratio of IBI time series, which could be affected by many factors, e.g. the hardware design, ambient light, motion artifacts, etc, has a great influence on the accuracy of HRV-related measures and functions.

Currently, most studies focus on the PPG signals with suggested satisfactory data quality, or acceptable measurement error in averaged HR and some HRV indices. Few studies have assessed the accuracy of wristband IBI time series in terms of the absolute agreement with gold-standard electrocardiogram (ECG) recording systems because the timelines of the two recording systems are difficult to align precisely. The correlation between the two IBI signals is a necessary but not sufficient condition for their agreement, as it simply measures the strength of their linear relation, regardless of systematic biases and depending on the range of measurement (Bland and Altman 1986).

To date, few available methods have been applied to quantify the accuracy of wristband IBI signals. Instead, the existing studies mainly evaluate the quality of PPG pulse wave signals, and then screen the high-quality PPG waves so as to ensure the accuracy of extracted IBI time series, neglecting the necessary of time continuity in IBI time series. Research methods to evaluate the quality of PPG signals are mainly in two directions: (1) by extracting the characteristics of pulse waves, and (2) by analyzing the correlation of HR features extracted separately from ECG signals and PPG pulse wave signals.

For example, Elgendi (2016) proposed a method to evaluate the quality of pulse wave signals by waveform characteristics, such as kurtosis, skewness, signal-to-noise ratio (SNR) and other indicators. Sukor et al (2011) divided the pulse wave signals into three categories according to different qualities by analyzing the waveform amplitude, width and DC component. Couceiro et al (2014) presented an algorithm to distinguish available and unavailable pulse wave signals. They extracted time domain features (amplitude, wave length, amplitude difference of wave crest, wave kurtosis, wave skewness, etc) and frequency-domain features (discrete short time Fourier transform (STFT) of the pulse wave signal, etc) of the pulse waves, and the eight most important features were selected and fed into a support vector machine (SVM) binary classification model. Menghini et al (2019) analyzed the correlations of HRV indices (mean HR, standard deviation of heartbeat interval, high-frequency component and low-frequency component of heartbeat interval, etc) separately calculated from the gold-standard ECG signal and wristband pulse waves, so as to evaluate the quality of pulse waves. Yang et al (2020) divided wristband HR signals into two categories, i.e. good quality and bad quality, using the XGBoost classification model to compare the extracted HR features with gold-standard ECG-derived HR features, including the median value, kurtosis, skewness, entropy, etc. Interestingly, Shcherbina et al (2017) assessed the accuracy of several wearable wristbands in measuring the average HR, and observed higher errors for males, higher body mass index, darker skin, and movement conditions. Recently, Lutin et al (2021) developed a learning-based quality indicator engine, comprising the fundamental steps of frequency-domain feature extraction, feature selection, and classification by an ensemble of decision trees, to aid HR estimation in wrist-worn PPG with motion artifacts. Chen et al (2021) used deep learning methods combined with STFT time-frequency spectra to perform signal quality assessment of PPG signals, which could give the result as good or bad quality. Mestrom et al (2021) assessed the agreement between the HR extracted from the wrist-worn optical HR monitor (OHRM) and the gold-standard five-lead ECG connected to a patient monitor during surgery and in the recovery period. They proposed that an OHRM can be considered clinically acceptable for HR monitoring in low-acuity hospitalized patients. Van der Stam et al (2023) compared the HR and respiratory rate measures obtained via a wearable PPG wristband with the measurements of the reference monitor in 62 post-abdominal surgery patients, and found that the device was able to provide accurate measurements for the large majority of the measurements as 98% and 93% of the measurements were within 5 bpm or 3 rpm of the reference signal, which is sufficiently accurate for clinical applications.

Complex physiological signals, like IBI time series, may contain unique dynamical characteristics, and these characteristics may be related to the underlying mechanisms of the original physiological system, which are of great importance, especially for the low- and high-frequency HRV components. To our knowledge, few studies have been done to assess the accuracy of IBI time series measured by PPG wristbands. Thus, this study focused on the hidden information of IBI fluctuations, and proposed to apply an information-based similarity (IBS) approach to quantify the pattern similarity of the dynamical temporal structures embedded in the IBI time series simultaneously recorded using different PPG wristbands and the gold-standard ECG recording system. Practically, we conducted a mental state induction and classification experiment to validate the accuracies of different level PPG wristband-derived IBI time series, and to further quantify the SNR of IBI time series to improve the accuracy of HRV and HRV-based mental state classification by rejecting the low-SNR IBI segments.

Furthermore, the proposed approach and its procedure can serve as a universal method to quantify the reliability of heartbeat intervals (or real-time HR) and HRV obtained using any non-ECG system before it is used in healthcare services or studies.

2. Data and materials

2.1. Participants

A total of 30 healthy right-handed college student volunteers (19 males and 11 females) with age ranging from 20 to 26 years (21.5 ± 2.1 years) from the Southeast University in China took part in this study. All participants were informed of the experimental protocol and matters needing attention, then signed the written informed consent prior to participating in the study. This study was approved by the Independent Ethics Committee for clinical research of Zhongda Hospital, affiliated to Southeast University (No. 2019ZDSYLL073-P01). None of the participants had any history of neurological or psychological disorders, and they all had normal vision or were corrected to normal vision (self-report). Three subjects were excluded due to device problems and data recording interruption.

2.2. Data recording systems

The 30 healthy participants concurrently wore two different level wristbands on their non-dominant wrists and a gold-standard ECG measurement device following the manufacturer's instructions.

The two wristbands are both wireless sensors that can continuously record PPG signals and output IBI time series. The E4 wristband (Empatica Inc., Milan, Italy) is specifically designed for both research and commercial purposes, and overcomes the limitations of most commercial devices. The sampling frequency for the E4 PPG signal is 64 Hz. To validate the reliability of commonly used consumer-grade commercial PPG wristbands, the Honor band 5 (Huawei Technologies Co., Ltd, Shenzhen, China) was worn on the same wrist of each participant, and the wearing location of the two wristbands was randomized. In this study, the PPG signal sampling frequency for Honor band 5 was 100 Hz. Honor band 5 is a popular commercialized smart band and it is much cheaper than E4. The raw PPG data were directly exported from the E4 band via a client mobile application. The Honor band 5 is commercially used and restricted by proprietary confidentiality agreements; thus, the raw PPG data were exported from the device using a specific software development kit after signing an agreement with Huawei.

ECG recordings were collected by a US Food and Drug Administration-approved ambulatory electrocardiogram monitor (DynaDx Corporation, Mountain View, CA, USA) with a computer-based data acquisition system. The ECG recording equipment was a single-lead Holter device that can continuously record ECG for over 24 h. The ECG electrodes were placed on the subject's chest. The sampling frequency of the ECG signals was set to 512 Hz.

2.3. Conditions and tasks

This study focused on the accuracy of IBI and HRV under different psychophysiological conditions. Outdoor conditions were not included because the accuracy of PPG during exercise is too poor to obtain reliable IBI time series when using consumer-grade wristbands. The data collection protocol is shown in table 1, and includes nine subsequent conditions: paced breathing at 0.17 Hz and 0.1 Hz, resting state, and three mental stress tasks followed by a resting state after each stress task.

Table 1. Data collection protocol: conditions and the corresponding durations.

No.Conditions and tasksDuration (s)
1Paced breathing (0.17 Hz)300
2Paced breathing (0.1 Hz)300
3Seated resting state 1500
4Stress task 1360
5Seated resting state 2500
6Stress task 2360
7Seated resting state 3500
8Stress task 3360
9Seated resting state 4500

2.4. Paced breathing

Participants were asked to breathe following a respiratory pacer set at 0.17 Hz (6 s/breathing period) firstly, and then at 0.1 Hz (10 s/breathing period). Each session was conducted for 5 min. Paced breathing was included to maximize the respiratory sinus arrhythmia, which is associated with increased vagally mediated HRV measures (Vaschillo et al 2006).

2.4.1. Seated resting

Participants were asked to sit comfortably in front of a black screen. This condition included five 100 s resting with eyes closed and eyes open alternately. Seated resting is commonly used as the resting or baseline condition in psychophysiological reactivity studies (Cacioppo et al 1998).

2.4.2. Modified Montreal Imaging Stress Tasks (mental stress tasks)

The Montreal Imaging Stress Task (MIST) is a well-validated task for healthy adults and clinical populations (Castro et al 2015, Ming et al 2017, Noack et al 2019). Considering our current understanding of neural stress circuitry and the results of the most relevant MIST studies (Hermans et al 2014, Hariri 2015, Lago et al 2017, Velozo et al 2021), we hypothesize that the MIST will elicit increased self-reported stress ratings, HR, and decreased HF-HRV.

In this study, a modified MIST task was applied. Participants were asked to solve challenging arithmetic operations under extreme time pressure while feedback on their average performance relative to the performance of alleged other participants is given (figure 1). The operations are between three numbers (limited to two digits), in which the first two numbers are multiplied or divided, followed by the addition or subtraction of a third number. In order to maintain the stress state of the participants, the time limit was reduced by 10% for three consecutive correct answers, and to avoid participants giving up on difficult tasks, the time limit was increased by 10% for three consecutive wrong answers.

Figure 1.

Figure 1. User interface during mental stress task.

Standard image High-resolution image

3. Methods

3.1. PPG signal preprocessing and IBI time series extraction

The five-order band-pass Butterworth filter and zero-phase shift filter with a cut-off frequency of 0.3 Hz to 8 Hz were first used to eliminate the baseline drift, power frequency interference, and high-frequency noise. Then, an adaptive threshold was used to detect the peak point of each pulse cycle and to extract the IBI series. The Matlab function 'findpeaks' was applied (Christensen et al 2015, Lockmann et al 2016, Stobart et al 2018, Babola et al 2021, Bae et al 2021), where the first parameter 'minpeakheight = 0' (because the DC component was filtered during preprocessing) and the second parameter 'minpeakdistance = 0.4*fs' (fs is the sample frequency of the PPG signal, and 0.4 s was set as the minimum value of one cardiac cycle).

Figure 2 is an example of a PPG peak detection result. The PPG signal was recorded using Honor band 5 during task 1 (paced breathing, 0.17 Hz). The IBI can be obtained from the intervals between the detected pulse wave peaks, known as PP intervals (PPIs).

Figure 2.

Figure 2. Representative result of PPG preprocessing and PPI time series extraction: (a) the raw PPG signal, (b) the preprocessed PPG and the peak detection results, (c) the obtained PPI time series.

Standard image High-resolution image

3.2. Acquisition of gold-standard IBI time series

The gold-standard IBI time series was obtained from the simultaneously recorded ECG signals. The ECG signals were preprocessed as follows: a third-order band-pass Butterworth filter and zero-phase shift filter with a cut-off frequency of 2 Hz to 30 Hz were first used to eliminate the baseline drift, power frequency interference, and high-frequency noise, which makes the QRS complex (including the Q wave, R wave, and S wave) more prominent, and avoids changing the position of R peaks.

At present, plenty of R-peak detection methods have been proposed in ECG signal analysis. However, when applying these detectors on ECG signals collected in daily life, especially via wearable single-lead ECG devices, the R-peak detection accuracies are usually unsatisfactory. As the gold standard, we must guarantee the accuracy of R-peak detection. Thus, a method for extracting high-quality RR intervals (RRI) proposed in our previous study (Cui et al 2020, Xue et al 2020) was applied, which combines five commonly used R-peak detection methods, namely Pan–Tompkins (Pan and Tompkins 1985, Hamilton and Tompkins 1986), jqrs (Johnson et al 2015), mteo (Solnik et al 2008), nqrs (Li and Yang 2013), and sixth power (Dohare et al 2014).

The gold-standard IBI time series can be obtained from the detected RRIs. Both RRIs and PPIs are IBI time series.

3.3. Quantifying the accuracy of PPG-derived IBI time series

3.3.1. Accuracy quantification algorithm based on IBS approach

The accuracy of PPG-derived HR is commonly determined either by analyzing PPG pulse waveform characteristics or by establishing the correlation between the separately calculated HRV indices from PPG and ECG. Both methods rely on qualitative assessments.

In this study, the similarity index of dynamical changing patterns between IBI time series simultaneously recorded using PPG wristbands and the ECG device was calculated by means of symbolic coding and ranking statistics to quantify the accuracy of PPG-derived IBI time series.

The IBS approach was proposed to measure the distance or dissimilarity between two symbolic sequences based on measuring differences in the occurrence of repetitive patterns between two symbolic sequences (Yang et al 2003, Goldberger and Peng 2005, Wang et al 2016). The IBS algorithm is based on symbolic coding and ranking statistics, and thus it is not required to align the timelines of PPI and RRI precisely. In this study, the accuracy quantification algorithm for wristband-measured PPI time series was proposed as follows.

  • (1)  
    Map the IBI time series (RRI and PPI) to binary symbolic sequences, where an increase in the IBI is represented by '1' and no change or a decrease in the IBI is represented by '0'.
  • (2)  
    Map m + 1 successive IBI intervals to a binary sequence of length m, called an m-bit 'word'. Each m-bit word, therefore, represents a unique pattern of fluctuations in a given IBI time series.
  • (3)  
    By shifting one data point at a time, the algorithm produces a collection of m-bit words over the whole time series (a total of 2m possible words).

Different patterns of dynamics thus produce different distributions of these m-bit words. Figure 3 illustrates this mapping procedure using five-bit words (m = 5) from a part of the IBI time series. For m = 5, there are a total of 32 (= 25) possible words. The first binary word (10010) shown in figure 3 is equivalent to the decimal number of 18 (1 × 24 + 1 × 21 = 18), so that (00100) and (01001) are termed 4 and 9, respectively.

  • (4)  
    These m-bit words are then sorted according to their frequency of occurrence. The rank-frequency of any given m-bit word may differ between the two sequences mapped from two IBI time series.
  • (5)  
    Plot the rank order of each m-bit word in the PPI symbolic sequence against its rank order in the RRI symbolic sequence. Figure 4 is a representative result of rank order comparison (the original time series were from subject #3 during resting state), where each data point on the graph represents a binary word (m = 5) with its rank on the RRI symbolic sequence plotted on the horizontal axis against that on the PPI symbolic sequence on the vertical axis. If two symbolic sequences are similar in their rank order, the data points will be located near the diagonal line (figure 4(a)), and vice versa (figure 4(b)).
  • (6)  
    Let Dr (ψ1, ψ2) denote a distance value between 0 and 1 of two symbolic sequences, sk denote an m-bit word, and L denote the number of unique m-bit words. Let R and p denote the word's rank and probability, respectively. Let F denote the weight of the word, where F is computed using Shannon's entropy and normalized with the normalization factor Z. The degree of distance between the two symbolic sequences can be defined as (Yang et al 2003, Goldberger and Peng 2005, Wang et al 2016):
    Equation (1)
    Equation (2)
    Equation (3)
    The sum is divided by the value L to keep Dr (ψ1, ψ2) in the range [0, 1]. A smaller distance value corresponds to a higher degree of similarity between PPI and RRI, and thus indicates high accuracy of PPI.
  • (7)  
    Calculate the IBS distance value D1 between PPI time series and RRI time series. Then calculate the IBS distance between randomly shuffled RRI time series and the original RRI time series, repeat it 3–5 times and get the average distance value D2. The worst case is when D1 is close to D2.
  • (8)  
    Add white noise of different power Wi to the original whole RRI time series (for example, let the SNR change from −15 dB to 75 dB, step = 1), and separately calculate the corresponding distance from the RRI time series. Repeat this process 3–5 times, and the average distance value is denoted as Di , and then the relationship between the distance index Di and SNR is obtained as shown in figure 5, where the blue line is a three-order Gaussian fitting curve, and the Spearman rank correlation was found between the SNR and IBS distance index Di (r = −0.9251, P < 0.001).
  • (9)  
    Record the added white noise power W1 when Di is equal to or closest to D1, and thus estimate the white noise power contained in the PPI time series. For example, the IBS distance for the time series in figure 4(a) is D1 = 0.0871, and the corresponding SNR = 37.11 can be obtained using binary search. 'SNR = 37.11' means the ratio of signal-to-noise power is about 5000:1, which indicates that the PPI accuracy is high in this case.
  • (10)  
    Reject 'low-accuracy' data segments. According to the data recording conditions, duration, analytic purpose, etc, define an SNR threshold to reject the data segments whose SNR is smaller than this threshold. For example, in the Results section, SNR < 25 (i.e. IBS > 0.21) was applied for a representative analysis.

Figure 3.

Figure 3. Schematic illustration of the mapping procedure for five-bit words (m = 5).

Standard image High-resolution image
Figure 4.

Figure 4. A representative result of rank order comparison of (a) the PPI and RRI time series, (b) the randomized RRI and original RRI time series.

Standard image High-resolution image

3.3.2. Tuning the parameter m

When the data length is short and m is too large, the changing patterns in the data only account for a small part of the 2m patterns, which will reduce the accuracy of the distance index; when two long data segments have similar patterns and m is small, the distance index can only be roughly measured, and the accuracy assessment will be not reliable. Thus, in this study, the best value of m is chosen by comparing the IBS distance index of different m. According to table 1, the corresponding time durations of data collection tasks were different (i.e. 300 s, 500 s, and 360 s), so the data lengths were different. To calculate the IBS distance index corresponding to different m values at different data lengths, representatively, we compared the IBS distance index between the PPI and RRI during the tasks with different data lengths; e.g. 'Paced breathing (0.17 Hz)' (PB-300s), 'Seated resting state 1' (RS-500s), and 'Stress task 1' (ST-360s). Table 2 presents the results when m = 3, 4, 5 for each participant. In the case m = 3, there are five subjects whose IBS distance values are 0, which indicates that m is too small to contain enough changing patterns (23 = 8), resulting in the IBS distance values being inaccurate.

Table 2. Representative results of the IBS distance index corresponding to different m at different data lengths (PB: paced breathing; RS: resting state; ST: stress task).

m 345
IBS subject #PB-300sRS-500sST-360sPB-300sRS-500sST-360sPB-300sRS-500sST-360s
10.0710.2500.1870.0140.1140.0570.0440.0980.048
20.1110.2250.1130.0910.1310.1420.1020.1530.171
30.1530.2730.2400.1540.2160.2520.0800.2890.243
4 0.000 0.0200.0060.0580.0780.0570.0850.1100.115
50.2150.1990.0480.0990.1070.0240.0340.1230.055
60.0660.1500.0510.0220.0700.1140.0540.1060.103
70.0320.3430.2980.1180.3180.2760.1210.2870.282
80.1540.3700.0220.1670.3970.0730.1890.3540.070
9 0.000 0.2210.4120.0210.1660.2950.0650.1950.283
100.1590.2150.1760.1400.0970.0510.1610.0850.088
120.0930.2110.2730.1360.2900.1400.1880.2110.138
130.1530.2050.0690.1700.2580.1610.1420.2870.080
140.0800.0750.2370.0460.1300.1060.0460.0850.105
150.1850.1880.3490.0710.2300.3300.0630.1860.383
160.1440.2060.3790.1760.3260.3960.1560.2400.333
17 0.000 0.1760.1990.0870.1110.1440.0870.1020.158
180.3510.2390.0830.2280.3180.1080.2350.2590.127
190.1940.1020.2900.1360.2360.3440.1540.2050.278
21 0.000 0.2290.2480.0780.1090.2430.1560.1260.201
230.0320.1750.2620.0460.1280.2070.0720.1350.181
240.3320.3400.3650.2050.4010.3580.2770.3820.368
250.2000.2000.0990.1210.1300.0520.1090.0750.100
260.0880.1490.1380.0440.2440.1140.0820.2540.099
27 0.000 0.3040.3450.1330.3210.3020.0680.2530.249
280.1510.0300.0380.0940.0730.0410.0960.1360.046
290.128 0.000 0.2160.1000.0380.2590.0990.0660.294
300.0340.0570.1550.0800.1450.2040.0660.1530.143
Figure 5.

Figure 5. Spearman rank correlation (r = −0.9251, P < 0.001) between SNR and IBS distance index Di (the blue line is a three-order Gaussian fitting curve).

Standard image High-resolution image

Taking the data segment 'PB-300s' from subject #4 (the IBS distance is 0 in table 2) as an example, the data points for both PPI and RRI are 292. Tables 3 and 4 display the word pattern and the occurrence frequency and ranking for each pattern separately for the cases m = 3 and m = 4. In the case m = 3, although the occurrence frequencies of eight patterns are different, their rankings are the same, resulting in an IBS distance value of 0. In the case m = 4, both the occurrence frequencies and their rankings are slightly different. Two patterns (0101 and 1011) did not appear for the RRI time series, and more patterns might be missing in the case m = 5. Considering the data recording durations (300 s, 360 s, and 500 s), we chose m = 4 in this study.

Table 3. Word pattern, occurrence frequency, and ranking (case m = 3).

 RRI time seriesPPI time series
Word patternOccurrence frequencyRankingOccurrence frequencyRanking
0000.20110.2111
0010.16320.1562
0100.02170.0557
0110.15640.1424
1000.15930.1523
1010.01780.0428
1100.15650.1425
1110.12860.1006

Table 4. Word pattern, occurrence frequency, and ranking (case m = 4).

 RRI time seriesPPI time series
Word patternOccurrence frequencyRankingOccurrence frequencyRanking
00000.08070.0945
00010.12230.1183
00100.007140.02413
00110.15610.1321
01000.021110.03111
01010.02414
01100.06380.0766
01110.09450.0667
10000.11840.1184
10010.04290.0359
10100.014130.03112
10110.01016
11000.13920.1222
11010.017120.01715
11100.09460.0668
11110.035100.03510

4. Results

4.1. Accuracy of wristband PPI time series

The IBS distance values were calculated separately between the gold-standard RRI time series and 1) shuffled RRI, 2) E4 band recorded PPI time series, and 3) Honor band 5 recorded PPI time series. The SNR of the PPI time series recorded during each task was estimated using the binary search method and the three-order Gaussian fitting curve, as shown in figure 5. The results are shown in table 5. According to figure 5 and table 5, the IBS values are negatively associated with the SNR; thus, a bigger IBS distance indicates a smaller SNR and low accuracy. The IBS distances of the PPI time series are much smaller than those of shuffled RRI that displayed the worst-case accuracy. In addition, for the PPI time series of the two wristbands, the IBS distances were in the order mental stress tasks > seated resting > paced breathing. During the mental stress tasks the participants made more hand movements, which could lead to more noise. During deep breathing, they were more focused on their breathing, and their PPI and RRI would vary with the breathing cycles; thus, the SNR could be bigger.

Table 5. The IBS distance from gold-standard RRI time series and the estimated SNR for PPI time series (data are presented in mean ± SD, PB: paced breathing; RS: resting state; ST: stress task).

IBI time seriesShuffled RRIPPI (E4 band)PPI (Honor band 5)
TasksIBS distanceIBS distanceSNR (dB)IBS distanceSNR (dB)
PB-0.17 Hz0.457 ± 0.1240.156 ± 0.10929.660.103 ± 0.10234.75
PB-0.1 Hz0.462 ± 0.1130.128 ± 0.07232.000.104 ± 0.09134.41
RS 10.468 ± 0.1030.175 ± 0.10228.000.115 ± 0.09533.39
ST 10.454 ± 0.1180.188 ± 0.09326.950.177 ± 0.10627.97
RS 20.467 ± 0.1110.175 ± 0.07328.000.109 ± 0.05634.00
ST 20.428 ± 0.1200.192 ± 0.10526.610.165 ± 0.11028.98
RS 30.463 ± 0.1030.171 ± 0.09428.300.149 ± 0.06030.34
ST 30.443 ± 0.1010.181 ± 0.10227.620.153 ± 0.10830.00
RS 40.448 ± 0.1160.155 ± 0.08729.660.108 ± 0.08834.07

4.2. Accuracy validation of PPI-measured HRV indices

The correlation analysis between the HRV indices separately calculated using PPI and RRI is one of the most commonly used PPG accuracy assessment methods, and is thus widely applied to validate the reliability of IBI time series-based psychophysiological measures. Hence, an HRV index correlation analysis was performed in this study to evaluate the accuracy of PPI-calculated HRV indices.

The Pearson correlation was analyzed between the same HRV indices of the same task. The HRV indices applied in this study included meanHR (mean heart rate), SDNN (standard deviation of IBI series), RMSSD (root mean square of successive IBI differences), pNN50 (percentage of successive IBI that differ by more than 50 ms), SD1/SD2 (ratio of Poincaré plot standard deviation perpendicular to the line of identity to standard deviation along the line of identity), LFP (percentage of the low-frequency band (0.04–0.15 Hz) power), HFP (percentage of the high-frequency band (0.15–0.4 Hz) power), LF/HF (ratio of LFP to HFP).

The color map in figure 6 presents the Pearson correlation coefficients between PPI-derived and RRI-derived HRV indices before and after rejecting the low-SNR segments for each HRV index under nine data recording conditions. Representatively, the 'low-accuracy' segments are defined as SNR < 25 (IBS > 0.21) in this section, and other SNR thresholds are defined case by case. The HRV correlation coefficients in table 6 demonstrate that the PPI accuracy was not satisfactory compared with the gold-standard RRI time series for both the E4 band and Honor band 5, which may decrease the accuracy of PPI-based mental state classification. The results in figure 6 and table 6 indicate that the most direct method to improve the reliability is to reject PPI data segments with low SNR. After rejection, the averaged correlation coefficients for all HRV indices increased from 0.542 ± 0.235 to 0.922 ± 0.120 for E4 and from 0.596 ± 0.227 to 0.859 ± 0.145 for Honor 5. In this case, the accuracy of PPI-based mental state classification would be close to the RRI.

Figure 6.

Figure 6. Pearson correlation coefficients between PPI- and RRI-derived HRV indices. (a) E4 band and (b) Honor band 5 before rejecting low-SNR segments; (c) E4 band and (d) Honor band 5 after rejecting low-SNR segments. The X-axis denotes the eight HRV indices, and the Y-axis denotes the nine data recording conditions. (PB: paced breathing; RS: resting state; ST: stress task).

Standard image High-resolution image

Table 6. Comparison of the averaged Pearson correlation coefficients between PPI- and RRI-derived HRV indices before and after rejecting low-SNR segments.

Correlation HRV indicesE4 bandHonor band 5
 BeforeAfterBeforeAfter
meanHR0.927 ± 0.0750.976 ± 0.0190.810 ± 0.1850.979 ± 0.016
SDNN0.424 ± 0.1540.999 ± 0.0010.568 ± 0.1620.984 ± 0.020
RMSSD0.366 ± 0.1610.899 ± 0.1300.402 ± 0.1690.764 ± 0.174
pNN500.519 ± 0.1220.830 ± 0.1470.766 ± 0.1380.831 ± 0.144
LFP0.477 ± 0.1910.906 ± 0.2050.543 ± 0.2040.763 ± 0.228
HFP0.789 ± 0.1220.898 ± 0.0750.763 ± 0.1820.810 ± 0.153
LF/HF0.467 ± 0.1550.899 ± 0.1300.514 ± 0.2230.764 ± 0.174
SD1/SD20.363 ± 0.1140.971 ± 0.0160.400 ± 0.1200.958 ± 0.032
All indices (average) 0.542 ± 0.235 0.922 ± 0.120 0.596 ± 0.227 0.859 ± 0.145

As shown in figure 6 and table 6, after rejecting low-SNR segments, the accuracies of most HRV indices showed great improvements, with the exception that the Pearson correlation coefficients between the mean HR obtained using E4 and the gold-standard ECG device are almost the same (0.927 ± 0.075 vs 0.976 ± 0.019), indicating that E4 had more reliable mean HR values. However, the original (before rejection) Pearson correlation coefficients of other HRV indices measured using Honor 5 were higher than E4.

4.3. Accuracy of PPI-HRV-based mental state classification

The HRV indices calculated via PPI time series were used in mental state classification. A total of 242 data segments (27 subjects * 9 conditions = 243, and the ninth condition of subject #9 was not completed) were included for each wristband, which were divided into four classes (four mental stress states): paced breathing–0.17 Hz, paced breathing–0.1 Hz, resting state, and stress state.

The HRV indices (i.e. SDNN, LFP, HFP, LF/HF, and SD1/SD2) were fed into an SVM model for mental state classification. In this study, a library for SVMs (LIBSVM) (Chang and Lin 2011) is implemented for the SVM classifier with the radial basis kernel function. Five-fold cross-validation was performed ten times to evaluate the model's predictive performance. The data segments were divided into five parts. One of the five parts was used as a testing set and the remaining four parts were used as a training set. To avoid information leakage, segments from the same subject were divided into either a training set or testing set. Before training, the features were normalized using the function scaleforSVM, which processes training and testing sets by mapping row minimum and maximum values to [−1, 1]. Then, the radial basis function kernel was selected, and the optimal parameters C and gamma were found using the function SVMcgForClass.

The averaged precision, sensitivity (recall), specificity, and overall accuracy were used to judge the performance of four-class mental stress state classification. The results are presented in table 7. To improve the reliability of classification, the PPI data segments with IBS distance > 0.21 (m = 4, SNR = 25) were rejected, which caused the precision, sensitivity, and specificity to increase from 76.18%, 75.44%, and 92.73% to 84.29%, 76.58%, and 93.37% for E4; and increase from 77.45%, 77.50%, and 93.00% to 79.99%, 79.75%, and 94.28% for Honor 5. The overall accuracy of the four-class mental stress state classification was increased from 77.3% to 81.9% for E4 and from 79.3% to 83.3% for Honor 5.

Table 7. Precision, sensitivity, specificity, and overall accuracy of four-class mental stress state classification before and after rejecting low-SNR segments.

  BeforeAfter
EvaluationPreSenSpeAcc (4-class)PreSenSpeAcc (4-class)
E4 bandState 177.8%71.8%96.1%100.0%58.3%100.0%
 State 267.4%67.4%94.4%75.0%69.2%95.7%
 State 383.5%85.5%91.2%82.9%93.6%88.5%
 State 476.0%77.0%89.3%79.3%85.2%89.3%
 Average76.2%75.4%92.7%77.3%84.3%76.6%93.4%81.9%
Honor band 5State 179.41%69.23%96.55%84.21%72.73%98.03%
 State 263.04%76.32%91.67%61.54%72.73%93.42%
 State 387.36%84.44%92.76%92.86%87.84%95.00%
 State 480.00%80.00%91.02%81.36%85.71%90.68%
 Average77.45%77.50%93.00%79.3%79.99%79.75%94.28%83.3%

(State 1: paced breathing–0.17 Hz; State 2: paced breathing–0.1 Hz; State 3: seated resting; State 4: stress state; Pre: precision; Sen: sensitivity; Spe: specificity; Acc: accuracy)

5. Discussions and conclusions

5.1. Main findings

The IBI time series and its variability are playing a more and more significant role both in clinical and daily healthcare services. PPI-based psychophysiological applications have been implemented in consumer-grade smart wristbands; however, the reliability of this measurement has not been fully validated. Besides the mean HR, the underlying dynamical structure patterns of PPI time series are more influenced during different mental stress states. The most accurate assessment should be the comparison with standard ECG-calculated IBI signals; however, very few studies have assessed wristband-recorded IBI accuracy in terms of absolute agreement with gold-standard ECG recording systems, because the timelines of the two systems are difficult to align precisely. Hence, the quantitative assessment is difficult to conduct. To solve this problem, this study proposed a novel IBS distance indicator for quantitatively assessing the accuracy of PPI time series according to the information of underlying dynamical structures, and the SNRs of PPI time series measured using different level PPG wristbands were obtained. Compared with the gold-standard ECG-measured HRV, the accuracies of PPG-measured HRV indices for both E4 and Honor 5 were not satisfactory, i.e. the HRV correlation coefficients were no more than 0.6; however, the figure reached around 0.9 when rejecting low-SNR PPI segments, indicating that the PPG-measured HRV indices were strongly influenced by noise, and that the accuracy could be improved via appropriate de-noising methods. The PPI-HRV-based mental state classification accuracies were increased by about 4% by rejecting the low-accuracy data segments by a designated SNR threshold.

A four-class mental stress state induction experiment, where subjects were wearing two popular but different level PPG wristbands (E4 and Honor 5) and an ECG device, was conducted with the aim of validating the accuracy of PPI time series by comparing the HRV correlations (between PPI-HRV and RRI-HRV) and mental state classification accuracy before and after applying the proposed method to reject low SNR segments. The IBS distance and SNR (table 5), HRV correlations (figure 6 and tables 6), and four-class mental state classification (table 7) all proved that the accuracy of PPI time series could be validated and improved using the proposed method. The correlations between HRV indices were greatly improved (table 6), and the four-class mental state classification accuracy was increased by about 4%, indicating that the PPI noise ratio has a great influence on the accuracy of HRV indices. Pursuing the highest mental state classification accuracy is not the goal of this study, and the accuracy could definitely be increased when more sophisticated machine learning models are applied in future.

5.2. The accuracy of PPI derived from consumer-grade PPG wristbands

A few preliminary studies have assessed the accuracy of the E4 wristband. For example, one study focused on stress and emotion recognition (Ragot et al 2018). Two other studies focused on the PPG signals and suggested a satisfactory data quality (McCarthy et al 2016) and acceptable measurement error in mean HR and some HRV measures (Pietilä et al 2017). A few studies have assessed the accuracy of the E4 in measuring mean HR and HRV indices, and to evaluate how accuracy varies as a function of the condition of measurement (Menghini et al 2019, Schuurmans et al 2020, Stuyck et al 2022). E4 was also used in a multimodal assessment of adult attachment (Parra et al 2017) and in urban stress research (Chrisinger and King 2018).

Honor 5 is one of the most popular worldwide commercialized consumer-grade smart bands. To our knowledge, no published study has reported its data quality. Interestingly, based on the IBS distance and the corresponding SNR results in table 5, and the original correlation coefficients between PPI- and RRI-derived HRV indices in table 6, it seems that the original PPI accuracy of Honor 5 is a bit better than E4. One possible reason is that the sample frequency of PPG signals was 100 Hz for Honor 5 and 64 Hz for E4, and a higher sampling frequency could generate more accurate PPI time series (higher SNR), which can further influence the accuracy of HRV indices.

The mean/average HR is the most conventionally used function for the PPG wristband, and is the most stable and noise-resistant HRV index. The mean HR is commonly used as the key measure to validate the data quality of PPG wristbands. Although a very high correlation (> 0.9) was found between the mean HR of the E4 band and the ECG device, other HRV indices and even the mental state classification accuracy did not show the same conclusion. Instead, due to the higher sample frequency of Honor 5, it seemed that its HRV indices (except the mean HR and HFP, table 6) as well as the mental state classification accuracy (table 7) were more reliable than E4 in this study.

The poor data quality of PPI derived from consumer-grade PPG wristbands can be influenced by various factors. Researchers should carefully consider and account for the following factors to improve the accuracy and reliability of their findings. (1) Motion artifacts: wrist movements, such as hand gestures or physical activities, can introduce motion artifacts, which may lead to inconsistencies in the signal and impact the reliability of PPI monitoring. (2) Skin contact and fit: inadequate contact between the wristband sensors and the skin or improper fit of the device can result in signal distortions. (3) Skin conditions: dryness, sweat, or excessive hair on the wrist may contribute to noise in the IBI data. (4) Device calibration: inaccurate calibration of the wristband sensors may result in systematic errors in PPI measurements, affecting the overall quality of the data. (5) Battery and hardware issues: low battery levels or hardware malfunctions in the wristband device can impact the quality of signal acquisition. (6) Environmental interference: external factors such as electromagnetic interference or ambient light can affect the performance of optical sensors in wristbands. (7) Signal processing algorithms: the algorithms used for processing the raw sensor data play a significant role. Inaccuracies in signal processing methods may lead to errors in IBI calculations. (8) Device-specific limitations: different wristband models may have specific limitations in terms of sensor technology and data-processing capabilities. Understanding and addressing these limitations is important for mitigating signal quality issues.

5.3. Study limitations and future work

The primary aim of this study was to objectively quantify inter-beat intervals acquired from consumer-grade PPG wristbands using an ECG-aided IBS approach. The information similarity was defined between PPI and RRI, and the PPG band and ECG device were concurrently worn on the same participant. The proposed approach may inadvertently capture certain gender-related patterns or differences. We acknowledge the significance of sex and gender considerations in research that involves human subjects. There might be broader societal or physiological variations that could impact the interpretation of our findings beyond the scope of our study. We hope that our findings contribute to the ongoing discussions in this field and encourage further investigations into the intricate impacts of gender on wearable health technologies.

In this study, we initially assessed the reliability of consumer-grade wristbands in home healthcare management using data from 30 healthy participants. Further research with a larger cohort with diverse populations and specific clinical conditions is necessary to validate the generalizability of our findings. For example, future work could expand to encompass populations affected by heart conditions to rigorously assess the method's performance under diverse cardiac pathologies.

The reliability of SNR estimation using the binary search method and the three-order Gaussian fitting curve depicted in figure 5 could be potentially influenced by the variation in sample space. Binary search is a commonly used method for finding values within a curve, but other numerical techniques could also be considered depending on the context, such as optimization algorithms, curve interpolation or regression, probabilistic estimation, and so on. The choice of method for finding values within a fitted Gaussian curve depends on various factors, including the computational complexity, required accuracy level, and the specific requirements of the application. We are open to exploring alternative methods or complementary approaches for SNR estimation that could potentially offer increased reliability in diverse sample space conditions. It might be worthwhile in the future to compare and benchmark different methods in terms of accuracy and computational efficiency to determine the most suitable approach for specific applications.

5.4. Main conclusions

This study proposes a methodology that involves concurrently wearing ECG and PPG devices to assess the reliability of the PPI time series by examining the information similarity between PPI and RRI time series before practical applications. Specifically, before the release of a new PPG device or smart band function, rigorous testing and verification of its reliability is crucial. This methodology can find application in various scenarios, including the selection of wristband manufacturers and models before large-scale studies based on PPG wristbands, verification before the development of health management applications utilizing PPG wristbands, the selection of sampling frequency for PPG wristband hardware design, and pre-release testing of new features by wristband companies based on PPG technology.

The proposed IBS-based approach has an emphasis on the dynamical changing structures of PPI time series, and thus has no requirement for the timeline alignment of measured signals and ECG signals. More significantly, this method can be used for accuracy validation of any IBI time series measured by non-standard ECG devices, including PPG, BCG millimeter-wave radar, Wi-Fi, heart sounds, and so on. Verification using public PPG or BCG databases will be carried out in a follow-up paper.

Acknowledgments

The authors wish to thank the participants involved in this study. This work was supported by the National Natural Science Foundation of China under Grants 82274631, 62077013 and 61807007, and in part by the National Key Research and Development Program of China under Grant 2018YFC2001100

Data availability statement

All data that support the findings of this study are included within the article (and any supplementary information files). Data will be available from 1 December 2024.

Conflict of interest statement

The authors declare no conflicts of interest.

Please wait… references are loading.