Photoplethysmography based atrial fibrillation detection: a continually growing field

Objective. Atrial fibrillation (AF) is a prevalent cardiac arrhythmia associated with significant health ramifications, including an elevated susceptibility to ischemic stroke, heart disease, and heightened mortality. Photoplethysmography (PPG) has emerged as a promising technology for continuous AF monitoring for its cost-effectiveness and widespread integration into wearable devices. Our team previously conducted an exhaustive review on PPG-based AF detection before June 2019. However, since then, more advanced technologies have emerged in this field. Approach. This paper offers a comprehensive review of the latest advancements in PPG-based AF detection, utilizing digital health and artificial intelligence (AI) solutions, within the timeframe spanning from July 2019 to December 2022. Through extensive exploration of scientific databases, we have identified 57 pertinent studies. Significance. Our comprehensive review encompasses an in-depth assessment of the statistical methodologies, traditional machine learning techniques, and deep learning approaches employed in these studies. In addition, we address the challenges encountered in the domain of PPG-based AF detection. Furthermore, we maintain a dedicated website to curate the latest research in this area, with regular updates on a regular basis.


Introduction
AF is a highly prevalent cardiac arrhythmia, which affects approximately 1%-2% of the general population, and is expected to continue to rise in the future worldwide due to population aging (Schnabel et al 2015, Lane et al 2017, Vinter et al 2020).Individuals with AF face a substantially heightened risk of experiencing cerebral and cardiovascular complications.Specifically, they are at a five fold higher risk (Tsao et al 2022) of ischemic stroke and are associated with an increased risk of ischemic heart disease, sudden cardiac death, and heart failure (Odutayo et al 2016).In general, people with AF have a four times increased risk of mortality compared to the general population (Lee et al 2018).The current detection of AF heavily relies on routine medical examinations; however, this approach may overlook paroxysmal AF cases, which refer to AF episodes that occur sporadically and self-terminate within 7 d.Additionally, a significant portion of AF patients, estimated at 25%-35%, remain asymptomatic (Rienstra et al 2012), which further reduces their likelihood of seeking care.These factors collectively contribute to delays in the identification of AF cases.Consequently, there has been a surge in efforts from both industry and academia sectors for developing technologies that enable reliable and continuous detection of AF.These advancements aim to transform the screening process for early detection of AF, particularly by identifying asymptomatic cases, potentially altering the course of treatment, and necessitating further research to fully understand their impact on patient outcomes (Boriani et al 2014, Chen et al 2018).
To enable consistent and long-term monitoring of atrial fibrillation (AF), a solution needs to be nonintrusive, cost-effective, and convenient, reducing operational complexity and encouraging user compliance.
To this end, photoplethysmography (PPG) has emerged as a preferred technology, with a ubiquitous adoption in over 71% of wearable devices given its capacity to capture heart rhythm dynamics (Charlton et al 2023).The physiological foundation of PPG for AF detection lies in the fact that irregular heartbeats induce variations in cardiac output, leading to fluctuations in peripheral blood volume.This results in irregular pulse-to-pulse intervals and altered morphologies in PPG during AF episodes.Exploiting this physiological basis, wearables equipped with PPG sensors and specialized software offer great promise for personalized self-monitoring of AF, enabling individuals to receive timely alerts for potential AF episodes.However, the success of this approach hinges on the accuracy of PPG AF detection algorithms.Suboptimal algorithms can easily lead to a surge in false positives, thereby straining healthcare resources through unnecessary or inappropriate medical consultations.
Therefore, it marks tremendous importance for the development of precise and sensitive PPG-based algorithms for AF detection.These algorithms should aim to minimize false detections and optimize the utilization of healthcare resources, ensuring that appropriate clinical guidance is provided to individuals experiencing actual AF episodes.A prior review conducted by Pereira et al provided a comprehensive summary of research on PPG-based AF detection using statistical analysis (STAT), machine learning (ML) and deep learning (DL) approaches up until July 2019 (Pereira et al 2020).The review concluded that PPG holds promise as a viable alternative to ECG for AF detection.However, it also highlighted challenges such as the presence of arrhythmias other than AF, motion artifacts in PPG signals from wearable devices, and labor-intensive data annotation processes, among others.
Given the rapid technological advancements in wearable technology and methodological development in artificial intelligence (AI), there is a well-justified need for an updated review of AF detection using PPG.Building upon the previous work by Pereira et al, this paper aims to fill the gap by providing a comprehensive review of the latest developments in utilizing PPG-based digital health and AI solutions for AF detection in both inpatient and outpatient settings from July 2019 to December 2022.The articles included in this review are classified by the three methodological categories established by Pereira et al (2020), namely, STAT, ML, and DL, to facilitate the tracking of evolving trends in the field.In addition to conducting a thorough analysis of studies on PPG-based AF detection, this study has established an online knowledge database (GitHub).This database encompasses all studies reviewed up to December 2022, including those from our work and Pereira's, along with direct links to the respective papers.Committed to keeping the database current, our team will update it semiannually.Through the creation of this resource, we aim to foster community collaboration and accelerate the development of effective solutions to this critical clinical challenge.

Search criteria
The research team used the SCOPUS, IEEE Xplore, PubMed, Web of Science, and Google Scholar databases to gather appropriate documents for the review.All articles selected were published between July 2019 and up to December 2022, and reviews were eschewed in favor of data-based research studies.Databases function similarly, but not uniformly, so queries needed to be adjusted to reflect this.Filters were used in all databases to restrict the date of publication.Table 1 describes the exact search strings used in different databases for initial document screening.After the documents were retrieved (in total 57 studies), they were further evaluated for appropriateness for review by two researchers (RX and CD).For the subsequent analysis, only studies focused on developing detection algorithms using PPG for AF detection were included.Review papers, perspectives, commentaries, clinical trials, and meta-analyzes were excluded from further analysis.Based on this search criteria, there are in total 57 studies included in the review, including 17 STAT, 18 ML, and 22 DL studies.To categorize studies into STAT, ML and DL, the primary classifier adopted in the studies was considered as the determinant factor for characterization.This way, in mixed methods where, for instance, features traditionally belonging to ML are fed into a DL classifier, the overall assigned category would be considered as DL.

Publication trends in the past decade
Figure 1 depicts the trends in the cumulative number of publications in the three method categories in the past 10 years between January 2013 and December 2022.To maintain consistency, the same screening criteria were applied to identify relevant studies from before the review period of the current study.It reveals an accelerated rate of growth in the number of publications in all three categories, indicating the increasing effort outpouring to developing PPG-based AF detection algorithms.It is worth noting that studies utilizing DL for AF detection emerged in 2017 and expanded rapidly, outpacing the other two categories.In the year 2022, the cumulative number of publications using DL for AF detection exceeded any of the other two categories for the first time in history.

Review of recent studies on PPG-based AF detection
Tables 2-4 were adapted and extended based on previous work from Pereira et al (2020).These tables summarize the compiled studies for PPG-based AF detection categorized by three different signal processing methods.It is important to note that within the 57 studies reviewed, some studies employed more than one signal processing approach, leading to their inclusion in multiple tables, allowing for a comprehensive understanding of the various methodologies.More information on data train/test splitting and excluded data due to noisy signals or motion artifacts can be found in tables A1-A7 in appendix A for STAT, ML and DL studies, respectively.
When referring to the measurement devices, we classified them into several categories, namely smartwatch, wrist band, fingertip sensor, smart ring, armband, and smartphone.This categorization is based on the implicit location for PPG sensing and the primary utility of the device.For instance, smartwatches and wristbands measure PPG signals at the wrist, while fingertip sensors, smart rings, and armbands measure PPG signals at the fingertip, proximal phalange (i.e. the base of the finger), and various locations within the arm or forearm, respectively.It is important to note that while both smartwatches and wristbands integrate reflective-type PPG sensors at the wrist in all studies, we distinguished between them based on their primary function.Smartwatches, such as the Apple Watch and Samsung Simband, are designed for general-purpose utilization and may include features like a screen and notification management utilities.On the other hand, wristbands, such as the Empatica E4, are screen-less devices primarily intended for monitoring physiological signals.Additionally,   Abbreviations: YO-Year Old, s-second, AF-atrial fibrillation, NSR-normal sinus rhythm, AFL-atrial flutter, SD-standard deviation, PAC-premature atrial contraction, PVC-premature ventricular contraction, Sensensitivity, Spe-specificity, Acc-accuracy, PPV-positive predictive value, NPV-negative predictive value, AUC-area under the receiver characteristic curve, CI-confident interval, DFT-defibrillation threshold, ICD-implantable cardioverter-defibrillator, IBI-inter-beat interval.studies using smartphones typically perform PPG measurements at the finger using reflective-type PPG sensors, with the camera and flashlight serving as the photosensitive and photoemitter components, respectively.For studies in which PPG signals were experimentally acquired the vast majority used a reflective-type PPG sensor, which includes form factors such as the smartwatch, wrist band and armband.For studies using PPG signals collected through 'fingertip sensors', the working mode (i.e.reflective versus transmissive mode) was not disclosed.Regarding the wavelength of the PPG sensors, this information was not disclosed in more than half of the studies (approximately 57.6%).Moreover, approximately 30.5% and 13.6% of the studies used one or more devices using green and red/infra-red, respectively, making the former wavelength the most common one among studies specifying the device's wavelength.More information can be consulted in table B1 in appendix B. 4.1.Updates on PPG-based AF detection using statistical analysis approaches A compilation of studies for PPG-based AF detection employing statistical analysis approaches is summarized in table 2. In the interest of maintaining uniformity and enabling systematic evaluation of the advancement in this field in recent years, our study deliberately replicates the table format of tables 1-3 from Pereira et al (2020) in our tables 2-4.The table provides an overview of these studies in chronological order, including patient cohorts, data characteristics, employed features and methods, care settings (inpatient versus outpatient), and the resultant performance outcomes.It shows that the statistical analysis approach mainly relies on threshold-based rules on the selected set of features for AF detection.Under this umbrella, the most frequently employed features for AF detection include the RR interval from the ECG and the inter-beat interval (IBI) from PPG (Kabutoya Consequently, the extracted features undergo analysis in terms of their histograms, both with and without the presence of AF and other cardiac rhythms.This analysis assists in determining optimal thresholds that effectively differentiate various rhythmic classes.Once these thresholds are established, they can be applied to the same features extracted from PPG signals. Furthermore, the utilization of identical feature sets with alternative statistical approaches, such as logistic regression, enhances the versatility and comprehensiveness of AF detection studies.By applying logistic regression, researchers can establish a mathematical model that estimates the probability of AF presence based on the input features.The logistic function, also known as the sigmoid function, is employed to transform the output into a range between 0 and 1.This transformed probability serves as an indicator of the likelihood of AF compared to non-AF cases.The advantage of logistic regression lies in its ability to provide a quantitative measure of the probability, allowing for a nuanced understanding of the classification outcome.Also, as reported in table 2, studies incorporating larger patient cohorts intend to utilize logistic regression (Eerikäinen et al 2019, Avram et al 2021, Han et al 2022) rather than rule-based models.This observation aligns with the trends identified in a previous review study (Pereira et al 2020), further reinforcing the preference for logistic regression in cases involving a higher number of patients.
As compared to the previous review, we observe a rising number of studies using the statistical analysis approach (4.25 studies/year between 2019 and 2022 versus 2 studies/year between 2013 ∼ 2019), which aligns with the rising number of all-type AF detection studies in recent years.It can be observed that more studies focus on outpatient populations, which might be attributed to the rapid advancement of wearable technology in recent years.

Updates on PPG-based AF detection using machine learning approaches
Table 3 presents a chronological summary of AF detection studies based on machine learning approaches in the last four years.Machine learning has demonstrated promising results in the detection of AF in low-sample settings.The application of ML techniques requires domain expertise for feature engineering to extract features that effectively capture the comprehensive characteristics of PPG waveforms and enable the discrimination of different classes.Commonly extracted features include morphological descriptors, time domain statistics, statistic measurements in the frequency domain, nonlinear measures, wavelet-based measures, and cross-correlation measures.
Of different machine learning algorithms, Tree-based algorithms, such as decision trees, random forest, and extreme gradient boosting (XGBoost) (Chen and Guestrin 2016), are the most popular choices and are collectively employed in 12 out of the 18 studies employing machine learning for AF detection.Random Forests have demonstrated strong performance in AF detection tasks using PPG.This ensemble learning algorithm combines multiple decision trees to create a robust classification model.By aggregating the predictions of individual trees, Random Forests can reduce overfitting, handle complex feature interactions, and provide accurate AF detection results.The versatility, interpretability, and resilience to noisy data make Random Forests a popular choice in PPG-AF detection research.XGBoost is a boosting algorithm that combines gradient boosting with decision trees to achieve high predictive accuracy in PPG-AF detection.XGBoost sequentially builds an ensemble of weak models, iteratively improving its performance by minimizing a loss function.It can effectively handle complex feature interactions and capture subtle patterns in PPG signals, leading to improved AF classification results and better detection performance compared to individual decision trees.
The second most popular (used in 8 out of 18 studies) machine learning classifier for AF detection is support vector machines (SVM) (Cortes and Vapnik 1995), due to their ability to handle high-dimensional feature spaces.SVM separates PPG signal data into different classes by identifying an optimal hyperplane that maximizes the margin between the classes.By mapping PPG signals into a higher-dimensional space, SVM can capture complex relationships and find effective decision boundaries for accurate AF classification.There are also other classifiers adopted in the studies such as K-Nearest neighbors (KNN) and artificial neural networks (ANN) but are not widely adopted as the above two classifiers.
Compared to the previous review, we observe a sharp increase in the adoption of machine learning for AF detection using PPG (5 studies/year between 2019 and 2022 versus 1.5 studies/year between 2016 and 2019).

Updates on PPG-based AF detection using deep learning approaches
Deep learning has emerged as a powerful approach for detecting AF in PPG signals, as reported in table 4. Unlike traditional ML methods, DL models can learn comprehensive feature representations through an end-to-end learning fashion, eliminating the need for complex feature engineering.This is achieved by learning from a large amount of training samples to train deep neural networks, which consist of interconnected layers of computational nodes.
As shown in table 4, studies using DL approaches can be divided into two main categories.The first category (employed in 14 out of 24 studies) is a family of convolutional neural networks (CNN).CNN is commonly applied in computer vision tasks, but they have also been successfully adapted for PPG-AF detection CNNs utilize convolutional layers to automatically extract relevant features from the PPG signal data (Shen et al 2019).These convolutional layers apply numerous filters across the signal, allowing the network to capture local patterns and identify important discriminative features associated with AF.By stacking multiple layers, CNNs can learn increasingly complex representations of the PPG signals, enhancing the accuracy of AF detection.Residual network (ResNet) (He et al 2016), a specific type of CNN, addresses the challenge of training deep neural networks by utilizing skip connections.These connections allow the network to bypass layers and pass information directly to subsequent layers, mitigating the vanishing gradient problem.In the context of PPG-AF detection, ResNet architectures enable the training of deeper networks with improved performance and ease of optimization.By incorporating residual connections, ResNet models can capture fine-grained details and longrange dependencies in PPG signals, leading to enhanced AF detection capabilities.The second category is a family of sequential DL models, of which long short-term memory (LSTM) (Hochreiter and Schmidhuber 1997), is a popular choice (employed in 4 out of 24 studies).LSTM is a recurrent neural network architecture commonly used in PPG-AF detection due to its ability to effectively capture temporal dependencies in sequential data.In the context of PPG signals, LSTM models can analyze the sequential nature of the data, considering the temporal order of the signal samples.This allows LSTM to capture long-term patterns and dynamic changes in the PPG signals, which are crucial for accurate AF detection.
To effectively train DL models, a substantial amount of labeled training data is typically required.However, in biomedical applications, the availability of labeled data is often limited.Transfer learning is a potential solution to this challenge, wherein a pre-trained DL model is fine-tuned for a specific task.The number of layers and the complexity of fine-tuning depend on the particular application.For example, in one study, a pre-trained CNN model designed for ECG analysis was fine-tuned to detect AF from PPG segments using a small set of labeled data.Another promising technique is data augmentation to generate artificial samples to boost the number of samples for training the DL models and increasing the generalizability of model performance.
DL is the fastest growing approach of all three approaches for PPG-AF detection.We observe an average of 6 studies employing DL per year between 2019 and 2022, as compared to 3.5 studies/year between 2018 and 2019.

Discussion
While the performance metrics reported in tables 2-4 suggest the promising potential of PPG for AF detection, several challenges remain.In this section, we will delve into these issues, offering insights drawn from our comprehensive analysis of the reviewed studies.Key concerns to be discussed include PPG signal quality, label accuracy, and the impact of concurrent arrhythmias.Studies that have considered these issues are summarized in table 5. Furthermore, we extend our discussions to encompass additional considerations pertaining to PPGbased AF detection.These include algorithmic factors such as performance metrics, data sources, computational efficiency, domain shifts, as well as model explainability and equity.

PPG signal quality
PPG signal quality remains a considerable challenge, which is widely acknowledged within the scientific community.A multitude of complicating factors can compromise the PPG signal quality, including motion artifacts, skin tone variations, sensor pressure variations, respiratory cycles, and ambient light interference, only to name a few.The challenge of noise in PPG signal is particularly acute when it comes to the continuous acquisition of PPG, which is crucial for long-term monitoring of AF risk.
As reported in table 5, most of the reviewed studies take signal quality into consideration, with 54% of the reviewed studies implementing measures to exclude PPG signals of poor quality.For example, in Han et al (2020), the authors presented a noise artifact detection algorithm designed for detecting noise artifacts.Out of a total of 2728 30 s PPG strips, only 314 strips were deemed suitable for further analysis after applying the algorithm.Similarly, in Torres-Soto and Ashley (2020), the authors proposed a multi-tasking framework that incorporated both signal quality assessment and AF detection tasks.Only PPG signals of excellent quality were retained for the purpose of AF detection.This practice, however, harbors potential issues that warrant deeper consideration.Firstly, by systematically discarding vast swaths of signal data considered of inferior quality, the earliest possible detection of AF is inevitably delayed, creating a potentially significant time lag in diagnosis.Secondly, this approach harbors a statistical dilemma; the discarded PPG-AF signals could be construed as false positives within the context of the overall analysis.However, such instances are typically overlooked when calculating the positively predicted value or false positive rate, thereby potentially inflating the model's reported performance.Consequently, the reliance on selective data exclusion as a signal quality control strategy may inadvertently compromise the validity of the study's outcomes and the efficacy of predictive models developed therefrom.
We propose a nuanced perspective on PPG signal quality assessment rather than adhering to the dichotomous approach of designating signals as merely black or white (Charlton et al 2023).Instead, we suggest the computation of a signal quality index (SQI) as a continuous metric (Guo et al 2021b).This calculation would be based on the proportion of motion artifacts present within individual PPG segments, thus providing a more precise estimate of signal quality.Subsequently, an appropriate threshold could be ascertained to filter out PPG signals devoid of meaningful information.Alternatively, one can integrate the signal quality information as part of the model input that controls the uncertainty level of the model output.These approaches would strike the balance of salvaging PPG signals with suboptimal quality for disrupt-less monitoring and model performance.

Label noise
The issue of label noise in annotated datasets presents another significant challenge in the application of PPG for AF detection.Accurate and consistent labeling of datasets is crucial for the development and validation of reliable detection algorithms (Song et al 2022).To achieve this, it usually involves more than two clinical domain experts to cross-check the agreement of annotations, and a reconciliation strategy needs to be in place in the event of disagreement.However, many studies often fall short in this aspect due to the labor-intensive task and an insufficient number of cardiologists available to annotate the datasets.Across the reviewed studies, only 9 out of the 57 studies (Kwon et al 2019, Väliaho et al 2019, Väliaho et al 2021b, Chang et al 2022, Liao et al 2022, Liu et al 2022, Nguyen et al 2022, Zhu et al 2022) employed the expertise of at least two cardiologists for annotation, as reported in table 5.This scarcity of expert annotators can result in imprecise and incomplete labeling of AF events, leading to label noise, which in turn, may undermine the performance of supervised learning algorithms.
Furthermore, the absence of standardized guidelines to address disagreements among annotators exacerbates this issue.In the event of conflicting annotations, the lack of a clear protocol or consensus mechanism can lead to inconsistencies in the dataset.This variability not only confounds the training of predictive models but also hampers the reproducibility of research findings.Consequently, establishing robust procedures for data annotation, which involve recruiting sufficient expert annotators and defining clear rules for resolving disagreements, is paramount.Addressing these issues would significantly enhance the quality of the annotated PPG datasets, thereby facilitating more reliable and accurate AF detection.
In addition to the shortage of expert involvement, the field faces another substantial challenge: the absence of clear clinical guidelines for annotating AF events using PPG data.Unlike ECG, which has well-established guidelines for AF event labeling, PPG operates in a far less standardized environment.This lack of formalized guidance further exacerbates the risk of label noise, compromising both algorithmic performance and clinical reliability.Given these constraints, it becomes imperative to consider multimodal signal inputs when annotating data.Incorporating ECG or other established modalities alongside PPG can provide a more robust framework for annotation, thereby improving the quality of labeled data.
Also, table 6 provides an overview of the methodologies used in obtaining annotated PPG data across the three study categories.Among the 57 studies, a significant majority (47 studies, representing approximately 80% of the studies) relied on annotated ECG data as the primary reference for validating PPG data during the classification phase, emerging as the predominant approach for generating ground-truth data.Furthermore, four studies used direct PPG data labeling, while three studies adopted mixed annotation techniques, which included simulated data (i.e.PPG data was generated based on acquired and annotated ECG signals).Notably, in three instances, the specific methodology for ground-truth generation was not explicitly outlined.

Concurrent arrhythmias
The detection accuracy of AF through PPG can be significantly influenced by the presence of other arrhythmias, notably premature ventricular contractions (PVC), premature atrial contractions (PAC), and atrial flutter (AFL).All of these introduce irregularities into the heart rhythm that can mimic the rhythm irregularities seen in AF, potentially leading to false-positive detections.PVCs and PACs are characterized by early heartbeats originating from the ventricles and atria, respectively (Han et al 2020).These early beats can disrupt the regular rhythm of the heart, resulting in PPG signal patterns that may resemble those associated with AF.Whereas in AFL, the rhythm is typically more organized and less erratic than AF, presenting a sawtooth-like pattern in ECG tracings which does not typically manifest in PPG data (Eerikainen et al 2020).This organized rhythm may not exhibit the characteristic variability and irregularity that PPG-based AF detection models are designed to identify.Consequently, a PPG-based AF detection model might mistakenly classify these as AF events, thereby reducing the specificity of the model.Furthermore, the simultaneous presence of AF and other arrhythmias in the same patient adds another layer of complexity to the problem.This co-existence can modify the PPG signal's morphology in ways that differ from the signals of patients with AF or PVC/PAC alone, making it more difficult to accurately identify the presence of AF.It is noteworthy that several studies considered the presence of arrhythmias other than AF, as shown in table 5.For instance, in the study by Eerikainen et al (2020), Liao et al (2022), the differentiation of PVC and PAC from AF using PPG signals was explored.The results of this investigation demonstrated successful differentiation between PVC/PAC and AF based on PPG signal characteristics.Despite limited research on PPG-based detection of atrial flutter (AFL), Eerikäinen et al have shown that PPG can differentiate among AF, AFL, and other rhythms.They employed a Random Forest classifier that utilizes a combination of inter-pulse interval features and PPG waveform characteristics, achieving high sensitivity and specificity (Eerikainen et al 2020).These findings suggest that PPG-based analysis holds promise for distinguishing various types of arrhythmias beyond AF.Thus, when developing and evaluating PPG-based AF detection models, it is critical to account for the potential influence of other arrhythmias.Robust algorithms should be designed to discriminate between AF and these other rhythm disturbances to maintain high detection accuracy, reinforcing the necessity of comprehensive, diverse, and well-annotated training datasets in the development of these predictive models.

Quantitative metrics for algorithm performance evaluation
The studies reviewed in this work always use conventional performance metrics, such as the area under the receiver operational characteristics curve (AUROC), accuracy, sensitivity, specificity, and F1 Score.However, it is crucial to acknowledge that relying solely on these conventional metrics may be insufficient, particularly within the context of continuous health monitoring scenarios (Butkuviene et al 2021).The landscape of continuous health monitoring, facilitated through wearable devices, unfolds as a dynamic and perpetually evolving terrain of data.Within this context, the intrinsic nature of a continuous data stream introduces complexities that transcend the conventional boundaries of traditional evaluation metrics.In scenarios wherein health-related parameters undergo ceaseless scrutiny, the spectrum of fluctuations, subtleties, and overarching trends assumes paramount significance.Conventional metrics, by design, tend to compartmentalize performance assessment within discrete segments, potentially missing the panoramic context that is intrinsic to continuous health monitoring.This paradigm invites us to reflect upon the necessity of embracing evaluation methodologies that are attuned to the temporal dynamics, such as assessing the frequency of AF occurrence that reflects AF burden, the duration of AF episodes, the nuances of variation, and the holistic import of trends.For instance, incorporating equivalent standards to the ANSI/AAMI EC57:2012 standard (which is used for ECG) (American Association of Medical Instrumentation 2020) into algorithm evaluation frameworks for PPG-based AF detection could provide guidance for assessing the clinical significance in continuous monitoring scenarios.

Domain shift problem
PPG signals, despite their utility in non-invasive physiological monitoring, present certain complexities linked to the site of acquisition and inter-patient variability.It has been observed that PPG signals sourced from distinct anatomical sites yield diverse morphological patterns (Fleischhauer et al 2023).This is primarily due to the different vascular structures, skin thickness, and other physiological attributes specific to these sites.Such morphological variations can pose significant challenges in interpreting these signals and developing universally applicable models, as the distribution of signal characteristics is inherently contingent on the site of collection.
Moreover, inter-patient variability further compounds this issue by introducing additional variations in the data distribution.These variations stem from a wide array of factors, including demographic attributes (such as age and sex), physiological characteristics (including skin pigmentation and body mass index [BMI]), and medical conditions unique to individual patients (Clifton et al 2007).For instance, an older patient might exhibit a different PPG signal morphology due to increased arterial stiffness, while individuals with darker skin might present a different signal-to-noise ratio owing to higher melanin content that can observe more light than lighter skin.
These site-specific and inter-patient differences can induce what is referred to as a 'domain shift' problem in machine learning (Wang and Deng 2018, Radha et al 2021).Here, a model that is trained on data from a specific group (for example, PPG signals from a certain body site or a particular patient group) may not generalize the model performance when it is applied to a different group.Therefore, while harnessing PPG signals for health monitoring and disease prediction, it is paramount to consider these variations and devise strategies to address the domain shift problem for reliable and generalized model performance.

Lack of large-scale labeled dataset
In concert with the label noise issue discussed in section 5.2, there exists a challenge of a paucity of large-scale, annotated datasets.To develop robust and reliable algorithms for AF detection, especially when deep learning models are employed, it requires extensive, labeled datasets.These ideal datasets should encompass a broad range of patient demographic groups, diverse health conditions, and various physiological states to ensure generalizable findings.Furthermore, they should contain precise annotations of the AF events in the PPG signal to facilitate effective supervised learning.
Emerging research is increasingly focused on addressing this issue by generating synthetic PPG signals through various data augmentation techniques.These range from traditional computational models that simulate physiologic PPG patterns (e.g.PPGSynth) (Tang et al 2020) to advanced generative models such as generative adversarial networks (GANs) (Goodfellow et al 2020, Ding et al 2023), variational autoencoders (VAEs) (Kingma and Welling 2013), and diffusion models.However, the extent to which these synthesized signals contribute to improved learning outcomes remains an open question.Recent research by Cheng et al indicates the existence of a 'performance ceiling'-a limit to the improvements achieved by incorporating synthetic signals (Ding et al 2023).This underscores the need for further investigation into more effective algorithms for synthetic signal generation as well as a deeper understanding of this performance ceiling phenomenon.
To sum up, the lack of large, labeled datasets impedes the progress of research in this area, limiting the development and validation of predictive models.It restricts the ability to comprehensively evaluate and compare the performance of different AF detection methods under diverse and challenging conditions.Additionally, it hampers the exploration of more advanced machine learning techniques, which often necessitate large quantities of annotated data to train effectively.Therefore, efforts to collect/generate, share, and consolidate large-scale, well-annotated PPG datasets for AF detection represent a critical step to move the performance needle in this field.

Computational time
With the rapid advancement of graphics processing units (GPUs) and increasing computational power, it is now feasible to train complex, large-scale neural networks that outperform traditional statistical or conventional machine learning methods (Thompson et al 2020).However, this complexity presents new challenges, particularly for model inference.The inference process, which involves generating predictions from new data based on trained models, can be computationally demanding.This poses significant obstacles for wearable technologies that rely on edge computing, as these calculations can quickly deplete battery life, thereby undermining the feasibility of continuous monitoring (Chen and Ran 2019).Alternative solutions include offloading computational tasks to more powerful, tethered smartphones or to cloud-based platforms.Yet, both alternatives require robust and fast data streaming infrastructures.
Research efforts to address these challenges are bifurcated.On one hand, there is a burgeoning focus on 'tiny ML,' which aims to optimize neural network architectures for efficient edge computing without sacrificing performance.On the other hand, advancements in hardware and battery technology are driving the development of more powerful sensing techniques that enhance the capacity for long-term monitoring.Consequently, tackling these computational challenges necessitates orchestrated efforts from both research directions.It also underscores the imperative to keep computational requirements at the forefront when developing PPG-based AF detection algorithms.

Explainability
Explainability in the context of PPG AF detection algorithms is a critical aspect that determines how well we understand the decision-making process of these algorithms.This is particularly important in healthcare, where the decisions made by these algorithms can have significant implications for patient care.Statistical methods are often considered naturally explainable because they rely on well-understood mathematical principles and procedures.For example, a linear regression model, which lies in the intersection between statistical methods and machine learning, makes predictions based on a weighted sum of input features.The weights (or coefficients) assigned to each feature provide a direct measure of the feature's importance in the prediction, making it relatively straightforward to interpret the model's decisions.Machine learning methods, on the other hand, often involve more complex computations and may not be as directly interpretable as statistical methods.However, techniques have been developed to calculate feature importance, which can provide a certain level of explainability.For instance, in Yang et al (2019), the Fisher score method was employed to calculate the importance of features.The Fisher score is a statistical measure that evaluates the discriminative power of individual features in a classification task.By utilizing this method, the study aimed to assess the relevance and significance of different features in the context of atrial fibrillation detection.Similarly, in Jeanningros et al (2022), each feature was input into the classifier separately, enabling the generation of a ranked list based on its impact on the overall classification performance through this sensitivity analysis.
Deep learning models, on the other hand, are often referred to as 'black boxes,' which make predictions based on intricate, high-dimensional mappings that are difficult to comprehend for humans.While they may achieve high predictive accuracy, it's often challenging to understand what features and their interactions the models use to make predictions, and how these features contribute to the final decision.This lack of transparency can be a major drawback in healthcare applications, where it's desirable to understand the underlying decision logic so as to gain trust from end users, such as clinicians and patients.
Several approaches are being explored to improve the explainability of deep learning models, including attention mechanisms, layer-wise relevance propagation, and model-agnostic methods like local interpretable model-agnostic explanations (LIME) and SHapley Additive exPlanations (SHAP) (Binder et al 2016, Ribeiro et al 2016, Zhou et al 2016, Lundberg and Lee 2017).A good example is Liu et al (2022), where authors used the guided gradient-weighted class activation mapping (Grad-CAM) approach to visualize crucial regions within the PPG signals that enabled the model to predict a specific rhythm category.Despite these advances, explainability in deep learning remains an active area of research, particularly in the context of PPG-based AF detection.
5.9.Performance bias and model equity Disparities in both access to and outcomes from utilizing digital health solutions and biotechnologies manifest a variety of identity dimensions, including economic status, social background, ethnicity, and gender (Lanier et al 2022).As described by Braveman (2014), health equity means, 'Kstriving for the highest possible standard of health for all people and giving special attention to the needs of those at greatest risk of poor health, based on social conditions.'In the context of PPG-based AF detection, this issue of equity extends across a spectrum of potential causes.It encompasses accessibility issues, particularly for individuals from rural areas or those with disadvantaged socioeconomic statuses, as well as physiological factors like skin tone and obesity, which can influence the reliability of PPG readings (Ajmal et al 2021, Fine et al 2021).Of the studies reviewed, a mere three explicitly touched upon the issue of performance bias and model equity (Aschbacher et al 2020, Avram et al 2021, Zhang et al 2021b).This oversight underscores the pressing need to heighten awareness and equity considerations within the field.To tackle this challenge, a multidisciplinary approach is necessary, and healthcare providers, engineers, and researchers must proactively develop technologies that consider the needs of vulnerable and underrepresented populations.

Conclusion
In conclusion, this comprehensive review highlights the growing significance of PPG-based AF detection in addressing a critical clinical challenge.The surge in research efforts, especially in machine learning and deep learning approaches, underscores the potential of PPG technology for continuous and accurate AF monitoring.While machine learning techniques offer versatility and promising results, deep learning models demonstrate remarkable performance by automating feature extraction.Nevertheless, challenges related to signal quality, label accuracy, and concurrent arrhythmias persist, necessitating ongoing research and development.Furthermore, the availability of large-scale labeled datasets, computational efficiency, model explainability, and addressing performance bias and equity issues emerge as crucial considerations in advancing PPG-based AF detection technology.This review underscores the importance of continued collaboration between the medical and artificial intelligence communities to refine and deploy effective solutions for AF detection, ultimately improving patient outcomes in the face of this widespread health concern.

Figure 1 .
Figure 1.Trends in the cumulative numbers of publications in three method categories using PPG for AF detection.

Table 2 .
Studies on photoplethysmography based AF detection using statistical approaches.

Table 3 .
Studies on photoplethysmography based AF detection using ML approaches.

Table 4 .
Studies on photoplethysmography based AF detection using DL approaches.

Table A1 .
Summary of the train/test data splitting and excluded data due to noisy data and motion artifacts (STAT).

Table A2 .
Summary of the train/test data splitting and excluded data due to noisy data and motion artifacts (STAT continued).

Table A3 .
Summary of the train/test data splitting and excluded data due to noisy data and motion artifacts (ML).

Table A4 .
Summary of the train/test data splitting and excluded data due to noisy data and motion artifacts (ML continued).

Table A5 .
Summary of the train/test data splitting and excluded data due to noisy data and motion artifacts (DL).Train/test from multiple datasets.Train: The model is trained on approximately one million simulated unlabeled physiological signals and fine-tuned on a curated dataset of over 500 K labeled signals from over 100 individuals from 3 different wearable devices.

Table A6 .
Kwon et al (2020)ain/test data splitting and excluded data due to noisy data and motion artifacts (DL continued).70% of the total data set was randomly selected for the training step, 90% of which was used as the training set, and 10% was used as the cross-validation set;Testing step: 30% of the total dataKwon et al (2020)108,6 h (13 038 30-s PPG)

Table B1 .
Summary of the color ranges used in PPG sensors.Some studies used wearable devices with more than one color range, which were included in more than one category.Green(Eerikäinen et al 2019, Fallet et al 2019, Kabutoya et al 2019, Rezaei Yousefi et al 2019, Väliaho