Physiological sensor data cleaning with autoencoders

Objective. Physiological sensor data (e.g. photoplethysmograph) is important for remotely monitoring patients’ vital signals, but is often affected by measurement noise. Existing feature-based models for signal cleaning can be limited as they might not capture the full signal characteristics. Approach. In this work we present a deep learning framework for sensor signal cleaning based on dilated convolutions which capture the coarse- and fine-grained structure in order to classify whether a signal is noisy or clean. However, since obtaining annotated physiological data is costly and time-consuming we propose an autoencoder-based semi-supervised model which is able to learn a representation of the sensor signal characteristics, also adding an element of interpretability. Main results. Our proposed models are over 8% more accurate than existing feature-based approaches with half the false positive/negative rates. Finally, we show that with careful tuning (that can be improved further), the semi-supervised model outperforms supervised approaches suggesting that incorporating the large amounts of available unlabeled data can be advantageous for achieving high accuracy (over 90%) and minimizing the false positive/negative rates. Significance. Our approach enables us to reliably separate clean from noisy physiological sensor signal that can pave the development of reliable features and eventually support decisions regarding drug efficacy in clinical trials.


Introduction
Wearable sensors are broadly used for collecting physiological and behavioral signals, used for health monitoring (Coravos et al 2019).Different signals can provide valuable information regarding people's health.For example, with photoplethysmograph (PPG) or electrocardiogram (ECG) sensors we can detect heart conditions that could alert users to go visit their doctors (Raja et al 2019).However, outcomes of such health monitoring tools or medical devices are only as reliable as the sensor data used.Sensor data quality depends on hardware and can be highly prone to noise, resulting in actual feature estimates (e.g.heart rate) to vary (Pasadyn et al 2019, Sequeira et al 2020).The signal quality of certain sensors is reduced by factors such as motion artifacts, sensor placement, and even blood perfusion or skin type (e.g.PPG, ECG, electroencephalogram (EEG)) (Bent et al 2020, Liang et al 2023).In order to make reliable predictions or assumptions regarding the well-being of the person carrying the sensor device, we need to be confident that we are using reliable (i.e.noise and artifacts free) data/signals.
We used PPG technology in this work as an example for evaluating the proposed methodology.PPG signal measures blood volume variations due to heart-beat by shining light into the skin and measuring the light that is reflected back (Biswas et al 2019, Allen et al 2021).PPG signal represents the aggregated expression of many physiological processes within the cardiovascular system (Liang et al 2018).When the PPG signal is reliable, we can compute heart rate (HR) and heart rate variability (HRV) features to understand multiple aspects of a person's physical, psychological and mental state, like exercise recovery (Bechke et al 2020), cardio conditions (Hoshi et al 2021), sleeping patterns (Hietakoste et al 2020), anxiety (Rodrigues et al 2020, Seipäjärvi et al 2022), and emotional state (Kim et al 2020).
Our goal was to develop a robust and accurate approach to detecting the clean subsignals using as an example continuous PPG signal (labeled and unlabeled) collected from participants performing everyday activities.
However, our approach is general and could be applied to any type of continuous physiological signals that are prone to noise.
For the development of such algorithms that can help us understand a person's physical, psychological and mental state, there is the need to have an accurate classification of clean versus noisy sensor signals, which currently does not exist as a standard methodology.Instead, common approaches tend to use sensor-specific heuristics (Bhowmik et al 2017, Liang et al 2023) or methodologies that report a continuous quality index.However, this then begs the question of how to set the signal quality threshold (Orphanidou et al 2014, Elgendi 2016, Zanon et al 2020).
The main challenges of building models to classify physiological signal quality are twofold.Firstly, expert manual annotation of physiological data needed for training models is expensive and time consuming, and therefore not a large amount of annotated data is available in the community.Secondly, sensor data can be device specific, so existing annotated data may not be usable when a different or newer sensor is used.
In terms of data modeling, feature-based models are very dependent on the feature-engineering process.Complicated signals make it difficult to capture the full signal characteristics (e.g.Elgendi 2016, Zanon et al 2020) and to provide robust and accurate classification.On the other hand, a deep learning approach using raw signal as input might help learn more appropriate features.
In this work, we propose both a supervised and semi-supervised model which leverages existing large unlabeled datasets and makes efficient use of the small amount of labeled data via autoencoders.

Materials and methods
2.1.Datasets PPG signals can be easily extracted from human peripheral tissue, such as fingers, toes, earlobes, wrists, and the forehead; therefore, they have great potential for application in wearable health devices (Liang et al 2018).
In this work, we used PPG signals collected via a wrist-worn smartwatch equipped with LEDs and photodiode for measuring PPG at 20 Hz sampling frequency (i.e.Samsung Gear Sport Smartwatch1 ), and logged using a custom smartwatch application with a sampling frequency of 20 Hz.Data was collected according to the Declaration of Helsinki.All participants were required to provide written informed consent before performing any study-related procedures.
Once the data was collected, we applied a 3rd order Butterworth bandpass filter with 0.5 and 9 Hz frequency cut on per participant daily PPG signals, and then we cut the daily signal into 10 second non-overlapping intervals.
Different physiological signals and applications require different lengths of signal to be usable.For example in the case of PPG, a very short signal (i.e.few seconds) can be used for the calculation of heart rate (Zhang et al 2015), a longer signal (e.g. 1 min) should be used for the calculation of time-domain features (e.g.average value, standard deviation), and even longer signal (e.g. 5 min) for frequency domain heart rate variability features (Shaffer and Ginsberg 2017).To enable as many of the possible applications we selected an as short as possible signal length that allows us to distinguish between the periodicity of the clean signal and the hecticness of the noisy signal.A post-processing step of the algorithm output can identify continuous clean signals of the appropriate length depending on the sensor, application and features we need to calculate.

Labeled PPG dataset
Data was collected from 5 healthy volunteers (1 female and 4 male with average age of 33) without any supervision, during their normal daily activities, or during their nightly sleep.In total 13 547 10 second nonoverlapping PPG signal samples were collected.We manually labeled the signals as clean or noisy according to Elgendi (2016) (resulting in 8305 noisy, and 5242 clean PPG signals).The labeling process took one employee three working days, and the correctness of the label validation of random samples from other experts one more day.Clean PPG signal (figure 1) may have mildly different formations depending on the person wearing the sensor, how the sensor is worn, and other external factors.In all cases though, the clean signals follow a sinusoidal waveform and we are looking to identify the clear peak of each period which represents a heart beat.On the other hand, noisy signals can have any other irregular form (figure 2).Each figure depicts 10 s of PPG signal with 20 Hz sampling frequency resulting in 200 PPG data points/samples.

Unlabeled PPG dataset
Data was also collected from 20 healthy volunteers (4 female and 16 male with average age of 32) at a different point in time compared to the previously described dataset (Zanon et al 2020).Data collection was conducted while the participants were performing a series of activities in a supervised manner where the participant would switch activities every 5 min.The activities included sitting in a resting position, paced breathing, console gameplay, orthostasis, mental stress manipulation, physical activity, and sitting in a resting position again.We expected that certain activities will introduce different level of motion artifacts (e.g.physical activity, orthostasis and console gameplay), while others would increase the heart rate and modify the PPG waveform (e.g.paced breathing).PPG signal was collected simultaneously with ECG signal in order to compare the derived HRV features, and eventually estimate an HRV multivariate quality metric.The HRV quality metric was computed for each PPG sample signal according to Zanon et al (2020) as a measure of error, which is inversely proportional to signal quality.If the HRV quality metric value was below 20 then the PPG signal was regarded as trustworthy (i.e.clean).In total 37 564 unlabeled 10 second non-overlapping PPG signal samples were collected.

Independent test set
We randomly selected 1000 out of the 37 564 samples of the previously described dataset, and manually annotated them.We found 796 noisy and 204 clean PPG signals across all activities of the experiment protocol (figure 3).

Model overview
We developed and compared a supervised and a semi-supervised model with the goal to classify PPG signals as clean or noisy.The datasets used are summarised in table 1.For both models we used a dilated convolutional neural network (CNN) architecture inspired by WaveNet (van den Oord et al 2016) operating directly on the PPG signal (figure 4).This architecture is using stacked causal dilated convolutions.Hence, the receptive field of the network can be increased greatly using only a few layers, maintaining computational efficiency, unlike recurrent neural networks (RNNs).For our application, we needed an increased receptive field because of the length of our signals and because the frequency of the sinusoidal waves of the clean PPG signals may vary depending on the breathing/anxiety patterns of each user.
For both models the goal was to jointly learn to use the labeled signals to distinguish clean from noisy signals, and to help the network learn more about the physiology of the signal via an autoencoder (figure 5).This was achieved using a secondary output with the goal to reconstruct the input signal and measure their difference in terms of mean squared error (MSE), alongside a standard classification loss.We expected that models with lower  MSE (i.e. more accurate signal reconstruction) would predict more accurate classifications.The MSE loss only depends on the difference between the input signal and the reconstructed signal, however, the better the model has grasped the morphology of the PPG signals and is able to reconstruct it more accurately, the higher the chances that the model will be able to classify between clean and noisy signal.It is both a way to validate that the model is learning the signal morphology correctly and a way to help the network learn the signal properties since   the MSE loss is also contributing equally to the binary cross entropy loss for the model training (figure 5).To that end, we used unlabeled data in the semi-supervised model to further improve learning the signal reconstruction, and potentially improve the classification performance during activities or subjects from a different population (e.g. with an anxiety disorder) with different characteristics compared to the limited labeled data already included in training.

Supervised model
We developed two versions of a supervised model.Firstly, a baseline one where the supervised model is trying to optimize for the classification loss only (figure 4).Secondly, we developed a supervised or semi-supervised model where we add the autoencoder part and try to optimize both for the classification and the MSE loss of the signal reconstruction (figure 5).
For both supervised models, we developed a common core architecture.The model is a CNN architecture that consisted of five CNN layers with dilation, causal padding and a Rectified Linear Unit (ReLU) activation function as depicted in figure 4 and 5.The network had dilation factors of [0,2,4,8,16] for each one of the five CNN layers.In each layer we used 16 filters, a kernel of size 3, and we experimented with the regularization strength between [0.0005, 0.001, 0.0015, 0.002].The output of the final convolutional layer was flattened and passed through a dense layer to the network outputs.We used an Adam optimizer with learning rate of 0.000 01 with a decay where the learning rate is halved every 100 epochs.For training, we used a batch size of 128, and tried epochs from 50 up to 500.
The input we provided to the model was filtered PPG signals of 10 seconds each with 20 Hz frequency as described in the previous section, with 200 data points in total.Moreover, we provided the ground truth information for each input signal for training (i.e.indicating whether a signal is noisy or clean).
For the supervised model with no signal reconstruction (figure 4), the class output used the dense layer's output as input along with a sigmoid activation function and a binary cross-entropy loss function.
In the case of the supervised model with signal reconstruction (figure 5), we used the two outputs contributing equally in the learning process.To estimate the MSE output (between the input and reconstructed signal with an MSE loss function) we introduced an extra dense layer to have an output of the same size as the input signal (i.e.200 data points).We also experimented with different weights for the MSE output.
For training, we provided a balanced dataset of 9,380 labeled signal samples, and we kept 1000 nonoverlapping samples from the same dataset as validation set.For testing, we used another 1000 non-overlapping samples from the same dataset as the training and validation set; and an extra independent labeled dataset (table 1).

Semi-supervised model
The semi-supervised model was an extension to the supervised one, with all layers and parameters kept the same for consistency and comparison.We provide an additional input to the model, on top of the filtered PPG signals of 10 seconds, called label mask (figure 5).Its purpose is to indicate whether an input signal is labeled or not.As far as the training dataset used, we used the same balanced labeled training set as with the supervised model.We also used the unlabeled training set, as described in table 1. Labeled data contributed to both losses of the network.Input data with no labels (i.e.label mask = 0) only contributed to the MSE loss (figure 5).
In each epoch, we trained the network in two steps.Firstly, only with a training set to learn the class label and relevant information for signal reconstruction.Secondly, only with a random subset of 5000 unlabeled training set to better learn the signal reconstruction, and also introduce completely different signal characteristics that was not included in the labeled training set (e.g.different people, different activities).

Experiment overview 2.3.1. Optimal hyperparameter selection
We ran our supervised and semi-supervised models with all hyperparameter combinations described in section 2.2, and we picked the optimal hyperparameter combination for each model according to certain evaluation metrics (i.e.maximizing F1-score).Moreover, we experimented with lower MSE loss weights (i.e.0.05, 0.1, 0.5) to account for the higher range of the MSE values than 1, which is the maximum classification output.This imbalance could lead the algorithm to learn more about the signal reconstruction and less about the actual class output.

Combined model
As an attempt to overcome possible limitations of the separate supervised and semi-supervised models, we created a combination of the two models, where we estimated the combined averaged predictions of the two models (taking the average of the probabilities reported by the two models).

Evaluation metrics
To evaluate model performance we computed confusion matrices and the following metrics: accuracy, precision, false positive rate, false negative rate, F1-score and the area under the ROC curve (AUC) for classification accuracy.Accuracy is defined as the sum of the true positives and true negatives, divided by the total number of test samples.Precision is the number of true positive results divided by the number of all positive results, including those not identified correctly.False positive rate is the number of false positive predictions over the number of negative samples, and false negative rate is the number of false negative predictions over the number of positive samples.These metrics are crucial for many applications of data cleaning of physiological signals that are prone to noise.False positives indicate that we add noise in our data with unreliable measurements, while false negatives might miss valuable clinical event information.
The F1-score is the harmonic mean of the precision and recall.A value of 1 indicates perfect precision/recall, whereas 0 indicates either the precision or the recall is zero.Finally, a receiver operating characteristic (ROC) curve is created by plotting the true positive rate against the false positive rate at various threshold settings.The AUC is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one.These quantities give an overall picture of how each model balances the different types of classification error.

Evaluation datasets
We used two datasets in order to evaluate the models' performance.Firstly, the test set that consists of 1000 samples from the same original dataset as the training data (section 2.1.1).There was no overlap between the training and test datasets.Secondly, we used an independent test set of again 1000 samples, randomly selected from the dataset created in Zanon et al (2020) as described in section 2.1.3(table 1).
We reported the evaluation metrics for both test datasets, using the standard classification threshold of 0.5 (Section 3.2), as well as the optimal threshold (estimated via the ROC curve) (section 3.3).

Results
We evaluated the performance of our proposed models on a variety of tasks relating to identifying correctly clean vs noisy PPG signal.We investigated the performance of the fully-supervised model with and without signal reconstruction, our semi-supervised model, as well as the combination of both the supervised and semisupervised models by averaging the resulting logits of the two models.
We compared our proposed models against an HRV multivariate quality metric as described in Zanon et al (2020).The multivariate quality metric is a continuous variable with values closer to 0 indicating a clean signal and higher than 20 a noisy signal.To match our classification output that ranges in [0, 1], we rescaled the multivariate quality metric to match the same range, where 0 indicates a noisy and 1 a perfectly clean signal.

Optimal hyperparameter selection
Out of all the hyperparameters we experimented with (section 2.3.1), the optimal parameters for each model were selected based on optimizing the F1-score when tested using the independent test set and 0.5 classification threshold.The optimal hyperparameters selected for each model are described in table 2.

Model performance comparison.
First, we evaluated all models' performance on the test set (table 1).For all models, the accuracy was over 98% when using 0.5 as the classification threshold.Specifically, the supervised model with no signal reconstruction, the supervised model with signal reconstruction and the semi-supervised model (both when using the optimal MSE loss weight) achieved accuracy of 98.2%, 98.3% and 98.4%, respectively.The accuracy is expected to be high since the test dataset comes the same dataset as the training one (even though there is no overlap of the two datasets), but we present it here for completeness and validation purposes.
Next, we compared the performance of the different models on a more challenging and independent test set (table 3).Our supervised model performed best in terms of overall accuracy, outperforming the multivariate quality metric by 8.7%.Even though the accuracy and precision did not vary considerably among the remaining methods, the false positive and false negative rates do.Therefore, depending on the application of the algorithm and what is more important for each specific case, one could select the optimal algorithm.For example, if minimizing the false positive rate is crucial for the application at hand and the false negative rate is not important, then the supervised model with optimal MSE loss weight would be the best option since it minimizes the false positive rate to 5.5% and accuracy of 91.9%.On the other hand, if minimizing the false negative rate is crucial, then the semi-supervised model with optimal MSE loss weight would be the best option with only 6.4% false negative rate at the same accuracy.
Even though the semi-supervised model did not achieve the highest accuracy with this classification threshold (i.e.91.9%-table 3), it optimized for both the false positive and false negative rates.This would be crucial in the case of detecting anxiety via PPG and HRV.We need to avoid misclassifying signals as clean (as they will add noise in our data possibly misleading us on if the subject actually has anxiety or not), while identifying as much of the actually clean signal is equally important to have sufficient data to make conclusions about subjects.Moreover, having the possibility to evaluate the signal reconstruction output, helps us gain confidence in the algorithm learning the important characteristics of the sensor signal.Figure 6 shows the model predictions for noisy (0) and clean (1) signals.The semi-supervised model was on average the one with the lowest variance in predictions for positive and negative samples, with the highest distance between predictions of the two classes.
On the other hand, the supervised models with and without signal reconstruction had a much higher variance in the predictions for positive samples, showing that the models can more confidently predict negative samples to positive ones (validated by the false positive and negative rates in table 3).Moreover, we saw that the rescaled multivariate quality metric had the lowest accuracy (table 3).Note that the multivariate quality metric had also the lowest distance in predictions between the two classes (figure 6), which it is expected since we have rescaled a positive continuous variable.
When comparing the ROC curves of all models (figure 7), we saw that all models outperform the baseline ( i.e. multivariate quality metric).We computed the ROC curves using the classification probabilities of each model, and the optimal MSE loss weight when applicable.The semi-supervised model had the highest AUC (97.6%), followed by the combined model (97.5%), the supervised model with no signal reconstruction (97%), the supervised model with signal reconstruction (96.6%) and finally the multivariate quality metric (90.5%).This model order was also validated in terms of overall performance (i.e.maximizing accuracy, and minimizing false positive and false negative rates) in table 3.
When we split the accuracy of each model per activity in the protocol (figure 8), we saw that all proposed models outperform the multivariate quality metric across all activities, while the differences between the proposed models are much smaller.Interestingly, physical activities (e.g.physical activity and gaming) had higher accuracy than non-physical activities (e.g.resting time and breathing) where we would expect more clean  signal.That was due to the noisy signal in the physical activities being usually too different from a clean signal, that it could not be mistaken for clean signal.On the other hand, the noisy signal in the non-physical activities could still include for example half clean and half noisy signal which made it a harder problem to solve for the classifier, resulting in lower accuracy compared to the physical activities.

Using an optimal classification threshold
Noticing that the different models control for false positives and false negatives in different ways, we further investigated the effect of tuning the classification threshold on the proposed approaches.In the context of this work, we set the optimal classification threshold on the test set as the threshold that maximizes the true positives and minimizes the false positives based on the ROC curve, but we still test this threshold on the independent dataset.In the future we will use a third dataset for setting the threshold to be unbiased and better generalise with new data.
In table 4, we see that with careful tuning of the classification threshold, the performance of the semisupervised model with optimal MSE loss weight could be significantly improved in terms of accuracy, while achieving the lowest false negative and positive rates, and highest F1-score compared to all other models.Moreover, the same model outperforms all models using the standard 0.5 classification threshold (table 3) in F1score and optimizing the trade-off between false positive and negative rates.

Signal reconstruction
The way the autoencoder contributed to the learning process was via the signal reconstruction and the MSE loss of the comparison of the reconstructed to the original signal (section 2.2).Looking deeper into it is impact, we wanted to compare our supervised model with signal reconstruction against the baseline (i.e. the multivariate   9 shows two examples of PPG signal that the baseline identified as clean (i.e.multivariate quality metric value is below 20 with the original signal in blue-solid and the reconstructed one in red-dotted), even though the right example was a noisy signal.On the other hand, our supervised model not only was able to very accurately reconstruct the peaks of the clean (left) and noisy (right) signal, but it also correctly classified the left signal as clean, and the right one as noisy, unlike the multivariate quality metric model.
Leveraging all the new unlabeled data in the semi-supervised model helped improve on the signal reconstruction and predictions in some cases that the supervised model underperformed (e.g.identify and correct wrong borderline classifications).Figure 10 shows the same signal reconstruction using the supervised (left) and the semi-supervised (right) model.The signal used is an acceptable usable signal despite the short peak around the 150th sample that could be misclassified as noisy by a super strict algorithm.As described in Elgendi (2016) there can be excellent, acceptable and unfit PPG signals for HR or HRV feature calculation.We wanted to leverage as much signal as possible and therefore we wanted acceptable signals labeled as clean in our application.The supervised model performed poorly in the signal reconstruction and wrongly classifying the signal as noisy.The semi-supervised model improved the signal reconstruction (clearer identification of signal peaks), and also corrected the label of the signal as clean increasing the model accuracy.Overall, we observed that the semi-supervised model was able to more accurately reconstruct the signals and eventually obtain a higher  true positive rate, as seen previously in table 3. Finally, we can use this signal reconstruction component of our proposed models to add an element of interpretability to the output of the classifier, understanding why the given model might perform poorly with certain input test data.

Novelty and principal findings
The novelty of this work is twofold.Firstly, we proposed a novel deep learning approach to physiological sensor data cleaning, leveraging both labeled and unlabeled data.We showed that in the case of PPG sensor signal our approach can achieve more than 90% accuracy while minimizing the false positive and false negative rates, outperforming other similar approaches in the literature.Secondly, our methodology could be applied or transferred to any kind of sinusoidal physiological sensor signals, filling the gap in literature for a generic model to tackle the sensor data cleaning problem.At the same time we made use of the vast amount of unlabeled data largely available together with a limited amount of annotated training data, and we avoid the use of additional sensors (e.g.accelerometers) to help identify motion artifacts.

Limitations
A limitation of this work we plan to address in the future is to apply the proposed deep learning methodology on different sensor data like electroencephalogram (EEG) and ballistocardiographic (BCG), to evaluate the model accuracy on these different and challenging physiological signals.
The noise in physiological sensor signals is not specific to a certain device model/brand, but it is the same across all specific type of sensors (i.e.PPG).Certain smartwatches/sensors can be more prone to noise depending on the placement of the sensor and the design of the watch (i.e.how tight or stable the smartwatch is on the wrist).Moreover, noise is tied to movement.Therefore, we could combine the deep learning methodology with heuristic-based results from the accelerometer sensors (i.e.detect movement), and assume a PPG signal is noisy if there has been motion/activity according to the heuristic.However, this approach would make our methodology dependent on the coexistence of accelerometer or other sensors, and it would not be possible to scale to different types of sensors that the noise is coming from other sources.In this manuscript our aim was to present a generic methodology, and anyone can take it and combine it to any other relevant sensors and heuristics that apply in their scenario of use.
Another limitation of the current model is that all training and test PPG data (both labeled and unlabeled) are from healthy volunteers with no known heart or anxiety related condition.This might pose an issue when the test dataset comes from populations with heart issues or any anxiety disorders.However, we expect the semisupervised model to be able to handle such scenarios if the unlabeled data provided during training are from the same population.We are currently looking into this limitation by applying our proposed methodology on a clinical study population that consists of participants with autism spectrum disorder (ASD) and various levels of anxiety.The results of this analysis will be included in a future publication focusing only the clinical study.

Related work
Existing approaches for clean signal segment selection include heuristic and classification-based approaches.In the case of heuristic approaches, no training nor annotated data is needed.For example, Bhowmik et al (2017) use physiological signal thresholds to filter out noisy signal.If certain statistics (e.g.estimated heart rate) are within certain physiological thresholds then the signal is marked as clean/acceptable.This approach though could lead to filtering out edge cases which may be the interesting ones depending on the application at hand (e.g.indicating high anxiety, or other conditions).However, such approaches as well as Zanon et al (2020) are easier interpretable because features and physiological thresholds have an easy to understand meaning.By adding a signal reconstruction component to the loss of our proposed models in this work, we also attempt to add an element of interpretability to the output of the classifier at the cost of more false negatives.
In classification-based approaches training and annotated data are needed.Pereira et al (2019) use a supervised deep learning model (i.e.Resnet18) reporting accuracy of 98% on a test set that comes from the same dataset as their training set, matching our supervised model accuracy on our test set.However, the authors have not evaluated their model's performance on an independent dataset, like we did.Moscato et al (2022) use PPG features together with accelerometer signal to classify PPG signal as noisy or not, achieving accuracy over 95% using an SVM classifier.The authors employ various machine learning methodologies, with the SVM classifier outperforming the neural network one using the same features as input.In our work, we only used raw PPG signal (instead of PPG features together with accelerometer signal) which makes our model more challenging, but also possible to be applied on data derived from sensors that are not equipped with an accelerometer sensor.At the same time we achieve similar accuracy (over 90%) with less information (i.e.no accelerometer signal to filter out motion artifacts).
Elgendi (2016) and Sabeti et al (2019) also use support vector machines (SVM) and features derived by the PPG signal, reporting an F1-score of 87.2% (compared to our 92.7%F1-score in the case of the supervised model), and accuracy of 83% of their test set that comes from the same original dataset as the training set (compared to our 98%), respectively.However, features derived by a neural network might better describe the signal characteristics compared to features humans would commonly think to develop.Therefore, a neural network approach might be a better solution to the signal quality classification task than SVMs.Some other approaches include waveform morphology analysis combined with a decision-tree classifier (Sukor et al 2011), developing an artifact detector on the PPG signal (Robles-Rubio et al 2013), or constructing a PPG signal quality index based on features derived by the signal (Orphanidou et al 2014, Zanon et al 2020).Esgalhado et al (2021) proposed a CNN-LSTM classifier for PPG signal quality classification and reported accuracy of 89.4% (i.e.3% lower than our best reported accuracy).Moreover, their approach involves more complicated preprocessing of the input data (i.e.synchrosqueezed Fourier transform) than our simpler and more commonly used filtering of the raw time-series input PPG data.
The authors in Yoon et al (2019) use a dataset with ECG signals from the intensive care unit (ICU).ICU datasets usually have very minimal noise, as people are mostly in a resting position so almost no motion artifacts impact the signal quality.On the other hand, PPG signals from everyday life (like we are using) are much more challenging and closer to real-world data.Moreover, the authors use a very small test dataset with only 300 samples (compared to our 1000 samples), and they propose a very simplistic supervised model with a small receptive field.As far as the results are concerned, the authors propose a very low cutoff value as a classification threshold instead of the standard (0.5) indicating that their model is possibly overfitting towards the noisy signal.Finally, their reported precision and recall is 74% and 89%, respectively, while our model outperforms this work with precision 93.2% and recall 91.9% with the semi-supervised model with the standard classification threshold of 0.5.

Learnings and future work
Physiological sensor data cleaning is crucial for all applications including those in decision making for clinical care and drug development.Insights can only be as good and reliable as the quality of the sensor data.If a large amount of noisy and unreliable sensor data is included in the data analysis, then it will only add noise in the insights making them unreliable and possibly misleading.Therefore, we proposed a novel set of deep learning methodologies that can tackle this problem with a high accuracy.Our different models can satisfy different optimization criteria in terms of minimizing false positives or false negatives, depending on the needs of each application.
Moreover, we showed that the vast amount of unlabeled data available can be used to train neural networks and to identify meaningful aspects of the signal.That can help the network understand better representations of the physiology of the signal, reconstruct it, and assist in signal quality classification.Our semi-supervised model employed unlabeled data during training and achieves accuracy of almost 92% while minimizing false positives and negatives, providing eventually more reliable and useful physiological information containing minimal noise and maximum clean signal.As part of our future work, we plan on applying this methodology on unsupervised real world patient data during clinical trials.A first application would be using our proposed models for cleaning PPG sensor data collected in the context of a clinical study in order to assess the patients' anxiety levels via HRV.
Next, we intend to incorporate the activities performed during each signal used in the semi-supervised model, since different activities suggest different levels of mobility which would add more motion artifacts suggesting higher likelihood that the signal is not clean during these activities.Together with a human activity recognition model already deployed in our clinical trial patient data (Cheng et al 2017), we could make more reliable predictions and minimize false positives even more, by making the model more aware of especially the more difficult signals (i.e.half clean and half noisy like in the case of non-physical activities).
Finally, correctly calibrating the optimal classification threshold requires care.In real clinical settings it may require an additional held-out validation set.It may also be possible to incorporate the reconstruction performance of the semi-supervised model to aid in calibration.

Conclusions
In this work, we tackled the problem of physiological sensor data cleaning, leveraging the limited annotated datasets and the vast amounts of unlabeled data in a novel semi-supervised deep learning framework.Our results showed that in the case of PPG sensor signals, the proposed semi-supervised approach achieves accuracy of almost 92% while minimizing false positives and negatives.This way we can trust that our estimated 'clean' signal will contain minimal noise and maximum clean signal for the most accurate clinical interpretation of the features.We showed that the semi-supervised model can outperform in accuracy all other models and the baseline of the HRV multivariate quality metric (by more than 8%) with careful tuning of the classification threshold.Finally, we added an element of interpretability in our model by adding a signal reconstruction component to our models.
The proposed methodology enabled us to reliably separate clean from noisy physiological sensor signal that can pave the development of reliable features and eventually support decisions regarding drug efficacy in clinical trials.
As a next step, we plan to apply this methodology on PPG data collected during a clinical study to extract reliable HRV features, to identify meaningful associations with clinical markers related to anxiety.Finally, in our future work we plan to show how our methodology could be applied to other types of continuous physiological signals that are prone to noise, and separate noisy and clean physiological signals for clinical analysis that can improve people's lives.

Figure 4 .
Figure 4. Deep learning architecture with no signal reconstruction.The blue block shows that L2-regularization was used in a certain layer.The green blocks indicate the activation function (i.e.darker green for the ReLU and lighter green for the sigmoid ones).All CNN blocks have causal padding (yellow block) and blocks two to five have dilation of 2, 4, 8 and 16, respectively (orange block).

Figure 5 .
Figure 5. Deep learning architecture with signal reconstruction which can be used for supervised or semi-supervised learning.The blue block shows that L2-regularization was used in a certain layer.The green blocks indicate the activation function (i.e.darker green for the ReLU and lighter green for the sigmoid ones).All CNN blocks have causal padding (yellow block) and blocks two to five have dilation of 2, 4, 8 and 16, respectively (orange block).Note that the label mask input is only needed for the semi-supervised model where additional unlabeled (i.e.label = 0) data is used for learning the signal morphology.

Figure 6 .
Figure 6.Model raw prediction when using the independent test set (table 1) for all evaluated models.Predictions closer to 0 indicate a predicted noisier signal and predictions closer to 1 indicate a predicted cleaner signal.The label 0 (noisy) and 1 (clean) indicates the ground truth of the specific input test signal.Optimal MSE loss weights used for the models displayed.

Figure 7 .
Figure 7. ROC curves comparison across all models.Optimal MSE loss weights used for the models displayed.

Figure 8 .
Figure8.Accuracy of all models per protocol activities in order they were performed.Optimal MSE loss weights used for the models displayed.

Figure 9 .
Figure 9. Signal reconstruction using the supervised model, with equal weights used for the classification and the signal reconstruction losses.Left figure shows a clean signal identified correctly by both the supervised and the multivariate quality metric models.Right figure depicts a noisy signal that was correctly labeled as noisy by the supervised model, but wrongly labeled as clean by the multivariate metric model.

Figure 10 .
Figure 10.Signal reconstruction example of the same test input signal using the supervised (left) and semi-supervised model (right), with equal weights used for the classification and the signal reconstruction losses.Multivariate quality metric model correctly labeled the signal as clean.The supervised model performed poorly wrongly classifying the signal as noisy, while the semi-supervised model correctly labels the signal as clean.
Our proposed models not only outperformed feature-based (Elgendi 2016, Sabeti et al 2019, Zanon et al 2020) and deep learning (Pereira et al 2019, Yoon et al 2019, Esgalhado et al 2021) models, but at the same time our signal reconstruction component during the learning process, enables us to visualize and understand why the model might wrongly label certain input signal, adding to the explainability and interpretability of our approach.

Table 1 .
Datasets used.All signals are non-overlapping.

Table 2 .
Optimal hyperparameters selected for each model, based on optimizing the F1-score when tested using the independent test set and 0.5 classification threshold.

Table 3 .
Model accuracy comparison, when using the independent test set (table 1) and 0.5 classification threshold.FPR: false positive rate; FNR: false negative rate.

Table 4 .
Model accuracy comparison, when using the independent test set (table 1), and optimal classification threshold.FPR: false positive rate; FNR: false negative rate.