Automatic sleep staging of EEG signals: recent development, challenges, and future directions

Huy Phan; Kaare Mikkelsen

doi:10.1088/1361-6579/ac6049

1. Introduction

Sleep makes up almost one third of our lives. Good sleep is crucial for maintaining one's mental and physical health (Maquet 2001, Siegel 2005), and sleep disorders are linked with a host of different ailments (Perez-Pozuelo et al 2020). The screening, assessment and diagnosis of a sleep disorder requires 30-second epochs of an overnight polysomnogram (PSG) to be assigned a sleep stage. This procedure is known as sleep staging or scoring. The sequence of sleep stages is critical for measuring parameters of the sleep macrostructure, such as the sleep cycles, the time spent in each stage, sleep latency, wake after sleep onset, etc. Sleep stages and cycles, which manifest the underlying neurophysiological processes, are also a rich source for mining the diagnostic markers of a wide range of sleep disorders (Norman et al 2006, Christensen et al 2015, Stephansen et al 2018, Cooray et al 2019), from common ones like obstructive sleep apnea (Redline et al 1999, Senaratna et al 2017) to rare ones such as narcolepsy (Christensen et al 2015, Stephansen et al 2018).

Sleep staging is still largely carried out by clinicians in sleep clinics and guided by a well-established manual, published by the American Academy of Sleep Medicine (Berry et al 2016). This labor-intensive and time-consuming manual-scoring method is unsuitable for handling large-scale data and cannot be scaled to serve the needs of the millions suffering from sleep disorders (Institute of Medicine 2006, Krieger 2017, Chattu et al 2019). At the same time, there is an increasing need for longitudinal monitoring in home environments. Accurate and cost-effective monitoring of sleep not only has great medical value but also allows individuals to self-assess and self-manage their sleep. Thus, it is imperative for sleep staging to be automated. The fact that it follows a predefined set of rules makes sleep staging a perfect task for automation via machine learning. Furthermore, a machine can perform the task thousands of times faster than a human expert, thus saving a clinician thousands of hours a year and making sleep assessment and diagnosis more widely available.

Indeed, given the standardization of data through the use of PSG, many developments in the field have already taken place. Particularly, the existence of increasingly large public datasets online (for example, PhysioNet (Goldberger et al 2000) and the National Sleep Research Resource (Zhang et al 2018)) have enabled the exploitation of deep learning (LeCun et al 2015, Goodfellow et al 2016) to teach a machine to perform sleep staging using a large amount of training data. These efforts have led to more advanced and practical methods (Supratak et al 2017, Biswal et al 2018, Phan et al 2019, Guillot and Thorey 2021, Olesen et al 2021, Phan et al 2021, 2022), which have surpassed the agreement level of experts' scoring and achieved a performance acceptable for clinical use. Notwithstanding this tremendous progress, machine sleep scoring still needs to overcome several technical and clinical barriers to be widely adopted and deliver full clinical value. We see great potential for further development. First, from algorithmic perspectives, we consider sleep staging to be an interesting modelling problem where novel methods can be developed to tackle the foreseeable obstacles and pave the way for clinical usage. Second, new recording platforms are emerging (Mikkelsen et al 2018, Miettinen et al 2018, Arnal et al 2019, Mikkelsen et al 2019) that simplify the sleep setup for home-based monitoring purposes.

In this review article, we begin by discussing the clinical context of automatic sleep scoring, after which we give an overview as well as technical insights of the state-of-the-art methods for automatic scoring of electroencephalography (EEG) data. Readers should note that a few existing reviews, such as those by Fiorilli et al (2019) and Faust et al (2019), have summed up the topic prior to 2019. A review on the broader applications of deep learning on EEG analysis also exists (Roy et al 2019). To avoid re-inventing the wheel, this article focuses on the latest developments in automatic sleep staging. In addition, we limit the scope of this article to fine-grained (i.e., five stages) sleep staging using PSG and modalities directly reading brain activities, such as mobile EEGs, and will not cover research work using other modalities, such as electrocardiogram (ECG)/photoplethysmography (PPG) (Radha et al 2021), actigraphy (Zhai et al 2020), audio (Dafna et al 2018), video (Long et al 2019) and radar (Toften et al 2020, Piriyajitakonkij et al 2021). We then discuss the current challenges, and suggest future directions. As we shall discuss in the next section, much good work has already been done on this problem. However, as it will become clear in the rest of this review, we believe that the field has only solved the first, 'entry', problem, and a plethora of new and exciting tasks lie ahead of us.

2. Clinical context

Manual sleep scoring is a somewhat reliable, highly versatile method, which readily yields interpretations and which is standardized across the world. This has made it a good solution, but also a local optimum which is hard to escape. By this we mean that it is not the best possible solution, due to a number of drawbacks:

1.
It is very time-consuming (and therefore expensive) to manually score an entire night's recording. Even more so if sleep events are also to be annotated.
2.
Despite the existence of a sleep-scoring standard, there is still variation between individual scorers.
3.
The sleep-scoring manual is based on the PSG recording setup, which is generally considered to be unwieldy and invasive. This, combined with the cost of each recording, means that clinicians will usually have to 'make do' with a single (at most two) nights of data, which may not be as representative of the patient's usual nights as one would hope.

It should come as no surprise that the properties of manual sleep scoring have shaped the way sleep monitoring is used—few recordings per subject, qualitative (non-data driven) analysis. This can make it hard, within the clinical reality, to immediately see the benefits of a new method (automatic sleep scoring with other sensor setups). Figuratively speaking, if you have learned to solve all problems using nails, it is hard to see how a screwdriver can compete with your hammer.

Automatic sleep scoring can reduce the costs of existing procedures (PSG recordings either in lab or at home), but also open the door to new ways of using sleep clinically, which today would be infeasible. We can imagine population-wide screening for early stages of debilitating diseases (e.g., REM sleep behavior disorder (RBD) is known to be tightly associated with Parkinson's disease (Lin and Chen 2018)), or routine follow-up procedures quantifying patient sleep after they leave the hospital. These procedures could have very real clinical benefits, but they all require changes to how sleep recordings are used and managed, not to abolish existing procedures, but to supplement them.

An algorithm-first approach to clinical sleep can also solve other problems. First, given the costs of a PSG recording, clinicians may often have to 'make do' with whatever recordings they get, even if the quality is questionable. However, if the standard quantum becomes a week's worth of data, automatic discarding of low-quality nights would be trivial. Second, definitions surrounding sleep have been developed and evaluated based on how well they can be used in manual sleep scoring. Computers have far fewer restrictions in this manner, and we can imagine more flexible taxonomies, such as hypnodensity plots (Stephansen et al 2018) or even disease-specific sleep states.

3. The state-of-the-art sleep scoring

Modern deep learning (LeCun et al 2015, Goodfellow et al 2016) crept into sleep research more slowly than in other fields, such as computer vision, natural language processing and speech recognition. The use of deep neural networks for automatic sleep staging only started around 5 years ago, even though their resurgence was almost a decade ago. Nevertheless, in this short period of time, deep neural networks have produced impactful and meaningful results that were not seen with more conventional machine-learning methods for a long time.

Transitioning from conventional machine learning, the first attempts to use deep learning for automatic sleep staging mainly employed simple networks in a traditional fashion where short input contexts of one to a few sleep epochs around a target epoch were used to predict the sleep stage of the target epoch. Expectedly, an influx of different variants of typical standalone network architectures, such as deep neural networks (Dong et al 2018, Wei et al 2018), convolutional neural networks (CNNs) (Tsinalis et al 2016, Biswal et al 2017, Sun et al 2017, Supratak et al 2017, Vilamala et al 2017, Andreotti et al 2018a, 2018b, Chambon et al 2018, Malafeev et al 2018, Phan et al 2018, Sors et al 2018, Phan et al 2019) and recurrent neural networks (RNNs) (Malafeev et al 2018, Phan et al 2018) (e.g., long short-term memory (LSTM) (Hochreiter and Schmidhuber 1997) or gated recurrent units (GRUs) (Cho et al 2014)), was met with limited success. Although these networks are able to learn useful features to represent an input, they are unable to capture long-range dependencies between sleep epochs due to the short input context. The ability to model long-range dependencies plays an important role in improving sleep-staging performance, due to the inherently slow transition nature of the physiological processes behind sleep stages (Hartmann 1968, Feinberg and Floyd 1979). In order to compensate for the lack of long-term modelling ability, once such a network has been trained and each epoch is encoded into an epoch-wise feature vector, an additional RNN (e.g. LSTM) (Supratak et al 2017, Dong et al 2018, Stephansen et al 2018, Sun et al 2020) is separately trained in a second stage to take into account a long sequence of epoch-wise feature vectors prior to a target epoch to classify it. These hybrid networks with two-stage training, initiated by Supratak et al (2017), boosted performance significantly and stood out from the other existing models at the time.

In fact, the positive effect introduced by long-term modelling in the above-mentioned two-stage training scheme is not a surprise. It resembles how manual scoring is done by sleep experts who normally need to attend to a much larger context around a target epoch in order to determine its label (Berry et al 2016). From a modelling perspective, this was commonly accomplished by using hidden Markov models (for example, see Ghimatgar et al 2020) before the evolution of deep learning. However, the early works (Supratak et al 2017, Dong et al 2018, Stephansen et al 2018, Sun et al 2020) came with some limitations. First, the independent two-stage training of two subnetworks is sub-optimal since it does not account for the interaction between the epoch-wise feature-learning network and the sequential modelling counterpart, let alone its inconvenience. Second, even the sequential-modelling network (i.e., the bidirectional RNN) is structured to receive a sequence of epochs as input, it classifies only one target epoch at a time, which is usually the last epoch in the input sequence. That is, it is tasked to encode the left-side context of the target epoch in order to make a prediction. Olesen et al (2021) showed that this left-side context often results in lower accuracy than when a more balanced one is used.

Nevertheless, these initial results underscore the essence of long-term modelling in automatic sleep staging. Inspired by these results, since late 2018, the community has witnessed an influx of advanced network architectures with built-in long-term modelling capacity. These networks can be generalized neatly in a common framework, namely the sequence-to-sequence sleep-staging framework (Phan et al 2021). Formally, let us denote an input sequence of L epochs as (S₁,...,S_L) where S_ℓ is the ℓ-th epoch, 1 ≤ ℓ ≤ L. In general, the epochs can be in any form, such as raw signals or time-frequency images and they can be single- or multi-channel. A network adhering to the framework typically consists of two main components: the epoch encoder ${{ \mathcal F }}_{E}$ and the sequence encoder ${{ \mathcal F }}_{S}$ as illustrated in Figure 1. The epoch encoder ${{ \mathcal F }}_{E}:{\bf{S}}\mapsto {\bf{x}}$ acts as an epoch-wise feature extractor which transforms an input epoch S in the input sequence into a feature vector x for representation. As a result, the input sequence is transformed into a sequence of feature vectors (x₁,...,x_L) (represented by the green circles in the figure). Of note, ${{ \mathcal F }}_{S}$ can be a hard-coded hand-crafted feature extractor; however, in the deep-learning context, it is often a neural network (e.g., a CNN or an RNN) that learns the feature presentation x automatically from low-level input signals. In turn, at the sequence level, the sequence encoder ${{ \mathcal F }}_{S}:({{\bf{x}}}_{1},\ldots ,{{\bf{x}}}_{L})\mapsto ({{\bf{z}}}_{1},\ldots ,{{\bf{z}}}_{L})$ transforms the sequence (x₁,...,x_L) into another sequence (z₁,...,z_L) (represented by the red circles in the figure). Intuitively, z_ℓ is a richer representation for the ℓ-th epoch than x_ℓ as it not only encompasses information about the epoch but also encodes its interaction with other epochs in the sequence. More specifically, z_ℓ is derived from x_ℓ, taking into account the left context (x₁,...,x_ℓ−1) and the right context (x_ℓ+1,...,x_L). Eventually, the vectors z₁,...,z_L are used for classification purposes to obtain the sequence of predicted sleep stages, one for each epoch in the input sequence.

**Figure 1.** A schematic diagram of sequence-to-sequence sleep staging. Effectively, different epochs in the input sequence are influenced by different contexts, illustrated by the shaded regions in the sequence encoder block, depending on their absolute position in the sequence. The epoch encoder plays the role of an epoch-wise feature extractor that transforms an epoch into a feature vector representation, illustrated by a green circle. The sequence encoder enriches the presentation, illustrated by a red circle, by incorporating interaction of each epoch with other epochs in its context.
Download figure:
Standard image High-resolution image

The framework features two advantages, helping them overcome the limitations of the earlier proposals from Supratak et al (2017), Dong et al (2018), Stephansen et al (2018) and Sun et al (2020). First, both ${{ \mathcal F }}_{E}$ and ${{ \mathcal F }}_{S}$ are optimized jointly in an end-to-end training fashion, allowing the interaction of the two network components. Second, they are tasked to solve a sequence-to-sequence classification problem, i.e., sequence-to-sequence sleep staging. In other words, a network classifies all the epochs in an input sequence at once rather than only targeting the last epoch. Due to this sequence-to-sequence scheme, different epochs in the input sequence are, in essence, influenced by different contexts depending on their absolute position in the sequence. This is illustrated by the shaded regions in Figure 1. Leveraging this property, sampling and advancing the sequence by one epoch at a time will result in L decisions for a particular epoch. These decisions are associated with diverging contexts; thus, forming an ensemble from them has been shown to lead to performance improvement (Phan et al 2019, 2021).

In Table 1, we give an overview of the automatic sleep-staging systems that are capable of long-term context modelling. The systems are presented in chronological order. On the one hand, most of the systems exploit CNNs, the cornerstone of deep-learning algorithms, for the epoch encoder. The spectrum of the CNN architectures varies from a very basic one (Supratak et al 2017, Seo et al 2020) to specialized ones, such as ResNet (Olesen et al 2021), U-Net (Perslev et al 2019, 2021) and U²-Net (Jia et al 2021). Epoch-wise features can also be learned by capturing sequential information within 30-second signals using RNNs alone (e.g., LSTM (Hochreiter and Schmidhuber 1997) and GRU (Cho et al 2014)) (Phan et al 2019, Guillot et al 2020, Guillot and Thorey 2021) or hybrid networks (e.g., convolutional recurrent neural networks (CRNNs) (Seo et al 2020, Neng et al 2021)). Emerging network architectures like graph convolutional networks (GCNs) (Jia et al 2020) and Transformer (Phan et al 2022) have also been shown to be useful for epoch encoding. On the other hand, RNNs have primarily been employed for sequence encoding due to their well-established capability in sequential modelling. However, inter-epoch sequence modelling can also be accomplished by non-recursive architectures, such as dilated CNNs (Jia et al 2021), self-attention (Eldele et al 2021) and Transformer (Phan et al 2022). It should be noted that not all of the networks in the table are strictly sequence-to-sequence (e.g., DeepSleepNet (Supratak et al 2017), Stephansen et al (2018) and GraphSleepNet (Jia et al 2020)) or end-to-end (e.g., DeepSleepNet (Supratak et al 2017), Stephansen et al (2018) and Sun et al (2020)). However, in principle, they can be framed into the sequence-to-sequence framework and trained end-to-end, as done with the end-to-end sequence-to-sequence variant of DeepSleepNet in Phan et al (2019).

Table 1. Automatic sleep-staging systems that are capable of long-term context modelling, published since late 2018 and sorted in chronological order. The reported performances are presented in term of Cohen's kappa (Cohen 1960). Alternatively, the macro F1-score (indicated with the subscript ^f) and overall accuracy (indicated with the subscript ^a) are presented where Cohen's kappa is not available. Note that not all of networks here are strictly sequence-to-sequence and/or end-to-end.

Network	Year	Input	Epoch Encoder	Sequence Encoder	EDF-20 Kemp et al (2000)	EDF-78 Kemp et al (2000)	MASS O'Reilly et al (2014)	Physio-2018 Ghassemi et al (2018)	SHHS Quan et al (1997)	DOD-H Guillot et al (2020)	DOD-O Guillot et al (2020)	ISRUC Khalighi et al (2016)	CAP Terzano et al (2002)	SVUH-UCD Goldberger et al (2000)	MESA Chen et al (2015)	MrOS Blackwell et al (2011), Song et al (2015)	CHAT Redline et al (2011), Marcus et al (2013)	Other
DeepSleepNet Supratak et al (2017), Phan et al (2019)	2017^*	Raw	CNN	RNN	0.760	0.702	0.800	—	—	0.843	0.804	—	—	—	—	—	0.848	—
SeqSleepNet Phan et al (2019)	2018	Time-freq.	RNN	RNN	0.809	0.776	0.815	0.733	0.838	0.804	0.772	—	—	—	—	—	0.854	—
Stephansen et al (2018)	2018	Corr. encoding	CNN	RNN	—	—	—	—	—	—	—	—	—	—	—	—	—	0.868^a
Biswal et al (2018)	2018	Time-freq.	CNN	RNN	—	—	—	—	—	—	—	—	—	—	—	—	—	0.805
SleepEEGNet Mousavi et al (2019)	2019	Raw	CNN	RNN	0.790	0.730	—	—	—	—	—	—	—	—	—	—	—	—
SimpleSleepNet Guillot et al (2020)	2019	Time-freq.	RNN	RNN	—	—	—	—	—	84.6	82.3	—	—	—	—	—	—	—
Chen et al (2019)^†	2019	Raw	CNN	RNN, CRF	0.820	—	—	—	—	—	—	—	—	—	—	—	—	0.670
IITNet Seo et al (2020)	2019	Raw	CRNN	RNN	0.780	0.790	—	—	0.810	—	—	—	—	—	—	—	—	—
U-Time Perslev et al (2019)/U-Sleep Perslev et al (2021)	2019	Raw	U-Net	CNN	0.790^f	0.760^f	0.800^f	0.770^f	0.800^f	0.820^f	0.790^f	0.770^f	0.680^f	0.730^f	0.790^f	0.770^f	0.850^f	0.850^f
TinySleepNet Supratak and Guo (2020)	2020	Raw	CNN	RNN	0.800	0.770	0.782	—	—	—	—	—	—	—	—	—	—	—
GraphSleepNet Jia et al (2020)	2020	Raw	GCN	Attention	—	—	0.834	—	—	—	—	—	—	—	—	—	—	—
Olesen et al (2021)	2020	Raw	ResNet	RNN	—	—	—	—	0.871^a	—	—	0.740^a	—	—	—	0.864^a	—	0.864^a
Jaoude et al (2020)	2020	Raw	CNN	RNN	—	—	—	—	—	—	—	—	—	—	—	—	—	0.740
Sun et al (2020)	2020	Raw, hand-crafted	CNN	RNN	—	—	0.795	—	—	—	—	—	—	—	—	—	—	—
Korkalainen et al (2020)	2020	Raw	CNN	RNN	—	0.780	—	—	—	—	—	—	—	—	—	—	—	0.790
Qu et al (2020)	2020	Raw	CNN, ResNet	Self-attention	0.780	—	0.800	—	—	—	—	—	—	—	—	—	—	—
HNSleepNet Chen et al (2020)	2020	Raw	CNN	RNN, attention	0.780	—	0.810	—	—	—	—	—	—	—	—	—	—	—
Li et al (2020)	2020	Raw	CNN	RNN, attention	0.790	—	—	—	—	—	—	—	—	—	—	—	—	—
FCNN+RNN Phan et al (2021)	2020	Raw	Fully CNN	RNN	0.775	0.759	0.806	0.738	80.9	—	—	—	—	—	—	—	0.847	—
XSleepNet Phan et al (2021)	2020	Raw, time-freq.	CNN, RNN	RNN	0.813	0.778	0.823	0.746	0.847	—	—	—	—	—	—	—	0.857	—
RobustSleepNet Guillot and Thorey (2021)	2021	Time-freq.	RNN	RNN	0.817^f	0.779^f	0.825^f	—	0.800^f	0.851^f	0.827^f	—	0.738^f	—	0.795^f	0.756^f	—	—
CCRRSleepNet Neng et al (2021)	2021	Raw	CRNN	RNN	0.780	—	—	—	—	—	—	—	—	—	—	—	—	—
Eldele et al (2021)	2021	Raw	CNN	Self-attention	0.790	0.740	—	—	0.780	—	—	—	—	—	—	—	—	—
RecSleepNet Nie et al (2021)	2021	Raw	CNN	RNN	0.813^f	0.779^f	—	—	—	—	—	0.779^f	—	0.743^f	—	—	—	—
Coon et al (2021)	2021	Raw	CNN	RNN	—	—	—	—	—	—	—	—	—	—	—	—	—	0.769^a
SalientSleepNet Jia et al (2021)	2021	Raw	U²-Net	Dilated CNN	0.830^f	0.795^f	—	—	—	—	—	—	—	—	—	—	—	—
SleepTransformer Phan et al (2022)	2021	Time-freq.	Transformer	Transformer	—	0.789	—	—	0.828	—	—	—	—	—	—	—	0.842	-

*DeepSleepNet was introduced by Supratak et al (2017) and the end-to-end version was presented as a baseline in the SeqSleepNet work by Phan et al (2019).

We also collate the performance of the systems with regard to common public sleep databases in Table 1. These results are obtained either from the original works or in other works where the systems were evaluated. We can see a Cohen's kappa of ≥ 0.81, i.e., 'almost perfect' agreement level according to the interpretation of Cohen's kappa (Cohen 1960), achieved on databases with a majority of healthy subjects, for example, EDF-20, MASS, SHHS, DOD-H, and DOD-O. However, performance is still substandard on databases associated with sleep pathologies, such as ISRUC and CAP. It is worth stressing that the performance values presented in the table should not be used out of context to justify a network's efficacy or compare one to another. The rationale is that potential discrepancies in evaluation setup (e.g., data subsets, the number of channels, etc.) and modelling (e.g., scratch training versus domain adaptation, supervised versus unsupervised learning, etc.), renders such a comparison meaningless.

4. Challenges and future directions

In our view, automatic sleep staging with PSG on healthy people has basically been solved, not only for adults (Biswal et al 2018, Phan et al 2021), but also for children (Phan et al 2022). The comparative study by Phan et al (2022) showed that different network architectures under the same sequence-to-sequence framework result in a similar 'almost perfect' consensus level (according to the interpretation of Cohen's kappa (Cohen 1960)) and little discrepancy was seen among their staging outcomes. This suggests that there is probably little room for accuracy improvement within the same sequence-to-sequence framework. Furthermore, the improvement, if any, is not necessarily meaningful.

While automatic sleep scoring of PSG recordings has come a very long way, as described above, there are still challenges to overcome. These, of course, should be viewed as opportunities for innovation. In Figure 2, we give an overview of the challenges around two critical applications, sleep scoring with PSG in clinical spaces and sleep monitoring with wearable EEG in daily living environments, and directions for future works to address the challenges. Before discussing the individual challenges, we feel it is valuable to highlight the overarching contexts of the two applications:

**Figure 2.** Applications, challenges and directions in automatic sleep staging.
Download figure:
Standard image High-resolution image

Clinical PSG scoring: It is not sufficient to have high-quality PSG scoring of healthy people only. In a clinical setting, the tools applied should be equally capable when confronted with non-textbook sleep phenotypes, where the sleep EEG may either be masked by disease-related artifacts, or where the sleep EEG itself may be so drastically changed by the patient's condition that a correct sleep scoring either requires specialized routines or may even be impossible. An automatic sleep scoring algorithm must be able to handle this situation transparently and reliably. Thus, to obtain more widespread adoption clinically, automatic sleep scoring should be as robust as manual scoring, and deliver outputs which are easy to fit into clinical workflows. In Figure 2, this relates particularly to 'data heterogeneity', 'model interpretability', 'learning with noisy labels' and 'tailored algorithms'.

Wearable EEG scoring: Medical-grade mobile sleep monitoring has great potential for revolutionizing healthcare both in screening, diagnosing and follow-up. A high-quality monitoring platform would allow easy recording of weeks of sleep from each individual, without incurring higher healthcare costs or discomfort the patient. Such data would be much more representative of the patient's actual sleeping patterns, and alleviate the present issues with phenomena which are only periodic, or which may be impacted by patients sleeping in unfamiliar environments. To reap the full benefits from such a monitoring device, we need analysis tools that can be tailored to the individual, detect subtle changes in sleep patterns and describe a patient's sleep in different, quantitative terms than those found in current single-night hypnograms.

The discussion below is structured following the overview outlined in Figure 2.

4.1. Longitudinal monitoring

A particularly interesting development in sleep monitoring, closely related to automatic scoring, is the feasibility of long-term monitoring. With conventional PSG setups, scored by hand, it is generally too expensive, not to say inconvenient for the wearer, to perform nightly recordings for weeks on end. However, when scoring is free, and when the hardware can be self-applied (Miettinen et al 2018, Arnal et al 2019, Mikkelsen et al 2019) and is relatively unobtrusive, longitudinal monitoring becomes much more feasible. Interested readers are recommended to refer to Imtiaz (2021) for a complete review of devices for wearable sleep staging. These types of devices, and the datasets recorded with them, open up new, very interesting avenues of research.

4.1.1. Individual modelling

Given a large number of nights from an individual, it seems both possible and advantageous to create personal models using this data. Indeed, we may be inspired by more general approaches for semi-supervised machine learning (Padmanabhan et al 1998, Kang et al 2019, Sun et al 2019). We could imagine that a successful approach would not only lead to an individual model, but a model which could keep up with the possible 'concept drift' that is caused by the long-term changes in a person's sleep patterns. At the same time, such an adaptive approach should be resistant to 'catastrophic forgetting' (French 1999), in which the sequential learning of a neural network causes it to forget how to solve previous problems; in this case, how to deal with types of sleep or events that only happen very infrequently for the given subject. We suspect that simply requiring a long 'memory' is sufficient to solve this problem. Given the performance of models trained as leave-one-subject-out, the size of standard datasets, the fact that inter-subject variation is estimated to be a significant driver of sleep data variation (Buckelmüller et al 2006, Hemmsen et al 2021, Finelli et al 2001, Tucker et al 2007) and studies evaluating the performance of individualized models (Mikkelsen et al 2019, Phan et al 2020), we expect that on the order of a 100 nights would be sufficiently long.

4.1.2. Change detection

A natural extension to this discussion is detection of sudden (between nights), serious changes to an individual's sleep. Given the large variation in sleep in a given subject, this is not readily feasible based on a few nights. However, based on perhaps weeks of data before and after a significant event (e.g., surgery, change in medication, disease onset, etc.), it seems possible that we could reliably detect that a patient had started to sleep differently. Conceivably, such change detection will be aided by the development of a continuously updating model (since this should entail detecting the need for significant updates).

4.1.3. Subject classification

It has been shown that despite the significant night-to-night variation in an individual's sleep, it is possible to define 'trait-like characteristics', special to the individual (Hemmsen et al 2021, Finelli et al 2001, Tucker et al 2007, Chua et al 2014). Ideally, this could mean that given a sufficient amount of sleep recordings from an individual, they could be transformed into a reliable biomarker, which could be used to determine not only changes in sleep patterns, but also whether a given patient's sleep was indicative of certain specific diseases, such as RBD.

4.2. Low signal quality—device limitations for mobile sleep monitoring

Mobile sleep monitoring devices have great potential for helping both healthy and sick users get increased knowledge about their own sleep. However, as has been shown in multiple studies (Arnal et al 2019, Mikkelsen et al 2019), even the best studies do not achieve the same inter-scorer reliability measures as PSG-based approaches (state-of-the-art values for Cohen's kappa seem to be about 0.75 for mobile devices (Arnal et al 2019, Mikkelsen et al 2021), while PSG data leads to values above 0.8). A lower signal-to-noise ratio for these types of data (relative to PSG recordings) is likely the main cause for the degraded performance in mobile solutions. Multiple studies (Mikkelsen et al 2018, 2021) have achieved much better performances from the same number of PSG nights as mobile EEG nights (using concurrent recordings). This means that while small datasets can definitely be an issue, as discussed in Section 4.3, it seems that making many recordings is likely not sufficient. We can model this phenomenon by imagining the mobile dataset as a projection of the high-dimensional PSG data into a lower-dimensional space, with resulting information loss. Given the differences in neuroanatomy, it is reasonable to consider this projection to be subject dependent, and depending on the device design, we could also expect there to be a variation between recording days (because the device may be mounted differently each time).

Several studies have shown that including subject-specific information increases sleep-scoring performance (Mikkelsen et al 2017, 2019, Phan et al 2020), which is also in line with general EEG studies showing significant differences between individuals (Palaniappan and Mandic 2007, Mikkelsen et al 2021). Mikkelsen et al (2019) found that the differences between algorithms trained on data just from the same subject and data from both the same subject and many other subjects were minimal. This indicates that the benefit from personalizing sleep-scoring algorithms is not that irrelevant data is excluded, but rather that maximally relevant data is included. If this trend were to scale to much larger cohorts, one could imagine that for sufficiently large cohorts, the personalized and general algorithms would perform similarly. However, achieving such a broad training set may be unrealistic in practice.

Individualization of algorithms has been done in multiple ways. Some studies have used random forest ensemble models (Mikkelsen et al 2017, Gangstad et al 2019, Mikkelsen et al 2019), which can be trained using very little data. This makes it possible to create full sleep-scoring models using only a single or few nights of data from an individual. Other studies, using deep neural networks, have instead resorted to variations of fine-tuning population models to individuals. Phan et al (2020) explored a technique where the model, during fine-tuning, was penalized for making large changes to the output in the source domain, effectively limiting the risk of overfitting to the fine-tuning dataset.

Note that while some groups focus on developing the entire device, others have strictly worked on developing sleep scoring algorithms for generic 'single-channel EEG' datasets, without relating it to a specific device (Koley and Dey 2012, Olesen et al 2020). In this discussion, we have lumped the two approaches together. We note also that similar observations, regarding mobile device monitoring were made in the review by Chriskos et al (2021).

4.3. Modelling with a small amount of data

Training a deep neural network generally requires a large amount of data. In fact, deep-learning-based sleep-staging models only reach expert-level performance when the training cohort is large, i.e., hundreds or thousands of subjects (Biswal et al 2018, Phan et al 2019, Perslev et al 2021, Phan et al 2021). The networks trained with a small cohort continue to exhibit substandard performance. Unfortunately, in practice, many sleep studies only have access to a small cohort, in the order of a few dozens of subjects; for example, when studying a particular sleep disorder (Andreotti et al 2018a, Cooray et al 2019).

This scenario is particularly common in studies exploring the feasibility of a new monitoring device; for example, mobile EEG devices (Mikkelsen et al 2018, 2019, Heremans et al 2021). While the PSG benefits from being an established standard with an enormous user base, new alternatives, by definition, do not. Add to this that many such devices will undergo multiple generations which may not be compatible. Finally and crucially, new training sets will usually require special recordings of both a device and PSG signals to obtain the necessary ground-truth manual PSG scoring which constitutes the training labels. All together, this means that algorithms for new sleep-monitoring devices usually have to be trained with quite small datasets.

The most popular solution to this problem is to use transfer learning, usually in the sense that a neural network is trained on a large sleep dataset, often consisting of PSG recordings. This model is then fine-tuned using the new dataset (Phan et al 2019, Olesen et al 2020, Guillot and Thorey 2021, Heremans et al 2021, Phan et al 2021). Although model fine-tuning often results in better performance than scratch model training, the gains are modest in some cases. The problem is that by fine-tuning, we essentially further train a pretrained model with a small amount of data that easily causes overfitting without a proper regularization, especially when the source domain and the target domain are significantly different. This can be remedied by fine-tuning just a part of the pretrained model; however, identifying the most relevant layers for fine-tuning still remains an unsolved question. EEG data augmentation (Fan et al 2020) is another direction to explore.

4.4. Privacy preservation: a note for sleep monitoring

Brain waves are a rich source of information from which deciphering numerous sensitive pieces of information has been shown possible, such as identity (Palaniappan and Mandic 2007), age (Al Zoubi et al 2018), gender (Wang and Hu 2019), emotion (Alarcao and Fonseca 2019), preference (Sangnark et al 2021), personality (Zhao et al 2018), etc. This poses a challenge to protect this information from the user's perspective and to comply with legal restrictions, such as the General Data Protection Regulation (2021) in the European Union and the Consumer Privacy Bill of Rights (Gaff et al 2014) in the US.

Traditionally, when sleep-staging models are trained, adapted and deployed centrally, EEG signals are expected to be sent via some communication means. Data privacy and security concerns should be paid heed in this case; however, it is beyond the scope of this article. From the algorithmic perspective, federated learning (McMahan et al 2017) emerges as a promising solution to address the fundamental problems of privacy and locality of data in the conventional centralized setting. Instead of bringing data to the (centralized) code, federated learning brings the code to (decentralized) data and exploits the distributed resources to train a model collaboratively. Thus, it gets rid of the need for the data to be transferred to a single server. Several open-source federated learning systems are available, e.g. FATE (Webank 2021), PaddleFL (Baidu 2021), TensorflowFL (Tensorflow 2021) and Pysyft (Openmined 2021), that would facilitate future research in this direction. Developing compact deep neural networks (Xia et al 2020) for sleep staging also becomes relevant. This particularly fits to longitudinal monitoring as model personalization and development can be done on devices, such as wearables, smartphones or Internet-of-Things edge devices and the person's data can stay local. Since these devices operate under run-time energy and memory storage constraints, they can only accommodate compact models due to their reduced energy consumption, memory requirement and inference latency. First, compact versions of existing state-of-the-art models can be derived via quantization (Zhu et al 2017, Hubara et al 2018) to reduce bit-depth of the weights and pruning (Han et al 2016, Molchanov et al 2017) to remove redundant weights. Second, a network architecture can be hand-designed to remove redundancy, for example via depth-wise convolution (Ma et al 2018) or a shift-based module (Wu et al 2018), and thus improve efficiency. Although this approach often results in good model compactness, there is no guiding principle in hand-engineering a network architecture and most of the existing works are more or less based on trial and error. An alternative approach is to automatically search for network architectures, i.e., neural architecture search (Baker et al 2017, Zoph and Le 2017), which has seen some success, for example in the image domain (Tan et al 2019).

4.5. Disorders affecting sleep structure

If a subject suffers from severe neurological disorders, this can have a drastic impact on their sleep. It may change the structure, timing and outward characteristics of the individual's sleep (Iranzo 2016), and it can also change the physiological features of the various sleep stages (Santamaria et al 2011). This can eventually lead to several types of issues, which require different types of solutions. First, if the manual scoring of the recordings becomes harder, training and validation of a sleep-staging model becomes equally hard (Stephansen et al 2018, Korkalainen et al 2020). We are not aware of any methodological studies on how to deal with this issue in sleep scoring, but it is a general issue for most types of medical data. Certainly, some of the methods employed in, for instance, medical imaging analysis (Tajbakhsh et al 2020) could be re-purposed for sequence-to-sequence sleep-staging models. Second, if the underlying issue is that distinct sleep stages are becoming less defined (which could be the case in very advanced brain damage), it is possible that more flexible approaches such as continuous sleep depth estimation (Asyali et al 2007, Carrubba et al 2012) are a better fit. Third, even if the above issues do not appear, the changes in transition probabilities or even feature spaces have to be absorbed by a classification algorithm to achieve good sleep-staging performance. Fortunately, some studies have shown that a model's performance can be improved significantly if it is individualized, for instance when the subjects are suffering from epilepsy (Gangstad et al 2019) or RBD (Andreotti et al 2018a).

A particular implementation which is interesting in this context is the ASEEGA algorithm (Berthomier et al 2007) which extracts frequency-based features and scales them according to the individual night. This results in an algorithm which achieves an average Cohen's kappa of 0.8 for healthy adults and full PSG setups (thus, not quite state-of-the-art), but which, on the other hand, appears to be quite resistant to perturbations caused by sleep disorders (Peter-Derex et al 2021) (reaching Cohen's kappa values between 0.75 and 0.80, depending on disorders). Another promising approach, similar to the ASEEGA algorithm, is to feed an entire night into the algorithm at once. In theory, this could enable the algorithm to detect subject- or disease-specific perturbations, and adjust for them. We have seen the approach applied by Li et al (2021) for arousal detection, to great success.

4.6. Black-box criticism and interpretability

4.6.1. Black-box criticism and adoption

Trust, a psychological mechanism to deal with uncertainty, is a crucial factor influencing interactions and relationships between human and AI, particularly between clinicians and AI in the healthcare domain. This is the chief mechanism that shapes the use and adoption of AI in healthcare settings where life is involved (Asan et al 2020). With its capabilities, deep learning has demonstrated its benefits to healthcare in many aspects: superior performance, capability of handling large and complex data, data-driven learning ability, etc. Examples are the application of deep learning to image-based diagnosis (Ting et al 2018), clinical outcome prediction (Rajkomar et al 2018), automatic ECG analysis (Ribeiro et al 2020), automatic sleep analysis (Stephansen et al 2018), mental health screening (Su et al 2020), intelligent assistive technologies for dementia care (Ienca et al 2017), to mention a few. However, the complex nature of these algorithms and their inherent 'black-box'-ness have been deterring medical professionals' trust (Lipton 2018, Rajkomar et al 2018). The way that a deep neural network processes input data through interconnected layers to arrive at its staging decisions poses difficulties in deciphering how it learns to produce the outputs. Expectedly, automatic sleep-staging systems are not an exception as black-box skepticism remains one of the main questions around their clinical value and adoption.

4.6.2. Interpretability

Interpretability is critical for a trustworthy sleep-staging system due to the fact that sleep stages are often ambiguous and even different human experts tend to disagree to a certain extent (Danker-Hopfe et al 2009, Guillot and Thorey 2021). While addressing this interpretability problem is mandatory to unleash the clinical value of deep-learning-based sleep-staging algorithms, it requires novel technical approaches to understanding the behaviour of these AI systems (i.e., explainable AI (Gunning and Aha 2019))

A few attempts have been made to introduce interpretability to a deep-learning-based automatic sleep-staging system. Most of them explain the models using feature visualization methods, such as sensitivity maps (Rasmussen et al 2011, Yosinski et al 2015) by Vilamala et al (2017), guided gradient-weighted class activation maps (Selvaraju et al 2017) by Andreotti et al (2018b), the saliency map (Qin et al 2020) by Jia et al (2021), and the self-attention score (Vaswani et al 2017) by Phan et al (2022). In another work, Lee et al (2020) proposed to associate a model's learning process with expert-defined EEG patterns. These patterns were used as templates for the first convolutional kernels of a CNN and were located in a test EEG signal via cosine similarity maximization to achieve interpretability. Al-Hussaini et al (2019) gained interpretability from the perspective of decision rules by coupling a deep-learning model with a regression tree. Prototypes in the high-dimensional embedding space of a CNN were firstly derived and used to generate similarity for each PSG epoch with the expert-defined rules. The similarity scores were then classified by a decision tree. Several resulting splitting rules of the decision tree were found similar to the guidelines for human annotators (Berry et al 2016).

Ultimately, future research towards interpretable automatic sleep staging could benefit from explainable deep-learning research in general. On the one hand, new backpropagation-based methods (e.g., layer-wise relevance propagation (Bach et al 2015, Montavon et al 2018), Deep Learning Important FeaTures (DeepLIFT) (Shrikumar et al 2017) and integrated gradients (Sundararajan et al 2017)) and perturbation-based methods (e.g., occlusion sensitivity (Zeiler and Fergus 2014), representation erasure (Li et al 2016), meaningful perturbation (Fong and Vedaldi 2017) and prediction difference analysis (Zintgraf et al 2017)) can be explored to improve models' explanation via scientific visualization of characteristics of an input that influence the output of a model. Designing intrinsically explainable deep networks, like those from Jia et al (2021) and Phan et al (2022), that can jointly optimize model performance and provide explanations as part of the model output is another potential direction. These intrinsic methods are probably more desirable than the post hoc methods that seek to explain models that were never designed to be explainable in the first place. On the other hand, there are model distillation approaches (Hinton et al 2015, Ribeiro et al 2016) in which the knowledge encoded within a deep-learning model (i.e., the 'black-box' model) is distilled into a 'white-box' model which is meant to identify the decision rules influencing the outputs of the deep-learning model, as shown in Al-Hussaini et al (2019). The distilled models, potentially simple and interpretable, such as decision tree or logistic regression, offer the explanation power while still achieving reasonable performance. In this way, one could alleviate the compromise between interpretability and prediction accuracy to some extent.

However, it remains an open question how the explainability of a model could be objectively quantified, evaluated and compared. Our opinion is that an explainable AI system for automatic sleep scoring should be inspired by the way a sleep expert performs manual scoring to provide interpretability to (1) whether the features and the rules resulted from an algorithm are clinically relevant to and underpin sleep and (2) how the decision on a target epoch is made under the influence of its neighboring epochs given their strong dependency due to the continuous nature of sleep. However, answering these questions in automatic sleep staging is tricky. First, most of the existing approaches explain the models via the prism of expert-defined rules and features; however, many of these features are not well defined while the majority of the rules for human annotators are vague (Berry et al 2016). Second, these features and rules cannot be used in many scenarios, for instance, wearable EEG (Mikkelsen et al 2017, Sterr et al 2018, Mikkelsen et al 2019) since the underlying signals are different from scalp EEG and not readily interpretable for a human scorer. Third, in practice, many features and rules learned by these networks do not conform to the established features and rules. However, these challenges point in the direction of moving beyond interpretability. We envision that efforts should also be made to understand the disharmonizing features and rules resulted purely from data. Clinical explanations for them would potentially help us to gain further insights about the underlying neurophysiology of human sleep, which, in turn, could be used to update the manual-scoring features and rules (Al-Hussaini et al 2019, Penzel et al 2013).

4.7. Data mismatch due to distributional shifts between datasets/cohorts

Sleep data typically come from different sources with a wide range of institutions, demographics, diseases, modalities, devices and acquisition conditions. As a result, these mismatches violate the data assumption of being independent and identically distributed required for a machine-learning system. Even when a deep-learning sleep-staging model is trained on large amounts of data, resulting in powerful hierarchical representations, the discrepancies are still computationally significant, degrading the accuracy of sleep-staging models on unseen data with a shift (i.e., mismatch) in their distribution (Phan et al 2021). A naive solution for this problem is to form training data from as many conditions as possible (Olesen et al 2021), ideally from all types of conditions that will be foreseeably encountered in the deployment phase. However, this is expensive, time-consuming and infeasible. In addition, novel setups will likely emerge in the study of particular sleep disorders (Stephansen et al 2018, Cooray et al 2019) or when exploring the feasibility of new monitoring devices (Mikkelsen et al 2018, Myllymaa et al 2016, Mikkelsen et al 2017, Arnal et al 2019, Heremans et al 2021).

As the data mismatches cannot be simply reversed by signal preprocessing, approaches to migrate a pretrained model to a target cohort with an unseen condition (e.g., via transductive transfer learning or domain adaptation (Phan et al 2019, Eldele et al 2021, Heremans et al 2021, Phan et al 2021, Yoo et al 2021, Zhao et al 2021)) have been adopted. While most existing works on this direction utilized a large labelled database for model pretraining in a supervised fashion, semi-supervised (Van Engelen and Hoos 2020) and unsupervised (i.e. self-supervised) (Mohsenvand et al 2020, Banville et al 2021, Jiang et al 2021, Jing and Tian 2021, Yang et al 2021) training regimes would further allow leveraging unprecedentedly large amounts of unlabelled data for this purpose. However, before migrating a pretrained model to a target domain, distributional shifts need to be detected and quantified to indicate the variation of the model and whether the migration is necessary or not. Entropy of the probability outputs could potentially serve this purpose (Mikkelsen et al 2020). Then, methods for model migration from a source domain to a target domain can be categorized depending on the availability of labelled data in the target domain. In the best case, when all data is labelled, supervised domain adaptation methods (Tan et al 2018), such as those by Phan et al (2019), Heremans et al (2021) and Phan et al (2021), appear to be most sensible. In these methods, a pretrained model needs to undergo a fine-tuning process, i.e., the model is further trained in a supervised fashion using the target domain's labelled data. When only a part of the target-domain data are labelled, supervised domain adaptation is, in essence, still feasible if the amount of labelled data is sufficient. Otherwise, fine-tuning using a small amount of data will be exposed to a great risk of overfitting. In any case, a better solution is to exploit semi-supervised domain adaptation methods, which incorporate semi-supervised learning (Van Engelen and Hoos 2020) and domain adaptation (Tan et al 2018), to leverage both labelled and unlabelled data at the same time. For example, a pretrained model can be fine-tuned to simultaneously minimize the sum of supervised classification and unsupervised reconstruction cost functions (Rasmus et al 2018). In the worst case, when all the data are unlabelled, unsupervised domain adaptation will be a natural choice (Wilso and Cook 2020). For example, encouraging results have been reported using adversarial domain adaption methods to match the feature distributions of the source and target domains via gradient reversal from a domain classifier that is tasked to discriminate between the two domains (Nasiri and Clifford 2020, Yoo et al 2021, Zhao et al 2021). A pretrained network can be also be adapted to a target domain by modulating the domain-specific statistics of deep features stored in the network's normalization layers like batch normalization (Fan et al 2022). While the reliance on target data labels are costly as human scoring is required, in general, the performance gains are proportional to the amount of labelled data. With the same target-domain data, the gains from supervised transfer learning methods are expected to be higher than semi-supervised ones which are in turn higher than unsupervised ones. However, success of model migration is also subject to training strategies, network-architecture choices and datasets (Phan et al 2021).

An alternative approach to the domain adaptation is data mapping. This has remained mostly unexplored for sleep data. The difference is that domain adaptation requires modification of the model parameters while data mapping aims to modify the data to map them from one domain to another. To this end, a mapping function can be learned to map from a certain target domain to the source domain. Evaluating a sleep-staging model on a target domain, the test data is firstly fed to the domain mapping function to make them look like the source data before sleep staging takes place. Inspired by the success of generative adversarial networks (GAN) (Goodfellow et al 2014) in image-to-image translation, subject-to-subject or sequence-to-sequence PSG mapping could potentially be done similarly with these GAN variants. However, the challenge here is that we may wish to achieve the mapping by modifying the traits of the mismatched factors in data while preserving the sequential nature of the sleep data.

Another approach to align source and target domains is to force a sleep-staging model to learn domain-invariant feature representations (Zhao et al 2019, Heremans et al 2021). In other words, the features learned by the model follow the same distribution no matter whether the input are from the source or target domain, and so representing the underlying sleep stages while being agnostic to other factors. As a consequence, the model trained on the source domain can generalize well to the target domain without the necessity of modification of the model parameters or data mapping. One way to achieve this is to minimize the distances (e.g., Wasserstein distance (Long et al 2016, Chambon et al 2018)) between the distributions during training in addition to the classification task. An alternative to distance minimization is to incorporate reconstruction losses (Bousmalis et al 2016, Ghifary et al 2016) to encourage the learned features to reconstruct either the target-domain data or both the source and target-domain data. Another possibility is to rely on an adversarial domain classifier, which is tasked to discriminate the source and target domain. In light of adversarial training as in a GAN (Goodfellow et al 2014), the idea is then to train the sleep-staging model to learn the features such that the domain classifier is unable to distinguish from which domain the features originated (Zhao et al 2019). However, this approach requires data from both domains to be available at the training phase.

4.8. Heterogeneity: a challenge beyond data mismatch

Apart from the data-mismatch challenge discussed in Section 4.7, heterogeneity is another challenge emerging from the data originated from different sources, or even from different subjects of the same cohort. Typically the number of channels and modalities in PSG recordings varies significantly due to the differences in channel layout and recording setup. This is not a major challenge in many existing network architectures relying on a single channel of one or a few modalities (i.e., EEG, electrooculogram (EOG) and electromyogram (EMG)) (Supratak et al 2017, Phan et al 2019, Perslev et al 2021, Phan et al 2021). However, it limits the applicability of those utilizing a larger number of channels (Chambon et al 2018, Jia et al 2020) when one or more employed channels are missing in test data. In practice, different channels and modalities manifest different perspectives of the underlying neurophysiological processes in human sleep. For example, alpha rhythm appears most clearly in the occipital lobe, sawtooth waves characterizing REM are best captured in the central lobe and K-complex events, the hallmark of N2, are best observed in the central lobe (Berry et al 2016). As a result, consolidating information from all available channels of PSG data in a holistic view would potentially improve sleep-staging performance. This will also facilitate model building from an unprecedented amount of data gathered from different sources (Stephansen et al 2018, Guillot and Thorey 2021, Perslev et al 2021).

Fortunately, tackling this challenge probably does not require designing entirely new modelling paradigms, and instead current ones could be built upon. One could devise an intermediate layer that interfaces the heterogeneous raw data and a network. This layer aims to amalgamate all available channels and modalities to form an input with a fixed number of channels that are ready to be fed to any existing network architecture. As a result, existing state-of-the-art models would be invigorated owing to the more informative input. Guillot and Thorey (2021) proposed such a layer using a multi-head attention layer with N heads to map an input with varying number of channels to N channels. Potentially, this could also be done using an across-channel pooling operator, e.g., average pooling or max pooling. Channel mapping, for example via a 1 × 1 convolutional kernel (Lin et al 2014), could also map the varying inputs into any fixed number of channels and further enrich the resulting channels via nonlinear activations. Ideally, such an interface layer should be integrated to an existing sleep-staging model and trained jointly on the classification task in an end-to-end fashion. However, automatic channel quality control should be put in place to exclude bad-quality channels with, for example, highly noisy (Stephansen et al 2018) and excessive data missing (Mikkelsen et al 2021).

4.9. Subjectivity in model building

It is well known that manual scoring of PSG is highly subjective. Many previous studies consistently reported the consensus among human scorers around a Cohen's kappa of 0.76 (Danker-Hopfe et al 2009, Rosenberg and Van Hout 2014). The consensus is particularly poor on epochs manifesting characteristics of more than one sleep stages. Examples are N1 stage, epochs close to the boundary of two stages, and data from patients with fragmented sleep. So far, it has been a common practice to use the subjective and noisy labels annotated by a single scorer for model training as if they are perfect. Thus, the scorer's subjectivity is unavoidably transferred into the trained model. Reducing such subjectivity is necessary but remains largely uncharted.

The above-mentioned subjectivity can be alleviated by scoring at a fine-grained temporal resolution (i.e., smaller epochs (Stephansen et al 2018, Olesen et al 2021) or even samples (Perslev et al 2019, 2021)). However, the fine-grained supervision signals in these works were still derived from 30-second epoch annotation, and therefore, the subjectivity continued to exist. A more proper treatment would be to consider the labels as noisy by nature (in practice, one-hot ground-truth labels are unknown or even not in existence) and robust training methods should be devised to manage the noisy labels (Han et al 2020, Nigam et al 2020). It also remains an open question how to train a model with multiple supervision signals (i.e., annotations of two or more human scorers) at the same time rather than a single supervision signal as usual. Doing this will encourage the model to adapt to the scoring style of a cohort of scorers. Note that this is different from averaging labels of multiple scorers (Guillot et al 2020) which eventually results in a single supervision signal.

4.10. Scorer personalization

Since it is very likely that a model trained with the annotation from a single scorer as described in Section 4.9 will mimic (i.e., overfit) the scorer's style, the resulting 'subjective' model poses another challenge from the end-user perspective that is worth being discussed separately here. Imagine a clinical scenario when a clinician is an end user of the scoring system. The trained model can be reasonably viewed as a digital twin of the original scorer who labelled the data; hence, it will face disagreement on staging decisions with a new human scorer (i.e., an end user in this case) as similar as disagreement between two human scorers. For maximum adoption by the clinician, this raises the need for the model to gradually adjust to the scoring style of the new scorer. Readers should note that this scorer personalization problem is orthogonal to the data personalization discussed previously in Section 4.1.1.

Tackling this challenge requires a closed-loop interaction between the model and the end user (Liang et al 2020). On the one hand, the disagreeing staging decisions need to be first identified. This could be done via uncertainty quantification, for example using entropy-based metrics (Mikkelsen et al 2020, Phan et al 2022) or an ensemble with Monte Carlo dropout (Fiorillo et al 2021), as the model's decisions with low confidence are more likely to be disagreeing ones. Although this approach can isolate a large portion of wrong decisions, a pitfall here is that an algorithm may output contentious decisions with a very high confidence. This anomalous behavior has been studied in many prior works (Nguyen et al 2015). While understanding this behavior in the context of human sleep is an interesting subject on its own, methods will also need to be developed to identify wrong decisions associated with high confidence. Model interpretability (see Section 4.6.2) with the aid of convenient user-interface and user-experience design will be equally important to allow the end user to scrutinize the potentially wrong decisions and make necessary corrections in an interactive manner. On the other hand, given the end user's feedback, learning methods will be needed to incrementally adapt the model using the newly labelled data in an open-ended fashion. Approaches for continual learning (Parisi et al 2019), such as the meta-learning method used in Banluesombatkul et al (2021), stands out as promising candidates. While these methods are required to learn from sequential, and potentially small, data, they also need to overcome catastrophic forgetting (Kirkpatrick et al 2017), the central issue in this learning setting.

5. Conclusions

Benefits of automatic tools for sleep scoring have driven research in automatic sleep staging for many decades. Promising results recently achieved by deep-learning-based methods have given incentives for a large number of research studies, now resulting in solutions that have comparable performance to sleep experts on the sleep-scoring task, at least on healthy individuals. This methodological development has benefited immensely from the initiatives for free access to large collections of de-identified sleep data and open-source tools and techniques from many experiments available for researchers around the world. We perceive this achievement as a small yet important milestone in the use of AI in sleep research. However, many challenges still have to be overcome for these AI tools to prove their clinical usefulness. Next to the challenge of sleep disorders, issues related to data heterogeneity, model explainability and subjectivity will require more attention. In order to bring sleep monitoring outside sleep labs and into daily living environments, prospective studies also have to be conducted to improve robustness to low signal quality of mobile EEG devices and limited amounts of training data and to address issues related to privacy and longitudinal monitoring. Once these challenges have been overcome, we expect AI-based sleep scoring tools to play an important role in day-to-day sleep practice, complementing the increasing need for trained sleep experts and benefiting millions in need of accurate sleep assessment, monitoring and treatment options for many sleep disorders.

Acknowledgments

H Phan is supported by a Turing Fellowship under the EPSRC grant EP/N510129/1.

Author contributions

H Phan and K Mikkelsen contributed equally.

Automatic sleep staging of EEG signals: recent development, challenges, and future directions

Article metrics

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Abstract

1. Introduction

2. Clinical context

3. The state-of-the-art sleep scoring

4. Challenges and future directions

4.1. Longitudinal monitoring

4.1.1. Individual modelling

4.1.2. Change detection

4.1.3. Subject classification

4.2. Low signal quality—device limitations for mobile sleep monitoring

4.3. Modelling with a small amount of data

4.4. Privacy preservation: a note for sleep monitoring

4.5. Disorders affecting sleep structure

4.6. Black-box criticism and interpretability

4.6.1. Black-box criticism and adoption

4.6.2. Interpretability

4.7. Data mismatch due to distributional shifts between datasets/cohorts

4.8. Heterogeneity: a challenge beyond data mismatch

4.9. Subjectivity in model building

4.10. Scorer personalization

5. Conclusions

Acknowledgments

Author contributions

Automatic sleep staging of EEG signals: recent development, challenges, and future directions

Article metrics

Share this article

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Abstract

1. Introduction

2. Clinical context

3. The state-of-the-art sleep scoring

4. Challenges and future directions

4.1. Longitudinal monitoring

4.1.1. Individual modelling

4.1.2. Change detection

4.1.3. Subject classification

4.2. Low signal quality—device limitations for mobile sleep monitoring

4.3. Modelling with a small amount of data

4.4. Privacy preservation: a note for sleep monitoring

4.5. Disorders affecting sleep structure

4.6. Black-box criticism and interpretability

4.6.1. Black-box criticism and adoption

4.6.2. Interpretability

4.7. Data mismatch due to distributional shifts between datasets/cohorts

4.8. Heterogeneity: a challenge beyond data mismatch

4.9. Subjectivity in model building

4.10. Scorer personalization

5. Conclusions

Acknowledgments

Author contributions