A causal perspective on brainwave modeling for brain–computer interfaces

Objective. Machine learning (ML) models have opened up enormous opportunities in the field of brain–computer Interfaces (BCIs). Despite their great success, they usually face severe limitations when they are employed in real-life applications outside a controlled laboratory setting. Approach. Mixing causal reasoning, identifying causal relationships between variables of interest, with brainwave modeling can change one’s viewpoint on some of these major challenges which can be found in various stages in the ML pipeline, ranging from data collection and data pre-processing to training methods and techniques. Main results. In this work, we employ causal reasoning and present a framework aiming to breakdown and analyze important challenges of brainwave modeling for BCIs. Significance. Furthermore, we present how general ML practices as well as brainwave-specific techniques can be utilized and solve some of these identified challenges. And finally, we discuss appropriate evaluation schemes in order to measure these techniques’ performance and efficiently compare them with other methods that will be developed in the future.


Introduction
Electroencephalography (EEG) -voltage fluctuations resulting from ionic current within the neurons of the brain [1]-has been a key tool in the field of neuroscience, unraveling new opportunities and paving the way for the newly-established field of braincomputer Interfaces (BCIs).By definition, a BCI is a digital system that allows a direct communication of the brain with the external world while bypassing normal communication pathways (like muscles and nerves).This is attainable through the analysis of brainwaves captured by various EEG signal recording devices.
BCIs have the potential to revolutionize the way we interact with the world around us.In the future, BCIs could be used to help people with disabilities regain lost functions, enhance our cognitive abilities, and create new forms of entertainment and communication [2][3][4][5].The possibilities for BCIs are endless and their future heavily relies on developing techniques and models that can accurately analyze the captured brainwave signals.
For several years, neuroengineers have been trying to analyze EEG signals using classical signal processing methods [6][7][8][9][10] and extracting manually crafted features, making the 'expert's knowledge' the center and a vital part of brainwave analysis.In recent years, however, the field of brainwave modeling has been enjoying tremendous progress with the advent of machine learning (ML) and deep learning [11][12][13], which has alleviated the need for manual feature extraction and has provably revolutionized several other fields like computer vision (e.g.[14,15]), audio/speech recognition (e.g.[16]) and natural language processing (e.g.[17,18]).The application of deep neural networks in various paradigms in the area of BCIs-e.g.epilepsy detection and prediction [19,20], sleep stage detection [21], anomaly detection [22], muscle activity [23] and motor-imagery (MI) classification [24]-has achieved state-of-the-art performances compared to previously employed classical signal processing techniques.
However, several obstacles in claiming a wide range success of ML in the field of BCI still remain: the limited amount of available annotated data required to train these networks as well as the quality (data noise) and fundamental differences (different recording conditions and various demographics) between the available brainwave datasets.Let us illustrate with a hypothetical case scenario how these challenges may arise in practice and pose real threats to various BCI paradigms.Suppose that a university laboratory team has collected some MI data from several students.The laboratory team used these collected data to train an MI-BCI to decode brainwaves.The team was able to demonstrate high performance in this task that outperformed classical signal processing methods as confirmed via the groundtruth MI labels.Because of their success, they hope to license their model to rehabilitation centers to assist in the restoration of motor function in individuals with paralysis.
And the question arises, is this project bound to succeed or fail?There are various issues and challenges that can jeopardize its success which might include: the limited amount of annotated data, data mismatch between the training and deployment sets, different population (the current dataset contains only young students which might not be the case in the rehabilitation center) and different EEG headsets used for the data collection and for the rehabilitation tasks.
In this work, we discuss the importance of causal relationships between the variables of interest in various tasks of brainwave modeling in the design of ML models in the field of BCI.In summary, the contributions of this paper are as follows: • We employ causal reasoning to accurately describe the task of brainwave modeling using independent causal factors • We propose a causal framework that allows us to categorize all BCI paradigms based on two core properties-the presence of task stimulus and the voluntary engagement of the subject.• We describe solutions to the identified challenges through this causal factorization, which can be either general ML practices or domain-specific techniques • We discuss evaluation schemes to efficiently compare brainwave modeling methods in the present and future The remainder of the paper is organized as follows: section 2 describes related work in the literature, details the motivation behind the paper and lies down the theoretical background of causal reasoning that we employ in the next sections.Section 3 outlines the proposed causal framework.More specifically, it categorizes brainwave models to encoders and decoders and breaks them down to their independent causal factors.Section 4 describes some of the challenges that are met in brainwave modeling based on this causal analysis and outlines techniques that can be employed to solve these identified obstacles.Section 5 portraits appropriate evaluation schemes.The last section summarizes and concludes our work and briefly outlines future research steps.
Inspired by [25] in the field of medical imaging, our main aim with this paper is to raise awareness that 'causality matters in brainwave analysis' and we hope that this will serve as a guideline for future research in the field of brainwave modeling leading to massive advances as well as successful commercialization of BCIs.

From physical to statistical models
Our real world constantly evolves over time and we are usually forced to adapt, either actively (by choice) or passively, to these changes.Modeling reality has been an important goal for humankind and the aspiration behind the invention of many modern scientific tools.Physical phenomena and physical systems were among the first mechanisms that scientists managed to successfully model using mathematics.
A commonly-used mathematical tool is differential equations, for example in motion equations, providing a comprehensive description of physical/mechanical systems while capturing their underlying casual structures.Reliable predictions under different changes to their variables is a feature that all physical models share.Despite their provable robustness to changes or shifts that can take the model outside of its original setting, they usually require a human expert to understand the system that needs to be modelled (its structure, its dynamic processes, the causal relationships between its different parts, etc) and come up with these analytical equations.
For certain systems or problems, though, it is often impossible to reach analytical equations that accurately describe them, no matter the available expert knowledge at hand.In these cases, statistical models (widely used in the field of ML) are proved to be useful.This type of models heavily relies on observational data and can uncover associations between the input and output data.In these models, it is possible to infer some accurate predictions based only on observational input data as long as there are no changes in the system or the problem that is attempted to be modeled.
Models learnt using ML algorithms belong to the category of statistical models, since these algorithms try to make accurate predictions from a given amount of observations.Nowadays, the great success of ML (or Deep Learning) in various fields is the result of four main factors, as described in [26]: 1. large amount of data, 2. massive computational power, 3. high-capacity systems and 4. independent and identically distributed (i.i.d.) tasks.
Today's hardware and research practices guarantee almost always factors 2. and 3. On the other hand, factor 1. tends to be valid for the majority of the problems tackled by ML models, although there are still areas where a lengthy data acquisition process and/or lack of human annotations prohibit the availability of large amounts of (labelled) data for particular tasks.
Lastly, the i.i.d.assumption (factor 4.) seems to be the greatest vulnerability for the majority of ML models.When this assumption is violated, by introducing changes or distribution shifts that take the models outside of their current settings, it is empirically proven that these algorithms are bound to fail (tend to not generalize well-in ML terms).These distribution shifts vary from simple perturbations to the input data (adversarial samples) to different data distributions being used during development and deployment.

Causal models
Unlike physical models, statistical models are not always robust to distribution shifts when at least one of the previously mentioned factors is violated.Generalization across problems and distribution shifts is an open challenge in the ML community and a gap that causal reasoning promises to bridge.
Causal reasoning is the analysis of a task/problem in terms of cause-effect relationships between the different variables of interest [25]: if a variable A is a direct cause of variable B, we express it as A → B (causal diagram: A causes B or B is the effect of A).
Causal modeling lies somewhere in-between these two types of models (physical and statistical).On the one hand, it takes into account causal relationships between the different variables of interests, like physical models, and on the other hand it is data-driven (learnt from observations), like statistical models.Figure 1 graphically demonstrates the difference between a statistical and a causal model.
Given an input X and a target prediction Y, a statistical model tries to estimate P(Y|X).A causal model, although data-driven, extends the statistical setting by considering additionally the cause-effect relationships between the various variables of interests.Using causal terminology, a task can be either [25]: • Causal: when X → Y, prediction of effect from cause or • Anti-causal: when Y → X, prediction of cause from effect.
According to the principle of independent causal mechanisms [26][27][28], the causal breakdown of a system consists of independent variables that do not influence or inform each other 6 .In other words, a sudden change or distribution shift to one of these variables would not change the mechanism of data generation, which is invariant and can be used to improve model robustness.
In order to demonstrate the power of causal models, let us consider the following scenarios where we have two observable variables X and Y.According to the common cause principle (Reichenbach's Principle) 7 , we have the following scenarios [26]: where both variables X and Y are caused by another variable Z A statistical model would not be able to differentiate between these three cases, simply due to the fact that the joint observable distribution P(X, Y) is the same in all three cases.By taking into account, the causal relationships between the variables of interest, a causal model has automatically more information-which in turn if appropriately used by ML algorithms-can lead to models which are more robust to certain types of distribution shifts.

Brainwave models
Understanding the brain is arguably one of the most interesting problems of our time.EEG technology has made it possible for researchers to measure the brain activity of a subject in a non-invasive manner, while various tasks are performed.Since the underlying organizational structure of the brain still remains 6 A causal breakdown of a system can be represented as a directed acyclic graph (DAG) where the nodes are the variables of interests and the edges represent direct causal relationships [29]. 7Common cause principle [26,30]: If two observable variables X and Y are statistically dependent, then either there is causal relationship between them or there exists a third variable Z that causally influences both and makes them independent when conditioned on Z. a puzzle to be solved, there is no analytical model (physical modeling) that can accurately describe its functionality for the entirety of human behavior 8 .Therefore, the success of BCIs mainly relies on the recent advances in ML (statistical models).
BCI models can be distinguished between encoding and decoding models (examples are shown in figure 2), known in ML terms as generative and discriminative models: • Encoding models : given a known performed task/ stimulus Y, the goal of these models is to estimate the brain activity X (brainwave) corresponding to that particular task, i.e.P(X|Y) • Decoding models: given an observed brain activity X (brainwave), the goal of these models is to predict the performed task Y corresponding to that particular brainwave, i.e.P(Y|X) By introducing causal (cause-effect) relationships between the observed brain activities and their corresponding performed tasks, these two types of brainwave models can be identified with contrasting causality terms: • If X → Y: encoding models are anti-causal and decoding models are causal • If Y → X: encoding models are causal and decoding models are anti-causal Although encoding BCI models (BCI systems that modulate the current neural activity so as to bring it within a desirable range associated with some targeted label/state) can be met in real-life application like transcranial direct current stimulation for limb rehabilitation [45] or Parkinson's disease [46], the vast majority of BCI systems are decoding models, therefore we will focus only on BCI brainwave decoders in this work.

Causal framework
To make use of causality models, as described in the previous section, it is crucial to break down the task of brainwave analysis in a number of independent variables, according to the Principle of Independent causal mechanism, and determine the underlying causal relationships between them.To effectively perform this analysis, the various tasks of BCI paradigms need to be first categorized.Unlike previous research works [47], we argue that this categorization is bound to two core properties-the presence of task stimulus and the voluntary engagement of the subject.
Based on the presence or absence of stimulus, a BCI paradigm can be identified as: • Exogenous: when the brain activity is modulated by the presence of stimuli which can take any form of sensory input.• Endogenous: when the brain activity is inherently modulated, without the intervention of any external stimulus.
On the other hand, based on the subject's voluntary engagement, a BCI paradigm can be identified as: • Voluntarily Engaged: when the subject willingly generates a particular brain activation or coactivation pattern like in the case of motor imagery [48].• Involuntarily Engaged: when the subject has no control of the generated brain activation pattern.In other words, the ongoing brain activity is modulated without any conscious effort from the subject [49].
Therefore, all BCI paradigms can be categorized using a combination of the above mentioned categories.For example, P300-speller [31] (a type of BCI that allows users to spell words and sentences by focusing their attention on specific characters) is exogenous (since there is a visual stimulus) and voluntarily engaged (since the subject initiates the brain activity).
On the other hand though, P300 Lie Detector (LD) [32] (a type of BCI that uses event-related potentials (ERPs) to detect deception) is again exogenous and involuntarily engaged (since the subject has no control over the generated brain patterns from the visual stimulus).In table P300 speller [31] P300 lie detector [32] Motor imagery [24] Seizure detection [19] SSVEP [33] Error-related potentials [34] Executed movement [35] Workload estimation [36] cVEP [37] Object recognition ERPs [38] Neurofeedback [39] Sleep staging [40] Auditory attention [41] Music appraisal [42] Omitted target [43] In our causal analysis of all BCI paradigms, we are mostly interested in determining the causal relationship between the observed EEG brainwave signal X and the labelled task Y.But before analyzing each one of these four types of BCI paradigms separately in terms of their independent components and causal relationships, we will introduce a variable that all categories share: the true underlying unobserved BCIrelevant brain activity Z. Inspired by [25], the EEG singal X can be considered a noisy measurement of the true unobserved neural activation patterns Z i.e.Z → X [29,50].
Given a BCI brainwave decoder that is exogenous, when a stimulus S is applied, there are two possible scenarios: (i) The task Y causes the stimulus S which in turn generates the corresponding brain activation pattern Z contained in the measured signal X (exogenous and voluntarily engaged).In this case, the causal diagram of the variables of interest is: The task Y and the stimulus S together-as a pair-cause the generation of the corresponding neural activation pattern Z measured within X (exogenous and involuntarily engaged).In this case, the causal diagram of the variables of interest is: For example, in the case of P300 speller, the user chooses which letter they want to focus on and in turn that letter causes the right stimulus which results in the observed EEG brain signal.On the other hand, in the P300 lie detector case, the participant-suspect is presented with a crime-related image (stimulus) and depending on whether he/she is aware of it or not, the P300 response may involuntarily be emitted.Given these two causal diagrams, in both cases, a decoding brainwave model of this category is anti-causal.
When the system is endogenous, there is no stimulus S applied that can initiate any task.Therefore, we need to investigate further and determine the true source of the neural activation pattern Z as well as the causal sequence of all involved modules.
In the case of endogenous and involuntarily engaged BCIs, no stimulus can initiate a task's brain activity and at the same time the subject has no control over the generated brain signals.Therefore, in order to determine the brainwave source in this case, we will go through two different BCI examples of this category, namely epileptic seizure detection [19,20] and driver's drowsiness detection [51].In the first example, a brainwave decoding model needs to be able to distinguish between normal and epileptic brain activity.In the second example, a brainwave decoding model needs to be able to distinguish between the brain activity of alertness and drowsiness.In both cases, the true underlying brain activity changes by factors that are unknown to and uncontrollable for the subject and essentially affect their current state.As a result, in this category we will introduce the environment E variable.The changes of E variable determine the task label Y (e.g.normal or epileptic brain activity) which in turn generates the corresponding neural activation pattern Z measured by X, i.e.E → Y → Z → X.Given this causal diagram, a decoding brainwave model of this category is anti-causal.
Finally, in the case of endogenous and voluntarily engaged BCIs, producing the causal diagram is not a trivial task to perform.Here, no stimulus can initiate a the brain activity of a task, but the subject has full control over the generated brain signals.Therefore, the intention I of the subject, can be considered the source of the BCI-relevant neural activation pattern.In this case though, different parts of the brainwave sequence can be used during modeling.We will elaborate with the following scenarios.Let Dec be a decoding model that uses brainwaves to discriminate between executed left and right hand movements.If Dec uses the brain signal during the execution of the movement, the intention has caused the prediction label Y which causes the brain activity, i.e.I → Y → Z → X.If Dec uses the motor preparation brain signal X [52], then the signal precedes the final result (the movement execution) but still the prediction label Y of the pre-movement signal is the cause of the brain activity, i.e.I → Y → Z → X.In both cases, there is one common causal diagram and the decoding model is anti-causal.
The causal diagrams of figure 3 show how different factors can influence brainwave activity.Based on these diagrams, we can develop causal factorizations, which are mathematical models that represent the relationships between these factors, to further explore the problem of brainwave modeling.For convenience, we will focus only on the case of brainwave decoding models.In other words, given an input EEG signal X we want to predict the associated task Y i.e P(Y|X) 9 .
since Y and S are independent.
• Voluntarily Engaged-Endogenous: P (I, Y, Z, X) = P (I) P (Y|I) P (Z|Y, I) P (X|Z, Y, I) = P (I) P (Y|I) P (Z|Y) P (X|Z) (4) 9 A similar causal diagram can also be drawn in the case of brainwave encoders where the current brain activity is modulated to a desired waveform based on targeted state.Given the current EEG signal Xnow and the desired state Y, a stimulus S modulates the subject's brain activity Z to a desired state measured by X modulated ≡ X, i.e (Xnow, Y) → S → Z → X.Since we are now interested in P(X|Y), a brainwave encoder is a causal model.

Challenges in BCI decoding
Through the above described causal breakdown, we can easily conclude that there are major challenges, that ML (statistical) models usually face when dealing with BCI brainwave decoding tasks.It is evident that we are in need of ML techniques that do no rely solely on the i.i.d.assumption but they are bound to these causal mechanisms [28].Data distribution shifts constitute the main factor that hurt generalization of ML models.Using the above described causal framework, it is possible to recognise all possible distribution shifts associated with each BCI paradigm and formulate new strategies to mitigate these problems.Let P S (•) denote a distribution where a shift has been applied to.Table 2 summarizes some of the various shifts in brainwave decoding as well as the causal sub-module they usually affect.These distribution shifts can have various sources ranging from the experimental setup, available EEG data to the nature of the brain itself.
BCI researchers need to understand the sources of all possible changes or distribution shifts associated with these core causal variables of interest when designing ML models for brainwave analysis.In this section, we outline the core challenges that BCI ML models face and we further propose how wellestablished-as well as newly emerging-techniques can tackle these problems.

Experimental settings
The experimental conditions (under which a BCI decoding task takes place) are crucial and they need to be designed carefully to avoid any possible distribution shift between the training and test EEG sets.BCI researchers need to take into account the following factors:

Training EEG data 4.2.1. Data scarcity
One of the main challenges in brainwave modeling is the lack of labeled data.This is because the acquisition process is difficult and time-consuming.Subjects must spend hours wearing an EEG headset and performing consecutive tasks, which can be tiring and uncomfortable, even when dry electrodes are used.
To increase the amount of available training samples, ML scientists usually turn to data augmentation.Data augmentation is defined as the process to generate artificial samples from the available brainwave data by applying certain shifts (e.g addition of Gaussian noise) to it.This process enriches the available joint distribution P(X, Y)rather than only P(X)-making it an available candidate for both causal and anti-causal brainwave tasks.Through this technique, the new training set includes data from different distribution shifts, which makes the learnt model more robust, solves partially the issue of data scarcity and leads to better generalization.In the field of BCI, various data augmentation techniques have been introduced [53]: generative adversarial networks (e.g.[54][55][56]), noise addition (e.g.[57,58]), overlapping sliding windows (e.g.[59]), recombination of segmentation (e.g.[60]) and geometric approaches (e.g.[61]).Data augmentation can lead to an increased performance regardless of the causal direction of the task.For example, [62] demonstrates increased performance in SSVEP classification (voluntarily engaged and exogenous task) while [57] shows improved Motor Imagery classification performance (voluntarily engaged and endogenous task).
Like data augmentation, pre-training's goal is to enrich the training set distribution.Unlike data augmentation though, using pre-training does not mean to create or obtain more examples from the joint distribution P(X, Y).This technique is based on the assumption that if a ML model is trained on a huge, diverse and close to the targeted brainwave task dataset, it will capture extra information from all of these various distributions, can benefit both causal and anti-causal models and achieve stronger BCI generalization as it is demonstrated in [63] for enhanced intracranial EEG decoding or in [64] for improved MI EEG classification (voluntarily engaged and endogenous task).
Semi-supervised learning aims to leverage huge amount of unlabelled brainwave data in an effort to generate statistical models that achieve better generalization compared to models that are learnt using only a limited amount of available labelled brainwaves.In essence, this technique utilizes large amount of unlabelled data to further enhance P(X).Unlike the previous two techniques, as also described in [25,26,65], it is futile when applied to causal tasks (X → Y) since any additional information, that can improve P(X), will have no influence on P(Y|X)-the quantity that the ML model is concerned with.Therefore, this technique should only be used in anti-causal tasks.Abdelhameed and Bayoumi [66] describes a state-ofthe-art BCI system, that utilized semi-supervision, for epileptic seizure detection (an endogenouspassive task and anti-causal for brainwave decoders).Gu et al [67] describes an online semi-supervised BCI for P300 speller (voluntarily engaged and exogenous task).
Finally, self-supervised learning (SSL) consists of two steps: training on a large unlabelled dataset and fine-tuning on few labelled brainwave samples.As in the case of semi-supervised learning, this technique aims to enrich P(X), a quantity only informative for anti-causal (mostly brainwave decoders) as far as ML is concerned.In (e.g.[68,69]) SSL-learned networks consistently outperformed purely supervised deep neural networks in anti-causal paradigms like sleep stage detection (involuntarily engaged and endogenous task).

Acquisition shift P S (X|Z)
The data acquisition can be undertaken with various EEG recorders with completely different specifications (e.g.number and type of electrodes or sampling frequency).As a result, this incompatibility of available EEG datasets makes it very difficult to combine them.Although perfectly aligning the EEG recorders after the data collection has taken place is unrealistic, three signal processing steps can be taken in an effort of combining different EEG training sets: (i) Re-sample all training sets to a common sampling frequency.This process should be treated with care so as to avoid aliasing issues and important temporal information loss (ii) Retain only the common EEG sensors according to the international 10-20 system [70] (iii) Perform a type of normalization (e.g.min-max normalization) to map the values of the EEG signals within the same range while maintaining their original distribution (iv) Perform signal re-referencing to a common EEG sensor-if it is vital for a specific brainwave analysis

Subject shift P S (Z|•)
Our brain is a complex system that is not only structured but also functions differently from one person to another, a phenomenon that is often referred to as population shift or inter-subject variability.Each subject has a unique brain anatomy and functionality of each individual that results in a variety of different neural activity patterns that can be observed on the surface of the brain using EEG signals.Additionally, brain functionality does not only differ between subjects but across time as well.According to various studies [71], a certain variability in the acquired EEG signal can be observed when the same subject attends the same experiment in different times (sessions), a phenomenon known as inter-session variability.In the case of BCI brainwave decoders, subject shifts can occur in all possible identified scenarios in our causal framework (i.e.P S (Z|S), P S (Z|Y, S), P S (Z|Y), P S (Z|I)).Therefore, we should turn our attention to domain-specific (BCI-specific) methods that are specifically designed to tackle these issues, found only in the area of brainwave analysis.Over the last few years, there has been an increasing interest towards this research direction with many strong empirical evidence of increasing the performance of various BCI models.[72] projects data into a subject-invariant space while [73] proposes an adversarial inference network that learns subject invariant features.In an effort to design enhanced subject-independent ML models, [74] describes a novel technique to explicitly align feature distributions at various layers of the deep learning model.By utilizing both statistical and trainable methods (inspired by convariance-based alignment methods used in Riemannian geometry [72]).While [29,50] approximates the subject distribution shift in the true BCI task-related brain activity among different subjects by using dynamic convolutions based on an input-dependent attention vector.

Artifacts
The acquired EEG signal is extremely vulnerable to undesired noise, which can result in various artifacts [75].As artifacts can be defined abnormal signals which can contaminate the normal brainwaves and severely interfere with ML-based brainwave decoders, since they can alter the measured EEG signal X.Step-by-Step guidelines for designing BCI decoders Given an EEG dataset D (or multiple datasets): 1. Ensure that all experimental conditions (stimuli, environmental conditions, demographics and subject's intentions) are designed carefully and resemble the deployment environment.2. Gather all necessary and important information about D. If needed, remove all artifacts.If you combine multiple Ds, then make sure to align the sampling frequency and normalize the EEG signals.

3.
Using the proposed causal framework, identify the causal scenario of the brainwave decoding model.4. If D is sparse, then deploy Data Augmentation or Pre-training techniques for both causal and anti-causal tasks.Or Self-supervised and semi-supervised learning for anti-causal tasks only. 5. Using the causal factorization of your BCI decoding case, identify the possible distribution shifts and employ appropriate solutions.
Therefore, these artifacts need to removed from the EEG sets used with ML decoders [76].EEG-related artifacts can be categorized to extrinsic (environmental artifacts and artifacts due to experimental errors) and intrinsic or physiological artifacts (e.g.eye blinks, muscle activity, heartbeat) [77].The environmental artifacts can be eliminated by a simple filtering [78] while the proper experimental procedure can reduce or eliminate the experimental errors [76].Finally, the elimination of intrinsic artifacts requires specialized techniques which include Regression [79], Independent Component Analysis [80], empiricalmode decomposition [81], blind source separation [82] and wavelet transform [83].

Associated task labels P S (Y)
It is important to make sure that the training set for a ML model is as similar as possible to the deployment set, in terms of the distribution of classes (task class balance).This is because if the training set is imbalanced, the model may not be able to generalize well to new unseen data during deployment.
In the cases where this is not possible, specific techniques should be employed to correct the bias in estimating the training loss.For instance, [84] proposes a BCI method for seizure detection that effectively moderates the bias in performance caused by imbalanced class distribution by assigning different weights to the EEG samples.

Discussion
Causal reasoning enables us to identify possible distribution shifts that can hurt the generalization of ML-based BCI brainwave decoders.In the previous section, we explore how general ML practices as well as specifically designed techniques can contribute into solving many of the identified causal challenges (table 3).Although these techniques are crucial by themselves, we need to identify appropriate evaluation schemes in order to measure and compare their performance and develop plans on making improvements.So far, in brainwave modeling for BCIs, the performance of trained models is measured based on some unseen parts of the available EEG dataset.Splitting brainwave datasets into training and testing parts comes with certain limitations and inherited biases.Therefore, EEG datasets need to be diverse in terms of the available subjects (to capture a large part of the population) as well as include multiple sessions per subject.In that way, various ML models could be tested (with the appropriate splitting) for their performance across various tasks as well as their robustness in a number of the above identified challenges and data distribution shifts like inter-subject and inter-session variability.
The ML community of BCIs is in need of appropriate benchmarks, similar to well-established efforts in other areas like computer vision (e.g.[85,86]), that can capture multiple scenarios and act as a valid indicator of the performance of different brainwave decoders.On that front, the benchmarks for EEG transfer learning competition [87] took place at the top-tier neural information processing systems 2021 conference.This benchmark provides a referencing platform for transfer learning strategies and performances on two widely-used BCI paradigms (MI and sleep stage detection).Consisting of two tasks, this benchmark includes two challenges (1.transfer across subjects and 2. transfer across subjects and data sets) and promotes the field of brainwave analysis towards the use of big data.Although this competition was a significant milestone, more similar initiatives are needed in that direction that will capture more BCI paradigms and distribution shift scenarios.
The proposed causal framework represents a crucial advancement, poised to serve as a guiding principle for the development of advanced BCI brainwave decoders.References [29,50] harness this framework to demonstrate enhanced cross-subject MI EEG decoding capabilities.In these studies, the framework is employed to meticulously characterize all potential distribution shifts within the MI task, as presented in one of the four causal diagrams in figure 3.This causal analysis dissects the problem of MI EEG classification into its causal factors, pinpointing all distribution shifts and attributing inter-subject variability to a distribution shift in one of the core variables of interest.
Such analysis facilitates the design of an evaluation setup that preserves all the identified challenges and distribution shifts by selecting MI datasets that are class-balanced and possess an ample number of trials per subject as well as by performing experiments and comparisons within a single dataset each time, ensuring that the EEG data comes from the same recording device.Therefore, unlike previous studies in the field, which often rely on a combination of techniques such as data augmentation (posing potential impacts on other variables of interest and raising doubts regarding their effectiveness in addressing inter-subject variability), these causality-based works [29,50] are underpinned by a meticulously crafted evaluation setup (a direct outcome of the proposed causal framework) designed to specifically target the issue of inter-subject variability.

Conclusion
In this work, we take a causal reasoning approach to analyze the task of brainwave modeling.We propose a causal framework that relies on two core properties (the presence of task stimulus and the voluntary engagement of the subject) and can break down every BCI paradigm to its independent components.To the best of our knowledge, this the first study to combine ML and causal reasoning in the field of BCIs and presents a unified causal framework.This is a theoretical study, but we believe that the proposed causal framework has the potential to improve BCIs in the future and can assist us in identifying potential problems with ML systems that attempt to solve the complex problem of exploiting the brain.

Figure 1 .
Figure 1.Statistical (left) and causal models (right) on a given set of two variables Z and Y.In the case of statistical model a single probability distribution is specified while in the case of causal model a set of distributions, one for each possible intervention, is specified.Adapted from [26].CC BY 4.0.

Figure 2 .
Figure 2. (Left) An example of an encoding brainwave model that predicts the brainwave activity based on a stimulus -(Right) An example of a decoding brainwave model that predicts the label based on the brainwave activity.

Figure 3 .
Figure 3.Our proposed framework to breakdown the task of brainwave decoding into causal sub-modules.For each category, there is a specific directed acyclic graph (DAG) that represent the causal relations between the identified variables of interests: (Y) represents the task true label, (S) represents the stimulus, (I) the subject's intention and (E) the environment factor.

Table 1 .
Examples of BCI paradigms categorized based on our proposed framework.(Ex) represents Exogenous, (End) represents Endogenous, (VE) stands for Voluntarily Engaged and (IE) stands for Involuntarily Engaged.
1, we demonstrate examples of some BCI paradigms based on this proposed causal framework.

Table 2 .
All possible shifts in the joint distribution P(X, Y, Z, •) in brainwave decoding for all causal factorization scenarios described in our proposed causal framework.

Table 3 .
Step-by-step guidelines for designing BCI decoders using the proposed causal framework and appropriate causal factorizations.