Accelerating P300-based neurofeedback training for attention enhancement using iterative learning control: a randomised controlled trial

Objective. Neurofeedback (NFB) training through brain–computer interfacing has demonstrated efficacy in treating neurological deficits and diseases, and enhancing cognitive abilities in healthy individuals. It was previously shown that event-related potential (ERP)-based NFB training using a P300 speller can improve attention in healthy adults by incrementally increasing the difficulty of the spelling task. This study aims to assess the impact of task difficulty adaptation on ERP-based attention training in healthy adults. To achieve this, we introduce a novel adaptation employing iterative learning control (ILC) and compare it against an existing method and a control group with random task difficulty variation. Approach. The study involved 45 healthy participants in a single-blind, three-arm randomised controlled trial. Each group underwent one NFB training session, using different methods to adapt task difficulty in a P300 spelling task: two groups with personalised difficulty adjustments (our proposed ILC and an existing approach) and one group with random difficulty. Cognitive performance was evaluated before and after the training session using a visual spatial attention task and we gathered participant feedback through questionnaires. Main results. All groups demonstrated a significant performance improvement in the spatial attention task post-training, with an average increase of 12.63%. Notably, the group using the proposed iterative learning controller achieved a 22% increase in P300 amplitude during training and a 17% reduction in post-training alpha power, all while significantly accelerating the training process compared to other groups. Significance. Our results suggest that ERP-based NFB training using a P300 speller effectively enhances attention in healthy adults, with significant improvements observed after a single session. Personalised task difficulty adaptation using ILC not only accelerates the training but also enhances ERPs during the training. Accelerating NFB training, while maintaining its effectiveness, is vital for its acceptability by both end-users and clinicians.


Introduction
The number of individuals suffering from cognitive deficits is rising due to an ageing and growing population, increasing the prevalence of conditions such as dementia [1], attention deficit/hyperactivity disorder (ADHD) [2], and stroke [3].These cognitive deficits not only impact the quality of life of affected individuals and their families [4,5] but also place a significant burden on society [1,6].
Neurofeedback training based on electroencephalography (EEG-NFB), a method of brain training where individuals learn to modulate brain activity through real-time feedback of their EEG signals, is emerging as an effective, non-invasive, and userfriendly treatment option for individuals with cognitive deficits.EEG-NFB is a subset of brain-computer interfaces (BCIs), which are systems that translate brain activity into commands for external devices or software.While traditional BCIs are designed for controlling external devices through brain patterns, EEG-NFB focuses on self-regulation of brain activity without necessarily involving external device control.
First introduced in the 1970s, interest in EEG-NFB has surged in recent years due to advancements in BCI technologies [7].EEG-NFB is now utilised in treating various mental or mood disorders such as schizophrenia, depression, dementia, ADHD, and in cognitive rehabilitation after brain injuries or stroke [8].
However, EEG-NFB is not limited to individuals with cognitive deficits; it also offers potential benefits for healthy individuals.Everyday cognitive stimulation is known to influence the risk of developing dementia later in life [9].However, due to modern technologies, everyday cognitive activity is declining [10,11].EEG-NFB could be employed for cognitive enhancement in healthy individuals to reduce agerelated cognitive decline and the risk of developing dementia.
Studies targeting these frequency bands in EEG-NFB have shown promising effects on cognitive processes in both healthy young and elderly adults, as well as individuals with mild cognitive impairment and dementia.Typically, NFB training involves 5 to 30 sessions, lasting 10 to 90 min each, and is conducted weekly or several times a week [14][15][16].
Many rhythm-based EEG-NFB studies report positive training effects on event-related potentials (ERPs) (e.g.[17,18]).ERPs are brain patterns elicited by specific stimuli or events, and they are associated with underlying cognitive processes [19].One well-known ERP is the P300, which is triggered by the oddball paradigm, where an infrequent target stimulus is presented among frequent nontarget stimuli [20].The P300 is related to attention and working memory, with larger amplitude and shorter latency indicating better cognitive abilities [20].Similarly, the N200 ERP, also triggered by the oddball paradigm, is associated with attention [19].ERPs have also been investigated as potential biomarkers for conditions, such as ADHD [21], mild cognitive impairment and dementia [22].
Given the relevance of ERPs like the P300 and N200 to cognitive processes targeted in training, their potential as biomarkers for conditions treated with NFB training, and evidence suggesting that ERPs are modifiable via training, it is logical to consider targeting ERPs directly.However, while rhythm-based EEG-NFB is well established, there has been less research on the use of ERP-based NFB training.Rieger et al [23] investigated the use of the N100, an auditory ERP component associated with attention, for NFB treatment of hallucinations in individuals with schizophrenia but did not find the training to be effective.Musso et al [24] successfully used NFB based on the auditory P300 for language training in patients with aphasia.Mismatch negativity, a subcomponent of the auditory N200, was used as NFB for working memory training in patients with subjective cognitive decline [25].
Fouillen [26] used P300-based video games for attention training in children with ADHD but did not find the NFB training efficacy to be superior to control groups.Li et al [27] used a P300-based video game for cognitive training in healthy adults and reported improved P300 ERP amplitude and latency post-training.Both Jacoby [28] and Arvaneh et al [29] used a P300 speller, traditionally a BCI used for communication (as described in more detail in section 1.1), for cognitive training in healthy adults.While both studies reported promising results, Jacoby [28] observed a performance plateau after three sessions of the P300 speller training and hypothesised that the P300 speller might not be engaging enough to keep participants motivated.This potential issue was addressed by Arvaneh et al [29] by progressive increases in task difficulty based on participants' performance to maintain engagement.
In this study, we use ERP-based NFB training to improve attention in healthy adults to gather further, necessary, evidence for the use of ERPs as NFB.This study further develops the adaptive P300 speller by Arvaneh et al [29], with the aim of accelerating the training by using iterative learning control (ILC), a method from the control sciences (further described in section 1.1), to adapt the task difficulty.EEG-NFB training can be a time-consuming process, which might limit user acceptability.It is therefore important to accelerate the training, making it more attractive and feasible for end-users and clinicians.
In addition to adapting the task difficulty with ILC, we also use the Arvaneh et al task difficulty adaptation approach [29] as a benchmark comparison, and use a random task difficulty to examine the impacts of personalised versus non-personalised task difficulty adaptation on the training efficacy and training length.
We hypothesise that a single session of P300-based NFB training, utilizing ILC for task difficulty adaptation, will not only improve attention post-training but also accelerate the training process without compromising its effectiveness.
The theoretical background for the proposed BCI system is explained in section 1.1.An overview of the study design, along with a description of the tasks, and an explanation of data analysis steps, is given in section 2. The detailed experimental protocol for this study was published previously [30].We present the study results in section 3 and discuss and compare them to the results of Arvaneh et al [29] in section 4, before concluding the manuscript in section 5.

Theoretical background 1.P300 speller for NFB training
The P300 speller is a common BCI application that uses the P300.Farwell and Donchin [31] first developed it in the 1980s as a communication tool for individuals with severe communication impairments, such as locked-in syndrome patients.The speller functions like an on-screen keyboard, presenting the user with a grid of symbols, such as letters and numbers, where each symbol in the grid is highlighted or flashed.The user focuses on the symbol they wish to select, generating a P300 wave when the target symbol is flashed.
To improve the signal-to-noise ratio of the EEG signals, each symbol in the grid is flashed several times, so that the EEG signals can be averaged over all the flashes, making it easier to detect the P300.Traditionally, the number of flashes per symbol is fixed; however, Arvaneh et al [29] demonstrated that, by progressively decreasing the number of flashes, the P300 speller can be used as a NFB training tool, encouraging users to increase their focus to maintain performance.The number of flashes can thus be seen as the task difficulty of the P300 speller.The Arvaneh et al study [29] found improved performance in a spatial visual attention task after one session of P300-based EEG-NFB training, as well as an enhanced P300 wave.These training effects were not observed in the control group, which completed the P300 speller training without receiving feedback [29].
Notably, the reduction in the number of flashes also increases the feedback frequency, bringing the training closer to the real-time nature of rhythmbased NFB training.Previous work by Arvaneh et al [32] underscored the significance of feedback frequency, showing that more frequent feedback leads to greater training effects.

ILC
ILC is a control method used for systems that repeat the same task across several iterations [33].Designed to learn from past experiences to eliminate tracking error over multiple iterations, it can also naturally adapt to small system changes, given its inherent feedback structure.The general framework of ILC is represented by the equation: where k is the iteration index, and u and e are the control input and tracking error, respectively [33].
ILC was independently developed by Uchiyama [34] and Arimoto et al [35] in the late 1970s and 1980s.Initially used in industrial robotics and semiconductor manufacturing [36], it has found applications in the biomedical world, including exoskeleton control [37] and stroke rehabilitation [38].
A common form of ILC is where K is the learning-gain operator.This operator can be a simple scalar, such as in the Arimoto algorithm [35], or it can be based on the system model [33,38].As (NFB) training is inherently repetitive with changes to the system due to learning effects/neural plasticity, applying ILC to adapt the task difficulty seems an appropriate approach.

Proposed BCI-NFB system
We propose a combined BCI and NFB system, which builds on the work by Arvaneh et al [29].A BCI typically involves the user controlling an external device or computer program with their brain activity.In contrast, NFB primarily focuses on providing users with real-time feedback based on their brain activity, without involving direct device control.In this study, we merge these two concepts, enabling users to control a P300 speller through their brain activity and leveraging the inherent feedback from the speller, showing which letter was selected, to provide NFB.Notably, the difficulty level of the P300 speller, i.e. the ease or difficulty of correctly spelling words, is adapted based on the user's performance.
The proposed system therefore contains two feedback loops, as can be seen in figure 1.The outer feedback loop represents the NFB provided to the user in terms of the letter that was selected based on their ERPs, while the inner feedback loop is responsible for adjusting the task difficulty based on the user's performance.Overview of proposed neurofeedback system.The user controls a P300 speller, spelling words on a computer, using event-related potentials (ERPs) extracted from their EEG signals.The letter that was selected based on their ERPs is provided to the user as neurofeedback.The task difficulty of the P300 speller is adapted to the user's performance to keep the system challenging and engaging.EEG = electroencephalography, LDA = linear discriminant analysis.

P300 speller
In this study, the P300 speller was implemented in OpenViBE [39].It is a 6 × 6 grid of letters and numbers.Participants copy-spelled all words; this means that they were always given the target letter to spell.This is done by highlighting the target letter in blue for 6 s before the start of the flashes.The letter identified by the computer is always highlighted in grey whether it was correctly identified or incorrectly identified, as shown in figure 2(a).The target letter and feedback are also displayed below the grid.Figure 2(b) shows how the symbols are flashed.The symbols in the grid are flashed per row and column by increasing the font size and changing the font colour to white.The flashes last for 55 ms with an inter-flash interval of 117 ms.

Task difficulty adaptation approaches
In this study, we propose to adapt the task difficulty during the P300 speller training, represented by the number of flashes, using ILC.To evaluate the efficacy of ILC, we also included the adaptation approach by Arvaneh et al [29] as a benchmark comparison, and a random difficulty adaptation to investigate the effects of personalised vs non-personalised task difficulty.Study participants were randomly assigned to one of these three task difficulty adaptation approaches, which are explained in this section.
ILC approach With our proposed approach, the task difficulty, i.e. the number of flashes per row and column, is adapted by an iterative learning controller.The update law of the controller is where i denotes the run number, N is the number of flashes per row and column, and ϵ is a controller tuning parameter.f(e i−1 ) is a penalty function relating to the error, i.e. percentage of incorrect letters in a word, in the previous run, defined as: where J i is the spelling accuracy in run i. Figure 3 shows f(e i−1 ) with the equation It can be seen that if the error in the previous run is 0.5, which means that 50% of the word was spelled incorrectly, the function is zero.The number of flashes in the new run is therefore the same as in the previous run.If every letter in the previous run is correct, the function becomes negative unity, and the number of flashes is consequently reduced by ϵ.The opposite happens when all the letters in the previous run are incorrect.
The controller parameter ϵ is tuned in simulation and pilot experiments to Since ϵ can be interpreted as the maximum update step, as explained above, making it dependent on the previous number of flashes ensures that the step size is appropriate for the task difficulty.Equations (3), ( 5) and ( 6) can be simplified to with the result of (7) always rounded up.

Benchmark approach
The task difficulty adaptation approach used by Arvaneh et al [29] is used as a benchmark comparison.The approach calculates the number of flashes in the next run as the average between the number of flashes used in the previous run, and the minimum number of flashes in the previous run that would have resulted in at least 66% spelling accuracy, where N i is the number of flashes per row and column in run i, and N (i−1)66 is the minimum number of flashes per row and column in run i − 1 that would have resulted in a spelling accuracy of at least 66%.
To determine N (i−1)66 , spelling accuracy is assessed for all numbers of flashes between 1 and N i−1 .This is achieved by considering varying numbers of trials for classification: initially only the trials from the first flash for each row and column are considered, then the average of the trials from the first two flashes, and so on, until all trials from N i−1 flashes are used for classification.The result of ( 8) is always rounded up.If a participant does not achieve a spelling accuracy of 66% or more, the number of flashes is increased by one in the next run, Random difficulty approach In this approach, each training run's task difficulty is determined by a random number of flashes, with the number being an integer between 1 and 10.By adopting this method, participants in this group face a wide range of task difficulty levels.Importantly, these levels are both nonpersonalised and unpredictable.There is a concern that keeping the task difficulty constant could introduce bias due to an easier average task difficulty compared to the other groups [40].Similarly, a systematic reduction in the number of flashes for each run might introduce bias because of the predictability, a factor absent in the other groups.The choice of a random range between 1 and 10 is made to ensure a wide variation in difficulty, while not decreasing the task difficulty beyond the initial difficulty level of ten flashes per row and column.
It is important to note that only the task difficulty in this group is randomised.The feedback provided to participants in this group is based on their actual brain signals, meaning that they undergo genuine NFB training.

EEG acquisition and online processing
EEG signals for each participant were recorded using the Ant Neuro eego rt amplifier [41] with a 32 channel waveguard cap [42], where the electrodes are in standard 10-20 positions [43].AFz and CPz are used as ground and reference electrodes, respectively.During online use of the P300 speller, all EEG signals are filtered between 1 and 20 Hz using a 4th order Butterworth filter.The signals are then downsampled by a factor of 4. The xDAWN spatial filter [44] was used to reduce the 32 EEG channels to three xDAWN components that maximise the difference between target and non-target trials.A linear discriminant analysis (LDA) classifier [45] uses epochs from stimulus-onset to 600 ms post-stimulus to determine the target row and column, i.e. the row and column that contain the letter the user is focusing on.Both the xDAWN spatial filter and the LDA classifier are trained for each participant at the beginning of the session by using the EEG data recorded from copyspelling runs.

Study design
This study consists of a single-blind, parallel three-arm randomised controlled trial.The study was approved by the Maynooth University Ethics Committee (BSRESC-2022-2474456) and is registered on ClinicalTrials.gov(NCT05576649).
Fifty-one healthy adults with no self-reported history of neurological disease or condition, and normal or corrected-to-normal vision, were recruited for the study.Four participants dropped out before the experiment was conducted, and two more participants were excluded from analysis due to a small change in the task difficulty adaptation algorithm in the ILC group.All recruited participants gave informed consent and underwent an allergy patch test for the electroconductive gel before the experiment started.Participants were randomly assigned to one of three groups: The ILC group, where the task difficulty is determined by an iterative learning controller; the benchmark group, where the task difficulty is adapted according to the algorithm by Arvaneh et al [29]; the random difficulty group, where the task difficulty is randomly chosen.15 participants per group are included in the analysis, with a mean age of 27.2 ± 10.3, 26.1 ± 9.4, 27.4 ± 10.6, in the ILC group, benchmark group and random difficulty group, respectively.Each group has six female participants, and one participant in the random difficulty group preferred not to specify their gender.
The study involves a single NFB training session, that lasted no more than 2 h including setup.The experiments were conducted in an electrically shielded, sound attenuated and dimly lit room on the Maynooth University campus.Participants were asked to complete pen-and-paper questionnaires and a spatial attention computer task at the beginning and end of the session, in addition to attention training with the adaptive P300 speller.All tasks are described in detail in section 2.3 and an overview of the study procedure can be seen in figure 4.

Tasks and stimuli
Participants in this study completed a series of tasks including questionnaires, a visual spatial attention task (random dot motion (RDM) task) and P300 speller training.These tasks are described in detail in sections 2.3.1-2.3.3.

Questionnaires
Participants were asked to complete the same tenpoint Likert scale questionnaire as in Arvaneh et al [29] at the beginning, and end, of the experimental session.The four questions on the questionnaire are: (i) How tired are you now? (ii) How alert do you feel? (iii) How bored do you feel? (iv) Do your eyes feel tired?
In the remainder of the manuscript, this questionnaire is referred to as the fatigue-boredom questionnaire for clarity.
Additionally, participants completed the NASA task load index (TLX) [46] at the end of the session to evaluate their subjective workload of the training.

RDM task
In this continuous RDM task, based on [29,47], participants observed randomly moving dots on a computer screen, where a fraction of the dots moves coherently in one direction (left or right) at certain times.Participants were required to indicate the direction the coherently moving dots are moving in by pressing the left or right arrow key as soon as they are sure of the motion direction.The dots switch between coherent and incoherent motion in a continuous manner.Figure 5 shows a schematic of the  RDM task, where the coherence level is the percentage of dots that move in the same direction.A total of 118 black dots, each dot with a size of 6 by 6 pixels, is presented against a grey background in a circle of 5 • visual angle at a viewing distance of 70 cm.The task was developed in-house using PsychoPy [48] and presented to participants on a 1920 by 1080 pixel, 52.7 cm wide LCD screen with a 60 Hz refresh rate.
Participants completed 40 target trials of the RDM task, i.e. 40 periods of coherent motion with incoherent motion in-between, before and after attention training.The periods of coherent motion always last for 1.9 s, whereas the incoherent motion lasts for 3.1 s, 4.2 s or 5.7 s, which is chosen randomly.The coherence level is either 19% or 25%, also chosen randomly for each trial.To familiarise participants with the task, and reduce performance improvements due to learning effects, three practice runs of six trials each were completed before the pre-training 40 trial run.Verbal feedback, informing participants of hits, misses and false alarms, was given during all three practice runs.The coherence levels in the first practice run are 80% and 60%, in the second practice run they reduced to 40% and 30%, and in the final practice run, either 25% or 20% of the dots move coherently.

NFB training-P300 speller task
The main part of the experimental session is the P300 speller task.
Participants completed nine runs, i.e. copyspelled nine words, with the P300 speller.Table 1 gives an overview of these runs.
The first two runs, referred to as calibration runs, are used to collect training data for the spatial filter and classifier.The number of flashes per row and column is fixed to 12 for all participants in all three groups.These are the only runs where participants do not receive feedback as the classifier is not trained at that point in the session.
Run 3 is the evaluation run.If only one letter in the word 'dog' is correctly identified, the spatial filter and classifier are re-trained with the additional data from the third run.The new classifier is evaluated in the same way by copy-spelling the word 'fox' .If still only one letter is correct, the participant would have been excluded from the study.However, all participants were able to spell the word 'dog' with at least two correct letters.The evaluation run, together with the two calibration runs, make up the pre-training stage.This pre-training stage is the same for all groups.
The fourth run is the first run of the training stage.This run is still the same across all groups, as the number of flashes is fixed at 10.The following four runs, runs 5 through 8, are the runs where the number of flashes is adapted according to the participant assigned group and previous performance (in the ILC and benchmark groups).The adaptation approaches for each group are explained in more detail in section 2.1.2.The training stage is the only part of the experiment that differs between groups and individuals.
In the final run, the number of flashes is fixed at 12 again, to allow for a direct comparison of performance between groups.In contrast to Arvaneh et al [29], all participants receive feedback in this run.This run is referred to as the post-training stage.

Data analysis
The outcomes of the study are analysed using various statistical tests, including ANOVA tests, Kruskal-Wallis tests, paired t-tests, and Wilcoxon signed-rank tests [49].Shapiro-Wilk tests [49] are used to test the data for normality.The detailed data analysis conducted for each task and outcome is outlined in sections 2.4.2-2.4.5.

Offline EEG processing
The offline analysis of the EEG signals is based on Arvaneh et al [29], using a Hamming windowedsinc bandpass filter with a passband between 0.5 and 35 Hz, before re-referencing to Fz.Only the electrodes C3, Cz, C4, P3, Pz and P4 are included in the offline analysis, according to [29].This selection aligns with the fact that the P300 is typically most prominent in the central and parietal regions of the brain [20].The filtered signals are separated into baselinecorrected epochs of 150 ms pre-stimulus to 550 ms post-stimulus, where the 150 ms pre-stimulus period is used as the baseline.Epochs with an amplitude of more than 75 µV, or with a voltage step of more than 150 µV within a 200 ms window are excluded from analysis as these voltage levels are likely due to artefacts.

Questionnaires
Since some of the fatigue-boredom questionnaire scores in the ILC group and benchmark group fail the Shapiro-Wilk test for normality, a nonparametric test is required.As such, a 2 × 4 × 3 (stages: preand post-training, questions, groups) repeated measures ANOVA test is applied to the ranked scores of the fatigue-boredom questionnaire.This is followed by one-way ANOVA tests for normally distributed scores, and Kruskal-Wallis tests for nonnormally distributed scores, to investigate betweengroup differences, and paired t-tests for normal data, and Wilcoxon signed-rank tests for non-normal data, for within-group differences.
For the NASA TLX, one-way ANOVA tests and Kruskal-Wallis tests (for the physical demand and performance questions due to non-normality) are used to investigate differences between groups for each individual question and the total score.

RDM task
Performance in the RDM task is evaluated using two metrics: response time and accuracy.The first metric, response time, is defined as the time elapsed between the onset of coherent motion and the participant button press.The average response time for each participant is calculated using only correct trials.Due to the normal distribution of the response times, a 2 × 3 (stages: pre-and post-training, groups) repeated measures ANOVA test is applied.
The second metric, accuracy, is determined by the percentage of correct trials.Because the posttraining accuracies of the random difficulty group fail the Shapiro-Wilk test for normality, the data is ranked before applying a 2 × 3 (stages: pre-and post-training, groups) repeated measures ANOVA test.One-way ANOVA and Kruskal-Wallis tests are used for examining between-group differences, while paired t-tests and Wilcoxon signed-rank tests are employed for assessing within-group differences.

NFB training-P300 speller task
The P300 speller task performance is analysed from five different perspectives.The first analysis metric is spelling accuracy.Only runs 3, 4, and 9 are considered, since these runs are the same across groups and provide feedback to the participant by showing the letter that was identified by the computer.The other runs either did not provide feedback or the difficulty varied between subjects, making direct comparisons of spelling accuracy impossible.Due to nonnormality of the data, Kruskal-Wallis tests are applied to investigate between-group differences.
Secondly, we examine the training length.Given that the main goal of the study is to accelerate training using ILC, the ability to compare the length of training across groups is crucial.To maintain accuracy and fairness, the total number of flashes in runs 5 to 8 is used as a metric for training length, rather than the actual time, which could have been influenced by varying setup times, break times, and computer loading times.Kruskal-Wallis tests are applied to the total number of flashes in runs 5 to 8 to examine betweengroup differences, given the non-normal distribution in the ILC group and benchmark group.
To analyse changes in EEG signals throughout the training, we evaluate the strength of ERP components, focusing on P300 amplitude, P300 latency, and total power.Since our BCI-NFB system relies on the P300, analysing these aspects allows us to detect direct training effects.P300 amplitude is defined as the difference between the negative and positive peaks within the epochs defined in section 2.4.1.This definition captures the N200 component and prevents bias due to epoch drift.P300 latency is defined as the time between stimulus onset and the positive peak.However, the P300 wave can sometimes manifest as a generally increased amplitude compared to nontarget trials without a distinct larger peak.To account for this variability, we also analyse total power, which is determined by averaging the squared samples in the epoch averages for both target and non-target trials.
We also conducted an analysis for power in the alpha band (7)(8)(9)(10)(11)(12).Alpha band desynchronisation is well-known for its correlation with selective attention and suppression of distracting information or stimuli [50].By analysing changes in alpha power throughout the experimental session, we gain valuable insights into how the training modulates participant attention.For this alpha power analysis, only non-target trials, that did not follow target trials, are considered.Epochs of 0 ms to 150 ms poststimulus are extracted, with baseline removal, where the 150 ms period preceding the stimulus onset serves as the baseline.
To quantify changes in these EEG metrics, we calculated the average of each metric during each stage.This involved averaging the metric across runs 1 to 3 for the pre-training stage and across runs 4 to 8 for the training stage.The post-training stage consists of a single run, run 9, so no average calculation was necessary.Subsequently, we computed the trainingto-calibration ratio, which is the ratio between the average during the training stage and the average during the pre-training stage.Similarly, we calculated the post-training-to-calibration ratio, where the metric for the post-training run was divided by the average of the pre-training stage.A ratio of greater than unity means that the metric increased compared to the pretraining stage, and a ratio of less than unity means that the metric decreased.
For P300 amplitude, we conduct a 2 × 3 (stage: training-to-calibration and post-trainingto-calibration, group) repeated measures ANOVA test, followed by one-way ANOVA for betweengroup differences and paired t-tests for within-group differences.
Since P300 latency data is not normally distributed in all conditions, we perform a 2 × 3 (stage: training-to-calibration and post-training-tocalibration, group) repeated measures ANOVA test on ranked data, followed by Kruskal-Wallis tests for between-group differences, and paired t-tests and Wilcoxon signed-rank tests for within-group differences.
As total power is not normally distributed in all conditions, a 2 × 2 × 3 (trial: target and non-target, stage: training-to-calibration and post-training-tocalibration, group) repeated measures ANOVA test is applied to the ranked ratios.Subsequently, 2 × 3 (trial: target and non-target, group) repeated measures ANOVA tests are then conducted for the ranked training-to-calibration ratios and ranked post-training-to-calibration ratios, individually.This is followed by one-way ANOVA and Kruskal-Wallis tests for between-group differences, and paired ttests and Wilcoxon signed-rank tests for within-group differences.
Similarly, given the non-normality of alpha power, a 2 × 3 (stage: training-to-calibration and post-training-to-calibration, group) repeated measures ANOVA test is applied to the ranked ratios, followed by one-way ANOVA and Kruskal-Wallis tests for between-group differences, and paired t-tests and Wilcoxon signed-rank tests for within-group differences.

Correlation between P300 speller task and RDM task
The hypothesis underlying the P300-based NFB training posits that by increasing the difficulty of the spelling task, participants are compelled to improve their attention in order to maintain their performance.Based on this hypothesis, we would expect that participants who achieved very good performance in the P300 speller task, do not experience a large performance improvement in the RDM task, as they were not challenged by the training.On the other hand, participants with poorer performance might experience a larger performance improvement in the RDM task.Similarly, we would expect participants with bigger changes in the EEG signals in the P300 speller task to perform better in the RDM task.
To test this hypothesis, we investigate the correlation between performance and EEG changes in the P300 speller task and performance in the RDM task.In the P300 speller task, we use the average and minimum spelling accuracies in the feedback runs as performance metrics.The minimum spelling accuracy is included as the average spelling accuracy might not accurately capture the level of challenge experienced by the participant, who might perform well in early runs due to easier task difficulty, and then struggle only in the last training run when the task became more difficult.The P300 amplitude, P300 latency, total power ratios, and alpha power ratios, are used to enumerate changes in EEG signals.For the RDM task, we determine the change in response time and accuracy, between the post-training and pre-training stages, by calculating the ratio of the post-training to pre-training response time and accuracy, respectively.Depending on the distribution of the data, we use Pearson's correlation coefficient for normally distributed data and Spearman's correlation test for nonnormally distributed data.

Post-hoc sensitivity analysis
Despite randomised allocation of participants into groups, we observed different baseline levels of boredom and eye fatigue (questions 3 and 4 in the fatigueboredom questionnaire) between the groups.This is discussed further in section 3.1.
To ensure that each combination of the two factors (group and baseline score) exists in the data despite the imbalanced score distribution, we aggregate the scores into the levels 'low' , 'medium' and 'high' .The distribution of scores and how they were aggregated into levels is shown in table 2.
We then use two-way ANOVA for normally distributed data, and aligned rank transformation (ART) ANOVA for data that does not meet the normality assumption.These tests are applied to investigate between-group differences for all outcome measures and subsequently compared to the results of the primary analysis.

Questionnaires
The repeated measures ANOVA test on the ranked scores reveals significant main effects of group (F (2,42) = 7.94, p < 0.001), stage (F (1,43) = 28.03,p < 0.001) and question (F (3,41) = 28.44,p < 0.001), as well as a significant interaction between stage and question (F (3,41) = 3.31, p = 0.020), and question and group (F (6,39) = 2.16, p = 0.046).A Kruskal-Wallis test shows that the difference in pre-training boredom scores is significantly different (χ 2 (2) = 7.30, p = 0.026) between the benchmark group and the random difficulty group (Tukey-Kramer-Nemenyi: p = 0.03).Additionally, the eye fatigue scores in the pre-training stage are significantly different (χ 2 (2) = 11.93,p = 0.003).The post-hoc Tukey-Kramer-Nemenyi test reveals that the random difficulty group is significantly different from both the benchmark group (p = 0.018) and the ILC group (p = 0.004).It can be seen, in figure 6, that the pretraining boredom and eye fatigue scores are higher in the random difficulty group, compared to the other two groups.Since these differences were observed in pre-training scores, they are likely due to inherent variations among the participants, rather than being influenced by the training.
Paired t-tests show a significant increase in tiredness in the ILC group (t (14) = −2.98,p = 0.010) and random difficulty group (t (14) = −3.06,p = 0.009).An increase in tiredness is also seen in the benchmark group according to a Wilcoxon signed-rank test (Z (14) = 7, p = 0.021).Self-reported boredom increased significantly in the ILC group (Z (14) = 2, p = 0.029), but not the others.A significant increase in eye fatigue is also seen in the ILC group (t (14) = −4.58,p < 0.001) and benchmark group (Z (14) = 0, p = 0.001).Since boredom and eye fatigue are already higher in the random difficulty group before training, no significant increase is seen in that group.
Table 3 shows the mean and standard deviation for each question of the NASA TLX, as well as the total score, which is the sum of scores for the six questions. 1 is the lowest score, and 20 is the highest score for each question, which means that the lowest total score is 6 and the highest total score is 120.A Kruskal-Wallis test reveals a significant difference in physical demand (χ 2 (2) = 8.35, p = 0.015), with the Tukey-Kramer-Nemenyi test showing that the perceived physical demand is significantly different between the benchmark group and the random difficulty group (p = 0.024).Given that there are no physical components in the training session in this study, it is plausible that the observed difference is due to individual differences in the interpretation of the 'physical demand' question.It is worth noting that all groups reported an increase in tiredness in the fatigue-boredom questionnaire, which could lead some participants to perceive higher physical demand.There are no significant betweengroup differences in the other questions or total score.

RDM task 3.2.1. Response time
While the mean response time in the RDM task decreased in all groups, between pre-and posttraining, as can be seen in figure 7, the repeated measures ANOVA test reveals no significant main effects of group and stage, or interaction between group and stage.This suggests that the response time does not significantly differ between groups, and that there are no significant improvements in response time after the training.

Accuracy
The repeated measures ANOVA test on the ranked accuracies reveals a significant main effect of stage (F (1,43) = 10.91,p = 0.001

NFB training-P300 speller task 3.3.1. Spelling accuracy
The mean spelling accuracies achieved by the different groups in the feedback runs, that are the same for all groups, is shown in table 4. It can be seen that the accuracies in run 3 are the same for all groups.While all groups achieved very good performances in the P300 speller task, the ILC group achieved the highest accuracies, whereas the random difficulty group achieved lower accuracies even before the training.However, these differences are small, and Kruskal-Wallis tests confirm that these differences are not statistically significant.These results indicate that when the task difficulty levels are relatively easy (i.e. a relatively high number of flashes per row and column), all participants, regardless of group, exhibit high proficiency at controlling the speller.

Length of training
The length of training is measured in terms of the total number of flashes in runs 5 to 8, as described in section 2.4.4.The mean total number of flashes is 13.3 ± 2.12, 18.6 ± 4.91 and 22 ± 5.94, in the ILC group, benchmark group and the random difficulty group, respectively.A Kruskal-Wallis test reveals a significant difference between groups (χ 2 (2) = 22.30, p < 0.001).The Tukey-Kramer-Nemenyi post-hoc test shows that the training length of the ILC group is significantly lower than both the benchmark group (p = 0.007) and the random difficulty group (p < 0.001), whereas there is no significant difference between the benchmark and random difficulty groups (p = 0.255).

P300 ERP
Figure 8 shows the training-to-calibration and posttraining-to-calibration ratios of P300 peak-to-peak amplitude for all groups, with boxplots, mean and individual points for each participant.It can be seen that, during the training stage, most participants exhibited an increased P300 amplitude compared to the pre-training stage, whereas the post-training-tocalibration ratios are more mixed.
The repeated measures ANOVA revealed a significant main effect of stage (F (1,43) = 20.13,p < 0.001) and group (F (2,42) = 4.74, p = 0.011).According to the one-way ANOVA there is a significant difference between groups in the training-to-calibration ratios (F (2,42) = 3.25, p = 0.049), with the post-hoc Tukey test revealing that this difference lies between the ILC and random difficulty groups (p = 0.044).There are no significant between-group differences in the posttraining-to-calibration ratios.
The training-to-calibration and post-training-tocalibration ratios are significantly different with both the ILC group and the random difficulty group, as shown by paired t-tests (t (14) = 3.74, p = 0.002 and t (14) = 4.42, p < 0.001, respectively).
As can be seen in figure 9, showing the ratios for P300 latency, and evidenced by the repeated measures ANOVA on ranked latency, there are no significant differences between or within groups.

Total power
Figures 10 and 11 show the total power ratios for both stages (training-to-calibration and post-training-tocalibration), both trials (target and non-target) and all groups.Both the ILC group and benchmark group experienced an increase (mean increase of 4% in ILC group and 18% in benchmark group) in the power of target trials, and an attenuation in the power of non-target trials (mean decrease of 3% in ILC group and 4% in benchmark group) in the post-training stage.In the random difficulty group, both the power of target and non-target trials decreased in the posttraining stage compared to the pre-training stage (mean decrease of 10% and 6%, respectively).
The repeated measures ANOVA test on the ranked data, including both stages, reveals a significant main effect of stage (F (1,43) = 26.19,p < 0.001) and group   ranked training-to-calibration ratios, whereas no significant main effects of trial or group are found in the ranked post-training-to-calibration ratios.
There are significant between-group differences in the non-target training-to-calibration ratios according to a Kruskal-Wallis test (χ 2 (2) = 6.64, p = 0.036), with a Tukey-Kramer-Nemenyi test revealing that the difference lies between the ILC group and benchmark group (p = 0.03).The Wilcoxon signed-rank tests show that the trainingto-calibration and post-training-to-calibration target ratios are significantly different in both the ILC group (Z (14) = 108, p = 0.004) and the random difficulty group (Z (14) = 101, p = 0.018).The trainingto-calibration and post-training-to-calibration nontarget ratios are also significantly different in the ILC group (Z (14) = 107, p = 0.005) and the random difficulty group (t (14) = 2.91, p = 0.011) according to Wilcoxon signed-rank and paired t-tests, respectively.No significant differences between the trainingto-calibration and post-training-to-calibration ratios are found in the benchmark group for either target or non-target trials.

Alpha power
The alpha power ratios can be found in figure 12.The repeated measures ANOVA test on the ranked alpha power ratios reveals a significant main effect of stage A one-way ANOVA test shows that there is a significant difference between groups in the trainingto-calibration alpha ratios (F (2,42) = 5.93, p = 0.005), specifically between the ILC group and the two other groups (benchmark group: p = 0.006, random difficulty group: p = 0.034) according to a Tukey test.No significant between-group differences are found in the post-training-to-calibration alpha ratios.A paired t-test found a significant decrease in alpha power ratios from training to post-training stage in the ILC group (t (14) = 5.46, p < 0.001).

Correlation between P300 speller task and RDM task
No significant correlation between performance in the P300 speller task and RDM task is found in the random difficulty group.
In the benchmark group, there is a significant negative correlation between mean spelling accuracy and RDM accuracy ratio (ρ (13) = −0.640,p = 0.010) according to a Spearman's correlation test, as well as minimum spelling accuracy and RDM accuracy ratio (r (13) = −0.580,p = 0.023) according to Pearson's correlation coefficient.This means that the higher the mean or minimum spelling accuracy is, during the P300 speller task, the lower the improvement in accuracy in the RDM task.Additionally, Pearson's correlation coefficient identified a significant negative correlation between training-to-calibration P300 latency and the RDM response time ratio (r (13) = −0.525,p = 0.045).This indicates that as the P300 latency during the training decreases, the posttraining response time in the RDM task increases.
In the ILC group, a significant positive correlation between training-to-calibration alpha power ratio and RDM response time ratio is revealed by the Pearson's correlation coefficient (r (13) = 0.555, p = 0.032).This means that the higher the alpha power in the training stage, the higher the post-training response time in the RDM task is.Similar to the benchmark group, the ILC group also exhibits a significant negative correlation between the post-trainingto-calibration P300 latency and RDM response time ratio, as demonstrated by Pearson's correlation coefficient (r (13) = −0.653,p = 0.008).

Post-hoc sensitivity analysis
The sensitivity analysis, where the statistical analysis of all outcome measures was repeated while taking boredom and eye fatigue baseline levels into account, highlighted three measures where the result of the analysis was different to the primary analysis.
The first measure is the NASA TLX score, specifically the scores for the mental and physical demand questions.The two-way ANOVA, using baseline eye fatigue level and group as factors, reveals that the difference of mental demand scores between groups is significant (F (2,36) = 4.08, p = 0.025).A Tukey test shows that there is a significant difference between the benchmark and random difficulty groups (p = 0.042).This indicates that the perceived mental demand of the tasks is affected by the baseline eye fatigue level of the participant.
The perceived physical demand seems to be affected by baseline eye fatigue as well, as ART ANOVA using eye fatigue and group as factors showed that there is no significant difference between groups, while the primary analysis indicated a significant difference between groups.
Another outcome that seems to be affected by baseline boredom levels is the spelling accuracy.Specifically, ART ANOVA using boredom revealed significant between-group differences in run 4 (F (2,36) = 4.81, p = 0.014), with contrast tests showing that the difference between the spelling accuracies of the ILC and random difficulty groups is significant (p = 0.023), and the difference between the ILC and benchmark groups is tending to significance (p = 0.059).Similarly in Run 9, the ILC group is significantly different from the other groups when boredom baseline levels are taken into account (F (2,36) = 5.60, p = 0.008, contrast tests: ILC-random difficulty, p = 0.011, ILC-benchmark, p = 0.048).
Lastly, the P300 amplitude training-to-calibration ratio is affected by eye fatigue and might be affected by boredom.While the primary analysis showed a significant difference in P300 amplitude ratio between the ILC and random difficulty groups, the twoway ANOVA using eye fatigue showed no significant between-group differences.When accounting for baseline boredom levels, the difference between the ILC and random difficulty groups is only tending to significance (F (2,36) = 2.97, p = 0.063, Tukey: p = 0.062).
All other outcome measures are robust against baseline eye fatigue and boredom levels.

Discussion
We compared three distinct task difficulty adaptation approaches in P300-based NFB training, aimed at cognitive enhancement.Our results demonstrate significant improvements in performance across all groups in the RDM task, indicating the effectiveness of the training protocol.Furthermore, the ILC method notably reduces training time, while maintaining comparable efficacy to the group using Arvaneh et al's adaptation algorithm [29].
Consistent with Arvaneh et al's findings [29], participants across all groups reported that the training was tiring and mentally demanding, which is unsurprising since the training is supposed to be challenging.Unlike the participants in Arvaneh et al's study [29], participants in our study also reported increased levels of eye fatigue and elevated boredom scores.Despite being offered breaks during the training sessions, most participants chose to continue the training without interruption.In future studies, eye fatigue might be mitigated by encouraging participants to take frequent breaks where they look away from the screen.While boredom scores increased in all groups, the post-training boredom scores are still low.The total NASA TLX score is neutral in all groups, indicating that the training workload is acceptable.Overall, the questionnaire scores indicate that the training was challenging but not overly frustrating for the participants.
All groups achieved statistically significant improvements in accuracy of the RDM task.Interestingly, Arvaneh et al report a significant improvement in response time but not in accuracy in the RDM task [29], whereas the opposite is true in this study.It is believed that the implementation of the RDM task in this study is more difficult than the task in Arvaneh et al [29], indicated by the lower accuracy in this study.In Arvaneh et al's study [29], participants were already very good at correctly identifying the movement direction pre-training, so the only aspect they could improve on was their response time.In contrast, in this study, there was more room for improvement in accuracy.
All participants achieved very good spelling accuracy in the P300 speller, even before the training.The mean spelling accuracies in this study are higher than the accuracies reported in Arvaneh et al [29].Due to the results of pilot experiments, and for completeness of the publicly available dataset [51], we decided to use 32 electrodes with the xDAWN spatial filter in this study, instead of only 8 electrodes as in Arvaneh et al's study [29].This increased number of electrodes likely explains why participants performed so well in the P300 speller.It also suggests that the training might not have been as challenging as it was in Arvaneh et al's study [29].
Turning to the main objective of our study, we sought to accelerate training using ILC.In this regard, we were successful: the ILC group completed the training more quickly than the other groups, without negatively impacting the training results.
We also analyse the training effects on EEG signals by comparing the change in EEG signals from the pre-training (calibration) to training stage, and pretraining (calibration) to post-training stage, respectively.It should be noted that the number of flashes differed for each participant during the training stage, with a maximum of ten flashes per row/column.In the post-training stage, all subjects spelled the word 'dance' with 12 flashes per row/column.This means that the post-training-to-calibration ratio allows for more direct comparisons across subjects and groups than the training-to-calibration ratio.
While we did not observe any significant changes in P300 latency, we saw an increase in P300 amplitude during the training stage in all groups.The ILC group experienced the largest increase, with only one participant having a reduced P300 amplitude on average during the training stage compared to the pre-training stage.The ILC group is the group with the lowest number of flashes, indicating that the hypothesis that decreasing the number of flashes, i.e. increasing the task difficulty, encourages participants to improve their focus, is supported.This is further confirmed by the P300 amplitude reducing in the post-training stage, where the number of flashes was increased to 12 flashes per row/column.
Both the ILC group and benchmark group experienced an increase in the total power of target trials, and an attenuation in the power of non-target trials, in the post-training stage.This indicates that participants in these groups were more focused on target trials and less distracted by non-target trials.
In contrast, in the random difficulty group, the total power decreased for both target and non-target trials.This indicates that non-personalised training may not be as effective as personalised training.
The differences in training efficacy between personalised and non-personalised training are further illustrated by the alpha power ratios, where the ILC group and benchmark group displayed attenuation in the post-training stage, whereas the alpha power increased in the random difficulty group.Since alpha power suppression is associated with selective attention [50], a decrease in alpha power indicates that participants in the ILC group and benchmark group were in a more attentive state after the training.
A negative correlation between the (mean and minimum) spelling accuracy, and the accuracy ratio in the RDM task, is revealed in the benchmark group.This observation is in line with expectations, since a high spelling accuracy might mean that the training was not challenging enough for learning effects to manifest.Similarly, a positive correlation between training-to-calibration alpha power ratio and response time ratio in the RDM task is revealed in the ILC group.A reduction in alpha power during training compared to the pre-training stage suggests that participants are more attentive, making them more likely to reduce their response time in the RDM task.This effectively explains the positive correlation between alpha power and response time ratio.We observed a negative correlation between P300 latency and RDM response time in the ILC and benchmark groups.This correlation is surprising, as we would have expected a positive correlation, given that shorter latency is typically associated with improved cognitive abilities and faster reaction times [20].
Due to imbalances in baseline boredom and eye fatigue levels between groups, we conducted a sensitivity analysis to adjust for the baseline levels in all outcome measures.While the perceived mental and physical demand of the training, spelling accuracy, and P300 amplitude were affected by eye fatigue, boredom, or both, respectively; the main outcomes of this study, i.e. the improved performance in the RDM task and accelerated training using ILC, were robust against the baseline imbalances.
The personalised training seems to be robust against specific task difficulty adaptation approaches, given the similar results between the ILC group and benchmark group.However, the ILC approach is faster and more computationally efficient, since it only requires the number of flashes and actual spelling accuracy in the previous run, whereas Arvaneh et al's algorithm [29] requires the spelling accuracy for all numbers of flashes between 1 and the actual number used in the previous run, as described in section 2.1.2.This increased speed and efficiency makes ILC a more convenient and user-friendly choice.
The results of this study are promising and provide further support for the use of P300-based NFB training.Despite study participants only undergoing a single NFB training session, we have observed a significant improvement in accuracy in the RDM task, as well as decreased alpha power in the ILC and benchmark groups, indicating enhanced attention post-training.Almost all participants in the ILC group exhibited a stronger P300 wave during the training, with a mean increase of 22%.Only healthy participants completed the experiment.This means that no claims can be made about the long-term effects of the training or whether it would be effective in people with cognitive deficits.At the same time, it also means that greater training effects might be achieved with training over longer periods with multiple sessions.

Conclusion
The main aim of this study is to accelerate P300based NFB training using ILC.The proposed iterative learning controller successfully reduces the training time, without affecting the training efficacy, as evidenced by the comparable results to the group using the adaptation algorithm proposed by Arvaneh et al [29].The results of the study show that personalised task difficulty adaptation is preferable, due to shorter training times and improved outcomes.This study also provides further evidence that the ERPbased NFB training via an adaptive P300 speller is effective with improved attentional measures after a single training session, supporting the conclusions of Arvaneh et al [29].
The P300-based NFB training, described in this study, is an attractive option for cognitive enhancement, due to its ease of use and training speed.Long-term effects of the training, as well as training efficacy in people with cognitive deficits should be investigated in the future to evaluate its use for neurorehabilitation.

Figure 1 .
Figure1.Overview of proposed neurofeedback system.The user controls a P300 speller, spelling words on a computer, using event-related potentials (ERPs) extracted from their EEG signals.The letter that was selected based on their ERPs is provided to the user as neurofeedback.The task difficulty of the P300 speller is adapted to the user's performance to keep the system challenging and engaging.EEG = electroencephalography, LDA = linear discriminant analysis.

Figure 2 .
Figure 2. Screenshots of P300 speller used in this study.(a) The previously selected letter is highlighted in grey, regardless of whether it was correct, while the next target letter is highlighted in blue.(b) Row 5 is being flashed, indicated by an increase in font size and a change in font colour to white.The previous and current target letters, as well as previously selected letters, are displayed at the bottom of the window.

Figure 3 .
Figure 3. Penalty function used in the iterative learning controller (3).The function is positive when more than half of the letters in the previous P300 speller run were incorrect, resulting in a reduction in task difficulty in the next run.Conversely, when fewer than half of the letters in the previous run were incorrect, the penalty function takes on a negative value, leading to an increase in task difficulty in the next run.

Figure 4 .
Figure 4. Overview of the study procedure.Participants are randomly allocated to one of three groups before engaging in questionnaires, a random dot motion task and neurofeedback training through a P300 speller task.The only difference between the groups lies in the method of task difficulty adaptation during the neurofeedback training.ILC = iterative learning control.

Figure 5 .
Figure 5. Schematic of random dot motion task.The task is to indicate the direction a fraction of dots are moving in.The dots switch between incoherent and coherent motion in certain intervals.For illustrative purposes, two target trials with coherence levels (i.e. percentage of coherently moving dots) of 40% and 30%, respectively, are shown.

Figure 7 .
Figure 7. Performance in the random dot motion task before and after neurofeedback training in the three groups (n = 15 each).(a) Average response time of correct trials.(b) Average accuracy, i.e. percentage of correct trials.(c) Percentage change in response time and accuracy for all participants, calculated as the difference between post-training and pre-training divided by pre-training.Blue shaded area indicates an improvement in both response time and accuracy, grey shaded area indicates an improvement in only one aspect and red shaded area means a decrease in both aspects.Statistical analysis by repeated measures ANOVA (a), and one-way ANOVA/Kruskal-Wallis and paired t-test/Wilcoxon signed-rank test (b), * p < 0.05, * * p < 0.01.ILC = iterative learning control.

Figure 8 .
Figure 8. Change in P300 amplitude throughout the neurofeedback training for all groups (n = 15 each).Values are the difference between minimum and maximum peak in target trials (150 ms to 550 ms post-stimulus) of the (post-)training stage divided by the difference between minimum and maximum peak in target trials of the pre-training stage.Statistical analysis by one-way ANOVA/Kruskal-Wallis test and paired t-test/Wilcoxon signed-rank test, * p < 0.05, * * p < 0.01, * * * p < 0.001.ILC = iterative learning control.

Figure 9 .
Figure 9. Change in P300 latency throughout the neurofeedback training for all groups (n = 15 each).Values are the time of maximum peak in target trials (150 ms to 550 ms post-stimulus) of the (post-)training stage divided by the time of maximum peak in the target trials of the pre-training stage.Statistical analysis by one-way ANOVA/Kruskal-Wallis test and paired t-test/Wilcoxon signed-rank test.ILC = iterative learning control.

Figure 10 .
Figure 10.Change in total power of target trials throughout the neurofeedback training for all groups (n = 15 each).Values are average total power in the target trials (150 ms to 550 ms post-stimulus) of the (post-)training stage divided by average total power in the target trials of the pre-training stage.Statistical analysis by one-way ANOVA/Kruskal-Wallis test and paired t-test/Wilcoxon signed-rank test, * p < 0.05, * * p < 0.01.ILC = iterative learning control.

Figure 11 .
Figure 11.Change in total power of non-target trials throughout the neurofeedback training for all groups (n = 15 each).Values are average total power in the non-target trials (150 ms to 550 ms post-stimulus) of the (post-)training stage divided by average total power in the non-target trials of the pre-training stage.One outlier (training-to-calibration, ILC: 3.96) is not shown for improved readability.Statistical analysis by one-way ANOVA/Kruskal-Wallis test and paired t-test/Wilcoxon signed-rank test, * p < 0.05, * * p < 0.01.ILC = iterative learning control.

Figure 12 .
Figure 12.Change in alpha power throughout neurofeedback training for all groups (n = 15 each).Values are average power in the alpha band (7-12 Hz) in the 150 ms period immediately after non-target stimuli of the (post-)training stage divided by average power in the alpha band in the 150 ms period immediately after non-target stimuli of the pre-training stage.Two outliers (post-training-to-calibration, benchmark: 2.79, random difficulty: 3.18) are not shown for improved readability.Statistical analysis by one-way ANOVA and paired t-test, * p < 0.05, * * p < 0.01, * * * p < 0.001.ILC = iterative learning control.

Table 1 .
Overview of P300 speller runs in each stage (pre-training, during training, and post-training).The word to copy-spell, the number of flashes per row and column used, and the provision of feedback during each run, are specified.

Table 2 .
Distribution of boredom and eye fatigue baseline scores and their aggregation into levels.

Table 3 .
Mean NASA TLX scores for all three groups (n = 15 each).Total score is the sum of all questions.Standard deviation is shown in brackets.Between-group p-values determined by one-way ANOVA and Kruskal-Wallis.ILC = iterative learning control.

Table 4 .
Mean spelling accuracy (%) in the P300 speller runs that provided feedback and that were the same for all groups.Standard deviation is shown in brackets.ILC = iterative learning control.